Message ID | 20180925130845.9962-10-jarkko.sakkinen@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Intel SGX1 support | expand |
Minor nit: On Tue, Sep 25, 2018 at 6:12 AM Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote: > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > by (c) as the kernel doesn't really have any other reasonable option, > e.g. we could kill the task or panic, but neither is warranted. Not killing the task is quite nice, but... > + /* > + * Access is blocked by the Enclave Page Cache Map (EPCM), > + * i.e. the access is allowed by the PTE but not the EPCM. > + * This usually happens when the EPCM is yanked out from > + * under us, e.g. by hardware after a suspend/resume cycle. > + * In any case, there is nothing that can be done by the > + * kernel to resolve the fault (short of killing the task). Maybe s/killing the task/sending a signal/? --Andy
On Tue, Sep 25, 2018 at 03:53:48PM -0700, Andy Lutomirski wrote: > Minor nit: > > On Tue, Sep 25, 2018 at 6:12 AM Jarkko Sakkinen > <jarkko.sakkinen@linux.intel.com> wrote: > > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > > by (c) as the kernel doesn't really have any other reasonable option, > > e.g. we could kill the task or panic, but neither is warranted. > > Not killing the task is quite nice, but... > > > + /* > > + * Access is blocked by the Enclave Page Cache Map (EPCM), > > + * i.e. the access is allowed by the PTE but not the EPCM. > > + * This usually happens when the EPCM is yanked out from > > + * under us, e.g. by hardware after a suspend/resume cycle. > > + * In any case, there is nothing that can be done by the > > + * kernel to resolve the fault (short of killing the task). > > Maybe s/killing the task/sending a signal/? My intent was to document that, unlike all other page faults, the kernel can't fix the source of the fault even if it were omniscient. How about this? With formatting changes since it's long-winded... /* * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the * access is allowed by the PTE but not the EPCM. This usually happens * when the EPCM is yanked out from under us, e.g. by hardware after a * suspend/resume cycle. In any case, software, i.e. the kernel, can't * fix the source of the fault as the EPCM can't be directly modified * by software. Handle the fault as an access error in order to signal * userspace, e.g. so that userspace can rebuild their enclave(s), even * though userspace may not have actually violated access permissions. */
> On Sep 26, 2018, at 10:35 AM, Sean Christopherson <sean.j.christopherson@intel.com> wrote: > >> On Tue, Sep 25, 2018 at 03:53:48PM -0700, Andy Lutomirski wrote: >> Minor nit: >> >> On Tue, Sep 25, 2018 at 6:12 AM Jarkko Sakkinen >> <jarkko.sakkinen@linux.intel.com> wrote: >>> >>> From: Sean Christopherson <sean.j.christopherson@intel.com> >>> >> >>> by (c) as the kernel doesn't really have any other reasonable option, >>> e.g. we could kill the task or panic, but neither is warranted. >> >> Not killing the task is quite nice, but... >> >>> + /* >>> + * Access is blocked by the Enclave Page Cache Map (EPCM), >>> + * i.e. the access is allowed by the PTE but not the EPCM. >>> + * This usually happens when the EPCM is yanked out from >>> + * under us, e.g. by hardware after a suspend/resume cycle. >>> + * In any case, there is nothing that can be done by the >>> + * kernel to resolve the fault (short of killing the task). >> >> Maybe s/killing the task/sending a signal/? > > My intent was to document that, unlike all other page faults, the > kernel can't fix the source of the fault even if it were omniscient. > > How about this? With formatting changes since it's long-winded... > > /* > * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the > * access is allowed by the PTE but not the EPCM. This usually happens > * when the EPCM is yanked out from under us, e.g. by hardware after a > * suspend/resume cycle. In any case, software, i.e. the kernel, can't > * fix the source of the fault as the EPCM can't be directly modified > * by software. Handle the fault as an access error in order to signal > * userspace, e.g. so that userspace can rebuild their enclave(s), even > * though userspace may not have actually violated access permissions. > */ > Looks good to me.
On 09/26/2018 11:12 AM, Andy Lutomirski wrote: >> e omniscient. >> >> How about this? With formatting changes since it's long-winded... >> >> /* >> * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the >> * access is allowed by the PTE but not the EPCM. This usually happens >> * when the EPCM is yanked out from under us, e.g. by hardware after a >> * suspend/resume cycle. In any case, software, i.e. the kernel, can't >> * fix the source of the fault as the EPCM can't be directly modified >> * by software. Handle the fault as an access error in order to signal >> * userspace, e.g. so that userspace can rebuild their enclave(s), even >> * though userspace may not have actually violated access permissions. >> */ >> > Looks good to me. Including the actual architectural definition of the bit might add some clarity. The SDM explicitly says (Vol 3a section 4.7): The fault resulted from violation of SGX-specific access-control requirements. Which totally squares with returning true from access_error(). There's also a tidbit that says: This flag is 1 if the exception is unrelated to paging and resulted from violation of SGX-specific access-control requirements. ... such a violation can occur only if there is no ordinary page fault... This is pretty important. It means that *none* of the other paging-related stuff that we're doing applies. We also need to clarify how this can happen. Is it through something than an app does, or is it solely when the hardware does something under the covers, like suspend/resume.
On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote: > On 09/26/2018 11:12 AM, Andy Lutomirski wrote: > >> e omniscient. > >> > >> How about this? With formatting changes since it's long-winded... > >> > >> /* > >> * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the > >> * access is allowed by the PTE but not the EPCM. This usually happens > >> * when the EPCM is yanked out from under us, e.g. by hardware after a > >> * suspend/resume cycle. In any case, software, i.e. the kernel, can't > >> * fix the source of the fault as the EPCM can't be directly modified > >> * by software. Handle the fault as an access error in order to signal > >> * userspace, e.g. so that userspace can rebuild their enclave(s), even > >> * though userspace may not have actually violated access permissions. > >> */ > >> > > Looks good to me. > > Including the actual architectural definition of the bit might add some > clarity. The SDM explicitly says (Vol 3a section 4.7): > > The fault resulted from violation of SGX-specific access-control > requirements. > > Which totally squares with returning true from access_error(). > > There's also a tidbit that says: > > This flag is 1 if the exception is unrelated to paging and > resulted from violation of SGX-specific access-control > requirements. ... such a violation can occur only if there > is no ordinary page fault... > > This is pretty important. It means that *none* of the other > paging-related stuff that we're doing applies. > > We also need to clarify how this can happen. Is it through something > than an app does, or is it solely when the hardware does something under > the covers, like suspend/resume. Are you looking for something in the changelog, the comment, or just a response? If it's the latter... On bare metal with a bug-free kernel, the only scenario I'm aware of where we'll encounter these faults is when hardware pulls the rug out from under us. In a virtualized environment all bets are off because the architecture allows VMMs to silently "destroy" the EPC at will, e.g. KVM, and I believe Hyper-V, will take advantage of this behavior to support live migration. Post migration, the destination system will generate PF_SGX because the EPC{M} can't be migrated between system, i.e. the destination EPCM sees all EPC pages as invalid.
On 09/26/2018 01:44 PM, Sean Christopherson wrote: > On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote: >> We also need to clarify how this can happen. Is it through something >> than an app does, or is it solely when the hardware does something under >> the covers, like suspend/resume. > > Are you looking for something in the changelog, the comment, or just > a response? If it's the latter... Comments, please. > On bare metal with a bug-free kernel, the only scenario I'm aware of > where we'll encounter these faults is when hardware pulls the rug out > from under us. In a virtualized environment all bets are off because > the architecture allows VMMs to silently "destroy" the EPC at will, > e.g. KVM, and I believe Hyper-V, will take advantage of this behavior > to support live migration. Post migration, the destination system > will generate PF_SGX because the EPC{M} can't be migrated between > system, i.e. the destination EPCM sees all EPC pages as invalid. OK, cool. That's good background fodder for the changelog. But, for the comment, I'm happy with something like this: /* * The fault resulted from violation of SGX-specific access- * controls. This is expected to be the result of some lower * layer action (CPU suspend/resume, VM migration) and is * not related to anything the OS did. Treat it as an access * error to ensure it is passed up to the app via a signal where * it can be handled. */ I really don't think we need to delve too deeply into the relationship between EPCM and PTEs or anything. Let's just say, "it's not the kernel's fault, it's not the app's fault, so throw up our hands".
On Wed, Sep 26, 2018 at 1:55 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 09/26/2018 01:44 PM, Sean Christopherson wrote: > > On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote: > >> We also need to clarify how this can happen. Is it through something > >> than an app does, or is it solely when the hardware does something under > >> the covers, like suspend/resume. > > > > Are you looking for something in the changelog, the comment, or just > > a response? If it's the latter... > > Comments, please. > > > On bare metal with a bug-free kernel, the only scenario I'm aware of > > where we'll encounter these faults is when hardware pulls the rug out > > from under us. In a virtualized environment all bets are off because > > the architecture allows VMMs to silently "destroy" the EPC at will, > > e.g. KVM, and I believe Hyper-V, will take advantage of this behavior > > to support live migration. Post migration, the destination system > > will generate PF_SGX because the EPC{M} can't be migrated between > > system, i.e. the destination EPCM sees all EPC pages as invalid. > > OK, cool. > > That's good background fodder for the changelog. > > But, for the comment, I'm happy with something like this: > > /* > * The fault resulted from violation of SGX-specific access- > * controls. This is expected to be the result of some lower > * layer action (CPU suspend/resume, VM migration) and is > * not related to anything the OS did. Treat it as an access > * error to ensure it is passed up to the app via a signal where > * it can be handled. > */ > > I really don't think we need to delve too deeply into the relationship > between EPCM and PTEs or anything. Let's just say, "it's not the > kernel's fault, it's not the app's fault, so throw up our hands". There is a non-nitpicky consideration here. Logically, user code is going to do this (totally made-up pseudocode): enclave_t enclave = load_and_init_enclave(...); int ret = sgx_run(enclave, some pointers to non-enclave-memory buffers, ...); and, with the code in this patch, a correct implementation of sgx_run() requires installing a signal handler. This is nasty, since signal handlers, expecially for something like SIGSEGV or SIGBUS, are not fantastic to say the least in libraries. Could we perhaps have a little vDSO entry (or syscall, I suppose) that runs an enclave an returns an error code, and rig up the #PF handler to check if the error happened in the vDSO entry and fix it up rather than sending a signal? On Windows, this is much less of a concern, because Windows has real scoped fault handling. But Linux doesn't, at least not yet. -- Andy Lutomirski AMA Capital Management, LLC
On 09/26/2018 02:15 PM, Andy Lutomirski wrote: > Could we perhaps have a little vDSO entry (or syscall, I suppose) that > runs an enclave an returns an error code, and rig up the #PF handler > to check if the error happened in the vDSO entry and fix it up rather > than sending a signal? Yeah, signals suck. So, instead of doing the enclave entry instruction (EENTER is it?), the app would do the vDSO call. It would have some calling convention, like "set %rax to 0 before entering". Then, we just teach the page fault handler about the %RIP in the vDSO that can fault and how to move one instruction later, munge %RIP to a value that tells about the error, then return from the fault. It would basically be like the kernel exception tables, but for userspace. Right? How would a syscall work, though? I assume we can't just enter the enclave from ring0.
On Wed, Sep 26, 2018 at 2:45 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 09/26/2018 02:15 PM, Andy Lutomirski wrote: > > Could we perhaps have a little vDSO entry (or syscall, I suppose) that > > runs an enclave an returns an error code, and rig up the #PF handler > > to check if the error happened in the vDSO entry and fix it up rather > > than sending a signal? > > Yeah, signals suck. > > So, instead of doing the enclave entry instruction (EENTER is it?), the > app would do the vDSO call. It would have some calling convention, like > "set %rax to 0 before entering". Then, we just teach the page fault > handler about the %RIP in the vDSO that can fault and how to move one > instruction later, munge %RIP to a value that tells about the error, > then return from the fault. It would basically be like the kernel > exception tables, but for userspace. Right? Yeah. Maybe like this: xorl %eax,%eax eenter_insn: ENCLU[whatever] eenter_landing_pad: ret And the kernel would use the existing vdso2c vdso-symbol-finding mechanism to do the fixup. > > How would a syscall work, though? I assume we can't just enter the > enclave from ring0. My understanding of how AEX works is a bit vague, but maybe a syscall could reuse the mechanism? The vDSO approach seems considerably simpler. We do need to make sure that a fault that happens on or after return from an AEX event does the right thing. But I'm still vague on how that works, sigh. --Andy
On Tue, Sep 25, 2018 at 03:53:48PM -0700, Andy Lutomirski wrote: > Minor nit: > > On Tue, Sep 25, 2018 at 6:12 AM Jarkko Sakkinen > <jarkko.sakkinen@linux.intel.com> wrote: > > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > > by (c) as the kernel doesn't really have any other reasonable option, > > e.g. we could kill the task or panic, but neither is warranted. > > Not killing the task is quite nice, but... > > > + /* > > + * Access is blocked by the Enclave Page Cache Map (EPCM), > > + * i.e. the access is allowed by the PTE but not the EPCM. > > + * This usually happens when the EPCM is yanked out from > > + * under us, e.g. by hardware after a suspend/resume cycle. > > + * In any case, there is nothing that can be done by the > > + * kernel to resolve the fault (short of killing the task). > > Maybe s/killing the task/sending a signal/? Agreed. Thanks. /Jarkko
On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote: > On 09/26/2018 11:12 AM, Andy Lutomirski wrote: > >> e omniscient. > >> > >> How about this? With formatting changes since it's long-winded... > >> > >> /* > >> * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the > >> * access is allowed by the PTE but not the EPCM. This usually happens > >> * when the EPCM is yanked out from under us, e.g. by hardware after a > >> * suspend/resume cycle. In any case, software, i.e. the kernel, can't > >> * fix the source of the fault as the EPCM can't be directly modified > >> * by software. Handle the fault as an access error in order to signal > >> * userspace, e.g. so that userspace can rebuild their enclave(s), even > >> * though userspace may not have actually violated access permissions. > >> */ > >> > > Looks good to me. > > Including the actual architectural definition of the bit might add some > clarity. The SDM explicitly says (Vol 3a section 4.7): > > The fault resulted from violation of SGX-specific access-control > requirements. > > Which totally squares with returning true from access_error(). > > There's also a tidbit that says: > > This flag is 1 if the exception is unrelated to paging and > resulted from violation of SGX-specific access-control > requirements. ... such a violation can occur only if there > is no ordinary page fault... > > This is pretty important. It means that *none* of the other > paging-related stuff that we're doing applies. > > We also need to clarify how this can happen. Is it through something > than an app does, or is it solely when the hardware does something under > the covers, like suspend/resume. When you change page permissions lets say with mprotect after the and try to do an invalid access according to the EPCM permissions this can happen. /Jarkko
On Wed, Sep 26, 2018 at 02:45:17PM -0700, Dave Hansen wrote: > On 09/26/2018 02:15 PM, Andy Lutomirski wrote: > > Could we perhaps have a little vDSO entry (or syscall, I suppose) that > > runs an enclave an returns an error code, and rig up the #PF handler > > to check if the error happened in the vDSO entry and fix it up rather > > than sending a signal? > > Yeah, signals suck. > > So, instead of doing the enclave entry instruction (EENTER is it?), the > app would do the vDSO call. It would have some calling convention, like > "set %rax to 0 before entering". Then, we just teach the page fault > handler about the %RIP in the vDSO that can fault and how to move one > instruction later, munge %RIP to a value that tells about the error, > then return from the fault. It would basically be like the kernel > exception tables, but for userspace. Right? > > How would a syscall work, though? I assume we can't just enter the > enclave from ring0. Enclave cannot be entered from ring-0. For me this plan sounds simple and sound. /Jarkko
On Wed, Sep 26, 2018 at 03:37:45PM -0700, Andy Lutomirski wrote: > Yeah. Maybe like this: > > xorl %eax,%eax > eenter_insn: > ENCLU[whatever] > eenter_landing_pad: > ret > > And the kernel would use the existing vdso2c vdso-symbol-finding > mechanism to do the fixup. > > > > > How would a syscall work, though? I assume we can't just enter the > > enclave from ring0. > > My understanding of how AEX works is a bit vague, but maybe a syscall > could reuse the mechanism? The vDSO approach seems considerably > simpler. > > We do need to make sure that a fault that happens on or after return > from an AEX event does the right thing. But I'm still vague on how > that works, sigh. > > --Andy Returning from AEX does not differ from any other memory access event so AFAIK it should be handled right with the proposed solution already. For convenience I think we could have a fixed trampoline for AEX e.g. this how it is implemented in the open source LE that I did: sgx_get_token: push %rbx mov $0x02, %rax mov %rsi, %rbx mov %rdx, %rsi mov $sgx_async_exit, %rcx sgx_async_exit: ENCLU pop %rbx ret BTW, if I converted the in-kernel LE as a standalone test program, would that be useful for basic testing of the series? /Jarkko
> On Sep 27, 2018, at 7:21 AM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote: > >> On Wed, Sep 26, 2018 at 03:37:45PM -0700, Andy Lutomirski wrote: >> Yeah. Maybe like this: > > xorl %eax,%eax > eenter_insn: >> ENCLU[whatever] >> eenter_landing_pad: >> ret >> >> And the kernel would use the existing vdso2c vdso-symbol-finding >> mechanism to do the fixup. >> >>> >>> How would a syscall work, though? I assume we can't just enter the >>> enclave from ring0. >> >> My understanding of how AEX works is a bit vague, but maybe a syscall >> could reuse the mechanism? The vDSO approach seems considerably >> simpler. >> >> We do need to make sure that a fault that happens on or after return >> from an AEX event does the right thing. But I'm still vague on how >> that works, sigh. >> >> --Andy > > Returning from AEX does not differ from any other memory access event so > AFAIK it should be handled right with the proposed solution already. > For convenience I think we could have a fixed trampoline for AEX e.g. > this how it is implemented in the open source LE that I did: > > sgx_get_token: > push %rbx > mov $0x02, %rax > mov %rsi, %rbx > mov %rdx, %rsi > mov $sgx_async_exit, %rcx > sgx_async_exit: > ENCLU > pop %rbx > ret > > BTW, if I converted the in-kernel LE as a standalone test program, would > that be useful for basic testing of the series? > > Definitely. Especially if you stick it in selftests/x86 and make it exit cleanly (error code 0) on unsupported hardware.
On 09/27/2018 06:42 AM, Jarkko Sakkinen wrote: >> This flag is 1 if the exception is unrelated to paging and >> resulted from violation of SGX-specific access-control >> requirements. ... such a violation can occur only if there >> is no ordinary page fault... >> >> This is pretty important. It means that *none* of the other >> paging-related stuff that we're doing applies. >> >> We also need to clarify how this can happen. Is it through something >> than an app does, or is it solely when the hardware does something under >> the covers, like suspend/resume. > When you change page permissions lets say with mprotect after the and > try to do an invalid access according to the EPCM permissions this can > happen. So, there are pages that are non-executable, non-readable, or non-writable both via the page tables and via underlying SGX permissions. Then, we allow an mprotect() and a later access will result in one of these SGX faults? What permissions are these, exactly? Is it even a good idea to let that mprotect() go through in the first place? Either way, it sounds like we have some new conditions to spell out in that comment.
On Thu, Sep 27, 2018 at 07:58:41AM -0700, Dave Hansen wrote: > On 09/27/2018 06:42 AM, Jarkko Sakkinen wrote: > >> This flag is 1 if the exception is unrelated to paging and > >> resulted from violation of SGX-specific access-control > >> requirements. ... such a violation can occur only if there > >> is no ordinary page fault... > >> > >> This is pretty important. It means that *none* of the other > >> paging-related stuff that we're doing applies. > >> > >> We also need to clarify how this can happen. Is it through something > >> than an app does, or is it solely when the hardware does something under > >> the covers, like suspend/resume. > > When you change page permissions lets say with mprotect after the and > > try to do an invalid access according to the EPCM permissions this can > > happen. > > So, there are pages that are non-executable, non-readable, or > non-writable both via the page tables and via underlying SGX > permissions. Then, we allow an mprotect() and a later access will > result in one of these SGX faults? The permissions are intersection of PTE and EPCM permissions. EPCM permissions are part of the enclave measurement. For SGX1 they are static. For SGX2 they can be changed with EMODPR/EACCEPT protocol (i.e. measurement can be updated after enclave initialization). > What permissions are these, exactly? Is it even a good idea to let that > mprotect() go through in the first place? You define RWX for each page when you do EADD. > Either way, it sounds like we have some new conditions to spell out in > that comment. Agreed. /Jarkko
On 09/27/2018 08:39 AM, Jarkko Sakkinen wrote: > On Thu, Sep 27, 2018 at 07:58:41AM -0700, Dave Hansen wrote: >> On 09/27/2018 06:42 AM, Jarkko Sakkinen wrote: >>>> This flag is 1 if the exception is unrelated to paging and >>>> resulted from violation of SGX-specific access-control >>>> requirements. ... such a violation can occur only if there >>>> is no ordinary page fault... >>>> >>>> This is pretty important. It means that *none* of the other >>>> paging-related stuff that we're doing applies. >>>> >>>> We also need to clarify how this can happen. Is it through something >>>> than an app does, or is it solely when the hardware does something under >>>> the covers, like suspend/resume. >>> When you change page permissions lets say with mprotect after the and >>> try to do an invalid access according to the EPCM permissions this can >>> happen. >> >> So, there are pages that are non-executable, non-readable, or >> non-writable both via the page tables and via underlying SGX >> permissions. Then, we allow an mprotect() and a later access will >> result in one of these SGX faults? > > The permissions are intersection of PTE and EPCM permissions. Right, but this *fault* bit is not. > EPCM permissions are part of the enclave measurement. For SGX1 they are > static. For SGX2 they can be changed with EMODPR/EACCEPT protocol (i.e. > measurement can be updated after enclave initialization). What does this all have to do with enclave measurement? >> What permissions are these, exactly? Is it even a good idea to let that >> mprotect() go through in the first place? > > You define RWX for each page when you do EADD. Are those permissions reflected into the VMAs mapping the enclave memory?
Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> writes: > From: Sean Christopherson <sean.j.christopherson@intel.com> > > Signal SIGSEGV(SEGV_SGXERR) for all faults with PF_SGX set in the > error code. The PF_SGX bit is set if and only if the #PF is detected > by the Enclave Page Cache Map (EPCM), which is consulted only after > an access walks the kernel's page tables, i.e.: > > a. the access was allowed by the kernel > b. the kernel's tables have become less restrictive than the EPCM > c. the kernel cannot fixup the cause of the fault > > Noteably, (b) implies that either the kernel has botched the EPC > mappings or the EPCM has been invalidated due to a power event. In > either case, userspace needs to be alerted so that it can take > appropriate action, e.g. restart the enclave. This is reinforced > by (c) as the kernel doesn't really have any other reasonable option, > e.g. we could kill the task or panic, but neither is warranted. > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> > --- > arch/x86/mm/fault.c | 20 +++++++++++++++++--- > 1 file changed, 17 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 85d20516b2f3..3fb2b2838d6c 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -960,10 +960,13 @@ static noinline void > bad_area_access_error(struct pt_regs *regs, unsigned long error_code, > unsigned long address, struct vm_area_struct *vma) > { > + int si_code = SEGV_ACCERR; > + > if (bad_area_access_from_pkeys(error_code, vma)) > - __bad_area(regs, error_code, address, vma, SEGV_PKUERR); > - else > - __bad_area(regs, error_code, address, vma, SEGV_ACCERR); > + si_code = SEGV_PKUERR; > + else if (unlikely(error_code & X86_PF_SGX)) > + si_code = SEGV_SGXERR; > + __bad_area(regs, error_code, address, vma, si_code); > } This conflicts with a cleanup in this area I have sitting in linux-next. It isn't in the x86 tree but you can find my siginfo tree at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git siginfo-next In my tree bad area no longer takes a vma parameter. If you are going to make changes to the fault handling code this cycle please let's figure out how to build it on top of my clean ups. Thank you, Eric Biederman
On Thu, 2018-09-27 at 21:43 +0200, Eric W. Biederman wrote: > Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> writes: > > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > Signal SIGSEGV(SEGV_SGXERR) for all faults with PF_SGX set in the > > error code. The PF_SGX bit is set if and only if the #PF is detected > > by the Enclave Page Cache Map (EPCM), which is consulted only after > > an access walks the kernel's page tables, i.e.: > > > > a. the access was allowed by the kernel > > b. the kernel's tables have become less restrictive than the EPCM > > c. the kernel cannot fixup the cause of the fault > > > > Noteably, (b) implies that either the kernel has botched the EPC > > mappings or the EPCM has been invalidated due to a power event. In > > either case, userspace needs to be alerted so that it can take > > appropriate action, e.g. restart the enclave. This is reinforced > > by (c) as the kernel doesn't really have any other reasonable option, > > e.g. we could kill the task or panic, but neither is warranted. > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > Cc: Dave Hansen <dave.hansen@linux.intel.com> > > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> > > --- > > arch/x86/mm/fault.c | 20 +++++++++++++++++--- > > 1 file changed, 17 insertions(+), 3 deletions(-) > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 85d20516b2f3..3fb2b2838d6c 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -960,10 +960,13 @@ static noinline void > > bad_area_access_error(struct pt_regs *regs, unsigned long error_code, > > unsigned long address, struct vm_area_struct *vma) > > { > > + int si_code = SEGV_ACCERR; > > + > > if (bad_area_access_from_pkeys(error_code, vma)) > > - __bad_area(regs, error_code, address, vma, SEGV_PKUERR); > > - else > > - __bad_area(regs, error_code, address, vma, SEGV_ACCERR); > > + si_code = SEGV_PKUERR; > > + else if (unlikely(error_code & X86_PF_SGX)) > > + si_code = SEGV_SGXERR; > > + __bad_area(regs, error_code, address, vma, si_code); > > } > > This conflicts with a cleanup in this area I have sitting in linux-next. > It isn't in the x86 tree but you can find my siginfo tree at: > > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git > siginfo-next > > In my tree bad area no longer takes a vma parameter. > > If you are going to make changes to the fault handling code this cycle > please let's figure out how to build it on top of my clean ups. We are now going with the proposed vdso solution for v15. Thank you for your feedback. I'll sync with you if that route would show non-feasible (still prototyping the vdso solution). > Thank you, > Eric Biederman /Jarkko
On Wed, 2018-09-26 at 14:15 -0700, Andy Lutomirski wrote: > On Wed, Sep 26, 2018 at 1:55 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > On 09/26/2018 01:44 PM, Sean Christopherson wrote: > > > > > > On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote: > > > > > > > > We also need to clarify how this can happen. Is it through something > > > > than an app does, or is it solely when the hardware does something under > > > > the covers, like suspend/resume. > > > Are you looking for something in the changelog, the comment, or just > > > a response? If it's the latter... > > Comments, please. > > > > > > > > On bare metal with a bug-free kernel, the only scenario I'm aware of > > > where we'll encounter these faults is when hardware pulls the rug out > > > from under us. In a virtualized environment all bets are off because > > > the architecture allows VMMs to silently "destroy" the EPC at will, > > > e.g. KVM, and I believe Hyper-V, will take advantage of this behavior > > > to support live migration. Post migration, the destination system > > > will generate PF_SGX because the EPC{M} can't be migrated between > > > system, i.e. the destination EPCM sees all EPC pages as invalid. > > OK, cool. > > > > That's good background fodder for the changelog. > > > > But, for the comment, I'm happy with something like this: > > > > /* > > * The fault resulted from violation of SGX-specific access- > > * controls. This is expected to be the result of some lower > > * layer action (CPU suspend/resume, VM migration) and is > > * not related to anything the OS did. Treat it as an access > > * error to ensure it is passed up to the app via a signal where > > * it can be handled. > > */ > > > > I really don't think we need to delve too deeply into the relationship > > between EPCM and PTEs or anything. Let's just say, "it's not the > > kernel's fault, it's not the app's fault, so throw up our hands". > There is a non-nitpicky consideration here. Logically, user code is > going to do this (totally made-up pseudocode): > > enclave_t enclave = load_and_init_enclave(...); > int ret = sgx_run(enclave, some pointers to non-enclave-memory buffers, ...); > > and, with the code in this patch, a correct implementation of > sgx_run() requires installing a signal handler. This is nasty, since > signal handlers, expecially for something like SIGSEGV or SIGBUS, are > not fantastic to say the least in libraries. > > Could we perhaps have a little vDSO entry (or syscall, I suppose) that > runs an enclave an returns an error code, and rig up the #PF handler > to check if the error happened in the vDSO entry and fix it up rather > than sending a signal? If we want to avoid having to install a signal handler then I'm pretty sure we'd need to fixup all #GPs and "bad access" #PFs that occur on EENTER or in the enclave, not just PF_SGX faults. SGX1 hardware takes a #GP instead of a #PF on EPCM faults, and SGX2 hardware allows enclaves to allocate/free/adjust EPC pages at runtime, e.g. an enclave runtime might want to intercept #PFs from within the enclave so that the enclave can dynamically grow its stack. > On Windows, this is much less of a concern, because Windows has real > scoped fault handling. But Linux doesn't, at least not yet. > > > -- > Andy Lutomirski > AMA Capital Management, LLC
On 10/01/2018 07:29 AM, Sean Christopherson wrote: >> Could we perhaps have a little vDSO entry (or syscall, I suppose) that >> runs an enclave an returns an error code, and rig up the #PF handler >> to check if the error happened in the vDSO entry and fix it up rather >> than sending a signal? > > If we want to avoid having to install a signal handler then I'm pretty > sure we'd need to fixup all #GPs and "bad access" #PFs that occur on > EENTER or in the enclave, not just PF_SGX faults. SGX1 hardware takes > a #GP instead of a #PF on EPCM faults, and SGX2 hardware allows enclaves > to allocate/free/adjust EPC pages at runtime, e.g. an enclave runtime > might want to intercept #PFs from within the enclave so that the enclave > can dynamically grow its stack. I think the technique Andy describes can be used for that as well. It basically works for any case where we know which instructions will take an exception (any exception), call the instruction from a fixed location, and know the fault(s) it can throw. To me, it's almost like turning these faulting instructions into mini syscall instructions. They enter the kernel only when they need help, though, instead of always.
On 2018-09-27 06:56, Jarkko Sakkinen wrote: > On Wed, Sep 26, 2018 at 02:45:17PM -0700, Dave Hansen wrote: >> On 09/26/2018 02:15 PM, Andy Lutomirski wrote: >>> Could we perhaps have a little vDSO entry (or syscall, I suppose) that >>> runs an enclave an returns an error code, and rig up the #PF handler >>> to check if the error happened in the vDSO entry and fix it up rather >>> than sending a signal? > > For me this plan sounds simple and sound. I support avoiding the need for a signal handler for various SGX-specific operations. Looking forward to v15. I have some thoughts regarding the design of the vDSO function. Please consider the following as you work on the next patch set. 1) Even though the vDSO function exists, userspace may still call `ENCLU[EENTER]` manually, so the fault handling as described in the current patch should also be maintained. 2) All the information that would normally be provided through the signal handler (x86 fault number, reason) should be provided to userspace. 3) vDSO functions should be provided for all standard non-enclave ENCLU leafs, and should support most ways that an application might want to use. This includes: * EENTER with a automatic AEX handler (as in Jarkko's sgx_get_token example) * EENTER & ERESUME with a user-specified AEX handler, or possibly just a special return value from the ENCLU function on AEX 4) I think the vDSO functions should have a special calling convention (not conforming to the standard SysV ABI), such that most registers are passed between user space and enclave space untouched. Basically, only RAX, RBX, RCX are available as input and output registers. -- Jethro Beekman | Fortanix
On 10/01/2018 02:42 PM, Jethro Beekman wrote: > > 1) Even though the vDSO function exists, userspace may still call > `ENCLU[EENTER]` manually, so the fault handling as described in the > current patch should also be maintained. Why?
On Mon, Oct 01, 2018 at 07:29:03AM -0700, Sean Christopherson wrote: > On Wed, 2018-09-26 at 14:15 -0700, Andy Lutomirski wrote: > > runs an enclave an returns an error code, and rig up the #PF handler > > to check if the error happened in the vDSO entry and fix it up rather > > than sending a signal? > > > If we want to avoid having to install a signal handler then I'm pretty > sure we'd need to fixup all #GPs and "bad access" #PFs that occur on > EENTER or in the enclave, not just PF_SGX faults. SGX1 hardware takes > a #GP instead of a #PF on EPCM faults, and SGX2 hardware allows enclaves > to allocate/free/adjust EPC pages at runtime, e.g. an enclave runtime > might want to intercept #PFs from within the enclave so that the enclave > can dynamically grow its stack. If I've understood Andy's proposal correctly, the run-time would get the same information as with a signal. The delivery path for this information would be just different. /Jarkko
On Mon, Oct 01, 2018 at 09:42:48PM +0000, Jethro Beekman wrote: > 1) Even though the vDSO function exists, userspace may still call > `ENCLU[EENTER]` manually, so the fault handling as described in the current > patch should also be maintained. You mean the way it was is in v13 and not the way it is in v14? > 2) All the information that would normally be provided through the signal > handler (x86 fault number, reason) should be provided to userspace. As I've understood it, this should be just a change in the delivery path. /Jarkko
On Mon, Oct 01, 2018 at 03:03:30PM -0700, Dave Hansen wrote: > On 10/01/2018 02:42 PM, Jethro Beekman wrote: > > > > 1) Even though the vDSO function exists, userspace may still call > > `ENCLU[EENTER]` manually, so the fault handling as described in the > > current patch should also be maintained. > > Why? Circling back to this question, what if we take the easy way out and simply signal SIGSEGV without an SGX-specific code? I.e. treat #PF with X86_PF_SGX as an access error, no more no less. That should be sufficient for userspace to function, albeit with a little more effort, but presumably no more than would be needed to run on SGX1 hardware. AFAIK there isn't a way to prevent userspace from manually invoking EENTER, short of doing some really nasty text poking or PTE swizzling. We could declare using EENTER as unsupported, but that seems like cutting off the nose to spite the face. Supporting userspace EENTER in a limited capacity would allow people to do whatever crazy tricks they're wont to do without having to deal with absurd requests for the vDSO interface. If we go this route we could also add the vDSO stuff after basic SGX support is in mainline, obviously with approval from the powers that be.
On 10/31/18 2:30 PM, Sean Christopherson wrote: > On Mon, Oct 01, 2018 at 03:03:30PM -0700, Dave Hansen wrote: >> On 10/01/2018 02:42 PM, Jethro Beekman wrote: >>> >>> 1) Even though the vDSO function exists, userspace may still call >>> `ENCLU[EENTER]` manually, so the fault handling as described in the >>> current patch should also be maintained. >> >> Why? > > Circling back to this question, what if we take the easy way out and > simply signal SIGSEGV without an SGX-specific code? I.e. treat #PF > with X86_PF_SGX as an access error, no more no less. That should be > sufficient for userspace to function, albeit with a little more effort, > but presumably no more than would be needed to run on SGX1 hardware. There are two sides to this ABI: what the kernel does to support SGX and what userspace does. If we do what you suggest, we remove any (most?) needed kernel changes and foist the burden entirely into userspace. But, we end up with two ABIs: the old one and the new vDSO one. IOW, once we start doing SIGSEGV, we have to do it forever, despite if we have a newer mechanism. > AFAIK there isn't a way to prevent userspace from manually invoking > EENTER, short of doing some really nasty text poking or PTE swizzling. > We could declare using EENTER as unsupported, Yep, userspace can call it all it wants, and we can also say that calling it outside the vdso is "undefined".
On 2018-10-31 14:35, Dave Hansen wrote: > On 10/31/18 2:30 PM, Sean Christopherson wrote: >> AFAIK there isn't a way to prevent userspace from manually invoking >> EENTER, short of doing some really nasty text poking or PTE swizzling. >> We could declare using EENTER as unsupported, > > Yep, userspace can call it all it wants, and we can also say that > calling it outside the vdso is "undefined". Is there a precedent for this? Are there any other ring 3 x86 instructions that Linux is claiming to be "undefined" when executed by a user process? -- Jethro Beekman | Fortanix
On 10/31/18 2:53 PM, Jethro Beekman wrote: > On 2018-10-31 14:35, Dave Hansen wrote: >> On 10/31/18 2:30 PM, Sean Christopherson wrote: >>> AFAIK there isn't a way to prevent userspace from manually invoking >>> EENTER, short of doing some really nasty text poking or PTE swizzling. >>> We could declare using EENTER as unsupported, >> >> Yep, userspace can call it all it wants, and we can also say that >> calling it outside the vdso is "undefined". > > Is there a precedent for this? Are there any other ring 3 x86 > instructions that Linux is claiming to be "undefined" when executed by a > user process? We did it for MPX. "Don't use MPX unless you first tell the kernel, or we'll eat your puppy."
> On Oct 31, 2018, at 2:58 PM, Dave Hansen <dave.hansen@intel.com> wrote: > >> On 10/31/18 2:53 PM, Jethro Beekman wrote: >>> On 2018-10-31 14:35, Dave Hansen wrote: >>>> On 10/31/18 2:30 PM, Sean Christopherson wrote: >>>> AFAIK there isn't a way to prevent userspace from manually invoking >>>> EENTER, short of doing some really nasty text poking or PTE swizzling. >>>> We could declare using EENTER as unsupported, >>> >>> Yep, userspace can call it all it wants, and we can also say that >>> calling it outside the vdso is "undefined". >> >> Is there a precedent for this? Are there any other ring 3 x86 >> instructions that Linux is claiming to be "undefined" when executed by a >> user process? > > We did it for MPX. "Don't use MPX unless you first tell the kernel, or > we'll eat your puppy." I think EENTER in plain user code should have well defined semantics, and it should get regular signals with the appropriate bits set in the error code field in the ucontext. But we should probably simultaneously offer a nicer API, and the libraries will use it because it’s nicer.
On Wed, 31 Oct 2018, Sean Christopherson wrote: > On Mon, Oct 01, 2018 at 03:03:30PM -0700, Dave Hansen wrote: >> On 10/01/2018 02:42 PM, Jethro Beekman wrote: >>> >>> 1) Even though the vDSO function exists, userspace may still call >>> `ENCLU[EENTER]` manually, so the fault handling as described in the >>> current patch should also be maintained. >> >> Why? > > Circling back to this question, what if we take the easy way out and > simply signal SIGSEGV without an SGX-specific code? I.e. treat #PF > with X86_PF_SGX as an access error, no more no less. That should be > sufficient for userspace to function, albeit with a little more effort, > but presumably no more than would be needed to run on SGX1 hardware. > > AFAIK there isn't a way to prevent userspace from manually invoking > EENTER, short of doing some really nasty text poking or PTE swizzling. > We could declare using EENTER as unsupported, but that seems like > cutting off the nose to spite the face. Supporting userspace EENTER > in a limited capacity would allow people to do whatever crazy tricks > they're wont to do without having to deal with absurd requests for > the vDSO interface. > > If we go this route we could also add the vDSO stuff after basic SGX > support is in mainline, obviously with approval from the powers that > be. > Yeah, this would give stable behavior when vDSO functions are not available. Here's a question: if we implement this behavior, could be upstream series without vDSO's first and after those changes have been landed we would continue with the vDSO's? /Jarkko
On Thu, 1 Nov 2018, Jarkko Sakkinen wrote: > On Wed, 31 Oct 2018, Sean Christopherson wrote: >> On Mon, Oct 01, 2018 at 03:03:30PM -0700, Dave Hansen wrote: >>> On 10/01/2018 02:42 PM, Jethro Beekman wrote: >>>> >>>> 1) Even though the vDSO function exists, userspace may still call >>>> `ENCLU[EENTER]` manually, so the fault handling as described in the >>>> current patch should also be maintained. >>> >>> Why? >> >> Circling back to this question, what if we take the easy way out and >> simply signal SIGSEGV without an SGX-specific code? I.e. treat #PF >> with X86_PF_SGX as an access error, no more no less. That should be >> sufficient for userspace to function, albeit with a little more effort, >> but presumably no more than would be needed to run on SGX1 hardware. >> >> AFAIK there isn't a way to prevent userspace from manually invoking >> EENTER, short of doing some really nasty text poking or PTE swizzling. >> We could declare using EENTER as unsupported, but that seems like >> cutting off the nose to spite the face. Supporting userspace EENTER >> in a limited capacity would allow people to do whatever crazy tricks >> they're wont to do without having to deal with absurd requests for >> the vDSO interface. >> >> If we go this route we could also add the vDSO stuff after basic SGX >> support is in mainline, obviously with approval from the powers that >> be. >> > > Yeah, this would give stable behavior when vDSO functions are not > available. > > Here's a question: if we implement this behavior, could be upstream > series without vDSO's first and after those changes have been landed > we would continue with the vDSO's? Right, it was in your last paragraph, sorry. Yeah, I fully support this idea. It will be easier also to work on the vDSO's once we have something landed (instead of working on a moving platform). /Jarkko
On 10/31/18 3:52 PM, Andy Lutomirski wrote: > I think EENTER in plain user code should have well defined semantics, > and it should get regular signals with the appropriate bits set in > the error code field in the ucontext. But we should probably > simultaneously offer a nicer API, and the libraries will use it > because it’s nicer. FWIW, if we have a signal-based version and a VDSO-based version, nobody will use the VDSO one. The Intel libraries are surely going to keep using the approach they've been using for years and I doubt their owners will be tempted even by a simpler interface to change one line of code. If we want to do the VDSO route, I think we probably need to go whole-hog and toss the signal-based one.
On Thu, Nov 1, 2018 at 10:51 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/31/18 3:52 PM, Andy Lutomirski wrote: > > I think EENTER in plain user code should have well defined semantics, > > and it should get regular signals with the appropriate bits set in > > the error code field in the ucontext. But we should probably > > simultaneously offer a nicer API, and the libraries will use it > > because it’s nicer. > > FWIW, if we have a signal-based version and a VDSO-based version, nobody > will use the VDSO one. > > The Intel libraries are surely going to keep using the approach they've > been using for years and I doubt their owners will be tempted even by a > simpler interface to change one line of code. > > If we want to do the VDSO route, I think we probably need to go > whole-hog and toss the signal-based one. That's a fair point. We don't want a situation where a widely-used SGX library registers a SIGSEGV handler.
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 85d20516b2f3..3fb2b2838d6c 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -960,10 +960,13 @@ static noinline void bad_area_access_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, struct vm_area_struct *vma) { + int si_code = SEGV_ACCERR; + if (bad_area_access_from_pkeys(error_code, vma)) - __bad_area(regs, error_code, address, vma, SEGV_PKUERR); - else - __bad_area(regs, error_code, address, vma, SEGV_ACCERR); + si_code = SEGV_PKUERR; + else if (unlikely(error_code & X86_PF_SGX)) + si_code = SEGV_SGXERR; + __bad_area(regs, error_code, address, vma, si_code); } static void @@ -1153,6 +1156,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) if (error_code & X86_PF_PK) return 1; + /* + * Access is blocked by the Enclave Page Cache Map (EPCM), + * i.e. the access is allowed by the PTE but not the EPCM. + * This usually happens when the EPCM is yanked out from + * under us, e.g. by hardware after a suspend/resume cycle. + * In any case, there is nothing that can be done by the + * kernel to resolve the fault (short of killing the task). + */ + if (unlikely(error_code & X86_PF_SGX)) + return 1; + /* * Make sure to check the VMA so that we do not perform * faults just to hit a X86_PF_PK as soon as we fill in a