Message ID | 20220920063948.3556917-4-zhiquan1.li@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/sgx: fine grained SGX MCA behavior | expand |
On 9/19/22 23:39, Zhiquan Li wrote: > + if (flags & MF_ACTION_REQUIRED) { > + /* > + * Provide extra info to the task so that it can make further > + * decision but not simply kill it. This is quite useful for > + * virtualization case. > + */ "Quite useful to the virtualization case", eh? This code can only even be *run* in the virtualization case. > + if (page->flags & SGX_EPC_PAGE_KVM_GUEST) { > + /* > + * The 'encl_owner' field is repurposed, when allocating EPC > + * page it was assigned to the virtual address of virtual EPC > + * page. > + */ That comment is talking about history, which is kinda silly. Let's just focus on what the code is doing today. > + vaddr = (void *)((unsigned long)page->vepc_vaddr & PAGE_MASK); First, why even align this? What does it matter? Second, is there a reason not to use PTR_ALIGN() here? > + ret = force_sig_mceerr(BUS_MCEERR_AR, vaddr, PAGE_SHIFT); > + if (ret < 0) > + pr_err("Memory failure: Error sending signal to %s:%d: %d\n", > + current->comm, current->pid, ret); > + } else > + force_sig(SIGBUS); > + }
On 2022/10/11 07:20, Dave Hansen wrote: > On 9/19/22 23:39, Zhiquan Li wrote: >> + if (flags & MF_ACTION_REQUIRED) { >> + /* >> + * Provide extra info to the task so that it can make further >> + * decision but not simply kill it. This is quite useful for >> + * virtualization case. >> + */ > > "Quite useful to the virtualization case", eh? > > This code can only even be *run* in the virtualization case. > OK, I'll change it in V10. >> + if (page->flags & SGX_EPC_PAGE_KVM_GUEST) { >> + /* >> + * The 'encl_owner' field is repurposed, when allocating EPC >> + * page it was assigned to the virtual address of virtual EPC >> + * page. >> + */ > > That comment is talking about history, which is kinda silly. Let's just > focus on what the code is doing today. > How about move the comment above to here? Because it's talking about what the code is doing today. Then this comment can be removed. >> + vaddr = (void *)((unsigned long)page->vepc_vaddr & PAGE_MASK); > > First, why even align this? What does it matter? > > Second, is there a reason not to use PTR_ALIGN() here? > You're right. The align is unnecessary, host kernel just reports the error address, no matter if it's aligned, guest kernel MCA will convert it to PFN. Best Regards, Zhiquan >> + ret = force_sig_mceerr(BUS_MCEERR_AR, vaddr, PAGE_SHIFT); >> + if (ret < 0) >> + pr_err("Memory failure: Error sending signal to %s:%d: %d\n", >> + current->comm, current->pid, ret); >> + } else >> + force_sig(SIGBUS); >> + } >
On 9/19/22 23:39, Zhiquan Li wrote: > Today, if a guest accesses an SGX EPC page with memory failure, > the kernel behavior will kill the entire guest. This blast > radius is too large. It would be idea to kill only the SGX ideal ^ > application inside the guest. > > To fix this, send a SIGBUS to host userspace (like QEMU) which can > follow up by injecting a #MC to the guest. This doesn't make any sense to me. It's *ALREADY* sending a SIGBUS. So, whatever is making this better, it's not "send a SIGBUS" that's doing it. What does this patch actually do to reduce the blast radius? > SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance > being shared by multiple VMs via fork(). However KVM doesn't support > running a VM across multiple mm structures, and the de facto userspace > hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice > this should not happen. This is out of the blue. Why is this here? What happens if a hypervisor *DOES* fork()? What's the fallout?
On 2022/10/11 22:04, Dave Hansen wrote: > On 9/19/22 23:39, Zhiquan Li wrote: >> Today, if a guest accesses an SGX EPC page with memory failure, >> the kernel behavior will kill the entire guest. This blast >> radius is too large. It would be idea to kill only the SGX > > ideal ^ > >> application inside the guest. >> >> To fix this, send a SIGBUS to host userspace (like QEMU) which can >> follow up by injecting a #MC to the guest. > > This doesn't make any sense to me. It's *ALREADY* sending a SIGBUS. > So, whatever is making this better, it's not "send a SIGBUS" that's > doing it. > > What does this patch actually do to reduce the blast radius? > Thanks for your attention, Dave. This part comes from your comments previously: https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m6d62670eb530fab178eefaaaf31a22c4475e818d The key is the SIGBUS should with code BUS_MCEERR_AR and virtual address of virtual EPC page. Hypervisor can identify it with the specific code and inject #MC to the guest. Can we change the statement like this? To fix this, send a SIGBUS with code BUS_MCEERR_AR and virtual address of virtual EPC page to host userspace (like QEMU) which can follow up by injecting a #MC to the guest. >> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance >> being shared by multiple VMs via fork(). However KVM doesn't support >> running a VM across multiple mm structures, and the de facto userspace >> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice >> this should not happen. > > This is out of the blue. Why is this here? > > What happens if a hypervisor *DOES* fork()? What's the fallout? This part originates from below discussion: https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t It intents to answer the question: Do you think the processes sharing the same enclave need to be killed, even they had not touched the EPC page with hardware error? Dave, do you mean it's not appropriate to be put here? Best Regards, Zhiquan
On Wed, 2022-10-12 at 13:09 +0800, Zhiquan Li wrote: > > > SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance > > > being shared by multiple VMs via fork(). However KVM doesn't support > > > running a VM across multiple mm structures, and the de facto userspace > > > hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice > > > this should not happen. > > > > This is out of the blue. Why is this here? > > > > What happens if a hypervisor *DOES* fork()? What's the fallout? > > This part originates from below discussion: > > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t > > It intents to answer the question: > > Do you think the processes sharing the same enclave need to be > killed, even they had not touched the EPC page with hardware error? Sharing virtual EPC instance will very likely unexpectedly break enclaves in all VMs. Whether kernel should explicitly prevent is another topic. To me I don't see strong reason to enforce in the kernel. For instance, multiple VMs can map the same file as memory backend with MAP_SHARED, in which case they can all break. Userspace should use virtual EPC in the right way. But the point is above is not directly related to your patch. On host where multiple processes can share one enclave legally, it does the same thing. I think you can just remove that paragraph from changelog.
On Wed, Oct 12, 2022 at 11:01:49AM +0000, Huang, Kai wrote: > On Wed, 2022-10-12 at 13:09 +0800, Zhiquan Li wrote: > > > > SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance > > > > being shared by multiple VMs via fork(). However KVM doesn't support > > > > running a VM across multiple mm structures, and the de facto userspace > > > > hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice > > > > this should not happen. > > > > > > This is out of the blue. Why is this here? > > > > > > What happens if a hypervisor *DOES* fork()? What's the fallout? > > > > This part originates from below discussion: > > > > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t > > > > It intents to answer the question: > > > > Do you think the processes sharing the same enclave need to be > > killed, even they had not touched the EPC page with hardware error? > > Sharing virtual EPC instance will very likely unexpectedly break enclaves in all > VMs. Whether kernel should explicitly prevent is another topic. To me I don't > see strong reason to enforce in the kernel. For instance, multiple VMs can map > the same file as memory backend with MAP_SHARED, in which case they can all > break. Userspace should use virtual EPC in the right way. Broadly speaking, for most of the time, and for any topic, kernel should not prevent anything, unless it can break kernel's internal state. > But the point is above is not directly related to your patch. On host where > multiple processes can share one enclave legally, it does the same thing. I > think you can just remove that paragraph from changelog. > > -- > Thanks, > -Kai > > BR, Jarkko
On 10/11/22 22:09, Zhiquan Li wrote: >>> application inside the guest. >>> >>> To fix this, send a SIGBUS to host userspace (like QEMU) which can >>> follow up by injecting a #MC to the guest. >> >> This doesn't make any sense to me. It's *ALREADY* sending a SIGBUS. >> So, whatever is making this better, it's not "send a SIGBUS" that's >> doing it. >> >> What does this patch actually do to reduce the blast radius? > > Thanks for your attention, Dave. > > This part comes from your comments previously: > > https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m6d62670eb530fab178eefaaaf31a22c4475e818d > > The key is the SIGBUS should with code BUS_MCEERR_AR and virtual address > of virtual EPC page. Hypervisor can identify it with the specific code > and inject #MC to the guest. > > Can we change the statement like this? > > To fix this, send a SIGBUS with code BUS_MCEERR_AR and virtual > address of virtual EPC page to host userspace (like QEMU) which can > follow up by injecting a #MC to the guest. This is really just mechanically restating what the patch does. It doesn't help me understand how it achieves the goal. I guess I'll just go and write the changelog for you. Here's what I was missing: There is already a signal-based ABI to tell userspace about machine checks. But, SGX does not use that ABI. Today, the kernel delivers a generic SIGBUS if a machine check occurs when accessing SGX memory. Userspace can not differentiate that SIGBUS from <add example here>, so it is very unlikely to be able to recover from the signal and the app will die. To fix this, have the SGX machine check code generate a SIGBUS which leverages the existing BUS_MCEERR_AR ABI. This enlightens userspace about why the SIGBUS was generated and gives it a chance of being able to handle the signal. QEMU, for instance, has code to handle these BUS_MCEERR_AR signals today. Without this patch, QEMU will just die in the face of a generic SIGBUS, and take the whole VM with it. With this patch <explain what QEMU actually does here>. In short, BUS_MCEERR_AR enables QEMU to reduce the blast radius down from the whole QEMU process to a single page. This patch doesn't *actually* reduce the blast radius. It enables QEMU to do that. >>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance >>> being shared by multiple VMs via fork(). However KVM doesn't support >>> running a VM across multiple mm structures, and the de facto userspace >>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice >>> this should not happen. >> >> This is out of the blue. Why is this here? >> >> What happens if a hypervisor *DOES* fork()? What's the fallout? > > This part originates from below discussion: > > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t > > It intents to answer the question: > > Do you think the processes sharing the same enclave need to be > killed, even they had not touched the EPC page with hardware error? > > Dave, do you mean it's not appropriate to be put here? It's actually a pretty important point, but it's still out of the blue. You also didn't answer my question.
On Wed, 2022-10-12 at 14:54 +0300, jarkko@kernel.org wrote: > > Sharing virtual EPC instance will very likely unexpectedly break enclaves in > > all > > VMs. Whether kernel should explicitly prevent is another topic. To me I > > don't > > see strong reason to enforce in the kernel. For instance, multiple VMs can > > map > > the same file as memory backend with MAP_SHARED, in which case they can all > > break. Userspace should use virtual EPC in the right way. > > Broadly speaking, for most of the time, and for any topic, kernel should > not prevent anything, unless it can break kernel's internal state. Good to know. Thanks you.
On 2022/10/12 19:01, Huang, Kai wrote: > On Wed, 2022-10-12 at 13:09 +0800, Zhiquan Li wrote: >>>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance >>>> being shared by multiple VMs via fork(). However KVM doesn't support >>>> running a VM across multiple mm structures, and the de facto userspace >>>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice >>>> this should not happen. >>> >>> This is out of the blue. Why is this here? >>> >>> What happens if a hypervisor *DOES* fork()? What's the fallout? >> >> This part originates from below discussion: >> >> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t >> >> It intents to answer the question: >> >> Do you think the processes sharing the same enclave need to be >> killed, even they had not touched the EPC page with hardware error? > > Sharing virtual EPC instance will very likely unexpectedly break enclaves in all > VMs. Whether kernel should explicitly prevent is another topic. To me I don't > see strong reason to enforce in the kernel. For instance, multiple VMs can map > the same file as memory backend with MAP_SHARED, in which case they can all > break. Userspace should use virtual EPC in the right way. > > But the point is above is not directly related to your patch. On host where > multiple processes can share one enclave legally, it does the same thing. I > think you can just remove that paragraph from changelog. > OK, I'll remove it since V10. Thank you all the same, Kai.
On 2022/10/12 22:36, Dave Hansen wrote: > This is really just mechanically restating what the patch does. It > doesn't help me understand how it achieves the goal. I guess I'll just > go and write the changelog for you. Here's what I was missing: > > There is already a signal-based ABI to tell userspace about > machine checks. But, SGX does not use that ABI. Today, the > kernel delivers a generic SIGBUS if a machine check occurs when > accessing SGX memory. Userspace can not differentiate that > SIGBUS from <add example here>, so it is very unlikely to be > able to recover from the signal and the app will die. > > To fix this, have the SGX machine check code generate a SIGBUS > which leverages the existing BUS_MCEERR_AR ABI. This enlightens > userspace about why the SIGBUS was generated and gives it a > chance of being able to handle the signal. > > QEMU, for instance, has code to handle these BUS_MCEERR_AR > signals today. Without this patch, QEMU will just die in the > face of a generic SIGBUS, and take the whole VM with it. With > this patch <explain what QEMU actually does here>. > > In short, BUS_MCEERR_AR enables QEMU to reduce the blast radius > down from the whole QEMU process to a single page. > > This patch doesn't *actually* reduce the blast radius. It enables QEMU > to do that. > Perfect, Dave! I'm ashamed to have to get your hands dirty in the end. I'd like to replace the angle brackets content as below, could you have a look? There is already a signal-based ABI to tell userspace about machine checks. But, SGX does not use that ABI. Today, the kernel delivers a generic SIGBUS if a machine check occurs when accessing SGX memory. Userspace can not differentiate that SIGBUS from <a recoverable machine check error>, so it is very unlikely to be able to recover from the signal and the app will die. ... QEMU, for instance, has code to handle these BUS_MCEERR_AR signals today. Without this patch, QEMU will just die in the face of a generic SIGBUS, and take the whole VM with it. With this patch <QEMU will be aware of the machine check error and retrieve the virtual address (HVA) from siginfo, then convert it to GPA and inject a #MC into VM. Guest kernel will handle this #MC, find and kill the task who had accessed the error memory.> ... >>>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance >>>> being shared by multiple VMs via fork(). However KVM doesn't support >>>> running a VM across multiple mm structures, and the de facto userspace >>>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice >>>> this should not happen. >>> This is out of the blue. Why is this here? >>> >>> What happens if a hypervisor *DOES* fork()? What's the fallout? >> This part originates from below discussion: >> >> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t >> >> It intents to answer the question: >> >> Do you think the processes sharing the same enclave need to be >> killed, even they had not touched the EPC page with hardware error? >> >> Dave, do you mean it's not appropriate to be put here? > It's actually a pretty important point, but it's still out of the blue. > > You also didn't answer my question. Oh, sorry, I focused on answering "Why is this here?" but forgot to answer "What's the fallout?" It's a very good question. Looks like Kai had answered it before: For instance, an application can open /dev/sgx_vepc, mmap(), and fork(). Then if the child does mmap() against the fd opened by the parent, the child will share the same vepc with the parent. ... Sharing virtual EPC instance will very likely unexpectedly break enclaves in all VMs. ... https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t Moreover, I checked the code in below scenario while child sharing the virtual EPC instance with parent: - child closes /dev/sgx_vepc earlier than parent. - child re-mmap() /dev/sgx_vepc to the same memory region as parent. - child mmap() /dev/sgx_vepc to different memory region. - child accesses the memory region of mmap() inherited from parent. It's just that the app messes itself up, the vepc instance is protected very well. Maybe there are other corner cases I've not considered? Best Regards, Zhiquan
On 10/13/22 07:40, Zhiquan Li wrote: > There is already a signal-based ABI to tell userspace about > machine checks. But, SGX does not use that ABI. Today, the > kernel delivers a generic SIGBUS if a machine check occurs when > accessing SGX memory. Userspace can not differentiate that > SIGBUS from <a recoverable machine check error>, I wanted an example here of a SIGBUS that can occur from sources *OTHER* than a recoverable machine check.
On 10/13/22 07:40, Zhiquan Li wrote: >>>> What happens if a hypervisor *DOES* fork()? What's the fallout? >>> This part originates from below discussion: >>> >>> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t >>> >>> It intents to answer the question: >>> >>> Do you think the processes sharing the same enclave need to be >>> killed, even they had not touched the EPC page with hardware error? >>> >>> Dave, do you mean it's not appropriate to be put here? >> It's actually a pretty important point, but it's still out of the blue. >> >> You also didn't answer my question. > Oh, sorry, I focused on answering "Why is this here?" but forgot to > answer "What's the fallout?" > > It's a very good question. > > Looks like Kai had answered it before: > > For instance, an application can open /dev/sgx_vepc, mmap(), and > fork(). Then if the child does mmap() against the fd opened by > the parent, the child will share the same vepc with the parent. > > ... > > Sharing virtual EPC instance will very likely unexpectedly break > enclaves in all VMs. How, though? This basically says, "I think things will break." I want to know a few more specifics than that before we merge this. There are ABI implications. > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t > > Moreover, I checked the code in below scenario while child sharing the > virtual EPC instance with parent: > - child closes /dev/sgx_vepc earlier than parent. > - child re-mmap() /dev/sgx_vepc to the same memory region as parent. > - child mmap() /dev/sgx_vepc to different memory region. > - child accesses the memory region of mmap() inherited from parent. > > It's just that the app messes itself up, the vepc instance is protected > very well. > Maybe there are other corner cases I've not considered? ... and what happens when *THIS* patch is in play? What if there is a machine check in SGX memory?
On Thu, 2022-10-13 at 08:44 -0700, Dave Hansen wrote: > On 10/13/22 07:40, Zhiquan Li wrote: > > > > > What happens if a hypervisor *DOES* fork()? What's the fallout? > > > > This part originates from below discussion: > > > > > > > > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t > > > > > > > > It intents to answer the question: > > > > > > > > Do you think the processes sharing the same enclave need to be > > > > killed, even they had not touched the EPC page with hardware error? > > > > > > > > Dave, do you mean it's not appropriate to be put here? > > > It's actually a pretty important point, but it's still out of the blue. > > > > > > You also didn't answer my question. > > Oh, sorry, I focused on answering "Why is this here?" but forgot to > > answer "What's the fallout?" > > > > It's a very good question. > > > > Looks like Kai had answered it before: > > > > For instance, an application can open /dev/sgx_vepc, mmap(), and > > fork(). Then if the child does mmap() against the fd opened by > > the parent, the child will share the same vepc with the parent. > > > > ... > > > > Sharing virtual EPC instance will very likely unexpectedly break > > enclaves in all VMs. > > How, though? This basically says, "I think things will break." I want > to know a few more specifics than that before we merge this. There are > ABI implications. This is because virtual EPC is just a raw resource to guest, and how guest uses virtual EPC to run enclaves is completely controlled by the guest. When virtual EPC is shared among VMs, it can happen that one guest _thinks_ one EPC page is still free but in fact it has been already used by another VM as a valid enclave page. Also, one VM can just unconditionally sanitize all EPC pages before using any of them (just like Linux does). All of those can cause unexpected SGX error, which can lead to failing to create enclave, and/or breaking existing enclaves running in all VMs. > > > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t > > > > Moreover, I checked the code in below scenario while child sharing the > > virtual EPC instance with parent: > > - child closes /dev/sgx_vepc earlier than parent. > > - child re-mmap() /dev/sgx_vepc to the same memory region as parent. > > - child mmap() /dev/sgx_vepc to different memory region. > > - child accesses the memory region of mmap() inherited from parent. > > > > It's just that the app messes itself up, the vepc instance is protected > > very well. > > Maybe there are other corner cases I've not considered? > > ... and what happens when *THIS* patch is in play? What if there is a > machine check in SGX memory? With this patch, when #MC happens on one virtual EPC page, it will be send to the VM, and the behaviour inside a VM depends on guest's implementation. But anyway based on Tony's reply, the enclave will be marked as "bad" by hardware and it will eventually be killed: https://lore.kernel.org/linux-sgx/55ffd9475f5d46f68dd06c4323bec871@intel.com/ https://lore.kernel.org/linux-sgx/5b6ad3e2af614caf9b41092797ffcd86@intel.com/ If the virtual EPC is shared by other VMs, the worst case is when other VMs use this bad EPC page (as we cannot take the bad EPC page out of VM for now), some SGX error (ENCLS/ENCLU error) or #MC could happen again. But this doesn't make things worse, as when sharing virtual EPC page among VMs you are likely to break enclaves in VMs anyway (as mentioned above).
On 10/13/22 14:49, Huang, Kai wrote: >> ... and what happens when *THIS* patch is in play? What if there is a >> machine check in SGX memory? > With this patch, when #MC happens on one virtual EPC page, it will be send to > the VM, and the behaviour inside a VM depends on guest's implementation. But > anyway based on Tony's reply, the enclave will be marked as "bad" by hardware > and it will eventually be killed: > > https://lore.kernel.org/linux-sgx/55ffd9475f5d46f68dd06c4323bec871@intel.com/ > https://lore.kernel.org/linux-sgx/5b6ad3e2af614caf9b41092797ffcd86@intel.com/ > > If the virtual EPC is shared by other VMs, the worst case is when other VMs use > this bad EPC page (as we cannot take the bad EPC page out of VM for now), some > SGX error (ENCLS/ENCLU error) or #MC could happen again. But this doesn't make > things worse, as when sharing virtual EPC page among VMs you are likely to break > enclaves in VMs anyway (as mentioned above). Guys, maybe I'm just not being specific enough. But, Kai, you're off on a tangent here that I didn't intend to ask about. You're actually adding confusion and delay here, not helping answer the question. This was specifically a question about fork() plus VEPC plus machine checks. I'm still trying to get to the bottom of: What happens if a hypervisor *DOES* fork()? What's the fallout? Specifically, with this machine check SIGBUS implementation, one EPC page can only have on host virtual address. But, it's possible to mmap() a single VEPC page in multiple processes at multiple host virtual addresses. So, the implementation only allows one host virtual address. But, we can imagine a scenario where there are *TWO* valid virtual addresses where the VEPC is mapped. What might happen in this case? What's the fallout?
On Thu, 2022-10-13 at 15:02 -0700, Hansen, Dave wrote: > Specifically, with this machine check SIGBUS implementation, one EPC > page can only have on host virtual address. But, it's possible to > mmap() a single VEPC page in multiple processes at multiple host virtual > addresses. > > So, the implementation only allows one host virtual address. But, we > can imagine a scenario where there are *TWO* valid virtual addresses > where the VEPC is mapped. > > What might happen in this case? What's the fallout? My understanding is there can be two #MC happening on two virtual addresses. Each #MC will be injected to the VM which triggers that #MC to handle. Today, both VM gets killed.
On 10/13/22 15:15, Huang, Kai wrote: > On Thu, 2022-10-13 at 15:02 -0700, Hansen, Dave wrote: >> Specifically, with this machine check SIGBUS implementation, one EPC >> page can only have on host virtual address. But, it's possible to >> mmap() a single VEPC page in multiple processes at multiple host virtual >> addresses. >> >> So, the implementation only allows one host virtual address. But, we >> can imagine a scenario where there are *TWO* valid virtual addresses >> where the VEPC is mapped. >> >> What might happen in this case? What's the fallout? > > My understanding is there can be two #MC happening on two virtual addresses. > > Each #MC will be injected to the VM which triggers that #MC to handle. OK but which address with each of them see in siginfo->si_addr? There's only one ->vepc_vaddr. It seems to me like this might result in the siginfo getting populated with the ->si_addr from the *OTHER* process and *OTHER* virtual address. Basically, whoever allocates the page sets up *their* virtual address. Anyone else who gets a machine check on that page will see a virtual address that has nothing to do with its virtual address space. Is that problematic?
On Thu, 2022-10-13 at 15:28 -0700, Dave Hansen wrote: > On 10/13/22 15:15, Huang, Kai wrote: > > On Thu, 2022-10-13 at 15:02 -0700, Hansen, Dave wrote: > > > Specifically, with this machine check SIGBUS implementation, one EPC > > > page can only have on host virtual address. But, it's possible to > > > mmap() a single VEPC page in multiple processes at multiple host virtual > > > addresses. > > > > > > So, the implementation only allows one host virtual address. But, we > > > can imagine a scenario where there are *TWO* valid virtual addresses > > > where the VEPC is mapped. > > > > > > What might happen in this case? What's the fallout? > > > > My understanding is there can be two #MC happening on two virtual addresses. > > > > Each #MC will be injected to the VM which triggers that #MC to handle. > > OK but which address with each of them see in siginfo->si_addr? There's > only one ->vepc_vaddr. > > It seems to me like this might result in the siginfo getting populated > with the ->si_addr from the *OTHER* process and *OTHER* virtual address. > Basically, whoever allocates the page sets up *their* virtual address. > Anyone else who gets a machine check on that page will see a virtual > address that has nothing to do with its virtual address space. > > Is that problematic? Thanks Dave. I see the problem now. I already forgot the reason that I said "whether to add a comment is good enough" in my original reply. So with allowing fork(), putting the 'virtual address' to the owner doesn't seems right. I think perhaps we have two options: 1) to enforce in kernel that virtual EPC doesn't support fork(), and page's owner can be used to carry 'virtual address'. 2) in arch_memory_failure, we find out the virtual address based on the current process, but I am not entirely sure whether it can work, i.e. it's not called from interrupt context? What's your suggestion?
On 10/13/22 16:40, Huang, Kai wrote:
> What's your suggestion?
I'm fresh out of suggestions.
On Thu, 2022-10-13 at 16:57 -0700, Dave Hansen wrote: > On 10/13/22 16:40, Huang, Kai wrote: > > What's your suggestion? > > I'm fresh out of suggestions. Thanks Dave. I'll dig more and find out a solution.
On 10/13/22 22:42, Zhiquan Li wrote: > On 2022/10/13 23:39, Dave Hansen wrote: >> I wanted an example here of a SIGBUS that can occur from sources *OTHER* >> than a recoverable machine check. > Oh, got it, it should like this: > > There is already a signal-based ABI to tell userspace about > machine checks. But, SGX does not use that ABI. Today, the > kernel delivers a generic SIGBUS if a machine check occurs when > accessing SGX memory. Userspace can not differentiate that > SIGBUS from invalid address alignment, nonexistent physical > address or object-specific hardware error, so it is very > unlikely to be able to recover from the signal and the app will > die. That looks much better, thanks.
On 2022/10/13 23:39, Dave Hansen wrote: > I wanted an example here of a SIGBUS that can occur from sources *OTHER* > than a recoverable machine check. Oh, got it, it should like this: There is already a signal-based ABI to tell userspace about machine checks. But, SGX does not use that ABI. Today, the kernel delivers a generic SIGBUS if a machine check occurs when accessing SGX memory. Userspace can not differentiate that SIGBUS from invalid address alignment, nonexistent physical address or object-specific hardware error, so it is very unlikely to be able to recover from the signal and the app will die. Thanks for the clarification, Dave. Best Regards, Zhiquan
On Fri, 2022-10-14 at 00:19 +0000, Huang, Kai wrote: > On Thu, 2022-10-13 at 16:57 -0700, Dave Hansen wrote: > > On 10/13/22 16:40, Huang, Kai wrote: > > > What's your suggestion? > > > > I'm fresh out of suggestions. > > Thanks Dave. I'll dig more and find out a solution. > Hi Dave, Not related to this series, I think there's a bug in the current virtual EPC driver page fault handler in case of fork() (sorry didn't notice it at the first place): static int __sgx_vepc_fault(struct sgx_vepc *vepc, struct vm_area_struct *vma, unsigned long addr) { ...... /* Calculate index of EPC page in virtual EPC's page_array */ index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start); epc_page = xa_load(&vepc->page_array, index); if (epc_page) return 0; ... } As you can see if the EPC page has already been populated at a given index of one virtual EPC instance, the current fault handler just assumes the mapping is already there and returns success immediately. This causes a bug when one virtual EPC instance is shared by multi processes via fork(): if the EPC page at one index is already populated by the parent process, when the child accesses the same page using different virtual address, the fault handler just returns success w/o actually setting up the mapping for the child, resulting in endless page fault. This needs to be fixed in no matter what way. My thinking is we can just enforce in the kernel that one virtual EPC instance can only be mmap()-ed by the process which opens /dev/sgx_vepc (a new virtual EPC instance is created on each open). KVM userspace should never use fork() to share virtual EPC instance otherwise userspace is using it in the wrong way. Thus enforcing in the kernel won't break any KVM userspace. Does this make sense? If it is OK to you, I'll send out a patch to fix.
On Wed, Oct 19, 2022 at 10:59:20AM +0000, Huang, Kai wrote: > static int __sgx_vepc_fault(struct sgx_vepc *vepc, > struct vm_area_struct *vma, unsigned long addr) > { > ...... > /* Calculate index of EPC page in virtual EPC's page_array */ > index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start); > > epc_page = xa_load(&vepc->page_array, index); > if (epc_page) > return 0; > > ... > } > > As you can see if the EPC page has already been populated at a given index of > one virtual EPC instance, the current fault handler just assumes the mapping is > already there and returns success immediately. This causes a bug when one > virtual EPC instance is shared by multi processes via fork(): if the EPC page at > one index is already populated by the parent process, when the child accesses > the same page using different virtual address, the fault handler just returns > success w/o actually setting up the mapping for the child, resulting in endless > page fault. > > This needs to be fixed in no matter what way. I think you mean that vm_insert_pfn() does not happen for child because of early return? I did not understand the part about "different virtual addresses", as it is the same mapping. R, Jarkko
On 2022/10/24 04:39, jarkko@kernel.org wrote: >> As you can see if the EPC page has already been populated at a given index of >> one virtual EPC instance, the current fault handler just assumes the mapping is >> already there and returns success immediately. This causes a bug when one >> virtual EPC instance is shared by multi processes via fork(): if the EPC page at >> one index is already populated by the parent process, when the child accesses >> the same page using different virtual address, the fault handler just returns >> success w/o actually setting up the mapping for the child, resulting in endless >> page fault. >> >> This needs to be fixed in no matter what way. > I think you mean that vm_insert_pfn() does not happen for child because > of early return? I did not understand the part about "different virtual > addresses", as it is the same mapping. > If userspace do something like this, the child will get "different virtual address": ... parent run enclave within VM ... if (fork() == 0) { int *caddr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, vepc_fd, 0); printf("child: %d\n", caddr[0]); } - "vepc_fd" is inherited from parent which had opened /dev/sgx_vepc. - mmap() will create a VMA in child with "different virtual addresses". - "caddr[0]" will cause a page fault as it's a new mapping. 1. Then kernel will run into the code snippet referenced by Kai. 2. The early return 0 will result in sgx_vepc_fault() return "VM_FAULT_NOPAGE". 3. This return value will make the condition as true at function do_user_addr_fault() if (likely(!(fault & VM_FAULT_ERROR))) return; 4. Since this page fault has not been handled and "$RIP" is still the original value, it will result in the same page fault again. Namely, it's an endless page fault. But the problem is neither the early return in __sgx_vepc_fault() nor the return of VM_FAULT_NOPAGE at sgx_vepc_fault(). The root cause has been highlighted by Kai, one virtual EPC instance can only be mmap()-ed by the process which opens /dev/sgx_vepc. In fact, to share a virtual EPC instance in userspace doesn't make any sense. Even though it can be shared by child, the virtual EPC page cannot be used by child correctly. Best Regards, Zhiquan
On Sun, 2022-10-23 at 23:39 +0300, jarkko@kernel.org wrote: > On Wed, Oct 19, 2022 at 10:59:20AM +0000, Huang, Kai wrote: > > static int __sgx_vepc_fault(struct sgx_vepc *vepc, > > struct vm_area_struct *vma, unsigned long addr) > > { > > ...... > > /* Calculate index of EPC page in virtual EPC's page_array */ > > index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start); > > > > epc_page = xa_load(&vepc->page_array, index); > > if (epc_page) > > return 0; > > > > ... > > } > > > > As you can see if the EPC page has already been populated at a given index of > > one virtual EPC instance, the current fault handler just assumes the mapping is > > already there and returns success immediately. This causes a bug when one > > virtual EPC instance is shared by multi processes via fork(): if the EPC page at > > one index is already populated by the parent process, when the child accesses > > the same page using different virtual address, the fault handler just returns > > success w/o actually setting up the mapping for the child, resulting in endless > > page fault. > > > > This needs to be fixed in no matter what way. > > I think you mean that vm_insert_pfn() does not happen for child because > of early return? > Yes exactly. Sorry for not pointing out directly. > I did not understand the part about "different virtual > addresses", as it is the same mapping. The child can use mmap() to get a new mapping. Whether the virtual address is different from the parent's doesn't matter actually.
On Mon, Oct 24, 2022 at 09:32:13AM +0800, Zhiquan Li wrote: > > > On 2022/10/24 04:39, jarkko@kernel.org wrote: > >> As you can see if the EPC page has already been populated at a given index of > >> one virtual EPC instance, the current fault handler just assumes the mapping is > >> already there and returns success immediately. This causes a bug when one > >> virtual EPC instance is shared by multi processes via fork(): if the EPC page at > >> one index is already populated by the parent process, when the child accesses > >> the same page using different virtual address, the fault handler just returns > >> success w/o actually setting up the mapping for the child, resulting in endless > >> page fault. > >> > >> This needs to be fixed in no matter what way. > > I think you mean that vm_insert_pfn() does not happen for child because > > of early return? I did not understand the part about "different virtual > > addresses", as it is the same mapping. > > > > If userspace do something like this, the child will get "different > virtual address": > > ... parent run enclave within VM ... > if (fork() == 0) { > int *caddr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, vepc_fd, 0); > printf("child: %d\n", caddr[0]); > } > > > - "vepc_fd" is inherited from parent which had opened /dev/sgx_vepc. > - mmap() will create a VMA in child with "different virtual addresses". > - "caddr[0]" will cause a page fault as it's a new mapping. > > 1. Then kernel will run into the code snippet referenced by Kai. > 2. The early return 0 will result in sgx_vepc_fault() return > "VM_FAULT_NOPAGE". > 3. This return value will make the condition as true at > function do_user_addr_fault() > > if (likely(!(fault & VM_FAULT_ERROR))) > return; > > 4. Since this page fault has not been handled and "$RIP" is still the > original value, it will result in the same page fault again. Namely, > it's an endless page fault. > > But the problem is neither the early return in __sgx_vepc_fault() nor > the return of VM_FAULT_NOPAGE at sgx_vepc_fault(). The root cause has > been highlighted by Kai, one virtual EPC instance > can only be mmap()-ed by the process which opens /dev/sgx_vepc. > > In fact, to share a virtual EPC instance in userspace doesn't make any > sense. Even though it can be shared by child, the virtual EPC page > cannot be used by child correctly. OK, makes sense, thanks for the explanation! Why would we want to enforce for user space not to do this, even if it does cause malfunctioning program? BR, Jarkko
On Mon, Oct 24, 2022 at 10:23:10PM +0000, Huang, Kai wrote: > On Sun, 2022-10-23 at 23:39 +0300, jarkko@kernel.org wrote: > > On Wed, Oct 19, 2022 at 10:59:20AM +0000, Huang, Kai wrote: > > > static int __sgx_vepc_fault(struct sgx_vepc *vepc, > > > struct vm_area_struct *vma, unsigned long addr) > > > { > > > ...... > > > /* Calculate index of EPC page in virtual EPC's page_array */ > > > index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start); > > > > > > epc_page = xa_load(&vepc->page_array, index); > > > if (epc_page) > > > return 0; > > > > > > ... > > > } > > > > > > As you can see if the EPC page has already been populated at a given index of > > > one virtual EPC instance, the current fault handler just assumes the mapping is > > > already there and returns success immediately. This causes a bug when one > > > virtual EPC instance is shared by multi processes via fork(): if the EPC page at > > > one index is already populated by the parent process, when the child accesses > > > the same page using different virtual address, the fault handler just returns > > > success w/o actually setting up the mapping for the child, resulting in endless > > > page fault. > > > > > > This needs to be fixed in no matter what way. > > > > I think you mean that vm_insert_pfn() does not happen for child because > > of early return? > > > > Yes exactly. Sorry for not pointing out directly. Np. > > > I did not understand the part about "different virtual > > addresses", as it is the same mapping. > > The child can use mmap() to get a new mapping. Whether the virtual address is > different from the parent's doesn't matter actually. Thanks for the response, I had one additional query responded to Zhiquan. BR, Jarkko
On 2022/11/1 08:46, jarkko@kernel.org wrote: > Why would we want to enforce for user space not to do this, even > if it does cause malfunctioning program? > We want to resolve the problem at the source rather than just deal with the symptom passively derived from it. For instance, we might be able to return VM_FAULT_SIGBUS to kill the malicious application, but if the malicious child touch the memory earlier than parent despite it cannot use the virtual EPC page, then the parent will be victim. Even thought it's not a security threaten, there is no practical significance for sharing a virtual EPC instance. So we would like to prevent it from the beginning. Best Regards, Zhiquan > BR, Jarkko
On Tue, 2022-11-01 at 02:46 +0200, jarkko@kernel.org wrote: > On Mon, Oct 24, 2022 at 09:32:13AM +0800, Zhiquan Li wrote: > > > > > > On 2022/10/24 04:39, jarkko@kernel.org wrote: > > > > As you can see if the EPC page has already been populated at a given index of > > > > one virtual EPC instance, the current fault handler just assumes the mapping is > > > > already there and returns success immediately. This causes a bug when one > > > > virtual EPC instance is shared by multi processes via fork(): if the EPC page at > > > > one index is already populated by the parent process, when the child accesses > > > > the same page using different virtual address, the fault handler just returns > > > > success w/o actually setting up the mapping for the child, resulting in endless > > > > page fault. > > > > > > > > This needs to be fixed in no matter what way. > > > I think you mean that vm_insert_pfn() does not happen for child because > > > of early return? I did not understand the part about "different virtual > > > addresses", as it is the same mapping. > > > > > > > If userspace do something like this, the child will get "different > > virtual address": > > > > ... parent run enclave within VM ... > > if (fork() == 0) { > > int *caddr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, vepc_fd, 0); > > printf("child: %d\n", caddr[0]); > > } > > > > > > - "vepc_fd" is inherited from parent which had opened /dev/sgx_vepc. > > - mmap() will create a VMA in child with "different virtual addresses". > > - "caddr[0]" will cause a page fault as it's a new mapping. > > > > 1. Then kernel will run into the code snippet referenced by Kai. > > 2. The early return 0 will result in sgx_vepc_fault() return > > "VM_FAULT_NOPAGE". > > 3. This return value will make the condition as true at > > function do_user_addr_fault() > > > > if (likely(!(fault & VM_FAULT_ERROR))) > > return; > > > > 4. Since this page fault has not been handled and "$RIP" is still the > > original value, it will result in the same page fault again. Namely, > > it's an endless page fault. > > > > But the problem is neither the early return in __sgx_vepc_fault() nor > > the return of VM_FAULT_NOPAGE at sgx_vepc_fault(). The root cause has > > been highlighted by Kai, one virtual EPC instance > > can only be mmap()-ed by the process which opens /dev/sgx_vepc. > > > > In fact, to share a virtual EPC instance in userspace doesn't make any > > sense. Even though it can be shared by child, the virtual EPC page > > cannot be used by child correctly. > > OK, makes sense, thanks for the explanation! > > Why would we want to enforce for user space not to do this, even > if it does cause malfunctioning program? > > BR, Jarkko Hi Jarkko, Dave, I've been re-thinking about this #MC handle on virtual EPC by stepping back to the beginning, and I think we have more problems than this "whether kernel should enforce child cannot mmap() virtual EPC". First of all, if we want to use epc->owner to carry the userspace virtual address, "make kernel enforce child cannot mmap() virtual EPC" alone isn't good enough -- nothing prevents userspace to call mmap() several times to map the same virtual EPC pages. So additionally, we also need to "make kernel enforce one virtual EPC can only be mmap()-ed once". Secondly, I am thinking that the current arch_memory_failure() cannot really handle #MC for virtual EPC page correctly. The problem is, even we mark the page as poisoned, and send signal to userspace to inject #MC to guest to handle, the poisoned virtual EPC page is never unmapped from the guest and then freed. This means a malicious guest can always try to use the poisoned EPC page again after it receives #MC on some EPC page. I am not entirely sure what kind behaviour/attack can be done in such case, but it seems the right behaviour should be the KVM to inject the #MC and unmap the poisoned EPC page from guest. And if guest ever tries to use this "guest's EPC page" (GFN) again, KVM should kill the guest. Hi Sean, If you ever see this, could you also comment?
On Fri, Nov 04, 2022, Huang, Kai wrote: > > > In fact, to share a virtual EPC instance in userspace doesn't make any > > > sense. Even though it can be shared by child, the virtual EPC page > > > cannot be used by child correctly. > > > > OK, makes sense, thanks for the explanation! > > > > Why would we want to enforce for user space not to do this, even > > if it does cause malfunctioning program? > > > > BR, Jarkko > > Hi Jarkko, Dave, > > I've been re-thinking about this #MC handle on virtual EPC by stepping back to > the beginning, and I think we have more problems than this "whether kernel > should enforce child cannot mmap() virtual EPC". IMO, virtual EPC should be restricted to a single mm_struct, which is what was originally proposed many years ago[*]. I should have pushed back harder, but by that point I had mostly stopped caring about SGX. There is no use case for sharing a virtual EPC, and conceptually it just doesn't make sense because all use would need to be mutually exclusive on a per-page basis to keep the EPCM happy. [*] https://lore.kernel.org/kvm/ace9d4cb10318370f6145aaced0cfa73dda36477.1609890536.git.kai.huang@intel.com > First of all, if we want to use epc->owner to carry the userspace virtual > address, "make kernel enforce child cannot mmap() virtual EPC" alone isn't good > enough -- nothing prevents userspace to call mmap() several times to map the > same virtual EPC pages. So additionally, we also need to "make kernel enforce > one virtual EPC can only be mmap()-ed once". > > Secondly, I am thinking that the current arch_memory_failure() cannot really > handle #MC for virtual EPC page correctly. The problem is, even we mark the > page as poisoned, and send signal to userspace to inject #MC to guest to handle, > the poisoned virtual EPC page is never unmapped from the guest and then freed. > > This means a malicious guest can always try to use the poisoned EPC page again > after it receives #MC on some EPC page. I am not entirely sure what kind > behaviour/attack can be done in such case, but it seems the right behaviour > should be the KVM to inject the #MC and unmap the poisoned EPC page from guest. > And if guest ever tries to use this "guest's EPC page" (GFN) again, KVM should > kill the guest. Just zap the PTEs for the affected mm_struct, mmu_notifier => KVM will do the rest.
On 11/4/22 09:26, Sean Christopherson wrote: >> I've been re-thinking about this #MC handle on virtual EPC by stepping back to >> the beginning, and I think we have more problems than this "whether kernel >> should enforce child cannot mmap() virtual EPC". > IMO, virtual EPC should be restricted to a single mm_struct, which is what was > originally proposed many years ago[*]. I should have pushed back harder, but by > that point I had mostly stopped caring about SGX. Considering that we have VM_DONTCOPY set on the vepc VMA, this shouldn't be too hard to pull off. We could just return -EBUSY if another mm comes around and tries to mmap() the fd.
On Fri, 2022-11-04 at 16:26 +0000, Sean Christopherson wrote: > On Fri, Nov 04, 2022, Huang, Kai wrote: > > > > In fact, to share a virtual EPC instance in userspace doesn't make any > > > > sense. Even though it can be shared by child, the virtual EPC page > > > > cannot be used by child correctly. > > > > > > OK, makes sense, thanks for the explanation! > > > > > > Why would we want to enforce for user space not to do this, even > > > if it does cause malfunctioning program? > > > > > > BR, Jarkko > > > > Hi Jarkko, Dave, > > > > I've been re-thinking about this #MC handle on virtual EPC by stepping back to > > the beginning, and I think we have more problems than this "whether kernel > > should enforce child cannot mmap() virtual EPC". > > IMO, virtual EPC should be restricted to a single mm_struct, which is what was > originally proposed many years ago[*]. I should have pushed back harder, but by > that point I had mostly stopped caring about SGX. > > There is no use case for sharing a virtual EPC, and conceptually it just doesn't > make sense because all use would need to be mutually exclusive on a per-page basis > to keep the EPCM happy. > > [*] https://lore.kernel.org/kvm/ace9d4cb10318370f6145aaced0cfa73dda36477.1609890536.git.kai.huang@intel.com Thanks for your time. Yes I agree. Also, IMHO we can further enforce to only allow associating one VMA with one virtual EPC instance. It also doesn't make sense to have multi-mappings (VMAs) to the same virtual EPC pages even for one mm_struct, right? This also makes zapping PTE easier (as mentioned below). > > > First of all, if we want to use epc->owner to carry the userspace virtual > > address, "make kernel enforce child cannot mmap() virtual EPC" alone isn't good > > enough -- nothing prevents userspace to call mmap() several times to map the > > same virtual EPC pages. So additionally, we also need to "make kernel enforce > > one virtual EPC can only be mmap()-ed once". > > > > Secondly, I am thinking that the current arch_memory_failure() cannot really > > handle #MC for virtual EPC page correctly. The problem is, even we mark the > > page as poisoned, and send signal to userspace to inject #MC to guest to handle, > > the poisoned virtual EPC page is never unmapped from the guest and then freed. > > > > This means a malicious guest can always try to use the poisoned EPC page again > > after it receives #MC on some EPC page. I am not entirely sure what kind > > behaviour/attack can be done in such case, but it seems the right behaviour > > should be the KVM to inject the #MC and unmap the poisoned EPC page from guest. > > And if guest ever tries to use this "guest's EPC page" (GFN) again, KVM should > > kill the guest. > > Just zap the PTEs for the affected mm_struct, mmu_notifier => KVM will do the rest. Zapping PTEs alone can, i.e. take the poisoned virtual EPC page away from the guest and map a new one, but it doesn't inject #MC to the guest. W/o injecting #MC, the guest will just see sudden lose of enclave which consumes that EPC page (enclave is marked as bad by hardware if any page caused #MC). But I guess it's also acceptable since we already have other cases that guest can possibly see sudden lose of enclave (i.e. live migration)? Btw, currently the virtual EPC driver doesn't have infrastructure to zap PTEs. If we can associate one virtual EPC instance to one VMA, then zapping PTE can be easily done by bookkeeping this VMA. Does this look good?
On Fri, 2022-11-04 at 09:34 -0700, Dave Hansen wrote: > On 11/4/22 09:26, Sean Christopherson wrote: > > > I've been re-thinking about this #MC handle on virtual EPC by stepping back to > > > the beginning, and I think we have more problems than this "whether kernel > > > should enforce child cannot mmap() virtual EPC". > > IMO, virtual EPC should be restricted to a single mm_struct, which is what was > > originally proposed many years ago[*]. I should have pushed back harder, but by > > that point I had mostly stopped caring about SGX. > > Considering that we have VM_DONTCOPY set on the vepc VMA, this shouldn't > be too hard to pull off. We could just return -EBUSY if another mm > comes around and tries to mmap() the fd. Yes. We can record the MM which opens /dev/sgx_vepc and reject mmap() from other MMs.
On Wed, Nov 02, 2022 at 09:38:55AM +0800, Zhiquan Li wrote: > > On 2022/11/1 08:46, jarkko@kernel.org wrote: > > Why would we want to enforce for user space not to do this, even > > if it does cause malfunctioning program? > > > > We want to resolve the problem at the source rather than just deal with > the symptom passively derived from it. For instance, we might be able to > return VM_FAULT_SIGBUS to kill the malicious application, but if the > malicious child touch the memory earlier than parent despite it cannot > use the virtual EPC page, then the parent will be victim. > > Even thought it's not a security threaten, there is no practical > significance for sharing a virtual EPC instance. So we would like to > prevent it from the beginning. Can you phrase this to the commit message? This makes sense as a motivation. BR, Jarkko
On 2022/11/7 19:36, jarkko@kernel.org wrote: > Can you phrase this to the commit message? This makes sense as a > motivation. > No problem, I will add it to v10 patch after current concerns are clarified and get a clear solution. Best Regards, Zhiquan
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index b319bedcaf1e..160c8dbee0ab 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -679,6 +679,8 @@ int arch_memory_failure(unsigned long pfn, int flags) struct sgx_epc_page *page = sgx_paddr_to_page(pfn << PAGE_SHIFT); struct sgx_epc_section *section; struct sgx_numa_node *node; + void __user *vaddr; + int ret; /* * mm/memory-failure.c calls this routine for all errors @@ -695,8 +697,26 @@ int arch_memory_failure(unsigned long pfn, int flags) * error. The signal may help the task understand why the * enclave is broken. */ - if (flags & MF_ACTION_REQUIRED) - force_sig(SIGBUS); + if (flags & MF_ACTION_REQUIRED) { + /* + * Provide extra info to the task so that it can make further + * decision but not simply kill it. This is quite useful for + * virtualization case. + */ + if (page->flags & SGX_EPC_PAGE_KVM_GUEST) { + /* + * The 'encl_owner' field is repurposed, when allocating EPC + * page it was assigned to the virtual address of virtual EPC + * page. + */ + vaddr = (void *)((unsigned long)page->vepc_vaddr & PAGE_MASK); + ret = force_sig_mceerr(BUS_MCEERR_AR, vaddr, PAGE_SHIFT); + if (ret < 0) + pr_err("Memory failure: Error sending signal to %s:%d: %d\n", + current->comm, current->pid, ret); + } else + force_sig(SIGBUS); + } section = &sgx_epc_sections[page->section]; node = section->node;