diff mbox series

[v9,3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization

Message ID 20220920063948.3556917-4-zhiquan1.li@intel.com (mailing list archive)
State New, archived
Headers show
Series x86/sgx: fine grained SGX MCA behavior | expand

Commit Message

Zhiquan Li Sept. 20, 2022, 6:39 a.m. UTC
Today, if a guest accesses an SGX EPC page with memory failure,
the kernel behavior will kill the entire guest.  This blast
radius is too large.  It would be idea to kill only the SGX
application inside the guest.

To fix this, send a SIGBUS to host userspace (like QEMU) which can
follow up by injecting a #MC to the guest.

SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
being shared by multiple VMs via fork().  However KVM doesn't support
running a VM across multiple mm structures, and the de facto userspace
hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
this should not happen.

Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/linux-sgx/443cb425-009c-2784-56f4-5e707122de76@intel.com/T/#m1d1f4098f4fad78034e8706a60e4d79c119db407
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

---
Changes since V8:
- Remove excess Acked-by.

Changes since V7:
- Add Acked-by from Jarkko.

Changes since V6:
- Fix build warning due to type changes.

Changes since V5:
- Use the 'vepc_vaddr' field instead of casting the 'owner' field.
- Clean up the commit message suggested by Dave.
  Link: https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m2ff4778948cdc9ee65f09672f1d02f8dc467247b
- Add Reviewed-by from Jarkko.

Changes since V4:
- Switch the order of the two variables so all of variables are in
  reverse Christmas style.
- Do not initialize "ret" because it will be overridden by the return
  value of force_sig_mceerr() unconditionally.

Changes since V2:
- Retrieve virtual address from "owner" field of struct sgx_epc_page,
  instead of struct sgx_vepc_page.
- Replace EPC page flag SGX_EPC_PAGE_IS_VEPC with
  SGX_EPC_PAGE_KVM_GUEST as they are duplicated.

Changes since V1:
- Add Acked-by from Kai Huang.
- Add Kai's excellent explanation regarding to why we no need to
  consider that one virtual EPC be shared by two guests.
---
 arch/x86/kernel/cpu/sgx/main.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

Comments

Dave Hansen Oct. 10, 2022, 11:20 p.m. UTC | #1
On 9/19/22 23:39, Zhiquan Li wrote:
> +	if (flags & MF_ACTION_REQUIRED) {
> +		/*
> +		 * Provide extra info to the task so that it can make further
> +		 * decision but not simply kill it. This is quite useful for
> +		 * virtualization case.
> +		 */

"Quite useful to the virtualization case", eh?

This code can only even be *run* in the virtualization case.

> +		if (page->flags & SGX_EPC_PAGE_KVM_GUEST) {
> +			/*
> +			 * The 'encl_owner' field is repurposed, when allocating EPC
> +			 * page it was assigned to the virtual address of virtual EPC
> +			 * page.
> +			 */

That comment is talking about history, which is kinda silly.  Let's just
focus on what the code is doing today.

> +			vaddr = (void *)((unsigned long)page->vepc_vaddr & PAGE_MASK);

First, why even align this?  What does it matter?

Second, is there a reason not to use PTR_ALIGN() here?

> +			ret = force_sig_mceerr(BUS_MCEERR_AR, vaddr, PAGE_SHIFT);
> +			if (ret < 0)
> +				pr_err("Memory failure: Error sending signal to %s:%d: %d\n",
> +					current->comm, current->pid, ret);
> +		} else
> +			force_sig(SIGBUS);
> +	}
Zhiquan Li Oct. 11, 2022, 4:44 a.m. UTC | #2
On 2022/10/11 07:20, Dave Hansen wrote:
> On 9/19/22 23:39, Zhiquan Li wrote:
>> +	if (flags & MF_ACTION_REQUIRED) {
>> +		/*
>> +		 * Provide extra info to the task so that it can make further
>> +		 * decision but not simply kill it. This is quite useful for
>> +		 * virtualization case.
>> +		 */
> 
> "Quite useful to the virtualization case", eh?
> 
> This code can only even be *run* in the virtualization case.
> 

OK, I'll change it in V10.

>> +		if (page->flags & SGX_EPC_PAGE_KVM_GUEST) {
>> +			/*
>> +			 * The 'encl_owner' field is repurposed, when allocating EPC
>> +			 * page it was assigned to the virtual address of virtual EPC
>> +			 * page.
>> +			 */
> 
> That comment is talking about history, which is kinda silly.  Let's just
> focus on what the code is doing today.
> 

How about move the comment above to here? Because it's talking about
what the code is doing today.
Then this comment can be removed.

>> +			vaddr = (void *)((unsigned long)page->vepc_vaddr & PAGE_MASK);
> 
> First, why even align this?  What does it matter?
> 
> Second, is there a reason not to use PTR_ALIGN() here?
> 

You're right.  The align is unnecessary, host kernel just reports the
error address, no matter if it's aligned, guest kernel MCA will convert
it to PFN.

Best Regards,
Zhiquan

>> +			ret = force_sig_mceerr(BUS_MCEERR_AR, vaddr, PAGE_SHIFT);
>> +			if (ret < 0)
>> +				pr_err("Memory failure: Error sending signal to %s:%d: %d\n",
>> +					current->comm, current->pid, ret);
>> +		} else
>> +			force_sig(SIGBUS);
>> +	}
>
Dave Hansen Oct. 11, 2022, 2:04 p.m. UTC | #3
On 9/19/22 23:39, Zhiquan Li wrote:
> Today, if a guest accesses an SGX EPC page with memory failure,
> the kernel behavior will kill the entire guest.  This blast
> radius is too large.  It would be idea to kill only the SGX

				ideal ^

> application inside the guest.
> 
> To fix this, send a SIGBUS to host userspace (like QEMU) which can
> follow up by injecting a #MC to the guest.

This doesn't make any sense to me.  It's *ALREADY* sending a SIGBUS.
So, whatever is making this better, it's not "send a SIGBUS" that's
doing it.

What does this patch actually do to reduce the blast radius?

> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
> being shared by multiple VMs via fork().  However KVM doesn't support
> running a VM across multiple mm structures, and the de facto userspace
> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
> this should not happen.

This is out of the blue.  Why is this here?

What happens if a hypervisor *DOES* fork()?  What's the fallout?
Zhiquan Li Oct. 12, 2022, 5:09 a.m. UTC | #4
On 2022/10/11 22:04, Dave Hansen wrote:
> On 9/19/22 23:39, Zhiquan Li wrote:
>> Today, if a guest accesses an SGX EPC page with memory failure,
>> the kernel behavior will kill the entire guest.  This blast
>> radius is too large.  It would be idea to kill only the SGX
> 
> 				ideal ^
> 
>> application inside the guest.
>>
>> To fix this, send a SIGBUS to host userspace (like QEMU) which can
>> follow up by injecting a #MC to the guest.
> 
> This doesn't make any sense to me.  It's *ALREADY* sending a SIGBUS.
> So, whatever is making this better, it's not "send a SIGBUS" that's
> doing it.
> 
> What does this patch actually do to reduce the blast radius?
> 

Thanks for your attention, Dave.

This part comes from your comments previously:

https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m6d62670eb530fab178eefaaaf31a22c4475e818d

The key is the SIGBUS should with code BUS_MCEERR_AR and virtual address
of virtual EPC page. Hypervisor can identify it with the specific code
and inject #MC to the guest.

Can we change the statement like this?

    To fix this, send a SIGBUS with code BUS_MCEERR_AR and virtual
    address of virtual EPC page to host userspace (like QEMU) which can
    follow up by injecting a #MC to the guest.

>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
>> being shared by multiple VMs via fork().  However KVM doesn't support
>> running a VM across multiple mm structures, and the de facto userspace
>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
>> this should not happen.
> 
> This is out of the blue.  Why is this here?
> 
> What happens if a hypervisor *DOES* fork()?  What's the fallout?

This part originates from below discussion:

https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t

It intents to answer the question:

    Do you think the processes sharing the same enclave need to be
    killed, even they had not touched the EPC page with hardware error?

Dave, do you mean it's not appropriate to be put here?

Best Regards,
Zhiquan
Huang, Kai Oct. 12, 2022, 11:01 a.m. UTC | #5
On Wed, 2022-10-12 at 13:09 +0800, Zhiquan Li wrote:
> > > SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
> > > being shared by multiple VMs via fork().  However KVM doesn't support
> > > running a VM across multiple mm structures, and the de facto userspace
> > > hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
> > > this should not happen.
> > 
> > This is out of the blue.  Why is this here?
> > 
> > What happens if a hypervisor *DOES* fork()?  What's the fallout?
> 
> This part originates from below discussion:
> 
> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> 
> It intents to answer the question:
> 
>     Do you think the processes sharing the same enclave need to be
>     killed, even they had not touched the EPC page with hardware error?

Sharing virtual EPC instance will very likely unexpectedly break enclaves in all
VMs.  Whether kernel should explicitly prevent is another topic. To me I don't
see strong reason to enforce in the kernel.  For instance, multiple VMs can map
the same file as memory backend with MAP_SHARED, in which case they can all
break.  Userspace should use virtual EPC in the right way.

But the point is above is not directly related to your patch.  On host where
multiple processes can share one enclave legally, it does the same thing.  I
think you can just remove that paragraph from changelog.
Jarkko Sakkinen Oct. 12, 2022, 11:54 a.m. UTC | #6
On Wed, Oct 12, 2022 at 11:01:49AM +0000, Huang, Kai wrote:
> On Wed, 2022-10-12 at 13:09 +0800, Zhiquan Li wrote:
> > > > SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
> > > > being shared by multiple VMs via fork().  However KVM doesn't support
> > > > running a VM across multiple mm structures, and the de facto userspace
> > > > hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
> > > > this should not happen.
> > > 
> > > This is out of the blue.  Why is this here?
> > > 
> > > What happens if a hypervisor *DOES* fork()?  What's the fallout?
> > 
> > This part originates from below discussion:
> > 
> > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> > 
> > It intents to answer the question:
> > 
> >     Do you think the processes sharing the same enclave need to be
> >     killed, even they had not touched the EPC page with hardware error?
> 
> Sharing virtual EPC instance will very likely unexpectedly break enclaves in all
> VMs.  Whether kernel should explicitly prevent is another topic. To me I don't
> see strong reason to enforce in the kernel.  For instance, multiple VMs can map
> the same file as memory backend with MAP_SHARED, in which case they can all
> break.  Userspace should use virtual EPC in the right way.

Broadly speaking, for most of the time, and for any topic, kernel should
not prevent anything, unless it can break kernel's internal state.

> But the point is above is not directly related to your patch.  On host where
> multiple processes can share one enclave legally, it does the same thing.  I
> think you can just remove that paragraph from changelog.
> 
> -- 
> Thanks,
> -Kai
> 
> 

BR, Jarkko
Dave Hansen Oct. 12, 2022, 2:36 p.m. UTC | #7
On 10/11/22 22:09, Zhiquan Li wrote:
>>> application inside the guest.
>>>
>>> To fix this, send a SIGBUS to host userspace (like QEMU) which can
>>> follow up by injecting a #MC to the guest.
>>
>> This doesn't make any sense to me.  It's *ALREADY* sending a SIGBUS.
>> So, whatever is making this better, it's not "send a SIGBUS" that's
>> doing it.
>>
>> What does this patch actually do to reduce the blast radius?
> 
> Thanks for your attention, Dave.
> 
> This part comes from your comments previously:
> 
> https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m6d62670eb530fab178eefaaaf31a22c4475e818d
> 
> The key is the SIGBUS should with code BUS_MCEERR_AR and virtual address
> of virtual EPC page. Hypervisor can identify it with the specific code
> and inject #MC to the guest.
> 
> Can we change the statement like this?
> 
>     To fix this, send a SIGBUS with code BUS_MCEERR_AR and virtual
>     address of virtual EPC page to host userspace (like QEMU) which can
>     follow up by injecting a #MC to the guest.

This is really just mechanically restating what the patch does.  It
doesn't help me understand how it achieves the goal.  I guess I'll just
go and write the changelog for you.  Here's what I was missing:

	There is already a signal-based ABI to tell userspace about
	machine checks.  But, SGX does not use that ABI.  Today, the
	kernel delivers a generic SIGBUS if a machine check occurs when
	accessing SGX memory.  Userspace can not differentiate that
	SIGBUS from <add example here>, so it is very unlikely to be
	able to recover from the signal and the app will die.

	To fix this, have the SGX machine check code generate a SIGBUS
	which leverages the existing BUS_MCEERR_AR ABI.  This enlightens
	userspace about why the SIGBUS was generated and gives it a
	chance of being able to handle the signal.

	QEMU, for instance, has code to handle these BUS_MCEERR_AR
	signals today.  Without this patch, QEMU will just die in the
	face of a generic SIGBUS, and take the whole VM with it.  With
	this patch <explain what QEMU actually does here>.

	In short, BUS_MCEERR_AR enables QEMU to reduce the blast radius
	down from the whole QEMU process to a single page.

This patch doesn't *actually* reduce the blast radius.  It enables QEMU
to do that.

>>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
>>> being shared by multiple VMs via fork().  However KVM doesn't support
>>> running a VM across multiple mm structures, and the de facto userspace
>>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
>>> this should not happen.
>>
>> This is out of the blue.  Why is this here?
>>
>> What happens if a hypervisor *DOES* fork()?  What's the fallout?
> 
> This part originates from below discussion:
> 
> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> 
> It intents to answer the question:
> 
>     Do you think the processes sharing the same enclave need to be
>     killed, even they had not touched the EPC page with hardware error?
> 
> Dave, do you mean it's not appropriate to be put here?

It's actually a pretty important point, but it's still out of the blue.

You also didn't answer my question.
Huang, Kai Oct. 12, 2022, 8:56 p.m. UTC | #8
On Wed, 2022-10-12 at 14:54 +0300, jarkko@kernel.org wrote:
> > Sharing virtual EPC instance will very likely unexpectedly break enclaves in
> > all
> > VMs.  Whether kernel should explicitly prevent is another topic. To me I
> > don't
> > see strong reason to enforce in the kernel.  For instance, multiple VMs can
> > map
> > the same file as memory backend with MAP_SHARED, in which case they can all
> > break.  Userspace should use virtual EPC in the right way.
> 
> Broadly speaking, for most of the time, and for any topic, kernel should
> not prevent anything, unless it can break kernel's internal state.

Good to know.  Thanks you.
Zhiquan Li Oct. 13, 2022, 2:05 a.m. UTC | #9
On 2022/10/12 19:01, Huang, Kai wrote:
> On Wed, 2022-10-12 at 13:09 +0800, Zhiquan Li wrote:
>>>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
>>>> being shared by multiple VMs via fork().  However KVM doesn't support
>>>> running a VM across multiple mm structures, and the de facto userspace
>>>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
>>>> this should not happen.
>>>
>>> This is out of the blue.  Why is this here?
>>>
>>> What happens if a hypervisor *DOES* fork()?  What's the fallout?
>>
>> This part originates from below discussion:
>>
>> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
>>
>> It intents to answer the question:
>>
>>     Do you think the processes sharing the same enclave need to be
>>     killed, even they had not touched the EPC page with hardware error?
> 
> Sharing virtual EPC instance will very likely unexpectedly break enclaves in all
> VMs.  Whether kernel should explicitly prevent is another topic. To me I don't
> see strong reason to enforce in the kernel.  For instance, multiple VMs can map
> the same file as memory backend with MAP_SHARED, in which case they can all
> break.  Userspace should use virtual EPC in the right way.
> 
> But the point is above is not directly related to your patch.  On host where
> multiple processes can share one enclave legally, it does the same thing.  I
> think you can just remove that paragraph from changelog.
> 

OK, I'll remove it since V10.
Thank you all the same, Kai.
Zhiquan Li Oct. 13, 2022, 2:40 p.m. UTC | #10
On 2022/10/12 22:36, Dave Hansen wrote:
> This is really just mechanically restating what the patch does.  It
> doesn't help me understand how it achieves the goal.  I guess I'll just
> go and write the changelog for you.  Here's what I was missing:
> 
> 	There is already a signal-based ABI to tell userspace about
> 	machine checks.  But, SGX does not use that ABI.  Today, the
> 	kernel delivers a generic SIGBUS if a machine check occurs when
> 	accessing SGX memory.  Userspace can not differentiate that
> 	SIGBUS from <add example here>, so it is very unlikely to be
> 	able to recover from the signal and the app will die.
> 
> 	To fix this, have the SGX machine check code generate a SIGBUS
> 	which leverages the existing BUS_MCEERR_AR ABI.  This enlightens
> 	userspace about why the SIGBUS was generated and gives it a
> 	chance of being able to handle the signal.
> 
> 	QEMU, for instance, has code to handle these BUS_MCEERR_AR
> 	signals today.  Without this patch, QEMU will just die in the
> 	face of a generic SIGBUS, and take the whole VM with it.  With
> 	this patch <explain what QEMU actually does here>.
> 
> 	In short, BUS_MCEERR_AR enables QEMU to reduce the blast radius
> 	down from the whole QEMU process to a single page.
> 
> This patch doesn't *actually* reduce the blast radius.  It enables QEMU
> to do that.
> 

Perfect, Dave! I'm ashamed to have to get your hands dirty in the end.
I'd like to replace the angle brackets content as below, could you have
a look?

	There is already a signal-based ABI to tell userspace about
	machine checks.  But, SGX does not use that ABI.  Today, the
	kernel delivers a generic SIGBUS if a machine check occurs when
	accessing SGX memory.  Userspace can not differentiate that
	SIGBUS from <a recoverable machine check error>, so it is very
	unlikely to be able to recover from the signal and the app will
	die.
	...
	QEMU, for instance, has code to handle these BUS_MCEERR_AR
	signals today.  Without this patch, QEMU will just die in the
	face of a generic SIGBUS, and take the whole VM with it.  With
	this patch <QEMU will be aware of the machine check error and
	retrieve the virtual address (HVA) from siginfo, then convert it
	to GPA and inject a #MC into VM. Guest kernel will handle this
	#MC, find and kill the task who had accessed the error memory.>
	...

>>>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
>>>> being shared by multiple VMs via fork().  However KVM doesn't support
>>>> running a VM across multiple mm structures, and the de facto userspace
>>>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
>>>> this should not happen.
>>> This is out of the blue.  Why is this here?
>>>
>>> What happens if a hypervisor *DOES* fork()?  What's the fallout?
>> This part originates from below discussion:
>>
>> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
>>
>> It intents to answer the question:
>>
>>     Do you think the processes sharing the same enclave need to be
>>     killed, even they had not touched the EPC page with hardware error?
>>
>> Dave, do you mean it's not appropriate to be put here?
> It's actually a pretty important point, but it's still out of the blue.
> 
> You also didn't answer my question.

Oh, sorry, I focused on answering "Why is this here?" but forgot to
answer "What's the fallout?"

It's a very good question.

Looks like Kai had answered it before:

	For instance, an application can open /dev/sgx_vepc, mmap(), and
	fork().  Then if the child does mmap() against the fd opened by
	the parent, the child will share the same vepc with the parent.

	...

	Sharing virtual EPC instance will very likely unexpectedly break
	enclaves in all VMs.
	...

https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t

Moreover, I checked the code in below scenario while child sharing the
virtual EPC instance with parent:
- child closes /dev/sgx_vepc earlier than parent.
- child re-mmap() /dev/sgx_vepc to the same memory region as parent.
- child mmap() /dev/sgx_vepc to different memory region.
- child accesses the memory region of mmap() inherited from parent.

It's just that the app messes itself up, the vepc instance is protected
very well.
Maybe there are other corner cases I've not considered?

Best Regards,
Zhiquan
Dave Hansen Oct. 13, 2022, 3:39 p.m. UTC | #11
On 10/13/22 07:40, Zhiquan Li wrote:
> 	There is already a signal-based ABI to tell userspace about
> 	machine checks.  But, SGX does not use that ABI.  Today, the
> 	kernel delivers a generic SIGBUS if a machine check occurs when
> 	accessing SGX memory.  Userspace can not differentiate that
> 	SIGBUS from <a recoverable machine check error>,

I wanted an example here of a SIGBUS that can occur from sources *OTHER*
than a recoverable machine check.
Dave Hansen Oct. 13, 2022, 3:44 p.m. UTC | #12
On 10/13/22 07:40, Zhiquan Li wrote:
>>>> What happens if a hypervisor *DOES* fork()?  What's the fallout?
>>> This part originates from below discussion:
>>>
>>> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
>>>
>>> It intents to answer the question:
>>>
>>>     Do you think the processes sharing the same enclave need to be
>>>     killed, even they had not touched the EPC page with hardware error?
>>>
>>> Dave, do you mean it's not appropriate to be put here?
>> It's actually a pretty important point, but it's still out of the blue.
>>
>> You also didn't answer my question.
> Oh, sorry, I focused on answering "Why is this here?" but forgot to
> answer "What's the fallout?"
> 
> It's a very good question.
> 
> Looks like Kai had answered it before:
> 
> 	For instance, an application can open /dev/sgx_vepc, mmap(), and
> 	fork().  Then if the child does mmap() against the fd opened by
> 	the parent, the child will share the same vepc with the parent.
> 
> 	...
> 
> 	Sharing virtual EPC instance will very likely unexpectedly break
> 	enclaves in all VMs.

How, though?  This basically says, "I think things will break."  I want
to know a few more specifics than that before we merge this.  There are
ABI implications.

> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> 
> Moreover, I checked the code in below scenario while child sharing the
> virtual EPC instance with parent:
> - child closes /dev/sgx_vepc earlier than parent.
> - child re-mmap() /dev/sgx_vepc to the same memory region as parent.
> - child mmap() /dev/sgx_vepc to different memory region.
> - child accesses the memory region of mmap() inherited from parent.
> 
> It's just that the app messes itself up, the vepc instance is protected
> very well.
> Maybe there are other corner cases I've not considered?

... and what happens when *THIS* patch is in play?  What if there is a
machine check in SGX memory?
Huang, Kai Oct. 13, 2022, 9:49 p.m. UTC | #13
On Thu, 2022-10-13 at 08:44 -0700, Dave Hansen wrote:
> On 10/13/22 07:40, Zhiquan Li wrote:
> > > > > What happens if a hypervisor *DOES* fork()?  What's the fallout?
> > > > This part originates from below discussion:
> > > > 
> > > > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> > > > 
> > > > It intents to answer the question:
> > > > 
> > > >     Do you think the processes sharing the same enclave need to be
> > > >     killed, even they had not touched the EPC page with hardware error?
> > > > 
> > > > Dave, do you mean it's not appropriate to be put here?
> > > It's actually a pretty important point, but it's still out of the blue.
> > > 
> > > You also didn't answer my question.
> > Oh, sorry, I focused on answering "Why is this here?" but forgot to
> > answer "What's the fallout?"
> > 
> > It's a very good question.
> > 
> > Looks like Kai had answered it before:
> > 
> > 	For instance, an application can open /dev/sgx_vepc, mmap(), and
> > 	fork().  Then if the child does mmap() against the fd opened by
> > 	the parent, the child will share the same vepc with the parent.
> > 
> > 	...
> > 
> > 	Sharing virtual EPC instance will very likely unexpectedly break
> > 	enclaves in all VMs.
> 
> How, though?  This basically says, "I think things will break."  I want
> to know a few more specifics than that before we merge this.  There are
> ABI implications.

This is because virtual EPC is just a raw resource to guest, and how guest uses
virtual EPC to run enclaves is completely controlled by the guest.  When virtual
EPC is shared among VMs, it can happen that one guest _thinks_ one EPC page is
still free but in fact it has been already used by another VM as a valid enclave
page.  Also, one VM can just unconditionally sanitize all EPC pages before using
any of them (just like Linux does).  All of those can cause unexpected SGX
error, which can lead to failing to create enclave, and/or breaking existing
enclaves running in all VMs.

> 
> > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> > 
> > Moreover, I checked the code in below scenario while child sharing the
> > virtual EPC instance with parent:
> > - child closes /dev/sgx_vepc earlier than parent.
> > - child re-mmap() /dev/sgx_vepc to the same memory region as parent.
> > - child mmap() /dev/sgx_vepc to different memory region.
> > - child accesses the memory region of mmap() inherited from parent.
> > 
> > It's just that the app messes itself up, the vepc instance is protected
> > very well.
> > Maybe there are other corner cases I've not considered?
> 
> ... and what happens when *THIS* patch is in play?  What if there is a
> machine check in SGX memory?

With this patch, when #MC happens on one virtual EPC page, it will be send to
the VM, and the behaviour inside a VM depends on guest's implementation.  But
anyway based on Tony's reply, the enclave will be marked as "bad" by hardware
and it will eventually be killed:

https://lore.kernel.org/linux-sgx/55ffd9475f5d46f68dd06c4323bec871@intel.com/
https://lore.kernel.org/linux-sgx/5b6ad3e2af614caf9b41092797ffcd86@intel.com/

If the virtual EPC is shared by other VMs, the worst case is when other VMs use
this bad EPC page (as we cannot take the bad EPC page out of VM for now), some
SGX error (ENCLS/ENCLU error) or #MC could happen again.  But this doesn't make
things worse, as when sharing virtual EPC page among VMs you are likely to break
enclaves in VMs anyway (as mentioned above).
Dave Hansen Oct. 13, 2022, 10:02 p.m. UTC | #14
On 10/13/22 14:49, Huang, Kai wrote:
>> ... and what happens when *THIS* patch is in play?  What if there is a
>> machine check in SGX memory?
> With this patch, when #MC happens on one virtual EPC page, it will be send to
> the VM, and the behaviour inside a VM depends on guest's implementation.  But
> anyway based on Tony's reply, the enclave will be marked as "bad" by hardware
> and it will eventually be killed:
> 
> https://lore.kernel.org/linux-sgx/55ffd9475f5d46f68dd06c4323bec871@intel.com/
> https://lore.kernel.org/linux-sgx/5b6ad3e2af614caf9b41092797ffcd86@intel.com/
> 
> If the virtual EPC is shared by other VMs, the worst case is when other VMs use
> this bad EPC page (as we cannot take the bad EPC page out of VM for now), some
> SGX error (ENCLS/ENCLU error) or #MC could happen again.  But this doesn't make
> things worse, as when sharing virtual EPC page among VMs you are likely to break
> enclaves in VMs anyway (as mentioned above).

Guys, maybe I'm just not being specific enough.  But, Kai, you're off on
a tangent here that I didn't intend to ask about.  You're actually
adding confusion and delay here, not helping answer the question.

This was specifically a question about fork() plus VEPC plus machine
checks.  I'm still trying to get to the bottom of:

	What happens if a hypervisor *DOES* fork()?  What's the fallout?

Specifically, with this machine check SIGBUS implementation, one EPC
page can only have on host virtual address.  But, it's possible to
mmap() a single VEPC page in multiple processes at multiple host virtual
addresses.

So, the implementation only allows one host virtual address.  But, we
can imagine a scenario where there are *TWO* valid virtual addresses
where the VEPC is mapped.

What might happen in this case?  What's the fallout?
Huang, Kai Oct. 13, 2022, 10:15 p.m. UTC | #15
On Thu, 2022-10-13 at 15:02 -0700, Hansen, Dave wrote:
> Specifically, with this machine check SIGBUS implementation, one EPC
> page can only have on host virtual address.  But, it's possible to
> mmap() a single VEPC page in multiple processes at multiple host virtual
> addresses.
> 
> So, the implementation only allows one host virtual address.  But, we
> can imagine a scenario where there are *TWO* valid virtual addresses
> where the VEPC is mapped.
> 
> What might happen in this case?  What's the fallout?

My understanding is there can be two #MC happening on two virtual addresses.

Each #MC will be injected to the VM which triggers that #MC to handle.

Today, both VM gets killed.
Dave Hansen Oct. 13, 2022, 10:28 p.m. UTC | #16
On 10/13/22 15:15, Huang, Kai wrote:
> On Thu, 2022-10-13 at 15:02 -0700, Hansen, Dave wrote:
>> Specifically, with this machine check SIGBUS implementation, one EPC
>> page can only have on host virtual address.  But, it's possible to
>> mmap() a single VEPC page in multiple processes at multiple host virtual
>> addresses.
>>
>> So, the implementation only allows one host virtual address.  But, we
>> can imagine a scenario where there are *TWO* valid virtual addresses
>> where the VEPC is mapped.
>>
>> What might happen in this case?  What's the fallout?
> 
> My understanding is there can be two #MC happening on two virtual addresses.
> 
> Each #MC will be injected to the VM which triggers that #MC to handle.

OK but which address with each of them see in siginfo->si_addr?  There's
only one ->vepc_vaddr.

It seems to me like this might result in the siginfo getting populated
with the ->si_addr from the *OTHER* process and *OTHER* virtual address.
 Basically, whoever allocates the page sets up *their* virtual address.
 Anyone else who gets a machine check on that page will see a virtual
address that has nothing to do with its virtual address space.

Is that problematic?
Huang, Kai Oct. 13, 2022, 11:40 p.m. UTC | #17
On Thu, 2022-10-13 at 15:28 -0700, Dave Hansen wrote:
> On 10/13/22 15:15, Huang, Kai wrote:
> > On Thu, 2022-10-13 at 15:02 -0700, Hansen, Dave wrote:
> > > Specifically, with this machine check SIGBUS implementation, one EPC
> > > page can only have on host virtual address.  But, it's possible to
> > > mmap() a single VEPC page in multiple processes at multiple host virtual
> > > addresses.
> > > 
> > > So, the implementation only allows one host virtual address.  But, we
> > > can imagine a scenario where there are *TWO* valid virtual addresses
> > > where the VEPC is mapped.
> > > 
> > > What might happen in this case?  What's the fallout?
> > 
> > My understanding is there can be two #MC happening on two virtual addresses.
> > 
> > Each #MC will be injected to the VM which triggers that #MC to handle.
> 
> OK but which address with each of them see in siginfo->si_addr?  There's
> only one ->vepc_vaddr.
> 
> It seems to me like this might result in the siginfo getting populated
> with the ->si_addr from the *OTHER* process and *OTHER* virtual address.
>  Basically, whoever allocates the page sets up *their* virtual address.
>  Anyone else who gets a machine check on that page will see a virtual
> address that has nothing to do with its virtual address space.
> 
> Is that problematic?

Thanks Dave.  I see the problem now.  I already forgot the reason that I said
"whether to add a comment is good enough" in my original reply.

So with allowing fork(), putting the 'virtual address' to the owner doesn't
seems right.  I think perhaps we have two options:

1) to enforce in kernel that virtual EPC doesn't support fork(), and page's
owner can be used to carry 'virtual address'.
2) in arch_memory_failure, we find out the virtual address based on the current
process, but I am not entirely sure whether it can work, i.e. it's not called
from interrupt context?

What's your suggestion?
Dave Hansen Oct. 13, 2022, 11:57 p.m. UTC | #18
On 10/13/22 16:40, Huang, Kai wrote:
> What's your suggestion?

I'm fresh out of suggestions.
Huang, Kai Oct. 14, 2022, 12:19 a.m. UTC | #19
On Thu, 2022-10-13 at 16:57 -0700, Dave Hansen wrote:
> On 10/13/22 16:40, Huang, Kai wrote:
> > What's your suggestion?
> 
> I'm fresh out of suggestions.

Thanks Dave.  I'll dig more and find out a solution.
Dave Hansen Oct. 14, 2022, 5:41 a.m. UTC | #20
On 10/13/22 22:42, Zhiquan Li wrote:
> On 2022/10/13 23:39, Dave Hansen wrote:
>> I wanted an example here of a SIGBUS that can occur from sources *OTHER*
>> than a recoverable machine check.
> Oh, got it, it should like this:
> 
> 	There is already a signal-based ABI to tell userspace about
> 	machine checks.  But, SGX does not use that ABI.  Today, the
> 	kernel delivers a generic SIGBUS if a machine check occurs when
> 	accessing SGX memory.  Userspace can not differentiate that
> 	SIGBUS from invalid address alignment, nonexistent physical
> 	address or object-specific hardware error, so it is very
> 	unlikely to be able to recover from the signal and the app will
> 	die.

That looks much better, thanks.
Zhiquan Li Oct. 14, 2022, 5:42 a.m. UTC | #21
On 2022/10/13 23:39, Dave Hansen wrote:
> I wanted an example here of a SIGBUS that can occur from sources *OTHER*
> than a recoverable machine check.

Oh, got it, it should like this:

	There is already a signal-based ABI to tell userspace about
	machine checks.  But, SGX does not use that ABI.  Today, the
	kernel delivers a generic SIGBUS if a machine check occurs when
	accessing SGX memory.  Userspace can not differentiate that
	SIGBUS from invalid address alignment, nonexistent physical
	address or object-specific hardware error, so it is very
	unlikely to be able to recover from the signal and the app will
	die.

Thanks for the clarification, Dave.

Best Regards,
Zhiquan
Huang, Kai Oct. 19, 2022, 10:59 a.m. UTC | #22
On Fri, 2022-10-14 at 00:19 +0000, Huang, Kai wrote:
> On Thu, 2022-10-13 at 16:57 -0700, Dave Hansen wrote:
> > On 10/13/22 16:40, Huang, Kai wrote:
> > > What's your suggestion?
> > 
> > I'm fresh out of suggestions.
> 
> Thanks Dave.  I'll dig more and find out a solution.
> 

Hi Dave,

Not related to this series, I think there's a bug in the current virtual EPC
driver page fault handler in case of fork() (sorry didn't notice it at the first
place):

static int __sgx_vepc_fault(struct sgx_vepc *vepc,
                            struct vm_area_struct *vma, unsigned long addr)
{
	......
        /* Calculate index of EPC page in virtual EPC's page_array */
        index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);

        epc_page = xa_load(&vepc->page_array, index);
        if (epc_page)
                return 0;

	...
}

As you can see if the EPC page has already been populated at a given index of
one virtual EPC instance, the current fault handler just assumes the mapping is
already there and returns success immediately.  This causes a bug when one
virtual EPC instance is shared by multi processes via fork(): if the EPC page at
one index is already populated by the parent process, when the child accesses
the same page using different virtual address, the fault handler just returns
success w/o actually setting up the mapping for the child, resulting in endless
page fault.

This needs to be fixed in no matter what way.

My thinking is we can just enforce in the kernel that one virtual EPC instance
can only be mmap()-ed by the process which opens /dev/sgx_vepc (a new virtual
EPC instance is created on each open).  KVM userspace should never use fork() to
share virtual EPC instance otherwise userspace is using it in the wrong way. 
Thus enforcing in the kernel won't break any KVM userspace.

Does this make sense?

If it is OK to you, I'll send out a patch to fix.
Jarkko Sakkinen Oct. 23, 2022, 8:39 p.m. UTC | #23
On Wed, Oct 19, 2022 at 10:59:20AM +0000, Huang, Kai wrote:
> static int __sgx_vepc_fault(struct sgx_vepc *vepc,
>                             struct vm_area_struct *vma, unsigned long addr)
> {
> 	......
>         /* Calculate index of EPC page in virtual EPC's page_array */
>         index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
> 
>         epc_page = xa_load(&vepc->page_array, index);
>         if (epc_page)
>                 return 0;
> 
> 	...
> }
> 
> As you can see if the EPC page has already been populated at a given index of
> one virtual EPC instance, the current fault handler just assumes the mapping is
> already there and returns success immediately.  This causes a bug when one
> virtual EPC instance is shared by multi processes via fork(): if the EPC page at
> one index is already populated by the parent process, when the child accesses
> the same page using different virtual address, the fault handler just returns
> success w/o actually setting up the mapping for the child, resulting in endless
> page fault.
> 
> This needs to be fixed in no matter what way.

I think you mean that vm_insert_pfn() does not happen for child because
of early return? I did not understand the part about "different virtual
addresses", as it is the same mapping.

R, Jarkko
Zhiquan Li Oct. 24, 2022, 1:32 a.m. UTC | #24
On 2022/10/24 04:39, jarkko@kernel.org wrote:
>> As you can see if the EPC page has already been populated at a given index of
>> one virtual EPC instance, the current fault handler just assumes the mapping is
>> already there and returns success immediately.  This causes a bug when one
>> virtual EPC instance is shared by multi processes via fork(): if the EPC page at
>> one index is already populated by the parent process, when the child accesses
>> the same page using different virtual address, the fault handler just returns
>> success w/o actually setting up the mapping for the child, resulting in endless
>> page fault.
>>
>> This needs to be fixed in no matter what way.
> I think you mean that vm_insert_pfn() does not happen for child because
> of early return? I did not understand the part about "different virtual
> addresses", as it is the same mapping.
> 

If userspace do something like this, the child will get "different
virtual address":

  ... parent run enclave within VM ...
  if (fork() == 0) {
      int *caddr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, vepc_fd, 0);
      printf("child: %d\n", caddr[0]);
  }


- "vepc_fd" is inherited from parent which had opened /dev/sgx_vepc.
- mmap() will create a VMA in child with "different virtual addresses".
- "caddr[0]" will cause a page fault as it's a new mapping.

1. Then kernel will run into the code snippet referenced by Kai.
2. The early return 0 will result in sgx_vepc_fault() return
   "VM_FAULT_NOPAGE".
3. This return value will make the condition as true at
   function do_user_addr_fault()

   if (likely(!(fault & VM_FAULT_ERROR)))
       return;

4. Since this page fault has not been handled and "$RIP" is still the
   original value, it will result in the same page fault again.  Namely,
   it's an endless page fault.

But the problem is neither the early return in __sgx_vepc_fault() nor
the return of VM_FAULT_NOPAGE at sgx_vepc_fault().  The root cause has
been highlighted by Kai, one virtual EPC instance
can only be mmap()-ed by the process which opens /dev/sgx_vepc.

In fact, to share a virtual EPC instance in userspace doesn't make any
sense.  Even though it can be shared by child, the virtual EPC page
cannot be used by child correctly.

Best Regards,
Zhiquan
Huang, Kai Oct. 24, 2022, 10:23 p.m. UTC | #25
On Sun, 2022-10-23 at 23:39 +0300, jarkko@kernel.org wrote:
> On Wed, Oct 19, 2022 at 10:59:20AM +0000, Huang, Kai wrote:
> > static int __sgx_vepc_fault(struct sgx_vepc *vepc,
> >                             struct vm_area_struct *vma, unsigned long addr)
> > {
> > 	......
> >         /* Calculate index of EPC page in virtual EPC's page_array */
> >         index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
> > 
> >         epc_page = xa_load(&vepc->page_array, index);
> >         if (epc_page)
> >                 return 0;
> > 
> > 	...
> > }
> > 
> > As you can see if the EPC page has already been populated at a given index of
> > one virtual EPC instance, the current fault handler just assumes the mapping is
> > already there and returns success immediately.  This causes a bug when one
> > virtual EPC instance is shared by multi processes via fork(): if the EPC page at
> > one index is already populated by the parent process, when the child accesses
> > the same page using different virtual address, the fault handler just returns
> > success w/o actually setting up the mapping for the child, resulting in endless
> > page fault.
> > 
> > This needs to be fixed in no matter what way.
> 
> I think you mean that vm_insert_pfn() does not happen for child because
> of early return?
> 

Yes exactly.  Sorry for not pointing out directly.

> I did not understand the part about "different virtual
> addresses", as it is the same mapping.

The child can use mmap() to get a new mapping.  Whether the virtual address is
different from the parent's doesn't matter actually.
Jarkko Sakkinen Nov. 1, 2022, 12:46 a.m. UTC | #26
On Mon, Oct 24, 2022 at 09:32:13AM +0800, Zhiquan Li wrote:
> 
> 
> On 2022/10/24 04:39, jarkko@kernel.org wrote:
> >> As you can see if the EPC page has already been populated at a given index of
> >> one virtual EPC instance, the current fault handler just assumes the mapping is
> >> already there and returns success immediately.  This causes a bug when one
> >> virtual EPC instance is shared by multi processes via fork(): if the EPC page at
> >> one index is already populated by the parent process, when the child accesses
> >> the same page using different virtual address, the fault handler just returns
> >> success w/o actually setting up the mapping for the child, resulting in endless
> >> page fault.
> >>
> >> This needs to be fixed in no matter what way.
> > I think you mean that vm_insert_pfn() does not happen for child because
> > of early return? I did not understand the part about "different virtual
> > addresses", as it is the same mapping.
> > 
> 
> If userspace do something like this, the child will get "different
> virtual address":
> 
>   ... parent run enclave within VM ...
>   if (fork() == 0) {
>       int *caddr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, vepc_fd, 0);
>       printf("child: %d\n", caddr[0]);
>   }
> 
> 
> - "vepc_fd" is inherited from parent which had opened /dev/sgx_vepc.
> - mmap() will create a VMA in child with "different virtual addresses".
> - "caddr[0]" will cause a page fault as it's a new mapping.
> 
> 1. Then kernel will run into the code snippet referenced by Kai.
> 2. The early return 0 will result in sgx_vepc_fault() return
>    "VM_FAULT_NOPAGE".
> 3. This return value will make the condition as true at
>    function do_user_addr_fault()
> 
>    if (likely(!(fault & VM_FAULT_ERROR)))
>        return;
> 
> 4. Since this page fault has not been handled and "$RIP" is still the
>    original value, it will result in the same page fault again.  Namely,
>    it's an endless page fault.
> 
> But the problem is neither the early return in __sgx_vepc_fault() nor
> the return of VM_FAULT_NOPAGE at sgx_vepc_fault().  The root cause has
> been highlighted by Kai, one virtual EPC instance
> can only be mmap()-ed by the process which opens /dev/sgx_vepc.
> 
> In fact, to share a virtual EPC instance in userspace doesn't make any
> sense.  Even though it can be shared by child, the virtual EPC page
> cannot be used by child correctly.

OK, makes sense, thanks for the explanation!

Why would we want to enforce for user space not to do this, even
if it does cause malfunctioning program?

BR, Jarkko
Jarkko Sakkinen Nov. 1, 2022, 12:53 a.m. UTC | #27
On Mon, Oct 24, 2022 at 10:23:10PM +0000, Huang, Kai wrote:
> On Sun, 2022-10-23 at 23:39 +0300, jarkko@kernel.org wrote:
> > On Wed, Oct 19, 2022 at 10:59:20AM +0000, Huang, Kai wrote:
> > > static int __sgx_vepc_fault(struct sgx_vepc *vepc,
> > >                             struct vm_area_struct *vma, unsigned long addr)
> > > {
> > > 	......
> > >         /* Calculate index of EPC page in virtual EPC's page_array */
> > >         index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
> > > 
> > >         epc_page = xa_load(&vepc->page_array, index);
> > >         if (epc_page)
> > >                 return 0;
> > > 
> > > 	...
> > > }
> > > 
> > > As you can see if the EPC page has already been populated at a given index of
> > > one virtual EPC instance, the current fault handler just assumes the mapping is
> > > already there and returns success immediately.  This causes a bug when one
> > > virtual EPC instance is shared by multi processes via fork(): if the EPC page at
> > > one index is already populated by the parent process, when the child accesses
> > > the same page using different virtual address, the fault handler just returns
> > > success w/o actually setting up the mapping for the child, resulting in endless
> > > page fault.
> > > 
> > > This needs to be fixed in no matter what way.
> > 
> > I think you mean that vm_insert_pfn() does not happen for child because
> > of early return?
> > 
> 
> Yes exactly.  Sorry for not pointing out directly.

Np.

> 
> > I did not understand the part about "different virtual
> > addresses", as it is the same mapping.
> 
> The child can use mmap() to get a new mapping.  Whether the virtual address is
> different from the parent's doesn't matter actually.

Thanks for the response, I had one additional query responded to Zhiquan.

BR, Jarkko
Zhiquan Li Nov. 2, 2022, 1:38 a.m. UTC | #28
On 2022/11/1 08:46, jarkko@kernel.org wrote:
> Why would we want to enforce for user space not to do this, even
> if it does cause malfunctioning program?
> 

We want to resolve the problem at the source rather than just deal with
the symptom passively derived from it. For instance, we might be able to
return VM_FAULT_SIGBUS to kill the malicious application, but if the
malicious child touch the memory earlier than parent despite it cannot
use the virtual EPC page, then the parent will be victim.

Even thought it's not a security threaten, there is no practical
significance for sharing a virtual EPC instance. So we would like to
prevent it from the beginning.

Best Regards,
Zhiquan

> BR, Jarkko
Huang, Kai Nov. 4, 2022, 10:17 a.m. UTC | #29
On Tue, 2022-11-01 at 02:46 +0200, jarkko@kernel.org wrote:
> On Mon, Oct 24, 2022 at 09:32:13AM +0800, Zhiquan Li wrote:
> > 
> > 
> > On 2022/10/24 04:39, jarkko@kernel.org wrote:
> > > > As you can see if the EPC page has already been populated at a given index of
> > > > one virtual EPC instance, the current fault handler just assumes the mapping is
> > > > already there and returns success immediately.  This causes a bug when one
> > > > virtual EPC instance is shared by multi processes via fork(): if the EPC page at
> > > > one index is already populated by the parent process, when the child accesses
> > > > the same page using different virtual address, the fault handler just returns
> > > > success w/o actually setting up the mapping for the child, resulting in endless
> > > > page fault.
> > > > 
> > > > This needs to be fixed in no matter what way.
> > > I think you mean that vm_insert_pfn() does not happen for child because
> > > of early return? I did not understand the part about "different virtual
> > > addresses", as it is the same mapping.
> > > 
> > 
> > If userspace do something like this, the child will get "different
> > virtual address":
> > 
> >   ... parent run enclave within VM ...
> >   if (fork() == 0) {
> >       int *caddr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, vepc_fd, 0);
> >       printf("child: %d\n", caddr[0]);
> >   }
> > 
> > 
> > - "vepc_fd" is inherited from parent which had opened /dev/sgx_vepc.
> > - mmap() will create a VMA in child with "different virtual addresses".
> > - "caddr[0]" will cause a page fault as it's a new mapping.
> > 
> > 1. Then kernel will run into the code snippet referenced by Kai.
> > 2. The early return 0 will result in sgx_vepc_fault() return
> >    "VM_FAULT_NOPAGE".
> > 3. This return value will make the condition as true at
> >    function do_user_addr_fault()
> > 
> >    if (likely(!(fault & VM_FAULT_ERROR)))
> >        return;
> > 
> > 4. Since this page fault has not been handled and "$RIP" is still the
> >    original value, it will result in the same page fault again.  Namely,
> >    it's an endless page fault.
> > 
> > But the problem is neither the early return in __sgx_vepc_fault() nor
> > the return of VM_FAULT_NOPAGE at sgx_vepc_fault().  The root cause has
> > been highlighted by Kai, one virtual EPC instance
> > can only be mmap()-ed by the process which opens /dev/sgx_vepc.
> > 
> > In fact, to share a virtual EPC instance in userspace doesn't make any
> > sense.  Even though it can be shared by child, the virtual EPC page
> > cannot be used by child correctly.
> 
> OK, makes sense, thanks for the explanation!
> 
> Why would we want to enforce for user space not to do this, even
> if it does cause malfunctioning program?
> 
> BR, Jarkko

Hi Jarkko, Dave,

I've been re-thinking about this #MC handle on virtual EPC by stepping back to
the beginning, and I think we have more problems than this "whether kernel
should enforce child cannot mmap() virtual EPC".

First of all, if we want to use epc->owner to carry the userspace virtual
address, "make kernel enforce child cannot mmap() virtual EPC" alone isn't good
enough -- nothing prevents userspace to call mmap() several times to map the
same virtual EPC pages.  So additionally, we also need to "make kernel enforce
one virtual EPC can only be mmap()-ed once".

Secondly, I am thinking that the current arch_memory_failure() cannot really
handle #MC for virtual EPC page correctly.  The problem is, even we mark the
page as poisoned, and send signal to userspace to inject #MC to guest to handle,
the poisoned virtual EPC page is never unmapped from the guest and then freed.

This means a malicious guest can always try to use the poisoned EPC page again
after it receives #MC on some EPC page.  I am not entirely sure what kind
behaviour/attack can be done in such case, but it seems the right behaviour
should be the KVM to inject the #MC and unmap the poisoned EPC page from guest.
And if guest ever tries to use this "guest's EPC page" (GFN) again, KVM should
kill the guest.

Hi Sean,

If you ever see this, could you also comment?
Sean Christopherson Nov. 4, 2022, 4:26 p.m. UTC | #30
On Fri, Nov 04, 2022, Huang, Kai wrote:
> > > In fact, to share a virtual EPC instance in userspace doesn't make any
> > > sense.  Even though it can be shared by child, the virtual EPC page
> > > cannot be used by child correctly.
> > 
> > OK, makes sense, thanks for the explanation!
> > 
> > Why would we want to enforce for user space not to do this, even
> > if it does cause malfunctioning program?
> > 
> > BR, Jarkko
> 
> Hi Jarkko, Dave,
> 
> I've been re-thinking about this #MC handle on virtual EPC by stepping back to
> the beginning, and I think we have more problems than this "whether kernel
> should enforce child cannot mmap() virtual EPC".

IMO, virtual EPC should be restricted to a single mm_struct, which is what was
originally proposed many years ago[*].  I should have pushed back harder, but by
that point I had mostly stopped caring about SGX.

There is no use case for sharing a virtual EPC, and conceptually it just doesn't
make sense because all use would need to be mutually exclusive on a per-page basis
to keep the EPCM happy.

[*] https://lore.kernel.org/kvm/ace9d4cb10318370f6145aaced0cfa73dda36477.1609890536.git.kai.huang@intel.com

> First of all, if we want to use epc->owner to carry the userspace virtual
> address, "make kernel enforce child cannot mmap() virtual EPC" alone isn't good
> enough -- nothing prevents userspace to call mmap() several times to map the
> same virtual EPC pages.  So additionally, we also need to "make kernel enforce
> one virtual EPC can only be mmap()-ed once".
> 
> Secondly, I am thinking that the current arch_memory_failure() cannot really
> handle #MC for virtual EPC page correctly.  The problem is, even we mark the
> page as poisoned, and send signal to userspace to inject #MC to guest to handle,
> the poisoned virtual EPC page is never unmapped from the guest and then freed.
> 
> This means a malicious guest can always try to use the poisoned EPC page again
> after it receives #MC on some EPC page.  I am not entirely sure what kind
> behaviour/attack can be done in such case, but it seems the right behaviour
> should be the KVM to inject the #MC and unmap the poisoned EPC page from guest.
> And if guest ever tries to use this "guest's EPC page" (GFN) again, KVM should
> kill the guest.

Just zap the PTEs for the affected mm_struct, mmu_notifier => KVM will do the rest.
Dave Hansen Nov. 4, 2022, 4:34 p.m. UTC | #31
On 11/4/22 09:26, Sean Christopherson wrote:
>> I've been re-thinking about this #MC handle on virtual EPC by stepping back to
>> the beginning, and I think we have more problems than this "whether kernel
>> should enforce child cannot mmap() virtual EPC".
> IMO, virtual EPC should be restricted to a single mm_struct, which is what was
> originally proposed many years ago[*].  I should have pushed back harder, but by
> that point I had mostly stopped caring about SGX.

Considering that we have VM_DONTCOPY set on the vepc VMA, this shouldn't
be too hard to pull off.  We could just return -EBUSY if another mm
comes around and tries to mmap() the fd.
Huang, Kai Nov. 7, 2022, 8:54 a.m. UTC | #32
On Fri, 2022-11-04 at 16:26 +0000, Sean Christopherson wrote:
> On Fri, Nov 04, 2022, Huang, Kai wrote:
> > > > In fact, to share a virtual EPC instance in userspace doesn't make any
> > > > sense.  Even though it can be shared by child, the virtual EPC page
> > > > cannot be used by child correctly.
> > > 
> > > OK, makes sense, thanks for the explanation!
> > > 
> > > Why would we want to enforce for user space not to do this, even
> > > if it does cause malfunctioning program?
> > > 
> > > BR, Jarkko
> > 
> > Hi Jarkko, Dave,
> > 
> > I've been re-thinking about this #MC handle on virtual EPC by stepping back to
> > the beginning, and I think we have more problems than this "whether kernel
> > should enforce child cannot mmap() virtual EPC".
> 
> IMO, virtual EPC should be restricted to a single mm_struct, which is what was
> originally proposed many years ago[*].  I should have pushed back harder, but by
> that point I had mostly stopped caring about SGX.
> 
> There is no use case for sharing a virtual EPC, and conceptually it just doesn't
> make sense because all use would need to be mutually exclusive on a per-page basis
> to keep the EPCM happy.
> 
> [*] https://lore.kernel.org/kvm/ace9d4cb10318370f6145aaced0cfa73dda36477.1609890536.git.kai.huang@intel.com

Thanks for your time.  Yes I agree.

Also, IMHO we can further enforce to only allow associating one VMA with one
virtual EPC instance.  It also doesn't make sense to have multi-mappings (VMAs)
to the same virtual EPC pages even for one mm_struct, right?  

This also makes zapping PTE easier (as mentioned below).

> 
> > First of all, if we want to use epc->owner to carry the userspace virtual
> > address, "make kernel enforce child cannot mmap() virtual EPC" alone isn't good
> > enough -- nothing prevents userspace to call mmap() several times to map the
> > same virtual EPC pages.  So additionally, we also need to "make kernel enforce
> > one virtual EPC can only be mmap()-ed once".
> > 
> > Secondly, I am thinking that the current arch_memory_failure() cannot really
> > handle #MC for virtual EPC page correctly.  The problem is, even we mark the
> > page as poisoned, and send signal to userspace to inject #MC to guest to handle,
> > the poisoned virtual EPC page is never unmapped from the guest and then freed.
> > 
> > This means a malicious guest can always try to use the poisoned EPC page again
> > after it receives #MC on some EPC page.  I am not entirely sure what kind
> > behaviour/attack can be done in such case, but it seems the right behaviour
> > should be the KVM to inject the #MC and unmap the poisoned EPC page from guest.
> > And if guest ever tries to use this "guest's EPC page" (GFN) again, KVM should
> > kill the guest.
> 
> Just zap the PTEs for the affected mm_struct, mmu_notifier => KVM will do the rest.

Zapping PTEs alone can, i.e. take the poisoned virtual EPC page away from the
guest and map a new one, but it doesn't inject #MC to the guest.  W/o injecting
#MC, the guest will just see sudden lose of enclave which consumes that EPC page
(enclave is marked as bad by hardware if any page caused #MC).

But I guess it's also acceptable since we already have other cases that guest
can possibly see sudden lose of enclave (i.e. live migration)?

Btw, currently the virtual EPC driver doesn't have infrastructure to zap PTEs. 
If we can associate one virtual EPC instance to one VMA, then zapping PTE can be
easily done by bookkeeping this VMA.

Does this look good?
Huang, Kai Nov. 7, 2022, 8:55 a.m. UTC | #33
On Fri, 2022-11-04 at 09:34 -0700, Dave Hansen wrote:
> On 11/4/22 09:26, Sean Christopherson wrote:
> > > I've been re-thinking about this #MC handle on virtual EPC by stepping back to
> > > the beginning, and I think we have more problems than this "whether kernel
> > > should enforce child cannot mmap() virtual EPC".
> > IMO, virtual EPC should be restricted to a single mm_struct, which is what was
> > originally proposed many years ago[*].  I should have pushed back harder, but by
> > that point I had mostly stopped caring about SGX.
> 
> Considering that we have VM_DONTCOPY set on the vepc VMA, this shouldn't
> be too hard to pull off.  We could just return -EBUSY if another mm
> comes around and tries to mmap() the fd.

Yes.  We can record the MM which opens /dev/sgx_vepc and reject mmap() from
other MMs.
Jarkko Sakkinen Nov. 7, 2022, 11:36 a.m. UTC | #34
On Wed, Nov 02, 2022 at 09:38:55AM +0800, Zhiquan Li wrote:
> 
> On 2022/11/1 08:46, jarkko@kernel.org wrote:
> > Why would we want to enforce for user space not to do this, even
> > if it does cause malfunctioning program?
> > 
> 
> We want to resolve the problem at the source rather than just deal with
> the symptom passively derived from it. For instance, we might be able to
> return VM_FAULT_SIGBUS to kill the malicious application, but if the
> malicious child touch the memory earlier than parent despite it cannot
> use the virtual EPC page, then the parent will be victim.
> 
> Even thought it's not a security threaten, there is no practical
> significance for sharing a virtual EPC instance. So we would like to
> prevent it from the beginning.

Can you phrase this to the commit message? This makes sense as a
motivation.

BR, Jarkko
Zhiquan Li Nov. 7, 2022, 12:19 p.m. UTC | #35
On 2022/11/7 19:36, jarkko@kernel.org wrote:
> Can you phrase this to the commit message? This makes sense as a
> motivation.
> 

No problem, I will add it to v10 patch after current concerns are
clarified and get a clear solution.

Best Regards,
Zhiquan
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b319bedcaf1e..160c8dbee0ab 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -679,6 +679,8 @@  int arch_memory_failure(unsigned long pfn, int flags)
 	struct sgx_epc_page *page = sgx_paddr_to_page(pfn << PAGE_SHIFT);
 	struct sgx_epc_section *section;
 	struct sgx_numa_node *node;
+	void __user *vaddr;
+	int ret;
 
 	/*
 	 * mm/memory-failure.c calls this routine for all errors
@@ -695,8 +697,26 @@  int arch_memory_failure(unsigned long pfn, int flags)
 	 * error. The signal may help the task understand why the
 	 * enclave is broken.
 	 */
-	if (flags & MF_ACTION_REQUIRED)
-		force_sig(SIGBUS);
+	if (flags & MF_ACTION_REQUIRED) {
+		/*
+		 * Provide extra info to the task so that it can make further
+		 * decision but not simply kill it. This is quite useful for
+		 * virtualization case.
+		 */
+		if (page->flags & SGX_EPC_PAGE_KVM_GUEST) {
+			/*
+			 * The 'encl_owner' field is repurposed, when allocating EPC
+			 * page it was assigned to the virtual address of virtual EPC
+			 * page.
+			 */
+			vaddr = (void *)((unsigned long)page->vepc_vaddr & PAGE_MASK);
+			ret = force_sig_mceerr(BUS_MCEERR_AR, vaddr, PAGE_SHIFT);
+			if (ret < 0)
+				pr_err("Memory failure: Error sending signal to %s:%d: %d\n",
+					current->comm, current->pid, ret);
+		} else
+			force_sig(SIGBUS);
+	}
 
 	section = &sgx_epc_sections[page->section];
 	node = section->node;