diff mbox series

[RFC] seccomp: Implement syscall isolation based on memory areas

Message ID 20200530055953.817666-1-krisman@collabora.com (mailing list archive)
State New, archived
Headers show
Series [RFC] seccomp: Implement syscall isolation based on memory areas | expand

Commit Message

Gabriel Krisman Bertazi May 30, 2020, 5:59 a.m. UTC
Modern Windows applications are executing system call instructions
directly from the application's code without going through the WinAPI.
This breaks Wine emulation, because it doesn't have a chance to
intercept and emulate these syscalls before they are submitted to Linux.

In addition, we cannot simply trap every system call of the application
to userspace using PTRACE_SYSEMU, because performance would suffer,
since our main use case is to run Windows games over Linux.  Therefore,
we need some in-kernel filtering to decide whether the syscall was
issued by the wine code or by the windows application.

The filtering cannot really be done based solely on the syscall number,
because those could collide with existing Linux syscalls.  Instead, our
proposed solution is to trap syscalls based on the userspace memory
region that triggered the syscall, as wine is responsible for the
Windows code allocations and it can apply correct memory protections to
those areas.

Therefore, this patch reuses the seccomp infrastructure to trap
system calls, but introduces a new mode to trap based on a vma attribute
that describes whether the userspace memory region is allowed to execute
syscalls or not.  The protection is defined at mmap/mprotect time with a
new protection flag PROT_NOSYSCALL.  This setting only takes effect if
the new SECCOMP_MODE_MEMMAP is enabled through seccomp().

It goes without saying that this is in no way a security mechanism
despite being built on top of seccomp, since an evil application can
always jump to a whitelisted memory region and run the syscall.  This
is not a concern for Wine games.  Nevertheless, we reuse seccomp as a
way to avoid adding a new mechanism to essentially do the same job of
filtering system calls.

* Why not SECCOMP_MODE_FILTER?

We experimented with dynamically generating BPF filters for whitelisted
memory regions and using SECCOMP_MODE_FILTER, but there are a few
reasons why it isn't enough nor a good idea for our use case:

1. We cannot set the filters at program initialization time and forget
about it, since there is no way of knowing which modules will be loaded,
whether native and windows.  Filter would need a way to be updated
frequently during game execution.

2. We cannot predict which Linux libraries will issue syscalls directly.
Most of the time, whitelisting libc and a few other libraries is enough,
but there are no guarantees other Linux libraries won't issue syscalls
directly and break the execution.  Adding every linux library that is
loaded also has a large performance cost due to the large resulting
filter.

3. As I mentioned before, performance is critical.  In our testing with
just a single memory segment blacklisted/whitelisted, the minimum size
of a bpf filter would be 4 instructions.  In that scenario,
SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
time of sysinfo(2) in comparison to seccomp disabled, while the impact
of SECCOMP_MODE_MEMMAP was averaged around 1.5%.

Indeed, points 1 and 2 could be worked around with some userspace work
and improved SECCOMP_MODE_FILTER support, but at a high performance and
some stability cost, to obtain the semantics we want.  Still, the
performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
enough that I believe it should be considered as an upstream solution.

Sending as an RFC for now to get the discussion started.  In particular:

1) Would this design be acceptable?

2) I'm not comfortable with using the high part of vm_flags, though I'm
not sure where else it would fit.  Suggestions?

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Will Drewry <wad@chromium.org>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Paul Gofman <gofmanp@gmail.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 arch/Kconfig                           | 21 +++++++++
 arch/x86/Kconfig                       |  1 +
 include/linux/mm.h                     |  8 ++++
 include/linux/mman.h                   |  4 +-
 include/uapi/asm-generic/mman-common.h |  2 +
 include/uapi/linux/seccomp.h           |  2 +
 kernel/seccomp.c                       | 59 ++++++++++++++++++++++++++
 mm/mprotect.c                          |  2 +-
 8 files changed, 97 insertions(+), 2 deletions(-)

Comments

Kees Cook May 30, 2020, 5:30 p.m. UTC | #1
On Sat, May 30, 2020 at 01:59:53AM -0400, Gabriel Krisman Bertazi wrote:
> Modern Windows applications are executing system call instructions
> directly from the application's code without going through the WinAPI.
> This breaks Wine emulation, because it doesn't have a chance to
> intercept and emulate these syscalls before they are submitted to Linux.
> 
> In addition, we cannot simply trap every system call of the application
> to userspace using PTRACE_SYSEMU, because performance would suffer,
> since our main use case is to run Windows games over Linux.  Therefore,
> we need some in-kernel filtering to decide whether the syscall was
> issued by the wine code or by the windows application.

Interesting use-case! It seems like you're in the position of needing to
invert the assumption about syscalls: before you knew everything going
through WinAPI needed emulation, and now you need to assume everything
not going through a native library needs emulation. Oof.

Is it possible to disassemble and instrument the Windows code to insert
breakpoints (or emulation calls) at all the Windows syscall points?

> [...]
> * Why not SECCOMP_MODE_FILTER?
> 
> We experimented with dynamically generating BPF filters for whitelisted
> memory regions and using SECCOMP_MODE_FILTER, but there are a few
> reasons why it isn't enough nor a good idea for our use case:
> 
> 1. We cannot set the filters at program initialization time and forget
> about it, since there is no way of knowing which modules will be loaded,
> whether native and windows.  Filter would need a way to be updated
> frequently during game execution.
> 
> 2. We cannot predict which Linux libraries will issue syscalls directly.
> Most of the time, whitelisting libc and a few other libraries is enough,
> but there are no guarantees other Linux libraries won't issue syscalls
> directly and break the execution.  Adding every linux library that is
> loaded also has a large performance cost due to the large resulting
> filter.

Just so I can understand the expected use: given the dynamic nature of
the library loading, how would Wine be marking the VMAs?

> 3. As I mentioned before, performance is critical.  In our testing with
> just a single memory segment blacklisted/whitelisted, the minimum size
> of a bpf filter would be 4 instructions.  In that scenario,
> SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
> time of sysinfo(2) in comparison to seccomp disabled, while the impact
> of SECCOMP_MODE_MEMMAP was averaged around 1.5%.

Was the BPF JIT enabled? I was recently examining filter performance too:
https://lore.kernel.org/linux-security-module/202005291043.A63D910A8@keescook/

> Indeed, points 1 and 2 could be worked around with some userspace work
> and improved SECCOMP_MODE_FILTER support, but at a high performance and
> some stability cost, to obtain the semantics we want.  Still, the
> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
> enough that I believe it should be considered as an upstream solution.

It looks like you're using SECCOMP_RET_TRAP for this? Signal handling
can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF?

> Sending as an RFC for now to get the discussion started.  In particular:
> 
> 1) Would this design be acceptable?

Maybe? I want to spend a little time exploring alternatives first. :)

> 2) I'm not comfortable with using the high part of vm_flags, though I'm
> not sure where else it would fit.  Suggestions?

I don't know this area very well. Hopefully some of the mm folks can
comment on it. It seems like there needs to be a more dynamic way mark
VMAs for arbitrary purposes (since this isn't _actually_ a PROT* flag,
it's just a marking for seccomp to check).

Regardless, here's a review of the specifics of patch, without regard to
the design question above. :)

> 
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Will Drewry <wad@chromium.org>
> CC: H. Peter Anvin <hpa@zytor.com>
> CC: Paul Gofman <gofmanp@gmail.com>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> ---
>  arch/Kconfig                           | 21 +++++++++
>  arch/x86/Kconfig                       |  1 +
>  include/linux/mm.h                     |  8 ++++
>  include/linux/mman.h                   |  4 +-
>  include/uapi/asm-generic/mman-common.h |  2 +
>  include/uapi/linux/seccomp.h           |  2 +
>  kernel/seccomp.c                       | 59 ++++++++++++++++++++++++++
>  mm/mprotect.c                          |  2 +-
>  8 files changed, 97 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 786a85d4ad40..e5c10231c9c2 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -471,6 +471,27 @@ config SECCOMP_FILTER
>  
>  	  See Documentation/userspace-api/seccomp_filter.rst for details.
>  
> +config HAVE_ARCH_SECCOMP_MEMMAP
> +	bool
> +	help
> +	  An arch should select this symbol if it provides all of these things:
> +	  - syscall_get_arch()
> +	  - syscall_get_arguments()
> +	  - syscall_rollback()
> +	  - syscall_set_return_value()
> +	  - SIGSYS siginfo_t support
> +	  - secure_computing is called from a ptrace_event()-safe context
> +	  - secure_computing return value is checked and a return value of -1
> +	    results in the system call being skipped immediately.
> +	  - seccomp syscall wired up
> +
> +config SECCOMP_MEMMAP
> +	def_bool y
> +	depends on HAVE_ARCH_SECCOMP_MEMMAP && SECCOMP
> +	help
> +	  Enable tasks to trap syscalls based on the page that triggered
> +	  the syscall.
> +

Since I don't see anything distinct in the requirements for
SECCOMP_MEMMAP, so I don't think it needs a separate CONFIG: it is almost
entirely built on the SECCOMP_FILTER infrastructure. I'd just drop the
CONFIGs.

>  config HAVE_ARCH_STACKLEAK
>  	bool
>  	help
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2d3f963fd6f1..70998b01c533 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -145,6 +145,7 @@ config X86
>  	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
>  	select HAVE_ARCH_PREL32_RELOCATIONS
>  	select HAVE_ARCH_SECCOMP_FILTER
> +	select HAVE_ARCH_SECCOMP_MEMMAP

(and it's not arch-specific)

>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>  	select HAVE_ARCH_STACKLEAK
>  	select HAVE_ARCH_TRACEHOOK
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a323422d783..6271fb7fd5bc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -294,11 +294,13 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
>  #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
>  #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
>  #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
>  #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>  
>  #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
>  # define VM_GROWSUP	VM_NONE
>  #endif
>  
> +#if defined(CONFIG_X86)
> +# define VM_NOSYSCALL	VM_HIGH_ARCH_5
> +#else
> +# define VM_NOSYSCALL	VM_NONE
> +#endif
> +
>  /* Bits set in the VMA until the stack is in its final location */
>  #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
>  

I have no idea what the correct way to mark VMAs would be. As you
mentioned, this seems ... wrong. :)

> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 4b08e9c9c538..a5ca42eb685a 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -94,7 +94,8 @@ static inline void vm_unacct_memory(long pages)
>   */
>  static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
>  {
> -	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
> +	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM
> +			 | PROT_NOSYSCALL)) == 0;
>  }
>  #define arch_validate_prot arch_validate_prot
>  #endif
> @@ -119,6 +120,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
>  	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
>  	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
>  	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
> +	       _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) |
>  	       arch_calc_vm_prot_bits(prot, pkey);
>  }
>  
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f94f65d429be..4eafbaa3f4bc 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -13,6 +13,8 @@
>  #define PROT_SEM	0x8		/* page may be used for atomic ops */
>  /*			0x10		   reserved for arch-specific use */
>  /*			0x20		   reserved for arch-specific use */
> +#define PROT_NOSYSCALL	0x40		/* page cannot trigger syscalls */
> +
>  #define PROT_NONE	0x0		/* page can not be accessed */
>  #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
>  #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */

AIUI, all of the above is to plumb the VMA marking through an mprotect()
call. I wonder if perhaps madvise() would be better? I'm not sure how
tight we are on new flags there, but I think it would be cleaner to use
that interface. Take a look at MADV_WIPEONFORK / VM_WIPEONFORK.

> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index c1735455bc53..c7d042a409e7 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -10,12 +10,14 @@
>  #define SECCOMP_MODE_DISABLED	0 /* seccomp is not in use. */
>  #define SECCOMP_MODE_STRICT	1 /* uses hard-coded filter. */
>  #define SECCOMP_MODE_FILTER	2 /* uses user-supplied filter. */
> +#define SECCOMP_MODE_MEMMAP	3 /* Lock syscalls per memory region. */

Making this incompatible with FILTER might cause problems for the future
(more and more process launchers are starting to set filters).

So this would, perhaps, be a new flag for struct seccomp, rather than a
new operating mode.

>  
>  /* Valid operations for seccomp syscall. */
>  #define SECCOMP_SET_MODE_STRICT		0
>  #define SECCOMP_SET_MODE_FILTER		1
>  #define SECCOMP_GET_ACTION_AVAIL	2
>  #define SECCOMP_GET_NOTIF_SIZES		3
> +#define SECCOMP_SET_MODE_MEMMAP		4
>  
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>  #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 55a6184f5990..ebf09b02db8d 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -930,6 +930,55 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  }
>  #endif
>  
> +#ifdef CONFIG_SECCOMP_MEMMAP
> +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
> +{
> +	struct vm_area_struct *vma = find_vma(current->mm,
> +					      sd->instruction_pointer);
> +	if (!vma)
> +		BUG();

No new kernel code should use BUG:
https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on

I would maybe pr_warn_once(), but then treat it as if it was marked with
VM_NOSYSCALL.

> +
> +	if (!(vma->vm_flags & VM_NOSYSCALL))
> +		return 0;
> +
> +	syscall_rollback(current, task_pt_regs(current));
> +	seccomp_send_sigsys(this_syscall, 0);
> +
> +	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
> +
> +	return -1;
> +}

This really just looks like an ip_address filter, but I get what you
mean about stacking filters, etc. This may finally be the day we turn to
eBPF in seccomp, since that would give you access to using a map lookup
on ip_address, and the map could be updated externally (removing the
need for the madvise() changes).

> +
> +static long seccomp_set_mode_memmap(unsigned int flags)
> +{
> +	const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP;
> +	long ret = 0;
> +
> +	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> +		return -EINVAL;

This should just test for any bits in flags. TSYNC + memmap should be
avoided, IMO.

> +
> +	spin_lock_irq(&current->sighand->siglock);
> +
> +	if (seccomp_may_assign_mode(seccomp_mode))
> +		seccomp_assign_mode(current, seccomp_mode, flags);
> +	else
> +		ret = -EINVAL;
> +
> +	spin_unlock_irq(&current->sighand->siglock);
> +
> +	return ret;
> +}
> +#else
> +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
> +{
> +	BUG();
> +}
> +static long seccomp_set_mode_memmap(unsigned int flags)
> +{
> +	return -EINVAL;
> +}
> +#endif /* CONFIG_SECCOMP_MEMMAP */
> +
>  int __secure_computing(const struct seccomp_data *sd)
>  {
>  	int mode = current->seccomp.mode;
> @@ -948,6 +997,8 @@ int __secure_computing(const struct seccomp_data *sd)
>  		return 0;
>  	case SECCOMP_MODE_FILTER:
>  		return __seccomp_filter(this_syscall, sd, false);
> +	case SECCOMP_MODE_MEMMAP:
> +		return __seccomp_memmap(this_syscall, sd);
>  	default:
>  		BUG();
>  	}
> @@ -1425,6 +1476,10 @@ static long do_seccomp(unsigned int op, unsigned int flags,
>  			return -EINVAL;
>  
>  		return seccomp_get_notif_sizes(uargs);
> +	case SECCOMP_SET_MODE_MEMMAP:
> +		if (uargs != NULL)
> +			return -EINVAL;
> +		return seccomp_set_mode_memmap(flags);
>  	default:
>  		return -EINVAL;
>  	}
> @@ -1462,6 +1517,10 @@ long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter)
>  		op = SECCOMP_SET_MODE_FILTER;
>  		uargs = filter;
>  		break;
> +	case SECCOMP_MODE_MEMMAP:
> +		op = SECCOMP_SET_MODE_MEMMAP;
> +		uargs = NULL;
> +		break;

The prctl() interface is deprecated, so no new features should be added
there.

>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 494192ca954b..6b5c00e8bbdc 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -591,7 +591,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
>  		 * cleared from the VMA.
>  		 */
>  		mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC |
> -					VM_FLAGS_CLEAR;
> +			VM_NOSYSCALL | VM_FLAGS_CLEAR;
>  
>  		new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
>  		newflags = calc_vm_prot_bits(prot, new_vma_pkey);
> -- 
> 2.27.0.rc2
> 

So, yes, interesting. I'll continue to think about how this could work
best. Can you share what your filters looked like when you were
originally trying those?

Thanks!

-Kees
Andy Lutomirski May 30, 2020, 10:09 p.m. UTC | #2
> On May 29, 2020, at 11:00 PM, Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
> 
> Modern Windows applications are executing system call instructions
> directly from the application's code without going through the WinAPI.
> This breaks Wine emulation, because it doesn't have a chance to
> intercept and emulate these syscalls before they are submitted to Linux.
> 
> In addition, we cannot simply trap every system call of the application
> to userspace using PTRACE_SYSEMU, because performance would suffer,
> since our main use case is to run Windows games over Linux.  Therefore,
> we need some in-kernel filtering to decide whether the syscall was
> issued by the wine code or by the windows application.

Do you really need in-kernel filtering?  What if you could have efficient userspace filtering instead?  That is, set something up so that all syscalls, except those from a special address, are translated to CALL thunk where the thunk is configured per task.  Then the thunk can do whatever emulation is needed.

Getting the details and especially the interaction with any seccomp filters that may be installed right could be tricky, but the performance should be decent, at least on non-PTI systems.

(If we go this route, I suspect that the correct interaction with seccomp is that this type of redirection takes precedence over seccomp and seccomp filters are not invoked for redirected syscalls. After all, a redirected syscall is, functionally, not a syscall at all.)

>
Gabriel Krisman Bertazi May 31, 2020, 12:26 a.m. UTC | #3
Andy Lutomirski <luto@amacapital.net> writes:

>> On May 29, 2020, at 11:00 PM, Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
>> 
>> Modern Windows applications are executing system call instructions
>> directly from the application's code without going through the WinAPI.
>> This breaks Wine emulation, because it doesn't have a chance to
>> intercept and emulate these syscalls before they are submitted to Linux.
>> 
>> In addition, we cannot simply trap every system call of the application
>> to userspace using PTRACE_SYSEMU, because performance would suffer,
>> since our main use case is to run Windows games over Linux.  Therefore,
>> we need some in-kernel filtering to decide whether the syscall was
>> issued by the wine code or by the windows application.
>
> Do you really need in-kernel filtering?  What if you could have
> efficient userspace filtering instead?  That is, set something up so
> that all syscalls, except those from a special address, are translated
> to CALL thunk where the thunk is configured per task.  Then the thunk
> can do whatever emulation is needed.

Hi,

I suggested something similar to my customer, by using
libsyscall-intercept.  The idea would be overwritting the syscall
instruction with a call to the entry point.  I'm not a specialist on the
specifics of Windows games, (cc'ed Paul Gofman, who can provide more
details on that side), but as far as I understand, the reason why that
is not feasible is that the anti-cheat protection in games will abort
execution if the binary region was modified either on-disk or in-memory.

Is there some mechanism to do that without modiyfing the application?

> Getting the details and especially the interaction with any seccomp
> filters that may be installed right could be tricky, but the performance
> should be decent, at least on non-PTI systems.
>
> (If we go this route, I suspect that the correct interaction with
> seccomp is that this type of redirection takes precedence over seccomp
> and seccomp filters are not invoked for redirected syscalls. After all,
> a redirected syscall is, functionally, not a syscall at all.)
>
Andy Lutomirski May 31, 2020, 12:59 a.m. UTC | #4
> On May 30, 2020, at 5:26 PM, Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
> 
> Andy Lutomirski <luto@amacapital.net> writes:
> 
>>>> On May 29, 2020, at 11:00 PM, Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
>>> 
>>> Modern Windows applications are executing system call instructions
>>> directly from the application's code without going through the WinAPI.
>>> This breaks Wine emulation, because it doesn't have a chance to
>>> intercept and emulate these syscalls before they are submitted to Linux.
>>> 
>>> In addition, we cannot simply trap every system call of the application
>>> to userspace using PTRACE_SYSEMU, because performance would suffer,
>>> since our main use case is to run Windows games over Linux.  Therefore,
>>> we need some in-kernel filtering to decide whether the syscall was
>>> issued by the wine code or by the windows application.
>> 
>> Do you really need in-kernel filtering?  What if you could have
>> efficient userspace filtering instead?  That is, set something up so
>> that all syscalls, except those from a special address, are translated
>> to CALL thunk where the thunk is configured per task.  Then the thunk
>> can do whatever emulation is needed.
> 
> Hi,
> 
> I suggested something similar to my customer, by using
> libsyscall-intercept.  The idea would be overwritting the syscall
> instruction with a call to the entry point.  I'm not a specialist on the
> specifics of Windows games, (cc'ed Paul Gofman, who can provide more
> details on that side), but as far as I understand, the reason why that
> is not feasible is that the anti-cheat protection in games will abort
> execution if the binary region was modified either on-disk or in-memory.
> 
> Is there some mechanism to do that without modiyfing the application?

I’m suggesting that the kernel learn how to help you, maybe like this:

prctl(PR_SET_SYSCALL_THUNK, target, address_of_unredirected_syscall, 0, 0, 0, 0);

This would be inherited on clone/fork and cleared on execve.

> 
>> Getting the details and especially the interaction with any seccomp
>> filters that may be installed right could be tricky, but the performance
>> should be decent, at least on non-PTI systems.
>> 
>> (If we go this route, I suspect that the correct interaction with
>> seccomp is that this type of redirection takes precedence over seccomp
>> and seccomp filters are not invoked for redirected syscalls. After all,
>> a redirected syscall is, functionally, not a syscall at all.)
>> 
> 
> 
> -- 
> Gabriel Krisman Bertazi
Gabriel Krisman Bertazi May 31, 2020, 5:56 a.m. UTC | #5
Kees Cook <keescook@chromium.org> writes:

> On Sat, May 30, 2020 at 01:59:53AM -0400, Gabriel Krisman Bertazi wrote:
>> Modern Windows applications are executing system call instructions
>> directly from the application's code without going through the WinAPI.
>> This breaks Wine emulation, because it doesn't have a chance to
>> intercept and emulate these syscalls before they are submitted to Linux.
>> 
>> In addition, we cannot simply trap every system call of the application
>> to userspace using PTRACE_SYSEMU, because performance would suffer,
>> since our main use case is to run Windows games over Linux.  Therefore,
>> we need some in-kernel filtering to decide whether the syscall was
>> issued by the wine code or by the windows application.
>
> Interesting use-case! It seems like you're in the position of needing to
> invert the assumption about syscalls: before you knew everything going
> through WinAPI needed emulation, and now you need to assume everything
> not going through a native library needs emulation. Oof.
>
> Is it possible to disassemble and instrument the Windows code to insert
> breakpoints (or emulation calls) at all the Windows syscall points?

Hi Kees,

I considered instrumenting the syscall instructions with calls to some
wrapper, but I was told that modifying the game in memory or on disk
will trigger all sorts of anti-cheating mechanisms (my main use case are
windows games).

>> [...]
>> * Why not SECCOMP_MODE_FILTER?
>> 
>> We experimented with dynamically generating BPF filters for whitelisted
>> memory regions and using SECCOMP_MODE_FILTER, but there are a few
>> reasons why it isn't enough nor a good idea for our use case:
>> 
>> 1. We cannot set the filters at program initialization time and forget
>> about it, since there is no way of knowing which modules will be loaded,
>> whether native and windows.  Filter would need a way to be updated
>> frequently during game execution.
>> 
>> 2. We cannot predict which Linux libraries will issue syscalls directly.
>> Most of the time, whitelisting libc and a few other libraries is enough,
>> but there are no guarantees other Linux libraries won't issue syscalls
>> directly and break the execution.  Adding every linux library that is
>> loaded also has a large performance cost due to the large resulting
>> filter.
>
> Just so I can understand the expected use: given the dynamic nature of
> the library loading, how would Wine be marking the VMAs?

Paul (cc'ed) is the wine expert, but my understanding is that memory
allocation and initial program load of the emulated binary will go
through wine.  It does the allocation and mark the vma accordingly
before returning the allocated range to the windows application.

>> 3. As I mentioned before, performance is critical.  In our testing with
>> just a single memory segment blacklisted/whitelisted, the minimum size
>> of a bpf filter would be 4 instructions.  In that scenario,
>> SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
>> time of sysinfo(2) in comparison to seccomp disabled, while the impact
>> of SECCOMP_MODE_MEMMAP was averaged around 1.5%.
>
> Was the BPF JIT enabled? I was recently examining filter performance too:
> https://lore.kernel.org/linux-security-module/202005291043.A63D910A8@keescook/

yes:

root@dilma:~# sysctl -a | grep -i jit
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0

>
>> Indeed, points 1 and 2 could be worked around with some userspace work
>> and improved SECCOMP_MODE_FILTER support, but at a high performance and
>> some stability cost, to obtain the semantics we want.  Still, the
>> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
>> enough that I believe it should be considered as an upstream solution.
>
> It looks like you're using SECCOMP_RET_TRAP for this? Signal handling
> can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF?

I experimented with SECCOMP_RET_TRAP and SECCOMP_RET_TRACE, but I hadn't
consider USER_NOTIF.  It seems to be a quite recent feature and I wasn't
aware of it.  I will try it and let you know.

>> diff --git a/include/linux/mman.h b/include/linux/mman.h
>> index 4b08e9c9c538..a5ca42eb685a 100644
>> --- a/include/linux/mman.h
>> +++ b/include/linux/mman.h
>> @@ -94,7 +94,8 @@ static inline void vm_unacct_memory(long pages)
>>   */
>>  static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
>>  {
>> -	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
>> +	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM
>> +			 | PROT_NOSYSCALL)) == 0;
>>  }
>>  #define arch_validate_prot arch_validate_prot
>>  #endif
>> @@ -119,6 +120,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
>>  	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
>>  	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
>>  	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
>> +	       _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) |
>>  	       arch_calc_vm_prot_bits(prot, pkey);
>>  }
>>  
>> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
>> index f94f65d429be..4eafbaa3f4bc 100644
>> --- a/include/uapi/asm-generic/mman-common.h
>> +++ b/include/uapi/asm-generic/mman-common.h
>> @@ -13,6 +13,8 @@
>>  #define PROT_SEM	0x8		/* page may be used for atomic ops */
>>  /*			0x10		   reserved for arch-specific use */
>>  /*			0x20		   reserved for arch-specific use */
>> +#define PROT_NOSYSCALL	0x40		/* page cannot trigger syscalls */
>> +
>>  #define PROT_NONE	0x0		/* page can not be accessed */
>>  #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
>>  #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
>
> AIUI, all of the above is to plumb the VMA marking through an mprotect()
> call. I wonder if perhaps madvise() would be better? I'm not sure how
> tight we are on new flags there, but I think it would be cleaner to use
> that interface. Take a look at MADV_WIPEONFORK / VM_WIPEONFORK.

Right, will do.  I considered it, but wasn't sure if madvise semantics
gave any guarantees of the result, but is seems to be the case for at
least for WIPEONFORK.  it is important to have it configurable on mmap
too, to avoid the extra syscall.  Plumbing through the mmap flags
argument shouldn't be a problem.

>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>> index c1735455bc53..c7d042a409e7 100644
>> --- a/include/uapi/linux/seccomp.h
>> +++ b/include/uapi/linux/seccomp.h
>> @@ -10,12 +10,14 @@
>>  #define SECCOMP_MODE_DISABLED	0 /* seccomp is not in use. */
>>  #define SECCOMP_MODE_STRICT	1 /* uses hard-coded filter. */
>>  #define SECCOMP_MODE_FILTER	2 /* uses user-supplied filter. */
>> +#define SECCOMP_MODE_MEMMAP	3 /* Lock syscalls per memory region. */
>
> Making this incompatible with FILTER might cause problems for the future
> (more and more process launchers are starting to set filters).
>
> So this would, perhaps, be a new flag for struct seccomp, rather than a
> new operating mode.

Yes, agreed.  I guess it would be a real scenario, even for us, to have
isolation against malicious syscalls (MODE_FILTER) combined
with region protection (MEMAP) for emulation.

>
>>  
>>  /* Valid operations for seccomp syscall. */
>>  #define SECCOMP_SET_MODE_STRICT		0
>>  #define SECCOMP_SET_MODE_FILTER		1
>>  #define SECCOMP_GET_ACTION_AVAIL	2
>>  #define SECCOMP_GET_NOTIF_SIZES		3
>> +#define SECCOMP_SET_MODE_MEMMAP		4
>>  
>>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>>  #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 55a6184f5990..ebf09b02db8d 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -930,6 +930,55 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>>  }
>>  #endif
>>  
>> +#ifdef CONFIG_SECCOMP_MEMMAP
>> +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
>> +{
>> +	struct vm_area_struct *vma = find_vma(current->mm,
>> +					      sd->instruction_pointer);
>> +	if (!vma)
>> +		BUG();
>
> No new kernel code should use BUG:
> https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
>
> I would maybe pr_warn_once(), but then treat it as if it was marked with
> VM_NOSYSCALL.
>
>> +
>> +	if (!(vma->vm_flags & VM_NOSYSCALL))
>> +		return 0;
>> +
>> +	syscall_rollback(current, task_pt_regs(current));
>> +	seccomp_send_sigsys(this_syscall, 0);
>> +
>> +	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
>> +
>> +	return -1;
>> +}
>
> This really just looks like an ip_address filter, but I get what you
> mean about stacking filters, etc. This may finally be the day we turn to
> eBPF in seccomp, since that would give you access to using a map lookup
> on ip_address, and the map could be updated externally (removing the
> need for the madvise() changes).

And 64-bit comparisons :)

that would be a good solution, we'd still need to look at some details,
like disabling/updating filters at runtime when some new library is
loaded, but since we can update externally, I think it covers it.

From the wine side, it should be enough to get the semantics we want,
but I don't know about performance.

>> +
>> +static long seccomp_set_mode_memmap(unsigned int flags)
>> +{
>> +	const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP;
>> +	long ret = 0;
>> +
>> +	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>> +		return -EINVAL;
>
> This should just test for any bits in flags. TSYNC + memmap should be
> avoided, IMO.

Maybe it is worth to expose SECCOMP_FILTER_FLAG_LOG?

>>  	default:
>>  		return -EINVAL;
>>  	}
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 494192ca954b..6b5c00e8bbdc 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -591,7 +591,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
>>  		 * cleared from the VMA.
>>  		 */
>>  		mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC |
>> -					VM_FLAGS_CLEAR;
>> +			VM_NOSYSCALL | VM_FLAGS_CLEAR;
>>  
>>  		new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
>>  		newflags = calc_vm_prot_bits(prot, new_vma_pkey);
>> -- 
>> 2.27.0.rc2
>> 
>
> So, yes, interesting. I'll continue to think about how this could work
> best. Can you share what your filters looked like when you were
> originally trying those?

The results I shared in the previous email had the following filter:

  BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))),
  BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, 0xf000, 0, 1),
  BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_TRAP),
  BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

Which is much much simpler than the real thing.  I also had a small
filter generator that generated segments like the following ( ip_HW and
ip_LW are the high and low part of the instruction pointer):

  0007: BPF_STMT(BPF_LD | BPF_W | BPF_ABS, ip_HW)
  0008: BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, EH, 0, REL_ADDR(next))
  0009: BPF_STMT(BPF_LD | BPF_W | BPF_ABS, ip_LW)
  000a: BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, EL, REL_ADDR(next), 0)
  000b: BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, SL, REL_ADDR(filter->jump_allow_target), REL_ADDR(next))
  000c: next:

Thanks,
Paul Gofman May 31, 2020, 12:39 p.m. UTC | #6
Hi!

thanks for looking into this.

On 5/31/20 08:56, Gabriel Krisman Bertazi wrote:
>
>> Is it possible to disassemble and instrument the Windows code to insert
>> breakpoints (or emulation calls) at all the Windows syscall points?
> Hi Kees,
>
> I considered instrumenting the syscall instructions with calls to some
> wrapper, but I was told that modifying the game in memory or on disk
> will trigger all sorts of anti-cheating mechanisms (my main use case are
> windows games).

Yes, this is the case. Besides, before instrumenting, we would need some
way to find those syscalls in the highly obfuscated dynamically
generated code, the whole purpose of which is to prevent disassembling,
debugging and finding things like that in it.

Ultimately, even for the cases when it would be technically feasible I
don't think Wine could ever go for modifying the program's code (unless
of course this is the part of what the program is doing itself using
winapi calls). Wine is trying to implement the API as close to Windows
as possible so the DRM can work with Wine, modifying the program's code
is something different.

>
>>> [...]
>>> * Why not SECCOMP_MODE_FILTER?
>>>
>>> We experimented with dynamically generating BPF filters for whitelisted
>>> memory regions and using SECCOMP_MODE_FILTER, but there are a few
>>> reasons why it isn't enough nor a good idea for our use case:
>>>
>>> 1. We cannot set the filters at program initialization time and forget
>>> about it, since there is no way of knowing which modules will be loaded,
>>> whether native and windows.  Filter would need a way to be updated
>>> frequently during game execution.
>>>
>>> 2. We cannot predict which Linux libraries will issue syscalls directly.
>>> Most of the time, whitelisting libc and a few other libraries is enough,
>>> but there are no guarantees other Linux libraries won't issue syscalls
>>> directly and break the execution.  Adding every linux library that is
>>> loaded also has a large performance cost due to the large resulting
>>> filter.
>> Just so I can understand the expected use: given the dynamic nature of
>> the library loading, how would Wine be marking the VMAs?
> Paul (cc'ed) is the wine expert, but my understanding is that memory
> allocation and initial program load of the emulated binary will go
> through wine.  It does the allocation and mark the vma accordingly
> before returning the allocated range to the windows application.
Yes, exactly. Pretty much any memory allocation which Wine does needs
syscalls (if those are ever encountered later during executing code from
those areas) to be trapped by Wine and passed to Wine's implementation
of the corresponding Windows API function. Linux native libraries
loading and memory allocations performed by them go outside of Wine control.
>
>>> Indeed, points 1 and 2 could be worked around with some userspace work
>>> and improved SECCOMP_MODE_FILTER support, but at a high performance and
>>> some stability cost, to obtain the semantics we want.  Still, the
>>> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
>>> enough that I believe it should be considered as an upstream solution.
>> It looks like you're using SECCOMP_RET_TRAP for this? Signal handling
>> can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF?

We are not much concerned with the overhead of the trapped syscall in
our use case, those are very rare. What we are concerned with is the
performance impact on the normal Linux syscalls when the syscall
trapping is enabled. When I was measuring that impact I've got the same
10% overhead for the untrapped syscalls (that is, always hitting
SECCOMP_RET_ALLOW case) with the filters Gabriel mentioned.


>>
>>> +
>>> +	if (!(vma->vm_flags & VM_NOSYSCALL))
>>> +		return 0;
>>> +
>>> +	syscall_rollback(current, task_pt_regs(current));
>>> +	seccomp_send_sigsys(this_syscall, 0);
>>> +
>>> +	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
>>> +
>>> +	return -1;
>>> +}
>> This really just looks like an ip_address filter, but I get what you
>> mean about stacking filters, etc. This may finally be the day we turn to
>> eBPF in seccomp, since that would give you access to using a map lookup
>> on ip_address, and the map could be updated externally (removing the
>> need for the madvise() changes).
> And 64-bit comparisons :)
>
> that would be a good solution, we'd still need to look at some details,
> like disabling/updating filters at runtime when some new library is
> loaded, but since we can update externally, I think it covers it.
I am afraid that for a general case the filter add / update / removal
should be done not just when a new library is loaded or unloaded but
pretty much any time new (executable) memory region is allocated or
deallocated, or PROT_EXEC should be changed on allocated pages . But I
am not sure I've got enough details yet on the suggested approach here
and might be missing important details. I guess maybe we could discuss
the details separately with Gabriel first.
Paul Gofman May 31, 2020, 12:56 p.m. UTC | #7
On 5/31/20 03:59, Andy Lutomirski wrote:
>
> I’m suggesting that the kernel learn how to help you, maybe like this:
>
> prctl(PR_SET_SYSCALL_THUNK, target, address_of_unredirected_syscall, 0, 0, 0, 0);
>
> This would be inherited on clone/fork and cleared on execve.
>
If we are talking about explicit specification of syscall addresses to
be trapped by Wine, the problem here is that we don't have any way of
knowing the exact addresses of syscalls to be redirected. We would need
some way to find those syscalls in the highly obfuscated dynamically
generated code, the whole purpose of which is to prevent  disassembling,
debugging and finding things like that in it. What we do know is that if
a syscall is executed from any memory which Wine allocates for Windows
application then it should be treated as Windows syscall and routed to
the Wine's dispatch function. Those code areas can be dynamically
allocated and deallocated.

If we are talking about explicit specification of syscall addresses not
to be trapped, it might be technically possible but at the moment looks
so messy so might be considered not feasible. Wine has a great number of
external dependencies. Most of them depends on some other libraries in
turn. Loading of those libraries goes out of Wine control. Linux
libraries are allowed to issue direct syscalls from their code. I am not
sure we can depend on them not doing it and always calling the same
glibc wrapper.
Matthew Wilcox May 31, 2020, 4:49 p.m. UTC | #8
On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote:
> > Paul (cc'ed) is the wine expert, but my understanding is that memory
> > allocation and initial program load of the emulated binary will go
> > through wine.  It does the allocation and mark the vma accordingly
> > before returning the allocated range to the windows application.
> Yes, exactly. Pretty much any memory allocation which Wine does needs
> syscalls (if those are ever encountered later during executing code from
> those areas) to be trapped by Wine and passed to Wine's implementation
> of the corresponding Windows API function. Linux native libraries
> loading and memory allocations performed by them go outside of Wine control.

I don't like Gabriel's approach very much.  Could we do something like
issue a syscall before executing a Windows region and then issue another
syscall when exiting?  If so, we could switch the syscall entry point (ie
change MSR_LSTAR).  I'm thinking something like a personality() syscall.
But maybe that would be too high an overhead.
Paul Gofman May 31, 2020, 5:10 p.m. UTC | #9
On 5/31/20 19:49, Matthew Wilcox wrote:
> On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote:
>>> Paul (cc'ed) is the wine expert, but my understanding is that memory
>>> allocation and initial program load of the emulated binary will go
>>> through wine.  It does the allocation and mark the vma accordingly
>>> before returning the allocated range to the windows application.
>> Yes, exactly. Pretty much any memory allocation which Wine does needs
>> syscalls (if those are ever encountered later during executing code from
>> those areas) to be trapped by Wine and passed to Wine's implementation
>> of the corresponding Windows API function. Linux native libraries
>> loading and memory allocations performed by them go outside of Wine control.
> I don't like Gabriel's approach very much.  Could we do something like
> issue a syscall before executing a Windows region and then issue another
> syscall when exiting?  If so, we could switch the syscall entry point (ie
> change MSR_LSTAR).  I'm thinking something like a personality() syscall.
> But maybe that would be too high an overhead.
>
IIRC Gabriel had such idea that we discussed. We can potentially track
the boundary between the Windows and native code exectution. But issuing
syscall every time we cross that boundary may have a prohibitive
performance impact, that happens way too often. What we could do is to
put the flag somewhere, but that flag has to be per thread. E. g., we
could use Linux gs: based thread local storage, or fs: based address
(that's what Windows using for thread local data and thus Wine maintains
also). If Seccomp filters could access such a memory location (fetch a
byte from there and put into the structure accessible by BPF_LD) we
could use SECCOMP_MODE_FILTER, I think.
Matthew Wilcox May 31, 2020, 5:31 p.m. UTC | #10
On Sun, May 31, 2020 at 08:10:18PM +0300, Paul Gofman wrote:
> On 5/31/20 19:49, Matthew Wilcox wrote:
> > On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote:
> >>> Paul (cc'ed) is the wine expert, but my understanding is that memory
> >>> allocation and initial program load of the emulated binary will go
> >>> through wine.  It does the allocation and mark the vma accordingly
> >>> before returning the allocated range to the windows application.
> >> Yes, exactly. Pretty much any memory allocation which Wine does needs
> >> syscalls (if those are ever encountered later during executing code from
> >> those areas) to be trapped by Wine and passed to Wine's implementation
> >> of the corresponding Windows API function. Linux native libraries
> >> loading and memory allocations performed by them go outside of Wine control.
> > I don't like Gabriel's approach very much.  Could we do something like
> > issue a syscall before executing a Windows region and then issue another
> > syscall when exiting?  If so, we could switch the syscall entry point (ie
> > change MSR_LSTAR).  I'm thinking something like a personality() syscall.
> > But maybe that would be too high an overhead.
>
> IIRC Gabriel had such idea that we discussed. We can potentially track
> the boundary between the Windows and native code exectution. But issuing
> syscall every time we cross that boundary may have a prohibitive
> performance impact, that happens way too often. What we could do is to
> put the flag somewhere, but that flag has to be per thread. E. g., we
> could use Linux gs: based thread local storage, or fs: based address
> (that's what Windows using for thread local data and thus Wine maintains
> also). If Seccomp filters could access such a memory location (fetch a
> byte from there and put into the structure accessible by BPF_LD) we
> could use SECCOMP_MODE_FILTER, I think.

If it's the cost of the syscall that's the problem, there are ways
around that.  We'd still want a personality() call to indicate that
the syscall handler should look (somewhere) to determine the current
personality, but that could be issued at the start of execution rather
than when we switch between Windows & Linux code.
Paul Gofman May 31, 2020, 6:01 p.m. UTC | #11
On 5/31/20 20:31, Matthew Wilcox wrote:
> If it's the cost of the syscall that's the problem, there are ways
> around that.  We'd still want a personality() call to indicate that
> the syscall handler should look (somewhere) to determine the current
> personality, but that could be issued at the start of execution rather
> than when we switch between Windows & Linux code.

Sure, we can call personality() at start and specify the location to
look at, the only thing is that the location should be thread specific,
that is, based on fs: or gs: or whatever else which would allow us to
have different threads in different "personality" state. If anything
needs to be set up at thread start we can do that also of course.

If there will be any proof of concept solution I will be happy to make a
proof of concept Wine patch using that and do some testing.
Andy Lutomirski May 31, 2020, 6:10 p.m. UTC | #12
On Sun, May 31, 2020 at 5:56 AM Paul Gofman <gofmanp@gmail.com> wrote:
>
> On 5/31/20 03:59, Andy Lutomirski wrote:
> >
> > I’m suggesting that the kernel learn how to help you, maybe like this:
> >
> > prctl(PR_SET_SYSCALL_THUNK, target, address_of_unredirected_syscall, 0, 0, 0, 0);
> >
> > This would be inherited on clone/fork and cleared on execve.
> >
> If we are talking about explicit specification of syscall addresses to
> be trapped by Wine, the problem here is that we don't have any way of
> knowing the exact addresses of syscalls to be redirected. We would need
> some way to find those syscalls in the highly obfuscated dynamically
> generated code, the whole purpose of which is to prevent  disassembling,
> debugging and finding things like that in it. What we do know is that if
> a syscall is executed from any memory which Wine allocates for Windows
> application then it should be treated as Windows syscall and routed to
> the Wine's dispatch function. Those code areas can be dynamically
> allocated and deallocated.

That's not what I meant.  I meant that you would set the kernel up to
redirect *all* syscalls from the thread with the sole exception of one
syscall instruction in the thunk.  This would catch Windows syscalls
and Linux syscalls.  The thunk would determine whether the original
syscall was Linux or Windows and handle it accordingly.

This may interact poorly with the DRM scheme.  The redzone might need
to be respected, or stack switching might be needed.
Paul Gofman May 31, 2020, 6:36 p.m. UTC | #13
On 5/31/20 21:10, Andy Lutomirski wrote:
>
> That's not what I meant.  I meant that you would set the kernel up to
> redirect *all* syscalls from the thread with the sole exception of one
> syscall instruction in the thunk.  This would catch Windows syscalls
> and Linux syscalls.  The thunk would determine whether the original
> syscall was Linux or Windows and handle it accordingly.
>
> This may interact poorly with the DRM scheme.  The redzone might need
> to be respected, or stack switching might be needed.

Oh yeah, I see now, thanks. Sure, we could trap every syscall and have a
Seccomp-allowed trampoline for executing native ones with the existing
Seccomp implementation. But this is going to have prohibitive
performance impact. Our present use case specifics is that vast majority
of syscalls do not need to be emulated, they are native. And just a few
go from the Windows application which we need to trap and route to our
handler to let the program continue, while we do not care too much about
the overhead for those few. So the hope was that the kernel can route
that majority of Linux native syscalls inside with the minor overhead.
I've read the suggestion to use SECCOMP_RET_USER_NOTIF instead of
SECCOMP_RET_TRAP, is handling the trap this way supposed to be much
quicker than handling the sigsys from SECCOMP_RET_TRAP? More
specifically, would not SECCOMP_RET_USER_NOTIF effectively serialize all
the syscalls waiting in a single queue for processing, while
SECCOMP_RET_TRAP can be processed without exclusive locking?
Andy Lutomirski May 31, 2020, 6:57 p.m. UTC | #14
On Sun, May 31, 2020 at 11:36 AM Paul Gofman <gofmanp@gmail.com> wrote:
>
> On 5/31/20 21:10, Andy Lutomirski wrote:
> >
> > That's not what I meant.  I meant that you would set the kernel up to
> > redirect *all* syscalls from the thread with the sole exception of one
> > syscall instruction in the thunk.  This would catch Windows syscalls
> > and Linux syscalls.  The thunk would determine whether the original
> > syscall was Linux or Windows and handle it accordingly.
> >
> > This may interact poorly with the DRM scheme.  The redzone might need
> > to be respected, or stack switching might be needed.
>
> Oh yeah, I see now, thanks. Sure, we could trap every syscall and have a
> Seccomp-allowed trampoline for executing native ones with the existing
> Seccomp implementation. But this is going to have prohibitive
> performance impact. Our present use case specifics is that vast majority
> of syscalls do not need to be emulated, they are native. And just a few
> go from the Windows application which we need to trap and route to our
> handler to let the program continue, while we do not care too much about
> the overhead for those few. So the hope was that the kernel can route
> that majority of Linux native syscalls inside with the minor overhead.
> I've read the suggestion to use SECCOMP_RET_USER_NOTIF instead of
> SECCOMP_RET_TRAP, is handling the trap this way supposed to be much
> quicker than handling the sigsys from SECCOMP_RET_TRAP? More
> specifically, would not SECCOMP_RET_USER_NOTIF effectively serialize all
> the syscalls waiting in a single queue for processing, while
> SECCOMP_RET_TRAP can be processed without exclusive locking?
>
>

Using SECCOMP_RET_USER_NOTIF is likely to be considerably more
expensive than my scheme.  On a non-PTI system, my approach will add a
few tens of ns to each syscall.  On a PTI system, it will be worse.
But using any kind of notifier for all syscalls will cause a context
switch to a different user program for each syscall, and that will be
much slower.

I think that the implementation may well want to live in seccomp, but
doing this as a seccomp filter isn't quite right.  It's not a security
thing -- it's an emulation thing.  Seccomp is all about making
inescapable sandboxes, but that's not what you're doing at all, and
the fact that seccomp filters are preserved across execve() sounds
like it'll be annoying for you.

What if there was a special filter type that ran a BPF program on each
syscall, and the program was allowed to access user memory to make its
decisions, e.g. to look at some list of memory addresses.  But this
would explicitly *not* be a security feature -- execve() would remove
the filter, and the filter's outcome would be one of redirecting
execution or allowing the syscall.  If the "allow" outcome occurs,
then regular seccomp filters run.  Obviously the exact semantics here
would need some care.
Paul Gofman May 31, 2020, 7:37 p.m. UTC | #15
On 5/31/20 21:57, Andy Lutomirski wrote:
>
> I think that the implementation may well want to live in seccomp, but
> doing this as a seccomp filter isn't quite right.  It's not a security
> thing -- it's an emulation thing.  Seccomp is all about making
> inescapable sandboxes, but that's not what you're doing at all, and
> the fact that seccomp filters are preserved across execve() sounds
> like it'll be annoying for you.

Yes, sure, preserving those filters (more broadly, lack the ability to
change them any time in an arbitrary way) is the major problem
preventing us from using seccomp filters as is for a generic solution.
If not that, growing the table too much (which might be the case if we
mark all the denied address ranges there) may potentially be a
performance problem, but not necessarily, that's something to be tested.

>
> What if there was a special filter type that ran a BPF program on each
> syscall, and the program was allowed to access user memory to make its
> decisions, e.g. to look at some list of memory addresses.  But this
> would explicitly *not* be a security feature -- execve() would remove
> the filter, and the filter's outcome would be one of redirecting
> execution or allowing the syscall.  If the "allow" outcome occurs,
> then regular seccomp filters run.  Obviously the exact semantics here
> would need some care.

Yes, absolutely, we are not implementing any sandboxing in Wine and are
not seeing this as a security feature.

Is the approach discussed in another branch of this thread [1] is some
way similar to what you suggest? If instead of the list of memory
addresses we can use some single flag which we can set by thread when
crossing Windows program / native boundary, we won't have to grow the
lookup table indefinitely. Otherwise I am afraid the list of addresses
might be growing big, but I don't have reasons to think it necessarily
won't work, that's something we could evaluate further and also test
performance given some brief proof of concept implementation.


1. https://lkml.org/lkml/2020/5/31/199
Andy Lutomirski May 31, 2020, 9:03 p.m. UTC | #16
On Sun, May 31, 2020 at 11:57 AM Andy Lutomirski <luto@kernel.org> wrote:
>
>
> What if there was a special filter type that ran a BPF program on each
> syscall, and the program was allowed to access user memory to make its
> decisions, e.g. to look at some list of memory addresses.  But this
> would explicitly *not* be a security feature -- execve() would remove
> the filter, and the filter's outcome would be one of redirecting
> execution or allowing the syscall.  If the "allow" outcome occurs,
> then regular seccomp filters run.  Obviously the exact semantics here
> would need some care.

Let me try to flesh this out a little.

A task could install a syscall emulation filter (maybe using the
seccomp() syscall, maybe using something else).  There would be at
most one such filter per process.  Upon doing a syscall, the kernel
will first do initial syscall fixups (e.g. SYSENTER/SYSCALL32 magic
argument translation) and would then invoke the filter.  The filter is
an eBPF program (sorry Kees) and, as input, it gets access to the
task's register state and to an indication of which type of syscall
entry this was.  This will inherently be rather architecture specific
-- x86 choices could be int80, int80(translated), and syscall64.  (We
could expose SYSCALL32 separately, I suppose, but SYSENTER is such a
mess that I'm not sure this would be productive.)  The program can
access user memory, and it returns one of two results: allow the
syscall or send SIGSYS.  If the program tries to access user memory
and faults, the result is SIGSYS.

(I would love to do this with cBPF, but I'm not sure how to pull this
off.  Accessing user memory is handy for making the lookup flexible
enough to detect Windows vs Linux.  It would be *really* nice to
finally settle the unprivileged eBPF subset discussion so that we can
figure out how to make eBPF work here.)

execve() clears the filter.  clone() copies the filter.

Does this seem reasonable?  Is the implementation complexity small
enough?  Is the eBPF thing going to be a showstopper?

Using a signal instead of a bespoke thunk simplifies a lot of thorny
details but is also enough slower that catching all syscalls might be
a performance problem.
Brendan Shanks May 31, 2020, 11:33 p.m. UTC | #17
> On May 31, 2020, at 11:57 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> Using SECCOMP_RET_USER_NOTIF is likely to be considerably more
> expensive than my scheme.  On a non-PTI system, my approach will add a
> few tens of ns to each syscall.  On a PTI system, it will be worse.
> But using any kind of notifier for all syscalls will cause a context
> switch to a different user program for each syscall, and that will be
> much slower.

There’s also no way (at least to my understanding) to modify register state from SECCOMP_RET_USER_NOTIF, which is how the existing -staging SIGSYS handler works:

<https://github.com/wine-staging/wine-staging/blob/master/patches/ntdll-Syscall_Emulation/0001-ntdll-Support-x86_64-syscall-emulation.patch#L62>

> I think that the implementation may well want to live in seccomp, but
> doing this as a seccomp filter isn't quite right.  It's not a security
> thing -- it's an emulation thing.  Seccomp is all about making
> inescapable sandboxes, but that's not what you're doing at all, and
> the fact that seccomp filters are preserved across execve() sounds
> like it'll be annoying for you.

Definitely. Regardless of what approach is taken, we don’t want it to persist across execve.

> What if there was a special filter type that ran a BPF program on each
> syscall, and the program was allowed to access user memory to make its
> decisions, e.g. to look at some list of memory addresses.  But this
> would explicitly *not* be a security feature -- execve() would remove
> the filter, and the filter's outcome would be one of redirecting
> execution or allowing the syscall.  If the "allow" outcome occurs,
> then regular seccomp filters run.  Obviously the exact semantics here
> would need some care.

Although if that’s running a BPF filter on every syscall, wouldn’t it also incur the ~10% overhead that Paul and Gabriel have seen with existing seccomp?


Brendan Shanks
CodeWeavers
Andy Lutomirski June 1, 2020, 1:51 a.m. UTC | #18
> On May 31, 2020, at 4:50 PM, Brendan Shanks <bshanks@codeweavers.com> wrote:
> 
> 
>> On May 31, 2020, at 11:57 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> Using SECCOMP_RET_USER_NOTIF is likely to be considerably more
>> expensive than my scheme.  On a non-PTI system, my approach will add a
>> few tens of ns to each syscall.  On a PTI system, it will be worse.
>> But using any kind of notifier for all syscalls will cause a context
>> switch to a different user program for each syscall, and that will be
>> much slower.
> 
> There’s also no way (at least to my understanding) to modify register state from SECCOMP_RET_USER_NOTIF, which is how the existing -staging SIGSYS handler works:
> 
> <https://github.com/wine-staging/wine-staging/blob/master/patches/ntdll-Syscall_Emulation/0001-ntdll-Support-x86_64-syscall-emulation.patch#L62>
> 
>> I think that the implementation may well want to live in seccomp, but
>> doing this as a seccomp filter isn't quite right.  It's not a security
>> thing -- it's an emulation thing.  Seccomp is all about making
>> inescapable sandboxes, but that's not what you're doing at all, and
>> the fact that seccomp filters are preserved across execve() sounds
>> like it'll be annoying for you.
> 
> Definitely. Regardless of what approach is taken, we don’t want it to persist across execve.
> 
>> What if there was a special filter type that ran a BPF program on each
>> syscall, and the program was allowed to access user memory to make its
>> decisions, e.g. to look at some list of memory addresses.  But this
>> would explicitly *not* be a security feature -- execve() would remove
>> the filter, and the filter's outcome would be one of redirecting
>> execution or allowing the syscall.  If the "allow" outcome occurs,
>> then regular seccomp filters run.  Obviously the exact semantics here
>> would need some care.
> 
> Although if that’s running a BPF filter on every syscall, wouldn’t it also incur the ~10% overhead that Paul and Gabriel have seen with existing seccomp?
> 
> 

Unlikely. Some benchmarking is needed, but the seccomp ptrace overhead is likely *huge* compared to the overhead of just a filter.

As wild guess numbers on made up modern hardware, cache hot:

Empty syscall: 50ns, or 300ns with PTI

Empty syscall accepted by simple seccomp filter: 10ns more than an empty syscall without seccomp

Seccomp ptrace round trip: 6 us  Worse with PTI

Seccomp user notif round trip: 4 us

Syscall hypothetically redirected back to same process: about the same as an empty filtered accepted syscall, plus however long it takes to run the handler. Add 900ns if using SIGSYS instead of plain redirection. Add an extra 500ns on current kernels because signal delivery sucks, but I can fix this.

Take these numbers with a huge grain of salt.  But the point is that the BPF part is the least of your worries.
Gabriel Krisman Bertazi June 1, 2020, 5:53 p.m. UTC | #19
Matthew Wilcox <willy@infradead.org> writes:

> On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote:
>> > Paul (cc'ed) is the wine expert, but my understanding is that memory
>> > allocation and initial program load of the emulated binary will go
>> > through wine.  It does the allocation and mark the vma accordingly
>> > before returning the allocated range to the windows application.
>> Yes, exactly. Pretty much any memory allocation which Wine does needs
>> syscalls (if those are ever encountered later during executing code from
>> those areas) to be trapped by Wine and passed to Wine's implementation
>> of the corresponding Windows API function. Linux native libraries
>> loading and memory allocations performed by them go outside of Wine control.
>
> I don't like Gabriel's approach very much.  Could we do something like

Hi Matthew,

I don't oppose your suggestion, as Paul said, it should be enough for
us.  But could you elaborate on the problems you see in the original
approach, even if only for my own education?

> issue a syscall before executing a Windows region and then issue another
> syscall when exiting?  If so, we could switch the syscall entry point (ie
> change MSR_LSTAR).  I'm thinking something like a personality() syscall.
> But maybe that would be too high an overhead.
>
Gabriel Krisman Bertazi June 1, 2020, 5:54 p.m. UTC | #20
Paul Gofman <gofmanp@gmail.com> writes:

> On 5/31/20 20:31, Matthew Wilcox wrote:
>> If it's the cost of the syscall that's the problem, there are ways
>> around that.  We'd still want a personality() call to indicate that
>> the syscall handler should look (somewhere) to determine the current
>> personality, but that could be issued at the start of execution rather
>> than when we switch between Windows & Linux code.
>
> Sure, we can call personality() at start and specify the location to
> look at, the only thing is that the location should be thread specific,
> that is, based on fs: or gs: or whatever else which would allow us to
> have different threads in different "personality" state. If anything
> needs to be set up at thread start we can do that also of course.
>
> If there will be any proof of concept solution I will be happy to make a
> proof of concept Wine patch using that and do some testing.

Let me give that a try and share the patches with you, so we can look at
how this implementation would look like.
Gabriel Krisman Bertazi June 1, 2020, 6:06 p.m. UTC | #21
Andy Lutomirski <luto@kernel.org> writes:

> On Sun, May 31, 2020 at 11:57 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>>
>> What if there was a special filter type that ran a BPF program on each
>> syscall, and the program was allowed to access user memory to make its
>> decisions, e.g. to look at some list of memory addresses.  But this
>> would explicitly *not* be a security feature -- execve() would remove
>> the filter, and the filter's outcome would be one of redirecting
>> execution or allowing the syscall.  If the "allow" outcome occurs,
>> then regular seccomp filters run.  Obviously the exact semantics here
>> would need some care.
>
> Let me try to flesh this out a little.
>
> A task could install a syscall emulation filter (maybe using the
> seccomp() syscall, maybe using something else).  There would be at
> most one such filter per process.  Upon doing a syscall, the kernel
> will first do initial syscall fixups (e.g. SYSENTER/SYSCALL32 magic
> argument translation) and would then invoke the filter.  The filter is
> an eBPF program (sorry Kees) and, as input, it gets access to the
> task's register state and to an indication of which type of syscall
> entry this was.  This will inherently be rather architecture specific
> -- x86 choices could be int80, int80(translated), and syscall64.  (We
> could expose SYSCALL32 separately, I suppose, but SYSENTER is such a
> mess that I'm not sure this would be productive.)  The program can
> access user memory, and it returns one of two results: allow the
> syscall or send SIGSYS.  If the program tries to access user memory
> and faults, the result is SIGSYS.
>
> (I would love to do this with cBPF, but I'm not sure how to pull this
> off.  Accessing user memory is handy for making the lookup flexible
> enough to detect Windows vs Linux.  It would be *really* nice to
> finally settle the unprivileged eBPF subset discussion so that we can
> figure out how to make eBPF work here.)
>
> execve() clears the filter.  clone() copies the filter.
>
> Does this seem reasonable?  Is the implementation complexity small
> enough?  Is the eBPF thing going to be a showstopper?
>
> Using a signal instead of a bespoke thunk simplifies a lot of thorny
> details but is also enough slower that catching all syscalls might be
> a performance problem.

If we can have something close to the numbers you shared, it seems to be
good for us.  Using the thunk instead of a signal seems very interesting
for performance.

Though, I'm not convinced about this not being part of seccomp just
because it is not security.  The suggestion from Kees to convert
seccomp to eBPF filters and stack them would provide similar semantics
and reuse the infrastructure.

Finnaly, as you said, I'm afraid that eBPF will be a show stopper,
unless unpriviledged eBPF becomes a thing. Wine cannot count on
CAP_SYS_ADMIN.
Kees Cook June 1, 2020, 8:08 p.m. UTC | #22
On Sun, May 31, 2020 at 02:03:48PM -0700, Andy Lutomirski wrote:
> On Sun, May 31, 2020 at 11:57 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> >
> > What if there was a special filter type that ran a BPF program on each
> > syscall, and the program was allowed to access user memory to make its
> > decisions, e.g. to look at some list of memory addresses.  But this
> > would explicitly *not* be a security feature -- execve() would remove
> > the filter, and the filter's outcome would be one of redirecting
> > execution or allowing the syscall.  If the "allow" outcome occurs,
> > then regular seccomp filters run.  Obviously the exact semantics here
> > would need some care.
> 
> Let me try to flesh this out a little.
> 
> A task could install a syscall emulation filter (maybe using the
> seccomp() syscall, maybe using something else).  There would be at
> most one such filter per process.  Upon doing a syscall, the kernel
> will first do initial syscall fixups (e.g. SYSENTER/SYSCALL32 magic
> argument translation) and would then invoke the filter.  The filter is
> an eBPF program (sorry Kees) and, as input, it gets access to the

FWIW, I agree: something like this needs to use eBPF -- this isn't
being designed as a security boundary. It's more like eBPF ptrace.

> task's register state and to an indication of which type of syscall
> entry this was.  This will inherently be rather architecture specific
> -- x86 choices could be int80, int80(translated), and syscall64.  (We
> could expose SYSCALL32 separately, I suppose, but SYSENTER is such a
> mess that I'm not sure this would be productive.)  The program can
> access user memory, and it returns one of two results: allow the
> syscall or send SIGSYS.  If the program tries to access user memory
> and faults, the result is SIGSYS.
> 
> (I would love to do this with cBPF, but I'm not sure how to pull this
> off.  Accessing user memory is handy for making the lookup flexible
> enough to detect Windows vs Linux.  It would be *really* nice to
> finally settle the unprivileged eBPF subset discussion so that we can
> figure out how to make eBPF work here.)

And yes, this is the next road-block: finding a way to safely do
unprivileged eBPF.
Andy Lutomirski June 1, 2020, 11:18 p.m. UTC | #23
On Mon, Jun 1, 2020 at 1:08 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Sun, May 31, 2020 at 02:03:48PM -0700, Andy Lutomirski wrote:
> > On Sun, May 31, 2020 at 11:57 AM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > >
> > > What if there was a special filter type that ran a BPF program on each
> > > syscall, and the program was allowed to access user memory to make its
> > > decisions, e.g. to look at some list of memory addresses.  But this
> > > would explicitly *not* be a security feature -- execve() would remove
> > > the filter, and the filter's outcome would be one of redirecting
> > > execution or allowing the syscall.  If the "allow" outcome occurs,
> > > then regular seccomp filters run.  Obviously the exact semantics here
> > > would need some care.
> >
> > Let me try to flesh this out a little.
> >
> > A task could install a syscall emulation filter (maybe using the
> > seccomp() syscall, maybe using something else).  There would be at
> > most one such filter per process.  Upon doing a syscall, the kernel
> > will first do initial syscall fixups (e.g. SYSENTER/SYSCALL32 magic
> > argument translation) and would then invoke the filter.  The filter is
> > an eBPF program (sorry Kees) and, as input, it gets access to the
>
> FWIW, I agree: something like this needs to use eBPF -- this isn't
> being designed as a security boundary. It's more like eBPF ptrace.

On a bit more consideration, I think that I have the model a bit
wrong.  We shouldn't think of this as a *syscall* filter but as a
filter for architectural privilege transitions in general.  After all,
there is no particular guarantee that any given emulated program has a
syscall ABI that is even remotely compatible with Linux.  So maybe the
filter is fed events like SYSCALL64, SYSCALL32, SYSENTER, #GP, #PF
(the bad kind that would otherwise get a signal), #UD, etc.  And the
filter can examine process state and take some reasonable action.
Think if it as a personality scheme that's programmable by user code.
I imagine that even schemes like NaCl could make some use of this.

This allows all kinds of interesting things.  For example, it should
give Wine a much nicer emulation of Windows SEH and vectored signals.
And maybe it could finally allow Linux userspace to have some sensible
equivalent of those Windows features -- being able to write library
code that could sanely handle, say, math errors would be quite handy.

This could be mocked up with cBPF, but I think a cBPF version will
struggle to be a performant solution for Wine because it will have a
hard time distinguishing between Windows and Linux syscalls.

--Andy
Sargun Dhillon June 5, 2020, 6:06 a.m. UTC | #24
On Fri, May 29, 2020 at 11:01 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Modern Windows applications are executing system call instructions
> directly from the application's code without going through the WinAPI.
> This breaks Wine emulation, because it doesn't have a chance to
> intercept and emulate these syscalls before they are submitted to Linux.
>
> In addition, we cannot simply trap every system call of the application
> to userspace using PTRACE_SYSEMU, because performance would suffer,
> since our main use case is to run Windows games over Linux.  Therefore,
> we need some in-kernel filtering to decide whether the syscall was
> issued by the wine code or by the windows application.
>
> The filtering cannot really be done based solely on the syscall number,
> because those could collide with existing Linux syscalls.  Instead, our
> proposed solution is to trap syscalls based on the userspace memory
> region that triggered the syscall, as wine is responsible for the
> Windows code allocations and it can apply correct memory protections to
> those areas.
>
> Therefore, this patch reuses the seccomp infrastructure to trap
> system calls, but introduces a new mode to trap based on a vma attribute
> that describes whether the userspace memory region is allowed to execute
> syscalls or not.  The protection is defined at mmap/mprotect time with a
> new protection flag PROT_NOSYSCALL.  This setting only takes effect if
> the new SECCOMP_MODE_MEMMAP is enabled through seccomp().
>
> It goes without saying that this is in no way a security mechanism
> despite being built on top of seccomp, since an evil application can
> always jump to a whitelisted memory region and run the syscall.  This
> is not a concern for Wine games.  Nevertheless, we reuse seccomp as a
> way to avoid adding a new mechanism to essentially do the same job of
> filtering system calls.
>
> * Why not SECCOMP_MODE_FILTER?
>
> We experimented with dynamically generating BPF filters for whitelisted
> memory regions and using SECCOMP_MODE_FILTER, but there are a few
> reasons why it isn't enough nor a good idea for our use case:
>
> 1. We cannot set the filters at program initialization time and forget
> about it, since there is no way of knowing which modules will be loaded,
> whether native and windows.  Filter would need a way to be updated
> frequently during game execution.
>
> 2. We cannot predict which Linux libraries will issue syscalls directly.
> Most of the time, whitelisting libc and a few other libraries is enough,
> but there are no guarantees other Linux libraries won't issue syscalls
> directly and break the execution.  Adding every linux library that is
> loaded also has a large performance cost due to the large resulting
> filter.
>
> 3. As I mentioned before, performance is critical.  In our testing with
> just a single memory segment blacklisted/whitelisted, the minimum size
> of a bpf filter would be 4 instructions.  In that scenario,
> SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
> time of sysinfo(2) in comparison to seccomp disabled, while the impact
> of SECCOMP_MODE_MEMMAP was averaged around 1.5%.
>
> Indeed, points 1 and 2 could be worked around with some userspace work
> and improved SECCOMP_MODE_FILTER support, but at a high performance and
> some stability cost, to obtain the semantics we want.  Still, the
> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
> enough that I believe it should be considered as an upstream solution.
>
> Sending as an RFC for now to get the discussion started.  In particular:
I have a totally different question. I am experimenting with a
patchset which is designed
to help with the "extended syscall" case (as Kees calls it).
Effectively syscalls like openat2,
where the syscall arguments are passed as a (potentially mixed size)
structure need to be
able to be inspected through user notif. `We can kind-of deal with
this with other syscalls
with mechanisms like pidfd_getfd, addfd, and potentially being able to
(re)set the registers
prior to actual invocation of the syscall. Unfortunately, you cannot
do the same trick with
user memory, because it opens you up to a time-of-check, time-of-use
attack, since the
kernel copies the syscall arguments from the invoking program again.

One of the things I've been experimenting with is using tricks like
userfaultfd / mprotect to
try to deal with this. I think that I might have to add some
capability to the kernel to actually
deal with this. In general, the approach is:
1. Syscall is invoked, and wakes up the manager
2. The manager gets the arguments, and a handle (either the ID, or an
FD). It then uses this
ID to read memory. Either something like process_vm_readv, an ioctl, or read.
3. When the kernel reads these arguments, it splits the VMA for the
address the pointer
lies in, and sets up access() with a special mapping that checks if
the page has been
tampered with by userspace in the read ranges between the manager read
and the writes.
We can either SIGBUS or stall writes to the range if we want to make
things "simple",
or we can mess with uaccess bits and EPERM if the kernel tries to read
that memory.
4. When the syscall returns, or the kernel writes to that area, we
reset the mapping.

I'm wondering if you're dynamically generating these special mappings
with protection,
and how many of them you're generating. How often are you generating them? What
kind of performance cost do you see in normal programs?
Gabriel Krisman Bertazi June 11, 2020, 7:38 p.m. UTC | #25
Andy Lutomirski <luto@kernel.org> writes:

> On Sun, May 31, 2020 at 11:57 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>>
>> What if there was a special filter type that ran a BPF program on each
>> syscall, and the program was allowed to access user memory to make its
>> decisions, e.g. to look at some list of memory addresses.  But this
>> would explicitly *not* be a security feature -- execve() would remove
>> the filter, and the filter's outcome would be one of redirecting
>> execution or allowing the syscall.  If the "allow" outcome occurs,
>> then regular seccomp filters run.  Obviously the exact semantics here
>> would need some care.
>
> Let me try to flesh this out a little.
>
> A task could install a syscall emulation filter (maybe using the
> seccomp() syscall, maybe using something else).  There would be at
> most one such filter per process.  Upon doing a syscall, the kernel
> will first do initial syscall fixups (e.g. SYSENTER/SYSCALL32 magic
> argument translation) and would then invoke the filter.  The filter is
> an eBPF program (sorry Kees) and, as input, it gets access to the
> task's register state and to an indication of which type of syscall
> entry this was.  This will inherently be rather architecture specific
> -- x86 choices could be int80, int80(translated), and syscall64.  (We
> could expose SYSCALL32 separately, I suppose, but SYSENTER is such a
> mess that I'm not sure this would be productive.)  The program can
> access user memory, and it returns one of two results: allow the
> syscall or send SIGSYS.  If the program tries to access user memory
> and faults, the result is SIGSYS.
>
> (I would love to do this with cBPF, but I'm not sure how to pull this
> off.  Accessing user memory is handy for making the lookup flexible
> enough to detect Windows vs Linux.  It would be *really* nice to
> finally settle the unprivileged eBPF subset discussion so that we can
> figure out how to make eBPF work here.)
>
> execve() clears the filter.  clone() copies the filter.
>
> Does this seem reasonable?  Is the implementation complexity small
> enough?  Is the eBPF thing going to be a showstopper?
>
> Using a signal instead of a bespoke thunk simplifies a lot of thorny
> details but is also enough slower that catching all syscalls might be
> a performance problem.

As far as I understand, the eBPF event emulation filter would be a
future-proof mechanism to simplify a lot of wine operations, so, fwiw it
seems an ideal mechanism for us.

If I understand correctly, though, the thunk idea is orthogonal to the
filter itself.  It could be used by the filter in the future
instead of SIGSYS, or we could support it as a sole mechanism to capture
everything.

As a first step, which would solve the existing Wine problem, would you
be open to the sole thunk support implementation, like you proposed?
This would allow us to move forward with this solution without waiting
for unprotected eBPF, and will still be useful in the future once we
get the entire eBPF filter model.

The interface could look like what you proposed, a prctl:

  prctl(PR_SET_SYSCALL_THUNK, target, address_of_unredirected_syscall, 0, 0, 0, 0);

This would immediately solve our use-case, and a few others. For
instance, I believe libsyscall-intercept could start to use that too.

I can explore a bit more, it could be combined to a personality that
enable/disable the thunk on wine entry points, if we notice that
filtering all syscalls in userspace is bad, but that is a second step
that might be unnecessary.
Robert O'Callahan June 25, 2020, 11:14 p.m. UTC | #26
rr (https://rr-project.org, https://arxiv.org/abs/1705.05937) grapples
with a similar problem. We need to intercept commonly-executed system
calls and wrap them with our own processing, with minimal overhead. I
think our basic approach might work for Wine without kernel changes.

We use SECCOMP_SET_MODE_FILTER with a simple filter that returns
SECCOMP_RET_TRAP on all syscalls except for those called from a single
specific trampoline page (which get SECCOMP_RET_ALLOW). rr ptraces its
children. So, when user-space makes a syscall, the seccomp filter
triggers a ptrace trap. The ptracer looks at the code around the
syscall and if it matches certain common patterns, the ptracer patches
the code with a jump to a stub that does extra work and issues a real
syscall via the trampoline. Thus, each library syscall instruction is
slow the first time and fast every subsequent time. "Weird" syscalls
that the ptracer chooses not to patch do incur the context-switch
penalty every time so their overhead does increase a lot ... but it
sounds like that might be OK in Wine's case?

A more efficient variant of this approach which would work in some
cases (but maybe not Wine?) would be to avoid using a ptracer and give
the process a SIGSYS handler which does the patching.

Rob
Gabriel Krisman Bertazi June 25, 2020, 11:48 p.m. UTC | #27
"Robert O'Callahan" <robert@ocallahan.org> writes:

> rr (https://rr-project.org, https://arxiv.org/abs/1705.05937) grapples
> with a similar problem. We need to intercept commonly-executed system
> calls and wrap them with our own processing, with minimal overhead. I
> think our basic approach might work for Wine without kernel changes.
>
> We use SECCOMP_SET_MODE_FILTER with a simple filter that returns
> SECCOMP_RET_TRAP on all syscalls except for those called from a single
> specific trampoline page (which get SECCOMP_RET_ALLOW). rr ptraces its
> children. So, when user-space makes a syscall, the seccomp filter
> triggers a ptrace trap. The ptracer looks at the code around the
> syscall and if it matches certain common patterns, the ptracer patches
> the code with a jump to a stub that does extra work and issues a real
> syscall via the trampoline. Thus, each library syscall instruction is
> slow the first time and fast every subsequent time. "Weird" syscalls
> that the ptracer chooses not to patch do incur the context-switch
> penalty every time so their overhead does increase a lot ... but it
> sounds like that might be OK in Wine's case?
>
> A more efficient variant of this approach which would work in some
> cases (but maybe not Wine?) would be to avoid using a ptracer and give
> the process a SIGSYS handler which does the patching.

We couldn't patch Windows code because of the aforementioned DRM and
anti-cheat mechanisms, but I suppose this limitation doesn't apply to
Wine/native code, and if this assumption is correct, this approach could
work.

One complexity might be the consistent model for the syscall live
patching.  I don't know how much of the problem is diminished from the
original userspace live-patching problem, but I believe at least part of
it applies.  And fencing every thread to patch would kill performance.
Also, we cannot just patch everything at the beginning.  How does rr
handle that?

Another problem is that we will want to support i386 and other
architectures.  For int 0x80, it is trickier to encode a branch to
another region, given the limited instruction space, and the patching
might not be possible in hot paths.  I did port libsyscall-intercept to
x86-32 once and I could correctly patch glibc, but it's not guaranteed
that an updated libc or something else won't break it.

I'm not sure the benefit of not needing enhanced kernel support
justifies the complexity and performance cost required to make this work
reliably, in particular since the semantics for a kernel implementation
that we are discussing doesn't seem overly intrusive and might have
other applications like in the generic filter Andy mentioned.
Robert O'Callahan June 26, 2020, 1:03 a.m. UTC | #28
On Fri, Jun 26, 2020 at 11:49 AM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
> We couldn't patch Windows code because of the aforementioned DRM and
> anti-cheat mechanisms, but I suppose this limitation doesn't apply to
> Wine/native code, and if this assumption is correct, this approach could
> work.
>
> One complexity might be the consistent model for the syscall live
> patching.  I don't know how much of the problem is diminished from the
> original userspace live-patching problem, but I believe at least part of
> it applies.  And fencing every thread to patch would kill performance.
> Also, we cannot just patch everything at the beginning.  How does rr
> handle that?

That's a good point. rr only allows one tracee thread to run at a time
for other reasons, so when we consider patching a syscall instruction,
we inspect all thread states to see if the patch would interfere with
any other thread, and skip patching it in that unlikely case. (We'll
try to patch it again next time that instruction is executed.) Wine
would need some other solution, but indeed that could be a
showstopper.

> Another problem is that we will want to support i386 and other
> architectures.  For int 0x80, it is trickier to encode a branch to
> another region, given the limited instruction space, and the patching
> might not be possible in hot paths.

This is no worse than for x86-64 `syscall`, which is also two bytes.
We have pretty much always been able to patch the frequently executed
syscalls by replacing both the syscall instruction and an instruction
before or after the syscall with a five-byte jump, and folding the
extra instruction into the stub.

>I did port libsyscall-intercept to
> x86-32 once and I could correctly patch glibc, but it's not guaranteed
> that an updated libc or something else won't break it.

We haven't found this to be much of a problem in rr. From time to time
we have to add new patch patterns. The good news is that if things
change so a syscall can't be patched, the result is degraded
performance, not functional breakage.

> I'm not sure the benefit of not needing enhanced kernel support
> justifies the complexity and performance cost required to make this work
> reliably, in particular since the semantics for a kernel implementation
> that we are discussing doesn't seem overly intrusive and might have
> other applications like in the generic filter Andy mentioned.

That's fair --- our solution is complex. (But worth it --- for us,
it's valuable that rr works on quite old Linux kernels.)

As for performance, it performs well for us. I think we'd prefer our
current approach to Andy's hypothetical PR_SET_SYSCALL_THUNK because
the latter would have higher overhead (two trips into the kernel per
syscall). We definitely don't want to require CAP_SYS_ADMIN so that
rules out any eBPF-based alternative too. I would love to see a
low-overhead unprivileged syscall interception mechanism that would
obsolete our patching approach --- preferably one that's also
stackable so rr can record and replay processes that use the new
mechanism --- but I don't think any of the proposals here are that,
yet, unfortunately.

Rob
diff mbox series

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index 786a85d4ad40..e5c10231c9c2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -471,6 +471,27 @@  config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config HAVE_ARCH_SECCOMP_MEMMAP
+	bool
+	help
+	  An arch should select this symbol if it provides all of these things:
+	  - syscall_get_arch()
+	  - syscall_get_arguments()
+	  - syscall_rollback()
+	  - syscall_set_return_value()
+	  - SIGSYS siginfo_t support
+	  - secure_computing is called from a ptrace_event()-safe context
+	  - secure_computing return value is checked and a return value of -1
+	    results in the system call being skipped immediately.
+	  - seccomp syscall wired up
+
+config SECCOMP_MEMMAP
+	def_bool y
+	depends on HAVE_ARCH_SECCOMP_MEMMAP && SECCOMP
+	help
+	  Enable tasks to trap syscalls based on the page that triggered
+	  the syscall.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2d3f963fd6f1..70998b01c533 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -145,6 +145,7 @@  config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
+	select HAVE_ARCH_SECCOMP_MEMMAP
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..6271fb7fd5bc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -294,11 +294,13 @@  extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -340,6 +342,12 @@  extern unsigned int kobjsize(const void *objp);
 # define VM_GROWSUP	VM_NONE
 #endif
 
+#if defined(CONFIG_X86)
+# define VM_NOSYSCALL	VM_HIGH_ARCH_5
+#else
+# define VM_NOSYSCALL	VM_NONE
+#endif
+
 /* Bits set in the VMA until the stack is in its final location */
 #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
 
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 4b08e9c9c538..a5ca42eb685a 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -94,7 +94,8 @@  static inline void vm_unacct_memory(long pages)
  */
 static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
 {
-	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
+	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM
+			 | PROT_NOSYSCALL)) == 0;
 }
 #define arch_validate_prot arch_validate_prot
 #endif
@@ -119,6 +120,7 @@  calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
+	       _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) |
 	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..4eafbaa3f4bc 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -13,6 +13,8 @@ 
 #define PROT_SEM	0x8		/* page may be used for atomic ops */
 /*			0x10		   reserved for arch-specific use */
 /*			0x20		   reserved for arch-specific use */
+#define PROT_NOSYSCALL	0x40		/* page cannot trigger syscalls */
+
 #define PROT_NONE	0x0		/* page can not be accessed */
 #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
 #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index c1735455bc53..c7d042a409e7 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -10,12 +10,14 @@ 
 #define SECCOMP_MODE_DISABLED	0 /* seccomp is not in use. */
 #define SECCOMP_MODE_STRICT	1 /* uses hard-coded filter. */
 #define SECCOMP_MODE_FILTER	2 /* uses user-supplied filter. */
+#define SECCOMP_MODE_MEMMAP	3 /* Lock syscalls per memory region. */
 
 /* Valid operations for seccomp syscall. */
 #define SECCOMP_SET_MODE_STRICT		0
 #define SECCOMP_SET_MODE_FILTER		1
 #define SECCOMP_GET_ACTION_AVAIL	2
 #define SECCOMP_GET_NOTIF_SIZES		3
+#define SECCOMP_SET_MODE_MEMMAP		4
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55a6184f5990..ebf09b02db8d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -930,6 +930,55 @@  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 }
 #endif
 
+#ifdef CONFIG_SECCOMP_MEMMAP
+static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
+{
+	struct vm_area_struct *vma = find_vma(current->mm,
+					      sd->instruction_pointer);
+	if (!vma)
+		BUG();
+
+	if (!(vma->vm_flags & VM_NOSYSCALL))
+		return 0;
+
+	syscall_rollback(current, task_pt_regs(current));
+	seccomp_send_sigsys(this_syscall, 0);
+
+	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
+
+	return -1;
+}
+
+static long seccomp_set_mode_memmap(unsigned int flags)
+{
+	const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP;
+	long ret = 0;
+
+	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
+		return -EINVAL;
+
+	spin_lock_irq(&current->sighand->siglock);
+
+	if (seccomp_may_assign_mode(seccomp_mode))
+		seccomp_assign_mode(current, seccomp_mode, flags);
+	else
+		ret = -EINVAL;
+
+	spin_unlock_irq(&current->sighand->siglock);
+
+	return ret;
+}
+#else
+static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
+{
+	BUG();
+}
+static long seccomp_set_mode_memmap(unsigned int flags)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_SECCOMP_MEMMAP */
+
 int __secure_computing(const struct seccomp_data *sd)
 {
 	int mode = current->seccomp.mode;
@@ -948,6 +997,8 @@  int __secure_computing(const struct seccomp_data *sd)
 		return 0;
 	case SECCOMP_MODE_FILTER:
 		return __seccomp_filter(this_syscall, sd, false);
+	case SECCOMP_MODE_MEMMAP:
+		return __seccomp_memmap(this_syscall, sd);
 	default:
 		BUG();
 	}
@@ -1425,6 +1476,10 @@  static long do_seccomp(unsigned int op, unsigned int flags,
 			return -EINVAL;
 
 		return seccomp_get_notif_sizes(uargs);
+	case SECCOMP_SET_MODE_MEMMAP:
+		if (uargs != NULL)
+			return -EINVAL;
+		return seccomp_set_mode_memmap(flags);
 	default:
 		return -EINVAL;
 	}
@@ -1462,6 +1517,10 @@  long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter)
 		op = SECCOMP_SET_MODE_FILTER;
 		uargs = filter;
 		break;
+	case SECCOMP_MODE_MEMMAP:
+		op = SECCOMP_SET_MODE_MEMMAP;
+		uargs = NULL;
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 494192ca954b..6b5c00e8bbdc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -591,7 +591,7 @@  static int do_mprotect_pkey(unsigned long start, size_t len,
 		 * cleared from the VMA.
 		 */
 		mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC |
-					VM_FLAGS_CLEAR;
+			VM_NOSYSCALL | VM_FLAGS_CLEAR;
 
 		new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
 		newflags = calc_vm_prot_bits(prot, new_vma_pkey);