diff mbox series

Adds a new ioctl32 syscall for backwards compatibility layers

Message ID 20210106064807.253112-1-Sonicadvance1@gmail.com (mailing list archive)
State New, archived
Headers show
Series Adds a new ioctl32 syscall for backwards compatibility layers | expand

Commit Message

Ryan Houdek Jan. 6, 2021, 6:48 a.m. UTC
From: Ryan Houdek <Sonicadvance1@gmail.com>

Problem presented:
A backwards compatibility layer that allows running x86-64 and x86
processes inside of an AArch64 process.
  - CPU is emulated
  - Syscall interface is mostly passthrough
  - Some syscalls require patching or emulation depending on behaviour
  - Not viable from the emulator design to use an AArch32 host process

x86-64 and x86 userspace emulator source:
https://github.com/FEX-Emu/FEX
Usage of ioctl32 is currently in a downstream fork. This will be the
first user of the syscall.

Cross documentation:
https://github.com/FEX-Emu/FEX/wiki/32Bit-x86-Woes#ioctl---54

ioctls are opaque from the emulator perspective and the data wants to be
passed through a syscall as unimpeded as possible.
Sadly due to ioctl struct differences between x86 and x86-64, we need a
syscall that exposes the compatibility ioctl handler to userspace in a
64bit process.

This is necessary behaves of the behaviour differences that occur
between an x86 process doing an ioctl and an x86-64 process doing an
ioctl.

Both of which are captured and passed through the AArch64 ioctl space.
This is implementing a new ioctl32 syscall that allows us to pass 32bit
x86 ioctls through to the kernel with zero or minimal manipulation.

The only supported hosts where we care about this currently is AArch64
and x86-64 (For testing purposes).
PPC64LE, MIPS64LE, and RISC-V64 might be interesting to support in the
future; But I don't have any platforms that get anywhere near Cortex-A77
performance in those architectures. Nor do I have the time to bring up
the emulator on them.
x86-64 can get to the compatibility ioctl through the int $0x80 handler.

This does not solve the following problems:
1) compat_alloc_user_space inside ioctl
2) ioctls that check task mode instead of entry point for behaviour
3) ioctls allocating memory
4) struct packing problems between architectures

Workarounds for the problems presented:
1a) Do a stack pivot to the lower 32bits from userspace
  - Forces host 64bit process to have its thread stacks to live in 32bit
  space. Not ideal.
  - Only do a stack pivot on ioctl to save previous 32bit VA space
1b) Teach kernel that compat_alloc_userspace can return a 64bit pointer
  - x86-64 truncates stack from this function
  - AArch64 returns the full stack pointer
  - Only ~29 users. Validating all of them support a 64bit stack is
  trivial?

2a) Any application using these can be checked for compatibility in
userspace and put on a block list.
2b) Fix any ioctls doing broken behaviour based on task mode rather than
ioctl entry point

3a) Userspace consumes all VA space above 32bit. Forcing allocations to
occur in lower 32bits
  - This is the current implementation
3b) Ensure any allocation in the ioctl handles ioctl entrypoint rather
than just allow generic memory allocations in full VA space
  - This is hard to guarantee

4a) Blocklist any application using ioctls that have different struct
packing across the boundary
  - Can happen when struct packing of 32bit x86 application goes down
  the aarch64 compat_ioctl path
  - Userspace is a AArch64 process passing 32bit x86 ioctl structures
  through the compat_ioctl path which is typically for AArch32 processes
  - None currently identified
4b) Work with upstream kernel and userspace projects to evaluate and fix
  - Identify the problem ioctls
  - Implement a new ioctl with more sane struct packing that matches
  cross-arch
  - Implement new ioctl while maintaining backwards compatibility with
  previous ioctl handler
  - Change upstream project to use the new compatibility ioctl
  - ioctl deprecation will be case by case per device and project
4b) Userspace implements a full ioctl emulation layer
  - Parses the full ioctl tree
  - Either passes through ioctls that it doesn't understand or
  transforms ioctls that it knows are trouble
  - Has the downside that it can still run in to edge cases that will
  fail
  - Performance of additional tracking is a concern
  - Prone to failure keeping the kernel ioctl and userspace ioctl
  handling in sync
  - Really want to have it in the kernel space as much as possible

Signed-off-by: Ryan Houdek <Sonicadvance1@gmail.com>
---
 arch/arm64/include/asm/unistd.h         |  2 +-
 arch/arm64/include/asm/unistd32.h       |  2 ++
 fs/ioctl.c                              | 16 ++++++++++++++--
 include/linux/syscalls.h                |  2 ++
 include/uapi/asm-generic/unistd.h       |  9 ++++++++-
 kernel/sys_ni.c                         |  3 +++
 tools/include/uapi/asm-generic/unistd.h |  9 ++++++++-
 7 files changed, 38 insertions(+), 5 deletions(-)

Comments

Arnd Bergmann Jan. 6, 2021, 8:48 a.m. UTC | #1
On Wed, Jan 6, 2021 at 7:48 AM <sonicadvance1@gmail.com> wrote:
> From: Ryan Houdek <Sonicadvance1@gmail.com>
...
> This does not solve the following problems:
> 1) compat_alloc_user_space inside ioctl
> 2) ioctls that check task mode instead of entry point for behaviour
> 3) ioctls allocating memory
> 4) struct packing problems between architectures
>
> Workarounds for the problems presented:
> 1a) Do a stack pivot to the lower 32bits from userspace
>   - Forces host 64bit process to have its thread stacks to live in 32bit
>   space. Not ideal.
>   - Only do a stack pivot on ioctl to save previous 32bit VA space
> 1b) Teach kernel that compat_alloc_userspace can return a 64bit pointer
>   - x86-64 truncates stack from this function
>   - AArch64 returns the full stack pointer
>   - Only ~29 users. Validating all of them support a 64bit stack is
>   trivial?

I've almost completed the removal of compat_alloc_user_space(),
that should no longer be a concern when the syscall gets added.

> 2a) Any application using these can be checked for compatibility in
> userspace and put on a block list.
> 2b) Fix any ioctls doing broken behaviour based on task mode rather than
> ioctl entry point

What the ioctls() actually check is 'in_compat_syscall()', which is not
the mode of the task but the type of syscall. There is actually a general
trend to use this helper more rather than less, and I think the only
way forward here is to ensure that this returns true when entering
through the new syscall number.

For x86, this has another complication, as some ioctls also need to
check whether they are in an ia32 task (with packed u64 and 32-bit
__kernel_old_time_t) or an x32 task (with aligned u64 and 64-bit
__kernel_old_time_t). If the new syscall gets wired up on x86 as well,
you'd need to decide which of the two behaviors you want.

> 3a) Userspace consumes all VA space above 32bit. Forcing allocations to
> occur in lower 32bits
>   - This is the current implementation
> 3b) Ensure any allocation in the ioctl handles ioctl entrypoint rather
> than just allow generic memory allocations in full VA space
>   - This is hard to guarantee

What kind of allocation do you mean here? Can you give an example of
an ioctl that does this?

> 4a) Blocklist any application using ioctls that have different struct
> packing across the boundary
>   - Can happen when struct packing of 32bit x86 application goes down
>   the aarch64 compat_ioctl path
>   - Userspace is a AArch64 process passing 32bit x86 ioctl structures
>   through the compat_ioctl path which is typically for AArch32 processes
>   - None currently identified
> 4b) Work with upstream kernel and userspace projects to evaluate and fix
>   - Identify the problem ioctls
>   - Implement a new ioctl with more sane struct packing that matches
>   cross-arch
>   - Implement new ioctl while maintaining backwards compatibility with
>   previous ioctl handler
>   - Change upstream project to use the new compatibility ioctl
>   - ioctl deprecation will be case by case per device and project
> 4b) Userspace implements a full ioctl emulation layer
>   - Parses the full ioctl tree
>   - Either passes through ioctls that it doesn't understand or
>   transforms ioctls that it knows are trouble
>   - Has the downside that it can still run in to edge cases that will
>   fail
>   - Performance of additional tracking is a concern
>   - Prone to failure keeping the kernel ioctl and userspace ioctl
>   handling in sync
>   - Really want to have it in the kernel space as much as possible

I think there are only a few ioctls that are affected, and you can
probably get a list from qemu, which emulates them in user space
already. Doing that transformation should not be all that hard
in the end.

If we want to do this in the kernel, this probably requires changes
to the syscall calling convention. Adding a flag to pick a particular
style of ioctl arguments would work, and this could enable the
case of emulating arm32 ioctls on x86-64 hosts.

> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 86a9d7b3eabe..949788f5ba40 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
>  #define __ARM_NR_compat_set_tls                (__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END            (__ARM_NR_COMPAT_BASE + 0x800)
>
> -#define __NR_compat_syscalls           442
> +#define __NR_compat_syscalls           443
>  #endif
>
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index cccfbbefbf95..35e3bc83dbdc 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -891,6 +891,8 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2)
>  __SYSCALL(__NR_process_madvise, sys_process_madvise)
>  #define __NR_epoll_pwait2 441
>  __SYSCALL(__NR_epoll_pwait2, compat_sys_epoll_pwait2)
> +#define __NR_ioctl32 442
> +__SYSCALL(__NR_ioctl32, compat_sys_ioctl)
>

I'm not sure why you want this in 32-bit processes, can't they just call
the normal ioctl() function?

>  }
> +
> +COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> +                       compat_ulong_t, arg)
> +{
> +       return do_ioctl32(fd, cmd, arg);
> +}
> +
> +SYSCALL_DEFINE3(ioctl32, unsigned int, fd, unsigned int, cmd,
> +                       compat_ulong_t, arg)
> +{
> +       return do_ioctl32(fd, cmd, arg);
> +}

These two look identical to me, I don't think you need to add a wrapper
here at all, but can just use the normal compat_sys_ioctl entry point
unless you want to add a 'flags' argument to control the struct padding.

> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 728752917785..18279e5b7b4f 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -862,8 +862,15 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
>  #define __NR_epoll_pwait2 441
>  __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
>
> +#define __NR_ioctl32 442
> +#ifdef CONFIG_COMPAT
> +__SC_COMP(__NR_ioctl32, sys_ioctl32, compat_sys_ioctl)
> +#else
> +__SC_COMP(__NR_ioctl32, sys_ni_syscall, sys_ni_syscall)
> +#endif
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 442
> +#define __NR_syscalls 443

(already mentioned on IRC)

If you add it here, the same number should be assigned across all architectures,
or at least a comment added to declare the number as reserved, to keep
the following syscalls in sync.

        Arnd
Amanieu d'Antras Jan. 6, 2021, 10:57 p.m. UTC | #2
I encountered a similar problem when writing the Tango binary
translator (https://www.amanieusystems.com/). Tango allows AArch32
programs to run on AArch64 CPUs that don't support AArch32 (e.g.
ThunderX). The technology has been licensed to several customers who
are primarily using it to run Android instances on cloud servers.

Doing compat system call translation entirely in user doesn't work out
for several reasons:
- As Ryan stated, it is impractical and error-prone to manually
translate all ioctls in user space. It would be much better to reuse
the existing compat layer in the kernel. This is even worse if you
take out-of-tree drivers (which have their own ioctls) into account.
- There are quite a few system calls which create VM mappings: the
usual suspects (mmap, etc) but also io_setup and some ioctls (e.g. in
the out-of-tree Mali GPU drivers).
- Seccomp filters simply don't work. They could be emulated in user
space but this is insecure since the translator is not designed to be
a secure sandbox. They also don't propagate across execve.
- Some syscalls rely on information that is only known to the kernel.
For example, a hugetlbfs fd can only be mapped at a huge page
boundary, but the translator cannot know this when selecting a virtual
address in the low 4GB for the mapping.

The solution that we ended up with was to allow AArch64 processes to
issue AArch32 syscalls by using a special system call number. This has
several effects for the duration of the syscall:
- The compat syscall table is used instead of the primary one.
- is_compat_task/in_compat_syscall return true.
- get_unmapped_area returns addresses below 4G and uses a separate
mmap_base. The separate mmap_base is needed to allow 32-bit
applications to benefit from address space randomization. Tango still
uses normal mmap syscalls for private memory allocations that are not
visible to the translated 32-bit process.
- KSTK_EIP and KSTK_ESP return the values of x13/x15 instead of pc/sp.
This is necessary for correct functioning seccomp filters and SELinux
checks respectively. Tango will set x13 and x15 to the correct values
when issuing a 32-bit syscall.
- System call restart after a signal sets things up so that
restart_syscall is executed as a 32-bit syscall after the signal.

I feel that a solution along these lines would also solve Ryan's
problem since the vast majority of the syscall translation work is
from the difference between 32-bit and 64-bit data structures in
memory. The remaining few differences in the syscall ABI between
AArch32 and x86-32 can be handled in user mode by the translator
program.

The kernel patch (based on the v5.4 LTS kernel) can be found at
https://github.com/Amanieu/linux/tree/tango-v5.4.





On Wed, Jan 6, 2021 at 6:49 AM <sonicadvance1@gmail.com> wrote:
>
> From: Ryan Houdek <Sonicadvance1@gmail.com>
>
> Problem presented:
> A backwards compatibility layer that allows running x86-64 and x86
> processes inside of an AArch64 process.
>   - CPU is emulated
>   - Syscall interface is mostly passthrough
>   - Some syscalls require patching or emulation depending on behaviour
>   - Not viable from the emulator design to use an AArch32 host process
>
> x86-64 and x86 userspace emulator source:
> https://github.com/FEX-Emu/FEX
> Usage of ioctl32 is currently in a downstream fork. This will be the
> first user of the syscall.
>
> Cross documentation:
> https://github.com/FEX-Emu/FEX/wiki/32Bit-x86-Woes#ioctl---54
>
> ioctls are opaque from the emulator perspective and the data wants to be
> passed through a syscall as unimpeded as possible.
> Sadly due to ioctl struct differences between x86 and x86-64, we need a
> syscall that exposes the compatibility ioctl handler to userspace in a
> 64bit process.
>
> This is necessary behaves of the behaviour differences that occur
> between an x86 process doing an ioctl and an x86-64 process doing an
> ioctl.
>
> Both of which are captured and passed through the AArch64 ioctl space.
> This is implementing a new ioctl32 syscall that allows us to pass 32bit
> x86 ioctls through to the kernel with zero or minimal manipulation.
>
> The only supported hosts where we care about this currently is AArch64
> and x86-64 (For testing purposes).
> PPC64LE, MIPS64LE, and RISC-V64 might be interesting to support in the
> future; But I don't have any platforms that get anywhere near Cortex-A77
> performance in those architectures. Nor do I have the time to bring up
> the emulator on them.
> x86-64 can get to the compatibility ioctl through the int $0x80 handler.
>
> This does not solve the following problems:
> 1) compat_alloc_user_space inside ioctl
> 2) ioctls that check task mode instead of entry point for behaviour
> 3) ioctls allocating memory
> 4) struct packing problems between architectures
>
> Workarounds for the problems presented:
> 1a) Do a stack pivot to the lower 32bits from userspace
>   - Forces host 64bit process to have its thread stacks to live in 32bit
>   space. Not ideal.
>   - Only do a stack pivot on ioctl to save previous 32bit VA space
> 1b) Teach kernel that compat_alloc_userspace can return a 64bit pointer
>   - x86-64 truncates stack from this function
>   - AArch64 returns the full stack pointer
>   - Only ~29 users. Validating all of them support a 64bit stack is
>   trivial?
>
> 2a) Any application using these can be checked for compatibility in
> userspace and put on a block list.
> 2b) Fix any ioctls doing broken behaviour based on task mode rather than
> ioctl entry point
>
> 3a) Userspace consumes all VA space above 32bit. Forcing allocations to
> occur in lower 32bits
>   - This is the current implementation
> 3b) Ensure any allocation in the ioctl handles ioctl entrypoint rather
> than just allow generic memory allocations in full VA space
>   - This is hard to guarantee
>
> 4a) Blocklist any application using ioctls that have different struct
> packing across the boundary
>   - Can happen when struct packing of 32bit x86 application goes down
>   the aarch64 compat_ioctl path
>   - Userspace is a AArch64 process passing 32bit x86 ioctl structures
>   through the compat_ioctl path which is typically for AArch32 processes
>   - None currently identified
> 4b) Work with upstream kernel and userspace projects to evaluate and fix
>   - Identify the problem ioctls
>   - Implement a new ioctl with more sane struct packing that matches
>   cross-arch
>   - Implement new ioctl while maintaining backwards compatibility with
>   previous ioctl handler
>   - Change upstream project to use the new compatibility ioctl
>   - ioctl deprecation will be case by case per device and project
> 4b) Userspace implements a full ioctl emulation layer
>   - Parses the full ioctl tree
>   - Either passes through ioctls that it doesn't understand or
>   transforms ioctls that it knows are trouble
>   - Has the downside that it can still run in to edge cases that will
>   fail
>   - Performance of additional tracking is a concern
>   - Prone to failure keeping the kernel ioctl and userspace ioctl
>   handling in sync
>   - Really want to have it in the kernel space as much as possible
>
> Signed-off-by: Ryan Houdek <Sonicadvance1@gmail.com>
> ---
>  arch/arm64/include/asm/unistd.h         |  2 +-
>  arch/arm64/include/asm/unistd32.h       |  2 ++
>  fs/ioctl.c                              | 16 ++++++++++++++--
>  include/linux/syscalls.h                |  2 ++
>  include/uapi/asm-generic/unistd.h       |  9 ++++++++-
>  kernel/sys_ni.c                         |  3 +++
>  tools/include/uapi/asm-generic/unistd.h |  9 ++++++++-
>  7 files changed, 38 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 86a9d7b3eabe..949788f5ba40 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
>  #define __ARM_NR_compat_set_tls                (__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END            (__ARM_NR_COMPAT_BASE + 0x800)
>
> -#define __NR_compat_syscalls           442
> +#define __NR_compat_syscalls           443
>  #endif
>
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index cccfbbefbf95..35e3bc83dbdc 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -891,6 +891,8 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2)
>  __SYSCALL(__NR_process_madvise, sys_process_madvise)
>  #define __NR_epoll_pwait2 441
>  __SYSCALL(__NR_epoll_pwait2, compat_sys_epoll_pwait2)
> +#define __NR_ioctl32 442
> +__SYSCALL(__NR_ioctl32, compat_sys_ioctl)
>
>  /*
>   * Please add new compat syscalls above this comment and update
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 4e6cc0a7d69c..116b9bca8c07 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -790,8 +790,8 @@ long compat_ptr_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  }
>  EXPORT_SYMBOL(compat_ptr_ioctl);
>
> -COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> -                      compat_ulong_t, arg)
> +long do_ioctl32(unsigned int fd, unsigned int cmd,
> +                       compat_ulong_t arg)
>  {
>         struct fd f = fdget(fd);
>         int error;
> @@ -850,4 +850,16 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
>
>         return error;
>  }
> +
> +COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> +                       compat_ulong_t, arg)
> +{
> +       return do_ioctl32(fd, cmd, arg);
> +}
> +
> +SYSCALL_DEFINE3(ioctl32, unsigned int, fd, unsigned int, cmd,
> +                       compat_ulong_t, arg)
> +{
> +       return do_ioctl32(fd, cmd, arg);
> +}
>  #endif
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index f3929aff39cf..470f928831eb 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -386,6 +386,8 @@ asmlinkage long sys_inotify_rm_watch(int fd, __s32 wd);
>  /* fs/ioctl.c */
>  asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd,
>                                 unsigned long arg);
> +asmlinkage long sys_ioctl32(unsigned int fd, unsigned int cmd,
> +                               compat_ulong_t arg);
>
>  /* fs/ioprio.c */
>  asmlinkage long sys_ioprio_set(int which, int who, int ioprio);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 728752917785..18279e5b7b4f 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -862,8 +862,15 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
>  #define __NR_epoll_pwait2 441
>  __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
>
> +#define __NR_ioctl32 442
> +#ifdef CONFIG_COMPAT
> +__SC_COMP(__NR_ioctl32, sys_ioctl32, compat_sys_ioctl)
> +#else
> +__SC_COMP(__NR_ioctl32, sys_ni_syscall, sys_ni_syscall)
> +#endif
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 442
> +#define __NR_syscalls 443
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 19aa806890d5..5a2f25eb341c 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -302,6 +302,9 @@ COND_SYSCALL(recvmmsg_time32);
>  COND_SYSCALL_COMPAT(recvmmsg_time32);
>  COND_SYSCALL_COMPAT(recvmmsg_time64);
>
> +COND_SYSCALL(ioctl32);
> +COND_SYSCALL_COMPAT(ioctl32);
> +
>  /*
>   * Architecture specific syscalls: see further below
>   */
> diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
> index 728752917785..18279e5b7b4f 100644
> --- a/tools/include/uapi/asm-generic/unistd.h
> +++ b/tools/include/uapi/asm-generic/unistd.h
> @@ -862,8 +862,15 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
>  #define __NR_epoll_pwait2 441
>  __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
>
> +#define __NR_ioctl32 442
> +#ifdef CONFIG_COMPAT
> +__SC_COMP(__NR_ioctl32, sys_ioctl32, compat_sys_ioctl)
> +#else
> +__SC_COMP(__NR_ioctl32, sys_ni_syscall, sys_ni_syscall)
> +#endif
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 442
> +#define __NR_syscalls 443
>
>  /*
>   * 32 bit systems traditionally used different
> --
> 2.27.0
>
Ryan Houdek Jan. 15, 2021, 2:21 a.m. UTC | #3
On Wed, Jan 6, 2021 at 12:49 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Wed, Jan 6, 2021 at 7:48 AM <sonicadvance1@gmail.com> wrote:
> > From: Ryan Houdek <Sonicadvance1@gmail.com>
> ...
> > This does not solve the following problems:
> > 1) compat_alloc_user_space inside ioctl
> > 2) ioctls that check task mode instead of entry point for behaviour
> > 3) ioctls allocating memory
> > 4) struct packing problems between architectures
> >
> > Workarounds for the problems presented:
> > 1a) Do a stack pivot to the lower 32bits from userspace
> >   - Forces host 64bit process to have its thread stacks to live in 32bit
> >   space. Not ideal.
> >   - Only do a stack pivot on ioctl to save previous 32bit VA space
> > 1b) Teach kernel that compat_alloc_userspace can return a 64bit pointer
> >   - x86-64 truncates stack from this function
> >   - AArch64 returns the full stack pointer
> >   - Only ~29 users. Validating all of them support a 64bit stack is
> >   trivial?
>
> I've almost completed the removal of compat_alloc_user_space(),
> that should no longer be a concern when the syscall gets added.
>
> > 2a) Any application using these can be checked for compatibility in
> > userspace and put on a block list.
> > 2b) Fix any ioctls doing broken behaviour based on task mode rather than
> > ioctl entry point
>
> What the ioctls() actually check is 'in_compat_syscall()', which is not
> the mode of the task but the type of syscall. There is actually a general
> trend to use this helper more rather than less, and I think the only
> way forward here is to ensure that this returns true when entering
> through the new syscall number.
>
> For x86, this has another complication, as some ioctls also need to
> check whether they are in an ia32 task (with packed u64 and 32-bit
> __kernel_old_time_t) or an x32 task (with aligned u64 and 64-bit
> __kernel_old_time_t). If the new syscall gets wired up on x86 as well,
> you'd need to decide which of the two behaviors you want.

I can have a follow-up patch that makes this do ni_syscall on x86_64
since we can go through the int 0x80 handler, or x32 handler path and
choose whichever one there.

>
> > 3a) Userspace consumes all VA space above 32bit. Forcing allocations to
> > occur in lower 32bits
> >   - This is the current implementation
> > 3b) Ensure any allocation in the ioctl handles ioctl entrypoint rather
> > than just allow generic memory allocations in full VA space
> >   - This is hard to guarantee
>
> What kind of allocation do you mean here? Can you give an example of
> an ioctl that does this?

My concern here would be something like DRM allocating memory and
returning a pointer to userspace that ends up in 64bit space.
I can see something like `drm_get_unmapped_area` calls in to
`current->mm->get_unmapped_area` which I believe only ends up falling
down TASK_SIZE checks.
Which could potentially return pointers in the 64bit address space
range in this case. Theoretically can be resolved either by thieving
the full 64bit VA range, or doing something like the Tango layer
patches that on syscall entry changes the syscall to a "compat"
syscall.
compat syscall flag like Tango might be nicer here?

>
> > 4a) Blocklist any application using ioctls that have different struct
> > packing across the boundary
> >   - Can happen when struct packing of 32bit x86 application goes down
> >   the aarch64 compat_ioctl path
> >   - Userspace is a AArch64 process passing 32bit x86 ioctl structures
> >   through the compat_ioctl path which is typically for AArch32 processes
> >   - None currently identified
> > 4b) Work with upstream kernel and userspace projects to evaluate and fix
> >   - Identify the problem ioctls
> >   - Implement a new ioctl with more sane struct packing that matches
> >   cross-arch
> >   - Implement new ioctl while maintaining backwards compatibility with
> >   previous ioctl handler
> >   - Change upstream project to use the new compatibility ioctl
> >   - ioctl deprecation will be case by case per device and project
> > 4b) Userspace implements a full ioctl emulation layer
> >   - Parses the full ioctl tree
> >   - Either passes through ioctls that it doesn't understand or
> >   transforms ioctls that it knows are trouble
> >   - Has the downside that it can still run in to edge cases that will
> >   fail
> >   - Performance of additional tracking is a concern
> >   - Prone to failure keeping the kernel ioctl and userspace ioctl
> >   handling in sync
> >   - Really want to have it in the kernel space as much as possible
>
> I think there are only a few ioctls that are affected, and you can
> probably get a list from qemu, which emulates them in user space
> already. Doing that transformation should not be all that hard
> in the end.
>
> If we want to do this in the kernel, this probably requires changes
> to the syscall calling convention. Adding a flag to pick a particular
> style of ioctl arguments would work, and this could enable the
> case of emulating arm32 ioctls on x86-64 hosts.
>
> > diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> > index 86a9d7b3eabe..949788f5ba40 100644
> > --- a/arch/arm64/include/asm/unistd.h
> > +++ b/arch/arm64/include/asm/unistd.h
> > @@ -38,7 +38,7 @@
> >  #define __ARM_NR_compat_set_tls                (__ARM_NR_COMPAT_BASE + 5)
> >  #define __ARM_NR_COMPAT_END            (__ARM_NR_COMPAT_BASE + 0x800)
> >
> > -#define __NR_compat_syscalls           442
> > +#define __NR_compat_syscalls           443
> >  #endif
> >
> >  #define __ARCH_WANT_SYS_CLONE
> > diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> > index cccfbbefbf95..35e3bc83dbdc 100644
> > --- a/arch/arm64/include/asm/unistd32.h
> > +++ b/arch/arm64/include/asm/unistd32.h
> > @@ -891,6 +891,8 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2)
> >  __SYSCALL(__NR_process_madvise, sys_process_madvise)
> >  #define __NR_epoll_pwait2 441
> >  __SYSCALL(__NR_epoll_pwait2, compat_sys_epoll_pwait2)
> > +#define __NR_ioctl32 442
> > +__SYSCALL(__NR_ioctl32, compat_sys_ioctl)
> >
>
> I'm not sure why you want this in 32-bit processes, can't they just call
> the normal ioctl() function?
>

I'll have a follow up patch that on 32bit hosts this is a ni_syscall instead

> >  }
> > +
> > +COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> > +                       compat_ulong_t, arg)
> > +{
> > +       return do_ioctl32(fd, cmd, arg);
> > +}
> > +
> > +SYSCALL_DEFINE3(ioctl32, unsigned int, fd, unsigned int, cmd,
> > +                       compat_ulong_t, arg)
> > +{
> > +       return do_ioctl32(fd, cmd, arg);
> > +}
>
> These two look identical to me, I don't think you need to add a wrapper
> here at all, but can just use the normal compat_sys_ioctl entry point
> unless you want to add a 'flags' argument to control the struct padding.

I tried having the dispatch table call directly in to the COMPAT one
and the way things were lining up weren't allowing me to do this.
Since this is a bit unique in how it operates, I'm not quite sure if
there is another example I could pull from for this.

>
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index 728752917785..18279e5b7b4f 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -862,8 +862,15 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
> >  #define __NR_epoll_pwait2 441
> >  __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
> >
> > +#define __NR_ioctl32 442
> > +#ifdef CONFIG_COMPAT
> > +__SC_COMP(__NR_ioctl32, sys_ioctl32, compat_sys_ioctl)
> > +#else
> > +__SC_COMP(__NR_ioctl32, sys_ni_syscall, sys_ni_syscall)
> > +#endif
> > +
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 442
> > +#define __NR_syscalls 443
>
> (already mentioned on IRC)
>

> If you add it here, the same number should be assigned across all architectures,
> or at least a comment added to declare the number as reserved, to keep
> the following syscalls in sync.

I can have a followup patch that adds this to all the backends.
>
>         Arnd
Arnd Bergmann Jan. 15, 2021, 9 a.m. UTC | #4
On Fri, Jan 15, 2021 at 3:06 AM Ryan Houdek <sonicadvance1@gmail.com> wrote:
> On Wed, Jan 6, 2021 at 12:49 AM Arnd Bergmann <arnd@kernel.org> wrote:
>> On Wed, Jan 6, 2021 at 7:48 AM <sonicadvance1@gmail.com> wrote:
>> > From: Ryan Houdek <Sonicadvance1@gmail.com>
>> ...
>>
>> For x86, this has another complication, as some ioctls also need to
>> check whether they are in an ia32 task (with packed u64 and 32-bit
>> __kernel_old_time_t) or an x32 task (with aligned u64 and 64-bit
>> __kernel_old_time_t). If the new syscall gets wired up on x86 as well,
>> you'd need to decide which of the two behaviors you want.
>
>
> I can have a follow-up patch that makes this do ni-syscall on x86_64 since
> we can go through the int 0x80 handler, or x32 handler path and choose
> whichever one there.

I'd say for consistency

>> > 3a) Userspace consumes all VA space above 32bit. Forcing allocations to
>> > occur in lower 32bits
>> >   - This is the current implementation
>> > 3b) Ensure any allocation in the ioctl handles ioctl entrypoint rather
>> > than just allow generic memory allocations in full VA space
>> >   - This is hard to guarantee
>>
>> What kind of allocation do you mean here? Can you give an example of
>> an ioctl that does this?
>>
> My concern here would be something like DRM allocating memory and
> returning a pointer to userspace that ends up in 64bit space.
> I can see something like `drm_get_unmapped_area` calls in to
>  `current->mm->get_unmapped_area` which I believe only ends up falling
> down TASK_SIZE checks.

I see.

> Which could potentially return pointers in the 64bit address space range
> in this case. Theoretically can be resolved either by thieving the full 64bit
> VA range, or doing something like the Tango layer patches that on
> syscall entry changes the syscall to a "compat" syscall.
> compat syscall flag like Tango might be nicer here?

Not sure how that flag is best encoded, but yes, it would have to be
somewhere that arch_get_unmapped_area() and
arch_get_mmap_end() can find. Clearly we want a solution that works
for both tango and for your work, as well as being portable to any
architecture.

>> >  }
>> > +
>> > +COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
>> > +                       compat_ulong_t, arg)
>> > +{
>> > +       return do_ioctl32(fd, cmd, arg);
>> > +}
>> > +
>> > +SYSCALL_DEFINE3(ioctl32, unsigned int, fd, unsigned int, cmd,
>> > +                       compat_ulong_t, arg)
>> > +{
>> > +       return do_ioctl32(fd, cmd, arg);
>> > +}
>>
>> These two look identical to me, I don't think you need to add a wrapper
>> here at all, but can just use the normal compat_sys_ioctl entry point
>> unless you want to add a 'flags' argument to control the struct padding.
>
>
> I tried having the dispatch table call directly in to the COMPAT one and
> the way things were lining up weren't allowing me to do this.
> Since this is a bit unique in how it operates, I'm not quite sure if there is
> another example I could pull from for this.

For the asm-generic/unistd.h, you should be able to write htis as

#if __BITS_PER_LONG == 64
__SC_COMP(__NR_ioctl32, compat_sys_ioctl, sys_ni_syscall)
#endif

Which means that the native syscall in a 64-bit process always
points to compat_sys_ioctl, while a 32-bit process always gets
-ENOSYS.

Similarly, the syscall_64.tbl file on x86 and the other 64-bit
architectures would use
442    64      ioctl32         compat_sys_ioctl

FWIW, I suppose you can rename compat_sys_ioctl to
sys_ioctl32 treewide, if that name makes more sense.

      Arnd
Andy Lutomirski Jan. 16, 2021, 12:07 a.m. UTC | #5
On Fri, Jan 15, 2021 at 1:03 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Fri, Jan 15, 2021 at 3:06 AM Ryan Houdek <sonicadvance1@gmail.com> wrote:
> > On Wed, Jan 6, 2021 at 12:49 AM Arnd Bergmann <arnd@kernel.org> wrote:
> >> On Wed, Jan 6, 2021 at 7:48 AM <sonicadvance1@gmail.com> wrote:
> >> > From: Ryan Houdek <Sonicadvance1@gmail.com>
> >> ...
> >>
> >> For x86, this has another complication, as some ioctls also need to
> >> check whether they are in an ia32 task (with packed u64 and 32-bit
> >> __kernel_old_time_t) or an x32 task (with aligned u64 and 64-bit
> >> __kernel_old_time_t). If the new syscall gets wired up on x86 as well,
> >> you'd need to decide which of the two behaviors you want.
> >
> >
> > I can have a follow-up patch that makes this do ni-syscall on x86_64 since
> > we can go through the int 0x80 handler, or x32 handler path and choose
> > whichever one there.
>
> I'd say for consistency
>

We need to make it crystal clear on x86 what this ioctl does.  We have
a silly selection of options:

 - ioctl32() via SYSCALL, x32 bit clear -- presumably does an i386 ioctl?
 - ioctl32() via SYSCALL, x32 bit set -- this needs to do something
clearly documented.
 - ioctl32() via int80 -- presumably you're not wiring this up

In any case, the compat alloc thing should just go away.  It's a hack
and serves no real purpose.

Finally, I'm not convinced that this patch works correctly.  We have
in_compat_syscall(), and code that uses it may well be reachable from
ioctl.  I personally would like to see in_compat_syscall() go away,
but some other people (Hi, Christoph!) disagree, and usage seems to be
increasing, not decreasing.
Christoph Hellwig Jan. 16, 2021, 9:07 a.m. UTC | #6
On Fri, Jan 15, 2021 at 04:07:46PM -0800, Andy Lutomirski wrote:
> Finally, I'm not convinced that this patch works correctly.  We have
> in_compat_syscall(), and code that uses it may well be reachable from
> ioctl.

ioctls are the prime user of in_compat_syscall().

> I personally would like to see in_compat_syscall() go away,
> but some other people (Hi, Christoph!) disagree, and usage seems to be
> increasing, not decreasing.

I'm absolutely against it going away.  in_compat_syscall helped to
remove so much crap compared to the explicit compat syscalls.
David Laight Jan. 17, 2021, 6:31 p.m. UTC | #7
From: Christoph Hellwig
> Sent: 16 January 2021 09:07
...
> > I personally would like to see in_compat_syscall() go away,
> > but some other people (Hi, Christoph!) disagree, and usage seems to be
> > increasing, not decreasing.
> 
> I'm absolutely against it going away.  in_compat_syscall helped to
> remove so much crap compared to the explicit compat syscalls.

The only other real option is to pass the 'syscall type' explicitly
through all the layers into every piece of code that might need it.

So passing it as a 'parameter' that is (probably) current->syscall_type
does make sense.

It might even make sense have separate bits for the required emulations.
So you'd have separate bits for '32bit pointers' and '64bit items 32bit
aligned' (etc).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 86a9d7b3eabe..949788f5ba40 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@ 
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		442
+#define __NR_compat_syscalls		443
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index cccfbbefbf95..35e3bc83dbdc 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -891,6 +891,8 @@  __SYSCALL(__NR_faccessat2, sys_faccessat2)
 __SYSCALL(__NR_process_madvise, sys_process_madvise)
 #define __NR_epoll_pwait2 441
 __SYSCALL(__NR_epoll_pwait2, compat_sys_epoll_pwait2)
+#define __NR_ioctl32 442
+__SYSCALL(__NR_ioctl32, compat_sys_ioctl)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 4e6cc0a7d69c..116b9bca8c07 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -790,8 +790,8 @@  long compat_ptr_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 }
 EXPORT_SYMBOL(compat_ptr_ioctl);
 
-COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
-		       compat_ulong_t, arg)
+long do_ioctl32(unsigned int fd, unsigned int cmd,
+			compat_ulong_t arg)
 {
 	struct fd f = fdget(fd);
 	int error;
@@ -850,4 +850,16 @@  COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
 
 	return error;
 }
+
+COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
+			compat_ulong_t, arg)
+{
+	return do_ioctl32(fd, cmd, arg);
+}
+
+SYSCALL_DEFINE3(ioctl32, unsigned int, fd, unsigned int, cmd,
+			compat_ulong_t, arg)
+{
+	return do_ioctl32(fd, cmd, arg);
+}
 #endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f3929aff39cf..470f928831eb 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -386,6 +386,8 @@  asmlinkage long sys_inotify_rm_watch(int fd, __s32 wd);
 /* fs/ioctl.c */
 asmlinkage long sys_ioctl(unsigned int fd, unsigned int cmd,
 				unsigned long arg);
+asmlinkage long sys_ioctl32(unsigned int fd, unsigned int cmd,
+				compat_ulong_t arg);
 
 /* fs/ioprio.c */
 asmlinkage long sys_ioprio_set(int which, int who, int ioprio);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 728752917785..18279e5b7b4f 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -862,8 +862,15 @@  __SYSCALL(__NR_process_madvise, sys_process_madvise)
 #define __NR_epoll_pwait2 441
 __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
 
+#define __NR_ioctl32 442
+#ifdef CONFIG_COMPAT
+__SC_COMP(__NR_ioctl32, sys_ioctl32, compat_sys_ioctl)
+#else
+__SC_COMP(__NR_ioctl32, sys_ni_syscall, sys_ni_syscall)
+#endif
+
 #undef __NR_syscalls
-#define __NR_syscalls 442
+#define __NR_syscalls 443
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 19aa806890d5..5a2f25eb341c 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -302,6 +302,9 @@  COND_SYSCALL(recvmmsg_time32);
 COND_SYSCALL_COMPAT(recvmmsg_time32);
 COND_SYSCALL_COMPAT(recvmmsg_time64);
 
+COND_SYSCALL(ioctl32);
+COND_SYSCALL_COMPAT(ioctl32);
+
 /*
  * Architecture specific syscalls: see further below
  */
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index 728752917785..18279e5b7b4f 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -862,8 +862,15 @@  __SYSCALL(__NR_process_madvise, sys_process_madvise)
 #define __NR_epoll_pwait2 441
 __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
 
+#define __NR_ioctl32 442
+#ifdef CONFIG_COMPAT
+__SC_COMP(__NR_ioctl32, sys_ioctl32, compat_sys_ioctl)
+#else
+__SC_COMP(__NR_ioctl32, sys_ni_syscall, sys_ni_syscall)
+#endif
+
 #undef __NR_syscalls
-#define __NR_syscalls 442
+#define __NR_syscalls 443
 
 /*
  * 32 bit systems traditionally used different