Message ID | 20200716185424.011950288@linutronix.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | entry, x86, kvm: Generic entry/exit functionality for host and guest | expand |
On Thu, Jul 16, 2020 at 08:22:09PM +0200, Thomas Gleixner wrote: > From: Thomas Gleixner <tglx@linutronix.de> > > On syscall entry certain work needs to be done: > > - Establish state (lockdep, context tracking, tracing) > - Conditional work (ptrace, seccomp, audit...) > > This code is needlessly duplicated and different in all > architectures. > > Provide a generic version based on the x86 implementation which has all the > RCU and instrumentation bits right. Ahh! You're reading my mind! I was just thinking about this while reviewing the proposed syscall redirection series[1], and pondering the lack of x86 TIF flags, and that nearly everything in the series (and for seccomp and other things) didn't need to be arch-specific. And now that series absolutely needs to be rebased and it'll magically work for every arch that switches to the generic entry code. :) Notes below... [1] https://lore.kernel.org/lkml/20200716193141.4068476-2-krisman@collabora.com/ > +/* > + * Define dummy _TIF work flags if not defined by the architecture or for > + * disabled functionality. > + */ When I was thinking about this last week I was pondering having a split between the arch-agnositc TIF flags and the arch-specific TIF flags, and that each arch could have a single "there is agnostic work to be done" TIF in their thread_info, and the agnostic flags could live in task_struct or something. Anyway, I'll keep reading... > +/** > + * syscall_enter_from_user_mode - Check and handle work before invoking > + * a syscall > + * @regs: Pointer to currents pt_regs > + * @syscall: The syscall number > + * > + * Invoked from architecture specific syscall entry code with interrupts > + * disabled. The calling code has to be non-instrumentable. When the > + * function returns all state is correct and the subsequent functions can be > + * instrumented. > + * > + * Returns: The original or a modified syscall number > + * > + * If the returned syscall number is -1 then the syscall should be > + * skipped. In this case the caller may invoke syscall_set_error() or > + * syscall_set_return_value() first. If neither of those are called and -1 > + * is returned, then the syscall will fail with ENOSYS. There's been some recent confusion over "has the syscall changed, or did seccomp request it be skipped?" that was explored in arm64[2] (though I see Will and Keno in CC already). There might need to be a clearer way to distinguish between "wild userspace issued a -1 syscall" and "seccomp or ptrace asked for the syscall to be skipped". The difference is mostly about when ENOSYS gets set, with respect to calls to syscall_set_return_value(), but if the syscall gets changed, the arch may need to recheck the value and consider ENOSYS, etc. IIUC, what Will ended up with[3] was having syscall_trace_enter() return the syscall return value instead of the new syscall. [2] https://lore.kernel.org/lkml/20200704125027.GB21185@willie-the-truck/ [3] https://lore.kernel.org/lkml/20200703083914.GA18516@willie-the-truck/ > +static long syscall_trace_enter(struct pt_regs *regs, long syscall, > + unsigned long ti_work) > +{ > + long ret = 0; > + > + /* Handle ptrace */ > + if (ti_work & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU)) { > + ret = arch_syscall_enter_tracehook(regs); > + if (ret || (ti_work & _TIF_SYSCALL_EMU)) > + return -1L; > + } > + > + /* Do seccomp after ptrace, to catch any tracer changes. */ > + if (ti_work & _TIF_SECCOMP) { > + ret = arch_syscall_enter_seccomp(regs); > + if (ret == -1L) > + return ret; > + } > + > + if (unlikely(ti_work & _TIF_SYSCALL_TRACEPOINT)) > + trace_sys_enter(regs, syscall); > + > + arch_syscall_enter_audit(regs); > + > + return ret ? : syscall; > +} Modulo the notes about -1 vs syscall number above, this looks correct to me for ptrace and seccomp.
Kees Cook <keescook@chromium.org> writes: > On Thu, Jul 16, 2020 at 08:22:09PM +0200, Thomas Gleixner wrote: >> This code is needlessly duplicated and different in all >> architectures. >> >> Provide a generic version based on the x86 implementation which has all the >> RCU and instrumentation bits right. > > Ahh! You're reading my mind! I told you about that plan at the last conference over a beer :) > I was just thinking about this while reviewing the proposed syscall > redirection series[1], and pondering the lack of x86 TIF flags, and > that nearly everything in the series (and for seccomp and other > things) didn't need to be arch-specific. And now that series > absolutely needs to be rebased and it'll magically work for every arch > that switches to the generic entry code. :) That's the plan. > Notes below... > > [1] https://lore.kernel.org/lkml/20200716193141.4068476-2-krisman@collabora.com/ Saw that fly by. *shudder* >> +/* >> + * Define dummy _TIF work flags if not defined by the architecture or for >> + * disabled functionality. >> + */ > > When I was thinking about this last week I was pondering having a split > between the arch-agnositc TIF flags and the arch-specific TIF flags, and > that each arch could have a single "there is agnostic work to be done" > TIF in their thread_info, and the agnostic flags could live in > task_struct or something. Anyway, I'll keep reading... That's going to be nasty. We rather go and expand the TIF storage to 64bit. And then do the following in a generic header: #ifndef TIF_ARCH_SPECIFIC # define TIF_ARCH_SPECIFIC #endif enum tif_bits { TIF_NEED_RESCHED = 0, TIF_..., TIF_LAST_GENERIC, TIF_ARCH_SPECIFIC, }; and in the arch specific one: #define TIF_ARCH_SPECIFIC \ TIF_ARCH_1, \ TIF_ARCH_2, or something like that. >> +/** >> + * syscall_enter_from_user_mode - Check and handle work before invoking >> + * a syscall >> + * @regs: Pointer to currents pt_regs >> + * @syscall: The syscall number >> + * >> + * Invoked from architecture specific syscall entry code with interrupts >> + * disabled. The calling code has to be non-instrumentable. When the >> + * function returns all state is correct and the subsequent functions can be >> + * instrumented. >> + * >> + * Returns: The original or a modified syscall number >> + * >> + * If the returned syscall number is -1 then the syscall should be >> + * skipped. In this case the caller may invoke syscall_set_error() or >> + * syscall_set_return_value() first. If neither of those are called and -1 >> + * is returned, then the syscall will fail with ENOSYS. > > There's been some recent confusion over "has the syscall changed, > or did seccomp request it be skipped?" that was explored in arm64[2] > (though I see Will and Keno in CC already). There might need to be a > clearer way to distinguish between "wild userspace issued a -1 syscall" > and "seccomp or ptrace asked for the syscall to be skipped". The > difference is mostly about when ENOSYS gets set, with respect to calls > to syscall_set_return_value(), but if the syscall gets changed, the arch > may need to recheck the value and consider ENOSYS, etc. IIUC, what Will > ended up with[3] was having syscall_trace_enter() return the syscall return > value instead of the new syscall. I was chatting with Will about that yesterday. IIRC he plans to fix the immediate issue on arm64 first and then move arm64 over to the generic variant. That's the reason why I reshuffled the patch series so the generic parts are first which allows me to provide will a branch with just those. If there are any changes needed we can just feed them back into that branch and fixup the affected architecture trees. IOW, that should not block progress on this stuff. Thanks, tglx
On Thu, Jul 16, 2020 at 11:55:59PM +0200, Thomas Gleixner wrote: > Kees Cook <keescook@chromium.org> writes: > > On Thu, Jul 16, 2020 at 08:22:09PM +0200, Thomas Gleixner wrote: > >> This code is needlessly duplicated and different in all > >> architectures. > >> > >> Provide a generic version based on the x86 implementation which has all the > >> RCU and instrumentation bits right. > > > > Ahh! You're reading my mind! > > I told you about that plan at the last conference over a beer :) Thank you for incepting it in my head, then! ;) > > [1] https://lore.kernel.org/lkml/20200716193141.4068476-2-krisman@collabora.com/ > > Saw that fly by. *shudder* Aw, it's nice. Better emulation! :) > > >> +/* > >> + * Define dummy _TIF work flags if not defined by the architecture or for > >> + * disabled functionality. > >> + */ > > > > When I was thinking about this last week I was pondering having a split > > between the arch-agnositc TIF flags and the arch-specific TIF flags, and > > that each arch could have a single "there is agnostic work to be done" > > TIF in their thread_info, and the agnostic flags could live in > > task_struct or something. Anyway, I'll keep reading... > > That's going to be nasty. We rather go and expand the TIF storage to > 64bit. And then do the following in a generic header: I though the point was to make the TIF_WORK check as fast as possible, even on the 32-bit word systems. I mean it's not a huge performance hit, but *shrug* > > #ifndef TIF_ARCH_SPECIFIC > # define TIF_ARCH_SPECIFIC > #endif > > enum tif_bits { > TIF_NEED_RESCHED = 0, > TIF_..., > TIF_LAST_GENERIC, > TIF_ARCH_SPECIFIC, > }; > > and in the arch specific one: > > #define TIF_ARCH_SPECIFIC \ > TIF_ARCH_1, \ > TIF_ARCH_2, > > or something like that. Okay, yeah, that can work. > > There's been some recent confusion over "has the syscall changed, > > or did seccomp request it be skipped?" that was explored in arm64[2] > > (though I see Will and Keno in CC already). There might need to be a > > clearer way to distinguish between "wild userspace issued a -1 syscall" > > and "seccomp or ptrace asked for the syscall to be skipped". The > > difference is mostly about when ENOSYS gets set, with respect to calls > > to syscall_set_return_value(), but if the syscall gets changed, the arch > > may need to recheck the value and consider ENOSYS, etc. IIUC, what Will > > ended up with[3] was having syscall_trace_enter() return the syscall return > > value instead of the new syscall. > > I was chatting with Will about that yesterday. IIRC he plans to fix the > immediate issue on arm64 first and then move arm64 over to the generic > variant. That's the reason why I reshuffled the patch series so the > generic parts are first which allows me to provide will a branch with > just those. If there are any changes needed we can just feed them back > into that branch and fixup the affected architecture trees. > > IOW, that should not block progress on this stuff. Ok, great! I just wanted to make sure that didn't surprise anyone. :)
Kees Cook <keescook@chromium.org> writes: > On Thu, Jul 16, 2020 at 11:55:59PM +0200, Thomas Gleixner wrote: >> Kees Cook <keescook@chromium.org> writes: >> >> +/* >> >> + * Define dummy _TIF work flags if not defined by the architecture or for >> >> + * disabled functionality. >> >> + */ >> > >> > When I was thinking about this last week I was pondering having a split >> > between the arch-agnositc TIF flags and the arch-specific TIF flags, and >> > that each arch could have a single "there is agnostic work to be done" >> > TIF in their thread_info, and the agnostic flags could live in >> > task_struct or something. Anyway, I'll keep reading... >> >> That's going to be nasty. We rather go and expand the TIF storage to >> 64bit. And then do the following in a generic header: > > I though the point was to make the TIF_WORK check as fast as possible, > even on the 32-bit word systems. I mean it's not a huge performance hit, > but *shrug* For 64bit it's definitely faster to have 64bit flags. For 32bit it's debatable whether having to fiddle with two words and taking care about ordering is faster or not. It's a pain on 32bit in any case. But what we can do is distangle the flags into those which really need to be grouped into a single 32bit TIF word and architecture specific ones which can live in a different place. Looking at x86 which has the most pressure on TIF bits: Real TIF bits TIF_SYSCALL_TRACE Entry/exit TIF_SYSCALL_TRACEPOINT Entry/exit TIF_NOTIFY_RESUME Entry/exit TIF_SIGPENDING Entry/exit TIF_NEED_RESCHED Entry/exit TIF_SINGLESTEP Entry/exit TIF_SYSCALL_EMU Entry/exit TIF_SYSCALL_AUDIT Entry/exit TIF_SECCOMP Entry/exit TIF_USER_RETURN_NOTIFY Entry/exit (x86 specific) TIF_UPROBE Entry/exit TIF_PATCH_PENDING Entry/exit TIF_POLLING_NRFLAG Scheduler (related to TIF_NEED_RESCHED) TIF_MEMDIE Historical, but not required to be a real one TIF_FSCHECK Historical, but not required to be a real one Context switch group TIF_NOCPUID X86 Context switch TIF_NOTSC X86 Context switch TIF_SLD X86 Context switch TIF_BLOCKSTEP X86 Context switch TIF_IO_BITMAP X86 Context switch + Entry/exit (see below) TIF_NEED_FPU_LOAD X86 Context switch + Entry/exit (see below) TIF_SSBD X86 Context switch TIF_SPEC_IB X86 Context switch TIF_SPEC_FORCE_UPDATE X86 Context switch No group requirements TIF_IA32 X86 random TIF_FORCED_TF X86 random TIF_LAZY_MMU_UPDATES X86 random TIF_ADDR32 X86 random TIF_X32 X86 random So the only interesting ones are TIF_IO_BITMAP and TIF_NEED_FPU_LOAD, but those are really not required to be in the real TIF group because they are independently evaluated _after_ the real TIF flags on exit to user space and that requires a reread of the flags anyway. So if we put the context switch and the random bits into a seperate word right after thread_info->flags then the second word is in the same cacheline and it wont matter. That way we gain plenty of free bits on 32 bit and have no dependency between the two words at all. The alternative is to play nasty games with TIF_IA32, TIF_ADDR32 and TIF_X32 to free up bits for 32bit and make the flags field 64 bit on 64 bit kernels, but I prefer to do the above seperation. Thanks, tglx
On Fri, Jul 17, 2020 at 12:29 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Kees Cook <keescook@chromium.org> writes: > > On Thu, Jul 16, 2020 at 11:55:59PM +0200, Thomas Gleixner wrote: > >> Kees Cook <keescook@chromium.org> writes: > >> >> +/* > >> >> + * Define dummy _TIF work flags if not defined by the architecture or for > >> >> + * disabled functionality. > >> >> + */ > >> > > >> > When I was thinking about this last week I was pondering having a split > >> > between the arch-agnositc TIF flags and the arch-specific TIF flags, and > >> > that each arch could have a single "there is agnostic work to be done" > >> > TIF in their thread_info, and the agnostic flags could live in > >> > task_struct or something. Anyway, I'll keep reading... > >> > >> That's going to be nasty. We rather go and expand the TIF storage to > >> 64bit. And then do the following in a generic header: > > > > I though the point was to make the TIF_WORK check as fast as possible, > > even on the 32-bit word systems. I mean it's not a huge performance hit, > > but *shrug* > > For 64bit it's definitely faster to have 64bit flags. > > For 32bit it's debatable whether having to fiddle with two words and > taking care about ordering is faster or not. It's a pain on 32bit in any > case. > > But what we can do is distangle the flags into those which really need > to be grouped into a single 32bit TIF word and architecture specific > ones which can live in a different place. > > Looking at x86 which has the most pressure on TIF bits: > > Real TIF bits > TIF_SYSCALL_TRACE Entry/exit > TIF_SYSCALL_TRACEPOINT Entry/exit > TIF_NOTIFY_RESUME Entry/exit > TIF_SIGPENDING Entry/exit > TIF_NEED_RESCHED Entry/exit > TIF_SINGLESTEP Entry/exit > TIF_SYSCALL_EMU Entry/exit > TIF_SYSCALL_AUDIT Entry/exit > TIF_SECCOMP Entry/exit > TIF_USER_RETURN_NOTIFY Entry/exit (x86 specific) > TIF_UPROBE Entry/exit > TIF_PATCH_PENDING Entry/exit > TIF_POLLING_NRFLAG Scheduler (related to TIF_NEED_RESCHED) > TIF_MEMDIE Historical, but not required to be a real one > TIF_FSCHECK Historical, but not required to be a real one > > Context switch group > TIF_NOCPUID X86 Context switch > TIF_NOTSC X86 Context switch > TIF_SLD X86 Context switch > TIF_BLOCKSTEP X86 Context switch > TIF_IO_BITMAP X86 Context switch + Entry/exit (see below) > TIF_NEED_FPU_LOAD X86 Context switch + Entry/exit (see below) > TIF_SSBD X86 Context switch > TIF_SPEC_IB X86 Context switch > TIF_SPEC_FORCE_UPDATE X86 Context switch > > No group requirements > TIF_IA32 X86 random > TIF_FORCED_TF X86 random > TIF_LAZY_MMU_UPDATES X86 random > TIF_ADDR32 X86 random > TIF_X32 X86 random > > So the only interesting ones are TIF_IO_BITMAP and TIF_NEED_FPU_LOAD, > but those are really not required to be in the real TIF group because > they are independently evaluated _after_ the real TIF flags on exit to > user space and that requires a reread of the flags anyway. So if we put > the context switch and the random bits into a seperate word right after > thread_info->flags then the second word is in the same cacheline and it > wont matter. That way we gain plenty of free bits on 32 bit and have no > dependency between the two words at all. > > The alternative is to play nasty games with TIF_IA32, TIF_ADDR32 and > TIF_X32 to free up bits for 32bit and make the flags field 64 bit on 64 > bit kernels, but I prefer to do the above seperation. I'm all for cleaning it up, but I don't think any nasty games would be needed regardless. IMO at least the following flags are nonsense and don't belong in TIF_anything at all: TIF_IA32, TIF_X32: can probably be deleted. Someone would just need to finish the work. TIF_ADDR32: also probably removable, but I'm less confident. TIF_FORCED_TF: This is purely a ptrace artifact and could easily go somewhere else entirely. So getting those five bits back would be straightforward. FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an entry/exit word *and* a context switch word. The latter is because it's logically a per-cpu flag, not a per-task flag, and the context switch code moves it around so it's always set on the running task. TIF_NEED_RESCHED is sort of in this category, too. We could introduce a percpu entry_exit_work field to simplify this at some small performance cost. TIF_POLLING_NRFLAG would go along with it. (The latter does not, strictly speaking, belong as a TIF_ flag at all, but it does need to be in the same atomic word as TIF_NEED_RESCHED.) Making this change would arguably be a decent cleanup. --Andy
Andy Lutomirski <luto@kernel.org> writes: > On Fri, Jul 17, 2020 at 12:29 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> The alternative is to play nasty games with TIF_IA32, TIF_ADDR32 and >> TIF_X32 to free up bits for 32bit and make the flags field 64 bit on 64 >> bit kernels, but I prefer to do the above seperation. > > I'm all for cleaning it up, but I don't think any nasty games would be > needed regardless. IMO at least the following flags are nonsense and > don't belong in TIF_anything at all: > > TIF_IA32, TIF_X32: can probably be deleted. Someone would just need > to finish the work. > TIF_ADDR32: also probably removable, but I'm less confident. > TIF_FORCED_TF: This is purely a ptrace artifact and could easily go > somewhere else entirely. > > So getting those five bits back would be straightforward. > > FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an > entry/exit word *and* a context switch word. The latter is because > it's logically a per-cpu flag, not a per-task flag, and the context > switch code moves it around so it's always set on the running task. Gah, I missed the context switch thing of that. That stuff is hideous. Thanks, tglx
On Sat, Jul 18, 2020 at 7:16 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > Andy Lutomirski <luto@kernel.org> writes: > > On Fri, Jul 17, 2020 at 12:29 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> The alternative is to play nasty games with TIF_IA32, TIF_ADDR32 and > >> TIF_X32 to free up bits for 32bit and make the flags field 64 bit on 64 > >> bit kernels, but I prefer to do the above seperation. > > > > I'm all for cleaning it up, but I don't think any nasty games would be > > needed regardless. IMO at least the following flags are nonsense and > > don't belong in TIF_anything at all: > > > > TIF_IA32, TIF_X32: can probably be deleted. Someone would just need > > to finish the work. > > TIF_ADDR32: also probably removable, but I'm less confident. > > TIF_FORCED_TF: This is purely a ptrace artifact and could easily go > > somewhere else entirely. > > > > So getting those five bits back would be straightforward. > > > > FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an > > entry/exit word *and* a context switch word. The latter is because > > it's logically a per-cpu flag, not a per-task flag, and the context > > switch code moves it around so it's always set on the running task. > > Gah, I missed the context switch thing of that. That stuff is hideous. It's also delightful because anything that screws up that dance (such as failure to do the exit-to-usermode path exactly right) likely results in an insta-root-hole. If we fail to run user return notifiers, we can run user code with incorrect syscall MSRs, etc.
Andy Lutomirski <luto@kernel.org> writes: > On Sat, Jul 18, 2020 at 7:16 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> Andy Lutomirski <luto@kernel.org> writes: >> > FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an >> > entry/exit word *and* a context switch word. The latter is because >> > it's logically a per-cpu flag, not a per-task flag, and the context >> > switch code moves it around so it's always set on the running task. >> >> Gah, I missed the context switch thing of that. That stuff is hideous. > > It's also delightful because anything that screws up that dance (such > as failure to do the exit-to-usermode path exactly right) likely > results in an insta-root-hole. If we fail to run user return > notifiers, we can run user code with incorrect syscall MSRs, etc. Looking at it deeper, having that thing in the loop is a pointless exercise. This really wants to be done _after_ the loop. Thanks, tglx
> On Jul 19, 2020, at 3:17 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > Andy Lutomirski <luto@kernel.org> writes: >>> On Sat, Jul 18, 2020 at 7:16 AM Thomas Gleixner <tglx@linutronix.de> wrote: >>> Andy Lutomirski <luto@kernel.org> writes: >>>> FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an >>>> entry/exit word *and* a context switch word. The latter is because >>>> it's logically a per-cpu flag, not a per-task flag, and the context >>>> switch code moves it around so it's always set on the running task. >>> >>> Gah, I missed the context switch thing of that. That stuff is hideous. >> >> It's also delightful because anything that screws up that dance (such >> as failure to do the exit-to-usermode path exactly right) likely >> results in an insta-root-hole. If we fail to run user return >> notifiers, we can run user code with incorrect syscall MSRs, etc. > > Looking at it deeper, having that thing in the loop is a pointless > exercise. This really wants to be done _after_ the loop. > As long as we’re confident that nothing after the loop can set the flag again. > Thanks, > > tglx
Andy Lutomirski <luto@amacapital.net> writes: >> On Jul 19, 2020, at 3:17 AM, Thomas Gleixner <tglx@linutronix.de> wrote: >> >> Andy Lutomirski <luto@kernel.org> writes: >>>> On Sat, Jul 18, 2020 at 7:16 AM Thomas Gleixner <tglx@linutronix.de> wrote: >>>> Andy Lutomirski <luto@kernel.org> writes: >>>>> FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an >>>>> entry/exit word *and* a context switch word. The latter is because >>>>> it's logically a per-cpu flag, not a per-task flag, and the context >>>>> switch code moves it around so it's always set on the running task. >>>> >>>> Gah, I missed the context switch thing of that. That stuff is hideous. >>> >>> It's also delightful because anything that screws up that dance (such >>> as failure to do the exit-to-usermode path exactly right) likely >>> results in an insta-root-hole. If we fail to run user return >>> notifiers, we can run user code with incorrect syscall MSRs, etc. >> >> Looking at it deeper, having that thing in the loop is a pointless >> exercise. This really wants to be done _after_ the loop. >> > As long as we’re confident that nothing after the loop can set the flag again. Yes, because that's the direct way off to user space. Thanks, tglx
On Thu, Jul 16, 2020 at 12:50 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > From: Thomas Gleixner <tglx@linutronix.de> > > On syscall entry certain work needs to be done: > > - Establish state (lockdep, context tracking, tracing) > - Conditional work (ptrace, seccomp, audit...) > > This code is needlessly duplicated and different in all > architectures. > > + > +/** > + * arch_syscall_enter_seccomp - Architecture specific seccomp invocation > + * @regs: Pointer to currents pt_regs > + * > + * Returns: The original or a modified syscall number > + * > + * Invoked from syscall_enter_from_user_mode(). Can be replaced by > + * architecture specific code. > + */ > +static inline long arch_syscall_enter_seccomp(struct pt_regs *regs); Ick. I'd rather see arch_populate_seccomp_data() and kill this hook. But we can clean this up later. > +/** > + * arch_syscall_enter_audit - Architecture specific audit invocation > + * @regs: Pointer to currents pt_regs > + * > + * Invoked from syscall_enter_from_user_mode(). Must be replaced by > + * architecture specific code if the architecture supports audit. > + */ > +static inline void arch_syscall_enter_audit(struct pt_regs *regs); > + Let's pass u32 arch here. > +/** > + * syscall_enter_from_user_mode - Check and handle work before invoking > + * a syscall > + * @regs: Pointer to currents pt_regs > + * @syscall: The syscall number > + * > + * Invoked from architecture specific syscall entry code with interrupts > + * disabled. The calling code has to be non-instrumentable. When the > + * function returns all state is correct and the subsequent functions can be > + * instrumented. > + * > + * Returns: The original or a modified syscall number > + * > + * If the returned syscall number is -1 then the syscall should be > + * skipped. In this case the caller may invoke syscall_set_error() or > + * syscall_set_return_value() first. If neither of those are called and -1 > + * is returned, then the syscall will fail with ENOSYS. > + * > + * The following functionality is handled here: > + * > + * 1) Establish state (lockdep, RCU (context tracking), tracing) > + * 2) TIF flag dependent invocations of arch_syscall_enter_tracehook(), > + * arch_syscall_enter_seccomp(), trace_sys_enter() > + * 3) Invocation of arch_syscall_enter_audit() > + */ > +long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall); This should IMO also take u32 arch. I'm also uneasy about having this do the idtentry/irqentry stuff as well as the syscall stuff. Is there a particular reason you did it this way instead of having callers do: idtentry_enter(); instrumentation_begin(); syscall_enter_from_user_mode(); FWIW, I think we could make this even better -- couldn't this get folded together with syscall *exit* and become: idtentry_enter(); instrumentation_begin(); generic_syscall(); instrumentation_end(); idtentry_exit(); and generic_syscall() would call arch_dispatch_syscall(regs, arch, syscall_nr); > + > +/** > + * irqentry_enter_from_user_mode - Establish state before invoking the irq handler > + * @regs: Pointer to currents pt_regs > + * > + * Invoked from architecture specific entry code with interrupts disabled. > + * Can only be called when the interrupt entry came from user mode. The > + * calling code must be non-instrumentable. When the function returns all > + * state is correct and the subsequent functions can be instrumented. > + * > + * The function establishes state (lockdep, RCU (context tracking), tracing) > + */ > +void irqentry_enter_from_user_mode(struct pt_regs *regs); Unless the rest of the series works differently from what I expect, I don't love this name. How about normal_entry_from_user_mode() or ordinary_entry_from_user_mode()? After all, this seems to cover IRQ, all the non-horrible exceptions, and (internally) syscalls. --Andy
--- a/arch/Kconfig +++ b/arch/Kconfig @@ -27,6 +27,9 @@ config HAVE_IMA_KEXEC config HOTPLUG_SMT bool +config GENERIC_ENTRY + bool + config OPROFILE tristate "OProfile system profiling" depends on PROFILING --- /dev/null +++ b/include/linux/entry-common.h @@ -0,0 +1,156 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __LINUX_ENTRYCOMMON_H +#define __LINUX_ENTRYCOMMON_H + +#include <linux/tracehook.h> +#include <linux/syscalls.h> +#include <linux/seccomp.h> +#include <linux/sched.h> + +#include <asm/entry-common.h> + +/* + * Define dummy _TIF work flags if not defined by the architecture or for + * disabled functionality. + */ +#ifndef _TIF_SYSCALL_TRACE +# define _TIF_SYSCALL_TRACE (0) +#endif + +#ifndef _TIF_SYSCALL_EMU +# define _TIF_SYSCALL_EMU (0) +#endif + +#ifndef _TIF_SYSCALL_TRACEPOINT +# define _TIF_SYSCALL_TRACEPOINT (0) +#endif + +#ifndef _TIF_SECCOMP +# define _TIF_SECCOMP (0) +#endif + +#ifndef _TIF_AUDIT +# define _TIF_AUDIT (0) +#endif + +/* + * TIF flags handled in syscall_enter_from_usermode() + */ +#ifndef ARCH_SYSCALL_ENTER_WORK +# define ARCH_SYSCALL_ENTER_WORK (0) +#endif + +#define SYSCALL_ENTER_WORK \ + (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | TIF_SECCOMP | \ + _TIF_SYSCALL_TRACEPOINT | _TIF_SYSCALL_EMU | \ + ARCH_SYSCALL_ENTER_WORK) + +/** + * arch_check_user_regs - Architecture specific sanity check for user mode regs + * @regs: Pointer to currents pt_regs + * + * Defaults to an empty implementation. Can be replaced by architecture + * specific code. + * + * Invoked from syscall_enter_from_user_mode() in the non-instrumentable + * section. Use __always_inline so the compiler cannot push it out of line + * and make it instrumentable. + */ +static __always_inline void arch_check_user_regs(struct pt_regs *regs); + +#ifndef arch_check_user_regs +static __always_inline void arch_check_user_regs(struct pt_regs *regs) {} +#endif + +/** + * arch_syscall_enter_tracehook - Wrapper around tracehook_report_syscall_entry() + * @regs: Pointer to currents pt_regs + * + * Returns: 0 on success or an error code to skip the syscall. + * + * Defaults to tracehook_report_syscall_entry(). Can be replaced by + * architecture specific code. + * + * Invoked from syscall_enter_from_user_mode() + */ +static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs); + +#ifndef arch_syscall_enter_tracehook +static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs) +{ + return tracehook_report_syscall_entry(regs); +} +#endif + +/** + * arch_syscall_enter_seccomp - Architecture specific seccomp invocation + * @regs: Pointer to currents pt_regs + * + * Returns: The original or a modified syscall number + * + * Invoked from syscall_enter_from_user_mode(). Can be replaced by + * architecture specific code. + */ +static inline long arch_syscall_enter_seccomp(struct pt_regs *regs); + +#ifndef arch_syscall_enter_seccomp +static inline long arch_syscall_enter_seccomp(struct pt_regs *regs) +{ + return secure_computing(NULL); +} +#endif + +/** + * arch_syscall_enter_audit - Architecture specific audit invocation + * @regs: Pointer to currents pt_regs + * + * Invoked from syscall_enter_from_user_mode(). Must be replaced by + * architecture specific code if the architecture supports audit. + */ +static inline void arch_syscall_enter_audit(struct pt_regs *regs); + +#ifndef arch_syscall_enter_audit +static inline void arch_syscall_enter_audit(struct pt_regs *regs) { } +#endif + +/** + * syscall_enter_from_user_mode - Check and handle work before invoking + * a syscall + * @regs: Pointer to currents pt_regs + * @syscall: The syscall number + * + * Invoked from architecture specific syscall entry code with interrupts + * disabled. The calling code has to be non-instrumentable. When the + * function returns all state is correct and the subsequent functions can be + * instrumented. + * + * Returns: The original or a modified syscall number + * + * If the returned syscall number is -1 then the syscall should be + * skipped. In this case the caller may invoke syscall_set_error() or + * syscall_set_return_value() first. If neither of those are called and -1 + * is returned, then the syscall will fail with ENOSYS. + * + * The following functionality is handled here: + * + * 1) Establish state (lockdep, RCU (context tracking), tracing) + * 2) TIF flag dependent invocations of arch_syscall_enter_tracehook(), + * arch_syscall_enter_seccomp(), trace_sys_enter() + * 3) Invocation of arch_syscall_enter_audit() + */ +long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall); + +/** + * irqentry_enter_from_user_mode - Establish state before invoking the irq handler + * @regs: Pointer to currents pt_regs + * + * Invoked from architecture specific entry code with interrupts disabled. + * Can only be called when the interrupt entry came from user mode. The + * calling code must be non-instrumentable. When the function returns all + * state is correct and the subsequent functions can be instrumented. + * + * The function establishes state (lockdep, RCU (context tracking), tracing) + */ +void irqentry_enter_from_user_mode(struct pt_regs *regs); + +#endif --- a/kernel/Makefile +++ b/kernel/Makefile @@ -48,6 +48,7 @@ obj-y += irq/ obj-y += rcu/ obj-y += livepatch/ obj-y += dma/ +obj-y += entry/ obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o obj-$(CONFIG_FREEZER) += freezer.o --- /dev/null +++ b/kernel/entry/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-$(CONFIG_GENERIC_ENTRY) += common.o --- /dev/null +++ b/kernel/entry/common.c @@ -0,0 +1,78 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/context_tracking.h> +#include <linux/entry-common.h> + +#define CREATE_TRACE_POINTS +#include <trace/events/syscalls.h> + +/** + * enter_from_user_mode - Establish state when coming from user mode + * + * Syscall/interrupt entry disables interrupts, but user mode is traced as + * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. + * + * 1) Tell lockdep that interrupts are disabled + * 2) Invoke context tracking if enabled to reactivate RCU + * 3) Trace interrupts off state + */ +static __always_inline void enter_from_user_mode(struct pt_regs *regs) +{ + arch_check_user_regs(regs); + lockdep_hardirqs_off(CALLER_ADDR0); + + CT_WARN_ON(ct_state() != CONTEXT_USER); + user_exit_irqoff(); + + instrumentation_begin(); + trace_hardirqs_off_finish(); + instrumentation_end(); +} + +static long syscall_trace_enter(struct pt_regs *regs, long syscall, + unsigned long ti_work) +{ + long ret = 0; + + /* Handle ptrace */ + if (ti_work & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU)) { + ret = arch_syscall_enter_tracehook(regs); + if (ret || (ti_work & _TIF_SYSCALL_EMU)) + return -1L; + } + + /* Do seccomp after ptrace, to catch any tracer changes. */ + if (ti_work & _TIF_SECCOMP) { + ret = arch_syscall_enter_seccomp(regs); + if (ret == -1L) + return ret; + } + + if (unlikely(ti_work & _TIF_SYSCALL_TRACEPOINT)) + trace_sys_enter(regs, syscall); + + arch_syscall_enter_audit(regs); + + return ret ? : syscall; +} + +noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall) +{ + unsigned long ti_work; + + enter_from_user_mode(regs); + instrumentation_begin(); + + local_irq_enable(); + ti_work = READ_ONCE(current_thread_info()->flags); + if (ti_work & SYSCALL_ENTER_WORK) + syscall = syscall_trace_enter(regs, syscall, ti_work); + instrumentation_end(); + + return syscall; +} + +noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) +{ + enter_from_user_mode(regs); +}