Message ID | 3f649f5658a163645e3ce15156176c325283762e.1405992946.git.luto@amacapital.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Andy, I am really sorry for delay. This is on top of the recent change from Kees, right? Could me remind me where can I found the tree this series based on? So that I could actually apply these changes... On 07/21, Andy Lutomirski wrote: > > +long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch, > + unsigned long phase1_result) > { > long ret = 0; > + u32 work = ACCESS_ONCE(current_thread_info()->flags) & > + _TIF_WORK_SYSCALL_ENTRY; > + > + BUG_ON(regs != task_pt_regs(current)); > > user_exit(); > > @@ -1458,17 +1562,20 @@ long syscall_trace_enter(struct pt_regs *regs) > * do_debug() and we need to set it again to restore the user > * state. If we entered on the slow path, TF was already set. > */ > - if (test_thread_flag(TIF_SINGLESTEP)) > + if (work & _TIF_SINGLESTEP) > regs->flags |= X86_EFLAGS_TF; This looks suspicious, but perhaps I misread this change. If I understand correctly, syscall_trace_enter() can avoid _phase2() above. But we should always call user_exit() unconditionally? And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be actually cleared on a 32bit kernel if we step over sysenter insn? Oleg.
On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote: > Hi Andy, > > I am really sorry for delay. > > This is on top of the recent change from Kees, right? Could me remind me > where can I found the tree this series based on? So that I could actually > apply these changes... https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath The first four patches are already applied there. > > On 07/21, Andy Lutomirski wrote: >> >> +long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch, >> + unsigned long phase1_result) >> { >> long ret = 0; >> + u32 work = ACCESS_ONCE(current_thread_info()->flags) & >> + _TIF_WORK_SYSCALL_ENTRY; >> + >> + BUG_ON(regs != task_pt_regs(current)); >> >> user_exit(); >> >> @@ -1458,17 +1562,20 @@ long syscall_trace_enter(struct pt_regs *regs) >> * do_debug() and we need to set it again to restore the user >> * state. If we entered on the slow path, TF was already set. >> */ >> - if (test_thread_flag(TIF_SINGLESTEP)) >> + if (work & _TIF_SINGLESTEP) >> regs->flags |= X86_EFLAGS_TF; > > This looks suspicious, but perhaps I misread this change. > > If I understand correctly, syscall_trace_enter() can avoid _phase2() above. > But we should always call user_exit() unconditionally? Damnit. I read that every function called by user_exit, and none of them give any indication of why they're needed for traced syscalls but not for untraced syscalls. On a second look, it seems that TIF_NOHZ controls it. I'll update the code to call user_exit iff TIF_NOHZ is set. If that's still wrong, then I don't see how the current code is correct either. > > And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be > actually cleared on a 32bit kernel if we step over sysenter insn? I don't follow. If TIF_SINGLESTEP, then phase1 will return a nonzero value, and phase2 will set TF. I admit I don't really understand all the TF machinations. --Andy
On 07/28, Andy Lutomirski wrote: > > On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > Hi Andy, > > > > I am really sorry for delay. > > > > This is on top of the recent change from Kees, right? Could me remind me > > where can I found the tree this series based on? So that I could actually > > apply these changes... > > https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath > > The first four patches are already applied there. Thanks! > > If I understand correctly, syscall_trace_enter() can avoid _phase2() above. > > But we should always call user_exit() unconditionally? > > Damnit. I read that every function called by user_exit, and none of > them give any indication of why they're needed for traced syscalls but > not for untraced syscalls. On a second look, it seems that TIF_NOHZ > controls it. Yes, just to trigger the slow path, I guess. > I'll update the code to call user_exit iff TIF_NOHZ is > set. Or perhaps it would be better to not add another user of this (strange) flag and just call user_exit() unconditionally(). But, yes, you need to use from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then. > > And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be > > actually cleared on a 32bit kernel if we step over sysenter insn? > > I don't follow. If TIF_SINGLESTEP, then phase1 will return a nonzero > value, Ah yes, thanks, I missed this. Oleg.
On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 07/28, Andy Lutomirski wrote: >> >> On Mon, Jul 28, 2014 at 10:37 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> > Hi Andy, >> > >> > I am really sorry for delay. >> > >> > This is on top of the recent change from Kees, right? Could me remind me >> > where can I found the tree this series based on? So that I could actually >> > apply these changes... >> >> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/fastpath >> >> The first four patches are already applied there. > > Thanks! > >> > If I understand correctly, syscall_trace_enter() can avoid _phase2() above. >> > But we should always call user_exit() unconditionally? >> >> Damnit. I read that every function called by user_exit, and none of >> them give any indication of why they're needed for traced syscalls but >> not for untraced syscalls. On a second look, it seems that TIF_NOHZ >> controls it. > > Yes, just to trigger the slow path, I guess. > >> I'll update the code to call user_exit iff TIF_NOHZ is >> set. > > Or perhaps it would be better to not add another user of this (strange) flag > and just call user_exit() unconditionally(). But, yes, you need to use > from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\ user_exit looks slow enough to me that a branch to try to avoid it may be worthwhile. I bet that explicitly checking the flag is actually both faster and clearer. That's what I did for v4. --Andy > >> > And we should always set X86_EFLAGS_TF if TIF_SINGLESTEP? IIRC, TF can be >> > actually cleared on a 32bit kernel if we step over sysenter insn? >> >> I don't follow. If TIF_SINGLESTEP, then phase1 will return a nonzero >> value, > > Ah yes, thanks, I missed this. > > Oleg. >
On 07/29, Andy Lutomirski wrote: > > On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > Yes, just to trigger the slow path, I guess. > > > >> I'll update the code to call user_exit iff TIF_NOHZ is > >> set. > > > > Or perhaps it would be better to not add another user of this (strange) flag > > and just call user_exit() unconditionally(). But, yes, you need to use > > from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\ > > user_exit looks slow enough to me that a branch to try to avoid it may > be worthwhile. I bet that explicitly checking the flag is > actually both faster and clearer. I don't think so (unless I am confused again), note that user_exit() uses jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ if possible because I think it should die somehow (currently I do not know how ;). And because it is ugly to check the same condition twice: if (work & TIF_NOHZ) { // user_exit() if (context_tracking_is_enabled()) context_tracking_user_exit(); } TIF_NOHZ is set if and only if context_tracking_is_enabled() is true. So I think that work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ); user_exit(); looks a bit better. But I won't argue. > That's what I did for v4. I am going to read it today. Not that I think I can help or find something wrong. Oleg.
On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 07/29, Andy Lutomirski wrote: >> >> On Tue, Jul 29, 2014 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> > >> > Yes, just to trigger the slow path, I guess. >> > >> >> I'll update the code to call user_exit iff TIF_NOHZ is >> >> set. >> > >> > Or perhaps it would be better to not add another user of this (strange) flag >> > and just call user_exit() unconditionally(). But, yes, you need to use >> > from "work = flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ)" then.\ >> >> user_exit looks slow enough to me that a branch to try to avoid it may >> be worthwhile. I bet that explicitly checking the flag is >> actually both faster and clearer. > > I don't think so (unless I am confused again), note that user_exit() uses > jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ > if possible because I think it should die somehow (currently I do not know > how ;). And because it is ugly to check the same condition twice: > > if (work & TIF_NOHZ) { > // user_exit() > if (context_tracking_is_enabled()) > context_tracking_user_exit(); > } > > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true. > So I think that > > work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ); > > user_exit(); > > looks a bit better. But I won't argue. I don't get it. context_tracking_is_enabled is global, and TIF_NOHZ is per-task. Isn't this stuff determined per-task or per-cpu or something? IOW, if one CPU is running something that's very heavily userspace-oriented and another CPU is doing something syscall- or sleep-heavy, then shouldn't only the first CPU end up paying the price of context tracking? > >> That's what I did for v4. > > I am going to read it today. Not that I think I can help or find something > wrong. > > Oleg. >
On 07/29, Andy Lutomirski wrote: > > On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > I don't think so (unless I am confused again), note that user_exit() uses > > jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ > > if possible because I think it should die somehow (currently I do not know > > how ;). And because it is ugly to check the same condition twice: > > > > if (work & TIF_NOHZ) { > > // user_exit() > > if (context_tracking_is_enabled()) > > context_tracking_user_exit(); > > } > > > > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true. > > So I think that > > > > work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ); > > > > user_exit(); > > > > looks a bit better. But I won't argue. > > I don't get it. Don't worry, you are not alone. > context_tracking_is_enabled is global, and TIF_NOHZ > is per-task. Isn't this stuff determined per-task or per-cpu or > something? > > IOW, if one CPU is running something that's very heavily > userspace-oriented and another CPU is doing something syscall- or > sleep-heavy, then shouldn't only the first CPU end up paying the price > of context tracking? Please see another email I sent to Frederic. Oleg.
On Tue, Jul 29, 2014 at 11:16 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 07/29, Andy Lutomirski wrote: >> >> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> > >> > I don't think so (unless I am confused again), note that user_exit() uses >> > jump label. But this doesn't matter. I meant that we should avoid TIF_NOHZ >> > if possible because I think it should die somehow (currently I do not know >> > how ;). And because it is ugly to check the same condition twice: >> > >> > if (work & TIF_NOHZ) { >> > // user_exit() >> > if (context_tracking_is_enabled()) >> > context_tracking_user_exit(); >> > } >> > >> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true. >> > So I think that >> > >> > work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ); >> > >> > user_exit(); >> > >> > looks a bit better. But I won't argue. >> >> I don't get it. > > Don't worry, you are not alone. > >> context_tracking_is_enabled is global, and TIF_NOHZ >> is per-task. Isn't this stuff determined per-task or per-cpu or >> something? >> >> IOW, if one CPU is running something that's very heavily >> userspace-oriented and another CPU is doing something syscall- or >> sleep-heavy, then shouldn't only the first CPU end up paying the price >> of context tracking? > > Please see another email I sent to Frederic. > I'll add at least this argument in favor of my approach: if context tracking works at all, then it had better not demand that syscall entry call user_exit if TIF_NOHZ is *not* set. So adding the condition ought to be safe, barring dumb bugs in my code. --Andy > Oleg. >
On 07/29, Andy Lutomirski wrote: > > On Tue, Jul 29, 2014 at 11:16 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > On 07/29, Andy Lutomirski wrote: > >> > >> On Tue, Jul 29, 2014 at 10:31 AM, Oleg Nesterov <oleg@redhat.com> wrote: > >> > > >> > TIF_NOHZ is set if and only if context_tracking_is_enabled() is true. > >> > So I think that > >> > > >> > work = current_thread_info()->flags & (_TIF_WORK_SYSCALL_ENTRY & ~TIF_NOHZ); > >> > > >> > user_exit(); > >> > > >> > looks a bit better. But I won't argue. > >> > >> I don't get it. > > > > Don't worry, you are not alone. > > > >> context_tracking_is_enabled is global, and TIF_NOHZ > >> is per-task. Isn't this stuff determined per-task or per-cpu or > >> something? > >> > >> IOW, if one CPU is running something that's very heavily > >> userspace-oriented and another CPU is doing something syscall- or > >> sleep-heavy, then shouldn't only the first CPU end up paying the price > >> of context tracking? > > > > Please see another email I sent to Frederic. > > > I'll add at least this argument in favor of my approach: if context > tracking works at all, then it had better not demand that syscall > entry call user_exit if TIF_NOHZ is *not* set. I disagree. At least I disagree with that you should enforce this in syscall_trace_enter() paths, and in any case this has nothing to do with these changes. But again, I won't insist, so please forget. > So adding the > condition ought to be safe, barring dumb bugs in my code. Yes, I think it is technically correct. Oleg.
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 6205f0c..86fc2bb 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -75,6 +75,11 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs); extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, int error_code, int si_code); + +extern unsigned long syscall_trace_enter_phase1(struct pt_regs *, u32 arch); +extern long syscall_trace_enter_phase2(struct pt_regs *, u32 arch, + unsigned long phase1_result); + extern long syscall_trace_enter(struct pt_regs *); extern void syscall_trace_leave(struct pt_regs *); diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 39296d2..9ec6972 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1441,13 +1441,117 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, force_sig_info(SIGTRAP, &info, tsk); } +static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch) +{ +#ifdef CONFIG_X86_64 + if (arch == AUDIT_ARCH_X86_64) { + audit_syscall_entry(arch, regs->orig_ax, regs->di, + regs->si, regs->dx, regs->r10); + } else +#endif + { + audit_syscall_entry(arch, regs->orig_ax, regs->bx, + regs->cx, regs->dx, regs->si); + } +} + /* - * We must return the syscall number to actually look up in the table. - * This can be -1L to skip running any syscall at all. + * We can return 0 to resume the syscall or anything else to go to phase + * 2. If we resume the syscall, we need to put something appropriate in + * regs->orig_ax. + * + * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax + * are fully functional. + * + * For phase 2's benefit, our return value is: + * 0: resume the syscall + * 1: go to phase 2; no seccomp phase 2 needed + * 2: go to phase 2; pass return value to seccomp */ -long syscall_trace_enter(struct pt_regs *regs) +unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) +{ + unsigned long ret = 0; + u32 work; + + BUG_ON(regs != task_pt_regs(current)); + + work = ACCESS_ONCE(current_thread_info()->flags) & + _TIF_WORK_SYSCALL_ENTRY; + +#ifdef CONFIG_SECCOMP + /* + * Do seccomp first -- it should minimize exposure of other + * code, and keeping seccomp fast is probably more valuable + * than the rest of this. + */ + if (work & _TIF_SECCOMP) { + struct seccomp_data sd; + + sd.arch = arch; + sd.nr = regs->orig_ax; + sd.instruction_pointer = regs->ip; +#ifdef CONFIG_X86_64 + if (arch == AUDIT_ARCH_X86_64) { + sd.args[0] = regs->di; + sd.args[1] = regs->si; + sd.args[2] = regs->dx; + sd.args[3] = regs->r10; + sd.args[4] = regs->r8; + sd.args[5] = regs->r9; + } else +#endif + { + sd.args[0] = regs->bx; + sd.args[1] = regs->cx; + sd.args[2] = regs->dx; + sd.args[3] = regs->si; + sd.args[4] = regs->di; + sd.args[5] = regs->bp; + } + + BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0); + BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1); + + ret = seccomp_phase1(&sd); + if (ret == SECCOMP_PHASE1_SKIP) { + regs->orig_ax = -1; + ret = 0; + } else if (ret != SECCOMP_PHASE1_OK) { + return ret; /* Go directly to phase 2 */ + } + + work &= ~_TIF_SECCOMP; + } +#endif + + /* Do our best to finish without phase 2. */ + if (work == 0) + return ret; /* seccomp only (ret == 0 here) */ + +#ifdef CONFIG_AUDITSYSCALL + if (work == _TIF_SYSCALL_AUDIT) { + /* + * If there is no more work to be done except auditing, + * then audit in phase 1. Phase 2 always audits, so, if + * we audit here, then we can't go on to phase 2. + */ + do_audit_syscall_entry(regs, arch); + return 0; + } +#endif + + return 1; /* Something is enabled that we can't handle in phase 1 */ +} + +/* Returns the syscall nr to run (which should match regs->orig_ax). */ +long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch, + unsigned long phase1_result) { long ret = 0; + u32 work = ACCESS_ONCE(current_thread_info()->flags) & + _TIF_WORK_SYSCALL_ENTRY; + + BUG_ON(regs != task_pt_regs(current)); user_exit(); @@ -1458,17 +1562,20 @@ long syscall_trace_enter(struct pt_regs *regs) * do_debug() and we need to set it again to restore the user * state. If we entered on the slow path, TF was already set. */ - if (test_thread_flag(TIF_SINGLESTEP)) + if (work & _TIF_SINGLESTEP) regs->flags |= X86_EFLAGS_TF; - /* do the secure computing check first */ - if (secure_computing()) { + /* + * Call seccomp_phase2 before running the other hooks so that + * they can see any changes made by a seccomp tracer. + */ + if (phase1_result > 1 && seccomp_phase2(phase1_result)) { /* seccomp failures shouldn't expose any additional code. */ ret = -1L; goto out; } - if (unlikely(test_thread_flag(TIF_SYSCALL_EMU))) + if (unlikely(work & _TIF_SYSCALL_EMU)) ret = -1L; if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) && @@ -1478,23 +1585,23 @@ long syscall_trace_enter(struct pt_regs *regs) if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) trace_sys_enter(regs, regs->orig_ax); - if (is_ia32_task()) - audit_syscall_entry(AUDIT_ARCH_I386, - regs->orig_ax, - regs->bx, regs->cx, - regs->dx, regs->si); -#ifdef CONFIG_X86_64 - else - audit_syscall_entry(AUDIT_ARCH_X86_64, - regs->orig_ax, - regs->di, regs->si, - regs->dx, regs->r10); -#endif + do_audit_syscall_entry(regs, arch); out: return ret ?: regs->orig_ax; } +long syscall_trace_enter(struct pt_regs *regs) +{ + u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64; + unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch); + + if (phase1_result == 0) + return regs->orig_ax; + else + return syscall_trace_enter_phase2(regs, arch, phase1_result); +} + void syscall_trace_leave(struct pt_regs *regs) { bool step;
This splits syscall_trace_enter into syscall_trace_enter_phase1 and syscall_trace_enter_phase2. Only phase 2 has full pt_regs, and only phase 2 is permitted to modify any of pt_regs except for orig_ax. The intent is that phase 1 can be called from the syscall fast path. Signed-off-by: Andy Lutomirski <luto@amacapital.net> --- arch/x86/include/asm/ptrace.h | 5 ++ arch/x86/kernel/ptrace.c | 145 ++++++++++++++++++++++++++++++++++++------ 2 files changed, 131 insertions(+), 19 deletions(-)