Message ID | 87pn8c1uj6.fsf_-_@x220.int.ebiederm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] exec: Conceal the other threads from wakeups during exec | expand |
On Thu, Jul 30, 2020 at 4:00 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > The key is the function make_task_wakekill which could probably > benefit from a little more review and refinement but appears to > be basically correct. You really need to explain a lot more why you think this is all a good idea. For example, what if one of those other threads is waiting in line for a critical lock, and the wait-queue you basically disabled was the exclusive wait after lock handoff? That means that the lock will now effectively be held by that thread. No, it wasn't woken up, but it had the lock handed to it, and it's now entirely unresponsive until it is killed. How is that different from the deadlocks you're actually trying to fix? These are the kinds of problems that the freezer() code had too, with freezing things that held locks etc. This approach does seem better than the freezer thing, and if I read it right it will gather things in the signal handler code, but it's not obvious that gathering them in random places where they sleep for random reasons is safe or a good idea. I can imagine _so_ many dead systems if you just basically froze something that holds the mmap lock and is sleeping on a page fault, for example. Maybe I'm missing something, but I really think your "let's freeze things" is seriously misguided. You're concentrating on some small problem and trying to solve that, and not seeign the HUGE HONKING problems that your approach is fundamentally introducing. Linus
Eric, I won't comment the intent, but I too do not understand this idea. On 07/30, Eric W. Biederman wrote: > > [This change requires more work to handle TASK_STOPPED and TASK_TRACED] Yes. And it is not clear to me how can you solve this. > [This adds a new lock ordering dependency siglock -> pi_lock -> rq_lock ] Not really, ttwu() can be safely called with siglock held and it takes pi_lock + rq_lock. Say, signal_wake_up(). > +int make_task_wakekill(struct task_struct *p) > +{ > + unsigned long flags; > + int cpu, success = 0; > + struct rq_flags rf; > + struct rq *rq; > + long state; > + > + /* Assumes p != current */ > + preempt_disable(); > + /* > + * If we are going to change a thread waiting for CONDITION we > + * need to ensure that CONDITION=1 done by the caller can not be > + * reordered with p->state check below. This pairs with mb() in > + * set_current_state() the waiting thread does. > + */ > + raw_spin_lock_irqsave(&p->pi_lock, flags); > + smp_mb__after_spinlock(); > + state = p->state; > + > + /* FIXME handle TASK_STOPPED and TASK_TRACED */ > + if ((state == TASK_KILLABLE) || > + (state == TASK_INTERRUPTIBLE)) { > + success = 1; > + cpu = task_cpu(p); > + rq = cpu_rq(cpu); > + rq_lock(rq, &rf); > + p->state = TASK_WAKEKILL; You can only do this if the task was already deactivated. Just suppose it is preempted or does something like set_current_sate(TASK_INTERRUPTIBLE); if (CONDITION) { // make_task_wakekill() sets state = TASK_WAKEKILL __set_current_state(TASK_RUNNING); return; } schedule(); Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > Eric, I won't comment the intent, but I too do not understand this idea. > > On 07/30, Eric W. Biederman wrote: >> >> [This change requires more work to handle TASK_STOPPED and TASK_TRACED] > > Yes. And it is not clear to me how can you solve this. I was imagining something putting TASK_STOPPED and TASK_TRACED in a loop that verified they should be in that state before exiting so they could handle spurious wake ups. There are a many subtlties in that code, especially in the conversion fo TASK_STOPPED to TASK_TRACED. So I suspect something more would be required but I have not looked yet to see how tricky that would be. >> [This adds a new lock ordering dependency siglock -> pi_lock -> rq_lock ] > > Not really, ttwu() can be safely called with siglock held and it takes > pi_lock + rq_lock. Say, signal_wake_up(). Good point. >> +int make_task_wakekill(struct task_struct *p) >> +{ >> + unsigned long flags; >> + int cpu, success = 0; >> + struct rq_flags rf; >> + struct rq *rq; >> + long state; >> + >> + /* Assumes p != current */ >> + preempt_disable(); >> + /* >> + * If we are going to change a thread waiting for CONDITION we >> + * need to ensure that CONDITION=1 done by the caller can not be >> + * reordered with p->state check below. This pairs with mb() in >> + * set_current_state() the waiting thread does. >> + */ >> + raw_spin_lock_irqsave(&p->pi_lock, flags); >> + smp_mb__after_spinlock(); >> + state = p->state; >> + >> + /* FIXME handle TASK_STOPPED and TASK_TRACED */ >> + if ((state == TASK_KILLABLE) || >> + (state == TASK_INTERRUPTIBLE)) { >> + success = 1; >> + cpu = task_cpu(p); >> + rq = cpu_rq(cpu); >> + rq_lock(rq, &rf); >> + p->state = TASK_WAKEKILL; > > You can only do this if the task was already deactivated. Just suppose it > is preempted or does something like > > set_current_sate(TASK_INTERRUPTIBLE); > > if (CONDITION) { > // make_task_wakekill() sets state = TASK_WAKEKILL > __set_current_state(TASK_RUNNING); > return; > } > > schedule(); You are quite right. So that bit of code would need to be: if (!task->on_rq) goto out; if ((state == TASK_KILLABLE) || (state == TASK_INTERRUPTIBLE)) { ... Eric
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Jul 30, 2020 at 4:00 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> The key is the function make_task_wakekill which could probably >> benefit from a little more review and refinement but appears to >> be basically correct. > > You really need to explain a lot more why you think this is all a good idea. > > For example, what if one of those other threads is waiting in line for > a critical lock, and the wait-queue you basically disabled was the > exclusive wait after lock handoff? > > That means that the lock will now effectively be held by that thread. > No, it wasn't woken up, but it had the lock handed to it, and it's now > entirely unresponsive until it is killed. > > How is that different from the deadlocks you're actually trying to fix? > > These are the kinds of problems that the freezer() code had too, with > freezing things that held locks etc. > > This approach does seem better than the freezer thing, and if I read > it right it will gather things in the signal handler code, but it's > not obvious that gathering them in random places where they sleep for > random reasons is safe or a good idea. > > I can imagine _so_ many dead systems if you just basically froze > something that holds the mmap lock and is sleeping on a page fault, > for example. > > Maybe I'm missing something, but I really think your "let's freeze > things" is seriously misguided. You're concentrating on some small > problem and trying to solve that, and not seeign the HUGE HONKING > problems that your approach is fundamentally introducing. Very good point. That would be a priority inversion on mmap_lock. Without great care that could indeed result in lockups. That definitely requires the points where things are already sleeping that can be converted to be opt-in. Which potentially makes things much more work. Thanks, that helps kill my bright idea as I expressed it. Part of what I was trying to solve (because I ran into the problem while I was reading the code) was that the freezer, the cgroup v2 freezer, and other waits do not compose nicely. Even limited to opt-in locations I think the trick of being able to transform the wait-state may solve that composition problem. That said I was really just posting this so if the ideas were good they could inspire future code, and if the ideas were bad they could be sunk. When it comes to sorting out future especially in exec I will know which ideas don't fly, so it will be easier to make the case for ideas that will work. Eric
On Fri, Jul 31, 2020 at 10:19 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Even limited to opt-in locations I think the trick of being able to > transform the wait-state may solve that composition problem. So the part I found intriguing was the "catch things in the signal handling path". Catching things there - and *only* there - would avoid a lot of the problems we had with the freezer. When you're about to return to user mode, there are no lock inversions etc. And it kind of makes conceptual sense to do, since what you're trying to capture is the signal group - so using the signal state to do so seems like a natural thing to do. No touching of any runqueues or scheduler data structures, do everything _purely_ with the signal handling pathways. So that "feels" ok to me. That said, I do wonder if there are nasty nasty latency issues with odd users. Normally, you'd expect that execve() with other threads in the group shouldn't be a performance issue, because people simply shouldn't do that. So it might be ok. And if you capture them all in the signal handling pathway, that ends up being a very convenient place to zap them all too, so maybe my latency worry is misguided. IOW, I think that you could try to do your "freese other threads" not at all like the freezer, but more like a "collect all threads in their signal handler parts as the first phase of zapping them". So maybe this approach is salvageable. I see where something like the above could work well. But I say that with a lot of handwaving, and maybe if I see the patch I'd go "Christ, I was a complete idiot for ever even suggesting that". Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Fri, Jul 31, 2020 at 10:19 AM Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> Even limited to opt-in locations I think the trick of being able to >> transform the wait-state may solve that composition problem. > > So the part I found intriguing was the "catch things in the signal > handling path". > > Catching things there - and *only* there - would avoid a lot of the > problems we had with the freezer. When you're about to return to user > mode, there are no lock inversions etc. > > And it kind of makes conceptual sense to do, since what you're trying > to capture is the signal group - so using the signal state to do so > seems like a natural thing to do. No touching of any runqueues or > scheduler data structures, do everything _purely_ with the signal > handling pathways. > > So that "feels" ok to me. > > That said, I do wonder if there are nasty nasty latency issues with > odd users. Normally, you'd expect that execve() with other threads in > the group shouldn't be a performance issue, because people simply > shouldn't do that. So it might be ok. > > And if you capture them all in the signal handling pathway, that ends > up being a very convenient place to zap them all too, so maybe my > latency worry is misguided. > > IOW, I think that you could try to do your "freese other threads" not > at all like the freezer, but more like a "collect all threads in their > signal handler parts as the first phase of zapping them". > > So maybe this approach is salvageable. I see where something like the > above could work well. But I say that with a lot of handwaving, and > maybe if I see the patch I'd go "Christ, I was a complete idiot for > ever even suggesting that". Yes. The tricky bit is that there are a handful of stops that must be handled, or it is impossible to stop everything without causing disruption if the exec fails. The big ones are TASK_STOPPED and TASK_TRACED. There is another in wait_for_vfork_done. At which point I am looking at writting a wrapper around schedule that changes task state to something like TASK_WAKEKILL when asked, and then restores the state when released. Something that is independent of which freezer the code is using. It could be the scheduler to with a special bit in state that says opt-in. But if we have to opt in it is probably much less error prone to write the code as an wrapper around schedule, and only modify the core scheduling code if necessary. If I can make TASK_STOPPED and TASK_TRACED handle spurious wake-ups I think I can build something that is independent of the rest of the freezers so the code doesn't have to go 3 deep on wrappers of different freezer at those locations. It is already 2 layers deep. But I really don't intend to work on that again for a while. Right now I am in the final stages of ressurecting: https://lore.kernel.org/linux-fsdevel/87a7ohs5ow.fsf@xmission.com/ The hard part looks like cleaning up and resurrecting Oleg's patch to prevent the abuse of files_struct->count. https://lore.kernel.org/linux-fsdevel/20180915160423.GA31461@redhat.com/ I am close but dotting all of the i's and crossing all of the t's is taking ab bit. Eric
diff --git a/fs/exec.c b/fs/exec.c index 3698252719a3..5e4b0187ac05 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1145,6 +1145,95 @@ static int exec_mmap(struct mm_struct *mm) return 0; } +static void exec_reveal_threads_locked(void) +{ + struct task_struct *p = current, *t; + struct signal_struct *signal = p->signal; + + if (signal->group_execing_task) { + signal->group_execing_task = NULL; + __for_each_thread(signal, t) { + if (t == p) + continue; + /* + * This might be a spurious wake up task t but + * this should be fine as t should verify it + * is the appropriate time to wake up and if + * not fall back asleep. + * + * Any performance (rather than correctness) + * implications of this code should be unimportant + * as it is only called on error. + */ + wake_up_state(t, TASK_WAKEKILL); + } + } +} + +static void exec_reveal_threads(void) +{ + spinlock_t *lock = ¤t->sighand->siglock; + + spin_lock_irq(lock); + exec_reveal_threads_locked(); + spin_unlock_irq(lock); +} + +/* + * Conceal all other threads in the thread group from wakeups + */ +static int exec_conceal_threads(void) +{ + struct task_struct *me = current, *t; + struct signal_struct *signal = me->signal; + spinlock_t *lock = &me->sighand->siglock; + int ret = 0; + + if (thread_group_empty(me)) + return 0; + + spin_lock_irq(lock); + if (signal_pending(me) || signal_group_exit(signal) || + signal->group_execing_task) { + spin_unlock(lock); + return -ERESTARTNOINTR; + } + + signal->group_execing_task = me; + for (;;) { + unsigned int todo = 0; + + __for_each_thread(signal, t) { + if ((t == me) || (t->flags & PF_EXITING)) + continue; + + if (make_task_wakekill(t)) + continue; + + signal_wake_up(t, 0); + todo++; + } + + if ((todo == 0) || __fatal_signal_pending(me)) + break; + + set_current_state(TASK_KILLABLE); + spin_unlock_irq(lock); + + schedule(); + + spin_lock_irq(lock); + if (__fatal_signal_pending(me)) + break; + } + if (__fatal_signal_pending(me)) { + ret = -ERESTARTNOINTR; + exec_reveal_threads_locked(); + } + spin_unlock_irq(lock); + return ret; +} + static int de_thread(struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; @@ -1885,10 +1974,15 @@ static int bprm_execve(struct linux_binprm *bprm, struct files_struct *displaced; int retval; - retval = unshare_files(&displaced); + /* Conceal any other threads from wakeups */ + retval = exec_conceal_threads(); if (retval) return retval; + retval = unshare_files(&displaced); + if (retval) + goto out_ret; + retval = prepare_bprm_creds(bprm); if (retval) goto out_files; @@ -1949,6 +2043,8 @@ static int bprm_execve(struct linux_binprm *bprm, out_files: if (displaced) reset_files_struct(displaced); +out_ret: + exec_reveal_threads(); return retval; } diff --git a/include/linux/sched.h b/include/linux/sched.h index edb2020875ad..dcd79e78b651 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1726,6 +1726,8 @@ extern void kick_process(struct task_struct *tsk); static inline void kick_process(struct task_struct *tsk) { } #endif +extern int make_task_wakekill(struct task_struct *tsk); + extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec); static inline void set_task_comm(struct task_struct *tsk, const char *from) diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 1bad18a1d8ba..647b7d0d2231 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -106,6 +106,9 @@ struct signal_struct { int notify_count; struct task_struct *group_exit_task; + /* Task that is performing exec */ + struct task_struct *group_execing_task; + /* thread group stop support, overloads group_exit_code too */ int group_stop_count; unsigned int flags; /* see SIGNAL_* flags below */ diff --git a/kernel/fork.c b/kernel/fork.c index bf215af7a904..686c6901eabd 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2247,6 +2247,12 @@ static __latent_entropy struct task_struct *copy_process( goto bad_fork_cancel_cgroup; } + /* Don't allow creation of new tasks during exec */ + if (unlikely(current->signal->group_execing_task)) { + retval = -ERESTARTNOINTR; + goto bad_fork_cancel_cgroup; + } + /* past the last point of failure */ if (pidfile) fd_install(pidfd, pidfile); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8f360326861e..1ac8b81f22de 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2648,6 +2648,44 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) return success; } +int make_task_wakekill(struct task_struct *p) +{ + unsigned long flags; + int cpu, success = 0; + struct rq_flags rf; + struct rq *rq; + long state; + + /* Assumes p != current */ + preempt_disable(); + /* + * If we are going to change a thread waiting for CONDITION we + * need to ensure that CONDITION=1 done by the caller can not be + * reordered with p->state check below. This pairs with mb() in + * set_current_state() the waiting thread does. + */ + raw_spin_lock_irqsave(&p->pi_lock, flags); + smp_mb__after_spinlock(); + state = p->state; + + /* FIXME handle TASK_STOPPED and TASK_TRACED */ + if ((state == TASK_KILLABLE) || + (state == TASK_INTERRUPTIBLE)) { + success = 1; + cpu = task_cpu(p); + rq = cpu_rq(cpu); + rq_lock(rq, &rf); + p->state = TASK_WAKEKILL; + rq_unlock(rq, &rf); + } + else if (state == TASK_WAKEKILL) + success = 1; + + raw_spin_unlock_irqrestore(&p->pi_lock, flags); + preempt_enable(); + return success; +} + /** * try_invoke_on_locked_down_task - Invoke a function on task in fixed state * @p: Process for which the function is to be invoked. diff --git a/kernel/signal.c b/kernel/signal.c index 5ca48cc5da76..19f73eda1d54 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2590,6 +2590,15 @@ bool get_signal(struct ksignal *ksig) goto fatal; } + /* Is this task concealing itself from wake-ups during exec? */ + if (unlikely(signal->group_execing_task)) { + set_current_state(TASK_WAKEKILL); + wake_up_process(signal->group_execing_task); + spin_unlock_irq(&sighand->siglock); + schedule(); + goto relock; + } + for (;;) { struct k_sigaction *ka; @@ -2849,6 +2858,8 @@ void exit_signals(struct task_struct *tsk) task_participate_group_stop(tsk)) group_stop = CLD_STOPPED; out: + if (unlikely(tsk->signal->group_execing_task)) + wake_up_process(tsk->signal->group_execing_task); spin_unlock_irq(&tsk->sighand->siglock); /*
Right now I think I see solutions to the problems of exec without using this code. However this code is tested and working for the common cases (and has no lockdep warnings) and the techniques it is using could potentially be used to simplify the freezer, the cgroup v1 freezer, or the cgroup v2 freezer. The key is the function make_task_wakekill which could probably benefit from a little more review and refinement but appears to be basically correct. --- [This change requires more work to handle TASK_STOPPED and TASK_TRACED] [This adds a new lock ordering dependency siglock -> pi_lock -> rq_lock ] Many of the challenges of implementing a simple version of exec come from the fact the code handles exec'ing multi-thread processes. To the best of my knowledge processes with more than one thread calling exec are not common, and as all of the threads will be killed by exec there does not appear to be any useful work a thread can reliably do during exec. Therefore make it simpler to get exec correct by concealing the other threads from wakeups at the beginning of exec. This removes an entire class of races, and makes it tractable to fix some of the long standing issues with exec. One issue that this change makes it easier to solve is the issue of deailing with the file table. Today exec unshares the file table at the beginning to ensure there are no weird races with file descriptors. Unfortunately this unsharing can unshare the file table when only threads of the current process share it. Which results in unnecessary unsharing and posix locks being inappropriately dropped by a multi-threaded exec. With all of the threads frozen the thread count is stable and it is easy to tell if the if the file table really needs to be unshared by exec. Further this changes allows seccomp to stop taking cred_guard_mutex, as the seccomp code takes cred_guard_mutex to protect against another thread that is in the middle of calling exec and this change guarantees that if one threads is calling exec all of the other threads have stopped running. So this problematic kind of concurrency between threads can no longer happen. The code in de_thread was modified to unmask the threads at the same time as it is killing them ensuring that code continues to work as it does today, and without introducing any races where a thread might perform any problematic work in the middle of de_thread. The code in fork is modified to fail if another thread in the parent is in the middle of fork. A new generic scheduler function make_task_killable is added. I think the locking is ok, changing task->state under both pi_lock and the rq_lock, but I have not done a detailed looked yet to confirm that I am not missing something subtle. A new function exec_conceal_threads is added, to set group_execing_task and walk through all of the threads and change their state to TASK_WAKEKILL if it is not already. A new companion function exec_reveal_threads sends a wake up to all of the other threads and clear group_execing_task. This may cause a spuroius wake up but that is an uncommon case and the code for TASK_UNINTERRUPTIBLE and TASK_INTERRUPTIBLE is expected to be handle spurious so it should be fine. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/exec.c | 98 +++++++++++++++++++++++++++++++++++- include/linux/sched.h | 2 + include/linux/sched/signal.h | 3 ++ kernel/fork.c | 6 +++ kernel/sched/core.c | 38 ++++++++++++++ kernel/signal.c | 11 ++++ 6 files changed, 157 insertions(+), 1 deletion(-)