Message ID | 20230821150909.GA2431@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | bpf: task_group_seq_get_next: cleanup the usage of next_thread() | expand |
On 8/21/23 08:09, Oleg Nesterov wrote: > 1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we > can safely iterate the task->thread_group list. Even if this task exits > right after get_pid_task() (or goto retry) and pid_alive() returns 0 > > Kill the unnecessary pid_alive() check. This function will return next_task holding a refcount, and release the refcount until the next time calling the same function. Meanwhile, the returned task A may be killed, and its next task B may be killed after A as well, before calling this function again. However, even task B is destroyed (free), A's next is still pointing to task B. When this function is called again for the same iterator, it doesn't promise that B is still there. Does that make sense to you? > > 2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)" > check. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > kernel/bpf/task_iter.c | 7 ------- > 1 file changed, 7 deletions(-) > > diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c > index c4ab9d6cdbe9..4d1125108014 100644 > --- a/kernel/bpf/task_iter.c > +++ b/kernel/bpf/task_iter.c > @@ -75,15 +75,8 @@ static struct task_struct *task_group_seq_get_next(struct bpf_iter_seq_task_comm > return NULL; > > retry: > - if (!pid_alive(task)) { > - put_task_struct(task); > - return NULL; > - } > - > next_task = next_thread(task); > put_task_struct(task); > - if (!next_task) > - return NULL; > > saved_tid = *tid; > *tid = __task_pid_nr_ns(next_task, PIDTYPE_PID, common->ns);
On 08/21, Kui-Feng Lee wrote: > > > On 8/21/23 08:09, Oleg Nesterov wrote: > >1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we > > can safely iterate the task->thread_group list. Even if this task exits > > right after get_pid_task() (or goto retry) and pid_alive() returns 0 > > > Kill the unnecessary pid_alive() check. > > This function will return next_task holding a refcount, and release the > refcount until the next time calling the same function. Meanwhile, > the returned task A may be killed, and its next task B may be > killed after A as well, before calling this function again. > However, even task B is destroyed (free), A's next is still pointing to > task B. When this function is called again for the same iterator, > it doesn't promise that B is still there. Not sure I understand... OK, if we have a task pointer with incremented refcount and do not hold rcu lock, then yes, you can't remove the pid_alive() check in this code: rcu_read_lock(); if (pid_alive(task)) do_something(next_thread(task)); rcu_read_unlock(); because task and then task->next can exit and do call_rcu(delayed_put_task_struct) before we take rcu_read_lock(). But if you do something like rcu_read_lock(); task = find_task_in_some_rcu_protected_list(); do_something(next_thread(task)); rcu_read_unlock(); then next_thread(task) should be safe without pid_alive(). And iiuc task_group_seq_get_next() always does rcu_read_lock(); // the caller does lock/unlock task = get_pid_task(pid, PIDTYPE_PID); if (!task) return; next_task = next_thread(task); rcu_read_unlock(); Yes, both task and task->next can exit right after get_pid_task(), but since can only happen after we took rcu_read_lock(), delayed_put_task_struct() can't be called until we drop rcu lock. What have I missed? Oleg.
So I still think the pid_alive() check should die... and when I look at this code again I don't understand why does it abuse task_struct->usage, I'll send another patch on top of this one. On 08/21, Oleg Nesterov wrote: > > On 08/21, Kui-Feng Lee wrote: > > > > > > On 8/21/23 08:09, Oleg Nesterov wrote: > > >1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we > > > can safely iterate the task->thread_group list. Even if this task exits > > > right after get_pid_task() (or goto retry) and pid_alive() returns 0 > > > > Kill the unnecessary pid_alive() check. > > > > This function will return next_task holding a refcount, and release the > > refcount until the next time calling the same function. Meanwhile, > > the returned task A may be killed, and its next task B may be > > killed after A as well, before calling this function again. > > However, even task B is destroyed (free), A's next is still pointing to > > task B. When this function is called again for the same iterator, > > it doesn't promise that B is still there. > > Not sure I understand... > > OK, if we have a task pointer with incremented refcount and do not hold > rcu lock, then yes, you can't remove the pid_alive() check in this code: > > rcu_read_lock(); > if (pid_alive(task)) > do_something(next_thread(task)); > rcu_read_unlock(); > > because task and then task->next can exit and do call_rcu(delayed_put_task_struct) > before we take rcu_read_lock(). > > But if you do something like > > rcu_read_lock(); > > task = find_task_in_some_rcu_protected_list(); > do_something(next_thread(task)); > > rcu_read_unlock(); > > then next_thread(task) should be safe without pid_alive(). > > And iiuc task_group_seq_get_next() always does > > rcu_read_lock(); // the caller does lock/unlock > > task = get_pid_task(pid, PIDTYPE_PID); > if (!task) > return; > > next_task = next_thread(task); > > rcu_read_unlock(); > > Yes, both task and task->next can exit right after get_pid_task(), but since > can only happen after we took rcu_read_lock(), delayed_put_task_struct() can't > be called until we drop rcu lock. > > What have I missed? > > Oleg.
On 8/21/23 11:34, Oleg Nesterov wrote: > On 08/21, Kui-Feng Lee wrote: >> >> >> On 8/21/23 08:09, Oleg Nesterov wrote: >>> 1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we >>> can safely iterate the task->thread_group list. Even if this task exits >>> right after get_pid_task() (or goto retry) and pid_alive() returns 0 > >>> Kill the unnecessary pid_alive() check. >> >> This function will return next_task holding a refcount, and release the >> refcount until the next time calling the same function. Meanwhile, >> the returned task A may be killed, and its next task B may be >> killed after A as well, before calling this function again. >> However, even task B is destroyed (free), A's next is still pointing to >> task B. When this function is called again for the same iterator, >> it doesn't promise that B is still there. > > Not sure I understand... > > OK, if we have a task pointer with incremented refcount and do not hold > rcu lock, then yes, you can't remove the pid_alive() check in this code: > > rcu_read_lock(); > if (pid_alive(task)) > do_something(next_thread(task)); > rcu_read_unlock(); > > because task and then task->next can exit and do call_rcu(delayed_put_task_struct) > before we take rcu_read_lock(). > > But if you do something like > > rcu_read_lock(); > > task = find_task_in_some_rcu_protected_list(); > do_something(next_thread(task)); > > rcu_read_unlock(); > > then next_thread(task) should be safe without pid_alive(). > > And iiuc task_group_seq_get_next() always does > > rcu_read_lock(); // the caller does lock/unlock > > task = get_pid_task(pid, PIDTYPE_PID); > if (!task) > return; > > next_task = next_thread(task); > > rcu_read_unlock(); > > Yes, both task and task->next can exit right after get_pid_task(), but since > can only happen after we took rcu_read_lock(), delayed_put_task_struct() can't > be called until we drop rcu lock. > > What have I missed? Then, it makes sense to me! Thank you for the explanation. > > Oleg. >
OK, it seems that you are not going to take these preparatory cleanups ;) I'll resend along with the s/next_thread/__next_thread/ change. I was going to do the last change later, but this recent discussion https://lore.kernel.org/all/20230824143112.GA31208@redhat.com/ makes me think we should do this right now. On 08/21, Oleg Nesterov wrote: > > 1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we > can safely iterate the task->thread_group list. Even if this task exits > right after get_pid_task() (or goto retry) and pid_alive() returns 0. > > Kill the unnecessary pid_alive() check. > > 2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)" > check. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > kernel/bpf/task_iter.c | 7 ------- > 1 file changed, 7 deletions(-) > > diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c > index c4ab9d6cdbe9..4d1125108014 100644 > --- a/kernel/bpf/task_iter.c > +++ b/kernel/bpf/task_iter.c > @@ -75,15 +75,8 @@ static struct task_struct *task_group_seq_get_next(struct bpf_iter_seq_task_comm > return NULL; > > retry: > - if (!pid_alive(task)) { > - put_task_struct(task); > - return NULL; > - } > - > next_task = next_thread(task); > put_task_struct(task); > - if (!next_task) > - return NULL; > > saved_tid = *tid; > *tid = __task_pid_nr_ns(next_task, PIDTYPE_PID, common->ns); > -- > 2.25.1.362.g51ebf55 > >
Oleg Nesterov <oleg@redhat.com> writes: > OK, it seems that you are not going to take these preparatory > cleanups ;) > > I'll resend along with the s/next_thread/__next_thread/ change. > I was going to do the last change later, but this recent discussion > https://lore.kernel.org/all/20230824143112.GA31208@redhat.com/ > makes me think we should do this right now. For the record I find this code confusing, and wrong. It looks like it wants to keep the task_struct pointer or possibly the struct pid pointer like proc does, but then it winds up keeping a userspace pid value and regenerating both the struct pid pointer and the struct task_struct pointer. Which means that task_group_seq_get_next is unnecessarily slow and has a built in race condition which means it could wind up iterating through a different process. This whole thing looks to be a bad (aka racy) reimplementation of first_tid and next_tid from proc. I thought the changes were to adapt to the needs of bpf, but on closer examination the code is just racy. For this code to be correct bpf_iter_seq_task_common needs to store at a minimum a struct pid pointer. Oleg your patch makes it easier to see what the how far this is from first_tid/next_tid in proc. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Eric
On 08/25, Eric W. Biederman wrote: > > For the record I find this code confusing, and wrong. Oh, yes... > and has > a built in race condition which means it could wind up iterating through > a different process. Yes, common->pid and/or common->pid_visiting can be reused but I am not going to try to fix this ;) > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Thanks! Oleg.
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index c4ab9d6cdbe9..4d1125108014 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -75,15 +75,8 @@ static struct task_struct *task_group_seq_get_next(struct bpf_iter_seq_task_comm return NULL; retry: - if (!pid_alive(task)) { - put_task_struct(task); - return NULL; - } - next_task = next_thread(task); put_task_struct(task); - if (!next_task) - return NULL; saved_tid = *tid; *tid = __task_pid_nr_ns(next_task, PIDTYPE_PID, common->ns);
1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we can safely iterate the task->thread_group list. Even if this task exits right after get_pid_task() (or goto retry) and pid_alive() returns 0. Kill the unnecessary pid_alive() check. 2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)" check. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/bpf/task_iter.c | 7 ------- 1 file changed, 7 deletions(-)