diff mbox series

bpf: task_group_seq_get_next: cleanup the usage of next_thread()

Message ID 20230821150909.GA2431@redhat.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf: task_group_seq_get_next: cleanup the usage of next_thread() | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1332 this patch: 1332
netdev/cc_maintainers warning 10 maintainers not CCed: daniel@iogearbox.net kpsingh@kernel.org martin.lau@linux.dev john.fastabend@gmail.com song@kernel.org sdf@google.com yonghong.song@linux.dev jolsa@kernel.org haoluo@google.com ast@kernel.org
netdev/build_clang success Errors and warnings before: 1353 this patch: 1353
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1355 this patch: 1355
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 15 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-1 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-12 fail Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on s390x with gcc

Commit Message

Oleg Nesterov Aug. 21, 2023, 3:09 p.m. UTC
1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we
   can safely iterate the task->thread_group list. Even if this task exits
   right after get_pid_task() (or goto retry) and pid_alive() returns 0.

   Kill the unnecessary pid_alive() check.

2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)"
   check.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/bpf/task_iter.c | 7 -------
 1 file changed, 7 deletions(-)

Comments

Kui-Feng Lee Aug. 21, 2023, 5:55 p.m. UTC | #1
On 8/21/23 08:09, Oleg Nesterov wrote:
> 1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we
>     can safely iterate the task->thread_group list. Even if this task exits
>     right after get_pid_task() (or goto retry) and pid_alive() returns 0 >
>     Kill the unnecessary pid_alive() check.

This function will return next_task holding a refcount, and release the
refcount until the next time calling the same function. Meanwhile,
the returned task A may be killed, and its next task B may be
killed after A as well, before calling this function again.
However, even task B is destroyed (free), A's next is still pointing to
task B. When this function is called again for the same iterator,
it doesn't promise that B is still there.

Does that make sense to you?

> 
> 2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)"
>     check.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>   kernel/bpf/task_iter.c | 7 -------
>   1 file changed, 7 deletions(-)
> 
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index c4ab9d6cdbe9..4d1125108014 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -75,15 +75,8 @@ static struct task_struct *task_group_seq_get_next(struct bpf_iter_seq_task_comm
>   		return NULL;
>   
>   retry:
> -	if (!pid_alive(task)) {
> -		put_task_struct(task);
> -		return NULL;
> -	}
> -
>   	next_task = next_thread(task);
>   	put_task_struct(task);
> -	if (!next_task)
> -		return NULL;
>   
>   	saved_tid = *tid;
>   	*tid = __task_pid_nr_ns(next_task, PIDTYPE_PID, common->ns);
Oleg Nesterov Aug. 21, 2023, 6:34 p.m. UTC | #2
On 08/21, Kui-Feng Lee wrote:
>
>
> On 8/21/23 08:09, Oleg Nesterov wrote:
> >1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we
> >    can safely iterate the task->thread_group list. Even if this task exits
> >    right after get_pid_task() (or goto retry) and pid_alive() returns 0 >
> >    Kill the unnecessary pid_alive() check.
>
> This function will return next_task holding a refcount, and release the
> refcount until the next time calling the same function. Meanwhile,
> the returned task A may be killed, and its next task B may be
> killed after A as well, before calling this function again.
> However, even task B is destroyed (free), A's next is still pointing to
> task B. When this function is called again for the same iterator,
> it doesn't promise that B is still there.

Not sure I understand...

OK, if we have a task pointer with incremented refcount and do not hold
rcu lock, then yes, you can't remove the pid_alive() check in this code:

	rcu_read_lock();
	if (pid_alive(task))
		do_something(next_thread(task));
	rcu_read_unlock();

because task and then task->next can exit and do call_rcu(delayed_put_task_struct)
before we take rcu_read_lock().

But if you do something like

	rcu_read_lock();

	task = find_task_in_some_rcu_protected_list();
	do_something(next_thread(task));

	rcu_read_unlock();

then next_thread(task) should be safe without pid_alive().

And iiuc task_group_seq_get_next() always does

	rcu_read_lock();	// the caller does lock/unlock

	task = get_pid_task(pid, PIDTYPE_PID);
	if (!task)
		return;
	
	next_task = next_thread(task);

	rcu_read_unlock();

Yes, both task and task->next can exit right after get_pid_task(), but since
can only happen after we took rcu_read_lock(), delayed_put_task_struct() can't
be called until we drop rcu lock.

What have I missed?

Oleg.
Oleg Nesterov Aug. 21, 2023, 7:54 p.m. UTC | #3
So I still think the pid_alive() check should die...

and when I look at this code again I don't understand why does it abuse
task_struct->usage, I'll send another patch on top of this one.

On 08/21, Oleg Nesterov wrote:
>
> On 08/21, Kui-Feng Lee wrote:
> >
> >
> > On 8/21/23 08:09, Oleg Nesterov wrote:
> > >1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we
> > >    can safely iterate the task->thread_group list. Even if this task exits
> > >    right after get_pid_task() (or goto retry) and pid_alive() returns 0 >
> > >    Kill the unnecessary pid_alive() check.
> >
> > This function will return next_task holding a refcount, and release the
> > refcount until the next time calling the same function. Meanwhile,
> > the returned task A may be killed, and its next task B may be
> > killed after A as well, before calling this function again.
> > However, even task B is destroyed (free), A's next is still pointing to
> > task B. When this function is called again for the same iterator,
> > it doesn't promise that B is still there.
>
> Not sure I understand...
>
> OK, if we have a task pointer with incremented refcount and do not hold
> rcu lock, then yes, you can't remove the pid_alive() check in this code:
>
> 	rcu_read_lock();
> 	if (pid_alive(task))
> 		do_something(next_thread(task));
> 	rcu_read_unlock();
>
> because task and then task->next can exit and do call_rcu(delayed_put_task_struct)
> before we take rcu_read_lock().
>
> But if you do something like
>
> 	rcu_read_lock();
>
> 	task = find_task_in_some_rcu_protected_list();
> 	do_something(next_thread(task));
>
> 	rcu_read_unlock();
>
> then next_thread(task) should be safe without pid_alive().
>
> And iiuc task_group_seq_get_next() always does
>
> 	rcu_read_lock();	// the caller does lock/unlock
>
> 	task = get_pid_task(pid, PIDTYPE_PID);
> 	if (!task)
> 		return;
>
> 	next_task = next_thread(task);
>
> 	rcu_read_unlock();
>
> Yes, both task and task->next can exit right after get_pid_task(), but since
> can only happen after we took rcu_read_lock(), delayed_put_task_struct() can't
> be called until we drop rcu lock.
>
> What have I missed?
>
> Oleg.
Kui-Feng Lee Aug. 21, 2023, 8:24 p.m. UTC | #4
On 8/21/23 11:34, Oleg Nesterov wrote:
> On 08/21, Kui-Feng Lee wrote:
>>
>>
>> On 8/21/23 08:09, Oleg Nesterov wrote:
>>> 1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we
>>>     can safely iterate the task->thread_group list. Even if this task exits
>>>     right after get_pid_task() (or goto retry) and pid_alive() returns 0 >
>>>     Kill the unnecessary pid_alive() check.
>>
>> This function will return next_task holding a refcount, and release the
>> refcount until the next time calling the same function. Meanwhile,
>> the returned task A may be killed, and its next task B may be
>> killed after A as well, before calling this function again.
>> However, even task B is destroyed (free), A's next is still pointing to
>> task B. When this function is called again for the same iterator,
>> it doesn't promise that B is still there.
> 
> Not sure I understand...
> 
> OK, if we have a task pointer with incremented refcount and do not hold
> rcu lock, then yes, you can't remove the pid_alive() check in this code:
> 
> 	rcu_read_lock();
> 	if (pid_alive(task))
> 		do_something(next_thread(task));
> 	rcu_read_unlock();
> 
> because task and then task->next can exit and do call_rcu(delayed_put_task_struct)
> before we take rcu_read_lock().
> 
> But if you do something like
> 
> 	rcu_read_lock();
> 
> 	task = find_task_in_some_rcu_protected_list();
> 	do_something(next_thread(task));
> 
> 	rcu_read_unlock();
> 
> then next_thread(task) should be safe without pid_alive().
> 
> And iiuc task_group_seq_get_next() always does
> 
> 	rcu_read_lock();	// the caller does lock/unlock
> 
> 	task = get_pid_task(pid, PIDTYPE_PID);
> 	if (!task)
> 		return;
> 	
> 	next_task = next_thread(task);
> 
> 	rcu_read_unlock();
> 
> Yes, both task and task->next can exit right after get_pid_task(), but since
> can only happen after we took rcu_read_lock(), delayed_put_task_struct() can't
> be called until we drop rcu lock.
> 
> What have I missed?

Then, it makes sense to me! Thank you for the explanation.

> 
> Oleg.
>
Oleg Nesterov Aug. 25, 2023, 12:41 p.m. UTC | #5
OK, it seems that you are not going to take these preparatory
cleanups ;)

I'll resend along with the s/next_thread/__next_thread/ change.
I was going to do the last change later, but this recent discussion
https://lore.kernel.org/all/20230824143112.GA31208@redhat.com/
makes me think we should do this right now.

On 08/21, Oleg Nesterov wrote:
>
> 1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we
>    can safely iterate the task->thread_group list. Even if this task exits
>    right after get_pid_task() (or goto retry) and pid_alive() returns 0.
>
>    Kill the unnecessary pid_alive() check.
>
> 2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)"
>    check.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  kernel/bpf/task_iter.c | 7 -------
>  1 file changed, 7 deletions(-)
>
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index c4ab9d6cdbe9..4d1125108014 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -75,15 +75,8 @@ static struct task_struct *task_group_seq_get_next(struct bpf_iter_seq_task_comm
>  		return NULL;
>
>  retry:
> -	if (!pid_alive(task)) {
> -		put_task_struct(task);
> -		return NULL;
> -	}
> -
>  	next_task = next_thread(task);
>  	put_task_struct(task);
> -	if (!next_task)
> -		return NULL;
>
>  	saved_tid = *tid;
>  	*tid = __task_pid_nr_ns(next_task, PIDTYPE_PID, common->ns);
> --
> 2.25.1.362.g51ebf55
>
>
Eric W. Biederman Aug. 25, 2023, 1:36 p.m. UTC | #6
Oleg Nesterov <oleg@redhat.com> writes:

> OK, it seems that you are not going to take these preparatory
> cleanups ;)
>
> I'll resend along with the s/next_thread/__next_thread/ change.
> I was going to do the last change later, but this recent discussion
> https://lore.kernel.org/all/20230824143112.GA31208@redhat.com/
> makes me think we should do this right now.

For the record I find this code confusing, and wrong.

It looks like it wants to keep the task_struct pointer or possibly the
struct pid pointer like proc does, but then it winds up keeping a
userspace pid value and regenerating both the struct pid pointer and
the struct task_struct pointer.

Which means that task_group_seq_get_next is unnecessarily slow and has
a built in race condition which means it could wind up iterating through
a different process.

This whole thing looks to be a bad (aka racy) reimplementation of
first_tid and next_tid from proc.  I thought the changes were to
adapt to the needs of bpf, but on closer examination the code is
just racy.

For this code to be correct bpf_iter_seq_task_common needs to store
at a minimum a struct pid pointer.


Oleg your patch makes it easier to see what the how
far this is from first_tid/next_tid in proc.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Eric
Oleg Nesterov Aug. 25, 2023, 1:50 p.m. UTC | #7
On 08/25, Eric W. Biederman wrote:
>
> For the record I find this code confusing, and wrong.

Oh, yes...

> and has
> a built in race condition which means it could wind up iterating through
> a different process.

Yes, common->pid and/or common->pid_visiting can be reused

but I am not going to try to fix this ;)

> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Thanks!

Oleg.
diff mbox series

Patch

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index c4ab9d6cdbe9..4d1125108014 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -75,15 +75,8 @@  static struct task_struct *task_group_seq_get_next(struct bpf_iter_seq_task_comm
 		return NULL;
 
 retry:
-	if (!pid_alive(task)) {
-		put_task_struct(task);
-		return NULL;
-	}
-
 	next_task = next_thread(task);
 	put_task_struct(task);
-	if (!next_task)
-		return NULL;
 
 	saved_tid = *tid;
 	*tid = __task_pid_nr_ns(next_task, PIDTYPE_PID, common->ns);