diff mbox series

[1/6] fs/exec: Drop task_lock() inside __get_task_comm()

Message ID 20240602023754.25443-2-laoar.shao@gmail.com (mailing list archive)
State Not Applicable
Headers show
Series kernel: Avoid memcpy of task comm | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/tree_selection success Not a local patch
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-18 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-13 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-8 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-10 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-7 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release

Commit Message

Yafang Shao June 2, 2024, 2:37 a.m. UTC
Quoted from Linus [0]:

  Since user space can randomly change their names anyway, using locking
  was always wrong for readers (for writers it probably does make sense
  to have some lock - although practically speaking nobody cares there
  either, but at least for a writer some kind of race could have
  long-term mixed results

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/CAHk-=wivfrF0_zvf+oj6==Sh=-npJooP8chLPEfaFV0oNYTTBA@mail.gmail.com [0]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
---
 fs/exec.c             | 7 +++++--
 include/linux/sched.h | 2 +-
 2 files changed, 6 insertions(+), 3 deletions(-)

Comments

Eric W. Biederman June 2, 2024, 3:51 a.m. UTC | #1
Yafang Shao <laoar.shao@gmail.com> writes:

> Quoted from Linus [0]:
>
>   Since user space can randomly change their names anyway, using locking
>   was always wrong for readers (for writers it probably does make sense
>   to have some lock - although practically speaking nobody cares there
>   either, but at least for a writer some kind of race could have
>   long-term mixed results

Ugh.
Ick.

This code is buggy.

I won't argue that Linus is wrong, about removing the
task_lock.

Unfortunately strscpy_pad does not work properly with the
task_lock removed, and buf_size larger that TASK_COMM_LEN.
There is a race that will allow reading past the end
of tsk->comm, if we read while tsk->common is being
updated.

So __get_task_comm needs to look something like:

char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
{
	size_t len = buf_size;
        if (len > TASK_COMM_LEN)
        	len = TASK_COMM_LEN;
	memcpy(buf, tsk->comm, len);
        buf[len -1] = '\0';
	return buf;
}

What shows up in buf past the '\0' is not guaranteed in the above
version but I would be surprised if anyone cares.

If people do care the code can do something like:
char *last = strchr(buf);
memset(last, '\0', buf_size - (last - buf));

To zero everything in the buffer past the first '\0' byte.


Eric


> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/all/CAHk-=wivfrF0_zvf+oj6==Sh=-npJooP8chLPEfaFV0oNYTTBA@mail.gmail.com [0]
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>  fs/exec.c             | 7 +++++--
>  include/linux/sched.h | 2 +-
>  2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index b3c40fbb325f..b43992d35a8a 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1227,12 +1227,15 @@ static int unshare_sighand(struct task_struct *me)
>  	return 0;
>  }
>  
> +/*
> + * User space can randomly change their names anyway, so locking for readers
> + * doesn't make sense. For writers, locking is probably necessary, as a race
> + * condition could lead to long-term mixed results.
> + */
>  char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
>  {
> -	task_lock(tsk);
>  	/* Always NUL terminated and zero-padded */
>  	strscpy_pad(buf, tsk->comm, buf_size);
> -	task_unlock(tsk);
>  	return buf;
>  }
>  EXPORT_SYMBOL_GPL(__get_task_comm);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c75fd46506df..56a927393a38 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1083,7 +1083,7 @@ struct task_struct {
>  	 *
>  	 * - normally initialized setup_new_exec()
>  	 * - access it with [gs]et_task_comm()
> -	 * - lock it with task_lock()
> +	 * - lock it with task_lock() for writing
>  	 */
>  	char				comm[TASK_COMM_LEN];
Yafang Shao June 2, 2024, 6:56 a.m. UTC | #2
On Sun, Jun 2, 2024 at 11:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > Quoted from Linus [0]:
> >
> >   Since user space can randomly change their names anyway, using locking
> >   was always wrong for readers (for writers it probably does make sense
> >   to have some lock - although practically speaking nobody cares there
> >   either, but at least for a writer some kind of race could have
> >   long-term mixed results
>
> Ugh.
> Ick.
>
> This code is buggy.
>
> I won't argue that Linus is wrong, about removing the
> task_lock.
>
> Unfortunately strscpy_pad does not work properly with the
> task_lock removed, and buf_size larger that TASK_COMM_LEN.
> There is a race that will allow reading past the end
> of tsk->comm, if we read while tsk->common is being
> updated.

It appears so. Thanks for pointing it out. Additionally, other code,
such as the BPF helper bpf_get_current_comm(), also uses strscpy_pad()
directly without the task_lock. It seems we should change that as
well.

>
> So __get_task_comm needs to look something like:
>
> char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
> {
>         size_t len = buf_size;
>         if (len > TASK_COMM_LEN)
>                 len = TASK_COMM_LEN;
>         memcpy(buf, tsk->comm, len);
>         buf[len -1] = '\0';
>         return buf;
> }

Thanks for your suggestion.

>
> What shows up in buf past the '\0' is not guaranteed in the above
> version but I would be surprised if anyone cares.

I believe we pad it to prevent the leakage of kernel data. In this
case, since no kernel data will be leaked, the following change may be
unnecessary.

>
> If people do care the code can do something like:
> char *last = strchr(buf);
> memset(last, '\0', buf_size - (last - buf));
>
> To zero everything in the buffer past the first '\0' byte.
>
Alexei Starovoitov June 2, 2024, 4:35 p.m. UTC | #3
On Sat, Jun 1, 2024 at 11:57 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sun, Jun 2, 2024 at 11:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > Yafang Shao <laoar.shao@gmail.com> writes:
> >
> > > Quoted from Linus [0]:
> > >
> > >   Since user space can randomly change their names anyway, using locking
> > >   was always wrong for readers (for writers it probably does make sense
> > >   to have some lock - although practically speaking nobody cares there
> > >   either, but at least for a writer some kind of race could have
> > >   long-term mixed results
> >
> > Ugh.
> > Ick.
> >
> > This code is buggy.
> >
> > I won't argue that Linus is wrong, about removing the
> > task_lock.
> >
> > Unfortunately strscpy_pad does not work properly with the
> > task_lock removed, and buf_size larger that TASK_COMM_LEN.
> > There is a race that will allow reading past the end
> > of tsk->comm, if we read while tsk->common is being
> > updated.
>
> It appears so. Thanks for pointing it out. Additionally, other code,
> such as the BPF helper bpf_get_current_comm(), also uses strscpy_pad()
> directly without the task_lock. It seems we should change that as
> well.

Hmm. What race do you see?
If lock is removed from __get_task_comm() it probably can be removed from
__set_task_comm() as well.
And both are calling strscpy_pad to write and read comm.
So I don't see how it would read past sizeof(comm),
because 'buf' passed into __set_task_comm is NUL-terminated.
So the concurrent read will find it.

> >
> > So __get_task_comm needs to look something like:
> >
> > char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
> > {
> >         size_t len = buf_size;
> >         if (len > TASK_COMM_LEN)
> >                 len = TASK_COMM_LEN;
> >         memcpy(buf, tsk->comm, len);
> >         buf[len -1] = '\0';
> >         return buf;
> > }
>
> Thanks for your suggestion.
>
> >
> > What shows up in buf past the '\0' is not guaranteed in the above
> > version but I would be surprised if anyone cares.
>
> I believe we pad it to prevent the leakage of kernel data. In this
> case, since no kernel data will be leaked, the following change may be
> unnecessary.

It's not about leaking of kernel data, but more about not writing
garbage past NUL.
Because comm[] is a part of some record that is used as a key
in a hash map.
Eric W. Biederman June 2, 2024, 5:52 p.m. UTC | #4
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Sat, Jun 1, 2024 at 11:57 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> On Sun, Jun 2, 2024 at 11:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> >
>> > Yafang Shao <laoar.shao@gmail.com> writes:
>> >
>> > > Quoted from Linus [0]:
>> > >
>> > >   Since user space can randomly change their names anyway, using locking
>> > >   was always wrong for readers (for writers it probably does make sense
>> > >   to have some lock - although practically speaking nobody cares there
>> > >   either, but at least for a writer some kind of race could have
>> > >   long-term mixed results
>> >
>> > Ugh.
>> > Ick.
>> >
>> > This code is buggy.
>> >
>> > I won't argue that Linus is wrong, about removing the
>> > task_lock.
>> >
>> > Unfortunately strscpy_pad does not work properly with the
>> > task_lock removed, and buf_size larger that TASK_COMM_LEN.
>> > There is a race that will allow reading past the end
>> > of tsk->comm, if we read while tsk->common is being
>> > updated.
>>
>> It appears so. Thanks for pointing it out. Additionally, other code,
>> such as the BPF helper bpf_get_current_comm(), also uses strscpy_pad()
>> directly without the task_lock. It seems we should change that as
>> well.
>
> Hmm. What race do you see?
> If lock is removed from __get_task_comm() it probably can be removed from
> __set_task_comm() as well.
> And both are calling strscpy_pad to write and read comm.
> So I don't see how it would read past sizeof(comm),
> because 'buf' passed into __set_task_comm is NUL-terminated.
> So the concurrent read will find it.

The read may race with a write that is changing the location
of '\0'.  Especially if the new value is shorter than
the old value.

If you are performing lockless reads and depending upon a '\0'
terminator without limiting yourself to the size of the buffer
there needs to be a big fat comment as to how in the world
you are guaranteed that a '\0' inside the buffer will always
be found.

Eric
Eric W. Biederman June 2, 2024, 5:56 p.m. UTC | #5
Yafang Shao <laoar.shao@gmail.com> writes:

> On Sun, Jun 2, 2024 at 11:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > Quoted from Linus [0]:
>> >
>> >   Since user space can randomly change their names anyway, using locking
>> >   was always wrong for readers (for writers it probably does make sense
>> >   to have some lock - although practically speaking nobody cares there
>> >   either, but at least for a writer some kind of race could have
>> >   long-term mixed results
>>
>> Ugh.
>> Ick.
>>
>> This code is buggy.
>>
>> I won't argue that Linus is wrong, about removing the
>> task_lock.
>>
>> Unfortunately strscpy_pad does not work properly with the
>> task_lock removed, and buf_size larger that TASK_COMM_LEN.
>> There is a race that will allow reading past the end
>> of tsk->comm, if we read while tsk->common is being
>> updated.
>
> It appears so. Thanks for pointing it out. Additionally, other code,
> such as the BPF helper bpf_get_current_comm(), also uses strscpy_pad()
> directly without the task_lock. It seems we should change that as
> well.

Which suggests that we could really use a helper that handles all of
the tricky business of reading the tsk->comm lock-free.

Eric
Alexei Starovoitov June 2, 2024, 6:23 p.m. UTC | #6
On Sun, Jun 2, 2024 at 10:53 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Sat, Jun 1, 2024 at 11:57 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> On Sun, Jun 2, 2024 at 11:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> >
> >> > Yafang Shao <laoar.shao@gmail.com> writes:
> >> >
> >> > > Quoted from Linus [0]:
> >> > >
> >> > >   Since user space can randomly change their names anyway, using locking
> >> > >   was always wrong for readers (for writers it probably does make sense
> >> > >   to have some lock - although practically speaking nobody cares there
> >> > >   either, but at least for a writer some kind of race could have
> >> > >   long-term mixed results
> >> >
> >> > Ugh.
> >> > Ick.
> >> >
> >> > This code is buggy.
> >> >
> >> > I won't argue that Linus is wrong, about removing the
> >> > task_lock.
> >> >
> >> > Unfortunately strscpy_pad does not work properly with the
> >> > task_lock removed, and buf_size larger that TASK_COMM_LEN.
> >> > There is a race that will allow reading past the end
> >> > of tsk->comm, if we read while tsk->common is being
> >> > updated.
> >>
> >> It appears so. Thanks for pointing it out. Additionally, other code,
> >> such as the BPF helper bpf_get_current_comm(), also uses strscpy_pad()
> >> directly without the task_lock. It seems we should change that as
> >> well.
> >
> > Hmm. What race do you see?
> > If lock is removed from __get_task_comm() it probably can be removed from
> > __set_task_comm() as well.
> > And both are calling strscpy_pad to write and read comm.
> > So I don't see how it would read past sizeof(comm),
> > because 'buf' passed into __set_task_comm is NUL-terminated.
> > So the concurrent read will find it.
>
> The read may race with a write that is changing the location
> of '\0'.  Especially if the new value is shorter than
> the old value.

so ?
strscpy_pad in __[gs]et_task_comm will read/write either long
or byte at a time.
Assume 64 bit and, say, we had comm where 2nd u64 had NUL.
Now two cpus are racing. One is writing shorter comm.
Another is reading.
The latter can read 1st u64 without NUL and will proceed
to read 2nd u64. Either it will read the old u64 with NUL in it
or it will read all zeros in 2nd u64 or some zeros in 2nd u64
depending on how the compiler generated memset(.., 0, ..)
as part of strscpy_pad().
_pad() part is critical here.
If it was just strscpy() then there would indeed be a chance
of reading both u64-s and not finding NUL in any of them.

> If you are performing lockless reads and depending upon a '\0'
> terminator without limiting yourself to the size of the buffer
> there needs to be a big fat comment as to how in the world
> you are guaranteed that a '\0' inside the buffer will always
> be found.

I think Yafang can certainly add such a comment next to
__[gs]et_task_comm.

I prefer to avoid open coding memcpy + mmemset when strscpy_pad works.
Linus Torvalds June 2, 2024, 8:11 p.m. UTC | #7
On Sun, 2 Jun 2024 at 10:53, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The read may race with a write that is changing the location
> of '\0'.  Especially if the new value is shorter than
> the old value.

It *shouldn't* happen.

So 'strscpy()' itself is written to be NUL-safe, in that if it ever
copies a NUL character, it will stop. Admittedly the byte loop at the
end might technically need a READ_ONCE() for that to eb strictly true
in theory, but in practice it already is.

And even if the new string is shorter, the comm[] array will always
have a NUL terminator _somewhere_, in how the last byte is never
non-NUL.

Now, the only real issue is if something writes *to* the  comm[] array
without following the rules properly - like writing a non-NULL
character to the end of the array before then filling it in with NUL
again.

But that would be a bug on the comm[] writing side, I feel, not a bug
on the reader side.

               Linus
Yafang Shao June 3, 2024, 11:35 a.m. UTC | #8
On Mon, Jun 3, 2024 at 2:23 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Jun 2, 2024 at 10:53 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >
> > > On Sat, Jun 1, 2024 at 11:57 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >>
> > >> On Sun, Jun 2, 2024 at 11:52 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > >> >
> > >> > Yafang Shao <laoar.shao@gmail.com> writes:
> > >> >
> > >> > > Quoted from Linus [0]:
> > >> > >
> > >> > >   Since user space can randomly change their names anyway, using locking
> > >> > >   was always wrong for readers (for writers it probably does make sense
> > >> > >   to have some lock - although practically speaking nobody cares there
> > >> > >   either, but at least for a writer some kind of race could have
> > >> > >   long-term mixed results
> > >> >
> > >> > Ugh.
> > >> > Ick.
> > >> >
> > >> > This code is buggy.
> > >> >
> > >> > I won't argue that Linus is wrong, about removing the
> > >> > task_lock.
> > >> >
> > >> > Unfortunately strscpy_pad does not work properly with the
> > >> > task_lock removed, and buf_size larger that TASK_COMM_LEN.
> > >> > There is a race that will allow reading past the end
> > >> > of tsk->comm, if we read while tsk->common is being
> > >> > updated.
> > >>
> > >> It appears so. Thanks for pointing it out. Additionally, other code,
> > >> such as the BPF helper bpf_get_current_comm(), also uses strscpy_pad()
> > >> directly without the task_lock. It seems we should change that as
> > >> well.
> > >
> > > Hmm. What race do you see?
> > > If lock is removed from __get_task_comm() it probably can be removed from
> > > __set_task_comm() as well.
> > > And both are calling strscpy_pad to write and read comm.
> > > So I don't see how it would read past sizeof(comm),
> > > because 'buf' passed into __set_task_comm is NUL-terminated.
> > > So the concurrent read will find it.
> >
> > The read may race with a write that is changing the location
> > of '\0'.  Especially if the new value is shorter than
> > the old value.
>
> so ?
> strscpy_pad in __[gs]et_task_comm will read/write either long
> or byte at a time.
> Assume 64 bit and, say, we had comm where 2nd u64 had NUL.
> Now two cpus are racing. One is writing shorter comm.
> Another is reading.
> The latter can read 1st u64 without NUL and will proceed
> to read 2nd u64. Either it will read the old u64 with NUL in it
> or it will read all zeros in 2nd u64 or some zeros in 2nd u64
> depending on how the compiler generated memset(.., 0, ..)
> as part of strscpy_pad().
> _pad() part is critical here.
> If it was just strscpy() then there would indeed be a chance
> of reading both u64-s and not finding NUL in any of them.
>
> > If you are performing lockless reads and depending upon a '\0'
> > terminator without limiting yourself to the size of the buffer
> > there needs to be a big fat comment as to how in the world
> > you are guaranteed that a '\0' inside the buffer will always
> > be found.
>
> I think Yafang can certainly add such a comment next to
> __[gs]et_task_comm.
>
> I prefer to avoid open coding memcpy + mmemset when strscpy_pad works.

Thanks for your explanation.
I will add a comment for it in the next version.
Matus Jokay June 4, 2024, 1:02 p.m. UTC | #9
Hi Yafang,

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c75fd46506df..56a927393a38 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1083,7 +1083,7 @@ struct task_struct {
>  	 *
>  	 * - normally initialized setup_new_exec()
>  	 * - access it with [gs]et_task_comm()
> -	 * - lock it with task_lock()
> +	 * - lock it with task_lock() for writing
since you are updating this comment, you might as well fix other comments concerning ->comm, e.g.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c75fd46506df..95b382d790d9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1081,9 +1081,9 @@ struct task_struct {
 	/*
 	 * executable name, excluding path.
 	 *
-	 * - normally initialized setup_new_exec()
-	 * - access it with [gs]et_task_comm()
-	 * - lock it with task_lock()
+	 * - normally initialized begin_new_exec()
+	 * - access it with __[gs]et_task_comm()
+	 * - lock it with task_lock() for writing
 	 */
 	char				comm[TASK_COMM_LEN];
Matus Jokay June 4, 2024, 8:01 p.m. UTC | #10
Sorry guys for the mistake,

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c75fd46506df..56a927393a38 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1083,7 +1083,7 @@ struct task_struct {
>  	 *
>  	 * - normally initialized setup_new_exec()
>  	 * - access it with [gs]et_task_comm()
> -	 * - lock it with task_lock()
> +	 * - lock it with task_lock() for writing
there should be fixed only the comment about ->comm initialization during exec.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c75fd46506df..48aa5c85ed9e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1081,9 +1081,9 @@ struct task_struct {
 	/*
 	 * executable name, excluding path.
 	 *
-	 * - normally initialized setup_new_exec()
+	 * - normally initialized begin_new_exec()
 	 * - access it with [gs]et_task_comm()
-	 * - lock it with task_lock()
+	 * - lock it with task_lock() for writing
 	 */
 	char				comm[TASK_COMM_LEN];
 
Again, sorry for the noise. It's a very minor fix, but maybe even a small fix to the documentation can help increase the readability of the code.

--
Thanks
mY
Yafang Shao June 5, 2024, 2:48 a.m. UTC | #11
On Wed, Jun 5, 2024 at 4:01 AM Matus Jokay <matus.jokay@stuba.sk> wrote:
>
> Sorry guys for the mistake,
>
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index c75fd46506df..56a927393a38 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1083,7 +1083,7 @@ struct task_struct {
> >        *
> >        * - normally initialized setup_new_exec()
> >        * - access it with [gs]et_task_comm()
> > -      * - lock it with task_lock()
> > +      * - lock it with task_lock() for writing
> there should be fixed only the comment about ->comm initialization during exec.
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c75fd46506df..48aa5c85ed9e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1081,9 +1081,9 @@ struct task_struct {
>         /*
>          * executable name, excluding path.
>          *
> -        * - normally initialized setup_new_exec()
> +        * - normally initialized begin_new_exec()
>          * - access it with [gs]et_task_comm()
> -        * - lock it with task_lock()
> +        * - lock it with task_lock() for writing
>          */
>         char                            comm[TASK_COMM_LEN];
>
> Again, sorry for the noise. It's a very minor fix, but maybe even a small fix to the documentation can help increase the readability of the code.
>

Thank you for your improvement. It is very helpful. I will include it
in the next version.
Eric W. Biederman June 10, 2024, 12:34 p.m. UTC | #12
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Sun, Jun 2, 2024 at 10:53 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> If you are performing lockless reads and depending upon a '\0'
>> terminator without limiting yourself to the size of the buffer
>> there needs to be a big fat comment as to how in the world
>> you are guaranteed that a '\0' inside the buffer will always
>> be found.
>
> I think Yafang can certainly add such a comment next to
> __[gs]et_task_comm.
>
> I prefer to avoid open coding memcpy + mmemset when strscpy_pad works.

Looking through the code in set_task_comm
strscpy_pad only works when both the source and designation are aligned.
Otherwise it performs a byte a time copy, and is most definitely
susceptible to the race I observed.

Further I looked a couple of the uses of set_task_com, in
fs/proc/base.c, kernel/kthread.c, and kernel/sys.c.

Nowhere do I see a guarantee that the source buffer is word aligned
or even something that would reasonably cause a compiler to place the
buffer that is being passed to set_task_comm to be word aligned.

As far as I can tell it is completely up to the compiler if it will
cause strscpy_pad to honor the word at a time guarantee needed
to make strscpy_pad safe for reading the information.

This is not to say we can't make it safe.

The easiest would be to create an aligned temporary buffer in
set_task_comm, and preserve the existing interface.  Alternatively
a type that has the appropriate size and alignment could be used
as input to set_task_comm and it could be caller's responsibility
to use it.

While we can definitely make reading task->comm happen without taking
the lock.  Doing so without updating set_task_comm to provide the
guarantees needed to make it safe, looks like a case of play silly
games, win silly prizes.

Eric
Alexei Starovoitov June 10, 2024, 11:01 p.m. UTC | #13
On Mon, Jun 10, 2024 at 5:34 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Sun, Jun 2, 2024 at 10:53 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>
> >> If you are performing lockless reads and depending upon a '\0'
> >> terminator without limiting yourself to the size of the buffer
> >> there needs to be a big fat comment as to how in the world
> >> you are guaranteed that a '\0' inside the buffer will always
> >> be found.
> >
> > I think Yafang can certainly add such a comment next to
> > __[gs]et_task_comm.
> >
> > I prefer to avoid open coding memcpy + mmemset when strscpy_pad works.
>
> Looking through the code in set_task_comm
> strscpy_pad only works when both the source and designation are aligned.
> Otherwise it performs a byte a time copy, and is most definitely
> susceptible to the race I observed.

Byte copy doesn't have an issue either.
Due to padding there is always a zero there.
Worst case in the last byte. So dst buffer will be zero terminated.
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index b3c40fbb325f..b43992d35a8a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1227,12 +1227,15 @@  static int unshare_sighand(struct task_struct *me)
 	return 0;
 }
 
+/*
+ * User space can randomly change their names anyway, so locking for readers
+ * doesn't make sense. For writers, locking is probably necessary, as a race
+ * condition could lead to long-term mixed results.
+ */
 char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
 {
-	task_lock(tsk);
 	/* Always NUL terminated and zero-padded */
 	strscpy_pad(buf, tsk->comm, buf_size);
-	task_unlock(tsk);
 	return buf;
 }
 EXPORT_SYMBOL_GPL(__get_task_comm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c75fd46506df..56a927393a38 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1083,7 +1083,7 @@  struct task_struct {
 	 *
 	 * - normally initialized setup_new_exec()
 	 * - access it with [gs]et_task_comm()
-	 * - lock it with task_lock()
+	 * - lock it with task_lock() for writing
 	 */
 	char				comm[TASK_COMM_LEN];