[RFC] signalfd: add support for SFD_TASK

Message ID	254505c9-2b76-ebeb-306c-02aaf1704b88@kernel.dk (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=caoM=ZT=vger.kernel.org=linux-fsdevel-owner@kernel.org> To: io-uring <io-uring@vger.kernel.org> Cc: Jann Horn <jannh@google.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org> From: Jens Axboe <axboe@kernel.dk> Subject: [PATCH RFC] signalfd: add support for SFD_TASK Message-ID: <254505c9-2b76-ebeb-306c-02aaf1704b88@kernel.dk> Date: Tue, 26 Nov 2019 22:11:38 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.1 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------87DB3E5ABFD4EB218E95005E" Content-Language: en-US Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	[RFC] signalfd: add support for SFD_TASK \| expand [RFC] signalfd: add support for SFD_TASK

Jens Axboe Nov. 27, 2019, 5:11 a.m. UTC

I posted this a few weeks back, took another look at it and refined it a
bit. I'd like some input on the viability of this approach.

A new signalfd setup flag is added, SFD_TASK. This is only valid if used
with SFD_CLOEXEC. If set, the task setting up the signalfd descriptor is
remembered in the signalfd context, and will be the one we use for
checking signals in the poll/read handlers in signalfd.

This is needed to make signalfd useful with io_uring and aio, of which
the former in particular has my interest.

I _think_ this is sane. To prevent the case of a task clearing O_CLOEXEC
on the signalfd descriptor, forking, and then exiting, we grab a
reference to the task when we assign it. If that original task exits, we
catch it in signalfd_flush() and ensure waiters are woken up. The
waiters also hold a task reference, so we don't have to wait for them to
go away.

Need to double check we can't race between original task exiting and new
task grabbing a reference. I don't think this is solid in the version
below. Probably need to add a refcount for ctx->task (the pointer, not
the task) for that.

Comments? Attaching two test programs using io_uring, one using poll and
the other read. Remove SFD_TASK from either of them, and they will fail
ala:

./signalfd-read
Timed out waiting for cqe

and with SFD_TASK set, both will exit silent with a value of 0. You need
liburing installed, then compile them with:

gcc -Wall -O2 -o signalfd-read signalfd-read.c -luring

---

Jann Horn Nov. 27, 2019, 7:23 p.m. UTC | #1

On Wed, Nov 27, 2019 at 6:11 AM Jens Axboe <axboe@kernel.dk> wrote:
> I posted this a few weeks back, took another look at it and refined it a
> bit. I'd like some input on the viability of this approach.
>
> A new signalfd setup flag is added, SFD_TASK. This is only valid if used
> with SFD_CLOEXEC. If set, the task setting up the signalfd descriptor is
> remembered in the signalfd context, and will be the one we use for
> checking signals in the poll/read handlers in signalfd.
>
> This is needed to make signalfd useful with io_uring and aio, of which
> the former in particular has my interest.
>
> I _think_ this is sane. To prevent the case of a task clearing O_CLOEXEC
> on the signalfd descriptor, forking, and then exiting, we grab a
> reference to the task when we assign it. If that original task exits, we
> catch it in signalfd_flush() and ensure waiters are woken up.

Mh... that's not really reliable, because you only get ->flush() from
the last exiting thread (or more precisely, the last exiting task that
shares the files_struct).

What is your goal here? To have a reference to a task without keeping
the entire task_struct around in memory if someone leaks the signalfd
to another process - basically like a weak pointer? If so, you could
store a refcounted reference to "struct pid" instead of a refcounted
reference to the task_struct, and then do the lookup of the
task_struct on ->poll and ->read (similar to what procfs does).

In other words:

> diff --git a/fs/signalfd.c b/fs/signalfd.c
> index 44b6845b071c..4bbdab9438c1 100644
> --- a/fs/signalfd.c
> +++ b/fs/signalfd.c
> @@ -50,28 +50,62 @@ void signalfd_cleanup(struct sighand_struct *sighand)
>
>   struct signalfd_ctx {
>         sigset_t sigmask;
> +       struct task_struct *task;

Turn this into "struct pid *task_pid".

> +static int signalfd_flush(struct file *file, void *data)
> +{
> +       struct signalfd_ctx *ctx = file->private_data;
> +       struct task_struct *tsk = ctx->task;
> +
> +       if (tsk == current) {
> +               ctx->task = NULL;
> +               wake_up(&tsk->sighand->signalfd_wqh);
> +               put_task_struct(tsk);
> +       }
> +
> +       return 0;
> +}

Get rid of this.

> +static struct task_struct *signalfd_get_task(struct signalfd_ctx *ctx)
> +{
> +       struct task_struct *tsk = ctx->task ?: current;
> +
> +       get_task_struct(tsk);
> +       return tsk;
> +}

Replace this with something like:

  if (ctx->task_pid)
    return get_pid_task(ctx->task_pid, PIDTYPE_PID); /* will return
NULL if the task is gone */
  else
    return get_task_struct(current);

and add NULL checks to the places that call this.

> @@ -167,10 +201,11 @@ static ssize_t signalfd_dequeue(struct signalfd_ctx *ctx, kernel_siginfo_t *info
>                                 int nonblock)
>   {
>         ssize_t ret;
> +       struct task_struct *tsk = signalfd_get_task(ctx);

(Here we could even optimize away the refcounting using RCU if we
wanted to, since unlike in the ->poll handler, we don't need to be
able to block.)

>         if (ufd == -1) {
> -               ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +               ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>                 if (!ctx)
>                         return -ENOMEM;
>
>                 ctx->sigmask = *mask;
> +               if (flags & SFD_TASK) {
> +                       ctx->task = current;
> +                       get_task_struct(ctx->task);
> +               }

and here do "ctx->task_pid = get_task_pid(current, PIDTYPE_PID)"

Jens Axboe Nov. 27, 2019, 8:48 p.m. UTC | #2

On 11/27/19 12:23 PM, Jann Horn wrote:
> On Wed, Nov 27, 2019 at 6:11 AM Jens Axboe <axboe@kernel.dk> wrote:
>> I posted this a few weeks back, took another look at it and refined it a
>> bit. I'd like some input on the viability of this approach.
>>
>> A new signalfd setup flag is added, SFD_TASK. This is only valid if used
>> with SFD_CLOEXEC. If set, the task setting up the signalfd descriptor is
>> remembered in the signalfd context, and will be the one we use for
>> checking signals in the poll/read handlers in signalfd.
>>
>> This is needed to make signalfd useful with io_uring and aio, of which
>> the former in particular has my interest.
>>
>> I _think_ this is sane. To prevent the case of a task clearing O_CLOEXEC
>> on the signalfd descriptor, forking, and then exiting, we grab a
>> reference to the task when we assign it. If that original task exits, we
>> catch it in signalfd_flush() and ensure waiters are woken up.
> 
> Mh... that's not really reliable, because you only get ->flush() from
> the last exiting thread (or more precisely, the last exiting task that
> shares the files_struct).
> 
> What is your goal here? To have a reference to a task without keeping
> the entire task_struct around in memory if someone leaks the signalfd
> to another process - basically like a weak pointer? If so, you could
> store a refcounted reference to "struct pid" instead of a refcounted
> reference to the task_struct, and then do the lookup of the
> task_struct on ->poll and ->read (similar to what procfs does).

Yeah, I think that works out much better (and cleaner). How about this,
then? Follows your advice and turns it into a struct pid instead. I
don't particularly like the -ESRCH in dequeue and setup, what do you
think? For poll, POLLERR seems like a prudent choice.

Tested with the test cases I sent out yesterday, works for me.

diff --git a/fs/signalfd.c b/fs/signalfd.c
index 44b6845b071c..ccb1173b20aa 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -50,6 +50,7 @@ void signalfd_cleanup(struct sighand_struct *sighand)
  
  struct signalfd_ctx {
  	sigset_t sigmask;
+	struct pid *task_pid;
  };
  
  static int signalfd_release(struct inode *inode, struct file *file)
@@ -58,20 +59,41 @@ static int signalfd_release(struct inode *inode, struct file *file)
  	return 0;
  }
  
+static void signalfd_put_task(struct signalfd_ctx *ctx, struct task_struct *tsk)
+{
+	if (ctx->task_pid)
+		put_task_struct(tsk);
+}
+
+static struct task_struct *signalfd_get_task(struct signalfd_ctx *ctx)
+{
+	if (ctx->task_pid)
+		return get_pid_task(ctx->task_pid, PIDTYPE_PID);
+
+	return current;
+}
+
  static __poll_t signalfd_poll(struct file *file, poll_table *wait)
  {
  	struct signalfd_ctx *ctx = file->private_data;
+	struct task_struct *tsk;
  	__poll_t events = 0;
  
-	poll_wait(file, &current->sighand->signalfd_wqh, wait);
+	tsk = signalfd_get_task(ctx);
+	if (tsk) {
+		poll_wait(file, &tsk->sighand->signalfd_wqh, wait);
  
-	spin_lock_irq(&current->sighand->siglock);
-	if (next_signal(&current->pending, &ctx->sigmask) ||
-	    next_signal(&current->signal->shared_pending,
-			&ctx->sigmask))
-		events |= EPOLLIN;
-	spin_unlock_irq(&current->sighand->siglock);
+		spin_lock_irq(&tsk->sighand->siglock);
+		if (next_signal(&tsk->pending, &ctx->sigmask) ||
+		    next_signal(&tsk->signal->shared_pending,
+				&ctx->sigmask))
+			events |= EPOLLIN;
+		spin_unlock_irq(&tsk->sighand->siglock);
  
+		signalfd_put_task(ctx, tsk);
+	} else {
+		events |= EPOLLERR;
+	}
  	return events;
  }
  
@@ -167,10 +189,15 @@ static ssize_t signalfd_dequeue(struct signalfd_ctx *ctx, kernel_siginfo_t *info
  				int nonblock)
  {
  	ssize_t ret;
+	struct task_struct *tsk;
  	DECLARE_WAITQUEUE(wait, current);
  
-	spin_lock_irq(&current->sighand->siglock);
-	ret = dequeue_signal(current, &ctx->sigmask, info);
+	tsk = signalfd_get_task(ctx);
+	if (!tsk)
+		return -ESRCH;
+
+	spin_lock_irq(&tsk->sighand->siglock);
+	ret = dequeue_signal(tsk, &ctx->sigmask, info);
  	switch (ret) {
  	case 0:
  		if (!nonblock)
@@ -178,29 +205,31 @@ static ssize_t signalfd_dequeue(struct signalfd_ctx *ctx, kernel_siginfo_t *info
  		ret = -EAGAIN;
  		/* fall through */
  	default:
-		spin_unlock_irq(&current->sighand->siglock);
+		spin_unlock_irq(&tsk->sighand->siglock);
+		signalfd_put_task(ctx, tsk);
  		return ret;
  	}
  
-	add_wait_queue(&current->sighand->signalfd_wqh, &wait);
+	add_wait_queue(&tsk->sighand->signalfd_wqh, &wait);
  	for (;;) {
  		set_current_state(TASK_INTERRUPTIBLE);
-		ret = dequeue_signal(current, &ctx->sigmask, info);
+		ret = dequeue_signal(tsk, &ctx->sigmask, info);
  		if (ret != 0)
  			break;
  		if (signal_pending(current)) {
  			ret = -ERESTARTSYS;
  			break;
  		}
-		spin_unlock_irq(&current->sighand->siglock);
+		spin_unlock_irq(&tsk->sighand->siglock);
  		schedule();
-		spin_lock_irq(&current->sighand->siglock);
+		spin_lock_irq(&tsk->sighand->siglock);
  	}
-	spin_unlock_irq(&current->sighand->siglock);
+	spin_unlock_irq(&tsk->sighand->siglock);
  
-	remove_wait_queue(&current->sighand->signalfd_wqh, &wait);
+	remove_wait_queue(&tsk->sighand->signalfd_wqh, &wait);
  	__set_current_state(TASK_RUNNING);
  
+	signalfd_put_task(ctx, tsk);
  	return ret;
  }
  
@@ -267,19 +296,24 @@ static int do_signalfd4(int ufd, sigset_t *mask, int flags)
  	/* Check the SFD_* constants for consistency.  */
  	BUILD_BUG_ON(SFD_CLOEXEC != O_CLOEXEC);
  	BUILD_BUG_ON(SFD_NONBLOCK != O_NONBLOCK);
+	BUILD_BUG_ON(SFD_TASK & (SFD_CLOEXEC | SFD_NONBLOCK));
  
-	if (flags & ~(SFD_CLOEXEC | SFD_NONBLOCK))
+	if (flags & ~(SFD_CLOEXEC | SFD_NONBLOCK | SFD_TASK))
+		return -EINVAL;
+	if ((flags & (SFD_CLOEXEC | SFD_TASK)) == SFD_TASK)
  		return -EINVAL;
  
  	sigdelsetmask(mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
  	signotset(mask);
  
  	if (ufd == -1) {
-		ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+		ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
  		if (!ctx)
  			return -ENOMEM;
  
  		ctx->sigmask = *mask;
+		if (flags & SFD_TASK)
+			ctx->task_pid = get_task_pid(current, PIDTYPE_PID);
  
  		/*
  		 * When we call this, the initialization must be complete, since
@@ -290,6 +324,7 @@ static int do_signalfd4(int ufd, sigset_t *mask, int flags)
  		if (ufd < 0)
  			kfree(ctx);
  	} else {
+		struct task_struct *tsk;
  		struct fd f = fdget(ufd);
  		if (!f.file)
  			return -EBADF;
@@ -298,11 +333,17 @@ static int do_signalfd4(int ufd, sigset_t *mask, int flags)
  			fdput(f);
  			return -EINVAL;
  		}
-		spin_lock_irq(&current->sighand->siglock);
+		tsk = signalfd_get_task(ctx);
+		if (!tsk) {
+			fdput(f);
+			return -ESRCH;
+		}
+		spin_lock_irq(&tsk->sighand->siglock);
  		ctx->sigmask = *mask;
-		spin_unlock_irq(&current->sighand->siglock);
+		spin_unlock_irq(&tsk->sighand->siglock);
  
-		wake_up(&current->sighand->signalfd_wqh);
+		wake_up(&tsk->sighand->signalfd_wqh);
+		signalfd_put_task(ctx, tsk);
  		fdput(f);
  	}
  
diff --git a/include/uapi/linux/signalfd.h b/include/uapi/linux/signalfd.h
index 83429a05b698..064c5dc3eb99 100644
--- a/include/uapi/linux/signalfd.h
+++ b/include/uapi/linux/signalfd.h
@@ -16,6 +16,7 @@
  /* Flags for signalfd4.  */
  #define SFD_CLOEXEC O_CLOEXEC
  #define SFD_NONBLOCK O_NONBLOCK
+#define SFD_TASK 00000001
  
  struct signalfd_siginfo {
  	__u32 ssi_signo;

Jann Horn Nov. 27, 2019, 11:27 p.m. UTC | #3

On Wed, Nov 27, 2019 at 9:48 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 11/27/19 12:23 PM, Jann Horn wrote:
> > On Wed, Nov 27, 2019 at 6:11 AM Jens Axboe <axboe@kernel.dk> wrote:
> >> I posted this a few weeks back, took another look at it and refined it a
> >> bit. I'd like some input on the viability of this approach.
> >>
> >> A new signalfd setup flag is added, SFD_TASK. This is only valid if used
> >> with SFD_CLOEXEC. If set, the task setting up the signalfd descriptor is
> >> remembered in the signalfd context, and will be the one we use for
> >> checking signals in the poll/read handlers in signalfd.
> >>
> >> This is needed to make signalfd useful with io_uring and aio, of which
> >> the former in particular has my interest.
> >>
> >> I _think_ this is sane. To prevent the case of a task clearing O_CLOEXEC
> >> on the signalfd descriptor, forking, and then exiting, we grab a
> >> reference to the task when we assign it. If that original task exits, we
> >> catch it in signalfd_flush() and ensure waiters are woken up.
> >
> > Mh... that's not really reliable, because you only get ->flush() from
> > the last exiting thread (or more precisely, the last exiting task that
> > shares the files_struct).
> >
> > What is your goal here? To have a reference to a task without keeping
> > the entire task_struct around in memory if someone leaks the signalfd
> > to another process - basically like a weak pointer? If so, you could
> > store a refcounted reference to "struct pid" instead of a refcounted
> > reference to the task_struct, and then do the lookup of the
> > task_struct on ->poll and ->read (similar to what procfs does).
>
> Yeah, I think that works out much better (and cleaner). How about this,
> then? Follows your advice and turns it into a struct pid instead. I
> don't particularly like the -ESRCH in dequeue and setup, what do you
> think? For poll, POLLERR seems like a prudent choice.

-ESRCH may be kinda weird, but I also can't think of anything
better... and it does describe the problem pretty accurately: The task
whose signal state you're trying to inspect is gone. I went through
the list of errnos, and everything else sounded more weird...


One more thing, though: We'll have to figure out some way to
invalidate the fd when the target goes through execve(), in particular
if it's a setuid execution. Otherwise we'll be able to just steal
signals that were intended for the other task, that's probably not
good.

So we should:
 a) prevent using ->wait() on an old signalfd once the task has gone
through execve()
 b) kick off all existing waiters
 c) most importantly, prevent ->read() on an old signalfd once the
task has gone through execve()

We probably want to avoid using the cred_guard_mutex here, since it is
quite broad and has some deadlocking issues; it might make sense to
put the update of ->self_exec_id in fs/exec.c under something like the
siglock, and then for a) and c) we can check whether the
->self_exec_id changed while holding the siglock, and for b) we can
add a call to signalfd_cleanup() after the ->self_exec_id change.

> +static void signalfd_put_task(struct signalfd_ctx *ctx, struct task_struct *tsk)
> +{
> +       if (ctx->task_pid)
> +               put_task_struct(tsk);
> +}
> +
> +static struct task_struct *signalfd_get_task(struct signalfd_ctx *ctx)
> +{
> +       if (ctx->task_pid)
> +               return get_pid_task(ctx->task_pid, PIDTYPE_PID);
> +
> +       return current;
> +}

This works, and I guess it's a question of coding style... but I'd
kinda prefer to do the refcount operation in both cases, so that the
semantics of the returned reference are simply "holds a reference"
instead of "either holds a reference or borrows from current depending
on ctx->task_pid". But if you feel strongly about it, feel free to
keep it as-is.

[...]
> -       add_wait_queue(&current->sighand->signalfd_wqh, &wait);
> +       add_wait_queue(&tsk->sighand->signalfd_wqh, &wait);
>         for (;;) {
>                 set_current_state(TASK_INTERRUPTIBLE);
> -               ret = dequeue_signal(current, &ctx->sigmask, info);
> +               ret = dequeue_signal(tsk, &ctx->sigmask, info);
>                 if (ret != 0)
>                         break;
>                 if (signal_pending(current)) {
>                         ret = -ERESTARTSYS;
>                         break;
>                 }
> -               spin_unlock_irq(&current->sighand->siglock);
> +               spin_unlock_irq(&tsk->sighand->siglock);
>                 schedule();

Should we be dropping the reference to the task before schedule() and
re-acquiring it afterwards so that if we're blocked on a signalfd read
and then the corresponding task dies, the refcount can drop to zero
and we can get woken up? Probably doesn't matter, but seems a bit
cleaner to me.

> -               spin_lock_irq(&current->sighand->siglock);
> +               spin_lock_irq(&tsk->sighand->siglock);
>         }
> -       spin_unlock_irq(&current->sighand->siglock);
> +       spin_unlock_irq(&tsk->sighand->siglock);
>
> -       remove_wait_queue(&current->sighand->signalfd_wqh, &wait);
> +       remove_wait_queue(&tsk->sighand->signalfd_wqh, &wait);
>         __set_current_state(TASK_RUNNING);
>
> +       signalfd_put_task(ctx, tsk);
>         return ret;
>   }
>
> @@ -267,19 +296,24 @@ static int do_signalfd4(int ufd, sigset_t *mask, int flags)
>         /* Check the SFD_* constants for consistency.  */
>         BUILD_BUG_ON(SFD_CLOEXEC != O_CLOEXEC);
>         BUILD_BUG_ON(SFD_NONBLOCK != O_NONBLOCK);
> +       BUILD_BUG_ON(SFD_TASK & (SFD_CLOEXEC | SFD_NONBLOCK));
>
> -       if (flags & ~(SFD_CLOEXEC | SFD_NONBLOCK))
> +       if (flags & ~(SFD_CLOEXEC | SFD_NONBLOCK | SFD_TASK))
> +               return -EINVAL;
> +       if ((flags & (SFD_CLOEXEC | SFD_TASK)) == SFD_TASK)
>                 return -EINVAL;

(non-actionable comment: It seems kinda weird that you can specify
these parameters with no effect for the `uffd != -1` case... but since
the existing parameters already work that way, I guess it's
consistent.)

Jens Axboe Nov. 28, 2019, 12:41 a.m. UTC | #4

On 11/27/19 4:27 PM, Jann Horn wrote:
> On Wed, Nov 27, 2019 at 9:48 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 11/27/19 12:23 PM, Jann Horn wrote:
>>> On Wed, Nov 27, 2019 at 6:11 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>> I posted this a few weeks back, took another look at it and refined it a
>>>> bit. I'd like some input on the viability of this approach.
>>>>
>>>> A new signalfd setup flag is added, SFD_TASK. This is only valid if used
>>>> with SFD_CLOEXEC. If set, the task setting up the signalfd descriptor is
>>>> remembered in the signalfd context, and will be the one we use for
>>>> checking signals in the poll/read handlers in signalfd.
>>>>
>>>> This is needed to make signalfd useful with io_uring and aio, of which
>>>> the former in particular has my interest.
>>>>
>>>> I _think_ this is sane. To prevent the case of a task clearing O_CLOEXEC
>>>> on the signalfd descriptor, forking, and then exiting, we grab a
>>>> reference to the task when we assign it. If that original task exits, we
>>>> catch it in signalfd_flush() and ensure waiters are woken up.
>>>
>>> Mh... that's not really reliable, because you only get ->flush() from
>>> the last exiting thread (or more precisely, the last exiting task that
>>> shares the files_struct).
>>>
>>> What is your goal here? To have a reference to a task without keeping
>>> the entire task_struct around in memory if someone leaks the signalfd
>>> to another process - basically like a weak pointer? If so, you could
>>> store a refcounted reference to "struct pid" instead of a refcounted
>>> reference to the task_struct, and then do the lookup of the
>>> task_struct on ->poll and ->read (similar to what procfs does).
>>
>> Yeah, I think that works out much better (and cleaner). How about this,
>> then? Follows your advice and turns it into a struct pid instead. I
>> don't particularly like the -ESRCH in dequeue and setup, what do you
>> think? For poll, POLLERR seems like a prudent choice.
> 
> -ESRCH may be kinda weird, but I also can't think of anything
> better... and it does describe the problem pretty accurately: The task
> whose signal state you're trying to inspect is gone. I went through
> the list of errnos, and everything else sounded more weird...

Right, that's why I ultimately ended up with -ESRCH. But I'll take that
as concensus :-)

> One more thing, though: We'll have to figure out some way to
> invalidate the fd when the target goes through execve(), in particular
> if it's a setuid execution. Otherwise we'll be able to just steal
> signals that were intended for the other task, that's probably not
> good.
> 
> So we should:
>   a) prevent using ->wait() on an old signalfd once the task has gone
> through execve()
>   b) kick off all existing waiters
>   c) most importantly, prevent ->read() on an old signalfd once the
> task has gone through execve()
> 
> We probably want to avoid using the cred_guard_mutex here, since it is
> quite broad and has some deadlocking issues; it might make sense to
> put the update of ->self_exec_id in fs/exec.c under something like the
> siglock, and then for a) and c) we can check whether the
> ->self_exec_id changed while holding the siglock, and for b) we can
> add a call to signalfd_cleanup() after the ->self_exec_id change.

OK, that seems like one for after the break. Was hoping there'd be a
more trivial way to accomplish that, I'll give it some thought.

>> +static void signalfd_put_task(struct signalfd_ctx *ctx, struct task_struct *tsk)
>> +{
>> +       if (ctx->task_pid)
>> +               put_task_struct(tsk);
>> +}
>> +
>> +static struct task_struct *signalfd_get_task(struct signalfd_ctx *ctx)
>> +{
>> +       if (ctx->task_pid)
>> +               return get_pid_task(ctx->task_pid, PIDTYPE_PID);
>> +
>> +       return current;
>> +}
> 
> This works, and I guess it's a question of coding style... but I'd
> kinda prefer to do the refcount operation in both cases, so that the
> semantics of the returned reference are simply "holds a reference"
> instead of "either holds a reference or borrows from current depending
> on ctx->task_pid". But if you feel strongly about it, feel free to
> keep it as-is.

I don't feel super strongly about it, but I wanted to avoid adding an
unnecessary get/put of the current task for the existing use cases of
signalfd. So I'll probably just keep it as-is.

>> -       add_wait_queue(&current->sighand->signalfd_wqh, &wait);
>> +       add_wait_queue(&tsk->sighand->signalfd_wqh, &wait);
>>          for (;;) {
>>                  set_current_state(TASK_INTERRUPTIBLE);
>> -               ret = dequeue_signal(current, &ctx->sigmask, info);
>> +               ret = dequeue_signal(tsk, &ctx->sigmask, info);
>>                  if (ret != 0)
>>                          break;
>>                  if (signal_pending(current)) {
>>                          ret = -ERESTARTSYS;
>>                          break;
>>                  }
>> -               spin_unlock_irq(&current->sighand->siglock);
>> +               spin_unlock_irq(&tsk->sighand->siglock);
>>                  schedule();
> 
> Should we be dropping the reference to the task before schedule() and
> re-acquiring it afterwards so that if we're blocked on a signalfd read
> and then the corresponding task dies, the refcount can drop to zero
> and we can get woken up? Probably doesn't matter, but seems a bit
> cleaner to me.

That would be simple enough to do, as we know that tsk is either still
the same, or we need to abort. Hence no need to fiddle waitqueues at
that point. I'll make that change.

>> -               spin_lock_irq(&current->sighand->siglock);
>> +               spin_lock_irq(&tsk->sighand->siglock);
>>          }
>> -       spin_unlock_irq(&current->sighand->siglock);
>> +       spin_unlock_irq(&tsk->sighand->siglock);
>>
>> -       remove_wait_queue(&current->sighand->signalfd_wqh, &wait);
>> +       remove_wait_queue(&tsk->sighand->signalfd_wqh, &wait);
>>          __set_current_state(TASK_RUNNING);
>>
>> +       signalfd_put_task(ctx, tsk);
>>          return ret;
>>    }
>>
>> @@ -267,19 +296,24 @@ static int do_signalfd4(int ufd, sigset_t *mask, int flags)
>>          /* Check the SFD_* constants for consistency.  */
>>          BUILD_BUG_ON(SFD_CLOEXEC != O_CLOEXEC);
>>          BUILD_BUG_ON(SFD_NONBLOCK != O_NONBLOCK);
>> +       BUILD_BUG_ON(SFD_TASK & (SFD_CLOEXEC | SFD_NONBLOCK));
>>
>> -       if (flags & ~(SFD_CLOEXEC | SFD_NONBLOCK))
>> +       if (flags & ~(SFD_CLOEXEC | SFD_NONBLOCK | SFD_TASK))
>> +               return -EINVAL;
>> +       if ((flags & (SFD_CLOEXEC | SFD_TASK)) == SFD_TASK)
>>                  return -EINVAL;
> 
> (non-actionable comment: It seems kinda weird that you can specify
> these parameters with no effect for the `uffd != -1` case... but since
> the existing parameters already work that way, I guess it's
> consistent.)

Yeah, just following what it already does, though I do agree it is weird
with the two separate cases and it only impacting one of them. Didn't
want to make it behave differently.

Rasmus Villemoes Nov. 28, 2019, 9:02 a.m. UTC | #5

On 28/11/2019 00.27, Jann Horn wrote:

> One more thing, though: We'll have to figure out some way to
> invalidate the fd when the target goes through execve(), in particular
> if it's a setuid execution. Otherwise we'll be able to just steal
> signals that were intended for the other task, that's probably not
> good.
> 
> So we should:
>  a) prevent using ->wait() on an old signalfd once the task has gone
> through execve()
>  b) kick off all existing waiters
>  c) most importantly, prevent ->read() on an old signalfd once the
> task has gone through execve()
> 
> We probably want to avoid using the cred_guard_mutex here, since it is
> quite broad and has some deadlocking issues; it might make sense to
> put the update of ->self_exec_id in fs/exec.c under something like the
> siglock,

What prevents one from exec'ing a trivial helper 2^32-1 times before
exec'ing into the victim binary?

Rasmus

Jann Horn Nov. 28, 2019, 10:07 a.m. UTC | #6

On Thu, Nov 28, 2019 at 10:02 AM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> On 28/11/2019 00.27, Jann Horn wrote:
>
> > One more thing, though: We'll have to figure out some way to
> > invalidate the fd when the target goes through execve(), in particular
> > if it's a setuid execution. Otherwise we'll be able to just steal
> > signals that were intended for the other task, that's probably not
> > good.
> >
> > So we should:
> >  a) prevent using ->wait() on an old signalfd once the task has gone
> > through execve()
> >  b) kick off all existing waiters
> >  c) most importantly, prevent ->read() on an old signalfd once the
> > task has gone through execve()
> >
> > We probably want to avoid using the cred_guard_mutex here, since it is
> > quite broad and has some deadlocking issues; it might make sense to
> > put the update of ->self_exec_id in fs/exec.c under something like the
> > siglock,
>
> What prevents one from exec'ing a trivial helper 2^32-1 times before
> exec'ing into the victim binary?

Uh, yeah... that thing should probably become 64 bits wide, too.

Jann Horn Nov. 28, 2019, 7:18 p.m. UTC | #7

On Thu, Nov 28, 2019 at 11:07 AM Jann Horn <jannh@google.com> wrote:
> On Thu, Nov 28, 2019 at 10:02 AM Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
> > On 28/11/2019 00.27, Jann Horn wrote:
> >
> > > One more thing, though: We'll have to figure out some way to
> > > invalidate the fd when the target goes through execve(), in particular
> > > if it's a setuid execution. Otherwise we'll be able to just steal
> > > signals that were intended for the other task, that's probably not
> > > good.
> > >
> > > So we should:
> > >  a) prevent using ->wait() on an old signalfd once the task has gone
> > > through execve()
> > >  b) kick off all existing waiters
> > >  c) most importantly, prevent ->read() on an old signalfd once the
> > > task has gone through execve()
> > >
> > > We probably want to avoid using the cred_guard_mutex here, since it is
> > > quite broad and has some deadlocking issues; it might make sense to
> > > put the update of ->self_exec_id in fs/exec.c under something like the
> > > siglock,
> >
> > What prevents one from exec'ing a trivial helper 2^32-1 times before
> > exec'ing into the victim binary?
>
> Uh, yeah... that thing should probably become 64 bits wide, too.

Actually, that'd still be wrong even with the existing kernel code for
two reasons:

 - if you reparent to a subreaper, the existing exec_id comparison breaks
 - the new check here is going to break if a non-leader thread goes
through execve(), because of the weird magic where the thread going
through execve steals the thread id (PID) of the leader

I'm gone for the day, but will try to dust off the years-old patch for
this that I have lying around somewhere tomorrow. I should probably
send it through akpm's tree with cc stable, given that this is already
kinda broken in existing releases...

Jann Horn Nov. 28, 2019, 10:46 p.m. UTC | #8

On Thu, Nov 28, 2019 at 8:18 PM Jann Horn <jannh@google.com> wrote:
> On Thu, Nov 28, 2019 at 11:07 AM Jann Horn <jannh@google.com> wrote:
> > On Thu, Nov 28, 2019 at 10:02 AM Rasmus Villemoes
> > <linux@rasmusvillemoes.dk> wrote:
> > > On 28/11/2019 00.27, Jann Horn wrote:
> > >
> > > > One more thing, though: We'll have to figure out some way to
> > > > invalidate the fd when the target goes through execve(), in particular
> > > > if it's a setuid execution. Otherwise we'll be able to just steal
> > > > signals that were intended for the other task, that's probably not
> > > > good.
> > > >
> > > > So we should:
> > > >  a) prevent using ->wait() on an old signalfd once the task has gone
> > > > through execve()
> > > >  b) kick off all existing waiters
> > > >  c) most importantly, prevent ->read() on an old signalfd once the
> > > > task has gone through execve()
> > > >
> > > > We probably want to avoid using the cred_guard_mutex here, since it is
> > > > quite broad and has some deadlocking issues; it might make sense to
> > > > put the update of ->self_exec_id in fs/exec.c under something like the
> > > > siglock,
> > >
> > > What prevents one from exec'ing a trivial helper 2^32-1 times before
> > > exec'ing into the victim binary?
> >
> > Uh, yeah... that thing should probably become 64 bits wide, too.
>
> Actually, that'd still be wrong even with the existing kernel code for
> two reasons:
>
>  - if you reparent to a subreaper, the existing exec_id comparison breaks

... actually, I was wrong about this, this case is fine because the
->exit_signal is reset in reparent_leader().

Jann Horn Nov. 29, 2019, 10:30 p.m. UTC | #9

On Thu, Nov 28, 2019 at 8:18 PM Jann Horn <jannh@google.com> wrote:
> On Thu, Nov 28, 2019 at 11:07 AM Jann Horn <jannh@google.com> wrote:
> > On Thu, Nov 28, 2019 at 10:02 AM Rasmus Villemoes
> > <linux@rasmusvillemoes.dk> wrote:
> > > On 28/11/2019 00.27, Jann Horn wrote:
> > >
> > > > One more thing, though: We'll have to figure out some way to
> > > > invalidate the fd when the target goes through execve(), in particular
> > > > if it's a setuid execution. Otherwise we'll be able to just steal
> > > > signals that were intended for the other task, that's probably not
> > > > good.
> > > >
> > > > So we should:
> > > >  a) prevent using ->wait() on an old signalfd once the task has gone
> > > > through execve()
> > > >  b) kick off all existing waiters
> > > >  c) most importantly, prevent ->read() on an old signalfd once the
> > > > task has gone through execve()
> > > >
> > > > We probably want to avoid using the cred_guard_mutex here, since it is
> > > > quite broad and has some deadlocking issues; it might make sense to
> > > > put the update of ->self_exec_id in fs/exec.c under something like the
> > > > siglock,
> > >
> > > What prevents one from exec'ing a trivial helper 2^32-1 times before
> > > exec'ing into the victim binary?
> >
> > Uh, yeah... that thing should probably become 64 bits wide, too.
>
> Actually, that'd still be wrong even with the existing kernel code for
> two reasons:
>
>  - if you reparent to a subreaper, the existing exec_id comparison breaks
>  - the new check here is going to break if a non-leader thread goes
> through execve(), because of the weird magic where the thread going
> through execve steals the thread id (PID) of the leader
>
> I'm gone for the day, but will try to dust off the years-old patch for
> this that I have lying around somewhere tomorrow. I should probably
> send it through akpm's tree with cc stable, given that this is already
> kinda broken in existing releases...

I'm taking that back, given that I was wrong when writing this mail.
But I've attached the old patch, in case you want to reuse it. That
cpu-plus-64-bits scheme was Andy Lutomirski's idea.

If you use that, you'd have to take the cred_guard_mutex for ->poll
and ->read, but I guess that's probably fine for signalfd.

[RFC] signalfd: add support for SFD_TASK

Commit Message

Comments

Patch