[RFC,04/14] pipe: Add O_NOTIFICATION_PIPE [ver #2]

Message ID	157313375678.29677.15875689548927466028.stgit@warthog.procyon.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=QQLj=Y7=vger.kernel.org=linux-usb-owner@kernel.org> Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 3798903 Subject: [RFC PATCH 04/14] pipe: Add O_NOTIFICATION_PIPE [ver #2] From: David Howells <dhowells@redhat.com> To: torvalds@linux-foundation.org Cc: dhowells@redhat.com, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Casey Schaufler <casey@schaufler-ca.com>, Stephen Smalley <sds@tycho.nsa.gov>, nicolas.dichtel@6wind.com, raven@themaw.net, Christian Brauner <christian@brauner.io>, dhowells@redhat.com, keyrings@vger.kernel.org, linux-usb@vger.kernel.org, linux-block@vger.kernel.org, linux-security-module@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org Date: Thu, 07 Nov 2019 13:35:56 +0000 Message-ID: <157313375678.29677.15875689548927466028.stgit@warthog.procyon.org.uk> In-Reply-To: <157313371694.29677.15388731274912671071.stgit@warthog.procyon.org.uk> References: <157313371694.29677.15388731274912671071.stgit@warthog.procyon.org.uk> User-Agent: StGit/unknown-version MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: linux-usb-owner@vger.kernel.org Precedence: bulk
Series	pipe: Keyrings, Block and USB notifications [ver #2] \| expand [RFC,00/14] pipe: Keyrings, Block and USB notifications [ver #2] [RFC,01/14] uapi: General notification queue definitions [ver #2] [RFC,02/14] security: Add hooks to rule on setting a watch [ver #2] [RFC,03/14] security: Add a hook for the point of notification insertion [ver #2] [RFC,04/14] pipe: Add O_NOTIFICATION_PIPE [ver #2] [RFC,05/14] pipe: Add general notification queue support [ver #2] [RFC,06/14] keys: Add a notification facility [ver #2] [RFC,07/14] Add sample notification program [ver #2] [RFC,08/14] pipe: Allow buffers to be marked read-whole-or-error for notifications [ver #2] [RFC,09/14] pipe: Add notification lossage handling [ver #2] [RFC,10/14] Add a general, global device notification watch list [ver #2] [RFC,11/14] block: Add block layer notifications [ver #2] [RFC,12/14] usb: Add USB subsystem notifications [ver #2] [RFC,13/14] selinux: Implement the watch_key security hook [ver #2] [RFC,14/14] smack: Implement the watch_key and post_notification hooks [ver #2]

Message ID

157313375678.29677.15875689548927466028.stgit@warthog.procyon.org.uk (mailing list archive)

State

New, archived

Headers

Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley
 Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United
 Kingdom.
 Registered in England and Wales under Company Registration No. 3798903
Subject: [RFC PATCH 04/14] pipe: Add O_NOTIFICATION_PIPE [ver #2]
From: David Howells <dhowells@redhat.com>
To: torvalds@linux-foundation.org
Cc: dhowells@redhat.com,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        Casey Schaufler <casey@schaufler-ca.com>,
        Stephen Smalley <sds@tycho.nsa.gov>, nicolas.dichtel@6wind.com,
        raven@themaw.net, Christian Brauner <christian@brauner.io>,
        dhowells@redhat.com, keyrings@vger.kernel.org,
        linux-usb@vger.kernel.org, linux-block@vger.kernel.org,
        linux-security-module@vger.kernel.org,
        linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
        linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org
Date: Thu, 07 Nov 2019 13:35:56 +0000
Message-ID: 
 <157313375678.29677.15875689548927466028.stgit@warthog.procyon.org.uk>
In-Reply-To: 
 <157313371694.29677.15388731274912671071.stgit@warthog.procyon.org.uk>
References: 
 <157313371694.29677.15388731274912671071.stgit@warthog.procyon.org.uk>
User-Agent: StGit/unknown-version
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Sender: linux-usb-owner@vger.kernel.org
Precedence: bulk

Series

pipe: Keyrings, Block and USB notifications [ver #2] | expand

Commit Message

David Howells Nov. 7, 2019, 1:35 p.m. UTC

Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate
that the pipe being created is going to be used for notifications.  This
suppresses the use of splice(), vmsplice(), tee() and sendfile() on the
pipe as calling iov_iter_revert() on a pipe when a kernel notification
message has been inserted into the middle of a multi-buffer splice will be
messy.

The flag is given the same value as O_EXCL as it seems unlikely that
this flag will ever be applicable to pipes and I don't want to use up
another O_* bit unnecessarily.  An alternative could be to add a pipe3()
system call.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/uapi/linux/watch_queue.h |    3 +++
 1 file changed, 3 insertions(+)

Comments

Andy Lutomirski Nov. 7, 2019, 6:16 p.m. UTC | #1

On Thu, Nov 7, 2019 at 5:39 AM David Howells <dhowells@redhat.com> wrote:
>
> Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate
> that the pipe being created is going to be used for notifications.  This
> suppresses the use of splice(), vmsplice(), tee() and sendfile() on the
> pipe as calling iov_iter_revert() on a pipe when a kernel notification
> message has been inserted into the middle of a multi-buffer splice will be
> messy.

How messy?  And is there some way to make it impossible for this to
happen?  Adding a new flag to pipe2() to avoid messy kernel code seems
like a poor tradeoff.

David Howells Nov. 7, 2019, 6:48 p.m. UTC | #2

Andy Lutomirski <luto@kernel.org> wrote:

> > Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate
> > that the pipe being created is going to be used for notifications.  This
> > suppresses the use of splice(), vmsplice(), tee() and sendfile() on the
> > pipe as calling iov_iter_revert() on a pipe when a kernel notification
> > message has been inserted into the middle of a multi-buffer splice will be
> > messy.
>
> How messy?

Well, iov_iter_revert() on a pipe iterator simply walks backwards along the
ring discarding the last N contiguous slots (where N is normally the number of
slots that were filled by whatever operation is being reverted).

However, unless the code that transfers stuff into the pipe takes the spinlock
spinlock and disables softirqs for the duration of its ring filling, what were
N contiguous slots may now have kernel notifications interspersed - even if it
has been holding the pipe mutex.

So, now what do you do?  You have to free up just the buffers relevant to the
iterator and then you can either compact down the ring to free up the space or
you can leave null slots and let the read side clean them up, thereby
reducing the capacity of the pipe temporarily.

Either way, iov_iter_revert() gets more complex and has to hold the spinlock.

And if you don't take the spinlock whilst you're reverting, more notifications
can come in to make your life more interesting.

There's also a problem with splicing out from a notification pipe that the
messages are scribed onto preallocated buffers, but now the buffers need
refcounts and, in any case, are of limited quantity.

> And is there some way to make it impossible for this to happen?

Yes.  That's what I'm doing by declaring the pipe to be unspliceable up front.

> Adding a new flag to pipe2() to avoid messy kernel code seems
> like a poor tradeoff.

By far the easiest place to check whether a pipe can be spliced to is in
get_pipe_info().  That's checking the file anyway.  After that, you can't make
the check until the pipe is locked.

Furthermore, if it's not done upfront, the change to the pipe might happen
during a splicing operation that's residing in pipe_wait()... which drops the
pipe mutex.

David

Andy Lutomirski Nov. 8, 2019, 5:06 a.m. UTC | #3

On Thu, Nov 7, 2019 at 10:48 AM David Howells <dhowells@redhat.com> wrote:
>
> Andy Lutomirski <luto@kernel.org> wrote:
>
> > > Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate
> > > that the pipe being created is going to be used for notifications.  This
> > > suppresses the use of splice(), vmsplice(), tee() and sendfile() on the
> > > pipe as calling iov_iter_revert() on a pipe when a kernel notification
> > > message has been inserted into the middle of a multi-buffer splice will be
> > > messy.
> >
> > How messy?
>
> Well, iov_iter_revert() on a pipe iterator simply walks backwards along the
> ring discarding the last N contiguous slots (where N is normally the number of
> slots that were filled by whatever operation is being reverted).
>
> However, unless the code that transfers stuff into the pipe takes the spinlock
> spinlock and disables softirqs for the duration of its ring filling, what were
> N contiguous slots may now have kernel notifications interspersed - even if it
> has been holding the pipe mutex.
>
> So, now what do you do?  You have to free up just the buffers relevant to the
> iterator and then you can either compact down the ring to free up the space or
> you can leave null slots and let the read side clean them up, thereby
> reducing the capacity of the pipe temporarily.
>
> Either way, iov_iter_revert() gets more complex and has to hold the spinlock.

I feel like I'm missing something fundamental here.

I can open a normal pipe from userspace (with pipe() or pipe2()), and
I can have two threads.  One thread writes to the pipe with write().
The other thread writes with splice().  Everything works fine.  What's
special about notifications?

David Howells Nov. 8, 2019, 6:42 a.m. UTC | #4

Andy Lutomirski <luto@kernel.org> wrote:

> I can open a normal pipe from userspace (with pipe() or pipe2()), and
> I can have two threads.  One thread writes to the pipe with write().
> The other thread writes with splice().  Everything works fine.

Yes.  Every operation you do on a pipe from userspace is serialised with the
pipe mutex - and both ends share the same pipe.

> What's special about notifications?

The post_notification() cannot take the pipe mutex.  It has to be callable
from softirq context.  Linus's idea is that when you're actually altering the
ring pointers you should hold the wake-queue spinlock, and post_notification()
holds the wake queue spinlock for the duration of the operation.

This means that post_notification() can be writing to the pipe whilst a
userspace-invoked operation is holding the pipe mutex and is also doing
something to the ring.

David

diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index 5f3d21e8a34b..9df72227f515 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -3,6 +3,9 @@ 
 #define _UAPI_LINUX_WATCH_QUEUE_H
 
 #include <linux/types.h>
+#include <linux/fcntl.h>
+
+#define O_NOTIFICATION_PIPE	O_EXCL	/* Parameter to pipe2() selecting notification pipe */
 
 enum watch_notification_type {
 	WATCH_TYPE_META		= 0,	/* Special record */

[RFC,04/14] pipe: Add O_NOTIFICATION_PIPE [ver #2]

Commit Message

Comments

Patch