Message ID | 157313375678.29677.15875689548927466028.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | pipe: Keyrings, Block and USB notifications [ver #2] | expand |
On Thu, Nov 7, 2019 at 5:39 AM David Howells <dhowells@redhat.com> wrote: > > Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate > that the pipe being created is going to be used for notifications. This > suppresses the use of splice(), vmsplice(), tee() and sendfile() on the > pipe as calling iov_iter_revert() on a pipe when a kernel notification > message has been inserted into the middle of a multi-buffer splice will be > messy. How messy? And is there some way to make it impossible for this to happen? Adding a new flag to pipe2() to avoid messy kernel code seems like a poor tradeoff.
Andy Lutomirski <luto@kernel.org> wrote: > > Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate > > that the pipe being created is going to be used for notifications. This > > suppresses the use of splice(), vmsplice(), tee() and sendfile() on the > > pipe as calling iov_iter_revert() on a pipe when a kernel notification > > message has been inserted into the middle of a multi-buffer splice will be > > messy. > > How messy? Well, iov_iter_revert() on a pipe iterator simply walks backwards along the ring discarding the last N contiguous slots (where N is normally the number of slots that were filled by whatever operation is being reverted). However, unless the code that transfers stuff into the pipe takes the spinlock spinlock and disables softirqs for the duration of its ring filling, what were N contiguous slots may now have kernel notifications interspersed - even if it has been holding the pipe mutex. So, now what do you do? You have to free up just the buffers relevant to the iterator and then you can either compact down the ring to free up the space or you can leave null slots and let the read side clean them up, thereby reducing the capacity of the pipe temporarily. Either way, iov_iter_revert() gets more complex and has to hold the spinlock. And if you don't take the spinlock whilst you're reverting, more notifications can come in to make your life more interesting. There's also a problem with splicing out from a notification pipe that the messages are scribed onto preallocated buffers, but now the buffers need refcounts and, in any case, are of limited quantity. > And is there some way to make it impossible for this to happen? Yes. That's what I'm doing by declaring the pipe to be unspliceable up front. > Adding a new flag to pipe2() to avoid messy kernel code seems > like a poor tradeoff. By far the easiest place to check whether a pipe can be spliced to is in get_pipe_info(). That's checking the file anyway. After that, you can't make the check until the pipe is locked. Furthermore, if it's not done upfront, the change to the pipe might happen during a splicing operation that's residing in pipe_wait()... which drops the pipe mutex. David
On Thu, Nov 7, 2019 at 10:48 AM David Howells <dhowells@redhat.com> wrote: > > Andy Lutomirski <luto@kernel.org> wrote: > > > > Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate > > > that the pipe being created is going to be used for notifications. This > > > suppresses the use of splice(), vmsplice(), tee() and sendfile() on the > > > pipe as calling iov_iter_revert() on a pipe when a kernel notification > > > message has been inserted into the middle of a multi-buffer splice will be > > > messy. > > > > How messy? > > Well, iov_iter_revert() on a pipe iterator simply walks backwards along the > ring discarding the last N contiguous slots (where N is normally the number of > slots that were filled by whatever operation is being reverted). > > However, unless the code that transfers stuff into the pipe takes the spinlock > spinlock and disables softirqs for the duration of its ring filling, what were > N contiguous slots may now have kernel notifications interspersed - even if it > has been holding the pipe mutex. > > So, now what do you do? You have to free up just the buffers relevant to the > iterator and then you can either compact down the ring to free up the space or > you can leave null slots and let the read side clean them up, thereby > reducing the capacity of the pipe temporarily. > > Either way, iov_iter_revert() gets more complex and has to hold the spinlock. I feel like I'm missing something fundamental here. I can open a normal pipe from userspace (with pipe() or pipe2()), and I can have two threads. One thread writes to the pipe with write(). The other thread writes with splice(). Everything works fine. What's special about notifications?
Andy Lutomirski <luto@kernel.org> wrote: > I can open a normal pipe from userspace (with pipe() or pipe2()), and > I can have two threads. One thread writes to the pipe with write(). > The other thread writes with splice(). Everything works fine. Yes. Every operation you do on a pipe from userspace is serialised with the pipe mutex - and both ends share the same pipe. > What's special about notifications? The post_notification() cannot take the pipe mutex. It has to be callable from softirq context. Linus's idea is that when you're actually altering the ring pointers you should hold the wake-queue spinlock, and post_notification() holds the wake queue spinlock for the duration of the operation. This means that post_notification() can be writing to the pipe whilst a userspace-invoked operation is holding the pipe mutex and is also doing something to the ring. David
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h index 5f3d21e8a34b..9df72227f515 100644 --- a/include/uapi/linux/watch_queue.h +++ b/include/uapi/linux/watch_queue.h @@ -3,6 +3,9 @@ #define _UAPI_LINUX_WATCH_QUEUE_H #include <linux/types.h> +#include <linux/fcntl.h> + +#define O_NOTIFICATION_PIPE O_EXCL /* Parameter to pipe2() selecting notification pipe */ enum watch_notification_type { WATCH_TYPE_META = 0, /* Special record */
Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate that the pipe being created is going to be used for notifications. This suppresses the use of splice(), vmsplice(), tee() and sendfile() on the pipe as calling iov_iter_revert() on a pipe when a kernel notification message has been inserted into the middle of a multi-buffer splice will be messy. The flag is given the same value as O_EXCL as it seems unlikely that this flag will ever be applicable to pipes and I don't want to use up another O_* bit unnecessarily. An alternative could be to add a pipe3() system call. Signed-off-by: David Howells <dhowells@redhat.com> --- include/uapi/linux/watch_queue.h | 3 +++ 1 file changed, 3 insertions(+)