Message ID | 30394.1571936252@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | pipe: Notification queue preparation [ver #2] | expand |
On Thu, Oct 24, 2019 at 12:57 PM David Howells <dhowells@redhat.com> wrote: > > pipe: Add fsync() support > > The keyrings testsuite needs the ability to wait for all the outstanding > notifications in the queue to have been processed so that it can then go > through them to find out whether the notifications it expected have been > emitted. Can't you just do ioctl(fd, FIONREAD, &count); in a loop instead? "No paperwork. Just sprinkle some msleep() crack on him, and let's get out of here" Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The keyrings testsuite needs the ability to wait for all the outstanding > > notifications in the queue to have been processed so that it can then go > > through them to find out whether the notifications it expected have been > > emitted. > > Can't you just do > > ioctl(fd, FIONREAD, &count); > > in a loop instead? "No paperwork. Just sprinkle some msleep() crack on > him, and let's get out of here" Using FIONREAD like this means that I would have to quiesce the tests in order to sync up. For the moment that's fine, but at some point I would like to be able to stress test the system by running tests in parallel against the same keyring. Each test needs to check with the monitor whether its keys have generated the appropriate notifications against a backdrop of events being continuously generated by other tests. I can hold this patch for now. Let me see if I can come up with a better way to do it. Maybe it can be done by dead reckoning, holding up until either we've counted out a complete ring-full of notifications or read() has come up empty. David
On Thu, Oct 24, 2019 at 05:57:32PM +0100, David Howells wrote: > pipe: Add fsync() support > > The keyrings testsuite needs the ability to wait for all the outstanding > notifications in the queue to have been processed so that it can then go > through them to find out whether the notifications it expected have been > emitted. > > Implement fsync() support for pipes to provide this. The tailmost buffer > at the point of calling is marked and fsync adds itself to the list of > waiters, noting the tail position to be waited for and marking the buffer > as no longer mergeable. Then when the buffer is consumed, if the flag is > set, any matching waiters are woken up. I am _really_ worried about overloading fsync for this behavior. fsync hasn't done anything for 50 years, and suddenly adding any action is not helpful. If you can't use FIONREAD please add a new ioctls instead, and document it properly.
On 24/10/2019 19.57, David Howells wrote: > pipe: Add fsync() support > > The keyrings testsuite needs the ability to wait for all the outstanding > notifications in the queue to have been processed so that it can then go > through them to find out whether the notifications it expected have been > emitted. Similar synchronization is required for reusing memory after vmsplice()? I don't see other way how sender could safely change these pages. > > Implement fsync() support for pipes to provide this. The tailmost buffer > at the point of calling is marked and fsync adds itself to the list of > waiters, noting the tail position to be waited for and marking the buffer > as no longer mergeable. Then when the buffer is consumed, if the flag is > set, any matching waiters are woken up. > > Signed-off-by: David Howells <dhowells@redhat.com> > --- > fs/fuse/dev.c | 1 > fs/pipe.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ > fs/splice.c | 3 ++ > include/linux/pipe_fs_i.h | 22 ++++++++++++++++ > lib/iov_iter.c | 2 - > 5 files changed, 88 insertions(+), 1 deletion(-) > > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c > index 5ef57a322cb8..9617a35579cb 100644 > --- a/fs/fuse/dev.c > +++ b/fs/fuse/dev.c > @@ -1983,6 +1983,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, > if (rem >= ibuf->len) { > *obuf = *ibuf; > ibuf->ops = NULL; > + pipe_wake_fsync(pipe, ibuf, tail); > tail++; > pipe_commit_read(pipe, tail); > } else { > diff --git a/fs/pipe.c b/fs/pipe.c > index 6a982a88f658..8e5fd7314be1 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -30,6 +30,12 @@ > > #include "internal.h" > > +struct pipe_fsync { > + struct list_head link; /* Link in pipe->fsync */ > + struct completion done; > + unsigned int tail; /* The buffer being waited for */ > +}; > + > /* > * The max size that a non-root user is allowed to grow the pipe. Can > * be set by root in /proc/sys/fs/pipe-max-size > @@ -269,6 +275,58 @@ static bool pipe_buf_can_merge(struct pipe_buffer *buf) > return buf->ops == &anon_pipe_buf_ops; > } > > +/* > + * Wait for all the data currently in the pipe to be consumed. > + */ > +static int pipe_fsync(struct file *file, loff_t a, loff_t b, int datasync) > +{ > + struct pipe_inode_info *pipe = file->private_data; > + struct pipe_buffer *buf; > + struct pipe_fsync fsync; > + unsigned int head, tail, mask; > + > + pipe_lock(pipe); > + > + head = pipe->head; > + tail = pipe->tail; > + mask = pipe->ring_size - 1; > + > + if (pipe_empty(head, tail)) { > + pipe_unlock(pipe); > + return 0; > + } > + > + init_completion(&fsync.done); > + fsync.tail = tail; > + buf = &pipe->bufs[tail & mask]; > + buf->flags |= PIPE_BUF_FLAG_FSYNC; > + pipe_buf_mark_unmergeable(buf); > + list_add_tail(&fsync.link, &pipe->fsync); > + pipe_unlock(pipe); > + > + if (wait_for_completion_interruptible(&fsync.done) < 0) { > + pipe_lock(pipe); > + list_del(&fsync.link); > + pipe_unlock(pipe); > + return -EINTR; > + } > + > + return 0; > +} > + > +void __pipe_wake_fsync(struct pipe_inode_info *pipe, unsigned int tail) > +{ > + struct pipe_fsync *fsync, *p; > + > + list_for_each_entry_safe(fsync, p, &pipe->fsync, link) { > + if (fsync->tail == tail) { > + list_del_init(&fsync->link); > + complete(&fsync->done); > + } > + } > +} > +EXPORT_SYMBOL(__pipe_wake_fsync); > + > static ssize_t > pipe_read(struct kiocb *iocb, struct iov_iter *to) > { > @@ -325,6 +383,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) > if (!buf->len) { > pipe_buf_release(pipe, buf); > spin_lock_irq(&pipe->wait.lock); > + pipe_wake_fsync(pipe, buf, tail); > tail++; > pipe_commit_read(pipe, tail); > do_wakeup = 1; > @@ -717,6 +776,7 @@ struct pipe_inode_info *alloc_pipe_info(void) > pipe->ring_size = pipe_bufs; > pipe->user = user; > mutex_init(&pipe->mutex); > + INIT_LIST_HEAD(&pipe->fsync); > return pipe; > } > > @@ -1060,6 +1120,7 @@ const struct file_operations pipefifo_fops = { > .llseek = no_llseek, > .read_iter = pipe_read, > .write_iter = pipe_write, > + .fsync = pipe_fsync, > .poll = pipe_poll, > .unlocked_ioctl = pipe_ioctl, > .release = pipe_release, > diff --git a/fs/splice.c b/fs/splice.c > index 3f72bc31b6ec..e106367e1be6 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -523,6 +523,7 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des > > if (!buf->len) { > pipe_buf_release(pipe, buf); > + pipe_wake_fsync(pipe, buf, tail); > tail++; > pipe_commit_read(pipe, tail); > if (pipe->files) > @@ -771,6 +772,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, > ret -= buf->len; > buf->len = 0; > pipe_buf_release(pipe, buf); > + pipe_wake_fsync(pipe, buf, tail); > tail++; > pipe_commit_read(pipe, tail); > if (pipe->files) > @@ -1613,6 +1615,7 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe, > */ > *obuf = *ibuf; > ibuf->ops = NULL; > + pipe_wake_fsync(ipipe, ibuf, i_tail); > i_tail++; > pipe_commit_read(ipipe, i_tail); > input_wakeup = true; > diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h > index 90055ff16550..1a3027089558 100644 > --- a/include/linux/pipe_fs_i.h > +++ b/include/linux/pipe_fs_i.h > @@ -8,6 +8,7 @@ > #define PIPE_BUF_FLAG_ATOMIC 0x02 /* was atomically mapped */ > #define PIPE_BUF_FLAG_GIFT 0x04 /* page is a gift */ > #define PIPE_BUF_FLAG_PACKET 0x08 /* read() as a packet */ > +#define PIPE_BUF_FLAG_FSYNC 0x10 /* fsync() is waiting for this buffer to die */ > > /** > * struct pipe_buffer - a linux kernel pipe buffer > @@ -43,6 +44,7 @@ struct pipe_buffer { > * @w_counter: writer counter > * @fasync_readers: reader side fasync > * @fasync_writers: writer side fasync > + * @fsync: Waiting fsyncs > * @bufs: the circular array of pipe buffers > * @user: the user who created this pipe > **/ > @@ -62,6 +64,7 @@ struct pipe_inode_info { > struct page *tmp_page; > struct fasync_struct *fasync_readers; > struct fasync_struct *fasync_writers; > + struct list_head fsync; > struct pipe_buffer *bufs; > struct user_struct *user; > }; > @@ -268,6 +271,25 @@ extern const struct pipe_buf_operations nosteal_pipe_buf_ops; > long pipe_fcntl(struct file *, unsigned int, unsigned long arg); > struct pipe_inode_info *get_pipe_info(struct file *file); > > +void __pipe_wake_fsync(struct pipe_inode_info *pipe, unsigned int tail); > + > +/** > + * pipe_wake_fsync - Wake up anyone waiting with fsync for this point > + * @pipe: The pipe that owns the buffer > + * @buf: The pipe buffer in question > + * @tail: The index in the ring of the buffer > + * > + * Check to see if anyone is waiting for the pipe ring to clear up to and > + * including this buffer, and, if they are, wake them up. > + */ > +static inline void pipe_wake_fsync(struct pipe_inode_info *pipe, > + struct pipe_buffer *buf, > + unsigned int tail) > +{ > + if (unlikely(buf->flags & PIPE_BUF_FLAG_FSYNC)) > + __pipe_wake_fsync(pipe, tail); > +} > + > int create_pipe_files(struct file **, int); > unsigned int round_pipe_size(unsigned long size); > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c > index e22f4e283f6d..38d52524cd21 100644 > --- a/lib/iov_iter.c > +++ b/lib/iov_iter.c > @@ -404,7 +404,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by > buf->offset = offset; > buf->len = bytes; > > - pipe_commit_read(pipe, i_head); > + pipe_commit_write(pipe, i_head); > i->iov_offset = offset + bytes; > i->head = i_head; > out: > >
Christoph Hellwig <hch@infradead.org> wrote: > I am _really_ worried about overloading fsync for this behavior. fsync > hasn't done anything for 50 years, and suddenly adding any action > is not helpful. If you can't use FIONREAD please add a new ioctls > instead, and document it properly. Okay. David
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote: > Similar synchronization is required for reusing memory after vmsplice()? > I don't see other way how sender could safely change these pages. Sounds like a point - if you have multiple parallel contributors to the pipe via vmsplice(), then FIONREAD is of no use. To use use FIONREAD, you have to let the pipe become empty before you can be sure. David
On Thu, Oct 31, 2019 at 8:16 AM David Howells <dhowells@redhat.com> wrote: > > Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote: > > > Similar synchronization is required for reusing memory after vmsplice()? > > I don't see other way how sender could safely change these pages. > > Sounds like a point - if you have multiple parallel contributors to the pipe > via vmsplice(), then FIONREAD is of no use. To use use FIONREAD, you have to > let the pipe become empty before you can be sure. Well, the rules for vmsplice is simply to not change the source pages. It's zero-copy, after all. If you want to change the source pages, you need to just use write() instead. That said, even then the right model isn't fsync(). If you really want to have something like "notify me when this buffer has been used", it should be some kind of sequence count thing, not a "wait for empty". Which might be useful in theory, but would be something quite different (and honestly, I wouldn't expect it to find all that widespread use) Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > Similar synchronization is required for reusing memory after vmsplice()? > > > I don't see other way how sender could safely change these pages. Actually, it's probably worse than that. If the output of the pipe gets teed or spliced somewhere else, you still don't know when the vmspliced pages are finished with. David
> On Nov 2, 2019, at 12:34 PM, David Howells <dhowells@redhat.com> wrote: > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > >>>> Similar synchronization is required for reusing memory after vmsplice()? >>>> I don't see other way how sender could safely change these pages. > > Actually, it's probably worse than that. If the output of the pipe gets teed > or spliced somewhere else, you still don't know when the vmspliced pages are > finished with. > > I sometimes wonder whether vmsplice should be disallowed or severely restricted. Even ignoring these usability issues, it makes me very uncomfortable that you can have some data queue up on a pipe, tee() it, and get *different* data in the original pipe and the teed copy because the sender used vmsplice and is messing with you. Add in the fact that it’s not obvious that vmsplice *can* be used correctly, and I’m wondering if we should just remove it or make it just do write() under the hood. I suppose the kernel could guarantee that it stops referring to the vmsplice source pages as soon as anything sees *or* tees the data. This way it would be, at least in principle, possible to say “hey, the pipe has consumed the first n vmspliced bytes, so I can reuse that memory”.
On Sat, Nov 2, 2019 at 1:31 PM Andy Lutomirski <luto@amacapital.net> wrote: > > Add in the fact that it’s not obvious that vmsplice *can* be used correctly, and I’m wondering if we should just remove it or make it just do write() under the hood. Sure it can. Just don't modify the data you vmsplice. It's really that simple. That said, if we don't have any actual users, then we should look at removing it (maybe turning it into "write()" as you say). Not because it's hard to use, but simply because it probably doesn't have that many uses. Linus
On Sat, Nov 2, 2019 at 3:03 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Nov 2, 2019 at 1:31 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > > Add in the fact that it’s not obvious that vmsplice *can* be used correctly, and I’m wondering if we should just remove it or make it just do write() under the hood. > > Sure it can. Just don't modify the data you vmsplice. It's really that simple. > > That said, if we don't have any actual users, then we should look at > removing it (maybe turning it into "write()" as you say). Not because > it's hard to use, but simply because it probably doesn't have that > many uses. Looking at debian code search, there are _some_ uses (including openssl and fuse): https://codesearch.debian.net/search?q=%3D+vmsplice%28&literal=1 but I didn't check any more closely what they do. Linus
> On Nov 2, 2019, at 3:04 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Nov 2, 2019 at 1:31 PM Andy Lutomirski <luto@amacapital.net> wrote: >> >> Add in the fact that it’s not obvious that vmsplice *can* be used correctly, and I’m wondering if we should just remove it or make it just do write() under the hood. > > Sure it can. Just don't modify the data you vmsplice. It's really that simple. So you allocate memory, vmsplice, and munmap() without reusing it? Just plain free() won’t be good enough. I suspect the TLB overhead will make this a loss in most workloads? Or maybe you vmsplice from a read-only mapping of a file that you know no one modifies? This could be useful, but you can just splice() from the file directly.
On Sat, Nov 2, 2019 at 3:30 PM Andy Lutomirski <luto@amacapital.net> wrote: > > So you allocate memory, vmsplice, and munmap() without reusing it? You can re-use it as much as you want. Just don't write to it. So the traditional argument for this was "I do a caching http server". If you don't ever load the data into user space at all and just push file data out, you just use splice() from the file to the target. But if you generate some of the data in memory, and you cache it, you use vmsplice(). And then it really is very easy to set up: make sure you generate your caches with a new clean private mmap, and you can throw them out with munmap (or just over-mmap it with the new cache, of course). If you don't cache it, then there's no advantage to vmsplice() - just write() it and forget about it. The whole (and only) point of vmsplice() is when you want to zero-copy the data, and that's generally likely only an advantage if you can do it multiple times. But I don't think anybody actually _did_ any of that. But that's basically the argument for the three splice operations: write/vmsplice/splice(). Which one you use depends on the lifetime and the source of your data. write() is obviously for the copy case (the source data might not be stable), while splice() is for the "data from another source", and vmsplace() is "data is from stable data in my vm". There's the reverse op, of course, but we never implemented that: mmap() on the pipe could do the reverse of a vmsplice() (moving from the pipe to the vm), but it would only work if everything was page-aligned, which it effectively never is. It's basically a benchmark-only operation. And the existence of vmsplice() is because we actually had code to play games with making write() do a zero-copy but mark the source as being COW. It was _wonderful_ for benchmarks, and was completely useless for real world case because in the real world you always took the COW fault. So vmsplice() is basically a "hey, I know what I'm doing, and you can just take the page as-is because the source is stable". Linus
On Sat, Nov 2, 2019 at 4:02 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But I don't think anybody actually _did_ any of that. But that's > basically the argument for the three splice operations: > write/vmsplice/splice(). Which one you use depends on the lifetime and > the source of your data. write() is obviously for the copy case (the > source data might not be stable), while splice() is for the "data from > another source", and vmsplace() is "data is from stable data in my > vm". Btw, it's really worth noting that "splice()" and friends are from a more happy-go-lucky time when we were experimenting with new interfaces, and in a day and age when people thought that interfaces like "sendpage()" and zero-copy and playing games with the VM was a great thing to do. It turns out that VM games are almost always more expensive than just copying the data in the first place, but hey, people didn't know that, and zero-copy was seen a big deal. The reality is that almost nobody uses splice and vmsplice at all, and they have been a much bigger headache than they are worth. If I could go back in time and not do them, I would. But there have been a few very special uses that seem to actually like the interfaces. But it's entirely possible that we should kill vmsplice() (likely by just implementing the semantics as "write()") because it's not common enough to have the complexity. Linus
On Sat, Nov 2, 2019 at 4:10 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Nov 2, 2019 at 4:02 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > But I don't think anybody actually _did_ any of that. But that's > > basically the argument for the three splice operations: > > write/vmsplice/splice(). Which one you use depends on the lifetime and > > the source of your data. write() is obviously for the copy case (the > > source data might not be stable), while splice() is for the "data from > > another source", and vmsplace() is "data is from stable data in my > > vm". > > Btw, it's really worth noting that "splice()" and friends are from a > more happy-go-lucky time when we were experimenting with new > interfaces, and in a day and age when people thought that interfaces > like "sendpage()" and zero-copy and playing games with the VM was a > great thing to do. I suppose a nicer interface might be: madvise(buf, len, MADV_STABILIZE); (MADV_STABILIZE is an imaginary operation that write protects the memory a la fork() but without the copying part.) vmsplice_safer(fd, ...); Where vmsplice_safer() is like vmsplice, except that it only works on write-protected pages. If you vmsplice_safer() some memory and then write to the memory, the pipe keeps the old copy. But this can all be done with memfd and splice, too, I think. > > It turns out that VM games are almost always more expensive than just > copying the data in the first place, but hey, people didn't know that, > and zero-copy was seen a big deal. > > The reality is that almost nobody uses splice and vmsplice at all, and > they have been a much bigger headache than they are worth. If I could > go back in time and not do them, I would. But there have been a few > very special uses that seem to actually like the interfaces. > > But it's entirely possible that we should kill vmsplice() (likely by > just implementing the semantics as "write()") because it's not common > enough to have the complexity. I think this is the right choice. FWIW, the openssl vmsplice() call looks dubious, but I suspect it's okay because it's vmsplicing to a netlink socket, and the kernel code on the other end won't read the data after it returns a response. --Andy
On 03/11/2019 02.14, Andy Lutomirski wrote: > On Sat, Nov 2, 2019 at 4:10 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> On Sat, Nov 2, 2019 at 4:02 PM Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> But I don't think anybody actually _did_ any of that. But that's >>> basically the argument for the three splice operations: >>> write/vmsplice/splice(). Which one you use depends on the lifetime and >>> the source of your data. write() is obviously for the copy case (the >>> source data might not be stable), while splice() is for the "data from >>> another source", and vmsplace() is "data is from stable data in my >>> vm". >> >> Btw, it's really worth noting that "splice()" and friends are from a >> more happy-go-lucky time when we were experimenting with new >> interfaces, and in a day and age when people thought that interfaces >> like "sendpage()" and zero-copy and playing games with the VM was a >> great thing to do. > > I suppose a nicer interface might be: > > > madvise(buf, len, MADV_STABILIZE); > > (MADV_STABILIZE is an imaginary operation that write protects the > memory a la fork() but without the copying part.) > > vmsplice_safer(fd, ...); > > Where vmsplice_safer() is like vmsplice, except that it only works on > write-protected pages. If you vmsplice_safer() some memory and then > write to the memory, the pipe keeps the old copy. > > But this can all be done with memfd and splice, too, I think. Looks monstrous. This will kill all fun and profit. =) I think vmsplice should at least deprecate and ignore SPLICE_F_GIFT. It almost never works - if page still mapped then page_count in generic_pipe_buf_steal() will be at least 2 (pte and pipe gup). But if user munmap vma between splicing and consuming (and page not stuck in lazy tlb and per-cpu vectors) then page from anon lru could be spliced into file. Ouch. And looks like fuse device still accepts SPLICE_F_MOVE. > > >> >> It turns out that VM games are almost always more expensive than just >> copying the data in the first place, but hey, people didn't know that, >> and zero-copy was seen a big deal. >> >> The reality is that almost nobody uses splice and vmsplice at all, and >> they have been a much bigger headache than they are worth. If I could >> go back in time and not do them, I would. But there have been a few >> very special uses that seem to actually like the interfaces. >> >> But it's entirely possible that we should kill vmsplice() (likely by >> just implementing the semantics as "write()") because it's not common >> enough to have the complexity. > > I think this is the right choice. > > FWIW, the openssl vmsplice() call looks dubious, but I suspect it's > okay because it's vmsplicing to a netlink socket, and the kernel code > on the other end won't read the data after it returns a response. > > --Andy >
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 5ef57a322cb8..9617a35579cb 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1983,6 +1983,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, if (rem >= ibuf->len) { *obuf = *ibuf; ibuf->ops = NULL; + pipe_wake_fsync(pipe, ibuf, tail); tail++; pipe_commit_read(pipe, tail); } else { diff --git a/fs/pipe.c b/fs/pipe.c index 6a982a88f658..8e5fd7314be1 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -30,6 +30,12 @@ #include "internal.h" +struct pipe_fsync { + struct list_head link; /* Link in pipe->fsync */ + struct completion done; + unsigned int tail; /* The buffer being waited for */ +}; + /* * The max size that a non-root user is allowed to grow the pipe. Can * be set by root in /proc/sys/fs/pipe-max-size @@ -269,6 +275,58 @@ static bool pipe_buf_can_merge(struct pipe_buffer *buf) return buf->ops == &anon_pipe_buf_ops; } +/* + * Wait for all the data currently in the pipe to be consumed. + */ +static int pipe_fsync(struct file *file, loff_t a, loff_t b, int datasync) +{ + struct pipe_inode_info *pipe = file->private_data; + struct pipe_buffer *buf; + struct pipe_fsync fsync; + unsigned int head, tail, mask; + + pipe_lock(pipe); + + head = pipe->head; + tail = pipe->tail; + mask = pipe->ring_size - 1; + + if (pipe_empty(head, tail)) { + pipe_unlock(pipe); + return 0; + } + + init_completion(&fsync.done); + fsync.tail = tail; + buf = &pipe->bufs[tail & mask]; + buf->flags |= PIPE_BUF_FLAG_FSYNC; + pipe_buf_mark_unmergeable(buf); + list_add_tail(&fsync.link, &pipe->fsync); + pipe_unlock(pipe); + + if (wait_for_completion_interruptible(&fsync.done) < 0) { + pipe_lock(pipe); + list_del(&fsync.link); + pipe_unlock(pipe); + return -EINTR; + } + + return 0; +} + +void __pipe_wake_fsync(struct pipe_inode_info *pipe, unsigned int tail) +{ + struct pipe_fsync *fsync, *p; + + list_for_each_entry_safe(fsync, p, &pipe->fsync, link) { + if (fsync->tail == tail) { + list_del_init(&fsync->link); + complete(&fsync->done); + } + } +} +EXPORT_SYMBOL(__pipe_wake_fsync); + static ssize_t pipe_read(struct kiocb *iocb, struct iov_iter *to) { @@ -325,6 +383,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) if (!buf->len) { pipe_buf_release(pipe, buf); spin_lock_irq(&pipe->wait.lock); + pipe_wake_fsync(pipe, buf, tail); tail++; pipe_commit_read(pipe, tail); do_wakeup = 1; @@ -717,6 +776,7 @@ struct pipe_inode_info *alloc_pipe_info(void) pipe->ring_size = pipe_bufs; pipe->user = user; mutex_init(&pipe->mutex); + INIT_LIST_HEAD(&pipe->fsync); return pipe; } @@ -1060,6 +1120,7 @@ const struct file_operations pipefifo_fops = { .llseek = no_llseek, .read_iter = pipe_read, .write_iter = pipe_write, + .fsync = pipe_fsync, .poll = pipe_poll, .unlocked_ioctl = pipe_ioctl, .release = pipe_release, diff --git a/fs/splice.c b/fs/splice.c index 3f72bc31b6ec..e106367e1be6 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -523,6 +523,7 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des if (!buf->len) { pipe_buf_release(pipe, buf); + pipe_wake_fsync(pipe, buf, tail); tail++; pipe_commit_read(pipe, tail); if (pipe->files) @@ -771,6 +772,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, ret -= buf->len; buf->len = 0; pipe_buf_release(pipe, buf); + pipe_wake_fsync(pipe, buf, tail); tail++; pipe_commit_read(pipe, tail); if (pipe->files) @@ -1613,6 +1615,7 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe, */ *obuf = *ibuf; ibuf->ops = NULL; + pipe_wake_fsync(ipipe, ibuf, i_tail); i_tail++; pipe_commit_read(ipipe, i_tail); input_wakeup = true; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 90055ff16550..1a3027089558 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -8,6 +8,7 @@ #define PIPE_BUF_FLAG_ATOMIC 0x02 /* was atomically mapped */ #define PIPE_BUF_FLAG_GIFT 0x04 /* page is a gift */ #define PIPE_BUF_FLAG_PACKET 0x08 /* read() as a packet */ +#define PIPE_BUF_FLAG_FSYNC 0x10 /* fsync() is waiting for this buffer to die */ /** * struct pipe_buffer - a linux kernel pipe buffer @@ -43,6 +44,7 @@ struct pipe_buffer { * @w_counter: writer counter * @fasync_readers: reader side fasync * @fasync_writers: writer side fasync + * @fsync: Waiting fsyncs * @bufs: the circular array of pipe buffers * @user: the user who created this pipe **/ @@ -62,6 +64,7 @@ struct pipe_inode_info { struct page *tmp_page; struct fasync_struct *fasync_readers; struct fasync_struct *fasync_writers; + struct list_head fsync; struct pipe_buffer *bufs; struct user_struct *user; }; @@ -268,6 +271,25 @@ extern const struct pipe_buf_operations nosteal_pipe_buf_ops; long pipe_fcntl(struct file *, unsigned int, unsigned long arg); struct pipe_inode_info *get_pipe_info(struct file *file); +void __pipe_wake_fsync(struct pipe_inode_info *pipe, unsigned int tail); + +/** + * pipe_wake_fsync - Wake up anyone waiting with fsync for this point + * @pipe: The pipe that owns the buffer + * @buf: The pipe buffer in question + * @tail: The index in the ring of the buffer + * + * Check to see if anyone is waiting for the pipe ring to clear up to and + * including this buffer, and, if they are, wake them up. + */ +static inline void pipe_wake_fsync(struct pipe_inode_info *pipe, + struct pipe_buffer *buf, + unsigned int tail) +{ + if (unlikely(buf->flags & PIPE_BUF_FLAG_FSYNC)) + __pipe_wake_fsync(pipe, tail); +} + int create_pipe_files(struct file **, int); unsigned int round_pipe_size(unsigned long size); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index e22f4e283f6d..38d52524cd21 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -404,7 +404,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by buf->offset = offset; buf->len = bytes; - pipe_commit_read(pipe, i_head); + pipe_commit_write(pipe, i_head); i->iov_offset = offset + bytes; i->head = i_head; out:
pipe: Add fsync() support The keyrings testsuite needs the ability to wait for all the outstanding notifications in the queue to have been processed so that it can then go through them to find out whether the notifications it expected have been emitted. Implement fsync() support for pipes to provide this. The tailmost buffer at the point of calling is marked and fsync adds itself to the list of waiters, noting the tail position to be waited for and marking the buffer as no longer mergeable. Then when the buffer is consumed, if the flag is set, any matching waiters are woken up. Signed-off-by: David Howells <dhowells@redhat.com> --- fs/fuse/dev.c | 1 fs/pipe.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ fs/splice.c | 3 ++ include/linux/pipe_fs_i.h | 22 ++++++++++++++++ lib/iov_iter.c | 2 - 5 files changed, 88 insertions(+), 1 deletion(-)