mbox series

[GIT,PULL] pipe: Notification queue preparation

Message ID 31452.1574721589@warthog.procyon.org.uk (mailing list archive)
State Mainlined
Commit 6a965666b7e7475c2f8c8e724703db58b8a8a445
Headers show
Series [GIT,PULL] pipe: Notification queue preparation | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-pipe-prep-20191115

Message

David Howells Nov. 25, 2019, 10:39 p.m. UTC
Hi Linus,

Can you pull this please?  This is my set of preparatory patches for
building a general notification queue on top of pipes.  It makes a number
of significant changes:

 (1) It removes the nr_exclusive argument from __wake_up_sync_key() as this
     is always 1.  This prepares for step 2.

 (2) Adds wake_up_interruptible_sync_poll_locked() so that poll can be
     woken up from a function that's holding the poll waitqueue spinlock.

     [btw, I realise that I haven't un-sync'd the
      wake_up_interruptible_sync_poll() calls as you tentatively suggested.
      I can send a follow up patch to fix that if you still want it]

 (3) Change the pipe buffer ring to be managed in terms of unbounded head
     and tail indices rather than bounded index and length.  This means
     that reading the pipe only needs to modify one index, not two.

 (4) A selection of helper functions are provided to query the state of the
     pipe buffer, plus a couple to apply updates to the pipe indices.

 (5) The pipe ring is allowed to have kernel-reserved slots.  This allows
     many notification messages to be spliced in by the kernel without
     allowing userspace to pin too many pages if it writes to the same
     pipe.

 (6) Advance the head and tail indices inside the pipe waitqueue lock and
     use step 2 to poke poll without having to take the lock twice.

 (7) Rearrange pipe_write() to preallocate the buffer it is going to write
     into and then drop the spinlock.  This allows kernel notifications to
     then be added the ring whilst it is filling the buffer it allocated.
     The read side is stalled because the pipe mutex is still held.

 (8) Don't wake up readers on a pipe if there was already data in it when
     we added more.

 (9) Don't wake up writers on a pipe if the ring wasn't full before we
     removed a buffer.

PATCHES	BENCHMARK	BEST		TOTAL BYTES	AVG BYTES	STDDEV
=======	===============	===============	===============	===============	===============
-	pipe		      307457969	    36348556755	      302904639	       10622403
-	splice		      287117614	    26933658717	      224447155	      160777958
-	vmsplice	      435180375	    51302964090	      427524700	       19083037

rm-nrx	pipe		      311091179	    37093181356	      309109844	        7221622
rm-nrx	splice		      285628049	    27916298942	      232635824	      158296431
rm-nrx	vmsplice	      417703153	    47570362546	      396419687	       33960822

wakesl	pipe		      310698731	    36772541631	      306437846	        8249347
wakesl	splice		      286193726	    28600435451	      238336962	      141169318
wakesl	vmsplice	      436175803	    50723895824	      422699131	       40724240

ht	pipe		      305534565	    36426079543	      303550662	        5673885
ht	splice		      243632025	    23319439010	      194328658	      150479853
ht	vmsplice	      432825176	    49101781001	      409181508	       44102509

k-rsv	pipe		      308691523	    36652267561	      305435563	       12972559
k-rsv	splice		      244793528	    23625172865	      196876440	      125319143
k-rsv	vmsplice	      436119082	    49460808579	      412173404	       55547525

r-adv-t	pipe		      310094218	    36860182219	      307168185	        8081101
r-adv-t	splice		      285527382	    27085052687	      225708772	      206918887
r-adv-t	vmsplice	      336885948	    40128756927	      334406307	        5895935

r-cond	pipe		      308727804	    36635828180	      305298568	        9976806
r-cond	splice		      284467568	    28445793054	      237048275	      200284329
r-cond	vmsplice	      449679489	    51134833848	      426123615	       66790875

w-preal	pipe		      307416578	    36662086426	      305517386	        6216663
w-preal	splice		      282655051	    28455249109	      237127075	      194154549
w-preal	vmsplice	      437002601	    47832160621	      398601338	       96513019

w-redun	pipe		      307279630	    36329750422	      302747920	        8913567
w-redun	splice		      284324488	    27327152734	      227726272	      219735663
w-redun	vmsplice	      451141971	    51485257719	      429043814	       51388217

w-ckful	pipe		      305055247	    36374947350	      303124561	        5400728
w-ckful	splice		      281575308	    26841554544	      223679621	      215942886
w-ckful	vmsplice	      436653588	    47564907110	      396374225	       82255342

The patches column indicates the point in the patchset at which the benchmarks
were taken:

	0	No patches
	rm-nrx	"Remove the nr_exclusive argument from __wake_up_sync_key()"
	wakesl	"Add wake_up_interruptible_sync_poll_locked()"
	ht	"pipe: Use head and tail pointers for the ring, not cursor and length"
	k-rsv	"pipe: Allow pipes to have kernel-reserved slots"
	r-adv-t	"pipe: Advance tail pointer inside of wait spinlock in pipe_read()"
	r-cond	"pipe: Conditionalise wakeup in pipe_read()"
	w-preal	"pipe: Rearrange sequence in pipe_write() to preallocate slot"
	w-redun	"pipe: Remove redundant wakeup from pipe_write()"
	w-ckful	"pipe: Check for ring full inside of the spinlock in pipe_write()"

Changes:

 (*) Fix some bugs spotted by kbuild.

 ver #3:

 (*) Get rid of pipe_commit_{read,write}.

 (*) Port the virtio_console driver.

 (*) Fix pipe_zero().

 (*) Amend some comments.

 (*) Added an additional patch that changes the threshold at which readers
     wake writers for Konstantin Khlebnikov.

 ver #2:

 (*) Split the notification patches out into a separate branch.

 (*) Removed the nr_exclusive parameter from __wake_up_sync_key().

 (*) Renamed the locked wakeup function.

 (*) Add helpers for empty, full, occupancy.

 (*) Split the addition of ->max_usage out into its own patch.

 (*) Fixed some bits pointed out by Rasmus Villemoes.

 ver #1:

 (*) Build on top of standard pipes instead of having a driver.

David
---
The following changes since commit da0c9ea146cbe92b832f1b0f694840ea8eb33cce:

  Linux 5.4-rc2 (2019-10-06 14:27:30 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-pipe-prep-20191115

for you to fetch changes up to 3c0edea9b29f9be6c093f236f762202b30ac9431:

  pipe: Remove sync on wake_ups (2019-11-15 16:22:54 +0000)

----------------------------------------------------------------
Pipework for general notification queue

----------------------------------------------------------------
David Howells (12):
      pipe: Reduce #inclusion of pipe_fs_i.h
      Remove the nr_exclusive argument from __wake_up_sync_key()
      Add wake_up_interruptible_sync_poll_locked()
      pipe: Use head and tail pointers for the ring, not cursor and length
      pipe: Allow pipes to have kernel-reserved slots
      pipe: Advance tail pointer inside of wait spinlock in pipe_read()
      pipe: Conditionalise wakeup in pipe_read()
      pipe: Rearrange sequence in pipe_write() to preallocate slot
      pipe: Remove redundant wakeup from pipe_write()
      pipe: Check for ring full inside of the spinlock in pipe_write()
      pipe: Increase the writer-wakeup threshold to reduce context-switch count
      pipe: Remove sync on wake_ups

 drivers/char/virtio_console.c |  16 ++-
 fs/exec.c                     |   1 -
 fs/fuse/dev.c                 |  31 +++--
 fs/ocfs2/aops.c               |   1 -
 fs/pipe.c                     | 232 +++++++++++++++++++++---------------
 fs/splice.c                   | 190 +++++++++++++++++------------
 include/linux/pipe_fs_i.h     |  64 +++++++++-
 include/linux/uio.h           |   4 +-
 include/linux/wait.h          |  11 +-
 kernel/exit.c                 |   2 +-
 kernel/sched/wait.c           |  37 ++++--
 lib/iov_iter.c                | 269 ++++++++++++++++++++++++------------------
 security/smack/smack_lsm.c    |   1 -
 13 files changed, 529 insertions(+), 330 deletions(-)

Comments

pr-tracker-bot@kernel.org Nov. 30, 2019, 11:05 p.m. UTC | #1
The pull request you sent on Mon, 25 Nov 2019 22:39:49 +0000:

> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-pipe-prep-20191115

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/6a965666b7e7475c2f8c8e724703db58b8a8a445

Thank you!
David Sterba Dec. 5, 2019, 12:58 p.m. UTC | #2
On Mon, Nov 25, 2019 at 10:39:49PM +0000, David Howells wrote:
> David Howells (12):
>       pipe: Reduce #inclusion of pipe_fs_i.h
>       Remove the nr_exclusive argument from __wake_up_sync_key()
>       Add wake_up_interruptible_sync_poll_locked()
>       pipe: Use head and tail pointers for the ring, not cursor and length

This commit (8cefc107ca54c8b06438b7dc9) causes hangs of 'btrfs send'
commands, that's eg. inside fstests or in btrfs-progs testsuite. The
'send' uses pipe/splice to get data from kernel to userspace.

The test that reliably worked for me leaves the process hanging with
this stacktrace (no CPU or IO activity):

[<0>] pipe_write+0x1be/0x4b0
[<0>] new_sync_write+0x11e/0x1c0
[<0>] vfs_write+0xc1/0x1d0
[<0>] kernel_write+0x2c/0x40
[<0>] send_cmd+0x78/0xf0 [btrfs]
[<0>] send_extent_data+0x4b1/0x52c [btrfs]
[<0>] process_extent+0xe46/0xe9d [btrfs]
[<0>] changed_cb+0xcf5/0xd2f [btrfs]
[<0>] send_subvol+0x8ad/0xc0b [btrfs]
[<0>] btrfs_ioctl_send+0xe50/0xf30 [btrfs]
[<0>] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[<0>] btrfs_ioctl+0x18e1/0x3450 [btrfs]
[<0>] do_vfs_ioctl+0xa5/0x710
[<0>] ksys_ioctl+0x70/0x80
[<0>] __x64_sys_ioctl+0x16/0x20
[<0>] do_syscall_64+0x56/0x220
[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

I've bisected that to the mentioned commit, using test in btrfs-progs
testsuite with command "make TEST=016\* test-misc". Normally the test
can run up to 10 seconds, in the buggy case it stays there.

I can help testing fixes or gathering more information, please let me
know.

Full bisect log:

# bad: [3c0edea9b29f9be6c093f236f762202b30ac9431] pipe: Remove sync on wake_ups
# good: [da0c9ea146cbe92b832f1b0f694840ea8eb33cce] Linux 5.4-rc2
git bisect start '3c0edea9b29f9be6c093f236f762202b30ac9431' 'd055b4fb4d165b06d912e7f846610d120c3bb9fb^'
# bad: [b667b867344301e24f21d4a4c844675ff61d89e1] pipe: Advance tail pointer inside of wait spinlock in pipe_read()
git bisect bad b667b867344301e24f21d4a4c844675ff61d89e1
# good: [f94df9890e98f2090c6a8d70c795134863b70201] Add wake_up_interruptible_sync_poll_locked()
git bisect good f94df9890e98f2090c6a8d70c795134863b70201
# bad: [6718b6f855a0b4962d54bd625be2718cb820cec6] pipe: Allow pipes to have kernel-reserved slots
git bisect bad 6718b6f855a0b4962d54bd625be2718cb820cec6
# bad: [8cefc107ca54c8b06438b7dc9cc08bc0a11d5b98] pipe: Use head and tail pointers for the ring, not cursor and length
git bisect bad 8cefc107ca54c8b06438b7dc9cc08bc0a11d5b98
# first bad commit: [8cefc107ca54c8b06438b7dc9cc08bc0a11d5b98] pipe: Use head and tail pointers for the ring, not cursor and length

>       pipe: Allow pipes to have kernel-reserved slots
>       pipe: Advance tail pointer inside of wait spinlock in pipe_read()
>       pipe: Conditionalise wakeup in pipe_read()
>       pipe: Rearrange sequence in pipe_write() to preallocate slot
>       pipe: Remove redundant wakeup from pipe_write()
>       pipe: Check for ring full inside of the spinlock in pipe_write()
>       pipe: Increase the writer-wakeup threshold to reduce context-switch count
>       pipe: Remove sync on wake_ups
David Howells Dec. 5, 2019, 1:56 p.m. UTC | #3
David Sterba <dsterba@suse.cz> wrote:

> [<0>] pipe_write+0x1be/0x4b0

Can you get me a line number of that?  Assuming you've built with -g, load
vmlinux into gdb and do "i li pipe_write+0x1be".

David
Linus Torvalds Dec. 5, 2019, 5:12 p.m. UTC | #4
On Thu, Dec 5, 2019 at 5:57 AM David Howells <dhowells@redhat.com> wrote:
>
> David Sterba <dsterba@suse.cz> wrote:
>
> > [<0>] pipe_write+0x1be/0x4b0
>
> Can you get me a line number of that?  Assuming you've built with -g, load
> vmlinux into gdb and do "i li pipe_write+0x1be".

If the kernel is built with debug info (which you need for the gdb
command anyway), it's much better to just use

   ./scripts/decode_stacktrace.sh

which gives all the information for the whole backtrace.

It would be interesting to hear if somebody else is waiting on the
read side too.

             Linus
David Sterba Dec. 5, 2019, 5:21 p.m. UTC | #5
On Thu, Dec 05, 2019 at 01:56:57PM +0000, David Howells wrote:
> David Sterba <dsterba@suse.cz> wrote:
> 
> > [<0>] pipe_write+0x1be/0x4b0
> 
> Can you get me a line number of that?  Assuming you've built with -g, load
> vmlinux into gdb and do "i li pipe_write+0x1be".

I built it with -g (DEBUG_INFO) but there's no output for the command (gdb 8.2):

(gdb) i li pipe_write+0x1be
Function "pipe_write+0x1be" not defined.

But the address can tell something:

(gdb) l *(pipe_write+0x1be)
0xffffffff81390b8e is in pipe_write (fs/pipe.c:509).
warning: Source file is more recent than executable.
504                             kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
505                             do_wakeup = 0;
506                     }
507                     pipe->waiting_writers++;
508                     pipe_wait(pipe);
509                     pipe->waiting_writers--;
510             }
511     out:
512             __pipe_unlock(pipe);
513             if (do_wakeup) {

I rerun the test again (with a different address where it's stuck), there's
nothing better I can get from the debug info, it always points to pipe_wait,
disassembly points to:

   0xffffffff81390b71 <+417>:   jne    0xffffffff81390c23 <pipe_write+595>
   0xffffffff81390b77 <+423>:   test   %ecx,%ecx
   0xffffffff81390b79 <+425>:   jne    0xffffffff81390b95 <pipe_write+453>
   0xffffffff81390b7b <+427>:   addl   $0x1,0x110(%rbx)
   0xffffffff81390b82 <+434>:   mov    %rbx,%rdi
   0xffffffff81390b85 <+437>:   callq  0xffffffff813908c0 <pipe_wait>
   0xffffffff81390b8a <+442>:   subl   $0x1,0x110(%rbx)

(pipe_write+0x1ba == 0xffffffff81390b8a)
Linus Torvalds Dec. 5, 2019, 5:24 p.m. UTC | #6
On Thu, Dec 5, 2019 at 9:12 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It would be interesting to hear if somebody else is waiting on the
> read side too.

Looking once more at that commit, I find at least one bug, but I'm
pretty sure that's not the cause of this problem.

DavidH, watch out for things like this:

-       for (idx = 0; idx < pipe->nrbufs && rem < len; idx++)
+       for (idx = tail; idx < head && rem < len; idx++)

which is a completely buggy conversion.

You can't compare tail and head with an inequality, it gets the
wraparound case wrong.

But I found only one of those, and it's fuse-specific, plus the
overflow would take a long time to trigger, so I'm pretty sure this
isn't what DavidS is reporting.

               Linus
David Howells Dec. 5, 2019, 5:25 p.m. UTC | #7
I've just posted a couple of patches - can you check to see if they fix your
problem?

https://lore.kernel.org/linux-fsdevel/157556649610.20869.8537079649495343567.stgit@warthog.procyon.org.uk/T/#t

Thanks,
David
Linus Torvalds Dec. 5, 2019, 5:33 p.m. UTC | #8
On Thu, Dec 5, 2019 at 9:22 AM David Sterba <dsterba@suse.cz> wrote:
>
> I rerun the test again (with a different address where it's stuck), there's
> nothing better I can get from the debug info, it always points to pipe_wait,
> disassembly points to.

Hah. I see another bug.

"pipe_wait()" depends on the fact that all events that wake it up
happen with the pipe lock held.

But we do some of the "do_wakeup()" handling outside the pipe lock now
on the reader side

        __pipe_unlock(pipe);

        /* Signal writers asynchronously that there is more room. */
        if (do_wakeup) {
                wake_up_interruptible_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
                kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
        }

However, that isn't new to this series _either_, so I don't think
that's it. It does wake up things inside the lock _too_ if it ended up
emptying a whole buffer.

So it could be triggered by timing and behavior changes, but I doubt
this pipe_wait() thing is it either. The fact that it bisects to the
thing that changes things to use head/tail pointers makes me think
there's some other incorrect update or comparison somewhere.

That said, "pipe_wait()" is an abomination. It should use a proper
wait condition and use wait_event(), but the code predates all of
that. I suspect pipe_wait() goes back to the dark ages with the BKL
and no actual races between kernel code.

               Linus
David Sterba Dec. 5, 2019, 6:18 p.m. UTC | #9
On Thu, Dec 05, 2019 at 05:25:20PM +0000, David Howells wrote:
> I've just posted a couple of patches - can you check to see if they fix your
> problem?
> 
> https://lore.kernel.org/linux-fsdevel/157556649610.20869.8537079649495343567.stgit@warthog.procyon.org.uk/T/#t

Not fixed, the test still hangs with the same call stack.