[RFC] Btrfs: improve performance on dbench

Kent Overstreet posted some dbench test numbers in the announcement of
bcachefs[1], in which btrfs's performance is much worse than that of
ext4 and xfs, especially in the case of multiple threads.

This difference can be observed on fast storage, I ran 'dbench -t10 64'
with 1.6T NVMe disk,
Processor: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
Memory:    504G

I took time to dig it a bit, perf shows that in the case of multiple
threads we spend most of cpu cycles on spin_lock_irqsave() and
spin_unlock_irqrestore() pair, which is called by wait_event() in btree
locking.

72.84%    72.84%  dbench           [kernel.vmlinux]            [k] native_queued_spin_lock_slowpath
	     |
	     ---native_queued_spin_lock_slowpath
		|
		|--71.64%-- _raw_spin_lock_irqsave
		|          |
		|          |--52.17%-- prepare_to_wait_event
		|          |          |
		|          |          |--94.33%-- btrfs_tree_lock
		|          |          |          |
		|          |          |          |--99.10%-- btrfs_lock_root_node
		|          |          |          |          btrfs_search_slot
		|          |          |          |          |
		|          |          |          |          |--26.44%-- btrfs_lookup_dir_item
		|          |          |          |          |          |
		|          |          |          |          |          |--99.31%-- __btrfs_unlink_inode

Having serious contention on btree lock can also be proved by another fact,
that is, if you use subvolume instead of directory for each dbench client,
the test number in the case of multiple threads will be considerably nice,
for 64 clients,

Throughput 5904.71 MB/sec  64 clients  64 procs  max_latency=816.715 ms

I did a few things to avoid waiting for blocking writers and readers:

1) Use path->leave_spinning=1 as much as possible, this leaves us holding
   spinning lock after searching btree.

2) Find out the cases that we don't have to take blocking lock, for example,
   we don't need blocking lock when parent node has more than 1/4 items it
   can hold.

3) Avoid unnecessary "goto again", eg. on btree root level,
   just update write_lock_level if we already hold BTRFS_WRITE_LOCK.

4) Remove btrfs_set_path_blocking() in btrfs_clear_path_blocking(), this
   contributes to a large part of improved numbers, this function is
   introduced to avoid lockdep warning, but after I turned lockdep on,
   xfstests didn't report about such a warning.

5) Make btrfs_search_forward to use non-sleepable function to find eb, this
   fixes a deadlock with previous changes.

Here is the end results for 64 clients, with vanilla 4.2, btrfs runs 15x faster
but with higher latency.

                   tput(MB/sec)         max_latency(ms)
xfs                 2742.93                21.855
ext4                7182.92                19.053
btrfs+subvol w/o    5904.71               816.715
btrfs+dir w/o        122.778              718.674
*btrfs+dir w        1715.77              1366.981

I've marked it as RFC since I'm not confident on the lockdep part.

Any comments are welcome!

[1]: https://lkml.org/lkml/2015/8/21/22

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/ctree.c     | 79 +++++++++++++++++++++++++++++++++++++++++++---------
 fs/btrfs/dir-item.c  |  1 +
 fs/btrfs/file-item.c |  3 +-
 fs/btrfs/file.c      |  7 ++++-
 fs/btrfs/inode-map.c |  2 ++
 fs/btrfs/inode.c     |  3 ++
 fs/btrfs/orphan.c    |  2 ++
 fs/btrfs/root-tree.c |  1 +
 fs/btrfs/xattr.c     |  2 ++
 9 files changed, 84 insertions(+), 16 deletions(-)

[RFC] Btrfs: improve performance on dbench

Commit Message

Patch