mbox series

[00/52,RFC] virtio-fs: shared file system for virtual machines

Message ID 20181210171318.16998-1-vgoyal@redhat.com (mailing list archive)
Headers show
Series virtio-fs: shared file system for virtual machines | expand

Message

Vivek Goyal Dec. 10, 2018, 5:12 p.m. UTC
Hi,

Here are RFC patches for virtio-fs. Looking for feedback on this approach.

These patches should apply on top of 4.20-rc5. We have also put code for
various components here.

https://gitlab.com/virtio-fs

Problem Description
===================
We want to be able to take a directory tree on the host and share it with
guest[s]. Our goal is to be able to do it in a fast, consistent and secure
manner. Our primary use case is kata containers, but it should be usable in
other scenarios as well.

Containers may rely on local file system semantics for shared volumes,
read-write mounts that multiple containers access simultaneously.  File
system changes must be visible to other containers with the same consistency
expected of a local file system, including mmap MAP_SHARED.

Existing Solutions
==================
We looked at existing solutions and virtio-9p already provides basic shared
file system functionality although does not offer local file system semantics,
causing some workloads and test suites to fail. In addition, virtio-9p
performance has been an issue for Kata Containers and we believe this cannot
be alleviated without major changes that do not fit into the 9P protocol.

Design Overview
===============
With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.

- Use fuse protocol (instead of 9p) for communication between guest
  and host. Guest kernel will be fuse client and a fuse server will
  run on host to serve the requests. Benchmark results (see below) are
  encouraging and show this approach performs well (2x to 8x improvement
  depending on test being run).

- For data access inside guest, mmap portion of file in QEMU address
  space and guest accesses this memory using dax. That way guest page
  cache is bypassed and there is only one copy of data (on host). This
  will also enable mmap(MAP_SHARED) between guests.

- For metadata coherency, there is a shared memory region which contains
  version number associated with metadata and any guest changing metadata
  updates version number and other guests refresh metadata on next
  access. This is still experimental and implementation is not complete.

How virtio-fs differs from existing approaches
==============================================
The unique idea behind virtio-fs is to take advantage of the co-location
of the virtual machine and hypervisor to avoid communication (vmexits).

DAX allows file contents to be accessed without communication with the
hypervisor. The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.

By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols. In addition, this also makes it easier to achieve
local file system semantics (coherency).

These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine. This is why we decided to build virtio-fs rather than
focus on 9P or NFS.

HOWTO
======
We have put instructions on how to use it here.

https://virtio-fs.gitlab.io/

Caching Modes
=============
Like virtio-9p, different caching modes are supported which determine the
coherency level as well. The “cache=FOO” and “writeback” options control the
level of coherence between the guest and host filesystems. The “shared” option
only has an effect on coherence between virtio-fs filesystem instances
running inside different guests.

- cache=none
  metadata, data and pathname lookup are not cached in guest. They are always
  fetched from host and any changes are immediately pushed to host.

- cache=always
  metadata, data and pathname lookup are cached in guest and never expire.

- cache=auto
  metadata and pathname lookup cache expires after a configured amount of time
  (default is 1 second). Data is cached while the file is open (close to open
  consistency).

- writeback/no_writeback
  These options control the writeback strategy.  If writeback is disabled,
  then normal writes will immediately be synchronized with the host fs. If
  writeback is enabled, then writes may be cached in the guest until the file
  is closed or an fsync(2) performed. This option has no effect on mmap-ed
  writes or writes going through the DAX mechanism.

- shared/no_shared
  These options control the  use of the shared version table. If shared mode
  is enabled then metadata and pathname lookup is cached in guest, but is
  refreshed due to changes in another virtio-fs instance.

DAX 
===
- dax can be turned on/off when mounting virtio-fs inside guest.

WHAT WORKS
==========
- As of now primarily cache options none, auto and always are working.
  shared option is still being worked on.

- Dax on/off seems to work. It does not seem to be as fast as we were
  expecting it to be. Still need to look into optimization opportunities.

TODO
====
- Complete "cache=shared" implementation.
- Look into improving performance for dax. It seems slow.
- Lot of bug fixing, cleanup and performance improvement. 

RESULTS
=======
- pjdfstests are passing. Have tried cache=none/auto/always and dax on/off).
  
  https://github.com/pjd/pjdfstest

  (one symlink test fails and that seems to be due xfs on host. Yet to
   look into it).

- We have run some basic tests and compared with virtio-9p and it seems
  to be faster. I ran "smallfile" utility and a simple fio job to test
  mmap performance.

Test Setup
-----------
- A fedora 28 host with 32G RAM, 2 sockets (6 cores per socket, 2
  threads per core)

- Using a PCIE SSD at host as backing store.

- Created a VM with 16 VCPUS and 6GB memory. A 2GB cache window (for dax
  mmap).
  
fio mmap
--------
Wrote simple fio job to run mmap and READ. Ran test on 1 file and 4
files and different caching modes. File size is 4G. Dropped cache in
guest before each run. Cache on host was untouched. So data on host must
have been cached. These results are average of 3 runs.

		cache mode 	1-file(one thread) 	4-files(4 threads)

virtio-9p	mmap		28 MB/s			140 MB/s
virtio-fs	none + dax	126 MB/s		501 MB/s


virtio-9p	loose	 	31 MB/s			135 MB/s
virtio-fs	always		235 MB/s		858 MB/s
virtio-fs	always + dax	121 MB/s		487 MB/s


smallfile
---------
https://github.com/distributed-system-analysis/smallfile

I basically ran bunch of operations like create, ls-l, read, append,
rename and delete-renamed and measured performance over 3 runs and
took average. Dropped cache after before each operation started
running. Used effectively following command for each operation.

# python smallfile_cli.py --operation create --threads 8 --file-size 1024 --files 2048 --top <test-dir>


		cache mode 	operation	(files/sec) 

virtio-9p	none		create		194
virtio-fs	none		create		714

virtio-9p	mmap		create		201
virtio-fs	none + dax	create		759

virtio-9p	loose		create		16
virtio-fs	always		create          685
virtio-fs	always + dax	create		735

virtio-9p	none		ls-l		2038
virtio-fs	none		ls-l		4615

virtio-9p	mmap		ls-l		2087	
virtio-fs	none + dax	ls-l		4616

virtio-9p	loose		ls-l		1619
virtio-fs	always		ls-l		13571
virtio-fs	always + dax	ls-l		12626

virtio-9p	none		read		199
virtio-fs	none		read		1405

virtio-9p	mmap		read		203	
virtio-fs	none + dax	read		1345

virtio-9p	loose		read		207
virtio-fs	always		read		1436
virtio-fs	always + dax	read		1368

virtio-9p	none		append		197
virtio-fs	none		append		717

virtio-9p	mmap		append		200	
virtio-fs	none + dax	append		645

virtio-9p	loose		append		16	
virtio-fs	always		append		651	
virtio-fs	always + dax	append		704	

virtio-9p	none		rename		2442
virtio-fs	none		rename		5797

virtio-9p	mmap		rename		2518	
virtio-fs	none + dax	rename		6386

virtio-9p	loose		rename		4178
virtio-fs	always		rename		15834
virtio-fs	always + dax	rename		15529

Thanks
Vivek

Dr. David Alan Gilbert (5):
  virtio-fs: Add VIRTIO_PCI_CAP_SHARED_MEMORY_CFG and utility to find
    them
  virito-fs: Make dax optional
  virtio: Free fuse devices on umount
  virtio-fs: Retrieve shm capabilities for version table
  virtio-fs: Map using the values from the capabilities

Miklos Szeredi (8):
  fuse: simplify fuse_fill_super_common() calling
  fuse: delete dentry if timeout is zero
  fuse: multiplex cached/direct_io/dax file operations
  virtio-fs: pass version table pointer to fuse
  fuse: don't crash if version table is NULL
  fuse: add shared version support (virtio-fs only)
  fuse: shared version cleanups
  fuse: fix fuse_permission() for the default_permissions case

Stefan Hajnoczi (17):
  fuse: add skeleton virtio_fs.ko module
  fuse: add probe/remove virtio driver
  fuse: rely on mutex_unlock() barrier instead of fput()
  fuse: extract fuse_fill_super_common()
  virtio_fs: get mount working
  fuse: export fuse_end_request()
  fuse: export fuse_len_args()
  fuse: add fuse_iqueue_ops callbacks
  fuse: process requests queues
  fuse: export fuse_get_unique()
  fuse: implement FUSE_FORGET for virtio-fs
  virtio_fs: Set up dax_device
  dax: remove block device dependencies
  fuse: add fuse_conn->dax_dev field
  fuse: map virtio_fs DAX window BAR
  fuse: Implement basic DAX read/write support commands
  fuse: add DAX mmap support

Vivek Goyal (22):
  virtio-fs: Retrieve shm capabilities for cache
  virtio-fs: Map cache using the values from the capabilities
  Limit number of pages returned by direct_access()
  fuse: Introduce fuse_dax_mapping
  Create a list of free memory ranges
  fuse: Introduce setupmapping/removemapping commands
  Introduce interval tree basic data structures
  fuse: Maintain a list of busy elements
  Do fallocate() to grow file before mapping for file growing writes
  dax: Pass dax_dev to dax_writeback_mapping_range()
  fuse: Define dax address space operations
  fuse, dax: Take ->i_mmap_sem lock during dax page fault
  fuse: Add logic to free up a memory range
  fuse: Add logic to do direct reclaim of memory
  fuse: Kick worker when free memory drops below 20% of total ranges
  Dispatch FORGET requests later instead of dropping them
  Release file in process context
  fuse: Do not block on inode lock while freeing memory range
  fuse: Reschedule dax free work if too many EAGAIN attempts
  fuse: Wait for memory ranges to become free
  fuse: Take inode lock for dax inode truncation
  fuse: Clear setuid bit even in direct I/O path

 drivers/dax/super.c             |    3 +-
 fs/dax.c                        |   23 +-
 fs/ext4/inode.c                 |    2 +-
 fs/fuse/Kconfig                 |   11 +
 fs/fuse/Makefile                |    1 +
 fs/fuse/cuse.c                  |    3 +-
 fs/fuse/dev.c                   |   80 ++-
 fs/fuse/dir.c                   |  282 +++++++--
 fs/fuse/file.c                  | 1012 +++++++++++++++++++++++++++--
 fs/fuse/fuse_i.h                |  234 ++++++-
 fs/fuse/inode.c                 |  278 ++++++--
 fs/fuse/readdir.c               |   12 +-
 fs/fuse/virtio_fs.c             | 1336 +++++++++++++++++++++++++++++++++++++++
 fs/splice.c                     |    3 +-
 fs/xfs/xfs_aops.c               |    2 +-
 include/linux/dax.h             |    6 +-
 include/linux/fs.h              |    2 +
 include/uapi/linux/fuse.h       |   39 ++
 include/uapi/linux/virtio_fs.h  |   46 ++
 include/uapi/linux/virtio_ids.h |    1 +
 include/uapi/linux/virtio_pci.h |   10 +
 21 files changed, 3151 insertions(+), 235 deletions(-)
 create mode 100644 fs/fuse/virtio_fs.c
 create mode 100644 include/uapi/linux/virtio_fs.h

Comments

Stefan Hajnoczi Dec. 11, 2018, 12:54 p.m. UTC | #1
On Mon, Dec 10, 2018 at 12:12:26PM -0500, Vivek Goyal wrote:
> Hi,
> 
> Here are RFC patches for virtio-fs. Looking for feedback on this approach.
> 
> These patches should apply on top of 4.20-rc5. We have also put code for
> various components here.
> 
> https://gitlab.com/virtio-fs

A draft specification for the virtio-fs device is available here:

https://stefanha.github.io/virtio/virtio-fs.html#x1-38800010 (HTML)

https://github.com/stefanha/virtio/commit/e1cac3777ef03bc9c5c8ee91bcc6ba478272e6b6

Stefan
Konrad Rzeszutek Wilk Dec. 12, 2018, 8:30 p.m. UTC | #2
On Mon, Dec 10, 2018 at 12:12:26PM -0500, Vivek Goyal wrote:
> Hi,
> 
> Here are RFC patches for virtio-fs. Looking for feedback on this approach.
> 
> These patches should apply on top of 4.20-rc5. We have also put code for
> various components here.
> 
> https://gitlab.com/virtio-fs
> 
> Problem Description
> ===================
> We want to be able to take a directory tree on the host and share it with
> guest[s]. Our goal is to be able to do it in a fast, consistent and secure
> manner. Our primary use case is kata containers, but it should be usable in
> other scenarios as well.
> 
> Containers may rely on local file system semantics for shared volumes,
> read-write mounts that multiple containers access simultaneously.  File
> system changes must be visible to other containers with the same consistency
> expected of a local file system, including mmap MAP_SHARED.
> 
> Existing Solutions
> ==================
> We looked at existing solutions and virtio-9p already provides basic shared
> file system functionality although does not offer local file system semantics,
> causing some workloads and test suites to fail. In addition, virtio-9p
> performance has been an issue for Kata Containers and we believe this cannot
> be alleviated without major changes that do not fit into the 9P protocol.
> 
> Design Overview
> ===============
> With the goal of designing something with better performance and local file
> system semantics, a bunch of ideas were proposed.
> 
> - Use fuse protocol (instead of 9p) for communication between guest
>   and host. Guest kernel will be fuse client and a fuse server will
>   run on host to serve the requests. Benchmark results (see below) are
>   encouraging and show this approach performs well (2x to 8x improvement
>   depending on test being run).
> 
> - For data access inside guest, mmap portion of file in QEMU address
>   space and guest accesses this memory using dax. That way guest page
>   cache is bypassed and there is only one copy of data (on host). This
>   will also enable mmap(MAP_SHARED) between guests.
> 
> - For metadata coherency, there is a shared memory region which contains
>   version number associated with metadata and any guest changing metadata
>   updates version number and other guests refresh metadata on next
>   access. This is still experimental and implementation is not complete.

What about Windows guests or BSD ones? Is there a plan to make that work with them as well?

What about the Virtio spec? Plans to make changes there as well?
Vivek Goyal Dec. 12, 2018, 9:22 p.m. UTC | #3
On Wed, Dec 12, 2018 at 03:30:49PM -0500, Konrad Rzeszutek Wilk wrote:
> On Mon, Dec 10, 2018 at 12:12:26PM -0500, Vivek Goyal wrote:
> > Hi,
> > 
> > Here are RFC patches for virtio-fs. Looking for feedback on this approach.
> > 
> > These patches should apply on top of 4.20-rc5. We have also put code for
> > various components here.
> > 
> > https://gitlab.com/virtio-fs
> > 
> > Problem Description
> > ===================
> > We want to be able to take a directory tree on the host and share it with
> > guest[s]. Our goal is to be able to do it in a fast, consistent and secure
> > manner. Our primary use case is kata containers, but it should be usable in
> > other scenarios as well.
> > 
> > Containers may rely on local file system semantics for shared volumes,
> > read-write mounts that multiple containers access simultaneously.  File
> > system changes must be visible to other containers with the same consistency
> > expected of a local file system, including mmap MAP_SHARED.
> > 
> > Existing Solutions
> > ==================
> > We looked at existing solutions and virtio-9p already provides basic shared
> > file system functionality although does not offer local file system semantics,
> > causing some workloads and test suites to fail. In addition, virtio-9p
> > performance has been an issue for Kata Containers and we believe this cannot
> > be alleviated without major changes that do not fit into the 9P protocol.
> > 
> > Design Overview
> > ===============
> > With the goal of designing something with better performance and local file
> > system semantics, a bunch of ideas were proposed.
> > 
> > - Use fuse protocol (instead of 9p) for communication between guest
> >   and host. Guest kernel will be fuse client and a fuse server will
> >   run on host to serve the requests. Benchmark results (see below) are
> >   encouraging and show this approach performs well (2x to 8x improvement
> >   depending on test being run).
> > 
> > - For data access inside guest, mmap portion of file in QEMU address
> >   space and guest accesses this memory using dax. That way guest page
> >   cache is bypassed and there is only one copy of data (on host). This
> >   will also enable mmap(MAP_SHARED) between guests.
> > 
> > - For metadata coherency, there is a shared memory region which contains
> >   version number associated with metadata and any guest changing metadata
> >   updates version number and other guests refresh metadata on next
> >   access. This is still experimental and implementation is not complete.
> 
> What about Windows guests or BSD ones? Is there a plan to make that work with them as well?

Hi Konrad,

I have not thought much about making it work on Windows or BSD yet. 
Does Fuse work with windows. I am assuming it does with BSD. As long as FUSE
works, I am assuming that atleast basic mode can be made to work.

> 
> What about the Virtio spec? Plans to make changes there as well?

There are plans to change that. Stefan posted a proposal here.

https://lists.oasis-open.org/archives/virtio-dev/201812/msg00073.html

Thanks
Vivek
Aneesh Kumar K.V Feb. 12, 2019, 3:56 p.m. UTC | #4
Vivek Goyal <vgoyal@redhat.com> writes:

> Hi,
>
> Here are RFC patches for virtio-fs. Looking for feedback on this approach.
>
> These patches should apply on top of 4.20-rc5. We have also put code for
> various components here.
>
> https://gitlab.com/virtio-fs
>
> Problem Description
> ===================
> We want to be able to take a directory tree on the host and share it with
> guest[s]. Our goal is to be able to do it in a fast, consistent and secure
> manner. Our primary use case is kata containers, but it should be usable in
> other scenarios as well.
>
> Containers may rely on local file system semantics for shared volumes,
> read-write mounts that multiple containers access simultaneously.  File
> system changes must be visible to other containers with the same consistency
> expected of a local file system, including mmap MAP_SHARED.
>
> Existing Solutions
> ==================
> We looked at existing solutions and virtio-9p already provides basic shared
> file system functionality although does not offer local file system semantics,
> causing some workloads and test suites to fail.

Can you elaborate on this? Is this with 9p2000.L ? We did quiet a lot of
work to make sure posix test suite pass on 9p file system. Also 
was the mount option with cache=loose?

-aneesh
Vivek Goyal Feb. 12, 2019, 6:57 p.m. UTC | #5
On Tue, Feb 12, 2019 at 09:26:48PM +0530, Aneesh Kumar K.V wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > Hi,
> >
> > Here are RFC patches for virtio-fs. Looking for feedback on this approach.
> >
> > These patches should apply on top of 4.20-rc5. We have also put code for
> > various components here.
> >
> > https://gitlab.com/virtio-fs
> >
> > Problem Description
> > ===================
> > We want to be able to take a directory tree on the host and share it with
> > guest[s]. Our goal is to be able to do it in a fast, consistent and secure
> > manner. Our primary use case is kata containers, but it should be usable in
> > other scenarios as well.
> >
> > Containers may rely on local file system semantics for shared volumes,
> > read-write mounts that multiple containers access simultaneously.  File
> > system changes must be visible to other containers with the same consistency
> > expected of a local file system, including mmap MAP_SHARED.
> >
> > Existing Solutions
> > ==================
> > We looked at existing solutions and virtio-9p already provides basic shared
> > file system functionality although does not offer local file system semantics,
> > causing some workloads and test suites to fail.
> 
> Can you elaborate on this? Is this with 9p2000.L ? We did quiet a lot of
> work to make sure posix test suite pass on 9p file system. Also 
> was the mount option with cache=loose?

Hi Aneesh,

Yes this is with 9p2000.L and cache=loose. I used following mount option.

mount -t 9p -o trans=virtio hostShared /mnt/virtio-9p/ -oversion=9p2000.L,posixacl,cache=loose

We noticed primarily two issues.

- Ran pjdfstests and a lot of them are failing. I think even kata
  container folks also experienced pjdfstests failures. I have never
  looked into details of why it is failing.

- We thought mmap(MAP_SHARED) will not work with virtio-9p when two
  clients are running in two different VMs and mapped same file with
  MAP_SHARED.

Having said that, biggest concern with virtio-9p seems to be performance.
We are looking for ways to improve performance with virtio-fs. Hoping
DAX can provide faster data access and fuse protocol itself seems to
be faster (in primilinary testing results).

Thanks
Vivek