Message ID | 20190903114203.8278-10-mszeredi@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | virtio-fs: shared file system for virtual machines | expand |
On Tue, Sep 03, 2019 at 01:42:02PM +0200, Miklos Szeredi wrote: > From: Stefan Hajnoczi <stefanha@redhat.com> > > Add a basic file system module for virtio-fs. This does not yet contain > shared data support between host and guest or metadata coherency speedups. > However it is already significantly faster than virtio-9p. > > Design Overview > =============== > > With the goal of designing something with better performance and local file > system semantics, a bunch of ideas were proposed. > > - Use fuse protocol (instead of 9p) for communication between guest and > host. Guest kernel will be fuse client and a fuse server will run on > host to serve the requests. > > - For data access inside guest, mmap portion of file in QEMU address space > and guest accesses this memory using dax. That way guest page cache is > bypassed and there is only one copy of data (on host). This will also > enable mmap(MAP_SHARED) between guests. > > - For metadata coherency, there is a shared memory region which contains > version number associated with metadata and any guest changing metadata > updates version number and other guests refresh metadata on next access. > This is yet to be implemented. > > How virtio-fs differs from existing approaches > ============================================== > > The unique idea behind virtio-fs is to take advantage of the co-location of > the virtual machine and hypervisor to avoid communication (vmexits). > > DAX allows file contents to be accessed without communication with the > hypervisor. The shared memory region for metadata avoids communication in > the common case where metadata is unchanged. > > By replacing expensive communication with cheaper shared memory accesses, > we expect to achieve better performance than approaches based on network > file system protocols. In addition, this also makes it easier to achieve > local file system semantics (coherency). > > These techniques are not applicable to network file system protocols since > the communications channel is bypassed by taking advantage of shared memory > on a local machine. This is why we decided to build virtio-fs rather than > focus on 9P or NFS. > > Caching Modes > ============= > > Like virtio-9p, different caching modes are supported which determine the > coherency level as well. The “cache=FOO” and “writeback” options control > the level of coherence between the guest and host filesystems. > > - cache=none > metadata, data and pathname lookup are not cached in guest. They are > always fetched from host and any changes are immediately pushed to host. > > - cache=always > metadata, data and pathname lookup are cached in guest and never expire. > > - cache=auto > metadata and pathname lookup cache expires after a configured amount of > time (default is 1 second). Data is cached while the file is open > (close to open consistency). > > - writeback/no_writeback > These options control the writeback strategy. If writeback is disabled, > then normal writes will immediately be synchronized with the host fs. > If writeback is enabled, then writes may be cached in the guest until > the file is closed or an fsync(2) performed. This option has no effect > on mmap-ed writes or writes going through the DAX mechanism. > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> What's with all of the TODOs? Some of these are really scary, looks like they need to be figured out before this is merged. Endian-ness for fuse header also looks wrong. > --- > fs/fuse/Kconfig | 11 + > fs/fuse/Makefile | 1 + > fs/fuse/fuse_i.h | 5 + > fs/fuse/virtio_fs.c | 1072 +++++++++++++++++++++++++++++++ > include/uapi/linux/virtio_fs.h | 19 + > include/uapi/linux/virtio_ids.h | 1 + > 6 files changed, 1109 insertions(+) > create mode 100644 fs/fuse/virtio_fs.c > create mode 100644 include/uapi/linux/virtio_fs.h > > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig > index 24fc5a5c1b97..0635cba19971 100644 > --- a/fs/fuse/Kconfig > +++ b/fs/fuse/Kconfig > @@ -27,3 +27,14 @@ config CUSE > > If you want to develop or use a userspace character device > based on CUSE, answer Y or M. > + > +config VIRTIO_FS > + tristate "Virtio Filesystem" > + depends on FUSE_FS > + select VIRTIO > + help > + The Virtio Filesystem allows guests to mount file systems from the > + host. > + > + If you want to share files between guests or with the host, answer Y > + or M. > diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile > index 9485019c2a14..6419a2b3510d 100644 > --- a/fs/fuse/Makefile > +++ b/fs/fuse/Makefile > @@ -5,5 +5,6 @@ > > obj-$(CONFIG_FUSE_FS) += fuse.o > obj-$(CONFIG_CUSE) += cuse.o > +obj-$(CONFIG_VIRTIO_FS) += virtio_fs.o > > fuse-objs := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index dbf73e5d5b38..85e2dcad68c1 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -444,6 +444,11 @@ struct fuse_req { > > /** Request is stolen from fuse_file->reserved_req */ > struct file *stolen_file; > + > +#if IS_ENABLED(CONFIG_VIRTIO_FS) > + /** virtio-fs's physically contiguous buffer for in and out args */ > + void *argbuf; > +#endif > }; > > struct fuse_iqueue; > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c > new file mode 100644 > index 000000000000..197e79e536f9 > --- /dev/null > +++ b/fs/fuse/virtio_fs.c > @@ -0,0 +1,1072 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * virtio-fs: Virtio Filesystem > + * Copyright (C) 2018 Red Hat, Inc. > + */ > + > +#include <linux/fs.h> > +#include <linux/module.h> > +#include <linux/virtio.h> > +#include <linux/virtio_fs.h> > +#include <linux/delay.h> > +#include <linux/fs_context.h> > +#include <linux/highmem.h> > +#include "fuse_i.h" > + > +/* List of virtio-fs device instances and a lock for the list */ > +static DEFINE_MUTEX(virtio_fs_mutex); > +static LIST_HEAD(virtio_fs_instances); > + > +enum { > + VQ_HIPRIO, > + VQ_REQUEST > +}; > + > +/* Per-virtqueue state */ > +struct virtio_fs_vq { > + spinlock_t lock; > + struct virtqueue *vq; /* protected by ->lock */ > + struct work_struct done_work; > + struct list_head queued_reqs; > + struct delayed_work dispatch_work; > + struct fuse_dev *fud; > + bool connected; > + long in_flight; > + char name[24]; I'd keep names somewhere separate as they are not used on data path. > +} ____cacheline_aligned_in_smp; > + > +/* A virtio-fs device instance */ > +struct virtio_fs { > + struct list_head list; /* on virtio_fs_instances */ > + char *tag; > + struct virtio_fs_vq *vqs; > + unsigned int nvqs; /* number of virtqueues */ > + unsigned int num_queues; /* number of request queues */ > +}; > + > +struct virtio_fs_forget { > + struct fuse_in_header ih; > + struct fuse_forget_in arg; These structures are all native endian. Passing them to host will make cross-endian setups painful to support, and hardware implementations impossible. How about converting everything to LE? > + /* This request can be temporarily queued on virt queue */ > + struct list_head list; > +}; > + > +static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq) > +{ > + struct virtio_fs *fs = vq->vdev->priv; > + > + return &fs->vqs[vq->index]; > +} > + > +static inline struct fuse_pqueue *vq_to_fpq(struct virtqueue *vq) > +{ > + return &vq_to_fsvq(vq)->fud->pq; > +} > + > +/* Add a new instance to the list or return -EEXIST if tag name exists*/ > +static int virtio_fs_add_instance(struct virtio_fs *fs) > +{ > + struct virtio_fs *fs2; > + bool duplicate = false; > + > + mutex_lock(&virtio_fs_mutex); > + > + list_for_each_entry(fs2, &virtio_fs_instances, list) { > + if (strcmp(fs->tag, fs2->tag) == 0) > + duplicate = true; > + } > + > + if (!duplicate) > + list_add_tail(&fs->list, &virtio_fs_instances); This is O(N^2) as it's presumably called for each istance. Doesn't scale - please switch to a tree or such. > + > + mutex_unlock(&virtio_fs_mutex); > + > + if (duplicate) > + return -EEXIST; > + return 0; > +} > + > +/* Return the virtio_fs with a given tag, or NULL */ > +static struct virtio_fs *virtio_fs_find_instance(const char *tag) > +{ > + struct virtio_fs *fs; > + > + mutex_lock(&virtio_fs_mutex); > + > + list_for_each_entry(fs, &virtio_fs_instances, list) { > + if (strcmp(fs->tag, tag) == 0) > + goto found; > + } > + > + fs = NULL; /* not found */ > + > +found: > + mutex_unlock(&virtio_fs_mutex); > + > + return fs; > +} > + > +static void virtio_fs_free_devs(struct virtio_fs *fs) > +{ > + unsigned int i; > + > + /* TODO lock */ Doesn't inspire confidence, does it? > + > + for (i = 0; i < fs->nvqs; i++) { > + struct virtio_fs_vq *fsvq = &fs->vqs[i]; > + > + if (!fsvq->fud) > + continue; > + > + flush_work(&fsvq->done_work); > + flush_delayed_work(&fsvq->dispatch_work); > + > + /* TODO need to quiesce/end_requests/decrement dev_count */ Indeed. Won't this crash if we don't? > + fuse_dev_free(fsvq->fud); > + fsvq->fud = NULL; > + } > +} > + > +/* Read filesystem name from virtio config into fs->tag (must kfree()). */ > +static int virtio_fs_read_tag(struct virtio_device *vdev, struct virtio_fs *fs) > +{ > + char tag_buf[sizeof_field(struct virtio_fs_config, tag)]; > + char *end; > + size_t len; > + > + virtio_cread_bytes(vdev, offsetof(struct virtio_fs_config, tag), > + &tag_buf, sizeof(tag_buf)); > + end = memchr(tag_buf, '\0', sizeof(tag_buf)); > + if (end == tag_buf) > + return -EINVAL; /* empty tag */ > + if (!end) > + end = &tag_buf[sizeof(tag_buf)]; > + > + len = end - tag_buf; > + fs->tag = devm_kmalloc(&vdev->dev, len + 1, GFP_KERNEL); > + if (!fs->tag) > + return -ENOMEM; > + memcpy(fs->tag, tag_buf, len); > + fs->tag[len] = '\0'; > + return 0; > +} > + > +/* Work function for hiprio completion */ > +static void virtio_fs_hiprio_done_work(struct work_struct *work) > +{ > + struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq, > + done_work); > + struct virtqueue *vq = fsvq->vq; > + > + /* Free completed FUSE_FORGET requests */ > + spin_lock(&fsvq->lock); > + do { > + unsigned int len; > + void *req; > + > + virtqueue_disable_cb(vq); > + > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) { > + kfree(req); > + fsvq->in_flight--; > + } > + } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq))); > + spin_unlock(&fsvq->lock); > +} > + > +static void virtio_fs_dummy_dispatch_work(struct work_struct *work) > +{ > +} > + > +static void virtio_fs_hiprio_dispatch_work(struct work_struct *work) > +{ > + struct virtio_fs_forget *forget; > + struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq, > + dispatch_work.work); > + struct virtqueue *vq = fsvq->vq; > + struct scatterlist sg; > + struct scatterlist *sgs[] = {&sg}; > + bool notify; > + int ret; > + > + pr_debug("virtio-fs: worker %s called.\n", __func__); > + while (1) { > + spin_lock(&fsvq->lock); > + forget = list_first_entry_or_null(&fsvq->queued_reqs, > + struct virtio_fs_forget, list); > + if (!forget) { > + spin_unlock(&fsvq->lock); > + return; > + } > + > + list_del(&forget->list); > + if (!fsvq->connected) { > + spin_unlock(&fsvq->lock); > + kfree(forget); > + continue; > + } > + > + sg_init_one(&sg, forget, sizeof(*forget)); This passes to host a structure including "struct list_head list"; Not a good idea. > + > + /* Enqueue the request */ > + dev_dbg(&vq->vdev->dev, "%s\n", __func__); > + ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC); This is easier as add_outbuf. Also - why GFP_ATOMIC? > + if (ret < 0) { > + if (ret == -ENOMEM || ret == -ENOSPC) { > + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Will try later\n", > + ret); > + list_add_tail(&forget->list, > + &fsvq->queued_reqs); > + schedule_delayed_work(&fsvq->dispatch_work, > + msecs_to_jiffies(1)); Can't we we requeue after some buffers get consumed? > + } else { > + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Dropping it.\n", > + ret); > + kfree(forget); > + } > + spin_unlock(&fsvq->lock); > + return; > + } > + > + fsvq->in_flight++; > + notify = virtqueue_kick_prepare(vq); > + spin_unlock(&fsvq->lock); > + > + if (notify) > + virtqueue_notify(vq); > + pr_debug("virtio-fs: worker %s dispatched one forget request.\n", > + __func__); > + } > +} > + > +/* Allocate and copy args into req->argbuf */ > +static int copy_args_to_argbuf(struct fuse_req *req) > +{ > + unsigned int offset = 0; > + unsigned int num_in; > + unsigned int num_out; > + unsigned int len; > + unsigned int i; > + > + num_in = req->in.numargs - req->in.argpages; > + num_out = req->out.numargs - req->out.argpages; > + len = fuse_len_args(num_in, (struct fuse_arg *)req->in.args) + > + fuse_len_args(num_out, req->out.args); > + > + req->argbuf = kmalloc(len, GFP_ATOMIC); > + if (!req->argbuf) > + return -ENOMEM; > + > + for (i = 0; i < num_in; i++) { > + memcpy(req->argbuf + offset, > + req->in.args[i].value, > + req->in.args[i].size); > + offset += req->in.args[i].size; > + } > + > + return 0; > +} > + > +/* Copy args out of and free req->argbuf */ > +static void copy_args_from_argbuf(struct fuse_req *req) > +{ > + unsigned int remaining; > + unsigned int offset; > + unsigned int num_in; > + unsigned int num_out; > + unsigned int i; > + > + remaining = req->out.h.len - sizeof(req->out.h); > + num_in = req->in.numargs - req->in.argpages; > + num_out = req->out.numargs - req->out.argpages; > + offset = fuse_len_args(num_in, (struct fuse_arg *)req->in.args); > + > + for (i = 0; i < num_out; i++) { > + unsigned int argsize = req->out.args[i].size; > + > + if (req->out.argvar && > + i == req->out.numargs - 1 && > + argsize > remaining) { > + argsize = remaining; > + } > + > + memcpy(req->out.args[i].value, req->argbuf + offset, argsize); > + offset += argsize; > + > + if (i != req->out.numargs - 1) > + remaining -= argsize; > + } > + > + /* Store the actual size of the variable-length arg */ > + if (req->out.argvar) > + req->out.args[req->out.numargs - 1].size = remaining; > + > + kfree(req->argbuf); > + req->argbuf = NULL; > +} > + > +/* Work function for request completion */ > +static void virtio_fs_requests_done_work(struct work_struct *work) > +{ > + struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq, > + done_work); > + struct fuse_pqueue *fpq = &fsvq->fud->pq; > + struct fuse_conn *fc = fsvq->fud->fc; > + struct virtqueue *vq = fsvq->vq; > + struct fuse_req *req; > + struct fuse_req *next; > + unsigned int len, i, thislen; > + struct page *page; > + LIST_HEAD(reqs); > + > + /* Collect completed requests off the virtqueue */ > + spin_lock(&fsvq->lock); > + do { > + virtqueue_disable_cb(vq); > + > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) { > + spin_lock(&fpq->lock); > + list_move_tail(&req->list, &reqs); > + spin_unlock(&fpq->lock); > + } > + } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq))); > + spin_unlock(&fsvq->lock); > + > + /* End requests */ > + list_for_each_entry_safe(req, next, &reqs, list) { > + /* TODO check unique */ > + /* TODO fuse_len_args(out) against oh.len */ > + > + copy_args_from_argbuf(req); > + > + if (req->out.argpages && req->out.page_zeroing) { > + len = req->out.args[req->out.numargs - 1].size; > + for (i = 0; i < req->num_pages; i++) { > + thislen = req->page_descs[i].length; > + if (len < thislen) { > + WARN_ON(req->page_descs[i].offset); > + page = req->pages[i]; > + zero_user_segment(page, len, thislen); > + len = 0; > + } else { > + len -= thislen; > + } > + } > + } > + > + spin_lock(&fpq->lock); > + clear_bit(FR_SENT, &req->flags); > + list_del_init(&req->list); > + spin_unlock(&fpq->lock); > + > + fuse_request_end(fc, req); > + } > +} > + > +/* Virtqueue interrupt handler */ > +static void virtio_fs_vq_done(struct virtqueue *vq) > +{ > + struct virtio_fs_vq *fsvq = vq_to_fsvq(vq); > + > + dev_dbg(&vq->vdev->dev, "%s %s\n", __func__, fsvq->name); > + > + schedule_work(&fsvq->done_work); > +} > + > +/* Initialize virtqueues */ > +static int virtio_fs_setup_vqs(struct virtio_device *vdev, > + struct virtio_fs *fs) > +{ > + struct virtqueue **vqs; > + vq_callback_t **callbacks; > + const char **names; > + unsigned int i; > + int ret; > + > + virtio_cread(vdev, struct virtio_fs_config, num_queues, > + &fs->num_queues); > + if (fs->num_queues == 0) > + return -EINVAL; > + > + fs->nvqs = 1 + fs->num_queues; > + > + fs->vqs = devm_kcalloc(&vdev->dev, fs->nvqs, > + sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL); > + if (!fs->vqs) > + return -ENOMEM; > + > + vqs = kmalloc_array(fs->nvqs, sizeof(vqs[VQ_HIPRIO]), GFP_KERNEL); > + callbacks = kmalloc_array(fs->nvqs, sizeof(callbacks[VQ_HIPRIO]), > + GFP_KERNEL); > + names = kmalloc_array(fs->nvqs, sizeof(names[VQ_HIPRIO]), GFP_KERNEL); > + if (!vqs || !callbacks || !names) { > + ret = -ENOMEM; > + goto out; > + } > + > + callbacks[VQ_HIPRIO] = virtio_fs_vq_done; > + snprintf(fs->vqs[VQ_HIPRIO].name, sizeof(fs->vqs[VQ_HIPRIO].name), > + "hiprio"); > + names[VQ_HIPRIO] = fs->vqs[VQ_HIPRIO].name; > + INIT_WORK(&fs->vqs[VQ_HIPRIO].done_work, virtio_fs_hiprio_done_work); > + INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].queued_reqs); > + INIT_DELAYED_WORK(&fs->vqs[VQ_HIPRIO].dispatch_work, > + virtio_fs_hiprio_dispatch_work); > + spin_lock_init(&fs->vqs[VQ_HIPRIO].lock); > + > + /* Initialize the requests virtqueues */ > + for (i = VQ_REQUEST; i < fs->nvqs; i++) { > + spin_lock_init(&fs->vqs[i].lock); > + INIT_WORK(&fs->vqs[i].done_work, virtio_fs_requests_done_work); > + INIT_DELAYED_WORK(&fs->vqs[i].dispatch_work, > + virtio_fs_dummy_dispatch_work); > + INIT_LIST_HEAD(&fs->vqs[i].queued_reqs); > + snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name), > + "requests.%u", i - VQ_REQUEST); > + callbacks[i] = virtio_fs_vq_done; > + names[i] = fs->vqs[i].name; > + } > + > + ret = virtio_find_vqs(vdev, fs->nvqs, vqs, callbacks, names, NULL); > + if (ret < 0) > + goto out; > + > + for (i = 0; i < fs->nvqs; i++) { > + fs->vqs[i].vq = vqs[i]; > + fs->vqs[i].connected = true; > + } > +out: > + kfree(names); > + kfree(callbacks); > + kfree(vqs); > + return ret; > +} > + > +/* Free virtqueues (device must already be reset) */ > +static void virtio_fs_cleanup_vqs(struct virtio_device *vdev, > + struct virtio_fs *fs) > +{ > + vdev->config->del_vqs(vdev); > +} > + > +static int virtio_fs_probe(struct virtio_device *vdev) > +{ > + struct virtio_fs *fs; > + int ret; > + > + fs = devm_kzalloc(&vdev->dev, sizeof(*fs), GFP_KERNEL); > + if (!fs) > + return -ENOMEM; > + vdev->priv = fs; > + > + ret = virtio_fs_read_tag(vdev, fs); > + if (ret < 0) > + goto out; > + > + ret = virtio_fs_setup_vqs(vdev, fs); > + if (ret < 0) > + goto out; > + > + /* TODO vq affinity */ > + /* TODO populate notifications vq */ what's notifications vq? > + > + /* Bring the device online in case the filesystem is mounted and > + * requests need to be sent before we return. > + */ > + virtio_device_ready(vdev); > + > + ret = virtio_fs_add_instance(fs); > + if (ret < 0) > + goto out_vqs; > + > + return 0; > + > +out_vqs: > + vdev->config->reset(vdev); > + virtio_fs_cleanup_vqs(vdev, fs); > + > +out: > + vdev->priv = NULL; > + return ret; > +} > + > +static void virtio_fs_remove(struct virtio_device *vdev) > +{ > + struct virtio_fs *fs = vdev->priv; > + > + virtio_fs_free_devs(fs); > + > + vdev->config->reset(vdev); > + virtio_fs_cleanup_vqs(vdev, fs); > + > + mutex_lock(&virtio_fs_mutex); > + list_del(&fs->list); > + mutex_unlock(&virtio_fs_mutex); > + > + vdev->priv = NULL; > +} > + > +#ifdef CONFIG_PM_SLEEP > +static int virtio_fs_freeze(struct virtio_device *vdev) > +{ > + return 0; /* TODO */ > +} > + > +static int virtio_fs_restore(struct virtio_device *vdev) > +{ > + return 0; /* TODO */ > +} Is this really a good idea? I'd rather it was implemented, but if not possible at all disabling PM seems better than just keep going. > +#endif /* CONFIG_PM_SLEEP */ > + > +const static struct virtio_device_id id_table[] = { > + { VIRTIO_ID_FS, VIRTIO_DEV_ANY_ID }, > + {}, > +}; > + > +const static unsigned int feature_table[] = {}; > + > +static struct virtio_driver virtio_fs_driver = { > + .driver.name = KBUILD_MODNAME, > + .driver.owner = THIS_MODULE, > + .id_table = id_table, > + .feature_table = feature_table, > + .feature_table_size = ARRAY_SIZE(feature_table), > + /* TODO validate config_get != NULL */ Why? > + .probe = virtio_fs_probe, > + .remove = virtio_fs_remove, > +#ifdef CONFIG_PM_SLEEP > + .freeze = virtio_fs_freeze, > + .restore = virtio_fs_restore, > +#endif > +}; > + > +static void virtio_fs_wake_forget_and_unlock(struct fuse_iqueue *fiq) > +__releases(fiq->waitq.lock) > +{ > + struct fuse_forget_link *link; > + struct virtio_fs_forget *forget; > + struct scatterlist sg; > + struct scatterlist *sgs[] = {&sg}; > + struct virtio_fs *fs; > + struct virtqueue *vq; > + struct virtio_fs_vq *fsvq; > + bool notify; > + u64 unique; > + int ret; > + > + link = fuse_dequeue_forget(fiq, 1, NULL); > + unique = fuse_get_unique(fiq); > + > + fs = fiq->priv; > + fsvq = &fs->vqs[VQ_HIPRIO]; > + spin_unlock(&fiq->waitq.lock); > + > + /* Allocate a buffer for the request */ > + forget = kmalloc(sizeof(*forget), GFP_NOFS | __GFP_NOFAIL); > + > + forget->ih = (struct fuse_in_header){ > + .opcode = FUSE_FORGET, > + .nodeid = link->forget_one.nodeid, > + .unique = unique, > + .len = sizeof(*forget), > + }; > + forget->arg = (struct fuse_forget_in){ > + .nlookup = link->forget_one.nlookup, > + }; > + > + sg_init_one(&sg, forget, sizeof(*forget)); > + > + /* Enqueue the request */ > + vq = fsvq->vq; > + dev_dbg(&vq->vdev->dev, "%s\n", __func__); > + spin_lock(&fsvq->lock); > + > + ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC); > + if (ret < 0) { > + if (ret == -ENOMEM || ret == -ENOSPC) { > + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Will try later.\n", > + ret); > + list_add_tail(&forget->list, &fsvq->queued_reqs); > + schedule_delayed_work(&fsvq->dispatch_work, > + msecs_to_jiffies(1)); > + } else { > + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Dropping it.\n", > + ret); > + kfree(forget); > + } > + spin_unlock(&fsvq->lock); > + goto out; > + } code duplicated from virtio_fs_hiprio_dispatch_work > + > + fsvq->in_flight++; > + notify = virtqueue_kick_prepare(vq); > + > + spin_unlock(&fsvq->lock); > + > + if (notify) > + virtqueue_notify(vq); > +out: > + kfree(link); > +} > + > +static void virtio_fs_wake_interrupt_and_unlock(struct fuse_iqueue *fiq) > +__releases(fiq->waitq.lock) > +{ > + /* TODO */ what's needed? > + spin_unlock(&fiq->waitq.lock); > +} > + > +/* Return the number of scatter-gather list elements required */ > +static unsigned int sg_count_fuse_req(struct fuse_req *req) > +{ > + unsigned int total_sgs = 1 /* fuse_in_header */; > + > + if (req->in.numargs - req->in.argpages) > + total_sgs += 1; > + > + if (req->in.argpages) > + total_sgs += req->num_pages; > + > + if (!test_bit(FR_ISREPLY, &req->flags)) > + return total_sgs; > + > + total_sgs += 1 /* fuse_out_header */; > + > + if (req->out.numargs - req->out.argpages) > + total_sgs += 1; > + > + if (req->out.argpages) > + total_sgs += req->num_pages; > + > + return total_sgs; > +} > + > +/* Add pages to scatter-gather list and return number of elements used */ > +static unsigned int sg_init_fuse_pages(struct scatterlist *sg, > + struct page **pages, > + struct fuse_page_desc *page_descs, > + unsigned int num_pages, > + unsigned int total_len) > +{ > + unsigned int i; > + unsigned int this_len; > + > + for (i = 0; i < num_pages && total_len; i++) { > + sg_init_table(&sg[i], 1); > + this_len = min(page_descs[i].length, total_len); > + sg_set_page(&sg[i], pages[i], this_len, page_descs[i].offset); > + total_len -= this_len; > + } > + > + return i; > +} > + > +/* Add args to scatter-gather list and return number of elements used */ > +static unsigned int sg_init_fuse_args(struct scatterlist *sg, > + struct fuse_req *req, > + struct fuse_arg *args, > + unsigned int numargs, > + bool argpages, > + void *argbuf, > + unsigned int *len_used) > +{ > + unsigned int total_sgs = 0; > + unsigned int len; > + > + len = fuse_len_args(numargs - argpages, args); > + if (len) > + sg_init_one(&sg[total_sgs++], argbuf, len); > + > + if (argpages) > + total_sgs += sg_init_fuse_pages(&sg[total_sgs], > + req->pages, > + req->page_descs, > + req->num_pages, > + args[numargs - 1].size); > + > + if (len_used) > + *len_used = len; > + > + return total_sgs; > +} > + > +/* Add a request to a virtqueue and kick the device */ > +static int virtio_fs_enqueue_req(struct virtqueue *vq, struct fuse_req *req) > +{ > + /* requests need at least 4 elements */ > + struct scatterlist *stack_sgs[6]; > + struct scatterlist stack_sg[ARRAY_SIZE(stack_sgs)]; > + struct scatterlist **sgs = stack_sgs; > + struct scatterlist *sg = stack_sg; > + struct virtio_fs_vq *fsvq; > + unsigned int argbuf_used = 0; > + unsigned int out_sgs = 0; > + unsigned int in_sgs = 0; > + unsigned int total_sgs; > + unsigned int i; > + int ret; > + bool notify; > + > + /* Does the sglist fit on the stack? */ > + total_sgs = sg_count_fuse_req(req); > + if (total_sgs > ARRAY_SIZE(stack_sgs)) { > + sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC); > + sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC); > + if (!sgs || !sg) { > + ret = -ENOMEM; > + goto out; > + } > + } > + > + /* Use a bounce buffer since stack args cannot be mapped */ > + ret = copy_args_to_argbuf(req); > + if (ret < 0) > + goto out; > + > + /* Request elements */ > + sg_init_one(&sg[out_sgs++], &req->in.h, sizeof(req->in.h)); > + out_sgs += sg_init_fuse_args(&sg[out_sgs], req, > + (struct fuse_arg *)req->in.args, > + req->in.numargs, req->in.argpages, > + req->argbuf, &argbuf_used); > + > + /* Reply elements */ > + if (test_bit(FR_ISREPLY, &req->flags)) { > + sg_init_one(&sg[out_sgs + in_sgs++], > + &req->out.h, sizeof(req->out.h)); > + in_sgs += sg_init_fuse_args(&sg[out_sgs + in_sgs], req, > + req->out.args, req->out.numargs, > + req->out.argpages, > + req->argbuf + argbuf_used, NULL); > + } > + > + WARN_ON(out_sgs + in_sgs != total_sgs); > + > + for (i = 0; i < total_sgs; i++) > + sgs[i] = &sg[i]; > + > + fsvq = vq_to_fsvq(vq); > + spin_lock(&fsvq->lock); > + > + ret = virtqueue_add_sgs(vq, sgs, out_sgs, in_sgs, req, GFP_ATOMIC); > + if (ret < 0) { > + /* TODO handle full virtqueue */ Indeed. What prevents this? > + spin_unlock(&fsvq->lock); > + goto out; > + } > + > + notify = virtqueue_kick_prepare(vq); > + > + spin_unlock(&fsvq->lock); > + > + if (notify) > + virtqueue_notify(vq); > + > +out: > + if (ret < 0 && req->argbuf) { > + kfree(req->argbuf); > + req->argbuf = NULL; > + } > + if (sgs != stack_sgs) { > + kfree(sgs); > + kfree(sg); > + } > + > + return ret; > +} > + > +static void virtio_fs_wake_pending_and_unlock(struct fuse_iqueue *fiq) > +__releases(fiq->waitq.lock) > +{ > + unsigned int queue_id = VQ_REQUEST; /* TODO multiqueue */ > + struct virtio_fs *fs; > + struct fuse_conn *fc; > + struct fuse_req *req; > + struct fuse_pqueue *fpq; > + int ret; > + > + WARN_ON(list_empty(&fiq->pending)); > + req = list_last_entry(&fiq->pending, struct fuse_req, list); > + clear_bit(FR_PENDING, &req->flags); > + list_del_init(&req->list); > + WARN_ON(!list_empty(&fiq->pending)); > + spin_unlock(&fiq->waitq.lock); > + > + fs = fiq->priv; > + fc = fs->vqs[queue_id].fud->fc; > + > + dev_dbg(&fs->vqs[queue_id].vq->vdev->dev, > + "%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u\n", > + __func__, req->in.h.opcode, req->in.h.unique, req->in.h.nodeid, > + req->in.h.len, fuse_len_args(req->out.numargs, req->out.args)); > + > + fpq = &fs->vqs[queue_id].fud->pq; > + spin_lock(&fpq->lock); > + if (!fpq->connected) { > + spin_unlock(&fpq->lock); > + req->out.h.error = -ENODEV; > + pr_err("virtio-fs: %s disconnected\n", __func__); > + fuse_request_end(fc, req); > + return; > + } > + list_add_tail(&req->list, fpq->processing); > + spin_unlock(&fpq->lock); > + set_bit(FR_SENT, &req->flags); > + /* matches barrier in request_wait_answer() */ > + smp_mb__after_atomic(); > + /* TODO check for FR_INTERRUPTED? */ ? > + > +retry: > + ret = virtio_fs_enqueue_req(fs->vqs[queue_id].vq, req); > + if (ret < 0) { > + if (ret == -ENOMEM || ret == -ENOSPC) { > + /* Virtqueue full. Retry submission */ > + usleep_range(20, 30); again, why not respond to completion? > + goto retry; > + } > + req->out.h.error = ret; > + pr_err("virtio-fs: virtio_fs_enqueue_req() failed %d\n", ret); > + fuse_request_end(fc, req); > + return; > + } > +} > + > +static void virtio_fs_flush_hiprio_queue(struct virtio_fs_vq *fsvq) > +{ > + struct virtio_fs_forget *forget; > + > + WARN_ON(fsvq->in_flight < 0); > + > + /* Go through pending forget requests and free them */ > + spin_lock(&fsvq->lock); > + while (1) { > + forget = list_first_entry_or_null(&fsvq->queued_reqs, > + struct virtio_fs_forget, list); > + if (!forget) > + break; > + list_del(&forget->list); > + kfree(forget); > + } > + > + spin_unlock(&fsvq->lock); > + > + /* Wait for in flight requests to finish.*/ > + while (1) { > + spin_lock(&fsvq->lock); > + if (!fsvq->in_flight) { > + spin_unlock(&fsvq->lock); > + break; > + } > + spin_unlock(&fsvq->lock); > + usleep_range(1000, 2000); Same thing here. Can we use e.g. a completion and not usleep? > + } > +} > + > +const static struct fuse_iqueue_ops virtio_fs_fiq_ops = { > + .wake_forget_and_unlock = virtio_fs_wake_forget_and_unlock, > + .wake_interrupt_and_unlock = virtio_fs_wake_interrupt_and_unlock, > + .wake_pending_and_unlock = virtio_fs_wake_pending_and_unlock, > +}; > + > +static int virtio_fs_fill_super(struct super_block *sb) > +{ > + struct fuse_conn *fc = get_fuse_conn_super(sb); > + struct virtio_fs *fs = fc->iq.priv; > + unsigned int i; > + int err; > + struct fuse_req *init_req; > + struct fuse_fs_context ctx = { > + .rootmode = S_IFDIR, > + .default_permissions = 1, > + .allow_other = 1, > + .max_read = UINT_MAX, > + .blksize = 512, > + .destroy = true, > + .no_control = true, > + .no_force_umount = true, > + }; > + > + /* TODO lock */ what does this refer to? > + if (fs->vqs[VQ_REQUEST].fud) { > + pr_err("virtio-fs: device already in use\n"); > + err = -EBUSY; > + goto err; > + } > + > + err = -ENOMEM; > + /* Allocate fuse_dev for hiprio and notification queues */ > + for (i = 0; i < VQ_REQUEST; i++) { > + struct virtio_fs_vq *fsvq = &fs->vqs[i]; > + > + fsvq->fud = fuse_dev_alloc(); > + if (!fsvq->fud) > + goto err_free_fuse_devs; > + } > + > + init_req = fuse_request_alloc(0); > + if (!init_req) > + goto err_free_fuse_devs; > + __set_bit(FR_BACKGROUND, &init_req->flags); > + > + ctx.fudptr = (void **)&fs->vqs[VQ_REQUEST].fud; > + err = fuse_fill_super_common(sb, &ctx); > + if (err < 0) > + goto err_free_init_req; > + > + fc = fs->vqs[VQ_REQUEST].fud->fc; > + > + /* TODO take fuse_mutex around this loop? */ Don't we need to figure this kind of thing out before this code lands upstream? > + for (i = 0; i < fs->nvqs; i++) { > + struct virtio_fs_vq *fsvq = &fs->vqs[i]; > + > + if (i == VQ_REQUEST) > + continue; /* already initialized */ > + fuse_dev_install(fsvq->fud, fc); > + atomic_inc(&fc->dev_count); > + } > + > + fuse_send_init(fc, init_req); > + return 0; > + > +err_free_init_req: > + fuse_request_free(init_req); > +err_free_fuse_devs: > + for (i = 0; i < fs->nvqs; i++) > + fuse_dev_free(fs->vqs[i].fud); > +err: > + return err; > +} > + > +static void virtio_kill_sb(struct super_block *sb) > +{ > + struct fuse_conn *fc = get_fuse_conn_super(sb); > + struct virtio_fs *vfs; > + struct virtio_fs_vq *fsvq; > + > + /* If mount failed, we can still be called without any fc */ > + if (!fc) > + return fuse_kill_sb_anon(sb); > + > + vfs = fc->iq.priv; > + fsvq = &vfs->vqs[VQ_HIPRIO]; > + > + /* Stop forget queue. Soon destroy will be sent */ > + spin_lock(&fsvq->lock); > + fsvq->connected = false; > + spin_unlock(&fsvq->lock); > + virtio_fs_flush_hiprio_queue(fsvq); > + > + fuse_kill_sb_anon(sb); > + virtio_fs_free_devs(vfs); > +} > + > +static int virtio_fs_test_super(struct super_block *sb, > + struct fs_context *fsc) > +{ > + struct fuse_conn *fc = fsc->s_fs_info; > + > + return fc->iq.priv == get_fuse_conn_super(sb)->iq.priv; > +} > + > +static int virtio_fs_set_super(struct super_block *sb, > + struct fs_context *fsc) > +{ > + int err; > + > + err = get_anon_bdev(&sb->s_dev); > + if (!err) > + fuse_conn_get(fsc->s_fs_info); > + > + return err; > +} > + > +static int virtio_fs_get_tree(struct fs_context *fsc) > +{ > + struct virtio_fs *fs; > + struct super_block *sb; > + struct fuse_conn *fc; > + int err; > + > + fs = virtio_fs_find_instance(fsc->source); > + if (!fs) { > + pr_info("virtio-fs: tag <%s> not found\n", fsc->source); > + return -EINVAL; > + } > + > + fc = kzalloc(sizeof(struct fuse_conn), GFP_KERNEL); > + if (!fc) > + return -ENOMEM; > + > + fuse_conn_init(fc, get_user_ns(current_user_ns()), &virtio_fs_fiq_ops, > + fs); > + fc->release = fuse_free_conn; > + fc->delete_stale = true; > + > + fsc->s_fs_info = fc; > + sb = sget_fc(fsc, virtio_fs_test_super, virtio_fs_set_super); > + fuse_conn_put(fc); > + if (IS_ERR(sb)) > + return PTR_ERR(sb); > + > + if (!sb->s_root) { > + err = virtio_fs_fill_super(sb); > + if (err) { > + deactivate_locked_super(sb); > + return err; > + } > + > + sb->s_flags |= SB_ACTIVE; > + } > + > + WARN_ON(fsc->root); > + fsc->root = dget(sb->s_root); > + return 0; > +} > + > +static const struct fs_context_operations virtio_fs_context_ops = { > + .get_tree = virtio_fs_get_tree, > +}; > + > +static int virtio_fs_init_fs_context(struct fs_context *fsc) > +{ > + fsc->ops = &virtio_fs_context_ops; > + return 0; > +} > + > +static struct file_system_type virtio_fs_type = { > + .owner = THIS_MODULE, > + .name = "virtiofs", > + .init_fs_context = virtio_fs_init_fs_context, > + .kill_sb = virtio_kill_sb, > +}; > + > +static int __init virtio_fs_init(void) > +{ > + int ret; > + > + ret = register_virtio_driver(&virtio_fs_driver); > + if (ret < 0) > + return ret; > + > + ret = register_filesystem(&virtio_fs_type); > + if (ret < 0) { > + unregister_virtio_driver(&virtio_fs_driver); > + return ret; > + } > + > + return 0; > +} > +module_init(virtio_fs_init); > + > +static void __exit virtio_fs_exit(void) > +{ > + unregister_filesystem(&virtio_fs_type); > + unregister_virtio_driver(&virtio_fs_driver); > +} > +module_exit(virtio_fs_exit); > + > +MODULE_AUTHOR("Stefan Hajnoczi <stefanha@redhat.com>"); > +MODULE_DESCRIPTION("Virtio Filesystem"); > +MODULE_LICENSE("GPL"); > +MODULE_ALIAS_FS(KBUILD_MODNAME); > +MODULE_DEVICE_TABLE(virtio, id_table); > diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h > new file mode 100644 > index 000000000000..b5e99c217c86 > --- /dev/null > +++ b/include/uapi/linux/virtio_fs.h > @@ -0,0 +1,19 @@ > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ > + > +#ifndef _UAPI_LINUX_VIRTIO_FS_H > +#define _UAPI_LINUX_VIRTIO_FS_H > + > +#include <linux/types.h> > +#include <linux/virtio_ids.h> > +#include <linux/virtio_config.h> > +#include <linux/virtio_types.h> > + > +struct virtio_fs_config { > + /* Filesystem name (UTF-8, not NUL-terminated, padded with NULs) */ > + __u8 tag[36]; > + > + /* Number of request queues */ > + __u32 num_queues; > +} __attribute__((packed)); > + > +#endif /* _UAPI_LINUX_VIRTIO_FS_H */ > diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h > index 348fd0176f75..585e07b27333 100644 > --- a/include/uapi/linux/virtio_ids.h > +++ b/include/uapi/linux/virtio_ids.h > @@ -44,6 +44,7 @@ > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */ > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */ > #define VIRTIO_ID_IOMMU 23 /* virtio IOMMU */ > +#define VIRTIO_ID_FS 26 /* virtio filesystem */ > #define VIRTIO_ID_PMEM 27 /* virtio pmem */ > > #endif /* _LINUX_VIRTIO_IDS_H */ > -- > 2.21.0
On Tue, Sep 03, 2019 at 09:55:49AM -0400, Michael S. Tsirkin wrote: > On Tue, Sep 03, 2019 at 01:42:02PM +0200, Miklos Szeredi wrote: > Endian-ness for fuse header also looks wrong. [...] > > +struct virtio_fs_forget { > > + struct fuse_in_header ih; > > + struct fuse_forget_in arg; > > These structures are all native endian. > > Passing them to host will make cross-endian setups painful to support, > and hardware implementations impossible. > > How about converting everything to LE? The driver dictates the endianness of the FUSE protocol session. The virtio-fs device specification states that the device looks at the first request's fuse_in_header::opcode field to detect the guest endianness. If it sees FUSE_INIT in its native endianness then no byte-swapping is necessary. If it sees FUSE_INIT in the opposite endianness then byte-swapping is necessary on the device side.
On Wed, Sep 04, 2019 at 07:16:30PM +0100, Stefan Hajnoczi wrote: > On Tue, Sep 03, 2019 at 09:55:49AM -0400, Michael S. Tsirkin wrote: > > On Tue, Sep 03, 2019 at 01:42:02PM +0200, Miklos Szeredi wrote: > > Endian-ness for fuse header also looks wrong. > [...] > > > +struct virtio_fs_forget { > > > + struct fuse_in_header ih; > > > + struct fuse_forget_in arg; > > > > These structures are all native endian. > > > > Passing them to host will make cross-endian setups painful to support, > > and hardware implementations impossible. > > > > How about converting everything to LE? > > The driver dictates the endianness of the FUSE protocol session. The > virtio-fs device specification states that the device looks at the first > request's fuse_in_header::opcode field to detect the guest endianness. > > If it sees FUSE_INIT in its native endianness then no byte-swapping is > necessary. If it sees FUSE_INIT in the opposite endianness then > byte-swapping is necessary on the device side. You are right. Pls ignore the comment. We need to reserve the byte-swapped FUSE_INIT to make sure future versions of fuse don't try to send that though. I sent a patch to that effect, let's see whether it gets accepted.
On Tue, Sep 03, 2019 at 09:55:49AM -0400, Michael S. Tsirkin wrote: [..] > What's with all of the TODOs? Some of these are really scary, > looks like they need to be figured out before this is merged. Hi Michael, One of the issue I noticed is races w.r.t device removal and super block initialization. I am about to post a set of patches which take care of these races and also get rid of some of the scary TODOs. Other TODOs like suspend/restore, multiqueue support etc are improvements which we can do over a period of time. [..] > > +/* Per-virtqueue state */ > > +struct virtio_fs_vq { > > + spinlock_t lock; > > + struct virtqueue *vq; /* protected by ->lock */ > > + struct work_struct done_work; > > + struct list_head queued_reqs; > > + struct delayed_work dispatch_work; > > + struct fuse_dev *fud; > > + bool connected; > > + long in_flight; > > + char name[24]; > > I'd keep names somewhere separate as they are not used on data path. Ok, this sounds like a nice to have. Will take care of this once base patch gets merged. [..] > > +struct virtio_fs_forget { > > + struct fuse_in_header ih; > > + struct fuse_forget_in arg; > > These structures are all native endian. > > Passing them to host will make cross-endian setups painful to support, > and hardware implementations impossible. > > How about converting everything to LE? So looks like endianness issue is now resolved (going by the other emails). So I will not worry about it. [..] > > +/* Add a new instance to the list or return -EEXIST if tag name exists*/ > > +static int virtio_fs_add_instance(struct virtio_fs *fs) > > +{ > > + struct virtio_fs *fs2; > > + bool duplicate = false; > > + > > + mutex_lock(&virtio_fs_mutex); > > + > > + list_for_each_entry(fs2, &virtio_fs_instances, list) { > > + if (strcmp(fs->tag, fs2->tag) == 0) > > + duplicate = true; > > + } > > + > > + if (!duplicate) > > + list_add_tail(&fs->list, &virtio_fs_instances); > > > This is O(N^2) as it's presumably called for each istance. > Doesn't scale - please switch to a tree or such. This is O(N) and not O(N^2) right? Addition of device is O(N), search during mount is O(N). This is not a frequent event at all and number of virtiofs instances per guest are expected to be fairly small (say less than 10). So I really don't think there is any value in converting this into a tree (instead of a list). [..] > > +static void virtio_fs_free_devs(struct virtio_fs *fs) > > +{ > > + unsigned int i; > > + > > + /* TODO lock */ > > Doesn't inspire confidence, does it? Agreed. Getting rid of this in set of fixes I am about to post. > > > + > > + for (i = 0; i < fs->nvqs; i++) { > > + struct virtio_fs_vq *fsvq = &fs->vqs[i]; > > + > > + if (!fsvq->fud) > > + continue; > > + > > + flush_work(&fsvq->done_work); > > + flush_delayed_work(&fsvq->dispatch_work); > > + > > + /* TODO need to quiesce/end_requests/decrement dev_count */ > > Indeed. Won't this crash if we don't? Took care of this as well. [..] > > +static void virtio_fs_hiprio_dispatch_work(struct work_struct *work) > > +{ > > + struct virtio_fs_forget *forget; > > + struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq, > > + dispatch_work.work); > > + struct virtqueue *vq = fsvq->vq; > > + struct scatterlist sg; > > + struct scatterlist *sgs[] = {&sg}; > > + bool notify; > > + int ret; > > + > > + pr_debug("virtio-fs: worker %s called.\n", __func__); > > + while (1) { > > + spin_lock(&fsvq->lock); > > + forget = list_first_entry_or_null(&fsvq->queued_reqs, > > + struct virtio_fs_forget, list); > > + if (!forget) { > > + spin_unlock(&fsvq->lock); > > + return; > > + } > > + > > + list_del(&forget->list); > > + if (!fsvq->connected) { > > + spin_unlock(&fsvq->lock); > > + kfree(forget); > > + continue; > > + } > > + > > + sg_init_one(&sg, forget, sizeof(*forget)); > > This passes to host a structure including "struct list_head list"; > > Not a good idea. Ok, host does not have to see "struct list_head list". Its needed for guest. Will look into getting rid of this. > > > > + > > + /* Enqueue the request */ > > + dev_dbg(&vq->vdev->dev, "%s\n", __func__); > > + ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC); > > > This is easier as add_outbuf. Will look into it. > > Also - why GFP_ATOMIC? Hmm..., may be it can be GFP_KERNEL. I don't see atomic context here. Will look into it. > > > + if (ret < 0) { > > + if (ret == -ENOMEM || ret == -ENOSPC) { > > + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Will try later\n", > > + ret); > > + list_add_tail(&forget->list, > > + &fsvq->queued_reqs); > > + schedule_delayed_work(&fsvq->dispatch_work, > > + msecs_to_jiffies(1)); > > Can't we we requeue after some buffers get consumed? That's what dispatch work is doing. It tries to requeue the request after a while. [..] > > +static int virtio_fs_probe(struct virtio_device *vdev) > > +{ > > + struct virtio_fs *fs; > > + int ret; > > + > > + fs = devm_kzalloc(&vdev->dev, sizeof(*fs), GFP_KERNEL); > > + if (!fs) > > + return -ENOMEM; > > + vdev->priv = fs; > > + > > + ret = virtio_fs_read_tag(vdev, fs); > > + if (ret < 0) > > + goto out; > > + > > + ret = virtio_fs_setup_vqs(vdev, fs); > > + if (ret < 0) > > + goto out; > > + > > + /* TODO vq affinity */ > > + /* TODO populate notifications vq */ > > what's notifications vq? It has not been implemented yet. At some point of time we want to have a notion of notification queue so that host can send notifications to guest. Will get rid of this comment for now. [..] > > +#ifdef CONFIG_PM_SLEEP > > +static int virtio_fs_freeze(struct virtio_device *vdev) > > +{ > > + return 0; /* TODO */ > > +} > > + > > +static int virtio_fs_restore(struct virtio_device *vdev) > > +{ > > + return 0; /* TODO */ > > +} > > Is this really a good idea? I'd rather it was implemented, > but if not possible at all disabling PM seems better than just > keep going. I agree. Will look into disabling it. > > > +#endif /* CONFIG_PM_SLEEP */ > > + > > +const static struct virtio_device_id id_table[] = { > > + { VIRTIO_ID_FS, VIRTIO_DEV_ANY_ID }, > > + {}, > > +}; > > + > > +const static unsigned int feature_table[] = {}; > > + > > +static struct virtio_driver virtio_fs_driver = { > > + .driver.name = KBUILD_MODNAME, > > + .driver.owner = THIS_MODULE, > > + .id_table = id_table, > > + .feature_table = feature_table, > > + .feature_table_size = ARRAY_SIZE(feature_table), > > + /* TODO validate config_get != NULL */ > > Why? Don't know. Stefan, do you remember why did you put this comment? If not, I will get rid of it. > > > + .probe = virtio_fs_probe, > > + .remove = virtio_fs_remove, > > +#ifdef CONFIG_PM_SLEEP > > + .freeze = virtio_fs_freeze, > > + .restore = virtio_fs_restore, > > +#endif > > +}; > > + > > +static void virtio_fs_wake_forget_and_unlock(struct fuse_iqueue *fiq) > > +__releases(fiq->waitq.lock) > > +{ > > + struct fuse_forget_link *link; > > + struct virtio_fs_forget *forget; > > + struct scatterlist sg; > > + struct scatterlist *sgs[] = {&sg}; > > + struct virtio_fs *fs; > > + struct virtqueue *vq; > > + struct virtio_fs_vq *fsvq; > > + bool notify; > > + u64 unique; > > + int ret; > > + > > + link = fuse_dequeue_forget(fiq, 1, NULL); > > + unique = fuse_get_unique(fiq); > > + > > + fs = fiq->priv; > > + fsvq = &fs->vqs[VQ_HIPRIO]; > > + spin_unlock(&fiq->waitq.lock); > > + > > + /* Allocate a buffer for the request */ > > + forget = kmalloc(sizeof(*forget), GFP_NOFS | __GFP_NOFAIL); > > + > > + forget->ih = (struct fuse_in_header){ > > + .opcode = FUSE_FORGET, > > + .nodeid = link->forget_one.nodeid, > > + .unique = unique, > > + .len = sizeof(*forget), > > + }; > > + forget->arg = (struct fuse_forget_in){ > > + .nlookup = link->forget_one.nlookup, > > + }; > > + > > + sg_init_one(&sg, forget, sizeof(*forget)); > > + > > + /* Enqueue the request */ > > + vq = fsvq->vq; > > + dev_dbg(&vq->vdev->dev, "%s\n", __func__); > > + spin_lock(&fsvq->lock); > > + > > + ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC); > > + if (ret < 0) { > > + if (ret == -ENOMEM || ret == -ENOSPC) { > > + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Will try later.\n", > > + ret); > > + list_add_tail(&forget->list, &fsvq->queued_reqs); > > + schedule_delayed_work(&fsvq->dispatch_work, > > + msecs_to_jiffies(1)); > > + } else { > > + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Dropping it.\n", > > + ret); > > + kfree(forget); > > + } > > + spin_unlock(&fsvq->lock); > > + goto out; > > + } > > > code duplicated from virtio_fs_hiprio_dispatch_work Will look into reusing the code. > > > + > > + fsvq->in_flight++; > > + notify = virtqueue_kick_prepare(vq); > > + > > + spin_unlock(&fsvq->lock); > > + > > + if (notify) > > + virtqueue_notify(vq); > > +out: > > + kfree(link); > > +} > > + > > +static void virtio_fs_wake_interrupt_and_unlock(struct fuse_iqueue *fiq) > > +__releases(fiq->waitq.lock) > > +{ > > + /* TODO */ > > what's needed? We have not implemented this interrupt thing. It interrupts the existing pending requests. Its not a must to have. [..] > > +static int virtio_fs_enqueue_req(struct virtqueue *vq, struct fuse_req *req) > > +{ > > + /* requests need at least 4 elements */ > > + struct scatterlist *stack_sgs[6]; > > + struct scatterlist stack_sg[ARRAY_SIZE(stack_sgs)]; > > + struct scatterlist **sgs = stack_sgs; > > + struct scatterlist *sg = stack_sg; > > + struct virtio_fs_vq *fsvq; > > + unsigned int argbuf_used = 0; > > + unsigned int out_sgs = 0; > > + unsigned int in_sgs = 0; > > + unsigned int total_sgs; > > + unsigned int i; > > + int ret; > > + bool notify; > > + > > + /* Does the sglist fit on the stack? */ > > + total_sgs = sg_count_fuse_req(req); > > + if (total_sgs > ARRAY_SIZE(stack_sgs)) { > > + sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC); > > + sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC); > > + if (!sgs || !sg) { > > + ret = -ENOMEM; > > + goto out; > > + } > > + } > > + > > + /* Use a bounce buffer since stack args cannot be mapped */ > > + ret = copy_args_to_argbuf(req); > > + if (ret < 0) > > + goto out; > > + > > + /* Request elements */ > > + sg_init_one(&sg[out_sgs++], &req->in.h, sizeof(req->in.h)); > > + out_sgs += sg_init_fuse_args(&sg[out_sgs], req, > > + (struct fuse_arg *)req->in.args, > > + req->in.numargs, req->in.argpages, > > + req->argbuf, &argbuf_used); > > + > > + /* Reply elements */ > > + if (test_bit(FR_ISREPLY, &req->flags)) { > > + sg_init_one(&sg[out_sgs + in_sgs++], > > + &req->out.h, sizeof(req->out.h)); > > + in_sgs += sg_init_fuse_args(&sg[out_sgs + in_sgs], req, > > + req->out.args, req->out.numargs, > > + req->out.argpages, > > + req->argbuf + argbuf_used, NULL); > > + } > > + > > + WARN_ON(out_sgs + in_sgs != total_sgs); > > + > > + for (i = 0; i < total_sgs; i++) > > + sgs[i] = &sg[i]; > > + > > + fsvq = vq_to_fsvq(vq); > > + spin_lock(&fsvq->lock); > > + > > + ret = virtqueue_add_sgs(vq, sgs, out_sgs, in_sgs, req, GFP_ATOMIC); > > + if (ret < 0) { > > + /* TODO handle full virtqueue */ > > Indeed. What prevents this? So for forget requests (VQ_HIPRIO), I already handled this by retrying submitting the request with the help of a worker. For regular requests (VQ_REQUESTS), I have never run into virt queue being full so far. That's why never worried about it. So nothing prevents this. But we have not noticed it yet. So its a TODO item. It will be nice to retry if virtuqueue gets full (instead of returning error to caller). [..] > > +static void virtio_fs_wake_pending_and_unlock(struct fuse_iqueue *fiq) > > +__releases(fiq->waitq.lock) > > +{ > > + unsigned int queue_id = VQ_REQUEST; /* TODO multiqueue */ > > + struct virtio_fs *fs; > > + struct fuse_conn *fc; > > + struct fuse_req *req; > > + struct fuse_pqueue *fpq; > > + int ret; > > + > > + WARN_ON(list_empty(&fiq->pending)); > > + req = list_last_entry(&fiq->pending, struct fuse_req, list); > > + clear_bit(FR_PENDING, &req->flags); > > + list_del_init(&req->list); > > + WARN_ON(!list_empty(&fiq->pending)); > > + spin_unlock(&fiq->waitq.lock); > > + > > + fs = fiq->priv; > > + fc = fs->vqs[queue_id].fud->fc; > > + > > + dev_dbg(&fs->vqs[queue_id].vq->vdev->dev, > > + "%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u\n", > > + __func__, req->in.h.opcode, req->in.h.unique, req->in.h.nodeid, > > + req->in.h.len, fuse_len_args(req->out.numargs, req->out.args)); > > + > > + fpq = &fs->vqs[queue_id].fud->pq; > > + spin_lock(&fpq->lock); > > + if (!fpq->connected) { > > + spin_unlock(&fpq->lock); > > + req->out.h.error = -ENODEV; > > + pr_err("virtio-fs: %s disconnected\n", __func__); > > + fuse_request_end(fc, req); > > + return; > > + } > > + list_add_tail(&req->list, fpq->processing); > > + spin_unlock(&fpq->lock); > > + set_bit(FR_SENT, &req->flags); > > + /* matches barrier in request_wait_answer() */ > > + smp_mb__after_atomic(); > > + /* TODO check for FR_INTERRUPTED? */ > > > ? hmm... we don't support FR_INTERRUPTED. Stefan, do you remember why this TODO is here. If not, I will get rid of it. > > > + > > +retry: > > + ret = virtio_fs_enqueue_req(fs->vqs[queue_id].vq, req); > > + if (ret < 0) { > > + if (ret == -ENOMEM || ret == -ENOSPC) { > > + /* Virtqueue full. Retry submission */ > > + usleep_range(20, 30); > > again, why not respond to completion? Using usleep() was a quick fix. Will look into using completion in second round of cleanup. In first round I am taking care of more scary TODOs which deal with races w.r.t device removal. > > > + goto retry; > > + } > > + req->out.h.error = ret; > > + pr_err("virtio-fs: virtio_fs_enqueue_req() failed %d\n", ret); > > + fuse_request_end(fc, req); > > + return; > > + } > > +} > > + > > +static void virtio_fs_flush_hiprio_queue(struct virtio_fs_vq *fsvq) > > +{ > > + struct virtio_fs_forget *forget; > > + > > + WARN_ON(fsvq->in_flight < 0); > > + > > + /* Go through pending forget requests and free them */ > > + spin_lock(&fsvq->lock); > > + while (1) { > > + forget = list_first_entry_or_null(&fsvq->queued_reqs, > > + struct virtio_fs_forget, list); > > + if (!forget) > > + break; > > + list_del(&forget->list); > > + kfree(forget); > > + } > > + > > + spin_unlock(&fsvq->lock); > > + > > + /* Wait for in flight requests to finish.*/ > > + while (1) { > > + spin_lock(&fsvq->lock); > > + if (!fsvq->in_flight) { > > + spin_unlock(&fsvq->lock); > > + break; > > + } > > + spin_unlock(&fsvq->lock); > > + usleep_range(1000, 2000); > > Same thing here. Can we use e.g. a completion and not usleep? Second round cleanup. > > > + } > > +} > > + > > +const static struct fuse_iqueue_ops virtio_fs_fiq_ops = { > > + .wake_forget_and_unlock = virtio_fs_wake_forget_and_unlock, > > + .wake_interrupt_and_unlock = virtio_fs_wake_interrupt_and_unlock, > > + .wake_pending_and_unlock = virtio_fs_wake_pending_and_unlock, > > +}; > > + > > +static int virtio_fs_fill_super(struct super_block *sb) > > +{ > > + struct fuse_conn *fc = get_fuse_conn_super(sb); > > + struct virtio_fs *fs = fc->iq.priv; > > + unsigned int i; > > + int err; > > + struct fuse_req *init_req; > > + struct fuse_fs_context ctx = { > > + .rootmode = S_IFDIR, > > + .default_permissions = 1, > > + .allow_other = 1, > > + .max_read = UINT_MAX, > > + .blksize = 512, > > + .destroy = true, > > + .no_control = true, > > + .no_force_umount = true, > > + }; > > + > > + /* TODO lock */ > > what does this refer to? Took care of in first round of cleanup. > > > + if (fs->vqs[VQ_REQUEST].fud) { > > + pr_err("virtio-fs: device already in use\n"); > > + err = -EBUSY; > > + goto err; > > + } > > + > > + err = -ENOMEM; > > + /* Allocate fuse_dev for hiprio and notification queues */ > > + for (i = 0; i < VQ_REQUEST; i++) { > > + struct virtio_fs_vq *fsvq = &fs->vqs[i]; > > + > > + fsvq->fud = fuse_dev_alloc(); > > + if (!fsvq->fud) > > + goto err_free_fuse_devs; > > + } > > + > > + init_req = fuse_request_alloc(0); > > + if (!init_req) > > + goto err_free_fuse_devs; > > + __set_bit(FR_BACKGROUND, &init_req->flags); > > + > > + ctx.fudptr = (void **)&fs->vqs[VQ_REQUEST].fud; > > + err = fuse_fill_super_common(sb, &ctx); > > + if (err < 0) > > + goto err_free_init_req; > > + > > + fc = fs->vqs[VQ_REQUEST].fud->fc; > > + > > + /* TODO take fuse_mutex around this loop? */ > > Don't we need to figure this kind of thing out before this > code lands upstream? I think we don't need this TODO. fuse_connection object and associated super block are still being formed. And my cleanup has taken care of the additional locking. Thanks Vivek
On Thu, Sep 05, 2019 at 03:15:15PM -0400, Vivek Goyal wrote: > On Tue, Sep 03, 2019 at 09:55:49AM -0400, Michael S. Tsirkin wrote: > [..] > > What's with all of the TODOs? Some of these are really scary, > > looks like they need to be figured out before this is merged. > > Hi Michael, > > One of the issue I noticed is races w.r.t device removal and super > block initialization. I am about to post a set of patches which > take care of these races and also get rid of some of the scary > TODOs. Other TODOs like suspend/restore, multiqueue support etc > are improvements which we can do over a period of time. > > [..] > > > +/* Per-virtqueue state */ > > > +struct virtio_fs_vq { > > > + spinlock_t lock; > > > + struct virtqueue *vq; /* protected by ->lock */ > > > + struct work_struct done_work; > > > + struct list_head queued_reqs; > > > + struct delayed_work dispatch_work; > > > + struct fuse_dev *fud; > > > + bool connected; > > > + long in_flight; > > > + char name[24]; > > > > I'd keep names somewhere separate as they are not used on data path. > > Ok, this sounds like a nice to have. Will take care of this once base > patch gets merged. > > [..] > > > +struct virtio_fs_forget { > > > + struct fuse_in_header ih; > > > + struct fuse_forget_in arg; > > > > These structures are all native endian. > > > > Passing them to host will make cross-endian setups painful to support, > > and hardware implementations impossible. > > > > How about converting everything to LE? > > So looks like endianness issue is now resolved (going by the other > emails). So I will not worry about it. > > [..] > > > +/* Add a new instance to the list or return -EEXIST if tag name exists*/ > > > +static int virtio_fs_add_instance(struct virtio_fs *fs) > > > +{ > > > + struct virtio_fs *fs2; > > > + bool duplicate = false; > > > + > > > + mutex_lock(&virtio_fs_mutex); > > > + > > > + list_for_each_entry(fs2, &virtio_fs_instances, list) { > > > + if (strcmp(fs->tag, fs2->tag) == 0) > > > + duplicate = true; > > > + } > > > + > > > + if (!duplicate) > > > + list_add_tail(&fs->list, &virtio_fs_instances); > > > > > > This is O(N^2) as it's presumably called for each istance. > > Doesn't scale - please switch to a tree or such. > > This is O(N) and not O(N^2) right? Addition of device is O(N), search > during mount is O(N). > > This is not a frequent event at all and number of virtiofs instances > per guest are expected to be fairly small (say less than 10). So I > really don't think there is any value in converting this into a tree > (instead of a list). > > [..] > > > +static void virtio_fs_free_devs(struct virtio_fs *fs) > > > +{ > > > + unsigned int i; > > > + > > > + /* TODO lock */ > > > > Doesn't inspire confidence, does it? > > Agreed. Getting rid of this in set of fixes I am about to post. > > > > > > + > > > + for (i = 0; i < fs->nvqs; i++) { > > > + struct virtio_fs_vq *fsvq = &fs->vqs[i]; > > > + > > > + if (!fsvq->fud) > > > + continue; > > > + > > > + flush_work(&fsvq->done_work); > > > + flush_delayed_work(&fsvq->dispatch_work); > > > + > > > + /* TODO need to quiesce/end_requests/decrement dev_count */ > > > > Indeed. Won't this crash if we don't? > > Took care of this as well. > > [..] > > > +static void virtio_fs_hiprio_dispatch_work(struct work_struct *work) > > > +{ > > > + struct virtio_fs_forget *forget; > > > + struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq, > > > + dispatch_work.work); > > > + struct virtqueue *vq = fsvq->vq; > > > + struct scatterlist sg; > > > + struct scatterlist *sgs[] = {&sg}; > > > + bool notify; > > > + int ret; > > > + > > > + pr_debug("virtio-fs: worker %s called.\n", __func__); > > > + while (1) { > > > + spin_lock(&fsvq->lock); > > > + forget = list_first_entry_or_null(&fsvq->queued_reqs, > > > + struct virtio_fs_forget, list); > > > + if (!forget) { > > > + spin_unlock(&fsvq->lock); > > > + return; > > > + } > > > + > > > + list_del(&forget->list); > > > + if (!fsvq->connected) { > > > + spin_unlock(&fsvq->lock); > > > + kfree(forget); > > > + continue; > > > + } > > > + > > > + sg_init_one(&sg, forget, sizeof(*forget)); > > > > This passes to host a structure including "struct list_head list"; > > > > Not a good idea. > > Ok, host does not have to see "struct list_head list". Its needed for > guest. Will look into getting rid of this. > > > > > > > > + > > > + /* Enqueue the request */ > > > + dev_dbg(&vq->vdev->dev, "%s\n", __func__); > > > + ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC); > > > > > > This is easier as add_outbuf. > > Will look into it. > > > > > Also - why GFP_ATOMIC? > > Hmm..., may be it can be GFP_KERNEL. I don't see atomic context here. Will > look into it. > > > > > > + if (ret < 0) { > > > + if (ret == -ENOMEM || ret == -ENOSPC) { > > > + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Will try later\n", > > > + ret); > > > + list_add_tail(&forget->list, > > > + &fsvq->queued_reqs); > > > + schedule_delayed_work(&fsvq->dispatch_work, > > > + msecs_to_jiffies(1)); > > > > Can't we we requeue after some buffers get consumed? > > That's what dispatch work is doing. It tries to requeue the request after > a while. > > [..] > > > +static int virtio_fs_probe(struct virtio_device *vdev) > > > +{ > > > + struct virtio_fs *fs; > > > + int ret; > > > + > > > + fs = devm_kzalloc(&vdev->dev, sizeof(*fs), GFP_KERNEL); > > > + if (!fs) > > > + return -ENOMEM; > > > + vdev->priv = fs; > > > + > > > + ret = virtio_fs_read_tag(vdev, fs); > > > + if (ret < 0) > > > + goto out; > > > + > > > + ret = virtio_fs_setup_vqs(vdev, fs); > > > + if (ret < 0) > > > + goto out; > > > + > > > + /* TODO vq affinity */ > > > + /* TODO populate notifications vq */ > > > > what's notifications vq? > > It has not been implemented yet. At some point of time we want to have > a notion of notification queue so that host can send notifications to > guest. Will get rid of this comment for now. > > [..] > > > +#ifdef CONFIG_PM_SLEEP > > > +static int virtio_fs_freeze(struct virtio_device *vdev) > > > +{ > > > + return 0; /* TODO */ > > > +} > > > + > > > +static int virtio_fs_restore(struct virtio_device *vdev) > > > +{ > > > + return 0; /* TODO */ > > > +} > > > > Is this really a good idea? I'd rather it was implemented, > > but if not possible at all disabling PM seems better than just > > keep going. > > I agree. Will look into disabling it. > > > > > > +#endif /* CONFIG_PM_SLEEP */ > > > + > > > +const static struct virtio_device_id id_table[] = { > > > + { VIRTIO_ID_FS, VIRTIO_DEV_ANY_ID }, > > > + {}, > > > +}; > > > + > > > +const static unsigned int feature_table[] = {}; > > > + > > > +static struct virtio_driver virtio_fs_driver = { > > > + .driver.name = KBUILD_MODNAME, > > > + .driver.owner = THIS_MODULE, > > > + .id_table = id_table, > > > + .feature_table = feature_table, > > > + .feature_table_size = ARRAY_SIZE(feature_table), > > > + /* TODO validate config_get != NULL */ > > > > Why? > > Don't know. Stefan, do you remember why did you put this comment? If not, > I will get rid of it. This comment can be removed. > > > +static void virtio_fs_wake_pending_and_unlock(struct fuse_iqueue *fiq) > > > +__releases(fiq->waitq.lock) > > > +{ > > > + unsigned int queue_id = VQ_REQUEST; /* TODO multiqueue */ > > > + struct virtio_fs *fs; > > > + struct fuse_conn *fc; > > > + struct fuse_req *req; > > > + struct fuse_pqueue *fpq; > > > + int ret; > > > + > > > + WARN_ON(list_empty(&fiq->pending)); > > > + req = list_last_entry(&fiq->pending, struct fuse_req, list); > > > + clear_bit(FR_PENDING, &req->flags); > > > + list_del_init(&req->list); > > > + WARN_ON(!list_empty(&fiq->pending)); > > > + spin_unlock(&fiq->waitq.lock); > > > + > > > + fs = fiq->priv; > > > + fc = fs->vqs[queue_id].fud->fc; > > > + > > > + dev_dbg(&fs->vqs[queue_id].vq->vdev->dev, > > > + "%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u\n", > > > + __func__, req->in.h.opcode, req->in.h.unique, req->in.h.nodeid, > > > + req->in.h.len, fuse_len_args(req->out.numargs, req->out.args)); > > > + > > > + fpq = &fs->vqs[queue_id].fud->pq; > > > + spin_lock(&fpq->lock); > > > + if (!fpq->connected) { > > > + spin_unlock(&fpq->lock); > > > + req->out.h.error = -ENODEV; > > > + pr_err("virtio-fs: %s disconnected\n", __func__); > > > + fuse_request_end(fc, req); > > > + return; > > > + } > > > + list_add_tail(&req->list, fpq->processing); > > > + spin_unlock(&fpq->lock); > > > + set_bit(FR_SENT, &req->flags); > > > + /* matches barrier in request_wait_answer() */ > > > + smp_mb__after_atomic(); > > > + /* TODO check for FR_INTERRUPTED? */ > > > > > > ? > > hmm... we don't support FR_INTERRUPTED. Stefan, do you remember why > this TODO is here. If not, I will get rid of it. We don't support FUSE_INTERRUPT yet. The purpose of this comment is that when we do support FUSE_INTERRUPT we'll need to follow fuse_dev_do_read() in queuing a FUSE_INTERRUPT here. Stefan
On Fri, Sep 06, 2019 at 11:22:09AM +0100, Stefan Hajnoczi wrote: > On Thu, Sep 05, 2019 at 03:15:15PM -0400, Vivek Goyal wrote: > > On Tue, Sep 03, 2019 at 09:55:49AM -0400, Michael S. Tsirkin wrote: > > [..] > > > What's with all of the TODOs? Some of these are really scary, > > > looks like they need to be figured out before this is merged. > > > > Hi Michael, > > > > One of the issue I noticed is races w.r.t device removal and super > > block initialization. I am about to post a set of patches which > > take care of these races and also get rid of some of the scary > > TODOs. Other TODOs like suspend/restore, multiqueue support etc > > are improvements which we can do over a period of time. > > > > [..] > > > > +/* Per-virtqueue state */ > > > > +struct virtio_fs_vq { > > > > + spinlock_t lock; > > > > + struct virtqueue *vq; /* protected by ->lock */ > > > > + struct work_struct done_work; > > > > + struct list_head queued_reqs; > > > > + struct delayed_work dispatch_work; > > > > + struct fuse_dev *fud; > > > > + bool connected; > > > > + long in_flight; > > > > + char name[24]; > > > > > > I'd keep names somewhere separate as they are not used on data path. > > > > Ok, this sounds like a nice to have. Will take care of this once base > > patch gets merged. > > > > [..] > > > > +struct virtio_fs_forget { > > > > + struct fuse_in_header ih; > > > > + struct fuse_forget_in arg; > > > > > > These structures are all native endian. > > > > > > Passing them to host will make cross-endian setups painful to support, > > > and hardware implementations impossible. > > > > > > How about converting everything to LE? > > > > So looks like endianness issue is now resolved (going by the other > > emails). So I will not worry about it. > > > > [..] > > > > +/* Add a new instance to the list or return -EEXIST if tag name exists*/ > > > > +static int virtio_fs_add_instance(struct virtio_fs *fs) > > > > +{ > > > > + struct virtio_fs *fs2; > > > > + bool duplicate = false; > > > > + > > > > + mutex_lock(&virtio_fs_mutex); > > > > + > > > > + list_for_each_entry(fs2, &virtio_fs_instances, list) { > > > > + if (strcmp(fs->tag, fs2->tag) == 0) > > > > + duplicate = true; > > > > + } > > > > + > > > > + if (!duplicate) > > > > + list_add_tail(&fs->list, &virtio_fs_instances); > > > > > > > > > This is O(N^2) as it's presumably called for each istance. > > > Doesn't scale - please switch to a tree or such. > > > > This is O(N) and not O(N^2) right? Addition of device is O(N), search > > during mount is O(N). > > > > This is not a frequent event at all and number of virtiofs instances > > per guest are expected to be fairly small (say less than 10). So I > > really don't think there is any value in converting this into a tree > > (instead of a list). > > > > [..] > > > > +static void virtio_fs_free_devs(struct virtio_fs *fs) > > > > +{ > > > > + unsigned int i; > > > > + > > > > + /* TODO lock */ > > > > > > Doesn't inspire confidence, does it? > > > > Agreed. Getting rid of this in set of fixes I am about to post. > > > > > > > > > + > > > > + for (i = 0; i < fs->nvqs; i++) { > > > > + struct virtio_fs_vq *fsvq = &fs->vqs[i]; > > > > + > > > > + if (!fsvq->fud) > > > > + continue; > > > > + > > > > + flush_work(&fsvq->done_work); > > > > + flush_delayed_work(&fsvq->dispatch_work); > > > > + > > > > + /* TODO need to quiesce/end_requests/decrement dev_count */ > > > > > > Indeed. Won't this crash if we don't? > > > > Took care of this as well. > > > > [..] > > > > +static void virtio_fs_hiprio_dispatch_work(struct work_struct *work) > > > > +{ > > > > + struct virtio_fs_forget *forget; > > > > + struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq, > > > > + dispatch_work.work); > > > > + struct virtqueue *vq = fsvq->vq; > > > > + struct scatterlist sg; > > > > + struct scatterlist *sgs[] = {&sg}; > > > > + bool notify; > > > > + int ret; > > > > + > > > > + pr_debug("virtio-fs: worker %s called.\n", __func__); > > > > + while (1) { > > > > + spin_lock(&fsvq->lock); > > > > + forget = list_first_entry_or_null(&fsvq->queued_reqs, > > > > + struct virtio_fs_forget, list); > > > > + if (!forget) { > > > > + spin_unlock(&fsvq->lock); > > > > + return; > > > > + } > > > > + > > > > + list_del(&forget->list); > > > > + if (!fsvq->connected) { > > > > + spin_unlock(&fsvq->lock); > > > > + kfree(forget); > > > > + continue; > > > > + } > > > > + > > > > + sg_init_one(&sg, forget, sizeof(*forget)); > > > > > > This passes to host a structure including "struct list_head list"; > > > > > > Not a good idea. > > > > Ok, host does not have to see "struct list_head list". Its needed for > > guest. Will look into getting rid of this. > > > > > > > > > > > > + > > > > + /* Enqueue the request */ > > > > + dev_dbg(&vq->vdev->dev, "%s\n", __func__); > > > > + ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC); > > > > > > > > > This is easier as add_outbuf. > > > > Will look into it. > > > > > > > > Also - why GFP_ATOMIC? > > > > Hmm..., may be it can be GFP_KERNEL. I don't see atomic context here. Will > > look into it. > > > > > > > > > + if (ret < 0) { > > > > + if (ret == -ENOMEM || ret == -ENOSPC) { > > > > + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Will try later\n", > > > > + ret); > > > > + list_add_tail(&forget->list, > > > > + &fsvq->queued_reqs); > > > > + schedule_delayed_work(&fsvq->dispatch_work, > > > > + msecs_to_jiffies(1)); > > > > > > Can't we we requeue after some buffers get consumed? > > > > That's what dispatch work is doing. It tries to requeue the request after > > a while. > > > > [..] > > > > +static int virtio_fs_probe(struct virtio_device *vdev) > > > > +{ > > > > + struct virtio_fs *fs; > > > > + int ret; > > > > + > > > > + fs = devm_kzalloc(&vdev->dev, sizeof(*fs), GFP_KERNEL); > > > > + if (!fs) > > > > + return -ENOMEM; > > > > + vdev->priv = fs; > > > > + > > > > + ret = virtio_fs_read_tag(vdev, fs); > > > > + if (ret < 0) > > > > + goto out; > > > > + > > > > + ret = virtio_fs_setup_vqs(vdev, fs); > > > > + if (ret < 0) > > > > + goto out; > > > > + > > > > + /* TODO vq affinity */ > > > > + /* TODO populate notifications vq */ > > > > > > what's notifications vq? > > > > It has not been implemented yet. At some point of time we want to have > > a notion of notification queue so that host can send notifications to > > guest. Will get rid of this comment for now. > > > > [..] > > > > +#ifdef CONFIG_PM_SLEEP > > > > +static int virtio_fs_freeze(struct virtio_device *vdev) > > > > +{ > > > > + return 0; /* TODO */ > > > > +} > > > > + > > > > +static int virtio_fs_restore(struct virtio_device *vdev) > > > > +{ > > > > + return 0; /* TODO */ > > > > +} > > > > > > Is this really a good idea? I'd rather it was implemented, > > > but if not possible at all disabling PM seems better than just > > > keep going. > > > > I agree. Will look into disabling it. > > > > > > > > > +#endif /* CONFIG_PM_SLEEP */ > > > > + > > > > +const static struct virtio_device_id id_table[] = { > > > > + { VIRTIO_ID_FS, VIRTIO_DEV_ANY_ID }, > > > > + {}, > > > > +}; > > > > + > > > > +const static unsigned int feature_table[] = {}; > > > > + > > > > +static struct virtio_driver virtio_fs_driver = { > > > > + .driver.name = KBUILD_MODNAME, > > > > + .driver.owner = THIS_MODULE, > > > > + .id_table = id_table, > > > > + .feature_table = feature_table, > > > > + .feature_table_size = ARRAY_SIZE(feature_table), > > > > + /* TODO validate config_get != NULL */ > > > > > > Why? > > > > Don't know. Stefan, do you remember why did you put this comment? If not, > > I will get rid of it. > > This comment can be removed. > > > > > +static void virtio_fs_wake_pending_and_unlock(struct fuse_iqueue *fiq) > > > > +__releases(fiq->waitq.lock) > > > > +{ > > > > + unsigned int queue_id = VQ_REQUEST; /* TODO multiqueue */ > > > > + struct virtio_fs *fs; > > > > + struct fuse_conn *fc; > > > > + struct fuse_req *req; > > > > + struct fuse_pqueue *fpq; > > > > + int ret; > > > > + > > > > + WARN_ON(list_empty(&fiq->pending)); > > > > + req = list_last_entry(&fiq->pending, struct fuse_req, list); > > > > + clear_bit(FR_PENDING, &req->flags); > > > > + list_del_init(&req->list); > > > > + WARN_ON(!list_empty(&fiq->pending)); > > > > + spin_unlock(&fiq->waitq.lock); > > > > + > > > > + fs = fiq->priv; > > > > + fc = fs->vqs[queue_id].fud->fc; > > > > + > > > > + dev_dbg(&fs->vqs[queue_id].vq->vdev->dev, > > > > + "%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u\n", > > > > + __func__, req->in.h.opcode, req->in.h.unique, req->in.h.nodeid, > > > > + req->in.h.len, fuse_len_args(req->out.numargs, req->out.args)); > > > > + > > > > + fpq = &fs->vqs[queue_id].fud->pq; > > > > + spin_lock(&fpq->lock); > > > > + if (!fpq->connected) { > > > > + spin_unlock(&fpq->lock); > > > > + req->out.h.error = -ENODEV; > > > > + pr_err("virtio-fs: %s disconnected\n", __func__); > > > > + fuse_request_end(fc, req); > > > > + return; > > > > + } > > > > + list_add_tail(&req->list, fpq->processing); > > > > + spin_unlock(&fpq->lock); > > > > + set_bit(FR_SENT, &req->flags); > > > > + /* matches barrier in request_wait_answer() */ > > > > + smp_mb__after_atomic(); > > > > + /* TODO check for FR_INTERRUPTED? */ > > > > > > > > > ? > > > > hmm... we don't support FR_INTERRUPTED. Stefan, do you remember why > > this TODO is here. If not, I will get rid of it. > > We don't support FUSE_INTERRUPT yet. The purpose of this comment is > that when we do support FUSE_INTERRUPT we'll need to follow > fuse_dev_do_read() in queuing a FUSE_INTERRUPT here. > > Stefan OK so pls write this explicitly in the comment.
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig index 24fc5a5c1b97..0635cba19971 100644 --- a/fs/fuse/Kconfig +++ b/fs/fuse/Kconfig @@ -27,3 +27,14 @@ config CUSE If you want to develop or use a userspace character device based on CUSE, answer Y or M. + +config VIRTIO_FS + tristate "Virtio Filesystem" + depends on FUSE_FS + select VIRTIO + help + The Virtio Filesystem allows guests to mount file systems from the + host. + + If you want to share files between guests or with the host, answer Y + or M. diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile index 9485019c2a14..6419a2b3510d 100644 --- a/fs/fuse/Makefile +++ b/fs/fuse/Makefile @@ -5,5 +5,6 @@ obj-$(CONFIG_FUSE_FS) += fuse.o obj-$(CONFIG_CUSE) += cuse.o +obj-$(CONFIG_VIRTIO_FS) += virtio_fs.o fuse-objs := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index dbf73e5d5b38..85e2dcad68c1 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -444,6 +444,11 @@ struct fuse_req { /** Request is stolen from fuse_file->reserved_req */ struct file *stolen_file; + +#if IS_ENABLED(CONFIG_VIRTIO_FS) + /** virtio-fs's physically contiguous buffer for in and out args */ + void *argbuf; +#endif }; struct fuse_iqueue; diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c new file mode 100644 index 000000000000..197e79e536f9 --- /dev/null +++ b/fs/fuse/virtio_fs.c @@ -0,0 +1,1072 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * virtio-fs: Virtio Filesystem + * Copyright (C) 2018 Red Hat, Inc. + */ + +#include <linux/fs.h> +#include <linux/module.h> +#include <linux/virtio.h> +#include <linux/virtio_fs.h> +#include <linux/delay.h> +#include <linux/fs_context.h> +#include <linux/highmem.h> +#include "fuse_i.h" + +/* List of virtio-fs device instances and a lock for the list */ +static DEFINE_MUTEX(virtio_fs_mutex); +static LIST_HEAD(virtio_fs_instances); + +enum { + VQ_HIPRIO, + VQ_REQUEST +}; + +/* Per-virtqueue state */ +struct virtio_fs_vq { + spinlock_t lock; + struct virtqueue *vq; /* protected by ->lock */ + struct work_struct done_work; + struct list_head queued_reqs; + struct delayed_work dispatch_work; + struct fuse_dev *fud; + bool connected; + long in_flight; + char name[24]; +} ____cacheline_aligned_in_smp; + +/* A virtio-fs device instance */ +struct virtio_fs { + struct list_head list; /* on virtio_fs_instances */ + char *tag; + struct virtio_fs_vq *vqs; + unsigned int nvqs; /* number of virtqueues */ + unsigned int num_queues; /* number of request queues */ +}; + +struct virtio_fs_forget { + struct fuse_in_header ih; + struct fuse_forget_in arg; + /* This request can be temporarily queued on virt queue */ + struct list_head list; +}; + +static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq) +{ + struct virtio_fs *fs = vq->vdev->priv; + + return &fs->vqs[vq->index]; +} + +static inline struct fuse_pqueue *vq_to_fpq(struct virtqueue *vq) +{ + return &vq_to_fsvq(vq)->fud->pq; +} + +/* Add a new instance to the list or return -EEXIST if tag name exists*/ +static int virtio_fs_add_instance(struct virtio_fs *fs) +{ + struct virtio_fs *fs2; + bool duplicate = false; + + mutex_lock(&virtio_fs_mutex); + + list_for_each_entry(fs2, &virtio_fs_instances, list) { + if (strcmp(fs->tag, fs2->tag) == 0) + duplicate = true; + } + + if (!duplicate) + list_add_tail(&fs->list, &virtio_fs_instances); + + mutex_unlock(&virtio_fs_mutex); + + if (duplicate) + return -EEXIST; + return 0; +} + +/* Return the virtio_fs with a given tag, or NULL */ +static struct virtio_fs *virtio_fs_find_instance(const char *tag) +{ + struct virtio_fs *fs; + + mutex_lock(&virtio_fs_mutex); + + list_for_each_entry(fs, &virtio_fs_instances, list) { + if (strcmp(fs->tag, tag) == 0) + goto found; + } + + fs = NULL; /* not found */ + +found: + mutex_unlock(&virtio_fs_mutex); + + return fs; +} + +static void virtio_fs_free_devs(struct virtio_fs *fs) +{ + unsigned int i; + + /* TODO lock */ + + for (i = 0; i < fs->nvqs; i++) { + struct virtio_fs_vq *fsvq = &fs->vqs[i]; + + if (!fsvq->fud) + continue; + + flush_work(&fsvq->done_work); + flush_delayed_work(&fsvq->dispatch_work); + + /* TODO need to quiesce/end_requests/decrement dev_count */ + fuse_dev_free(fsvq->fud); + fsvq->fud = NULL; + } +} + +/* Read filesystem name from virtio config into fs->tag (must kfree()). */ +static int virtio_fs_read_tag(struct virtio_device *vdev, struct virtio_fs *fs) +{ + char tag_buf[sizeof_field(struct virtio_fs_config, tag)]; + char *end; + size_t len; + + virtio_cread_bytes(vdev, offsetof(struct virtio_fs_config, tag), + &tag_buf, sizeof(tag_buf)); + end = memchr(tag_buf, '\0', sizeof(tag_buf)); + if (end == tag_buf) + return -EINVAL; /* empty tag */ + if (!end) + end = &tag_buf[sizeof(tag_buf)]; + + len = end - tag_buf; + fs->tag = devm_kmalloc(&vdev->dev, len + 1, GFP_KERNEL); + if (!fs->tag) + return -ENOMEM; + memcpy(fs->tag, tag_buf, len); + fs->tag[len] = '\0'; + return 0; +} + +/* Work function for hiprio completion */ +static void virtio_fs_hiprio_done_work(struct work_struct *work) +{ + struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq, + done_work); + struct virtqueue *vq = fsvq->vq; + + /* Free completed FUSE_FORGET requests */ + spin_lock(&fsvq->lock); + do { + unsigned int len; + void *req; + + virtqueue_disable_cb(vq); + + while ((req = virtqueue_get_buf(vq, &len)) != NULL) { + kfree(req); + fsvq->in_flight--; + } + } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq))); + spin_unlock(&fsvq->lock); +} + +static void virtio_fs_dummy_dispatch_work(struct work_struct *work) +{ +} + +static void virtio_fs_hiprio_dispatch_work(struct work_struct *work) +{ + struct virtio_fs_forget *forget; + struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq, + dispatch_work.work); + struct virtqueue *vq = fsvq->vq; + struct scatterlist sg; + struct scatterlist *sgs[] = {&sg}; + bool notify; + int ret; + + pr_debug("virtio-fs: worker %s called.\n", __func__); + while (1) { + spin_lock(&fsvq->lock); + forget = list_first_entry_or_null(&fsvq->queued_reqs, + struct virtio_fs_forget, list); + if (!forget) { + spin_unlock(&fsvq->lock); + return; + } + + list_del(&forget->list); + if (!fsvq->connected) { + spin_unlock(&fsvq->lock); + kfree(forget); + continue; + } + + sg_init_one(&sg, forget, sizeof(*forget)); + + /* Enqueue the request */ + dev_dbg(&vq->vdev->dev, "%s\n", __func__); + ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC); + if (ret < 0) { + if (ret == -ENOMEM || ret == -ENOSPC) { + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Will try later\n", + ret); + list_add_tail(&forget->list, + &fsvq->queued_reqs); + schedule_delayed_work(&fsvq->dispatch_work, + msecs_to_jiffies(1)); + } else { + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Dropping it.\n", + ret); + kfree(forget); + } + spin_unlock(&fsvq->lock); + return; + } + + fsvq->in_flight++; + notify = virtqueue_kick_prepare(vq); + spin_unlock(&fsvq->lock); + + if (notify) + virtqueue_notify(vq); + pr_debug("virtio-fs: worker %s dispatched one forget request.\n", + __func__); + } +} + +/* Allocate and copy args into req->argbuf */ +static int copy_args_to_argbuf(struct fuse_req *req) +{ + unsigned int offset = 0; + unsigned int num_in; + unsigned int num_out; + unsigned int len; + unsigned int i; + + num_in = req->in.numargs - req->in.argpages; + num_out = req->out.numargs - req->out.argpages; + len = fuse_len_args(num_in, (struct fuse_arg *)req->in.args) + + fuse_len_args(num_out, req->out.args); + + req->argbuf = kmalloc(len, GFP_ATOMIC); + if (!req->argbuf) + return -ENOMEM; + + for (i = 0; i < num_in; i++) { + memcpy(req->argbuf + offset, + req->in.args[i].value, + req->in.args[i].size); + offset += req->in.args[i].size; + } + + return 0; +} + +/* Copy args out of and free req->argbuf */ +static void copy_args_from_argbuf(struct fuse_req *req) +{ + unsigned int remaining; + unsigned int offset; + unsigned int num_in; + unsigned int num_out; + unsigned int i; + + remaining = req->out.h.len - sizeof(req->out.h); + num_in = req->in.numargs - req->in.argpages; + num_out = req->out.numargs - req->out.argpages; + offset = fuse_len_args(num_in, (struct fuse_arg *)req->in.args); + + for (i = 0; i < num_out; i++) { + unsigned int argsize = req->out.args[i].size; + + if (req->out.argvar && + i == req->out.numargs - 1 && + argsize > remaining) { + argsize = remaining; + } + + memcpy(req->out.args[i].value, req->argbuf + offset, argsize); + offset += argsize; + + if (i != req->out.numargs - 1) + remaining -= argsize; + } + + /* Store the actual size of the variable-length arg */ + if (req->out.argvar) + req->out.args[req->out.numargs - 1].size = remaining; + + kfree(req->argbuf); + req->argbuf = NULL; +} + +/* Work function for request completion */ +static void virtio_fs_requests_done_work(struct work_struct *work) +{ + struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq, + done_work); + struct fuse_pqueue *fpq = &fsvq->fud->pq; + struct fuse_conn *fc = fsvq->fud->fc; + struct virtqueue *vq = fsvq->vq; + struct fuse_req *req; + struct fuse_req *next; + unsigned int len, i, thislen; + struct page *page; + LIST_HEAD(reqs); + + /* Collect completed requests off the virtqueue */ + spin_lock(&fsvq->lock); + do { + virtqueue_disable_cb(vq); + + while ((req = virtqueue_get_buf(vq, &len)) != NULL) { + spin_lock(&fpq->lock); + list_move_tail(&req->list, &reqs); + spin_unlock(&fpq->lock); + } + } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq))); + spin_unlock(&fsvq->lock); + + /* End requests */ + list_for_each_entry_safe(req, next, &reqs, list) { + /* TODO check unique */ + /* TODO fuse_len_args(out) against oh.len */ + + copy_args_from_argbuf(req); + + if (req->out.argpages && req->out.page_zeroing) { + len = req->out.args[req->out.numargs - 1].size; + for (i = 0; i < req->num_pages; i++) { + thislen = req->page_descs[i].length; + if (len < thislen) { + WARN_ON(req->page_descs[i].offset); + page = req->pages[i]; + zero_user_segment(page, len, thislen); + len = 0; + } else { + len -= thislen; + } + } + } + + spin_lock(&fpq->lock); + clear_bit(FR_SENT, &req->flags); + list_del_init(&req->list); + spin_unlock(&fpq->lock); + + fuse_request_end(fc, req); + } +} + +/* Virtqueue interrupt handler */ +static void virtio_fs_vq_done(struct virtqueue *vq) +{ + struct virtio_fs_vq *fsvq = vq_to_fsvq(vq); + + dev_dbg(&vq->vdev->dev, "%s %s\n", __func__, fsvq->name); + + schedule_work(&fsvq->done_work); +} + +/* Initialize virtqueues */ +static int virtio_fs_setup_vqs(struct virtio_device *vdev, + struct virtio_fs *fs) +{ + struct virtqueue **vqs; + vq_callback_t **callbacks; + const char **names; + unsigned int i; + int ret; + + virtio_cread(vdev, struct virtio_fs_config, num_queues, + &fs->num_queues); + if (fs->num_queues == 0) + return -EINVAL; + + fs->nvqs = 1 + fs->num_queues; + + fs->vqs = devm_kcalloc(&vdev->dev, fs->nvqs, + sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL); + if (!fs->vqs) + return -ENOMEM; + + vqs = kmalloc_array(fs->nvqs, sizeof(vqs[VQ_HIPRIO]), GFP_KERNEL); + callbacks = kmalloc_array(fs->nvqs, sizeof(callbacks[VQ_HIPRIO]), + GFP_KERNEL); + names = kmalloc_array(fs->nvqs, sizeof(names[VQ_HIPRIO]), GFP_KERNEL); + if (!vqs || !callbacks || !names) { + ret = -ENOMEM; + goto out; + } + + callbacks[VQ_HIPRIO] = virtio_fs_vq_done; + snprintf(fs->vqs[VQ_HIPRIO].name, sizeof(fs->vqs[VQ_HIPRIO].name), + "hiprio"); + names[VQ_HIPRIO] = fs->vqs[VQ_HIPRIO].name; + INIT_WORK(&fs->vqs[VQ_HIPRIO].done_work, virtio_fs_hiprio_done_work); + INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].queued_reqs); + INIT_DELAYED_WORK(&fs->vqs[VQ_HIPRIO].dispatch_work, + virtio_fs_hiprio_dispatch_work); + spin_lock_init(&fs->vqs[VQ_HIPRIO].lock); + + /* Initialize the requests virtqueues */ + for (i = VQ_REQUEST; i < fs->nvqs; i++) { + spin_lock_init(&fs->vqs[i].lock); + INIT_WORK(&fs->vqs[i].done_work, virtio_fs_requests_done_work); + INIT_DELAYED_WORK(&fs->vqs[i].dispatch_work, + virtio_fs_dummy_dispatch_work); + INIT_LIST_HEAD(&fs->vqs[i].queued_reqs); + snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name), + "requests.%u", i - VQ_REQUEST); + callbacks[i] = virtio_fs_vq_done; + names[i] = fs->vqs[i].name; + } + + ret = virtio_find_vqs(vdev, fs->nvqs, vqs, callbacks, names, NULL); + if (ret < 0) + goto out; + + for (i = 0; i < fs->nvqs; i++) { + fs->vqs[i].vq = vqs[i]; + fs->vqs[i].connected = true; + } +out: + kfree(names); + kfree(callbacks); + kfree(vqs); + return ret; +} + +/* Free virtqueues (device must already be reset) */ +static void virtio_fs_cleanup_vqs(struct virtio_device *vdev, + struct virtio_fs *fs) +{ + vdev->config->del_vqs(vdev); +} + +static int virtio_fs_probe(struct virtio_device *vdev) +{ + struct virtio_fs *fs; + int ret; + + fs = devm_kzalloc(&vdev->dev, sizeof(*fs), GFP_KERNEL); + if (!fs) + return -ENOMEM; + vdev->priv = fs; + + ret = virtio_fs_read_tag(vdev, fs); + if (ret < 0) + goto out; + + ret = virtio_fs_setup_vqs(vdev, fs); + if (ret < 0) + goto out; + + /* TODO vq affinity */ + /* TODO populate notifications vq */ + + /* Bring the device online in case the filesystem is mounted and + * requests need to be sent before we return. + */ + virtio_device_ready(vdev); + + ret = virtio_fs_add_instance(fs); + if (ret < 0) + goto out_vqs; + + return 0; + +out_vqs: + vdev->config->reset(vdev); + virtio_fs_cleanup_vqs(vdev, fs); + +out: + vdev->priv = NULL; + return ret; +} + +static void virtio_fs_remove(struct virtio_device *vdev) +{ + struct virtio_fs *fs = vdev->priv; + + virtio_fs_free_devs(fs); + + vdev->config->reset(vdev); + virtio_fs_cleanup_vqs(vdev, fs); + + mutex_lock(&virtio_fs_mutex); + list_del(&fs->list); + mutex_unlock(&virtio_fs_mutex); + + vdev->priv = NULL; +} + +#ifdef CONFIG_PM_SLEEP +static int virtio_fs_freeze(struct virtio_device *vdev) +{ + return 0; /* TODO */ +} + +static int virtio_fs_restore(struct virtio_device *vdev) +{ + return 0; /* TODO */ +} +#endif /* CONFIG_PM_SLEEP */ + +const static struct virtio_device_id id_table[] = { + { VIRTIO_ID_FS, VIRTIO_DEV_ANY_ID }, + {}, +}; + +const static unsigned int feature_table[] = {}; + +static struct virtio_driver virtio_fs_driver = { + .driver.name = KBUILD_MODNAME, + .driver.owner = THIS_MODULE, + .id_table = id_table, + .feature_table = feature_table, + .feature_table_size = ARRAY_SIZE(feature_table), + /* TODO validate config_get != NULL */ + .probe = virtio_fs_probe, + .remove = virtio_fs_remove, +#ifdef CONFIG_PM_SLEEP + .freeze = virtio_fs_freeze, + .restore = virtio_fs_restore, +#endif +}; + +static void virtio_fs_wake_forget_and_unlock(struct fuse_iqueue *fiq) +__releases(fiq->waitq.lock) +{ + struct fuse_forget_link *link; + struct virtio_fs_forget *forget; + struct scatterlist sg; + struct scatterlist *sgs[] = {&sg}; + struct virtio_fs *fs; + struct virtqueue *vq; + struct virtio_fs_vq *fsvq; + bool notify; + u64 unique; + int ret; + + link = fuse_dequeue_forget(fiq, 1, NULL); + unique = fuse_get_unique(fiq); + + fs = fiq->priv; + fsvq = &fs->vqs[VQ_HIPRIO]; + spin_unlock(&fiq->waitq.lock); + + /* Allocate a buffer for the request */ + forget = kmalloc(sizeof(*forget), GFP_NOFS | __GFP_NOFAIL); + + forget->ih = (struct fuse_in_header){ + .opcode = FUSE_FORGET, + .nodeid = link->forget_one.nodeid, + .unique = unique, + .len = sizeof(*forget), + }; + forget->arg = (struct fuse_forget_in){ + .nlookup = link->forget_one.nlookup, + }; + + sg_init_one(&sg, forget, sizeof(*forget)); + + /* Enqueue the request */ + vq = fsvq->vq; + dev_dbg(&vq->vdev->dev, "%s\n", __func__); + spin_lock(&fsvq->lock); + + ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC); + if (ret < 0) { + if (ret == -ENOMEM || ret == -ENOSPC) { + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Will try later.\n", + ret); + list_add_tail(&forget->list, &fsvq->queued_reqs); + schedule_delayed_work(&fsvq->dispatch_work, + msecs_to_jiffies(1)); + } else { + pr_debug("virtio-fs: Could not queue FORGET: err=%d. Dropping it.\n", + ret); + kfree(forget); + } + spin_unlock(&fsvq->lock); + goto out; + } + + fsvq->in_flight++; + notify = virtqueue_kick_prepare(vq); + + spin_unlock(&fsvq->lock); + + if (notify) + virtqueue_notify(vq); +out: + kfree(link); +} + +static void virtio_fs_wake_interrupt_and_unlock(struct fuse_iqueue *fiq) +__releases(fiq->waitq.lock) +{ + /* TODO */ + spin_unlock(&fiq->waitq.lock); +} + +/* Return the number of scatter-gather list elements required */ +static unsigned int sg_count_fuse_req(struct fuse_req *req) +{ + unsigned int total_sgs = 1 /* fuse_in_header */; + + if (req->in.numargs - req->in.argpages) + total_sgs += 1; + + if (req->in.argpages) + total_sgs += req->num_pages; + + if (!test_bit(FR_ISREPLY, &req->flags)) + return total_sgs; + + total_sgs += 1 /* fuse_out_header */; + + if (req->out.numargs - req->out.argpages) + total_sgs += 1; + + if (req->out.argpages) + total_sgs += req->num_pages; + + return total_sgs; +} + +/* Add pages to scatter-gather list and return number of elements used */ +static unsigned int sg_init_fuse_pages(struct scatterlist *sg, + struct page **pages, + struct fuse_page_desc *page_descs, + unsigned int num_pages, + unsigned int total_len) +{ + unsigned int i; + unsigned int this_len; + + for (i = 0; i < num_pages && total_len; i++) { + sg_init_table(&sg[i], 1); + this_len = min(page_descs[i].length, total_len); + sg_set_page(&sg[i], pages[i], this_len, page_descs[i].offset); + total_len -= this_len; + } + + return i; +} + +/* Add args to scatter-gather list and return number of elements used */ +static unsigned int sg_init_fuse_args(struct scatterlist *sg, + struct fuse_req *req, + struct fuse_arg *args, + unsigned int numargs, + bool argpages, + void *argbuf, + unsigned int *len_used) +{ + unsigned int total_sgs = 0; + unsigned int len; + + len = fuse_len_args(numargs - argpages, args); + if (len) + sg_init_one(&sg[total_sgs++], argbuf, len); + + if (argpages) + total_sgs += sg_init_fuse_pages(&sg[total_sgs], + req->pages, + req->page_descs, + req->num_pages, + args[numargs - 1].size); + + if (len_used) + *len_used = len; + + return total_sgs; +} + +/* Add a request to a virtqueue and kick the device */ +static int virtio_fs_enqueue_req(struct virtqueue *vq, struct fuse_req *req) +{ + /* requests need at least 4 elements */ + struct scatterlist *stack_sgs[6]; + struct scatterlist stack_sg[ARRAY_SIZE(stack_sgs)]; + struct scatterlist **sgs = stack_sgs; + struct scatterlist *sg = stack_sg; + struct virtio_fs_vq *fsvq; + unsigned int argbuf_used = 0; + unsigned int out_sgs = 0; + unsigned int in_sgs = 0; + unsigned int total_sgs; + unsigned int i; + int ret; + bool notify; + + /* Does the sglist fit on the stack? */ + total_sgs = sg_count_fuse_req(req); + if (total_sgs > ARRAY_SIZE(stack_sgs)) { + sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC); + sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC); + if (!sgs || !sg) { + ret = -ENOMEM; + goto out; + } + } + + /* Use a bounce buffer since stack args cannot be mapped */ + ret = copy_args_to_argbuf(req); + if (ret < 0) + goto out; + + /* Request elements */ + sg_init_one(&sg[out_sgs++], &req->in.h, sizeof(req->in.h)); + out_sgs += sg_init_fuse_args(&sg[out_sgs], req, + (struct fuse_arg *)req->in.args, + req->in.numargs, req->in.argpages, + req->argbuf, &argbuf_used); + + /* Reply elements */ + if (test_bit(FR_ISREPLY, &req->flags)) { + sg_init_one(&sg[out_sgs + in_sgs++], + &req->out.h, sizeof(req->out.h)); + in_sgs += sg_init_fuse_args(&sg[out_sgs + in_sgs], req, + req->out.args, req->out.numargs, + req->out.argpages, + req->argbuf + argbuf_used, NULL); + } + + WARN_ON(out_sgs + in_sgs != total_sgs); + + for (i = 0; i < total_sgs; i++) + sgs[i] = &sg[i]; + + fsvq = vq_to_fsvq(vq); + spin_lock(&fsvq->lock); + + ret = virtqueue_add_sgs(vq, sgs, out_sgs, in_sgs, req, GFP_ATOMIC); + if (ret < 0) { + /* TODO handle full virtqueue */ + spin_unlock(&fsvq->lock); + goto out; + } + + notify = virtqueue_kick_prepare(vq); + + spin_unlock(&fsvq->lock); + + if (notify) + virtqueue_notify(vq); + +out: + if (ret < 0 && req->argbuf) { + kfree(req->argbuf); + req->argbuf = NULL; + } + if (sgs != stack_sgs) { + kfree(sgs); + kfree(sg); + } + + return ret; +} + +static void virtio_fs_wake_pending_and_unlock(struct fuse_iqueue *fiq) +__releases(fiq->waitq.lock) +{ + unsigned int queue_id = VQ_REQUEST; /* TODO multiqueue */ + struct virtio_fs *fs; + struct fuse_conn *fc; + struct fuse_req *req; + struct fuse_pqueue *fpq; + int ret; + + WARN_ON(list_empty(&fiq->pending)); + req = list_last_entry(&fiq->pending, struct fuse_req, list); + clear_bit(FR_PENDING, &req->flags); + list_del_init(&req->list); + WARN_ON(!list_empty(&fiq->pending)); + spin_unlock(&fiq->waitq.lock); + + fs = fiq->priv; + fc = fs->vqs[queue_id].fud->fc; + + dev_dbg(&fs->vqs[queue_id].vq->vdev->dev, + "%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u\n", + __func__, req->in.h.opcode, req->in.h.unique, req->in.h.nodeid, + req->in.h.len, fuse_len_args(req->out.numargs, req->out.args)); + + fpq = &fs->vqs[queue_id].fud->pq; + spin_lock(&fpq->lock); + if (!fpq->connected) { + spin_unlock(&fpq->lock); + req->out.h.error = -ENODEV; + pr_err("virtio-fs: %s disconnected\n", __func__); + fuse_request_end(fc, req); + return; + } + list_add_tail(&req->list, fpq->processing); + spin_unlock(&fpq->lock); + set_bit(FR_SENT, &req->flags); + /* matches barrier in request_wait_answer() */ + smp_mb__after_atomic(); + /* TODO check for FR_INTERRUPTED? */ + +retry: + ret = virtio_fs_enqueue_req(fs->vqs[queue_id].vq, req); + if (ret < 0) { + if (ret == -ENOMEM || ret == -ENOSPC) { + /* Virtqueue full. Retry submission */ + usleep_range(20, 30); + goto retry; + } + req->out.h.error = ret; + pr_err("virtio-fs: virtio_fs_enqueue_req() failed %d\n", ret); + fuse_request_end(fc, req); + return; + } +} + +static void virtio_fs_flush_hiprio_queue(struct virtio_fs_vq *fsvq) +{ + struct virtio_fs_forget *forget; + + WARN_ON(fsvq->in_flight < 0); + + /* Go through pending forget requests and free them */ + spin_lock(&fsvq->lock); + while (1) { + forget = list_first_entry_or_null(&fsvq->queued_reqs, + struct virtio_fs_forget, list); + if (!forget) + break; + list_del(&forget->list); + kfree(forget); + } + + spin_unlock(&fsvq->lock); + + /* Wait for in flight requests to finish.*/ + while (1) { + spin_lock(&fsvq->lock); + if (!fsvq->in_flight) { + spin_unlock(&fsvq->lock); + break; + } + spin_unlock(&fsvq->lock); + usleep_range(1000, 2000); + } +} + +const static struct fuse_iqueue_ops virtio_fs_fiq_ops = { + .wake_forget_and_unlock = virtio_fs_wake_forget_and_unlock, + .wake_interrupt_and_unlock = virtio_fs_wake_interrupt_and_unlock, + .wake_pending_and_unlock = virtio_fs_wake_pending_and_unlock, +}; + +static int virtio_fs_fill_super(struct super_block *sb) +{ + struct fuse_conn *fc = get_fuse_conn_super(sb); + struct virtio_fs *fs = fc->iq.priv; + unsigned int i; + int err; + struct fuse_req *init_req; + struct fuse_fs_context ctx = { + .rootmode = S_IFDIR, + .default_permissions = 1, + .allow_other = 1, + .max_read = UINT_MAX, + .blksize = 512, + .destroy = true, + .no_control = true, + .no_force_umount = true, + }; + + /* TODO lock */ + if (fs->vqs[VQ_REQUEST].fud) { + pr_err("virtio-fs: device already in use\n"); + err = -EBUSY; + goto err; + } + + err = -ENOMEM; + /* Allocate fuse_dev for hiprio and notification queues */ + for (i = 0; i < VQ_REQUEST; i++) { + struct virtio_fs_vq *fsvq = &fs->vqs[i]; + + fsvq->fud = fuse_dev_alloc(); + if (!fsvq->fud) + goto err_free_fuse_devs; + } + + init_req = fuse_request_alloc(0); + if (!init_req) + goto err_free_fuse_devs; + __set_bit(FR_BACKGROUND, &init_req->flags); + + ctx.fudptr = (void **)&fs->vqs[VQ_REQUEST].fud; + err = fuse_fill_super_common(sb, &ctx); + if (err < 0) + goto err_free_init_req; + + fc = fs->vqs[VQ_REQUEST].fud->fc; + + /* TODO take fuse_mutex around this loop? */ + for (i = 0; i < fs->nvqs; i++) { + struct virtio_fs_vq *fsvq = &fs->vqs[i]; + + if (i == VQ_REQUEST) + continue; /* already initialized */ + fuse_dev_install(fsvq->fud, fc); + atomic_inc(&fc->dev_count); + } + + fuse_send_init(fc, init_req); + return 0; + +err_free_init_req: + fuse_request_free(init_req); +err_free_fuse_devs: + for (i = 0; i < fs->nvqs; i++) + fuse_dev_free(fs->vqs[i].fud); +err: + return err; +} + +static void virtio_kill_sb(struct super_block *sb) +{ + struct fuse_conn *fc = get_fuse_conn_super(sb); + struct virtio_fs *vfs; + struct virtio_fs_vq *fsvq; + + /* If mount failed, we can still be called without any fc */ + if (!fc) + return fuse_kill_sb_anon(sb); + + vfs = fc->iq.priv; + fsvq = &vfs->vqs[VQ_HIPRIO]; + + /* Stop forget queue. Soon destroy will be sent */ + spin_lock(&fsvq->lock); + fsvq->connected = false; + spin_unlock(&fsvq->lock); + virtio_fs_flush_hiprio_queue(fsvq); + + fuse_kill_sb_anon(sb); + virtio_fs_free_devs(vfs); +} + +static int virtio_fs_test_super(struct super_block *sb, + struct fs_context *fsc) +{ + struct fuse_conn *fc = fsc->s_fs_info; + + return fc->iq.priv == get_fuse_conn_super(sb)->iq.priv; +} + +static int virtio_fs_set_super(struct super_block *sb, + struct fs_context *fsc) +{ + int err; + + err = get_anon_bdev(&sb->s_dev); + if (!err) + fuse_conn_get(fsc->s_fs_info); + + return err; +} + +static int virtio_fs_get_tree(struct fs_context *fsc) +{ + struct virtio_fs *fs; + struct super_block *sb; + struct fuse_conn *fc; + int err; + + fs = virtio_fs_find_instance(fsc->source); + if (!fs) { + pr_info("virtio-fs: tag <%s> not found\n", fsc->source); + return -EINVAL; + } + + fc = kzalloc(sizeof(struct fuse_conn), GFP_KERNEL); + if (!fc) + return -ENOMEM; + + fuse_conn_init(fc, get_user_ns(current_user_ns()), &virtio_fs_fiq_ops, + fs); + fc->release = fuse_free_conn; + fc->delete_stale = true; + + fsc->s_fs_info = fc; + sb = sget_fc(fsc, virtio_fs_test_super, virtio_fs_set_super); + fuse_conn_put(fc); + if (IS_ERR(sb)) + return PTR_ERR(sb); + + if (!sb->s_root) { + err = virtio_fs_fill_super(sb); + if (err) { + deactivate_locked_super(sb); + return err; + } + + sb->s_flags |= SB_ACTIVE; + } + + WARN_ON(fsc->root); + fsc->root = dget(sb->s_root); + return 0; +} + +static const struct fs_context_operations virtio_fs_context_ops = { + .get_tree = virtio_fs_get_tree, +}; + +static int virtio_fs_init_fs_context(struct fs_context *fsc) +{ + fsc->ops = &virtio_fs_context_ops; + return 0; +} + +static struct file_system_type virtio_fs_type = { + .owner = THIS_MODULE, + .name = "virtiofs", + .init_fs_context = virtio_fs_init_fs_context, + .kill_sb = virtio_kill_sb, +}; + +static int __init virtio_fs_init(void) +{ + int ret; + + ret = register_virtio_driver(&virtio_fs_driver); + if (ret < 0) + return ret; + + ret = register_filesystem(&virtio_fs_type); + if (ret < 0) { + unregister_virtio_driver(&virtio_fs_driver); + return ret; + } + + return 0; +} +module_init(virtio_fs_init); + +static void __exit virtio_fs_exit(void) +{ + unregister_filesystem(&virtio_fs_type); + unregister_virtio_driver(&virtio_fs_driver); +} +module_exit(virtio_fs_exit); + +MODULE_AUTHOR("Stefan Hajnoczi <stefanha@redhat.com>"); +MODULE_DESCRIPTION("Virtio Filesystem"); +MODULE_LICENSE("GPL"); +MODULE_ALIAS_FS(KBUILD_MODNAME); +MODULE_DEVICE_TABLE(virtio, id_table); diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h new file mode 100644 index 000000000000..b5e99c217c86 --- /dev/null +++ b/include/uapi/linux/virtio_fs.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ + +#ifndef _UAPI_LINUX_VIRTIO_FS_H +#define _UAPI_LINUX_VIRTIO_FS_H + +#include <linux/types.h> +#include <linux/virtio_ids.h> +#include <linux/virtio_config.h> +#include <linux/virtio_types.h> + +struct virtio_fs_config { + /* Filesystem name (UTF-8, not NUL-terminated, padded with NULs) */ + __u8 tag[36]; + + /* Number of request queues */ + __u32 num_queues; +} __attribute__((packed)); + +#endif /* _UAPI_LINUX_VIRTIO_FS_H */ diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h index 348fd0176f75..585e07b27333 100644 --- a/include/uapi/linux/virtio_ids.h +++ b/include/uapi/linux/virtio_ids.h @@ -44,6 +44,7 @@ #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */ #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */ #define VIRTIO_ID_IOMMU 23 /* virtio IOMMU */ +#define VIRTIO_ID_FS 26 /* virtio filesystem */ #define VIRTIO_ID_PMEM 27 /* virtio pmem */ #endif /* _LINUX_VIRTIO_IDS_H */