Message ID | 20210520154654.1791183-6-groug@kaod.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | virtiofs: propagate sync() to file server | expand |
On Thu, 20 May 2021 17:46:54 +0200 Greg Kurz <groug@kaod.org> wrote: > Even if POSIX doesn't mandate it, linux users legitimately expect > sync() to flush all data and metadata to physical storage when it > is located on the same system. This isn't happening with virtiofs > though : sync() inside the guest returns right away even though > data still needs to be flushed from the host page cache. > > This is easily demonstrated by doing the following in the guest: > > $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync > 5120+0 records in > 5120+0 records out > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s > sync() = 0 <0.024068> > +++ exited with 0 +++ > > and start the following in the host when the 'dd' command completes > in the guest: > > $ strace -T -e fsync /usr/bin/sync virtiofs/foo > fsync(3) = 0 <10.371640> > +++ exited with 0 +++ > > There are no good reasons not to honor the expected behavior of > sync() actually : it gives an unrealistic impression that virtiofs > is super fast and that data has safely landed on HW, which isn't > the case obviously. > > Implement a ->sync_fs() superblock operation that sends a new > FUSE_SYNCFS request type for this purpose. Provision a 64-bit > placeholder for possible future extensions. Since the file > server cannot handle the wait == 0 case, we skip it to avoid a > gratuitous roundtrip. Note that this is per-superblock : a > FUSE_SYNCFS is send for the root mount and for each submount. > s/send/sent Miklos, Great thanks for the quick feedback on these patches ! :) Apart from the fact that nothing is sent for submounts as long as we don't set SB_BORN on them, this patch doesn't really depends on the previous ones. If it looks good to you, maybe you can just merge it and I'll re-post the fixes separately ? Cheers, -- Greg > Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for > FUSE_SYNCFS in the file server is treated as permanent success. > This ensures compatibility with older file servers : the client > will get the current behavior of sync() not being propagated to > the file server. > > Note that such an operation allows the file server to DoS sync(). > Since a typical FUSE file server is an untrusted piece of software > running in userspace, this is disabled by default. Only enable it > with virtiofs for now since virtiofsd is supposedly trusted by the > guest kernel. > > Reported-by: Robert Krawitz <rlk@redhat.com> > Signed-off-by: Greg Kurz <groug@kaod.org> > --- > fs/fuse/fuse_i.h | 3 +++ > fs/fuse/inode.c | 40 +++++++++++++++++++++++++++++++++++++++ > fs/fuse/virtio_fs.c | 1 + > include/uapi/linux/fuse.h | 10 +++++++++- > 4 files changed, 53 insertions(+), 1 deletion(-) > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index e2f5c8617e0d..01d9283261af 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -761,6 +761,9 @@ struct fuse_conn { > /* Auto-mount submounts announced by the server */ > unsigned int auto_submounts:1; > > + /* Propagate syncfs() to server */ > + unsigned int sync_fs:1; > + > /** The number of requests waiting for completion */ > atomic_t num_waiting; > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index 123b53d1c3c6..96b00253f766 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -506,6 +506,45 @@ static int fuse_statfs(struct dentry *dentry, struct kstatfs *buf) > return err; > } > > +static int fuse_sync_fs(struct super_block *sb, int wait) > +{ > + struct fuse_mount *fm = get_fuse_mount_super(sb); > + struct fuse_conn *fc = fm->fc; > + struct fuse_syncfs_in inarg; > + FUSE_ARGS(args); > + int err; > + > + /* > + * Userspace cannot handle the wait == 0 case. Avoid a > + * gratuitous roundtrip. > + */ > + if (!wait) > + return 0; > + > + /* The filesystem is being unmounted. Nothing to do. */ > + if (!sb->s_root) > + return 0; > + > + if (!fc->sync_fs) > + return 0; > + > + memset(&inarg, 0, sizeof(inarg)); > + args.in_numargs = 1; > + args.in_args[0].size = sizeof(inarg); > + args.in_args[0].value = &inarg; > + args.opcode = FUSE_SYNCFS; > + args.nodeid = get_node_id(sb->s_root->d_inode); > + args.out_numargs = 0; > + > + err = fuse_simple_request(fm, &args); > + if (err == -ENOSYS) { > + fc->sync_fs = 0; > + err = 0; > + } > + > + return err; > +} > + > enum { > OPT_SOURCE, > OPT_SUBTYPE, > @@ -909,6 +948,7 @@ static const struct super_operations fuse_super_operations = { > .put_super = fuse_put_super, > .umount_begin = fuse_umount_begin, > .statfs = fuse_statfs, > + .sync_fs = fuse_sync_fs, > .show_options = fuse_show_options, > }; > > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c > index 8962cd033016..f649a47efb68 100644 > --- a/fs/fuse/virtio_fs.c > +++ b/fs/fuse/virtio_fs.c > @@ -1455,6 +1455,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc) > fc->release = fuse_free_conn; > fc->delete_stale = true; > fc->auto_submounts = true; > + fc->sync_fs = true; > > /* Tell FUSE to split requests that exceed the virtqueue's size */ > fc->max_pages_limit = min_t(unsigned int, fc->max_pages_limit, > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h > index 271ae90a9bb7..36ed092227fa 100644 > --- a/include/uapi/linux/fuse.h > +++ b/include/uapi/linux/fuse.h > @@ -181,6 +181,9 @@ > * - add FUSE_OPEN_KILL_SUIDGID > * - extend fuse_setxattr_in, add FUSE_SETXATTR_EXT > * - add FUSE_SETXATTR_ACL_KILL_SGID > + * > + * 7.34 > + * - add FUSE_SYNCFS > */ > > #ifndef _LINUX_FUSE_H > @@ -216,7 +219,7 @@ > #define FUSE_KERNEL_VERSION 7 > > /** Minor version number of this interface */ > -#define FUSE_KERNEL_MINOR_VERSION 33 > +#define FUSE_KERNEL_MINOR_VERSION 34 > > /** The node ID of the root inode */ > #define FUSE_ROOT_ID 1 > @@ -509,6 +512,7 @@ enum fuse_opcode { > FUSE_COPY_FILE_RANGE = 47, > FUSE_SETUPMAPPING = 48, > FUSE_REMOVEMAPPING = 49, > + FUSE_SYNCFS = 50, > > /* CUSE specific operations */ > CUSE_INIT = 4096, > @@ -971,4 +975,8 @@ struct fuse_removemapping_one { > #define FUSE_REMOVEMAPPING_MAX_ENTRY \ > (PAGE_SIZE / sizeof(struct fuse_removemapping_one)) > > +struct fuse_syncfs_in { > + uint64_t padding; > +}; > + > #endif /* _LINUX_FUSE_H */
On Fri, 21 May 2021 at 12:09, Greg Kurz <groug@kaod.org> wrote: > If it looks good to you, maybe you can just merge it and > I'll re-post the fixes separately ? Looks good, applied. Thanks, Miklos
Hi Greg, Sorry for the late reply, I have some questions about this change... On Fri, May 21, 2021 at 9:12 AM Greg Kurz <groug@kaod.org> wrote: > > Even if POSIX doesn't mandate it, linux users legitimately expect > sync() to flush all data and metadata to physical storage when it > is located on the same system. This isn't happening with virtiofs > though : sync() inside the guest returns right away even though > data still needs to be flushed from the host page cache. > > This is easily demonstrated by doing the following in the guest: > > $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync > 5120+0 records in > 5120+0 records out > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s > sync() = 0 <0.024068> > +++ exited with 0 +++ > > and start the following in the host when the 'dd' command completes > in the guest: > > $ strace -T -e fsync /usr/bin/sync virtiofs/foo > fsync(3) = 0 <10.371640> > +++ exited with 0 +++ > > There are no good reasons not to honor the expected behavior of > sync() actually : it gives an unrealistic impression that virtiofs > is super fast and that data has safely landed on HW, which isn't > the case obviously. > > Implement a ->sync_fs() superblock operation that sends a new > FUSE_SYNCFS request type for this purpose. Provision a 64-bit > placeholder for possible future extensions. Since the file > server cannot handle the wait == 0 case, we skip it to avoid a > gratuitous roundtrip. Note that this is per-superblock : a > FUSE_SYNCFS is send for the root mount and for each submount. > > Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for > FUSE_SYNCFS in the file server is treated as permanent success. > This ensures compatibility with older file servers : the client > will get the current behavior of sync() not being propagated to > the file server. I wonder - even if the server does not support SYNCFS or if the kernel does not trust the server with SYNCFS, fuse_sync_fs() can wait until all pending requests up to this call have been completed, either before or after submitting the SYNCFS request. No? Does virtiofsd track all requests prior to SYNCFS request to make sure that they were executed on the host filesystem before calling syncfs() on the host filesystem? I am not familiar enough with FUSE internals so there may already be a mechanism to track/wait for all pending requests? > > Note that such an operation allows the file server to DoS sync(). > Since a typical FUSE file server is an untrusted piece of software > running in userspace, this is disabled by default. Only enable it > with virtiofs for now since virtiofsd is supposedly trusted by the > guest kernel. Isn't there already a similar risk of DoS to sync() from the ability of any untrusted (or malfunctioning) server to block writes? Thanks, Amir.
On Sun, Aug 15, 2021 at 05:14:06PM +0300, Amir Goldstein wrote: > Hi Greg, > > Sorry for the late reply, I have some questions about this change... > > On Fri, May 21, 2021 at 9:12 AM Greg Kurz <groug@kaod.org> wrote: > > > > Even if POSIX doesn't mandate it, linux users legitimately expect > > sync() to flush all data and metadata to physical storage when it > > is located on the same system. This isn't happening with virtiofs > > though : sync() inside the guest returns right away even though > > data still needs to be flushed from the host page cache. > > > > This is easily demonstrated by doing the following in the guest: > > > > $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync > > 5120+0 records in > > 5120+0 records out > > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s > > sync() = 0 <0.024068> > > +++ exited with 0 +++ > > > > and start the following in the host when the 'dd' command completes > > in the guest: > > > > $ strace -T -e fsync /usr/bin/sync virtiofs/foo > > fsync(3) = 0 <10.371640> > > +++ exited with 0 +++ > > > > There are no good reasons not to honor the expected behavior of > > sync() actually : it gives an unrealistic impression that virtiofs > > is super fast and that data has safely landed on HW, which isn't > > the case obviously. > > > > Implement a ->sync_fs() superblock operation that sends a new > > FUSE_SYNCFS request type for this purpose. Provision a 64-bit > > placeholder for possible future extensions. Since the file > > server cannot handle the wait == 0 case, we skip it to avoid a > > gratuitous roundtrip. Note that this is per-superblock : a > > FUSE_SYNCFS is send for the root mount and for each submount. > > > > Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for > > FUSE_SYNCFS in the file server is treated as permanent success. > > This ensures compatibility with older file servers : the client > > will get the current behavior of sync() not being propagated to > > the file server. > > I wonder - even if the server does not support SYNCFS or if the kernel > does not trust the server with SYNCFS, fuse_sync_fs() can wait > until all pending requests up to this call have been completed, either > before or after submitting the SYNCFS request. No? > > Does virtiofsd track all requests prior to SYNCFS request to make > sure that they were executed on the host filesystem before calling > syncfs() on the host filesystem? Hi Amir, I don't think virtiofsd has any such notion. I would think, that client should make sure all pending writes have completed and then send SYNCFS request. Looking at the sync_filesystem(), I am assuming vfs will take care of flushing out all dirty pages and then call ->sync_fs. Having said that, I think fuse queues the writeback request internally and signals completion of writeback to mm(end_page_writeback()). And that's why fuse_fsync() has notion of waiting for all pending writes to finish on an inode (fuse_sync_writes()). So I think you have raised a good point. That is if there are pending writes at the time of syncfs(), we don't seem to have a notion of first waiting for all these writes to finish before we send FUSE_SYNCFS request to server. In case of virtiofs, we could probably move away from the notion of ending writeback immediately. IIUC, this was needed for regular fuse where we wanted to make sure rouge/malfunctining fuse server could not affect processes on system which are not dealing with fuse. But in case of virtiofs, guest is trusting file server. I had tried to get rid of this for virtiofs but ran into some other issues which I could not resolve easily at the time and then I got distracted in other things. Anyway, irrespective of that, we probably need a way to flush out all pending writes with fuse and then send FUSE_SYNCFS. (And lost make sure writes coming after call to fuse_sync_fs(), continue to be queued and we don't livelock. BTW, in the context of virtiofs, this probably is problem only with mmaped writes. otherwise cache=auto and cache=none are basically writethrough caches. So write is sent to server immediately. So there is nothing to be written back when syncfs() comes along. But mmaped() writes are different and even with cache=auto there can be dirty pages. (cache=none does not support mmap() at all). > > I am not familiar enough with FUSE internals so there may already > be a mechanism to track/wait for all pending requests? fuse_sync_writes() does it for inode. I am not aware of anything which can do it for the whole filesystem (all the inodes). > > > > > Note that such an operation allows the file server to DoS sync(). > > Since a typical FUSE file server is an untrusted piece of software > > running in userspace, this is disabled by default. Only enable it > > with virtiofs for now since virtiofsd is supposedly trusted by the > > guest kernel. > > Isn't there already a similar risk of DoS to sync() from the ability of any > untrusted (or malfunctioning) server to block writes? I think fuse has some safeguards for this. Fuse signals completion of writeback immediately so that vfs/mm/fs does not blocking trying to writeback and if server is not finishing WRITES fast enough, the there will be enough dirty pages in bdi that it will create back pressure and block process dirtying pages. Thanks Vivek
On Mon, Aug 16, 2021 at 6:29 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sun, Aug 15, 2021 at 05:14:06PM +0300, Amir Goldstein wrote: > > Hi Greg, > > > > Sorry for the late reply, I have some questions about this change... > > > > On Fri, May 21, 2021 at 9:12 AM Greg Kurz <groug@kaod.org> wrote: > > > > > > Even if POSIX doesn't mandate it, linux users legitimately expect > > > sync() to flush all data and metadata to physical storage when it > > > is located on the same system. This isn't happening with virtiofs > > > though : sync() inside the guest returns right away even though > > > data still needs to be flushed from the host page cache. > > > > > > This is easily demonstrated by doing the following in the guest: > > > > > > $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync > > > 5120+0 records in > > > 5120+0 records out > > > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s > > > sync() = 0 <0.024068> > > > +++ exited with 0 +++ > > > > > > and start the following in the host when the 'dd' command completes > > > in the guest: > > > > > > $ strace -T -e fsync /usr/bin/sync virtiofs/foo > > > fsync(3) = 0 <10.371640> > > > +++ exited with 0 +++ > > > > > > There are no good reasons not to honor the expected behavior of > > > sync() actually : it gives an unrealistic impression that virtiofs > > > is super fast and that data has safely landed on HW, which isn't > > > the case obviously. > > > > > > Implement a ->sync_fs() superblock operation that sends a new > > > FUSE_SYNCFS request type for this purpose. Provision a 64-bit > > > placeholder for possible future extensions. Since the file > > > server cannot handle the wait == 0 case, we skip it to avoid a > > > gratuitous roundtrip. Note that this is per-superblock : a > > > FUSE_SYNCFS is send for the root mount and for each submount. > > > > > > Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for > > > FUSE_SYNCFS in the file server is treated as permanent success. > > > This ensures compatibility with older file servers : the client > > > will get the current behavior of sync() not being propagated to > > > the file server. > > > > I wonder - even if the server does not support SYNCFS or if the kernel > > does not trust the server with SYNCFS, fuse_sync_fs() can wait > > until all pending requests up to this call have been completed, either > > before or after submitting the SYNCFS request. No? > > > > > Does virtiofsd track all requests prior to SYNCFS request to make > > sure that they were executed on the host filesystem before calling > > syncfs() on the host filesystem? > > Hi Amir, > > I don't think virtiofsd has any such notion. I would think, that > client should make sure all pending writes have completed and > then send SYNCFS request. > > Looking at the sync_filesystem(), I am assuming vfs will take care > of flushing out all dirty pages and then call ->sync_fs. > > Having said that, I think fuse queues the writeback request internally > and signals completion of writeback to mm(end_page_writeback()). And > that's why fuse_fsync() has notion of waiting for all pending > writes to finish on an inode (fuse_sync_writes()). > > So I think you have raised a good point. That is if there are pending > writes at the time of syncfs(), we don't seem to have a notion of > first waiting for all these writes to finish before we send > FUSE_SYNCFS request to server. Maybe, but I was not referring to inode writeback requests. I had assumed that those were handled correctly. I was referring to pending metadata requests. ->sync_fs() in local fs also takes care of flushing metadata (e.g. journal). I assumed that virtiofsd implements FUSE_SYNCFS request by calling syncfs() on host fs, but it is does that than there is no guarantee that all metadata requests have reached the host fs from virtiofs unless client or server take care of waiting for all pending metadata requests before issuing FUSE_SYNCFS. But maybe I am missing something. It might be worth mentioning that I did not find any sync_fs() commands that request to flush metadata caches on the server in NFS or SMB protocols either. Thanks, Amir.
On Mon, Aug 16, 2021 at 09:57:08PM +0300, Amir Goldstein wrote: > On Mon, Aug 16, 2021 at 6:29 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Sun, Aug 15, 2021 at 05:14:06PM +0300, Amir Goldstein wrote: > > > Hi Greg, > > > > > > Sorry for the late reply, I have some questions about this change... > > > > > > On Fri, May 21, 2021 at 9:12 AM Greg Kurz <groug@kaod.org> wrote: > > > > > > > > Even if POSIX doesn't mandate it, linux users legitimately expect > > > > sync() to flush all data and metadata to physical storage when it > > > > is located on the same system. This isn't happening with virtiofs > > > > though : sync() inside the guest returns right away even though > > > > data still needs to be flushed from the host page cache. > > > > > > > > This is easily demonstrated by doing the following in the guest: > > > > > > > > $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync > > > > 5120+0 records in > > > > 5120+0 records out > > > > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s > > > > sync() = 0 <0.024068> > > > > +++ exited with 0 +++ > > > > > > > > and start the following in the host when the 'dd' command completes > > > > in the guest: > > > > > > > > $ strace -T -e fsync /usr/bin/sync virtiofs/foo > > > > fsync(3) = 0 <10.371640> > > > > +++ exited with 0 +++ > > > > > > > > There are no good reasons not to honor the expected behavior of > > > > sync() actually : it gives an unrealistic impression that virtiofs > > > > is super fast and that data has safely landed on HW, which isn't > > > > the case obviously. > > > > > > > > Implement a ->sync_fs() superblock operation that sends a new > > > > FUSE_SYNCFS request type for this purpose. Provision a 64-bit > > > > placeholder for possible future extensions. Since the file > > > > server cannot handle the wait == 0 case, we skip it to avoid a > > > > gratuitous roundtrip. Note that this is per-superblock : a > > > > FUSE_SYNCFS is send for the root mount and for each submount. > > > > > > > > Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for > > > > FUSE_SYNCFS in the file server is treated as permanent success. > > > > This ensures compatibility with older file servers : the client > > > > will get the current behavior of sync() not being propagated to > > > > the file server. > > > > > > I wonder - even if the server does not support SYNCFS or if the kernel > > > does not trust the server with SYNCFS, fuse_sync_fs() can wait > > > until all pending requests up to this call have been completed, either > > > before or after submitting the SYNCFS request. No? > > > > > > > > Does virtiofsd track all requests prior to SYNCFS request to make > > > sure that they were executed on the host filesystem before calling > > > syncfs() on the host filesystem? > > > > Hi Amir, > > > > I don't think virtiofsd has any such notion. I would think, that > > client should make sure all pending writes have completed and > > then send SYNCFS request. > > > > Looking at the sync_filesystem(), I am assuming vfs will take care > > of flushing out all dirty pages and then call ->sync_fs. > > > > Having said that, I think fuse queues the writeback request internally > > and signals completion of writeback to mm(end_page_writeback()). And > > that's why fuse_fsync() has notion of waiting for all pending > > writes to finish on an inode (fuse_sync_writes()). > > > > So I think you have raised a good point. That is if there are pending > > writes at the time of syncfs(), we don't seem to have a notion of > > first waiting for all these writes to finish before we send > > FUSE_SYNCFS request to server. > > Maybe, but I was not referring to inode writeback requests. > I had assumed that those were handled correctly. > I was referring to pending metadata requests. > > ->sync_fs() in local fs also takes care of flushing metadata > (e.g. journal). I assumed that virtiofsd implements FUSE_SYNCFS > request by calling syncfs() on host fs, Yes virtiofsd calls syncfs() on host fs. > but it is does that than > there is no guarantee that all metadata requests have reached the > host fs from virtiofs unless client or server take care of waiting > for all pending metadata requests before issuing FUSE_SYNCFS. We don't have any journal in virtiofs. In fact we don't seem to cache any metadta. Except probably the case when "-o writeback" where we can trust local time stamps. If "-o writeback" is not enabled, i am not sure what metadata we will be caching that we will need to worry about. Do you have something specific in mind. (Atleast from virtiofs point of view, I can't seem to think what metadata we are caching which we need to worry about). Thanks Vivek > > But maybe I am missing something. > > It might be worth mentioning that I did not find any sync_fs() > commands that request to flush metadata caches on the server in > NFS or SMB protocols either. > > Thanks, > Amir. >
On Mon, Aug 16, 2021 at 10:11 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Aug 16, 2021 at 09:57:08PM +0300, Amir Goldstein wrote: > > On Mon, Aug 16, 2021 at 6:29 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Sun, Aug 15, 2021 at 05:14:06PM +0300, Amir Goldstein wrote: > > > > Hi Greg, > > > > > > > > Sorry for the late reply, I have some questions about this change... > > > > > > > > On Fri, May 21, 2021 at 9:12 AM Greg Kurz <groug@kaod.org> wrote: > > > > > > > > > > Even if POSIX doesn't mandate it, linux users legitimately expect > > > > > sync() to flush all data and metadata to physical storage when it > > > > > is located on the same system. This isn't happening with virtiofs > > > > > though : sync() inside the guest returns right away even though > > > > > data still needs to be flushed from the host page cache. > > > > > > > > > > This is easily demonstrated by doing the following in the guest: > > > > > > > > > > $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync > > > > > 5120+0 records in > > > > > 5120+0 records out > > > > > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s > > > > > sync() = 0 <0.024068> > > > > > +++ exited with 0 +++ > > > > > > > > > > and start the following in the host when the 'dd' command completes > > > > > in the guest: > > > > > > > > > > $ strace -T -e fsync /usr/bin/sync virtiofs/foo > > > > > fsync(3) = 0 <10.371640> > > > > > +++ exited with 0 +++ > > > > > > > > > > There are no good reasons not to honor the expected behavior of > > > > > sync() actually : it gives an unrealistic impression that virtiofs > > > > > is super fast and that data has safely landed on HW, which isn't > > > > > the case obviously. > > > > > > > > > > Implement a ->sync_fs() superblock operation that sends a new > > > > > FUSE_SYNCFS request type for this purpose. Provision a 64-bit > > > > > placeholder for possible future extensions. Since the file > > > > > server cannot handle the wait == 0 case, we skip it to avoid a > > > > > gratuitous roundtrip. Note that this is per-superblock : a > > > > > FUSE_SYNCFS is send for the root mount and for each submount. > > > > > > > > > > Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for > > > > > FUSE_SYNCFS in the file server is treated as permanent success. > > > > > This ensures compatibility with older file servers : the client > > > > > will get the current behavior of sync() not being propagated to > > > > > the file server. > > > > > > > > I wonder - even if the server does not support SYNCFS or if the kernel > > > > does not trust the server with SYNCFS, fuse_sync_fs() can wait > > > > until all pending requests up to this call have been completed, either > > > > before or after submitting the SYNCFS request. No? > > > > > > > > > > > Does virtiofsd track all requests prior to SYNCFS request to make > > > > sure that they were executed on the host filesystem before calling > > > > syncfs() on the host filesystem? > > > > > > Hi Amir, > > > > > > I don't think virtiofsd has any such notion. I would think, that > > > client should make sure all pending writes have completed and > > > then send SYNCFS request. > > > > > > Looking at the sync_filesystem(), I am assuming vfs will take care > > > of flushing out all dirty pages and then call ->sync_fs. > > > > > > Having said that, I think fuse queues the writeback request internally > > > and signals completion of writeback to mm(end_page_writeback()). And > > > that's why fuse_fsync() has notion of waiting for all pending > > > writes to finish on an inode (fuse_sync_writes()). > > > > > > So I think you have raised a good point. That is if there are pending > > > writes at the time of syncfs(), we don't seem to have a notion of > > > first waiting for all these writes to finish before we send > > > FUSE_SYNCFS request to server. > > > > Maybe, but I was not referring to inode writeback requests. > > I had assumed that those were handled correctly. > > I was referring to pending metadata requests. > > > > ->sync_fs() in local fs also takes care of flushing metadata > > (e.g. journal). I assumed that virtiofsd implements FUSE_SYNCFS > > request by calling syncfs() on host fs, > > Yes virtiofsd calls syncfs() on host fs. > > > but it is does that than > > there is no guarantee that all metadata requests have reached the > > host fs from virtiofs unless client or server take care of waiting > > for all pending metadata requests before issuing FUSE_SYNCFS. > > We don't have any journal in virtiofs. In fact we don't seem to > cache any metadta. Except probably the case when "-o writeback" > where we can trust local time stamps. > > If "-o writeback" is not enabled, i am not sure what metadata > we will be caching that we will need to worry about. Do you have > something specific in mind. (Atleast from virtiofs point of view, > I can't seem to think what metadata we are caching which we need > to worry about). No, I don't see a problem. I guess I was confused by the semantics. Thanks for clarifying. Amir.
On Mon, Aug 16, 2021 at 11:29:02AM -0400, Vivek Goyal wrote: > On Sun, Aug 15, 2021 at 05:14:06PM +0300, Amir Goldstein wrote: > > I wonder - even if the server does not support SYNCFS or if the kernel > > does not trust the server with SYNCFS, fuse_sync_fs() can wait > > until all pending requests up to this call have been completed, either > > before or after submitting the SYNCFS request. No? > > > > > Does virtiofsd track all requests prior to SYNCFS request to make > > sure that they were executed on the host filesystem before calling > > syncfs() on the host filesystem? > > Hi Amir, > > I don't think virtiofsd has any such notion. I would think, that > client should make sure all pending writes have completed and > then send SYNCFS request. > > Looking at the sync_filesystem(), I am assuming vfs will take care > of flushing out all dirty pages and then call ->sync_fs. > > Having said that, I think fuse queues the writeback request internally > and signals completion of writeback to mm(end_page_writeback()). And > that's why fuse_fsync() has notion of waiting for all pending > writes to finish on an inode (fuse_sync_writes()). > > So I think you have raised a good point. That is if there are pending > writes at the time of syncfs(), we don't seem to have a notion of > first waiting for all these writes to finish before we send > FUSE_SYNCFS request to server. So here a proposed patch for fixing this. Works by counting write requests initiated up till the syncfs call. Since more than one syncfs can be in progress counts are kept in "buckets" in order to wait for the correct write requests in each instance. I tried to make this lightweight, but the cacheline bounce due to the counter is still there, unfortunately. fc->num_waiting also causes cacheline bouce, so I'm not going to optimize this (percpu counter?) until that one is also optimizied. Not yet tested, and I'm not sure how to test this. Comments? Thanks, Miklos diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 97f860cfc195..8d1d6e895534 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -389,6 +389,7 @@ struct fuse_writepage_args { struct list_head queue_entry; struct fuse_writepage_args *next; struct inode *inode; + struct fuse_sync_bucket *bucket; }; static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi, @@ -1608,6 +1609,9 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa) struct fuse_args_pages *ap = &wpa->ia.ap; int i; + if (wpa->bucket && atomic_dec_and_test(&wpa->bucket->num_writepages)) + wake_up(&wpa->bucket->waitq); + for (i = 0; i < ap->num_pages; i++) __free_page(ap->pages[i]); @@ -1871,6 +1875,19 @@ static struct fuse_writepage_args *fuse_writepage_args_alloc(void) } +static void fuse_writepage_add_to_bucket(struct fuse_conn *fc, + struct fuse_writepage_args *wpa) +{ + if (!fc->sync_fs) + return; + + rcu_read_lock(); + do { + wpa->bucket = rcu_dereference(fc->curr_bucket); + } while (unlikely(!atomic_inc_not_zero(&wpa->bucket->num_writepages))); + rcu_read_unlock(); +} + static int fuse_writepage_locked(struct page *page) { struct address_space *mapping = page->mapping; @@ -1898,6 +1915,7 @@ static int fuse_writepage_locked(struct page *page) if (!wpa->ia.ff) goto err_nofile; + fuse_writepage_add_to_bucket(fc, wpa); fuse_write_args_fill(&wpa->ia, wpa->ia.ff, page_offset(page), 0); copy_highpage(tmp_page, page); @@ -2148,6 +2166,8 @@ static int fuse_writepages_fill(struct page *page, __free_page(tmp_page); goto out_unlock; } + fuse_writepage_add_to_bucket(fc, wpa); + data->max_pages = 1; ap = &wpa->ia.ap; diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 07829ce78695..ee638e227bb3 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -515,6 +515,14 @@ struct fuse_fs_context { void **fudptr; }; +struct fuse_sync_bucket { + atomic_t num_writepages; + union { + wait_queue_head_t waitq; + struct rcu_head rcu; + }; +}; + /** * A Fuse connection. * @@ -807,6 +815,9 @@ struct fuse_conn { /** List of filesystems using this connection */ struct list_head mounts; + + /* New writepages go into this bucket */ + struct fuse_sync_bucket *curr_bucket; }; /* diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index b9beb39a4a18..524b2d128985 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -506,10 +506,24 @@ static int fuse_statfs(struct dentry *dentry, struct kstatfs *buf) return err; } +static struct fuse_sync_bucket *fuse_sync_bucket_alloc(void) +{ + struct fuse_sync_bucket *bucket; + + bucket = kzalloc(sizeof(*bucket), GFP_KERNEL | __GFP_NOFAIL); + if (bucket) { + init_waitqueue_head(&bucket->waitq); + /* Initial active count */ + atomic_set(&bucket->num_writepages, 1); + } + return bucket; +} + static int fuse_sync_fs(struct super_block *sb, int wait) { struct fuse_mount *fm = get_fuse_mount_super(sb); struct fuse_conn *fc = fm->fc; + struct fuse_sync_bucket *bucket, *new_bucket; struct fuse_syncfs_in inarg; FUSE_ARGS(args); int err; @@ -528,6 +542,31 @@ static int fuse_sync_fs(struct super_block *sb, int wait) if (!fc->sync_fs) return 0; + new_bucket = fuse_sync_bucket_alloc(); + spin_lock(&fc->lock); + bucket = fc->curr_bucket; + if (atomic_read(&bucket->num_writepages) != 0) { + /* One more for count completion of old bucket */ + atomic_inc(&new_bucket->num_writepages); + rcu_assign_pointer(fc->curr_bucket, new_bucket); + /* Drop initially added active count */ + atomic_dec(&bucket->num_writepages); + spin_unlock(&fc->lock); + + wait_event(bucket->waitq, atomic_read(&bucket->num_writepages) == 0); + /* + * Drop count on new bucket, possibly resulting in a completion + * if more than one syncfs is going on + */ + if (atomic_dec_and_test(&new_bucket->num_writepages)) + wake_up(&new_bucket->waitq); + kfree_rcu(bucket, rcu); + } else { + spin_unlock(&fc->lock); + /* Free unused */ + kfree(new_bucket); + } + memset(&inarg, 0, sizeof(inarg)); args.in_numargs = 1; args.in_args[0].size = sizeof(inarg); @@ -770,6 +809,7 @@ void fuse_conn_put(struct fuse_conn *fc) fiq->ops->release(fiq); put_pid_ns(fc->pid_ns); put_user_ns(fc->user_ns); + kfree_rcu(fc->curr_bucket, rcu); fc->release(fc); } } @@ -1418,6 +1458,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx) if (sb->s_flags & SB_MANDLOCK) goto err; + fc->curr_bucket = fuse_sync_bucket_alloc(); fuse_sb_defaults(sb); if (ctx->is_bdev) {
On Sat, Aug 28, 2021 at 05:21:39PM +0200, Miklos Szeredi wrote: > On Mon, Aug 16, 2021 at 11:29:02AM -0400, Vivek Goyal wrote: > > On Sun, Aug 15, 2021 at 05:14:06PM +0300, Amir Goldstein wrote: > > > > I wonder - even if the server does not support SYNCFS or if the kernel > > > does not trust the server with SYNCFS, fuse_sync_fs() can wait > > > until all pending requests up to this call have been completed, either > > > before or after submitting the SYNCFS request. No? > > > > > > > > Does virtiofsd track all requests prior to SYNCFS request to make > > > sure that they were executed on the host filesystem before calling > > > syncfs() on the host filesystem? > > > > Hi Amir, > > > > I don't think virtiofsd has any such notion. I would think, that > > client should make sure all pending writes have completed and > > then send SYNCFS request. > > > > Looking at the sync_filesystem(), I am assuming vfs will take care > > of flushing out all dirty pages and then call ->sync_fs. > > > > Having said that, I think fuse queues the writeback request internally > > and signals completion of writeback to mm(end_page_writeback()). And > > that's why fuse_fsync() has notion of waiting for all pending > > writes to finish on an inode (fuse_sync_writes()). > > > > So I think you have raised a good point. That is if there are pending > > writes at the time of syncfs(), we don't seem to have a notion of > > first waiting for all these writes to finish before we send > > FUSE_SYNCFS request to server. > > So here a proposed patch for fixing this. Works by counting write requests > initiated up till the syncfs call. Since more than one syncfs can be in > progress counts are kept in "buckets" in order to wait for the correct write > requests in each instance. > > I tried to make this lightweight, but the cacheline bounce due to the counter is > still there, unfortunately. fc->num_waiting also causes cacheline bouce, so I'm > not going to optimize this (percpu counter?) until that one is also optimizied. > > Not yet tested, and I'm not sure how to test this. > > Comments? > > Thanks, > Miklos > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index 97f860cfc195..8d1d6e895534 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -389,6 +389,7 @@ struct fuse_writepage_args { > struct list_head queue_entry; > struct fuse_writepage_args *next; > struct inode *inode; > + struct fuse_sync_bucket *bucket; > }; > > static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi, > @@ -1608,6 +1609,9 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa) > struct fuse_args_pages *ap = &wpa->ia.ap; > int i; > > + if (wpa->bucket && atomic_dec_and_test(&wpa->bucket->num_writepages)) Hi Miklos, Wondering why this wpa->bucket check is there. Isn't every wpa is associated bucket. So when do we run into situation when wpa->bucket = NULL. > + wake_up(&wpa->bucket->waitq); > + > for (i = 0; i < ap->num_pages; i++) > __free_page(ap->pages[i]); > > @@ -1871,6 +1875,19 @@ static struct fuse_writepage_args *fuse_writepage_args_alloc(void) > > } > > +static void fuse_writepage_add_to_bucket(struct fuse_conn *fc, > + struct fuse_writepage_args *wpa) > +{ > + if (!fc->sync_fs) > + return; > + > + rcu_read_lock(); > + do { > + wpa->bucket = rcu_dereference(fc->curr_bucket); > + } while (unlikely(!atomic_inc_not_zero(&wpa->bucket->num_writepages))); So this loop is there because fuse_sync_fs() might be replacing fc->curr_bucket. And we are fetching this pointer under rcu. So it is possible that fuse_fs_sync() dropped its reference and that led to ->num_writepages 0 and we don't want to use this bucket. What if fuse_sync_fs() dropped its reference but still there is another wpa in progress and hence ->num_writepages is not zero. We still don't want to use this bucket for new wpa, right? > + rcu_read_unlock(); > +} > + > static int fuse_writepage_locked(struct page *page) > { > struct address_space *mapping = page->mapping; > @@ -1898,6 +1915,7 @@ static int fuse_writepage_locked(struct page *page) > if (!wpa->ia.ff) > goto err_nofile; > > + fuse_writepage_add_to_bucket(fc, wpa); > fuse_write_args_fill(&wpa->ia, wpa->ia.ff, page_offset(page), 0); > > copy_highpage(tmp_page, page); > @@ -2148,6 +2166,8 @@ static int fuse_writepages_fill(struct page *page, > __free_page(tmp_page); > goto out_unlock; > } > + fuse_writepage_add_to_bucket(fc, wpa); > + > data->max_pages = 1; > > ap = &wpa->ia.ap; > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index 07829ce78695..ee638e227bb3 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -515,6 +515,14 @@ struct fuse_fs_context { > void **fudptr; > }; > > +struct fuse_sync_bucket { > + atomic_t num_writepages; > + union { > + wait_queue_head_t waitq; > + struct rcu_head rcu; > + }; > +}; > + > /** > * A Fuse connection. > * > @@ -807,6 +815,9 @@ struct fuse_conn { > > /** List of filesystems using this connection */ > struct list_head mounts; > + > + /* New writepages go into this bucket */ > + struct fuse_sync_bucket *curr_bucket; > }; > > /* > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index b9beb39a4a18..524b2d128985 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -506,10 +506,24 @@ static int fuse_statfs(struct dentry *dentry, struct kstatfs *buf) > return err; > } > > +static struct fuse_sync_bucket *fuse_sync_bucket_alloc(void) > +{ > + struct fuse_sync_bucket *bucket; > + > + bucket = kzalloc(sizeof(*bucket), GFP_KERNEL | __GFP_NOFAIL); > + if (bucket) { > + init_waitqueue_head(&bucket->waitq); > + /* Initial active count */ > + atomic_set(&bucket->num_writepages, 1); > + } > + return bucket; > +} > + > static int fuse_sync_fs(struct super_block *sb, int wait) > { > struct fuse_mount *fm = get_fuse_mount_super(sb); > struct fuse_conn *fc = fm->fc; > + struct fuse_sync_bucket *bucket, *new_bucket; > struct fuse_syncfs_in inarg; > FUSE_ARGS(args); > int err; > @@ -528,6 +542,31 @@ static int fuse_sync_fs(struct super_block *sb, int wait) > if (!fc->sync_fs) > return 0; > > + new_bucket = fuse_sync_bucket_alloc(); > + spin_lock(&fc->lock); > + bucket = fc->curr_bucket; > + if (atomic_read(&bucket->num_writepages) != 0) { > + /* One more for count completion of old bucket */ > + atomic_inc(&new_bucket->num_writepages); > + rcu_assign_pointer(fc->curr_bucket, new_bucket); > + /* Drop initially added active count */ > + atomic_dec(&bucket->num_writepages); > + spin_unlock(&fc->lock); > + > + wait_event(bucket->waitq, atomic_read(&bucket->num_writepages) == 0); > + /* > + * Drop count on new bucket, possibly resulting in a completion > + * if more than one syncfs is going on > + */ > + if (atomic_dec_and_test(&new_bucket->num_writepages)) > + wake_up(&new_bucket->waitq); > + kfree_rcu(bucket, rcu); > + } else { > + spin_unlock(&fc->lock); > + /* Free unused */ > + kfree(new_bucket); When can we run into the situation when fc->curr_bucket is num_writepages == 0. When install a bucket it has count 1. And only time it can go to 0 is when we have dropped the initial reference. And initial reference can be dropped only after removing bucket from fc->curr_bucket. IOW, we don't drop initial reference on a bucket if it is in fc->curr_bucket. And that mean anything installed fc->curr_bucket should not ever have a reference count of 0. What am I missing. Thanks Vivek > + } > + > memset(&inarg, 0, sizeof(inarg)); > args.in_numargs = 1; > args.in_args[0].size = sizeof(inarg); > @@ -770,6 +809,7 @@ void fuse_conn_put(struct fuse_conn *fc) > fiq->ops->release(fiq); > put_pid_ns(fc->pid_ns); > put_user_ns(fc->user_ns); > + kfree_rcu(fc->curr_bucket, rcu); > fc->release(fc); > } > } > @@ -1418,6 +1458,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx) > if (sb->s_flags & SB_MANDLOCK) > goto err; > > + fc->curr_bucket = fuse_sync_bucket_alloc(); > fuse_sb_defaults(sb); > > if (ctx->is_bdev) { >
On Mon, 30 Aug 2021 at 19:01, Vivek Goyal <vgoyal@redhat.com> wrote: > > static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi, > > @@ -1608,6 +1609,9 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa) > > struct fuse_args_pages *ap = &wpa->ia.ap; > > int i; > > > > + if (wpa->bucket && atomic_dec_and_test(&wpa->bucket->num_writepages)) > > Hi Miklos, > > Wondering why this wpa->bucket check is there. Isn't every wpa is associated > bucket. So when do we run into situation when wpa->bucket = NULL. In case fc->sync_fs is false. > > @@ -1871,6 +1875,19 @@ static struct fuse_writepage_args *fuse_writepage_args_alloc(void) > > > > } > > > > +static void fuse_writepage_add_to_bucket(struct fuse_conn *fc, > > + struct fuse_writepage_args *wpa) > > +{ > > + if (!fc->sync_fs) > > + return; > > + > > + rcu_read_lock(); > > + do { > > + wpa->bucket = rcu_dereference(fc->curr_bucket); > > + } while (unlikely(!atomic_inc_not_zero(&wpa->bucket->num_writepages))); > > So this loop is there because fuse_sync_fs() might be replacing > fc->curr_bucket. And we are fetching this pointer under rcu. So it is > possible that fuse_fs_sync() dropped its reference and that led to > ->num_writepages 0 and we don't want to use this bucket. > > What if fuse_sync_fs() dropped its reference but still there is another > wpa in progress and hence ->num_writepages is not zero. We still don't > want to use this bucket for new wpa, right? It's an unlikely race in which case the the write will go into the old bucket, and will be waited for, but that definitely should not be a problem. > > @@ -528,6 +542,31 @@ static int fuse_sync_fs(struct super_block *sb, int wait) > > if (!fc->sync_fs) > > return 0; > > > > + new_bucket = fuse_sync_bucket_alloc(); > > + spin_lock(&fc->lock); > > + bucket = fc->curr_bucket; > > + if (atomic_read(&bucket->num_writepages) != 0) { > > + /* One more for count completion of old bucket */ > > + atomic_inc(&new_bucket->num_writepages); > > + rcu_assign_pointer(fc->curr_bucket, new_bucket); > > + /* Drop initially added active count */ > > + atomic_dec(&bucket->num_writepages); > > + spin_unlock(&fc->lock); > > + > > + wait_event(bucket->waitq, atomic_read(&bucket->num_writepages) == 0); > > + /* > > + * Drop count on new bucket, possibly resulting in a completion > > + * if more than one syncfs is going on > > + */ > > + if (atomic_dec_and_test(&new_bucket->num_writepages)) > > + wake_up(&new_bucket->waitq); > > + kfree_rcu(bucket, rcu); > > + } else { > > + spin_unlock(&fc->lock); > > + /* Free unused */ > > + kfree(new_bucket); > When can we run into the situation when fc->curr_bucket is num_writepages > == 0. When install a bucket it has count 1. And only time it can go to > 0 is when we have dropped the initial reference. And initial reference > can be dropped only after removing bucket from fc->curr_bucket. > > IOW, we don't drop initial reference on a bucket if it is in > fc->curr_bucket. And that mean anything installed fc->curr_bucket should > not ever have a reference count of 0. What am I missing. You are correct. I fixed it by warning on zero count and checking for count != 1. I have other fixes as well, will send v2. Thanks, Miklos
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index e2f5c8617e0d..01d9283261af 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -761,6 +761,9 @@ struct fuse_conn { /* Auto-mount submounts announced by the server */ unsigned int auto_submounts:1; + /* Propagate syncfs() to server */ + unsigned int sync_fs:1; + /** The number of requests waiting for completion */ atomic_t num_waiting; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 123b53d1c3c6..96b00253f766 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -506,6 +506,45 @@ static int fuse_statfs(struct dentry *dentry, struct kstatfs *buf) return err; } +static int fuse_sync_fs(struct super_block *sb, int wait) +{ + struct fuse_mount *fm = get_fuse_mount_super(sb); + struct fuse_conn *fc = fm->fc; + struct fuse_syncfs_in inarg; + FUSE_ARGS(args); + int err; + + /* + * Userspace cannot handle the wait == 0 case. Avoid a + * gratuitous roundtrip. + */ + if (!wait) + return 0; + + /* The filesystem is being unmounted. Nothing to do. */ + if (!sb->s_root) + return 0; + + if (!fc->sync_fs) + return 0; + + memset(&inarg, 0, sizeof(inarg)); + args.in_numargs = 1; + args.in_args[0].size = sizeof(inarg); + args.in_args[0].value = &inarg; + args.opcode = FUSE_SYNCFS; + args.nodeid = get_node_id(sb->s_root->d_inode); + args.out_numargs = 0; + + err = fuse_simple_request(fm, &args); + if (err == -ENOSYS) { + fc->sync_fs = 0; + err = 0; + } + + return err; +} + enum { OPT_SOURCE, OPT_SUBTYPE, @@ -909,6 +948,7 @@ static const struct super_operations fuse_super_operations = { .put_super = fuse_put_super, .umount_begin = fuse_umount_begin, .statfs = fuse_statfs, + .sync_fs = fuse_sync_fs, .show_options = fuse_show_options, }; diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 8962cd033016..f649a47efb68 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -1455,6 +1455,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc) fc->release = fuse_free_conn; fc->delete_stale = true; fc->auto_submounts = true; + fc->sync_fs = true; /* Tell FUSE to split requests that exceed the virtqueue's size */ fc->max_pages_limit = min_t(unsigned int, fc->max_pages_limit, diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 271ae90a9bb7..36ed092227fa 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -181,6 +181,9 @@ * - add FUSE_OPEN_KILL_SUIDGID * - extend fuse_setxattr_in, add FUSE_SETXATTR_EXT * - add FUSE_SETXATTR_ACL_KILL_SGID + * + * 7.34 + * - add FUSE_SYNCFS */ #ifndef _LINUX_FUSE_H @@ -216,7 +219,7 @@ #define FUSE_KERNEL_VERSION 7 /** Minor version number of this interface */ -#define FUSE_KERNEL_MINOR_VERSION 33 +#define FUSE_KERNEL_MINOR_VERSION 34 /** The node ID of the root inode */ #define FUSE_ROOT_ID 1 @@ -509,6 +512,7 @@ enum fuse_opcode { FUSE_COPY_FILE_RANGE = 47, FUSE_SETUPMAPPING = 48, FUSE_REMOVEMAPPING = 49, + FUSE_SYNCFS = 50, /* CUSE specific operations */ CUSE_INIT = 4096, @@ -971,4 +975,8 @@ struct fuse_removemapping_one { #define FUSE_REMOVEMAPPING_MAX_ENTRY \ (PAGE_SIZE / sizeof(struct fuse_removemapping_one)) +struct fuse_syncfs_in { + uint64_t padding; +}; + #endif /* _LINUX_FUSE_H */
Even if POSIX doesn't mandate it, linux users legitimately expect sync() to flush all data and metadata to physical storage when it is located on the same system. This isn't happening with virtiofs though : sync() inside the guest returns right away even though data still needs to be flushed from the host page cache. This is easily demonstrated by doing the following in the guest: $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s sync() = 0 <0.024068> +++ exited with 0 +++ and start the following in the host when the 'dd' command completes in the guest: $ strace -T -e fsync /usr/bin/sync virtiofs/foo fsync(3) = 0 <10.371640> +++ exited with 0 +++ There are no good reasons not to honor the expected behavior of sync() actually : it gives an unrealistic impression that virtiofs is super fast and that data has safely landed on HW, which isn't the case obviously. Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS request type for this purpose. Provision a 64-bit placeholder for possible future extensions. Since the file server cannot handle the wait == 0 case, we skip it to avoid a gratuitous roundtrip. Note that this is per-superblock : a FUSE_SYNCFS is send for the root mount and for each submount. Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in the file server is treated as permanent success. This ensures compatibility with older file servers : the client will get the current behavior of sync() not being propagated to the file server. Note that such an operation allows the file server to DoS sync(). Since a typical FUSE file server is an untrusted piece of software running in userspace, this is disabled by default. Only enable it with virtiofs for now since virtiofsd is supposedly trusted by the guest kernel. Reported-by: Robert Krawitz <rlk@redhat.com> Signed-off-by: Greg Kurz <groug@kaod.org> --- fs/fuse/fuse_i.h | 3 +++ fs/fuse/inode.c | 40 +++++++++++++++++++++++++++++++++++++++ fs/fuse/virtio_fs.c | 1 + include/uapi/linux/fuse.h | 10 +++++++++- 4 files changed, 53 insertions(+), 1 deletion(-)