diff mbox series

[RFC] fuse: update attributes on read() only on timeout

Message ID 20200929185015.GG220516@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] fuse: update attributes on read() only on timeout | expand

Commit Message

Vivek Goyal Sept. 29, 2020, 6:50 p.m. UTC
Following commit added a flag to invalidate guest page cache automatically.

72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag

Idea seemed to be that for network file systmes if client A modifies
the file, then client B should be able to detect that mtime of file
change and invalidate its own cache and fetch new data from server.

There are few questions/issues with this method.

How soon client B able to detect that file has changed. Should it
first GETATTR from server for every READ and compare mtime. That
will be much stronger cache coherency but very slow because every
READ will first be preceeded by a GETATTR.

Or should this be driven by inode timeout. That is if inode cached attrs
(including mtime) have timed out, we fetch new mtime from server and
invalidate cache based on that.

Current logic calls fuse_update_attr() on every READ. But that method
will result in GETATTR only if either attrs have timedout or if cached
attrs have been invalidated.

If client B is only doing READs (and not WRITEs), then attrs should be
valid for inode timeout interval. And that means client B will detect
mtime change only after timeout interval.

But if client B is also doing WRITE, then once WRITE completes, we
invalidate cached attrs. That means next READ will force GETATTR()
and invalidate page cache. In this case client B will detect the
change by client A much sooner but it can't differentiate between
its own WRITEs and by another client WRITE. So every WRITE followed
by READ will result in GETATTR, followed by page cache invalidation
and performance suffers in mixed read/write workloads.

I am assuming that intent of auto_inval_data is to detect changes
by another client but it can take up to "inode timeout" seconds
to detect that change. (And it does not guarantee an immidiate change
detection).

If above assumption is acceptable, then I am proposing this patch
which will update attrs on READ only if attrs have timed out. This
means every second we will do a GETATTR and invalidate page cache.

This is also suboptimal because only if client B is writing, our
cache is still valid but we will still invalidate it after 1 second.
But we don't have a good mechanism to differentiate between our own
changes and another client's changes. So this is probably second
best method to reduce the extent of issue.

I am running equivalent of following fio workload on virtiofs (cache=auto)
and there I see a performance improvement of roughly 12%.

fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio
+--bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
+--output=/output/fio.txt

NAME                    WORKLOAD                Bandwidth       IOPS
vtfs-auto-sh		randrw-psync            43.3mb/14.4mb   10.8k/3709
vtfs-auto-sh-invaltime  randrw-psync            48.9mb/16.3mb   12.2k/4197

Signee-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/dir.c    |  6 ++++++
 fs/fuse/file.c   | 21 +++++++++++++++------
 fs/fuse/fuse_i.h |  1 +
 3 files changed, 22 insertions(+), 6 deletions(-)

Comments

Vivek Goyal Sept. 29, 2020, 7:58 p.m. UTC | #1
On Tue, Sep 29, 2020 at 02:50:15PM -0400, Vivek Goyal wrote:
> Following commit added a flag to invalidate guest page cache automatically.
> 
> 72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag
> 
> Idea seemed to be that for network file systmes if client A modifies
> the file, then client B should be able to detect that mtime of file
> change and invalidate its own cache and fetch new data from server.
> 
> There are few questions/issues with this method.
> 
> How soon client B able to detect that file has changed. Should it
> first GETATTR from server for every READ and compare mtime. That
> will be much stronger cache coherency but very slow because every
> READ will first be preceeded by a GETATTR.
> 
> Or should this be driven by inode timeout. That is if inode cached attrs
> (including mtime) have timed out, we fetch new mtime from server and
> invalidate cache based on that.
> 
> Current logic calls fuse_update_attr() on every READ. But that method
> will result in GETATTR only if either attrs have timedout or if cached
> attrs have been invalidated.
> 
> If client B is only doing READs (and not WRITEs), then attrs should be
> valid for inode timeout interval. And that means client B will detect
> mtime change only after timeout interval.
> 
> But if client B is also doing WRITE, then once WRITE completes, we
> invalidate cached attrs. That means next READ will force GETATTR()
> and invalidate page cache. In this case client B will detect the
> change by client A much sooner but it can't differentiate between
> its own WRITEs and by another client WRITE. So every WRITE followed
> by READ will result in GETATTR, followed by page cache invalidation
> and performance suffers in mixed read/write workloads.
> 
> I am assuming that intent of auto_inval_data is to detect changes
> by another client but it can take up to "inode timeout" seconds
> to detect that change. (And it does not guarantee an immidiate change
> detection).
> 
> If above assumption is acceptable, then I am proposing this patch
> which will update attrs on READ only if attrs have timed out. This
> means every second we will do a GETATTR and invalidate page cache.
> 
> This is also suboptimal because only if client B is writing, our
> cache is still valid but we will still invalidate it after 1 second.
> But we don't have a good mechanism to differentiate between our own
> changes and another client's changes. So this is probably second
> best method to reduce the extent of issue.
> 
> I am running equivalent of following fio workload on virtiofs (cache=auto)
> and there I see a performance improvement of roughly 12%.
> 
> fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio
> +--bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
> +--output=/output/fio.txt
> 
> NAME                    WORKLOAD                Bandwidth       IOPS
> vtfs-auto-sh		randrw-psync            43.3mb/14.4mb   10.8k/3709
> vtfs-auto-sh-invaltime  randrw-psync            48.9mb/16.3mb   12.2k/4197

Also ran a kernel compilation test. With this change, there is roughly
10% improvement in build time.

- git clone linux
- cd linux
- sync; echo 3 > /proc/sys/vm/drop_cache
- make defconfig
- time make -j32

Without patch
-------------
real    4m57.819s
user    23m2.432s
sys     18m39.004s

With patch
----------
real    4m33.549s
user    23m4.168s
sys     18m36.515s

Thanks
Vivek

> 
> Signee-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/fuse/dir.c    |  6 ++++++
>  fs/fuse/file.c   | 21 +++++++++++++++------
>  fs/fuse/fuse_i.h |  1 +
>  3 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 26f028bc760b..5c4eda0edd95 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1002,6 +1002,12 @@ int fuse_update_attributes(struct inode *inode, struct file *file)
>  				    STATX_BASIC_STATS & ~STATX_ATIME, 0);
>  }
>  
> +int fuse_update_attributes_timeout(struct inode *inode, struct file *file)
> +{
> +	/* Only refresh attrs if timeout has happened */
> +	return fuse_update_get_attr(inode, file, NULL, 0, 0);
> +}
> +
>  int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid,
>  			     u64 child_nodeid, struct qstr *name)
>  {
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 6611ef3269a8..dea001f5f4e9 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -972,19 +972,28 @@ static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  {
>  	struct inode *inode = iocb->ki_filp->f_mapping->host;
>  	struct fuse_conn *fc = get_fuse_conn(inode);
> +	int err = 0;
>  
>  	/*
> -	 * In auto invalidate mode, always update attributes on read.
> +	 * In auto invalidate mode, update attributes on read if previously
> +	 * stored attributes timed out. This is primarily done to detect
> +	 * writes by other clients and invalidate local cache. But we don't
> +	 * have a way to differentiate between our own writes and writes
> +	 * by other clients. So we end up updating attrs and invalidating
> +	 * cache on our own write. Stick to the theme of detecting changes
> +	 * after timeout. This is what happens already if we are not doing
> +	 * writes but other client is doing.
> +	 *
>  	 * Otherwise, only update if we attempt to read past EOF (to ensure
>  	 * i_size is up to date).
>  	 */
> -	if (fc->auto_inval_data ||
> -	    (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
> -		int err;
> +	if (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode)) {
>  		err = fuse_update_attributes(inode, iocb->ki_filp);
> -		if (err)
> -			return err;
> +	} else if (fc->auto_inval_data) {
> +		err = fuse_update_attributes_timeout(inode, iocb->ki_filp);
>  	}
> +	if (err)
> +		return err;
>  
>  	return generic_file_read_iter(iocb, to);
>  }
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 740a8a7d7ae6..78f93129b60e 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1004,6 +1004,7 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
>  void fuse_update_ctime(struct inode *inode);
>  
>  int fuse_update_attributes(struct inode *inode, struct file *file);
> +int fuse_update_attributes_timeout(struct inode *inode, struct file *file);
>  
>  void fuse_flush_writepages(struct inode *inode);
>  
> -- 
> 2.25.4
>
Amir Goldstein Sept. 30, 2020, 4:35 a.m. UTC | #2
On Tue, Sep 29, 2020 at 9:52 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> Following commit added a flag to invalidate guest page cache automatically.
>
> 72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag
>
> Idea seemed to be that for network file systmes if client A modifies
> the file, then client B should be able to detect that mtime of file
> change and invalidate its own cache and fetch new data from server.
>
> There are few questions/issues with this method.
>
> How soon client B able to detect that file has changed. Should it
> first GETATTR from server for every READ and compare mtime. That
> will be much stronger cache coherency but very slow because every
> READ will first be preceeded by a GETATTR.
>
> Or should this be driven by inode timeout. That is if inode cached attrs
> (including mtime) have timed out, we fetch new mtime from server and
> invalidate cache based on that.
>
> Current logic calls fuse_update_attr() on every READ. But that method
> will result in GETATTR only if either attrs have timedout or if cached
> attrs have been invalidated.
>
> If client B is only doing READs (and not WRITEs), then attrs should be
> valid for inode timeout interval. And that means client B will detect
> mtime change only after timeout interval.
>
> But if client B is also doing WRITE, then once WRITE completes, we
> invalidate cached attrs. That means next READ will force GETATTR()
> and invalidate page cache. In this case client B will detect the
> change by client A much sooner but it can't differentiate between
> its own WRITEs and by another client WRITE. So every WRITE followed
> by READ will result in GETATTR, followed by page cache invalidation
> and performance suffers in mixed read/write workloads.
>
> I am assuming that intent of auto_inval_data is to detect changes
> by another client but it can take up to "inode timeout" seconds
> to detect that change. (And it does not guarantee an immidiate change
> detection).
>
> If above assumption is acceptable, then I am proposing this patch
> which will update attrs on READ only if attrs have timed out. This
> means every second we will do a GETATTR and invalidate page cache.
>
> This is also suboptimal because only if client B is writing, our
> cache is still valid but we will still invalidate it after 1 second.
> But we don't have a good mechanism to differentiate between our own
> changes and another client's changes. So this is probably second
> best method to reduce the extent of issue.
>

I was under the impression that virtiofs in now in the stage of stabilizing the
"all changes are from this client and no local changes on server" use case.
Is that the case? I remember you also had an idea to communicate that this
is the use case on connection setup time for SB_NOSEC which did not happen.

If that is the case, why use auto_inval_data at all for virtiofs and not
explicit_inval_data?
Is that because you do want to allow local changes on the server?

I wonder out loud if this change of behavior you proposed is a good opportunity
to introduce some of the verbs from SMB oplocks / NFS delegations into the
FUSE protocol in order to allow finer grained control over per-file
(and later also
per-directory) caching behavior.

Thanks,
Amir.
Miklos Szeredi Sept. 30, 2020, 6:40 a.m. UTC | #3
On Wed, Sep 30, 2020 at 6:36 AM Amir Goldstein <amir73il@gmail.com> wrote:

> I wonder out loud if this change of behavior you proposed is a good opportunity
> to introduce some of the verbs from SMB oplocks / NFS delegations into the
> FUSE protocol in order to allow finer grained control over per-file
> (and later also per-directory) caching behavior.

That would be really nice.  Let me find a recent discussion on this...
ah it was private.   Copying the thread below.  Thoughts?

Thanks,
Miklos

On Tue, Aug 11, 2020 at 8:56 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, Aug 5, 2020 at 5:53 AM Zhi Zhang <zhang.david2011@gmail.com> wrote:
> > On Tue, Aug 4, 2020 at 11:36 AM Zhi Zhang <zhang.david2011@gmail.com> wrote:
> > > On Mon, Aug 3, 2020 at 6:37 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > On Thu, Jul 23, 2020 at 12:40 PM Zhi Zhang <zhang.david2011@gmail.com> wrote:
>
> > > > > We are using distributed filesystem ceph-fuse and we enabled writeback
> > > > > cache on fuse which improves the write performance, but the file's
> > > > > size attribute can't be updated on another client even if the users on
> > > > > this client only read this file.
> > > > >
> > > > > From my understanding, if the file is not opened in write mode and
> > > > > already writes its buffered data to userspace filesystem like
> > > > > ceph-fuse, then its state should be clean. The upper userspace and
> > > > > remote server should be responsible for the data and consistency. So
> > > > > at this moment fuse could trust the attributes from the server which
> > > > > has the most authoritative information about this file.
> > > > >
> > > > > Please let me know your thoughts, then I can work on this patch. Thanks.
> > > >
> > > > Hi,
> > > >
> > > > Something like this makes sense, but I think we should be adding an
> > > > explicit state (a lease to read/write the data) to the fuse inode.
> > > >
> > > > Opening for write would automatically acquire the WRITE lease,
> > > > similarly opening for read would acquire the READ lease.  Then we need
> > > > a new notification for revoking a lease (FUSE_NOTIFY_REVOKE).  And we
> > > > need a new request for re-acquiring a lease (FUSE_REACQUIRE).
> > > >
> > > > Does that make sense?
> > > >
> > > > Would you mind discussing this on the linux-fsdevel mailing lists?
> > > >
> > > > Thanks,
> > > > Miklos>
> > Hi Miklos,
> >
> > Thanks for the comments. I thought about it but I still have a couple
> > of questions about the lease.
> >
> > 1. After acquiring a WRITE lease, when should we release(revoke) it?
> > Before I assumed the file would be clean once we wrote buffered data
> > to the userspace file system. Now if we introduce the lease, should we
> > release the WRITE lease once we write the buffered data or we need to
> > wait for the revoking notification from userspace file system?
>
> I think it's easier to wait for the notification, instead of trying to
> guess.   When the file is closed (released) then the lease is also
> implicitly released.
>
> > 2. What is the purpose of READ lease?
> > Once we hold the READ lease, we could trust cached attributes and data
> > until revoking notification from userspace file system?
>
> Yes.
>
> > 3. What is the purpose of re-acquiring a lease and why do we need a new request?
> > From my understanding, the lease mechanism is only known by kernel
> > fuse, not for libfuse.
>
> We don't necessarily need a new request, it could be implicit in the
> first uncached write.
>
> > To re-acquire a lease here is actually for READ
> > lease by sending a sync getattr request.
>
> Yes.
>
> Thanks,
> Miklos
Stefan Hajnoczi Sept. 30, 2020, 9:03 a.m. UTC | #4
On Tue, Sep 29, 2020 at 02:50:15PM -0400, Vivek Goyal wrote:
> Following commit added a flag to invalidate guest page cache automatically.
> 
> 72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag
> 
> Idea seemed to be that for network file systmes if client A modifies
> the file, then client B should be able to detect that mtime of file
> change and invalidate its own cache and fetch new data from server.
> 
> There are few questions/issues with this method.
> 
> How soon client B able to detect that file has changed. Should it
> first GETATTR from server for every READ and compare mtime. That
> will be much stronger cache coherency but very slow because every
> READ will first be preceeded by a GETATTR.
> 
> Or should this be driven by inode timeout. That is if inode cached attrs
> (including mtime) have timed out, we fetch new mtime from server and
> invalidate cache based on that.
> 
> Current logic calls fuse_update_attr() on every READ. But that method
> will result in GETATTR only if either attrs have timedout or if cached
> attrs have been invalidated.
> 
> If client B is only doing READs (and not WRITEs), then attrs should be
> valid for inode timeout interval. And that means client B will detect
> mtime change only after timeout interval.
> 
> But if client B is also doing WRITE, then once WRITE completes, we
> invalidate cached attrs. That means next READ will force GETATTR()
> and invalidate page cache. In this case client B will detect the
> change by client A much sooner but it can't differentiate between
> its own WRITEs and by another client WRITE. So every WRITE followed
> by READ will result in GETATTR, followed by page cache invalidation
> and performance suffers in mixed read/write workloads.
> 
> I am assuming that intent of auto_inval_data is to detect changes
> by another client but it can take up to "inode timeout" seconds
> to detect that change. (And it does not guarantee an immidiate change
> detection).
> 
> If above assumption is acceptable, then I am proposing this patch
> which will update attrs on READ only if attrs have timed out. This
> means every second we will do a GETATTR and invalidate page cache.
> 
> This is also suboptimal because only if client B is writing, our
> cache is still valid but we will still invalidate it after 1 second.
> But we don't have a good mechanism to differentiate between our own
> changes and another client's changes. So this is probably second
> best method to reduce the extent of issue.
> 
> I am running equivalent of following fio workload on virtiofs (cache=auto)
> and there I see a performance improvement of roughly 12%.
> 
> fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio
> +--bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
> +--output=/output/fio.txt
> 
> NAME                    WORKLOAD                Bandwidth       IOPS
> vtfs-auto-sh		randrw-psync            43.3mb/14.4mb   10.8k/3709
> vtfs-auto-sh-invaltime  randrw-psync            48.9mb/16.3mb   12.2k/4197
> 
> Signee-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/fuse/dir.c    |  6 ++++++
>  fs/fuse/file.c   | 21 +++++++++++++++------
>  fs/fuse/fuse_i.h |  1 +
>  3 files changed, 22 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Amir Goldstein Sept. 30, 2020, 11:57 a.m. UTC | #5
On Wed, Sep 30, 2020 at 9:40 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, Sep 30, 2020 at 6:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > I wonder out loud if this change of behavior you proposed is a good opportunity
> > to introduce some of the verbs from SMB oplocks / NFS delegations into the
> > FUSE protocol in order to allow finer grained control over per-file
> > (and later also per-directory) caching behavior.
>
> That would be really nice.  Let me find a recent discussion on this...
> ah it was private.   Copying the thread below.  Thoughts?
>
> Thanks,
> Miklos
>
> On Tue, Aug 11, 2020 at 8:56 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Wed, Aug 5, 2020 at 5:53 AM Zhi Zhang <zhang.david2011@gmail.com> wrote:
> > > On Tue, Aug 4, 2020 at 11:36 AM Zhi Zhang <zhang.david2011@gmail.com> wrote:
> > > > On Mon, Aug 3, 2020 at 6:37 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > >
> > > > > On Thu, Jul 23, 2020 at 12:40 PM Zhi Zhang <zhang.david2011@gmail.com> wrote:
> >
> > > > > > We are using distributed filesystem ceph-fuse and we enabled writeback
> > > > > > cache on fuse which improves the write performance, but the file's
> > > > > > size attribute can't be updated on another client even if the users on
> > > > > > this client only read this file.
> > > > > >
> > > > > > From my understanding, if the file is not opened in write mode and
> > > > > > already writes its buffered data to userspace filesystem like
> > > > > > ceph-fuse, then its state should be clean. The upper userspace and
> > > > > > remote server should be responsible for the data and consistency. So
> > > > > > at this moment fuse could trust the attributes from the server which
> > > > > > has the most authoritative information about this file.
> > > > > >
> > > > > > Please let me know your thoughts, then I can work on this patch. Thanks.
> > > > >
> > > > > Hi,
> > > > >
> > > > > Something like this makes sense, but I think we should be adding an
> > > > > explicit state (a lease to read/write the data) to the fuse inode.
> > > > >
> > > > > Opening for write would automatically acquire the WRITE lease,
> > > > > similarly opening for read would acquire the READ lease.  Then we need
> > > > > a new notification for revoking a lease (FUSE_NOTIFY_REVOKE).  And we
> > > > > need a new request for re-acquiring a lease (FUSE_REACQUIRE).
> > > > >
> > > > > Does that make sense?

Sounds right.

Linking to MS docs for reference, because I find their documentation
most comprehensive and SMBv3.1 semantics is much richer than NFSv4,
so maybe useful examples can be found there:
https://docs.microsoft.com/en-us/windows/win32/fileio/breaking-opportunistic-locks

I would consider adopting the downgrade/upgrade terminology, because
REVOKE/REACQUIRE sounds like losing the old any taking a new lease
when in fact downgrade from WRITE to READ and vice versa is a common
case. But the name is not what matters, it's to get the functionality right.

Thanks,
Amir.

> > > > >
> > > > > Would you mind discussing this on the linux-fsdevel mailing lists?
> > > > >
> > > > > Thanks,
> > > > > Miklos>
> > > Hi Miklos,
> > >
> > > Thanks for the comments. I thought about it but I still have a couple
> > > of questions about the lease.
> > >
> > > 1. After acquiring a WRITE lease, when should we release(revoke) it?
> > > Before I assumed the file would be clean once we wrote buffered data
> > > to the userspace file system. Now if we introduce the lease, should we
> > > release the WRITE lease once we write the buffered data or we need to
> > > wait for the revoking notification from userspace file system?
> >
> > I think it's easier to wait for the notification, instead of trying to
> > guess.   When the file is closed (released) then the lease is also
> > implicitly released.
> >
> > > 2. What is the purpose of READ lease?
> > > Once we hold the READ lease, we could trust cached attributes and data
> > > until revoking notification from userspace file system?
> >
> > Yes.
> >
> > > 3. What is the purpose of re-acquiring a lease and why do we need a new request?
> > > From my understanding, the lease mechanism is only known by kernel
> > > fuse, not for libfuse.
> >
> > We don't necessarily need a new request, it could be implicit in the
> > first uncached write.
> >
> > > To re-acquire a lease here is actually for READ
> > > lease by sending a sync getattr request.
> >
> > Yes.
> >
> > Thanks,
> > Miklos
Vivek Goyal Sept. 30, 2020, 1:02 p.m. UTC | #6
On Wed, Sep 30, 2020 at 07:35:57AM +0300, Amir Goldstein wrote:
> On Tue, Sep 29, 2020 at 9:52 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Following commit added a flag to invalidate guest page cache automatically.
> >
> > 72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag
> >
> > Idea seemed to be that for network file systmes if client A modifies
> > the file, then client B should be able to detect that mtime of file
> > change and invalidate its own cache and fetch new data from server.
> >
> > There are few questions/issues with this method.
> >
> > How soon client B able to detect that file has changed. Should it
> > first GETATTR from server for every READ and compare mtime. That
> > will be much stronger cache coherency but very slow because every
> > READ will first be preceeded by a GETATTR.
> >
> > Or should this be driven by inode timeout. That is if inode cached attrs
> > (including mtime) have timed out, we fetch new mtime from server and
> > invalidate cache based on that.
> >
> > Current logic calls fuse_update_attr() on every READ. But that method
> > will result in GETATTR only if either attrs have timedout or if cached
> > attrs have been invalidated.
> >
> > If client B is only doing READs (and not WRITEs), then attrs should be
> > valid for inode timeout interval. And that means client B will detect
> > mtime change only after timeout interval.
> >
> > But if client B is also doing WRITE, then once WRITE completes, we
> > invalidate cached attrs. That means next READ will force GETATTR()
> > and invalidate page cache. In this case client B will detect the
> > change by client A much sooner but it can't differentiate between
> > its own WRITEs and by another client WRITE. So every WRITE followed
> > by READ will result in GETATTR, followed by page cache invalidation
> > and performance suffers in mixed read/write workloads.
> >
> > I am assuming that intent of auto_inval_data is to detect changes
> > by another client but it can take up to "inode timeout" seconds
> > to detect that change. (And it does not guarantee an immidiate change
> > detection).
> >
> > If above assumption is acceptable, then I am proposing this patch
> > which will update attrs on READ only if attrs have timed out. This
> > means every second we will do a GETATTR and invalidate page cache.
> >
> > This is also suboptimal because only if client B is writing, our
> > cache is still valid but we will still invalidate it after 1 second.
> > But we don't have a good mechanism to differentiate between our own
> > changes and another client's changes. So this is probably second
> > best method to reduce the extent of issue.
> >
> 
> I was under the impression that virtiofs in now in the stage of stabilizing the
> "all changes are from this client and no local changes on server" use case.

Looks like that kubernetes is allowed to drop some files in host directory
while it is being shared with guest. And I will not be surprised that if
kata is already doing some very limited amount of modification on
directory on host.

For virtiofs we have both the use cases. For container images, "no
sharing" assumption should work probably reasonably fine. But then
we also need to address other use case of sharing volumes between
containers and there other clients can modify shared directory.

> Is that the case? I remember you also had an idea to communicate that this
> is the use case on connection setup time for SB_NOSEC which did not happen.

Given we have both the use cases and I am not 100% sure if kata is not
doing any modifications on host, I thought not to pursue this line of
thought that no modifications are allowed on host. It will be very
limiting if kata/kubernetes decide to drop small files or make other
small changes on host.

> 
> If that is the case, why use auto_inval_data at all for virtiofs and not
> explicit_inval_data?
> Is that because you do want to allow local changes on the server?

Yes. Atleast want to keep that possibility open. We know that there is
a demand for this other mode too.

If it ever becomes clear that for container image case we don't need
any modifications on server, then I can easily add an option to virtiofsd
and disable auto_inval_data for that use case. Having said that, we
still need to optimize auto_inval_data case. Its inconsistent with
itself. A client's own WRITE will invalidate its cache.

> 
> I wonder out loud if this change of behavior you proposed is a good opportunity
> to introduce some of the verbs from SMB oplocks / NFS delegations into the
> FUSE protocol in order to allow finer grained control over per-file
> (and later also
> per-directory) caching behavior.

May be. How will NFS delegation help with cache invalidation issue. I
mean if client B has the lease and modifying file, then client A will
still need to know when client B has modified the file and invalidate
its own caches.

I don't know anything about SMB oplocks and know very little about NFS
delegation.

Thanks
Vivek
Amir Goldstein Sept. 30, 2020, 1:19 p.m. UTC | #7
On Wed, Sep 30, 2020 at 4:02 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, Sep 30, 2020 at 07:35:57AM +0300, Amir Goldstein wrote:
> > On Tue, Sep 29, 2020 at 9:52 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > Following commit added a flag to invalidate guest page cache automatically.
> > >
> > > 72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag
> > >
> > > Idea seemed to be that for network file systmes if client A modifies
> > > the file, then client B should be able to detect that mtime of file
> > > change and invalidate its own cache and fetch new data from server.
> > >
> > > There are few questions/issues with this method.
> > >
> > > How soon client B able to detect that file has changed. Should it
> > > first GETATTR from server for every READ and compare mtime. That
> > > will be much stronger cache coherency but very slow because every
> > > READ will first be preceeded by a GETATTR.
> > >
> > > Or should this be driven by inode timeout. That is if inode cached attrs
> > > (including mtime) have timed out, we fetch new mtime from server and
> > > invalidate cache based on that.
> > >
> > > Current logic calls fuse_update_attr() on every READ. But that method
> > > will result in GETATTR only if either attrs have timedout or if cached
> > > attrs have been invalidated.
> > >
> > > If client B is only doing READs (and not WRITEs), then attrs should be
> > > valid for inode timeout interval. And that means client B will detect
> > > mtime change only after timeout interval.
> > >
> > > But if client B is also doing WRITE, then once WRITE completes, we
> > > invalidate cached attrs. That means next READ will force GETATTR()
> > > and invalidate page cache. In this case client B will detect the
> > > change by client A much sooner but it can't differentiate between
> > > its own WRITEs and by another client WRITE. So every WRITE followed
> > > by READ will result in GETATTR, followed by page cache invalidation
> > > and performance suffers in mixed read/write workloads.
> > >
> > > I am assuming that intent of auto_inval_data is to detect changes
> > > by another client but it can take up to "inode timeout" seconds
> > > to detect that change. (And it does not guarantee an immidiate change
> > > detection).
> > >
> > > If above assumption is acceptable, then I am proposing this patch
> > > which will update attrs on READ only if attrs have timed out. This
> > > means every second we will do a GETATTR and invalidate page cache.
> > >
> > > This is also suboptimal because only if client B is writing, our
> > > cache is still valid but we will still invalidate it after 1 second.
> > > But we don't have a good mechanism to differentiate between our own
> > > changes and another client's changes. So this is probably second
> > > best method to reduce the extent of issue.
> > >
> >
> > I was under the impression that virtiofs in now in the stage of stabilizing the
> > "all changes are from this client and no local changes on server" use case.
>
> Looks like that kubernetes is allowed to drop some files in host directory
> while it is being shared with guest. And I will not be surprised that if
> kata is already doing some very limited amount of modification on
> directory on host.
>
> For virtiofs we have both the use cases. For container images, "no
> sharing" assumption should work probably reasonably fine. But then
> we also need to address other use case of sharing volumes between
> containers and there other clients can modify shared directory.
>
> > Is that the case? I remember you also had an idea to communicate that this
> > is the use case on connection setup time for SB_NOSEC which did not happen.
>
> Given we have both the use cases and I am not 100% sure if kata is not
> doing any modifications on host, I thought not to pursue this line of
> thought that no modifications are allowed on host. It will be very
> limiting if kata/kubernetes decide to drop small files or make other
> small changes on host.
>
> >
> > If that is the case, why use auto_inval_data at all for virtiofs and not
> > explicit_inval_data?
> > Is that because you do want to allow local changes on the server?
>
> Yes. Atleast want to keep that possibility open. We know that there is
> a demand for this other mode too.
>
> If it ever becomes clear that for container image case we don't need
> any modifications on server, then I can easily add an option to virtiofsd
> and disable auto_inval_data for that use case. Having said that, we
> still need to optimize auto_inval_data case. Its inconsistent with
> itself. A client's own WRITE will invalidate its cache.
>
> >
> > I wonder out loud if this change of behavior you proposed is a good opportunity
> > to introduce some of the verbs from SMB oplocks / NFS delegations into the
> > FUSE protocol in order to allow finer grained control over per-file
> > (and later also
> > per-directory) caching behavior.
>
> May be. How will NFS delegation help with cache invalidation issue. I
> mean if client B has the lease and modifying file, then client A will
> still need to know when client B has modified the file and invalidate
> its own caches.

I think it goes a bit something like this:

B can ask and get a WRITE lease, if no other client has a READ lease.
Then it can do writeback cache and read cache.

If A is opening a file for read, then client B lease needs to be "broken"
or "revoked" and acknowledged by client B *before* client A open returns,
which means B needs to flush all its cached writes and start doing uncached
writes.

Both clients A and B can be granted a READ lease for cached reads if
there are no writers. The first open for write will break the READ leases,
but there is no need to wait when breaking READ leases.

>
> I don't know anything about SMB oplocks and know very little about NFS
> delegation.
>

I don't know that much either, I just know they are meant to close
the "knowledge gap" that you describe in your use case.
See email from Miklos about more specific details.

Thanks,
Amir.
Miklos Szeredi Sept. 30, 2020, 1:38 p.m. UTC | #8
On Wed, Sep 30, 2020 at 3:02 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, Sep 30, 2020 at 07:35:57AM +0300, Amir Goldstein wrote:

> > I wonder out loud if this change of behavior you proposed is a good opportunity
> > to introduce some of the verbs from SMB oplocks / NFS delegations into the
> > FUSE protocol in order to allow finer grained control over per-file
> > (and later also
> > per-directory) caching behavior.
>
> May be. How will NFS delegation help with cache invalidation issue. I
> mean if client B has the lease and modifying file, then client A will
> still need to know when client B has modified the file and invalidate
> its own caches.

Right.   So the way it would work in the mixed read-write workload is
that the client tries to obtain a WRITE lease for the file.

a) it could obtain the lease: it can do cached reads and writes
without having to go to the server until the cache is flushed
naturally or the lease is revoked.

b) it can't obtain the lease: client will disable cache and let server
manage coherency (cache=none basically).

For a read-only workload the same happens but only a READ lease is
requested.  The nice thing  is that multiple entities can have a READ
lease on the file as long as it is not being modified.

So this is best of both worlds, optimized for the non-contended case,
but still consistent for the case when the file is being read/modified
by more than one entity.

Thanks,
Miklos
diff mbox series

Patch

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 26f028bc760b..5c4eda0edd95 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1002,6 +1002,12 @@  int fuse_update_attributes(struct inode *inode, struct file *file)
 				    STATX_BASIC_STATS & ~STATX_ATIME, 0);
 }
 
+int fuse_update_attributes_timeout(struct inode *inode, struct file *file)
+{
+	/* Only refresh attrs if timeout has happened */
+	return fuse_update_get_attr(inode, file, NULL, 0, 0);
+}
+
 int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid,
 			     u64 child_nodeid, struct qstr *name)
 {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6611ef3269a8..dea001f5f4e9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -972,19 +972,28 @@  static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
+	int err = 0;
 
 	/*
-	 * In auto invalidate mode, always update attributes on read.
+	 * In auto invalidate mode, update attributes on read if previously
+	 * stored attributes timed out. This is primarily done to detect
+	 * writes by other clients and invalidate local cache. But we don't
+	 * have a way to differentiate between our own writes and writes
+	 * by other clients. So we end up updating attrs and invalidating
+	 * cache on our own write. Stick to the theme of detecting changes
+	 * after timeout. This is what happens already if we are not doing
+	 * writes but other client is doing.
+	 *
 	 * Otherwise, only update if we attempt to read past EOF (to ensure
 	 * i_size is up to date).
 	 */
-	if (fc->auto_inval_data ||
-	    (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
-		int err;
+	if (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode)) {
 		err = fuse_update_attributes(inode, iocb->ki_filp);
-		if (err)
-			return err;
+	} else if (fc->auto_inval_data) {
+		err = fuse_update_attributes_timeout(inode, iocb->ki_filp);
 	}
+	if (err)
+		return err;
 
 	return generic_file_read_iter(iocb, to);
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 740a8a7d7ae6..78f93129b60e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1004,6 +1004,7 @@  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
 void fuse_update_ctime(struct inode *inode);
 
 int fuse_update_attributes(struct inode *inode, struct file *file);
+int fuse_update_attributes_timeout(struct inode *inode, struct file *file);
 
 void fuse_flush_writepages(struct inode *inode);