Message ID | 20200929185015.GG220516@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] fuse: update attributes on read() only on timeout | expand |
On Tue, Sep 29, 2020 at 02:50:15PM -0400, Vivek Goyal wrote: > Following commit added a flag to invalidate guest page cache automatically. > > 72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag > > Idea seemed to be that for network file systmes if client A modifies > the file, then client B should be able to detect that mtime of file > change and invalidate its own cache and fetch new data from server. > > There are few questions/issues with this method. > > How soon client B able to detect that file has changed. Should it > first GETATTR from server for every READ and compare mtime. That > will be much stronger cache coherency but very slow because every > READ will first be preceeded by a GETATTR. > > Or should this be driven by inode timeout. That is if inode cached attrs > (including mtime) have timed out, we fetch new mtime from server and > invalidate cache based on that. > > Current logic calls fuse_update_attr() on every READ. But that method > will result in GETATTR only if either attrs have timedout or if cached > attrs have been invalidated. > > If client B is only doing READs (and not WRITEs), then attrs should be > valid for inode timeout interval. And that means client B will detect > mtime change only after timeout interval. > > But if client B is also doing WRITE, then once WRITE completes, we > invalidate cached attrs. That means next READ will force GETATTR() > and invalidate page cache. In this case client B will detect the > change by client A much sooner but it can't differentiate between > its own WRITEs and by another client WRITE. So every WRITE followed > by READ will result in GETATTR, followed by page cache invalidation > and performance suffers in mixed read/write workloads. > > I am assuming that intent of auto_inval_data is to detect changes > by another client but it can take up to "inode timeout" seconds > to detect that change. (And it does not guarantee an immidiate change > detection). > > If above assumption is acceptable, then I am proposing this patch > which will update attrs on READ only if attrs have timed out. This > means every second we will do a GETATTR and invalidate page cache. > > This is also suboptimal because only if client B is writing, our > cache is still valid but we will still invalidate it after 1 second. > But we don't have a good mechanism to differentiate between our own > changes and another client's changes. So this is probably second > best method to reduce the extent of issue. > > I am running equivalent of following fio workload on virtiofs (cache=auto) > and there I see a performance improvement of roughly 12%. > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio > +--bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 > +--output=/output/fio.txt > > NAME WORKLOAD Bandwidth IOPS > vtfs-auto-sh randrw-psync 43.3mb/14.4mb 10.8k/3709 > vtfs-auto-sh-invaltime randrw-psync 48.9mb/16.3mb 12.2k/4197 Also ran a kernel compilation test. With this change, there is roughly 10% improvement in build time. - git clone linux - cd linux - sync; echo 3 > /proc/sys/vm/drop_cache - make defconfig - time make -j32 Without patch ------------- real 4m57.819s user 23m2.432s sys 18m39.004s With patch ---------- real 4m33.549s user 23m4.168s sys 18m36.515s Thanks Vivek > > Signee-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > fs/fuse/dir.c | 6 ++++++ > fs/fuse/file.c | 21 +++++++++++++++------ > fs/fuse/fuse_i.h | 1 + > 3 files changed, 22 insertions(+), 6 deletions(-) > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > index 26f028bc760b..5c4eda0edd95 100644 > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -1002,6 +1002,12 @@ int fuse_update_attributes(struct inode *inode, struct file *file) > STATX_BASIC_STATS & ~STATX_ATIME, 0); > } > > +int fuse_update_attributes_timeout(struct inode *inode, struct file *file) > +{ > + /* Only refresh attrs if timeout has happened */ > + return fuse_update_get_attr(inode, file, NULL, 0, 0); > +} > + > int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid, > u64 child_nodeid, struct qstr *name) > { > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index 6611ef3269a8..dea001f5f4e9 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -972,19 +972,28 @@ static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to) > { > struct inode *inode = iocb->ki_filp->f_mapping->host; > struct fuse_conn *fc = get_fuse_conn(inode); > + int err = 0; > > /* > - * In auto invalidate mode, always update attributes on read. > + * In auto invalidate mode, update attributes on read if previously > + * stored attributes timed out. This is primarily done to detect > + * writes by other clients and invalidate local cache. But we don't > + * have a way to differentiate between our own writes and writes > + * by other clients. So we end up updating attrs and invalidating > + * cache on our own write. Stick to the theme of detecting changes > + * after timeout. This is what happens already if we are not doing > + * writes but other client is doing. > + * > * Otherwise, only update if we attempt to read past EOF (to ensure > * i_size is up to date). > */ > - if (fc->auto_inval_data || > - (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) { > - int err; > + if (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode)) { > err = fuse_update_attributes(inode, iocb->ki_filp); > - if (err) > - return err; > + } else if (fc->auto_inval_data) { > + err = fuse_update_attributes_timeout(inode, iocb->ki_filp); > } > + if (err) > + return err; > > return generic_file_read_iter(iocb, to); > } > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index 740a8a7d7ae6..78f93129b60e 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -1004,6 +1004,7 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id); > void fuse_update_ctime(struct inode *inode); > > int fuse_update_attributes(struct inode *inode, struct file *file); > +int fuse_update_attributes_timeout(struct inode *inode, struct file *file); > > void fuse_flush_writepages(struct inode *inode); > > -- > 2.25.4 >
On Tue, Sep 29, 2020 at 9:52 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > Following commit added a flag to invalidate guest page cache automatically. > > 72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag > > Idea seemed to be that for network file systmes if client A modifies > the file, then client B should be able to detect that mtime of file > change and invalidate its own cache and fetch new data from server. > > There are few questions/issues with this method. > > How soon client B able to detect that file has changed. Should it > first GETATTR from server for every READ and compare mtime. That > will be much stronger cache coherency but very slow because every > READ will first be preceeded by a GETATTR. > > Or should this be driven by inode timeout. That is if inode cached attrs > (including mtime) have timed out, we fetch new mtime from server and > invalidate cache based on that. > > Current logic calls fuse_update_attr() on every READ. But that method > will result in GETATTR only if either attrs have timedout or if cached > attrs have been invalidated. > > If client B is only doing READs (and not WRITEs), then attrs should be > valid for inode timeout interval. And that means client B will detect > mtime change only after timeout interval. > > But if client B is also doing WRITE, then once WRITE completes, we > invalidate cached attrs. That means next READ will force GETATTR() > and invalidate page cache. In this case client B will detect the > change by client A much sooner but it can't differentiate between > its own WRITEs and by another client WRITE. So every WRITE followed > by READ will result in GETATTR, followed by page cache invalidation > and performance suffers in mixed read/write workloads. > > I am assuming that intent of auto_inval_data is to detect changes > by another client but it can take up to "inode timeout" seconds > to detect that change. (And it does not guarantee an immidiate change > detection). > > If above assumption is acceptable, then I am proposing this patch > which will update attrs on READ only if attrs have timed out. This > means every second we will do a GETATTR and invalidate page cache. > > This is also suboptimal because only if client B is writing, our > cache is still valid but we will still invalidate it after 1 second. > But we don't have a good mechanism to differentiate between our own > changes and another client's changes. So this is probably second > best method to reduce the extent of issue. > I was under the impression that virtiofs in now in the stage of stabilizing the "all changes are from this client and no local changes on server" use case. Is that the case? I remember you also had an idea to communicate that this is the use case on connection setup time for SB_NOSEC which did not happen. If that is the case, why use auto_inval_data at all for virtiofs and not explicit_inval_data? Is that because you do want to allow local changes on the server? I wonder out loud if this change of behavior you proposed is a good opportunity to introduce some of the verbs from SMB oplocks / NFS delegations into the FUSE protocol in order to allow finer grained control over per-file (and later also per-directory) caching behavior. Thanks, Amir.
On Wed, Sep 30, 2020 at 6:36 AM Amir Goldstein <amir73il@gmail.com> wrote: > I wonder out loud if this change of behavior you proposed is a good opportunity > to introduce some of the verbs from SMB oplocks / NFS delegations into the > FUSE protocol in order to allow finer grained control over per-file > (and later also per-directory) caching behavior. That would be really nice. Let me find a recent discussion on this... ah it was private. Copying the thread below. Thoughts? Thanks, Miklos On Tue, Aug 11, 2020 at 8:56 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Wed, Aug 5, 2020 at 5:53 AM Zhi Zhang <zhang.david2011@gmail.com> wrote: > > On Tue, Aug 4, 2020 at 11:36 AM Zhi Zhang <zhang.david2011@gmail.com> wrote: > > > On Mon, Aug 3, 2020 at 6:37 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > > > On Thu, Jul 23, 2020 at 12:40 PM Zhi Zhang <zhang.david2011@gmail.com> wrote: > > > > > > We are using distributed filesystem ceph-fuse and we enabled writeback > > > > > cache on fuse which improves the write performance, but the file's > > > > > size attribute can't be updated on another client even if the users on > > > > > this client only read this file. > > > > > > > > > > From my understanding, if the file is not opened in write mode and > > > > > already writes its buffered data to userspace filesystem like > > > > > ceph-fuse, then its state should be clean. The upper userspace and > > > > > remote server should be responsible for the data and consistency. So > > > > > at this moment fuse could trust the attributes from the server which > > > > > has the most authoritative information about this file. > > > > > > > > > > Please let me know your thoughts, then I can work on this patch. Thanks. > > > > > > > > Hi, > > > > > > > > Something like this makes sense, but I think we should be adding an > > > > explicit state (a lease to read/write the data) to the fuse inode. > > > > > > > > Opening for write would automatically acquire the WRITE lease, > > > > similarly opening for read would acquire the READ lease. Then we need > > > > a new notification for revoking a lease (FUSE_NOTIFY_REVOKE). And we > > > > need a new request for re-acquiring a lease (FUSE_REACQUIRE). > > > > > > > > Does that make sense? > > > > > > > > Would you mind discussing this on the linux-fsdevel mailing lists? > > > > > > > > Thanks, > > > > Miklos> > > Hi Miklos, > > > > Thanks for the comments. I thought about it but I still have a couple > > of questions about the lease. > > > > 1. After acquiring a WRITE lease, when should we release(revoke) it? > > Before I assumed the file would be clean once we wrote buffered data > > to the userspace file system. Now if we introduce the lease, should we > > release the WRITE lease once we write the buffered data or we need to > > wait for the revoking notification from userspace file system? > > I think it's easier to wait for the notification, instead of trying to > guess. When the file is closed (released) then the lease is also > implicitly released. > > > 2. What is the purpose of READ lease? > > Once we hold the READ lease, we could trust cached attributes and data > > until revoking notification from userspace file system? > > Yes. > > > 3. What is the purpose of re-acquiring a lease and why do we need a new request? > > From my understanding, the lease mechanism is only known by kernel > > fuse, not for libfuse. > > We don't necessarily need a new request, it could be implicit in the > first uncached write. > > > To re-acquire a lease here is actually for READ > > lease by sending a sync getattr request. > > Yes. > > Thanks, > Miklos
On Tue, Sep 29, 2020 at 02:50:15PM -0400, Vivek Goyal wrote: > Following commit added a flag to invalidate guest page cache automatically. > > 72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag > > Idea seemed to be that for network file systmes if client A modifies > the file, then client B should be able to detect that mtime of file > change and invalidate its own cache and fetch new data from server. > > There are few questions/issues with this method. > > How soon client B able to detect that file has changed. Should it > first GETATTR from server for every READ and compare mtime. That > will be much stronger cache coherency but very slow because every > READ will first be preceeded by a GETATTR. > > Or should this be driven by inode timeout. That is if inode cached attrs > (including mtime) have timed out, we fetch new mtime from server and > invalidate cache based on that. > > Current logic calls fuse_update_attr() on every READ. But that method > will result in GETATTR only if either attrs have timedout or if cached > attrs have been invalidated. > > If client B is only doing READs (and not WRITEs), then attrs should be > valid for inode timeout interval. And that means client B will detect > mtime change only after timeout interval. > > But if client B is also doing WRITE, then once WRITE completes, we > invalidate cached attrs. That means next READ will force GETATTR() > and invalidate page cache. In this case client B will detect the > change by client A much sooner but it can't differentiate between > its own WRITEs and by another client WRITE. So every WRITE followed > by READ will result in GETATTR, followed by page cache invalidation > and performance suffers in mixed read/write workloads. > > I am assuming that intent of auto_inval_data is to detect changes > by another client but it can take up to "inode timeout" seconds > to detect that change. (And it does not guarantee an immidiate change > detection). > > If above assumption is acceptable, then I am proposing this patch > which will update attrs on READ only if attrs have timed out. This > means every second we will do a GETATTR and invalidate page cache. > > This is also suboptimal because only if client B is writing, our > cache is still valid but we will still invalidate it after 1 second. > But we don't have a good mechanism to differentiate between our own > changes and another client's changes. So this is probably second > best method to reduce the extent of issue. > > I am running equivalent of following fio workload on virtiofs (cache=auto) > and there I see a performance improvement of roughly 12%. > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio > +--bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 > +--output=/output/fio.txt > > NAME WORKLOAD Bandwidth IOPS > vtfs-auto-sh randrw-psync 43.3mb/14.4mb 10.8k/3709 > vtfs-auto-sh-invaltime randrw-psync 48.9mb/16.3mb 12.2k/4197 > > Signee-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > fs/fuse/dir.c | 6 ++++++ > fs/fuse/file.c | 21 +++++++++++++++------ > fs/fuse/fuse_i.h | 1 + > 3 files changed, 22 insertions(+), 6 deletions(-) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
On Wed, Sep 30, 2020 at 9:40 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Wed, Sep 30, 2020 at 6:36 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > I wonder out loud if this change of behavior you proposed is a good opportunity > > to introduce some of the verbs from SMB oplocks / NFS delegations into the > > FUSE protocol in order to allow finer grained control over per-file > > (and later also per-directory) caching behavior. > > That would be really nice. Let me find a recent discussion on this... > ah it was private. Copying the thread below. Thoughts? > > Thanks, > Miklos > > On Tue, Aug 11, 2020 at 8:56 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Wed, Aug 5, 2020 at 5:53 AM Zhi Zhang <zhang.david2011@gmail.com> wrote: > > > On Tue, Aug 4, 2020 at 11:36 AM Zhi Zhang <zhang.david2011@gmail.com> wrote: > > > > On Mon, Aug 3, 2020 at 6:37 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > > > > > On Thu, Jul 23, 2020 at 12:40 PM Zhi Zhang <zhang.david2011@gmail.com> wrote: > > > > > > > > We are using distributed filesystem ceph-fuse and we enabled writeback > > > > > > cache on fuse which improves the write performance, but the file's > > > > > > size attribute can't be updated on another client even if the users on > > > > > > this client only read this file. > > > > > > > > > > > > From my understanding, if the file is not opened in write mode and > > > > > > already writes its buffered data to userspace filesystem like > > > > > > ceph-fuse, then its state should be clean. The upper userspace and > > > > > > remote server should be responsible for the data and consistency. So > > > > > > at this moment fuse could trust the attributes from the server which > > > > > > has the most authoritative information about this file. > > > > > > > > > > > > Please let me know your thoughts, then I can work on this patch. Thanks. > > > > > > > > > > Hi, > > > > > > > > > > Something like this makes sense, but I think we should be adding an > > > > > explicit state (a lease to read/write the data) to the fuse inode. > > > > > > > > > > Opening for write would automatically acquire the WRITE lease, > > > > > similarly opening for read would acquire the READ lease. Then we need > > > > > a new notification for revoking a lease (FUSE_NOTIFY_REVOKE). And we > > > > > need a new request for re-acquiring a lease (FUSE_REACQUIRE). > > > > > > > > > > Does that make sense? Sounds right. Linking to MS docs for reference, because I find their documentation most comprehensive and SMBv3.1 semantics is much richer than NFSv4, so maybe useful examples can be found there: https://docs.microsoft.com/en-us/windows/win32/fileio/breaking-opportunistic-locks I would consider adopting the downgrade/upgrade terminology, because REVOKE/REACQUIRE sounds like losing the old any taking a new lease when in fact downgrade from WRITE to READ and vice versa is a common case. But the name is not what matters, it's to get the functionality right. Thanks, Amir. > > > > > > > > > > Would you mind discussing this on the linux-fsdevel mailing lists? > > > > > > > > > > Thanks, > > > > > Miklos> > > > Hi Miklos, > > > > > > Thanks for the comments. I thought about it but I still have a couple > > > of questions about the lease. > > > > > > 1. After acquiring a WRITE lease, when should we release(revoke) it? > > > Before I assumed the file would be clean once we wrote buffered data > > > to the userspace file system. Now if we introduce the lease, should we > > > release the WRITE lease once we write the buffered data or we need to > > > wait for the revoking notification from userspace file system? > > > > I think it's easier to wait for the notification, instead of trying to > > guess. When the file is closed (released) then the lease is also > > implicitly released. > > > > > 2. What is the purpose of READ lease? > > > Once we hold the READ lease, we could trust cached attributes and data > > > until revoking notification from userspace file system? > > > > Yes. > > > > > 3. What is the purpose of re-acquiring a lease and why do we need a new request? > > > From my understanding, the lease mechanism is only known by kernel > > > fuse, not for libfuse. > > > > We don't necessarily need a new request, it could be implicit in the > > first uncached write. > > > > > To re-acquire a lease here is actually for READ > > > lease by sending a sync getattr request. > > > > Yes. > > > > Thanks, > > Miklos
On Wed, Sep 30, 2020 at 07:35:57AM +0300, Amir Goldstein wrote: > On Tue, Sep 29, 2020 at 9:52 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > Following commit added a flag to invalidate guest page cache automatically. > > > > 72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag > > > > Idea seemed to be that for network file systmes if client A modifies > > the file, then client B should be able to detect that mtime of file > > change and invalidate its own cache and fetch new data from server. > > > > There are few questions/issues with this method. > > > > How soon client B able to detect that file has changed. Should it > > first GETATTR from server for every READ and compare mtime. That > > will be much stronger cache coherency but very slow because every > > READ will first be preceeded by a GETATTR. > > > > Or should this be driven by inode timeout. That is if inode cached attrs > > (including mtime) have timed out, we fetch new mtime from server and > > invalidate cache based on that. > > > > Current logic calls fuse_update_attr() on every READ. But that method > > will result in GETATTR only if either attrs have timedout or if cached > > attrs have been invalidated. > > > > If client B is only doing READs (and not WRITEs), then attrs should be > > valid for inode timeout interval. And that means client B will detect > > mtime change only after timeout interval. > > > > But if client B is also doing WRITE, then once WRITE completes, we > > invalidate cached attrs. That means next READ will force GETATTR() > > and invalidate page cache. In this case client B will detect the > > change by client A much sooner but it can't differentiate between > > its own WRITEs and by another client WRITE. So every WRITE followed > > by READ will result in GETATTR, followed by page cache invalidation > > and performance suffers in mixed read/write workloads. > > > > I am assuming that intent of auto_inval_data is to detect changes > > by another client but it can take up to "inode timeout" seconds > > to detect that change. (And it does not guarantee an immidiate change > > detection). > > > > If above assumption is acceptable, then I am proposing this patch > > which will update attrs on READ only if attrs have timed out. This > > means every second we will do a GETATTR and invalidate page cache. > > > > This is also suboptimal because only if client B is writing, our > > cache is still valid but we will still invalidate it after 1 second. > > But we don't have a good mechanism to differentiate between our own > > changes and another client's changes. So this is probably second > > best method to reduce the extent of issue. > > > > I was under the impression that virtiofs in now in the stage of stabilizing the > "all changes are from this client and no local changes on server" use case. Looks like that kubernetes is allowed to drop some files in host directory while it is being shared with guest. And I will not be surprised that if kata is already doing some very limited amount of modification on directory on host. For virtiofs we have both the use cases. For container images, "no sharing" assumption should work probably reasonably fine. But then we also need to address other use case of sharing volumes between containers and there other clients can modify shared directory. > Is that the case? I remember you also had an idea to communicate that this > is the use case on connection setup time for SB_NOSEC which did not happen. Given we have both the use cases and I am not 100% sure if kata is not doing any modifications on host, I thought not to pursue this line of thought that no modifications are allowed on host. It will be very limiting if kata/kubernetes decide to drop small files or make other small changes on host. > > If that is the case, why use auto_inval_data at all for virtiofs and not > explicit_inval_data? > Is that because you do want to allow local changes on the server? Yes. Atleast want to keep that possibility open. We know that there is a demand for this other mode too. If it ever becomes clear that for container image case we don't need any modifications on server, then I can easily add an option to virtiofsd and disable auto_inval_data for that use case. Having said that, we still need to optimize auto_inval_data case. Its inconsistent with itself. A client's own WRITE will invalidate its cache. > > I wonder out loud if this change of behavior you proposed is a good opportunity > to introduce some of the verbs from SMB oplocks / NFS delegations into the > FUSE protocol in order to allow finer grained control over per-file > (and later also > per-directory) caching behavior. May be. How will NFS delegation help with cache invalidation issue. I mean if client B has the lease and modifying file, then client A will still need to know when client B has modified the file and invalidate its own caches. I don't know anything about SMB oplocks and know very little about NFS delegation. Thanks Vivek
On Wed, Sep 30, 2020 at 4:02 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Wed, Sep 30, 2020 at 07:35:57AM +0300, Amir Goldstein wrote: > > On Tue, Sep 29, 2020 at 9:52 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > Following commit added a flag to invalidate guest page cache automatically. > > > > > > 72d0d248ca823 fuse: add FUSE_AUTO_INVAL_DATA init flag > > > > > > Idea seemed to be that for network file systmes if client A modifies > > > the file, then client B should be able to detect that mtime of file > > > change and invalidate its own cache and fetch new data from server. > > > > > > There are few questions/issues with this method. > > > > > > How soon client B able to detect that file has changed. Should it > > > first GETATTR from server for every READ and compare mtime. That > > > will be much stronger cache coherency but very slow because every > > > READ will first be preceeded by a GETATTR. > > > > > > Or should this be driven by inode timeout. That is if inode cached attrs > > > (including mtime) have timed out, we fetch new mtime from server and > > > invalidate cache based on that. > > > > > > Current logic calls fuse_update_attr() on every READ. But that method > > > will result in GETATTR only if either attrs have timedout or if cached > > > attrs have been invalidated. > > > > > > If client B is only doing READs (and not WRITEs), then attrs should be > > > valid for inode timeout interval. And that means client B will detect > > > mtime change only after timeout interval. > > > > > > But if client B is also doing WRITE, then once WRITE completes, we > > > invalidate cached attrs. That means next READ will force GETATTR() > > > and invalidate page cache. In this case client B will detect the > > > change by client A much sooner but it can't differentiate between > > > its own WRITEs and by another client WRITE. So every WRITE followed > > > by READ will result in GETATTR, followed by page cache invalidation > > > and performance suffers in mixed read/write workloads. > > > > > > I am assuming that intent of auto_inval_data is to detect changes > > > by another client but it can take up to "inode timeout" seconds > > > to detect that change. (And it does not guarantee an immidiate change > > > detection). > > > > > > If above assumption is acceptable, then I am proposing this patch > > > which will update attrs on READ only if attrs have timed out. This > > > means every second we will do a GETATTR and invalidate page cache. > > > > > > This is also suboptimal because only if client B is writing, our > > > cache is still valid but we will still invalidate it after 1 second. > > > But we don't have a good mechanism to differentiate between our own > > > changes and another client's changes. So this is probably second > > > best method to reduce the extent of issue. > > > > > > > I was under the impression that virtiofs in now in the stage of stabilizing the > > "all changes are from this client and no local changes on server" use case. > > Looks like that kubernetes is allowed to drop some files in host directory > while it is being shared with guest. And I will not be surprised that if > kata is already doing some very limited amount of modification on > directory on host. > > For virtiofs we have both the use cases. For container images, "no > sharing" assumption should work probably reasonably fine. But then > we also need to address other use case of sharing volumes between > containers and there other clients can modify shared directory. > > > Is that the case? I remember you also had an idea to communicate that this > > is the use case on connection setup time for SB_NOSEC which did not happen. > > Given we have both the use cases and I am not 100% sure if kata is not > doing any modifications on host, I thought not to pursue this line of > thought that no modifications are allowed on host. It will be very > limiting if kata/kubernetes decide to drop small files or make other > small changes on host. > > > > > If that is the case, why use auto_inval_data at all for virtiofs and not > > explicit_inval_data? > > Is that because you do want to allow local changes on the server? > > Yes. Atleast want to keep that possibility open. We know that there is > a demand for this other mode too. > > If it ever becomes clear that for container image case we don't need > any modifications on server, then I can easily add an option to virtiofsd > and disable auto_inval_data for that use case. Having said that, we > still need to optimize auto_inval_data case. Its inconsistent with > itself. A client's own WRITE will invalidate its cache. > > > > > I wonder out loud if this change of behavior you proposed is a good opportunity > > to introduce some of the verbs from SMB oplocks / NFS delegations into the > > FUSE protocol in order to allow finer grained control over per-file > > (and later also > > per-directory) caching behavior. > > May be. How will NFS delegation help with cache invalidation issue. I > mean if client B has the lease and modifying file, then client A will > still need to know when client B has modified the file and invalidate > its own caches. I think it goes a bit something like this: B can ask and get a WRITE lease, if no other client has a READ lease. Then it can do writeback cache and read cache. If A is opening a file for read, then client B lease needs to be "broken" or "revoked" and acknowledged by client B *before* client A open returns, which means B needs to flush all its cached writes and start doing uncached writes. Both clients A and B can be granted a READ lease for cached reads if there are no writers. The first open for write will break the READ leases, but there is no need to wait when breaking READ leases. > > I don't know anything about SMB oplocks and know very little about NFS > delegation. > I don't know that much either, I just know they are meant to close the "knowledge gap" that you describe in your use case. See email from Miklos about more specific details. Thanks, Amir.
On Wed, Sep 30, 2020 at 3:02 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Wed, Sep 30, 2020 at 07:35:57AM +0300, Amir Goldstein wrote: > > I wonder out loud if this change of behavior you proposed is a good opportunity > > to introduce some of the verbs from SMB oplocks / NFS delegations into the > > FUSE protocol in order to allow finer grained control over per-file > > (and later also > > per-directory) caching behavior. > > May be. How will NFS delegation help with cache invalidation issue. I > mean if client B has the lease and modifying file, then client A will > still need to know when client B has modified the file and invalidate > its own caches. Right. So the way it would work in the mixed read-write workload is that the client tries to obtain a WRITE lease for the file. a) it could obtain the lease: it can do cached reads and writes without having to go to the server until the cache is flushed naturally or the lease is revoked. b) it can't obtain the lease: client will disable cache and let server manage coherency (cache=none basically). For a read-only workload the same happens but only a READ lease is requested. The nice thing is that multiple entities can have a READ lease on the file as long as it is not being modified. So this is best of both worlds, optimized for the non-contended case, but still consistent for the case when the file is being read/modified by more than one entity. Thanks, Miklos
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 26f028bc760b..5c4eda0edd95 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1002,6 +1002,12 @@ int fuse_update_attributes(struct inode *inode, struct file *file) STATX_BASIC_STATS & ~STATX_ATIME, 0); } +int fuse_update_attributes_timeout(struct inode *inode, struct file *file) +{ + /* Only refresh attrs if timeout has happened */ + return fuse_update_get_attr(inode, file, NULL, 0, 0); +} + int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid, u64 child_nodeid, struct qstr *name) { diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 6611ef3269a8..dea001f5f4e9 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -972,19 +972,28 @@ static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct inode *inode = iocb->ki_filp->f_mapping->host; struct fuse_conn *fc = get_fuse_conn(inode); + int err = 0; /* - * In auto invalidate mode, always update attributes on read. + * In auto invalidate mode, update attributes on read if previously + * stored attributes timed out. This is primarily done to detect + * writes by other clients and invalidate local cache. But we don't + * have a way to differentiate between our own writes and writes + * by other clients. So we end up updating attrs and invalidating + * cache on our own write. Stick to the theme of detecting changes + * after timeout. This is what happens already if we are not doing + * writes but other client is doing. + * * Otherwise, only update if we attempt to read past EOF (to ensure * i_size is up to date). */ - if (fc->auto_inval_data || - (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) { - int err; + if (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode)) { err = fuse_update_attributes(inode, iocb->ki_filp); - if (err) - return err; + } else if (fc->auto_inval_data) { + err = fuse_update_attributes_timeout(inode, iocb->ki_filp); } + if (err) + return err; return generic_file_read_iter(iocb, to); } diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 740a8a7d7ae6..78f93129b60e 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1004,6 +1004,7 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id); void fuse_update_ctime(struct inode *inode); int fuse_update_attributes(struct inode *inode, struct file *file); +int fuse_update_attributes_timeout(struct inode *inode, struct file *file); void fuse_flush_writepages(struct inode *inode);