diff mbox series

[v6,2/2] fuse: add new function to invalidate cache for all inodes

Message ID 20250217133228.24405-3-luis@igalia.com (mailing list archive)
State New
Headers show
Series fuse: allow notify_inval for all inodes | expand

Commit Message

Luis Henriques Feb. 17, 2025, 1:32 p.m. UTC
Currently userspace is able to notify the kernel to invalidate the cache
for an inode.  This means that, if all the inodes in a filesystem need to
be invalidated, then userspace needs to iterate through all of them and do
this kernel notification separately.

This patch adds a new option that allows userspace to invalidate all the
inodes with a single notification operation.  In addition to invalidate
all the inodes, it also shrinks the sb dcache.

Signed-off-by: Luis Henriques <luis@igalia.com>
---
 fs/fuse/inode.c           | 34 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/fuse.h |  3 +++
 2 files changed, 37 insertions(+)

Comments

Dave Chinner Feb. 18, 2025, 12:55 a.m. UTC | #1
On Mon, Feb 17, 2025 at 01:32:28PM +0000, Luis Henriques wrote:
> Currently userspace is able to notify the kernel to invalidate the cache
> for an inode.  This means that, if all the inodes in a filesystem need to
> be invalidated, then userspace needs to iterate through all of them and do
> this kernel notification separately.
> 
> This patch adds a new option that allows userspace to invalidate all the
> inodes with a single notification operation.  In addition to invalidate
> all the inodes, it also shrinks the sb dcache.

You still haven't justified why we should be exposing this
functionality in a low level filesystem ioctl out of sight of the
VFS.

User driven VFS cache invalidation has long been considered a
DOS-in-waiting, hence we don't allow user APIs to invalidate caches
like this. This is one of the reasons that /proc/sys/vm/drop_caches
requires root access - it's system debug and problem triage
functionality, not a production system interface....

Every other situation where filesystems invalidate vfs caches is
during mount, remount or unmount operations.  Without actually
explaining how this functionality is controlled and how user abuse
is not possible (e.g. explain the permission model and/or how only
root can run this operation), it is not really possible to determine
whether we should unconditional allow VFS cache invalidation outside
of the existing operation scope....

FInally, given that the VFS can only do best-effort invalidation
and won't provide FUSE (or any other filesystem) with any cache
invalidation guarantees outside of specific mount and unmount
contexts, I'm not convinced that this is actually worth anything...

-Dave.
Miklos Szeredi Feb. 18, 2025, 9:07 a.m. UTC | #2
On Mon, 17 Feb 2025 at 14:32, Luis Henriques <luis@igalia.com> wrote:
>
> Currently userspace is able to notify the kernel to invalidate the cache
> for an inode.  This means that, if all the inodes in a filesystem need to
> be invalidated, then userspace needs to iterate through all of them and do
> this kernel notification separately.
>
> This patch adds a new option that allows userspace to invalidate all the
> inodes with a single notification operation.  In addition to invalidate
> all the inodes, it also shrinks the sb dcache.

What is your use case for dropping all inodes?

There's a reason for doing cache shrinking that I know of, and that's
about resources held by the fuse server (e.g. open files) for each
cached inode.  In that case what's really needed is something that
tells the fuse client (the kernel or in case of virtiofs, the guest
kernel) to get rid of the N least recently used inodes.


> Signed-off-by: Luis Henriques <luis@igalia.com>
> ---
>  fs/fuse/inode.c           | 34 ++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fuse.h |  3 +++
>  2 files changed, 37 insertions(+)
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index e9db2cb8c150..64fa0806e97d 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -547,6 +547,37 @@ struct inode *fuse_ilookup(struct fuse_conn *fc, u64 nodeid,
>         return NULL;
>  }
>
> +static int fuse_reverse_inval_all(struct fuse_conn *fc)
> +{
> +       struct fuse_mount *fm;
> +       struct inode *inode;
> +
> +       inode = fuse_ilookup(fc, FUSE_ROOT_ID, &fm);
> +       if (!inode || !fm)
> +               return -ENOENT;
> +       iput(inode);
> +
> +       /* Remove all possible active references to cached inodes */
> +       shrink_dcache_sb(fm->sb);
> +
> +       /* Remove all unreferenced inodes from cache */
> +       invalidate_inodes(fm->sb);

After a dcache shrink, this is unnecessary.  See " .drop_inode =
generic_delete_inode," in fs/fuse/inode.c

Thanks,
Miklos
Miklos Szeredi Feb. 18, 2025, 9:15 a.m. UTC | #3
On Tue, 18 Feb 2025 at 01:55, Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Feb 17, 2025 at 01:32:28PM +0000, Luis Henriques wrote:
> > Currently userspace is able to notify the kernel to invalidate the cache
> > for an inode.  This means that, if all the inodes in a filesystem need to
> > be invalidated, then userspace needs to iterate through all of them and do
> > this kernel notification separately.
> >
> > This patch adds a new option that allows userspace to invalidate all the
> > inodes with a single notification operation.  In addition to invalidate
> > all the inodes, it also shrinks the sb dcache.
>
> You still haven't justified why we should be exposing this
> functionality in a low level filesystem ioctl out of sight of the
> VFS.
>
> User driven VFS cache invalidation has long been considered a
> DOS-in-waiting, hence we don't allow user APIs to invalidate caches
> like this. This is one of the reasons that /proc/sys/vm/drop_caches
> requires root access - it's system debug and problem triage
> functionality, not a production system interface....
>
> Every other situation where filesystems invalidate vfs caches is
> during mount, remount or unmount operations.  Without actually
> explaining how this functionality is controlled and how user abuse
> is not possible (e.g. explain the permission model and/or how only
> root can run this operation), it is not really possible to determine
> whether we should unconditional allow VFS cache invalidation outside
> of the existing operation scope....

I think you are grabbing the wrong end of the stick here.

This is not about an arbitrary user being able to control caching
behavior of a fuse filesystem.  It's about the filesystem itself being
able to control caching behavior.

I'm not arguing for the validity of this particular patch, just saying
that something like this could be valid.  And as explained in my other
reply there's actually a real problem out there waiting for a
solution.

Thanks,
Miklos


>
> FInally, given that the VFS can only do best-effort invalidation
> and won't provide FUSE (or any other filesystem) with any cache
> invalidation guarantees outside of specific mount and unmount
> contexts, I'm not convinced that this is actually worth anything...
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
Luis Henriques Feb. 18, 2025, 10:04 a.m. UTC | #4
On Tue, Feb 18 2025, Miklos Szeredi wrote:

> On Tue, 18 Feb 2025 at 01:55, Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Mon, Feb 17, 2025 at 01:32:28PM +0000, Luis Henriques wrote:
>> > Currently userspace is able to notify the kernel to invalidate the cache
>> > for an inode.  This means that, if all the inodes in a filesystem need to
>> > be invalidated, then userspace needs to iterate through all of them and do
>> > this kernel notification separately.
>> >
>> > This patch adds a new option that allows userspace to invalidate all the
>> > inodes with a single notification operation.  In addition to invalidate
>> > all the inodes, it also shrinks the sb dcache.
>>
>> You still haven't justified why we should be exposing this
>> functionality in a low level filesystem ioctl out of sight of the
>> VFS.
>>
>> User driven VFS cache invalidation has long been considered a
>> DOS-in-waiting, hence we don't allow user APIs to invalidate caches
>> like this. This is one of the reasons that /proc/sys/vm/drop_caches
>> requires root access - it's system debug and problem triage
>> functionality, not a production system interface....
>>
>> Every other situation where filesystems invalidate vfs caches is
>> during mount, remount or unmount operations.  Without actually
>> explaining how this functionality is controlled and how user abuse
>> is not possible (e.g. explain the permission model and/or how only
>> root can run this operation), it is not really possible to determine
>> whether we should unconditional allow VFS cache invalidation outside
>> of the existing operation scope....
>
> I think you are grabbing the wrong end of the stick here.
>
> This is not about an arbitrary user being able to control caching
> behavior of a fuse filesystem.  It's about the filesystem itself being
> able to control caching behavior.
>
> I'm not arguing for the validity of this particular patch, just saying
> that something like this could be valid.  And as explained in my other
> reply there's actually a real problem out there waiting for a
> solution.

The problem I'm trying to solve is that, if a filesystem wants to ask the
kernel to get rid of all inodes, it has to request the kernel to forget
each one, individually.  The specific filesystem I'm looking at is CVMFS,
which is a read-only filesystem that needs to be able to update the full
set of filesystem objects when a new generation snapshot becomes
available.

The obvious problem with the current solution (i.e. walking through all
the inodes) is that it is slow.  And if new snapshot generations succeed
fast enough, memory usage becomes a major issue -- enough to have a helper
daemon monitoring memory and do a full remount when it passes some
predefined threshold.

Obviously, I'm open to other solutions, including the one suggested by
Miklos in is other reply -- to get rid of the N LRU inodes.  I'm not sure
how that could be implemented yet, but I can have a look into that if you
think that's the right interface.

Cheers,
Miklos Szeredi Feb. 18, 2025, 10:34 a.m. UTC | #5
On Tue, 18 Feb 2025 at 11:04, Luis Henriques <luis@igalia.com> wrote:

> The problem I'm trying to solve is that, if a filesystem wants to ask the
> kernel to get rid of all inodes, it has to request the kernel to forget
> each one, individually.  The specific filesystem I'm looking at is CVMFS,
> which is a read-only filesystem that needs to be able to update the full
> set of filesystem objects when a new generation snapshot becomes
> available.

Yeah, we talked about this use case.  As I remember there was a
proposal to set an epoch, marking all objects for "revalidate needed",
which I think is a better solution to the CVMFS problem, than just
getting rid of unused objects.

Thanks,
Miklos
Luis Henriques Feb. 18, 2025, 11:51 a.m. UTC | #6
On Tue, Feb 18 2025, Miklos Szeredi wrote:

> On Tue, 18 Feb 2025 at 11:04, Luis Henriques <luis@igalia.com> wrote:
>
>> The problem I'm trying to solve is that, if a filesystem wants to ask the
>> kernel to get rid of all inodes, it has to request the kernel to forget
>> each one, individually.  The specific filesystem I'm looking at is CVMFS,
>> which is a read-only filesystem that needs to be able to update the full
>> set of filesystem objects when a new generation snapshot becomes
>> available.
>
> Yeah, we talked about this use case.  As I remember there was a
> proposal to set an epoch, marking all objects for "revalidate needed",
> which I think is a better solution to the CVMFS problem, than just
> getting rid of unused objects.

OK, so I think I'm missing some context here.  And, obviously, I also miss
some more knowledge on the filesystem itself.  But, if I understand it
correctly, the concept of 'inode' in CVMFS is very loose: when a new
snapshot generation is available (you mentioned 'epoch', which is, I
guess, the same thing) the inodes are all renewed -- the inode numbers
aren't kept between generations/epochs.

Do you have any links for such discussions, or any details on how this
proposal is being implemented?  This would probably be done mostly in
user-space I guess, but it would still need a way to get rid of the unused
inodes from old snapshots, right?  (inodes from old snapshots still in use
would obvious be kept aroud).

Cheers,
Miklos Szeredi Feb. 18, 2025, 2:26 p.m. UTC | #7
On Tue, 18 Feb 2025 at 12:51, Luis Henriques <luis@igalia.com> wrote:
>
> On Tue, Feb 18 2025, Miklos Szeredi wrote:
>
> > On Tue, 18 Feb 2025 at 11:04, Luis Henriques <luis@igalia.com> wrote:
> >
> >> The problem I'm trying to solve is that, if a filesystem wants to ask the
> >> kernel to get rid of all inodes, it has to request the kernel to forget
> >> each one, individually.  The specific filesystem I'm looking at is CVMFS,
> >> which is a read-only filesystem that needs to be able to update the full
> >> set of filesystem objects when a new generation snapshot becomes
> >> available.
> >
> > Yeah, we talked about this use case.  As I remember there was a
> > proposal to set an epoch, marking all objects for "revalidate needed",
> > which I think is a better solution to the CVMFS problem, than just
> > getting rid of unused objects.
>
> OK, so I think I'm missing some context here.  And, obviously, I also miss
> some more knowledge on the filesystem itself.  But, if I understand it
> correctly, the concept of 'inode' in CVMFS is very loose: when a new
> snapshot generation is available (you mentioned 'epoch', which is, I
> guess, the same thing) the inodes are all renewed -- the inode numbers
> aren't kept between generations/epochs.
>
> Do you have any links for such discussions, or any details on how this
> proposal is being implemented?  This would probably be done mostly in
> user-space I guess, but it would still need a way to get rid of the unused
> inodes from old snapshots, right?  (inodes from old snapshots still in use
> would obvious be kept aroud).

I don't have links.  Adding Valentin Volkl and Laura Promberger to the
Cc list, maybe they can help with clarification.

As far as I understand it would work by incrementing fc->epoch on
FUSE_INVALIDATE_ALL. When an object is looked up/created the current
epoch is copied to e.g. dentry->d_time.  fuse_dentry_revalidate() then
compares d_time with fc->epoch and forces an invalidate on mismatch.

Only problem with this is that it seems very CVMFS specific, but I
guess so is your proposal.

Implementing the LRU purge is more generally useful, but I'm not sure
if that helps CVMFS, since it would only get rid of unused objects.

Thanks,
Miklos
Luis Henriques Feb. 18, 2025, 6:11 p.m. UTC | #8
On Tue, Feb 18 2025, Miklos Szeredi wrote:

> On Tue, 18 Feb 2025 at 12:51, Luis Henriques <luis@igalia.com> wrote:
>>
>> On Tue, Feb 18 2025, Miklos Szeredi wrote:
>>
>> > On Tue, 18 Feb 2025 at 11:04, Luis Henriques <luis@igalia.com> wrote:
>> >
>> >> The problem I'm trying to solve is that, if a filesystem wants to ask the
>> >> kernel to get rid of all inodes, it has to request the kernel to forget
>> >> each one, individually.  The specific filesystem I'm looking at is CVMFS,
>> >> which is a read-only filesystem that needs to be able to update the full
>> >> set of filesystem objects when a new generation snapshot becomes
>> >> available.
>> >
>> > Yeah, we talked about this use case.  As I remember there was a
>> > proposal to set an epoch, marking all objects for "revalidate needed",
>> > which I think is a better solution to the CVMFS problem, than just
>> > getting rid of unused objects.
>>
>> OK, so I think I'm missing some context here.  And, obviously, I also miss
>> some more knowledge on the filesystem itself.  But, if I understand it
>> correctly, the concept of 'inode' in CVMFS is very loose: when a new
>> snapshot generation is available (you mentioned 'epoch', which is, I
>> guess, the same thing) the inodes are all renewed -- the inode numbers
>> aren't kept between generations/epochs.
>>
>> Do you have any links for such discussions, or any details on how this
>> proposal is being implemented?  This would probably be done mostly in
>> user-space I guess, but it would still need a way to get rid of the unused
>> inodes from old snapshots, right?  (inodes from old snapshots still in use
>> would obvious be kept aroud).
>
> I don't have links.  Adding Valentin Volkl and Laura Promberger to the
> Cc list, maybe they can help with clarification.
>
> As far as I understand it would work by incrementing fc->epoch on
> FUSE_INVALIDATE_ALL. When an object is looked up/created the current
> epoch is copied to e.g. dentry->d_time.  fuse_dentry_revalidate() then
> compares d_time with fc->epoch and forces an invalidate on mismatch.

OK, so hopefully Valentin or Laura will be able to help providing some
more details.  But, from your description, we would still require this
FUSE_INVALIDATE_ALL operation to exist in order to increment the epoch.
And this new operation could do that *and* also already invalidate those
unused objects.

> Only problem with this is that it seems very CVMFS specific, but I
> guess so is your proposal.
>
> Implementing the LRU purge is more generally useful, but I'm not sure
> if that helps CVMFS, since it would only get rid of unused objects.

The LRU inodes purge can indeed work for me as well, because my patch is
also only getting rid of unused objects, right?  Any inode still being
referenced will be kept around.

So, based on your reply, let me try to summarize a possible alternative
solution, that I think would be useful for CVMFS but also generic enough
for other filesystems:

- Add a new operation FUSE_INVAL_LRU_INODES, which would get rid of, at
  most, 'N' unused inodes.
  
- This operation would have an argument 'N' with the maximum number of
  inodes to invalidate.

- In addition, it would also increment this new fuse_connection attribute
  'epoch', to be used in the dentry revalidation as you suggested above

- This 'N' could also be set to a pre-#define'ed value that would mean
  *all* (unused) inodes.

Does this make sense?  Would something like this be acceptable?

Cheers,
Dave Chinner Feb. 18, 2025, 9:29 p.m. UTC | #9
On Tue, Feb 18, 2025 at 03:26:24PM +0100, Miklos Szeredi wrote:
> On Tue, 18 Feb 2025 at 12:51, Luis Henriques <luis@igalia.com> wrote:
> >
> > On Tue, Feb 18 2025, Miklos Szeredi wrote:
> >
> > > On Tue, 18 Feb 2025 at 11:04, Luis Henriques <luis@igalia.com> wrote:
> > >
> > >> The problem I'm trying to solve is that, if a filesystem wants to ask the
> > >> kernel to get rid of all inodes, it has to request the kernel to forget
> > >> each one, individually.  The specific filesystem I'm looking at is CVMFS,
> > >> which is a read-only filesystem that needs to be able to update the full
> > >> set of filesystem objects when a new generation snapshot becomes
> > >> available.
> > >
> > > Yeah, we talked about this use case.  As I remember there was a
> > > proposal to set an epoch, marking all objects for "revalidate needed",
> > > which I think is a better solution to the CVMFS problem, than just
> > > getting rid of unused objects.
> >
> > OK, so I think I'm missing some context here.  And, obviously, I also miss
> > some more knowledge on the filesystem itself.  But, if I understand it
> > correctly, the concept of 'inode' in CVMFS is very loose: when a new
> > snapshot generation is available (you mentioned 'epoch', which is, I
> > guess, the same thing) the inodes are all renewed -- the inode numbers
> > aren't kept between generations/epochs.
> >
> > Do you have any links for such discussions, or any details on how this
> > proposal is being implemented?  This would probably be done mostly in
> > user-space I guess, but it would still need a way to get rid of the unused
> > inodes from old snapshots, right?  (inodes from old snapshots still in use
> > would obvious be kept aroud).
> 
> I don't have links.  Adding Valentin Volkl and Laura Promberger to the
> Cc list, maybe they can help with clarification.
> 
> As far as I understand it would work by incrementing fc->epoch on
> FUSE_INVALIDATE_ALL. When an object is looked up/created the current
> epoch is copied to e.g. dentry->d_time.  fuse_dentry_revalidate() then
> compares d_time with fc->epoch and forces an invalidate on mismatch.

That was what I was going to suggest before I got this far in the
thread. i.e. the fs itself keeps a current snapshot ID, and it keeps
track of what snapshot ID each inode was instantiated for. On access
to the inode it checks the IDs and if they don't match, the inode
needs revalidation and then the fs can take appropriate action.

This also takes care of the active inode reference problem that
invalidation has, in that it can revoke access to inodes that are
not longer valid as soon as they are accessed by userspace.

It is also entirely internal to the filesystem (as I suggested such
functionality should be earlier in the thread) and does not require
walking the VFS to find all cached inodes...

-Dave.
Dave Chinner Feb. 18, 2025, 9:44 p.m. UTC | #10
On Tue, Feb 18, 2025 at 10:15:26AM +0100, Miklos Szeredi wrote:
> On Tue, 18 Feb 2025 at 01:55, Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Feb 17, 2025 at 01:32:28PM +0000, Luis Henriques wrote:
> > > Currently userspace is able to notify the kernel to invalidate the cache
> > > for an inode.  This means that, if all the inodes in a filesystem need to
> > > be invalidated, then userspace needs to iterate through all of them and do
> > > this kernel notification separately.
> > >
> > > This patch adds a new option that allows userspace to invalidate all the
> > > inodes with a single notification operation.  In addition to invalidate
> > > all the inodes, it also shrinks the sb dcache.
> >
> > You still haven't justified why we should be exposing this
> > functionality in a low level filesystem ioctl out of sight of the
> > VFS.
> >
> > User driven VFS cache invalidation has long been considered a
> > DOS-in-waiting, hence we don't allow user APIs to invalidate caches
> > like this. This is one of the reasons that /proc/sys/vm/drop_caches
> > requires root access - it's system debug and problem triage
> > functionality, not a production system interface....
> >
> > Every other situation where filesystems invalidate vfs caches is
> > during mount, remount or unmount operations.  Without actually
> > explaining how this functionality is controlled and how user abuse
> > is not possible (e.g. explain the permission model and/or how only
> > root can run this operation), it is not really possible to determine
> > whether we should unconditional allow VFS cache invalidation outside
> > of the existing operation scope....
> 
> I think you are grabbing the wrong end of the stick here.
> 
> This is not about an arbitrary user being able to control caching
> behavior of a fuse filesystem.

Except the filesytsem is under control of some user, yes? Nobody has
explained why such functionality is actually safe to expose to user
controlled filesystems from a security and permissions perspective.
At minimum, this needs to be explained in the commit message so when
it gets abused years down the track we have a record of why we
thought this was a safe thing to do...

> It's about the filesystem itself being
> able to control caching behavior.

I'll give the same response regardless of the filesystem - walking
the list of all VFS cached inodes to invalidate them -does not work-
to invalidate the entire VFS inode cache. That guarantee is only
provided in unmount and mount contexts once we have guaranteed that
there are no remaining active references to any cached inode.

i.e. this is functionality that -does not work- as the filesystem
might want it to work. revoke() is what is needed to do full VFS
cache invaliadtions, but that does not exist....

> I'm not arguing for the validity of this particular patch, just saying
> that something like this could be valid.  And as explained in my other
> reply there's actually a real problem out there waiting for a
> solution.

And I've already pointed out that the solution lies within that
filesystem impementation, not the VFS.

i.e. If a filesystem needs to invalidate all existing inode objects
it has instantiated regardless of whether they are actively
referenced or not by other VFS objects, then the filesystem itself
needs to implement the invalidation mechanism itself to ensure
persistently referenced VFS inodes are correctly handled....

-Dave.
Dave Chinner Feb. 18, 2025, 10:05 p.m. UTC | #11
On Tue, Feb 18, 2025 at 06:11:17PM +0000, Luis Henriques wrote:
> On Tue, Feb 18 2025, Miklos Szeredi wrote:
> 
> > On Tue, 18 Feb 2025 at 12:51, Luis Henriques <luis@igalia.com> wrote:
> >>
> >> On Tue, Feb 18 2025, Miklos Szeredi wrote:
> >>
> >> > On Tue, 18 Feb 2025 at 11:04, Luis Henriques <luis@igalia.com> wrote:
> >> >
> >> >> The problem I'm trying to solve is that, if a filesystem wants to ask the
> >> >> kernel to get rid of all inodes, it has to request the kernel to forget
> >> >> each one, individually.  The specific filesystem I'm looking at is CVMFS,
> >> >> which is a read-only filesystem that needs to be able to update the full
> >> >> set of filesystem objects when a new generation snapshot becomes
> >> >> available.
> >> >
> >> > Yeah, we talked about this use case.  As I remember there was a
> >> > proposal to set an epoch, marking all objects for "revalidate needed",
> >> > which I think is a better solution to the CVMFS problem, than just
> >> > getting rid of unused objects.
> >>
> >> OK, so I think I'm missing some context here.  And, obviously, I also miss
> >> some more knowledge on the filesystem itself.  But, if I understand it
> >> correctly, the concept of 'inode' in CVMFS is very loose: when a new
> >> snapshot generation is available (you mentioned 'epoch', which is, I
> >> guess, the same thing) the inodes are all renewed -- the inode numbers
> >> aren't kept between generations/epochs.
> >>
> >> Do you have any links for such discussions, or any details on how this
> >> proposal is being implemented?  This would probably be done mostly in
> >> user-space I guess, but it would still need a way to get rid of the unused
> >> inodes from old snapshots, right?  (inodes from old snapshots still in use
> >> would obvious be kept aroud).
> >
> > I don't have links.  Adding Valentin Volkl and Laura Promberger to the
> > Cc list, maybe they can help with clarification.
> >
> > As far as I understand it would work by incrementing fc->epoch on
> > FUSE_INVALIDATE_ALL. When an object is looked up/created the current
> > epoch is copied to e.g. dentry->d_time.  fuse_dentry_revalidate() then
> > compares d_time with fc->epoch and forces an invalidate on mismatch.
> 
> OK, so hopefully Valentin or Laura will be able to help providing some
> more details.  But, from your description, we would still require this
> FUSE_INVALIDATE_ALL operation to exist in order to increment the epoch.
> And this new operation could do that *and* also already invalidate those
> unused objects.

I think you are still looking at this from the wrong direction.

Invalidation is -not the operation- that is being requested. The
CVMFS fuse server needs to update some global state in the kernel
side fuse mount (i.e. the snapshot ID/epoch), and the need to evict
cached inodes from previous IDs is a CVMFS implementation
optimisation related to changing the global state.

> > Only problem with this is that it seems very CVMFS specific, but I
> > guess so is your proposal.
> >
> > Implementing the LRU purge is more generally useful, but I'm not sure
> > if that helps CVMFS, since it would only get rid of unused objects.
> 
> The LRU inodes purge can indeed work for me as well, because my patch is
> also only getting rid of unused objects, right?  Any inode still being
> referenced will be kept around.
> 
> So, based on your reply, let me try to summarize a possible alternative
> solution, that I think would be useful for CVMFS but also generic enough
> for other filesystems:
> 
> - Add a new operation FUSE_INVAL_LRU_INODES, which would get rid of, at
>   most, 'N' unused inodes.
>
> - This operation would have an argument 'N' with the maximum number of
>   inodes to invalidate.
>
> - In addition, it would also increment this new fuse_connection attribute
>   'epoch', to be used in the dentry revalidation as you suggested above

As per above: invalidation is an implementation optimisation for the
CVMFS epoch update. Invalidation, OTOH, does not imply that any fuse
mount/connector global state (e.g. the epoch) needs to change...

ii.e. the operation should be FUSE_UPDATE_EPOCH, not
FUSE_INVAL_LRU_INODES...

> 
> - This 'N' could also be set to a pre-#define'ed value that would mean
>   *all* (unused) inodes.

Saying "only invalidate N inodes" makes no sense to me - it is
fundamentally impossible for userspace to get right. Either the
epoch update should evict all unreferenced inodes immediately, or it
should leave them all behind to be purged by memory pressure or
other periodic garbage collection mechanisms.

-Dave.
Luis Henriques Feb. 19, 2025, 11:23 a.m. UTC | #12
On Wed, Feb 19 2025, Dave Chinner wrote:

> On Tue, Feb 18, 2025 at 06:11:17PM +0000, Luis Henriques wrote:
>> On Tue, Feb 18 2025, Miklos Szeredi wrote:
>> 
>> > On Tue, 18 Feb 2025 at 12:51, Luis Henriques <luis@igalia.com> wrote:
>> >>
>> >> On Tue, Feb 18 2025, Miklos Szeredi wrote:
>> >>
>> >> > On Tue, 18 Feb 2025 at 11:04, Luis Henriques <luis@igalia.com> wrote:
>> >> >
>> >> >> The problem I'm trying to solve is that, if a filesystem wants to ask the
>> >> >> kernel to get rid of all inodes, it has to request the kernel to forget
>> >> >> each one, individually.  The specific filesystem I'm looking at is CVMFS,
>> >> >> which is a read-only filesystem that needs to be able to update the full
>> >> >> set of filesystem objects when a new generation snapshot becomes
>> >> >> available.
>> >> >
>> >> > Yeah, we talked about this use case.  As I remember there was a
>> >> > proposal to set an epoch, marking all objects for "revalidate needed",
>> >> > which I think is a better solution to the CVMFS problem, than just
>> >> > getting rid of unused objects.
>> >>
>> >> OK, so I think I'm missing some context here.  And, obviously, I also miss
>> >> some more knowledge on the filesystem itself.  But, if I understand it
>> >> correctly, the concept of 'inode' in CVMFS is very loose: when a new
>> >> snapshot generation is available (you mentioned 'epoch', which is, I
>> >> guess, the same thing) the inodes are all renewed -- the inode numbers
>> >> aren't kept between generations/epochs.
>> >>
>> >> Do you have any links for such discussions, or any details on how this
>> >> proposal is being implemented?  This would probably be done mostly in
>> >> user-space I guess, but it would still need a way to get rid of the unused
>> >> inodes from old snapshots, right?  (inodes from old snapshots still in use
>> >> would obvious be kept aroud).
>> >
>> > I don't have links.  Adding Valentin Volkl and Laura Promberger to the
>> > Cc list, maybe they can help with clarification.
>> >
>> > As far as I understand it would work by incrementing fc->epoch on
>> > FUSE_INVALIDATE_ALL. When an object is looked up/created the current
>> > epoch is copied to e.g. dentry->d_time.  fuse_dentry_revalidate() then
>> > compares d_time with fc->epoch and forces an invalidate on mismatch.
>> 
>> OK, so hopefully Valentin or Laura will be able to help providing some
>> more details.  But, from your description, we would still require this
>> FUSE_INVALIDATE_ALL operation to exist in order to increment the epoch.
>> And this new operation could do that *and* also already invalidate those
>> unused objects.
>
> I think you are still looking at this from the wrong direction.
>
> Invalidation is -not the operation- that is being requested. The
> CVMFS fuse server needs to update some global state in the kernel
> side fuse mount (i.e. the snapshot ID/epoch), and the need to evict
> cached inodes from previous IDs is a CVMFS implementation
> optimisation related to changing the global state.
>
>> > Only problem with this is that it seems very CVMFS specific, but I
>> > guess so is your proposal.
>> >
>> > Implementing the LRU purge is more generally useful, but I'm not sure
>> > if that helps CVMFS, since it would only get rid of unused objects.
>> 
>> The LRU inodes purge can indeed work for me as well, because my patch is
>> also only getting rid of unused objects, right?  Any inode still being
>> referenced will be kept around.
>> 
>> So, based on your reply, let me try to summarize a possible alternative
>> solution, that I think would be useful for CVMFS but also generic enough
>> for other filesystems:
>> 
>> - Add a new operation FUSE_INVAL_LRU_INODES, which would get rid of, at
>>   most, 'N' unused inodes.
>>
>> - This operation would have an argument 'N' with the maximum number of
>>   inodes to invalidate.
>>
>> - In addition, it would also increment this new fuse_connection attribute
>>   'epoch', to be used in the dentry revalidation as you suggested above
>
> As per above: invalidation is an implementation optimisation for the
> CVMFS epoch update. Invalidation, OTOH, does not imply that any fuse
> mount/connector global state (e.g. the epoch) needs to change...
>
> ii.e. the operation should be FUSE_UPDATE_EPOCH, not
> FUSE_INVAL_LRU_INODES...
>
>> 
>> - This 'N' could also be set to a pre-#define'ed value that would mean
>>   *all* (unused) inodes.
>
> Saying "only invalidate N inodes" makes no sense to me - it is
> fundamentally impossible for userspace to get right. Either the
> epoch update should evict all unreferenced inodes immediately, or it
> should leave them all behind to be purged by memory pressure or
> other periodic garbage collection mechanisms.

So, below I've a patch that is totally untested (not even compile-tested).
It's unlikely to be fully correct, but I just wanted to make sure I got
the main idea right.

What I'm trying to do there is to initialize this new 'epoch'
counter, both in the fuse connection and in every new dentry.  Then, in
the ->d_revalidate() it simply invalidate a dentry if the epochs don't
match.  Then, there's the new fuse notify operation to increment the
epoch and shrink dcache (dropped the call to {evict,invalidate}_inodes()
as Miklos suggested elsewhere).

Does this look reasonable?

(I may be missing other places where epoch should be checked or
initialized.)

Cheers,
Miklos Szeredi Feb. 19, 2025, 3:39 p.m. UTC | #13
On Wed, 19 Feb 2025 at 12:23, Luis Henriques <luis@igalia.com> wrote:

> +static int fuse_notify_update_epoch(struct fuse_conn *fc)
> +{
> +       struct fuse_mount *fm;
> +       struct inode *inode;
> +
> +       inode = fuse_ilookup(fc, FUSE_ROOT_ID, &fm);
> +       if (!inode) || !fm)
> +               return -ENOENT;
> +
> +       iput(inode);
> +       atomic_inc(&fc->epoch);
> +       shrink_dcache_sb(fm->sb);

This is just an optimization and could be racy, kicking out valid
cache (harmlessly of course).  I'd leave it out of the first version.

There could be more than one fuse_mount instance.  Wondering if epoch
should be per-fm not per-fc...

> @@ -204,6 +204,12 @@ static int fuse_dentry_revalidate(struct inode *dir, const struct qstr *name,
>         int ret;
>
>         inode = d_inode_rcu(entry);
> +       if (inode) {
> +               fm = get_fuse_mount(inode);
> +               if (entry->d_time < atomic_read(&fm->fc->epoch))
> +                       goto invalid;
> +       }

Negative dentries need to be invalidated too.

> @@ -446,6 +452,12 @@ static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry,
>                 goto out_err;
>
>         entry = newent ? newent : entry;
> +       if (inode) {
> +               struct fuse_mount *fm = get_fuse_mount(inode);
> +               entry->d_time = atomic_read(&fm->fc->epoch);
> +       } else {
> +               entry->d_time = 0;
> +       }

Again, should do the same for positive and negative dentries.

Need to read out fc->epoch before sending the request to the server,
otherwise might get a stale dentry with an updated epoch.

This also needs to be done in fuse_create_open(), create_new_entry()
and fuse_direntplus_link().

Thanks,
Miklos
Luis Henriques Feb. 19, 2025, 4:31 p.m. UTC | #14
On Wed, Feb 19 2025, Miklos Szeredi wrote:

> On Wed, 19 Feb 2025 at 12:23, Luis Henriques <luis@igalia.com> wrote:
>
>> +static int fuse_notify_update_epoch(struct fuse_conn *fc)
>> +{
>> +       struct fuse_mount *fm;
>> +       struct inode *inode;
>> +
>> +       inode = fuse_ilookup(fc, FUSE_ROOT_ID, &fm);
>> +       if (!inode) || !fm)
>> +               return -ENOENT;
>> +
>> +       iput(inode);
>> +       atomic_inc(&fc->epoch);
>> +       shrink_dcache_sb(fm->sb);
>
> This is just an optimization and could be racy, kicking out valid
> cache (harmlessly of course).  I'd leave it out of the first version.

OK, will do.

> There could be more than one fuse_mount instance.  Wondering if epoch
> should be per-fm not per-fc...

Good question.  Because the cache is shared among the several fuse_mount
instances the epoch may eventually affect all of them even if it's a
per-fm attribute.  But on the other hand, different mounts could focus on
a different set of filesystem subtrees so... yeah, I'll probably leave it
in fc for now while thinking about it some more.

>> @@ -204,6 +204,12 @@ static int fuse_dentry_revalidate(struct inode *dir, const struct qstr *name,
>>         int ret;
>>
>>         inode = d_inode_rcu(entry);
>> +       if (inode) {
>> +               fm = get_fuse_mount(inode);
>> +               if (entry->d_time < atomic_read(&fm->fc->epoch))
>> +                       goto invalid;
>> +       }
>
> Negative dentries need to be invalidated too.

Ack.

>> @@ -446,6 +452,12 @@ static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry,
>>                 goto out_err;
>>
>>         entry = newent ? newent : entry;
>> +       if (inode) {
>> +               struct fuse_mount *fm = get_fuse_mount(inode);
>> +               entry->d_time = atomic_read(&fm->fc->epoch);
>> +       } else {
>> +               entry->d_time = 0;
>> +       }
>
> Again, should do the same for positive and negative dentries.
>
> Need to read out fc->epoch before sending the request to the server,
> otherwise might get a stale dentry with an updated epoch.

Ah, good point.

> This also needs to be done in fuse_create_open(), create_new_entry()
> and fuse_direntplus_link().

Yeah I suspected there were a few other places where this would be
required.  I'll look closer into that.

Thanks a lot for your feedback, Miklos.  I'll work on this new approach,
so that I can send a real patch soon.

Cheers,
diff mbox series

Patch

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e9db2cb8c150..64fa0806e97d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -547,6 +547,37 @@  struct inode *fuse_ilookup(struct fuse_conn *fc, u64 nodeid,
 	return NULL;
 }
 
+static int fuse_reverse_inval_all(struct fuse_conn *fc)
+{
+	struct fuse_mount *fm;
+	struct inode *inode;
+
+	inode = fuse_ilookup(fc, FUSE_ROOT_ID, &fm);
+	if (!inode || !fm)
+		return -ENOENT;
+	iput(inode);
+
+	/* Remove all possible active references to cached inodes */
+	shrink_dcache_sb(fm->sb);
+
+	/* Remove all unreferenced inodes from cache */
+	invalidate_inodes(fm->sb);
+
+	return 0;
+}
+
+/*
+ * Notify to invalidate inodes cache.  It can be called with @nodeid set to
+ * either:
+ *
+ * - An inode number - Any pending writebacks within the rage [@offset @len]
+ *   will be triggered and the inode will be validated.  To invalidate the whole
+ *   cache @offset has to be set to '0' and @len needs to be <= '0'; if @offset
+ *   is negative, only the inode attributes are invalidated.
+ *
+ * - FUSE_INVAL_ALL_INODES - All the inodes in the superblock are invalidated
+ *   and the whole dcache is shrinked.
+ */
 int fuse_reverse_inval_inode(struct fuse_conn *fc, u64 nodeid,
 			     loff_t offset, loff_t len)
 {
@@ -555,6 +586,9 @@  int fuse_reverse_inval_inode(struct fuse_conn *fc, u64 nodeid,
 	pgoff_t pg_start;
 	pgoff_t pg_end;
 
+	if (nodeid == FUSE_INVAL_ALL_INODES)
+		return fuse_reverse_inval_all(fc);
+
 	inode = fuse_ilookup(fc, nodeid, NULL);
 	if (!inode)
 		return -ENOENT;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5e0eb41d967e..e5852b63f99f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -669,6 +669,9 @@  enum fuse_notify_code {
 	FUSE_NOTIFY_CODE_MAX,
 };
 
+/* The nodeid to request to invalidate all inodes */
+#define FUSE_INVAL_ALL_INODES 0
+
 /* The read buffer is required to be at least 8k, but may be much larger */
 #define FUSE_MIN_READ_BUFFER 8192