diff mbox series

[3/4] fsnotify: assert that file_start_write() is not held in permission hooks

Message ID 20231207123825.4011620-4-amir73il@gmail.com (mailing list archive)
State New, archived
Headers show
Series Prepare for fsnotify pre-content permission events | expand

Commit Message

Amir Goldstein Dec. 7, 2023, 12:38 p.m. UTC
filesystem may be modified in the context of fanotify permission events
(e.g. by HSM service), so assert that sb freeze protection is not held.

If the assertion fails, then the following deadlock would be possible:

CPU0				CPU1			CPU2
-------------------------------------------------------------------------
file_start_write()#0
...
  fsnotify_perm()
    fanotify_get_response() =>	(read event and fill file)
				...
				...			freeze_super()
				...			  sb_wait_write()
				...
				vfs_write()
				  file_start_write()#1

This example demonstrates a use case of an hierarchical storage management
(HSM) service that uses fanotify permission events to fill the content of
a file before access, while a 3rd process starts fsfreeze.

This creates a circular dependeny:
  file_start_write()#0 => fanotify_get_response =>
    file_start_write()#1 =>
      sb_wait_write() =>
        file_end_write()#0

Where file_end_write()#0 can never be called and none of the threads can
make progress.

The assertion is checked for both MAY_READ and MAY_WRITE permission
hooks in preparation for a pre-modify permission event.

The assertion is not checked for an open permission event, because
do_open() takes mnt_want_write() in O_TRUNC case, meaning that it is not
safe to write to filesystem in the content of an open permission event.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 include/linux/fsnotify.h | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Jan Kara Dec. 8, 2023, 6:46 p.m. UTC | #1
On Thu 07-12-23 14:38:24, Amir Goldstein wrote:
> filesystem may be modified in the context of fanotify permission events
> (e.g. by HSM service), so assert that sb freeze protection is not held.
> 
> If the assertion fails, then the following deadlock would be possible:
> 
> CPU0				CPU1			CPU2
> -------------------------------------------------------------------------
> file_start_write()#0
> ...
>   fsnotify_perm()
>     fanotify_get_response() =>	(read event and fill file)
> 				...
> 				...			freeze_super()
> 				...			  sb_wait_write()
> 				...
> 				vfs_write()
> 				  file_start_write()#1
> 
> This example demonstrates a use case of an hierarchical storage management
> (HSM) service that uses fanotify permission events to fill the content of
> a file before access, while a 3rd process starts fsfreeze.
> 
> This creates a circular dependeny:
>   file_start_write()#0 => fanotify_get_response =>
>     file_start_write()#1 =>
>       sb_wait_write() =>
>         file_end_write()#0
> 
> Where file_end_write()#0 can never be called and none of the threads can
> make progress.
> 
> The assertion is checked for both MAY_READ and MAY_WRITE permission
> hooks in preparation for a pre-modify permission event.
> 
> The assertion is not checked for an open permission event, because
> do_open() takes mnt_want_write() in O_TRUNC case, meaning that it is not
> safe to write to filesystem in the content of an open permission event.
				     ^^^^^ context

BTW, isn't this a bit inconvenient? I mean filling file contents on open
looks quite natural... Do you plan to fill files only on individual read /
write events? I was under the impression simple HSM handlers would be doing
it on open.
 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

Anyway this particular change looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/fsnotify.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index 926bb4461b9e..0a9d6a8a747a 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -107,6 +107,13 @@ static inline int fsnotify_file_perm(struct file *file, int perm_mask)
>  {
>  	__u32 fsnotify_mask = FS_ACCESS_PERM;
>  
> +	/*
> +	 * filesystem may be modified in the context of permission events
> +	 * (e.g. by HSM filling a file on access), so sb freeze protection
> +	 * must not be held.
> +	 */
> +	lockdep_assert_once(file_write_not_started(file));
> +
>  	if (!(perm_mask & MAY_READ))
>  		return 0;
>  
> -- 
> 2.34.1
>
Amir Goldstein Dec. 8, 2023, 9:02 p.m. UTC | #2
On Fri, Dec 8, 2023 at 8:46 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 07-12-23 14:38:24, Amir Goldstein wrote:
> > filesystem may be modified in the context of fanotify permission events
> > (e.g. by HSM service), so assert that sb freeze protection is not held.
> >
> > If the assertion fails, then the following deadlock would be possible:
> >
> > CPU0                          CPU1                    CPU2
> > -------------------------------------------------------------------------
> > file_start_write()#0
> > ...
> >   fsnotify_perm()
> >     fanotify_get_response() =>        (read event and fill file)
> >                               ...
> >                               ...                     freeze_super()
> >                               ...                       sb_wait_write()
> >                               ...
> >                               vfs_write()
> >                                 file_start_write()#1
> >
> > This example demonstrates a use case of an hierarchical storage management
> > (HSM) service that uses fanotify permission events to fill the content of
> > a file before access, while a 3rd process starts fsfreeze.
> >
> > This creates a circular dependeny:
> >   file_start_write()#0 => fanotify_get_response =>
> >     file_start_write()#1 =>
> >       sb_wait_write() =>
> >         file_end_write()#0
> >
> > Where file_end_write()#0 can never be called and none of the threads can
> > make progress.
> >
> > The assertion is checked for both MAY_READ and MAY_WRITE permission
> > hooks in preparation for a pre-modify permission event.
> >
> > The assertion is not checked for an open permission event, because
> > do_open() takes mnt_want_write() in O_TRUNC case, meaning that it is not
> > safe to write to filesystem in the content of an open permission event.
>                                      ^^^^^ context
>
> BTW, isn't this a bit inconvenient? I mean filling file contents on open
> looks quite natural... Do you plan to fill files only on individual read /
> write events? I was under the impression simple HSM handlers would be doing
> it on open.
>

Naive HSMs perhaps... The problem with filling on open is that the pattern
open();fstat();close() is quite common with applications and we found open()
to be a sub-optimal predicate for near future read().

Filling the file on first read() access or directory on first
readdir() access does
a better job in "swapping in" the correct files.
A simple HSM would just fill the entire file/dir on the first PRE_ACCESS event.
that's not any more or less simple than filling it on an OPEN_PERM event.

Another point that could get lost when reading to above deadlock is that
filling the file content before open(O_TRUNC) would be really dumb,
because swap in is costly and you are going to throw away the data.
If we really wanted to provide HSM with a safe way to fill files on open,
we would probably need to report the open flags with the open event.
I actually think that reporting the open flags would be nice even with
an async open event.

Thanks,
Amir.
Jan Kara Dec. 11, 2023, 10:30 a.m. UTC | #3
On Fri 08-12-23 23:02:35, Amir Goldstein wrote:
> On Fri, Dec 8, 2023 at 8:46 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 07-12-23 14:38:24, Amir Goldstein wrote:
> > > filesystem may be modified in the context of fanotify permission events
> > > (e.g. by HSM service), so assert that sb freeze protection is not held.
> > >
> > > If the assertion fails, then the following deadlock would be possible:
> > >
> > > CPU0                          CPU1                    CPU2
> > > -------------------------------------------------------------------------
> > > file_start_write()#0
> > > ...
> > >   fsnotify_perm()
> > >     fanotify_get_response() =>        (read event and fill file)
> > >                               ...
> > >                               ...                     freeze_super()
> > >                               ...                       sb_wait_write()
> > >                               ...
> > >                               vfs_write()
> > >                                 file_start_write()#1
> > >
> > > This example demonstrates a use case of an hierarchical storage management
> > > (HSM) service that uses fanotify permission events to fill the content of
> > > a file before access, while a 3rd process starts fsfreeze.
> > >
> > > This creates a circular dependeny:
> > >   file_start_write()#0 => fanotify_get_response =>
> > >     file_start_write()#1 =>
> > >       sb_wait_write() =>
> > >         file_end_write()#0
> > >
> > > Where file_end_write()#0 can never be called and none of the threads can
> > > make progress.
> > >
> > > The assertion is checked for both MAY_READ and MAY_WRITE permission
> > > hooks in preparation for a pre-modify permission event.
> > >
> > > The assertion is not checked for an open permission event, because
> > > do_open() takes mnt_want_write() in O_TRUNC case, meaning that it is not
> > > safe to write to filesystem in the content of an open permission event.
> >                                      ^^^^^ context
> >
> > BTW, isn't this a bit inconvenient? I mean filling file contents on open
> > looks quite natural... Do you plan to fill files only on individual read /
> > write events? I was under the impression simple HSM handlers would be doing
> > it on open.
> >
> 
> Naive HSMs perhaps... The problem with filling on open is that the pattern
> open();fstat();close() is quite common with applications and we found open()
> to be a sub-optimal predicate for near future read().
> 
> Filling the file on first read() access or directory on first
> readdir() access does
> a better job in "swapping in" the correct files.
> A simple HSM would just fill the entire file/dir on the first PRE_ACCESS event.
> that's not any more or less simple than filling it on an OPEN_PERM event.
> 
> Another point that could get lost when reading to above deadlock is that
> filling the file content before open(O_TRUNC) would be really dumb,
> because swap in is costly and you are going to throw away the data.
> If we really wanted to provide HSM with a safe way to fill files on open,
> we would probably need to report the open flags with the open event.
> I actually think that reporting the open flags would be nice even with
> an async open event.

OK, thanks for explanation!

								Honza
Amir Goldstein Dec. 11, 2023, 10:57 a.m. UTC | #4
On Mon, Dec 11, 2023 at 12:30 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 08-12-23 23:02:35, Amir Goldstein wrote:
> > On Fri, Dec 8, 2023 at 8:46 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 07-12-23 14:38:24, Amir Goldstein wrote:
> > > > filesystem may be modified in the context of fanotify permission events
> > > > (e.g. by HSM service), so assert that sb freeze protection is not held.
> > > >
> > > > If the assertion fails, then the following deadlock would be possible:
> > > >
> > > > CPU0                          CPU1                    CPU2
> > > > -------------------------------------------------------------------------
> > > > file_start_write()#0
> > > > ...
> > > >   fsnotify_perm()
> > > >     fanotify_get_response() =>        (read event and fill file)
> > > >                               ...
> > > >                               ...                     freeze_super()
> > > >                               ...                       sb_wait_write()
> > > >                               ...
> > > >                               vfs_write()
> > > >                                 file_start_write()#1
> > > >
> > > > This example demonstrates a use case of an hierarchical storage management
> > > > (HSM) service that uses fanotify permission events to fill the content of
> > > > a file before access, while a 3rd process starts fsfreeze.
> > > >
> > > > This creates a circular dependeny:
> > > >   file_start_write()#0 => fanotify_get_response =>
> > > >     file_start_write()#1 =>
> > > >       sb_wait_write() =>
> > > >         file_end_write()#0
> > > >
> > > > Where file_end_write()#0 can never be called and none of the threads can
> > > > make progress.
> > > >
> > > > The assertion is checked for both MAY_READ and MAY_WRITE permission
> > > > hooks in preparation for a pre-modify permission event.
> > > >
> > > > The assertion is not checked for an open permission event, because
> > > > do_open() takes mnt_want_write() in O_TRUNC case, meaning that it is not
> > > > safe to write to filesystem in the content of an open permission event.
> > >                                      ^^^^^ context
> > >
> > > BTW, isn't this a bit inconvenient? I mean filling file contents on open
> > > looks quite natural... Do you plan to fill files only on individual read /
> > > write events? I was under the impression simple HSM handlers would be doing
> > > it on open.
> > >
> >
> > Naive HSMs perhaps... The problem with filling on open is that the pattern
> > open();fstat();close() is quite common with applications and we found open()
> > to be a sub-optimal predicate for near future read().
> >
> > Filling the file on first read() access or directory on first
> > readdir() access does
> > a better job in "swapping in" the correct files.
> > A simple HSM would just fill the entire file/dir on the first PRE_ACCESS event.
> > that's not any more or less simple than filling it on an OPEN_PERM event.
> >
> > Another point that could get lost when reading to above deadlock is that
> > filling the file content before open(O_TRUNC) would be really dumb,
> > because swap in is costly and you are going to throw away the data.
> > If we really wanted to provide HSM with a safe way to fill files on open,
> > we would probably need to report the open flags with the open event.
> > I actually think that reporting the open flags would be nice even with
> > an async open event.
>
> OK, thanks for explanation!

Anyway, I will try to find a solution also for the OPEN_PERM deadlock.
I have some sketches, but ->atomic_open() complicates things...

Thanks,
Amir.
diff mbox series

Patch

diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 926bb4461b9e..0a9d6a8a747a 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -107,6 +107,13 @@  static inline int fsnotify_file_perm(struct file *file, int perm_mask)
 {
 	__u32 fsnotify_mask = FS_ACCESS_PERM;
 
+	/*
+	 * filesystem may be modified in the context of permission events
+	 * (e.g. by HSM filling a file on access), so sb freeze protection
+	 * must not be held.
+	 */
+	lockdep_assert_once(file_write_not_started(file));
+
 	if (!(perm_mask & MAY_READ))
 		return 0;