Message ID | 20230615113848.8439-1-jack@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs: Protect reconfiguration of sb read-write from racing writes | expand |
On Thu, Jun 15, 2023 at 01:38:48PM +0200, Jan Kara wrote: > The reconfigure / remount code takes a lot of effort to protect > filesystem's reconfiguration code from racing writes on remounting > read-only. However during remounting read-only filesystem to read-write > mode userspace writes can start immediately once we clear SB_RDONLY > flag. This is inconvenient for example for ext4 because we need to do > some writes to the filesystem (such as preparation of quota files) > before we can take userspace writes so we are clearing SB_RDONLY flag > before we are fully ready to accept userpace writes and syzbot has found > a way to exploit this [1]. Also as far as I'm reading the code > the filesystem remount code was protected from racing writes in the > legacy mount path by the mount's MNT_READONLY flag so this is relatively > new problem. It is actually fairly easy to protect remount read-write > from racing writes using sb->s_readonly_remount flag so let's just do > that instead of having to workaround these races in the filesystem code. > > [1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/ > Signed-off-by: Jan Kara <jack@suse.cz> > --- So looking at the ext4 code this can only happen when you clear SB_RDONLY in ->reconfigure() too early (and the mount isn't MNT_READONLY). Afaict, this was fixed in: a44be64bbecb ("ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled") by clearing SB_RDONLY late, right before returning from ->reconfigure() when everything's ready. So your change is not about fixing that bug in [1] it's about making the vfs give the guarantee that an fs is free to clear SB_RDONLY because any ro<->rw transitions are protected via s_readonly_remount. Correct? It seems ok to me just making sure.
On Thu, Jun 15, 2023 at 02:53:53PM +0200, Christian Brauner wrote: > > So looking at the ext4 code this can only happen when you clear > SB_RDONLY in ->reconfigure() too early (and the mount isn't > MNT_READONLY). Afaict, this was fixed in: > > a44be64bbecb ("ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled") > > by clearing SB_RDONLY late, right before returning from ->reconfigure() > when everything's ready. So your change is not about fixing that bug in > [1] it's about making the vfs give the guarantee that an fs is free to > clear SB_RDONLY because any ro<->rw transitions are protected via > s_readonly_remount. Correct? It seems ok to me just making sure. Unfortunately we had to revert that commit because that broke r/o->r/w writes when quota was enabled. The problem is we need a way of enabling file system writes for internal purposes (e.g., because quota needs to set up quota inodes) but *not* allow userspace file system writes to occur until we are fully done with the remount process. See the discussion here: https://lore.kernel.org/all/20230608044056.GA1418535@mit.edu/ The problem with the current state of the tree is commit dea9d8f7643f ("ext4: only check dquot_initialize_needed() when debugging") has caught real bugs in the past where the caller of ext4_xattr_block_set() failed to call dquot_initialize(inode). In addition, shutting up the warning doesn't fix the problem that while we hit this race where we have started remounting r/w, quota hasn't been initialized, quota tracking will get silently dropped, leading to the quota usage tracking no longer reflecting reality. Jan's patch will fix this problem. Cheers, - Ted
On Thu 15-06-23 10:10:40, Theodore Ts'o wrote: > On Thu, Jun 15, 2023 at 02:53:53PM +0200, Christian Brauner wrote: > > > > So looking at the ext4 code this can only happen when you clear > > SB_RDONLY in ->reconfigure() too early (and the mount isn't > > MNT_READONLY). Afaict, this was fixed in: > > > > a44be64bbecb ("ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled") > > > > by clearing SB_RDONLY late, right before returning from ->reconfigure() > > when everything's ready. So your change is not about fixing that bug in > > [1] it's about making the vfs give the guarantee that an fs is free to > > clear SB_RDONLY because any ro<->rw transitions are protected via > > s_readonly_remount. Correct? It seems ok to me just making sure. > > Unfortunately we had to revert that commit because that broke > r/o->r/w writes when quota was enabled. The problem is we need a way > of enabling file system writes for internal purposes (e.g., because > quota needs to set up quota inodes) but *not* allow userspace file > system writes to occur until we are fully done with the remount process. > > See the discussion here: > > https://lore.kernel.org/all/20230608044056.GA1418535@mit.edu/ > > The problem with the current state of the tree is commit dea9d8f7643f > ("ext4: only check dquot_initialize_needed() when debugging") has > caught real bugs in the past where the caller of > ext4_xattr_block_set() failed to call dquot_initialize(inode). In > addition, shutting up the warning doesn't fix the problem that while > we hit this race where we have started remounting r/w, quota hasn't > been initialized, quota tracking will get silently dropped, leading to > the quota usage tracking no longer reflecting reality. > > Jan's patch will fix this problem. Yes, and to fully reveal the situation, we can indeed introduce ext4-private superblock state "only internal writes allowed" which can be used for quota setup. It just requires a bit of tidying and helper adjustment (in fact I have such patches on my private branch). But it has occured to me that generally it is easier if the filesystem's remount code doesn't have to care about racing writers on rw<->ro transitions since we have many filesystems and making sure all are getting this correct is ... tedious. Honza
On Thu, 15 Jun 2023 13:38:48 +0200, Jan Kara wrote: > The reconfigure / remount code takes a lot of effort to protect > filesystem's reconfiguration code from racing writes on remounting > read-only. However during remounting read-only filesystem to read-write > mode userspace writes can start immediately once we clear SB_RDONLY > flag. This is inconvenient for example for ext4 because we need to do > some writes to the filesystem (such as preparation of quota files) > before we can take userspace writes so we are clearing SB_RDONLY flag > before we are fully ready to accept userpace writes and syzbot has found > a way to exploit this [1]. Also as far as I'm reading the code > the filesystem remount code was protected from racing writes in the > legacy mount path by the mount's MNT_READONLY flag so this is relatively > new problem. It is actually fairly easy to protect remount read-write > from racing writes using sb->s_readonly_remount flag so let's just do > that instead of having to workaround these races in the filesystem code. > > [...] Applied to the vfs.misc branch of the vfs/vfs.git tree. Patches in the vfs.misc branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs.misc [1/1] fs: Protect reconfiguration of sb read-write from racing writes https://git.kernel.org/vfs/vfs/c/496de0b41695
On Thu, Jun 15, 2023 at 01:38:48PM +0200, Jan Kara wrote: > The reconfigure / remount code takes a lot of effort to protect > filesystem's reconfiguration code from racing writes on remounting > read-only. However during remounting read-only filesystem to read-write > mode userspace writes can start immediately once we clear SB_RDONLY > flag. This is inconvenient for example for ext4 because we need to do > some writes to the filesystem (such as preparation of quota files) > before we can take userspace writes so we are clearing SB_RDONLY flag > before we are fully ready to accept userpace writes and syzbot has found > a way to exploit this [1]. Also as far as I'm reading the code > the filesystem remount code was protected from racing writes in the > legacy mount path by the mount's MNT_READONLY flag so this is relatively > new problem. It is actually fairly easy to protect remount read-write > from racing writes using sb->s_readonly_remount flag so let's just do > that instead of having to workaround these races in the filesystem code. > > [1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/ > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/super.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/fs/super.c b/fs/super.c > index 34afe411cf2b..6cd64961aa07 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -903,6 +903,7 @@ int reconfigure_super(struct fs_context *fc) > struct super_block *sb = fc->root->d_sb; > int retval; > bool remount_ro = false; > + bool remount_rw = false; > bool force = fc->sb_flags & SB_FORCE; > > if (fc->sb_flags_mask & ~MS_RMT_MASK) > @@ -920,7 +921,7 @@ int reconfigure_super(struct fs_context *fc) > bdev_read_only(sb->s_bdev)) > return -EACCES; > #endif > - > + remount_rw = !(fc->sb_flags & SB_RDONLY) && sb_rdonly(sb); > remount_ro = (fc->sb_flags & SB_RDONLY) && !sb_rdonly(sb); > } > > @@ -950,6 +951,14 @@ int reconfigure_super(struct fs_context *fc) > if (retval) > return retval; > } > + } else if (remount_rw) { > + /* > + * We set s_readonly_remount here to protect filesystem's > + * reconfigure code from writes from userspace until > + * reconfigure finishes. > + */ > + sb->s_readonly_remount = 1; > + smp_wmb(); What does the magic random memory barrier do? What is it ordering, and what is it paired with? This sort of thing is much better done with small helpers that encapsulate the necessary memory barriers: sb_set_readonly_remount() sb_clear_readonly_remount() alongside the helper that provides the read-side check and memory barrier the write barrier is associated with. I don't often ask for code to be cleaned up before a bug fix can be added, but I think this is one of the important cases where it does actually matter - we should never add undocumented memory barriers in the code like this... Cheers, Dave.
On Fri 16-06-23 08:36:53, Dave Chinner wrote: > On Thu, Jun 15, 2023 at 01:38:48PM +0200, Jan Kara wrote: > > The reconfigure / remount code takes a lot of effort to protect > > filesystem's reconfiguration code from racing writes on remounting > > read-only. However during remounting read-only filesystem to read-write > > mode userspace writes can start immediately once we clear SB_RDONLY > > flag. This is inconvenient for example for ext4 because we need to do > > some writes to the filesystem (such as preparation of quota files) > > before we can take userspace writes so we are clearing SB_RDONLY flag > > before we are fully ready to accept userpace writes and syzbot has found > > a way to exploit this [1]. Also as far as I'm reading the code > > the filesystem remount code was protected from racing writes in the > > legacy mount path by the mount's MNT_READONLY flag so this is relatively > > new problem. It is actually fairly easy to protect remount read-write > > from racing writes using sb->s_readonly_remount flag so let's just do > > that instead of having to workaround these races in the filesystem code. ... > > + } else if (remount_rw) { > > + /* > > + * We set s_readonly_remount here to protect filesystem's > > + * reconfigure code from writes from userspace until > > + * reconfigure finishes. > > + */ > > + sb->s_readonly_remount = 1; > > + smp_wmb(); > > What does the magic random memory barrier do? What is it ordering, > and what is it paired with? > > This sort of thing is much better done with small helpers that > encapsulate the necessary memory barriers: > > sb_set_readonly_remount() > sb_clear_readonly_remount() > > alongside the helper that provides the read-side check and memory > barrier the write barrier is associated with. Fair remark. The new code including barrier just copies what happens a few lines above for remount read-only case (and what happens ib several other places throughout VFS code). I agree having helpers for this and actually documenting how the barriers are matching there is a good cleanup. > I don't often ask for code to be cleaned up before a bug fix can be > added, but I think this is one of the important cases where it does > actually matter - we should never add undocumented memory barriers > in the code like this... I've talked to Christian and we'll queue this as a separate cleanup. I'll post it shortly. Honza
On Fri, Jun 16, 2023 at 06:37:00PM +0200, Jan Kara wrote: > On Fri 16-06-23 08:36:53, Dave Chinner wrote: > > On Thu, Jun 15, 2023 at 01:38:48PM +0200, Jan Kara wrote: > > > The reconfigure / remount code takes a lot of effort to protect > > > filesystem's reconfiguration code from racing writes on remounting > > > read-only. However during remounting read-only filesystem to read-write > > > mode userspace writes can start immediately once we clear SB_RDONLY > > > flag. This is inconvenient for example for ext4 because we need to do > > > some writes to the filesystem (such as preparation of quota files) > > > before we can take userspace writes so we are clearing SB_RDONLY flag > > > before we are fully ready to accept userpace writes and syzbot has found > > > a way to exploit this [1]. Also as far as I'm reading the code > > > the filesystem remount code was protected from racing writes in the > > > legacy mount path by the mount's MNT_READONLY flag so this is relatively > > > new problem. It is actually fairly easy to protect remount read-write > > > from racing writes using sb->s_readonly_remount flag so let's just do > > > that instead of having to workaround these races in the filesystem code. > ... > > > + } else if (remount_rw) { > > > + /* > > > + * We set s_readonly_remount here to protect filesystem's > > > + * reconfigure code from writes from userspace until > > > + * reconfigure finishes. > > > + */ > > > + sb->s_readonly_remount = 1; > > > + smp_wmb(); > > > > What does the magic random memory barrier do? What is it ordering, > > and what is it paired with? > > > > This sort of thing is much better done with small helpers that > > encapsulate the necessary memory barriers: > > > > sb_set_readonly_remount() > > sb_clear_readonly_remount() > > > > alongside the helper that provides the read-side check and memory > > barrier the write barrier is associated with. > > Fair remark. The new code including barrier just copies what happens a few > lines above for remount read-only case (and what happens ib several other > places throughout VFS code). Yes, I saw that you'd copied that magic memory barrier. Good for consistency, but it doesn't answer any of the above questions either, so.... > I agree having helpers for this and actually > documenting how the barriers are matching there is a good cleanup. > > > I don't often ask for code to be cleaned up before a bug fix can be > > added, but I think this is one of the important cases where it does > > actually matter - we should never add undocumented memory barriers > > in the code like this... > > I've talked to Christian and we'll queue this as a separate cleanup. I'll > post it shortly. Thanks! -Dave.
diff --git a/fs/super.c b/fs/super.c index 34afe411cf2b..6cd64961aa07 100644 --- a/fs/super.c +++ b/fs/super.c @@ -903,6 +903,7 @@ int reconfigure_super(struct fs_context *fc) struct super_block *sb = fc->root->d_sb; int retval; bool remount_ro = false; + bool remount_rw = false; bool force = fc->sb_flags & SB_FORCE; if (fc->sb_flags_mask & ~MS_RMT_MASK) @@ -920,7 +921,7 @@ int reconfigure_super(struct fs_context *fc) bdev_read_only(sb->s_bdev)) return -EACCES; #endif - + remount_rw = !(fc->sb_flags & SB_RDONLY) && sb_rdonly(sb); remount_ro = (fc->sb_flags & SB_RDONLY) && !sb_rdonly(sb); } @@ -950,6 +951,14 @@ int reconfigure_super(struct fs_context *fc) if (retval) return retval; } + } else if (remount_rw) { + /* + * We set s_readonly_remount here to protect filesystem's + * reconfigure code from writes from userspace until + * reconfigure finishes. + */ + sb->s_readonly_remount = 1; + smp_wmb(); } if (fc->ops->reconfigure) {
The reconfigure / remount code takes a lot of effort to protect filesystem's reconfiguration code from racing writes on remounting read-only. However during remounting read-only filesystem to read-write mode userspace writes can start immediately once we clear SB_RDONLY flag. This is inconvenient for example for ext4 because we need to do some writes to the filesystem (such as preparation of quota files) before we can take userspace writes so we are clearing SB_RDONLY flag before we are fully ready to accept userpace writes and syzbot has found a way to exploit this [1]. Also as far as I'm reading the code the filesystem remount code was protected from racing writes in the legacy mount path by the mount's MNT_READONLY flag so this is relatively new problem. It is actually fairly easy to protect remount read-write from racing writes using sb->s_readonly_remount flag so let's just do that instead of having to workaround these races in the filesystem code. [1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/ Signed-off-by: Jan Kara <jack@suse.cz> --- fs/super.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)