diff mbox series

[3/3] overlayfs: Report writeback errors on upper

Message ID 20201221195055.35295-4-vgoyal@redhat.com (mailing list archive)
State New, archived
Headers show
Series vfs, overlayfs: Fix syncfs() to return correct errors | expand

Commit Message

Vivek Goyal Dec. 21, 2020, 7:50 p.m. UTC
Currently syncfs() and fsync() seem to be two interfaces which check and
return writeback errors on superblock to user space. fsync() should
work fine with overlayfs as it relies on underlying filesystem to
do the check and return error. For example, if ext4 is on upper filesystem,
then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
upper file and returns error. So overlayfs does not have to do anything
special.

But with syncfs(), error check happens in vfs in syncfs() w.r.t
overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
does not do actual writeback and all writeback errors are recorded
on underlying filesystem. So sb->s_wb_err is never updated hence
syncfs() does not work with overlay.

Jeff suggested that instead of trying to propagate errors to overlay
super block, why not simply check for errors against upper filesystem
super block. I implemented this idea.

Overlay file has "since" value which needs to be initialized at open
time. Overlay overrides VFS initialization and re-initializes
f->f_sb_err w.r.t upper super block. Later when
ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
since value to figure out if any error on upper sb has happened since
then.

Note, Right now this patch only deals with regular file and directories.
Yet to deal with special files like device inodes, socket, fifo etc.

Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/overlayfs/file.c      |  1 +
 fs/overlayfs/overlayfs.h |  1 +
 fs/overlayfs/readdir.c   |  1 +
 fs/overlayfs/super.c     | 23 +++++++++++++++++++++++
 fs/overlayfs/util.c      | 13 +++++++++++++
 5 files changed, 39 insertions(+)

Comments

Matthew Wilcox Dec. 22, 2020, 4:20 p.m. UTC | #1
On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> +static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
> +{
> +	struct ovl_fs *ofs = sb->s_fs_info;
> +	struct super_block *upper_sb;
> +	int ret;
> +
> +	if (!ovl_upper_mnt(ofs))
> +		return 0;
> +
> +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> +
> +	if (!errseq_check(&upper_sb->s_wb_err, file->f_sb_err))
> +		return 0;
> +
> +	/* Something changed, must use slow path */
> +	spin_lock(&file->f_lock);
> +	ret = errseq_check_and_advance(&upper_sb->s_wb_err, &file->f_sb_err);
> +	spin_unlock(&file->f_lock);

Why are you microoptimising syncfs()?  Are there really applications which
call syncfs() in a massively parallel manner on the same file descriptor?
Vivek Goyal Dec. 22, 2020, 4:29 p.m. UTC | #2
On Tue, Dec 22, 2020 at 04:20:27PM +0000, Matthew Wilcox wrote:
> On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > +static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
> > +{
> > +	struct ovl_fs *ofs = sb->s_fs_info;
> > +	struct super_block *upper_sb;
> > +	int ret;
> > +
> > +	if (!ovl_upper_mnt(ofs))
> > +		return 0;
> > +
> > +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +
> > +	if (!errseq_check(&upper_sb->s_wb_err, file->f_sb_err))
> > +		return 0;
> > +
> > +	/* Something changed, must use slow path */
> > +	spin_lock(&file->f_lock);
> > +	ret = errseq_check_and_advance(&upper_sb->s_wb_err, &file->f_sb_err);
> > +	spin_unlock(&file->f_lock);
> 
> Why are you microoptimising syncfs()?  Are there really applications which
> call syncfs() in a massively parallel manner on the same file descriptor?

This is atleast theoritical race. I am not aware which application can
trigger this race. So to me it makes sense to fix the race.

Jeff Layton also posted a fix for syncfs().

https://lore.kernel.org/linux-fsdevel/20201219134804.20034-1-jlayton@kernel.org/

To me it makes sense to fix the race irrespective of the fact if somebody
hit it or not. People end up copying code in other parts of kernel and
and they will atleast copy race free code.

Vivek
Matthew Wilcox Dec. 22, 2020, 5:46 p.m. UTC | #3
On Tue, Dec 22, 2020 at 11:29:25AM -0500, Vivek Goyal wrote:
> On Tue, Dec 22, 2020 at 04:20:27PM +0000, Matthew Wilcox wrote:
> > On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > > +static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
> > > +{
> > > +	struct ovl_fs *ofs = sb->s_fs_info;
> > > +	struct super_block *upper_sb;
> > > +	int ret;
> > > +
> > > +	if (!ovl_upper_mnt(ofs))
> > > +		return 0;
> > > +
> > > +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > +
> > > +	if (!errseq_check(&upper_sb->s_wb_err, file->f_sb_err))
> > > +		return 0;
> > > +
> > > +	/* Something changed, must use slow path */
> > > +	spin_lock(&file->f_lock);
> > > +	ret = errseq_check_and_advance(&upper_sb->s_wb_err, &file->f_sb_err);
> > > +	spin_unlock(&file->f_lock);
> > 
> > Why are you microoptimising syncfs()?  Are there really applications which
> > call syncfs() in a massively parallel manner on the same file descriptor?
> 
> This is atleast theoritical race. I am not aware which application can
> trigger this race. So to me it makes sense to fix the race.
> 
> Jeff Layton also posted a fix for syncfs().
> 
> https://lore.kernel.org/linux-fsdevel/20201219134804.20034-1-jlayton@kernel.org/
> 
> To me it makes sense to fix the race irrespective of the fact if somebody
> hit it or not. People end up copying code in other parts of kernel and
> and they will atleast copy race free code.

Let me try again.  "Why are you trying to avoid taking the spinlock?"
Vivek Goyal Dec. 22, 2020, 5:55 p.m. UTC | #4
On Tue, Dec 22, 2020 at 05:46:37PM +0000, Matthew Wilcox wrote:
> On Tue, Dec 22, 2020 at 11:29:25AM -0500, Vivek Goyal wrote:
> > On Tue, Dec 22, 2020 at 04:20:27PM +0000, Matthew Wilcox wrote:
> > > On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > > > +static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
> > > > +{
> > > > +	struct ovl_fs *ofs = sb->s_fs_info;
> > > > +	struct super_block *upper_sb;
> > > > +	int ret;
> > > > +
> > > > +	if (!ovl_upper_mnt(ofs))
> > > > +		return 0;
> > > > +
> > > > +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > +
> > > > +	if (!errseq_check(&upper_sb->s_wb_err, file->f_sb_err))
> > > > +		return 0;
> > > > +
> > > > +	/* Something changed, must use slow path */
> > > > +	spin_lock(&file->f_lock);
> > > > +	ret = errseq_check_and_advance(&upper_sb->s_wb_err, &file->f_sb_err);
> > > > +	spin_unlock(&file->f_lock);
> > > 
> > > Why are you microoptimising syncfs()?  Are there really applications which
> > > call syncfs() in a massively parallel manner on the same file descriptor?
> > 
> > This is atleast theoritical race. I am not aware which application can
> > trigger this race. So to me it makes sense to fix the race.
> > 
> > Jeff Layton also posted a fix for syncfs().
> > 
> > https://lore.kernel.org/linux-fsdevel/20201219134804.20034-1-jlayton@kernel.org/
> > 
> > To me it makes sense to fix the race irrespective of the fact if somebody
> > hit it or not. People end up copying code in other parts of kernel and
> > and they will atleast copy race free code.
> 
> Let me try again.  "Why are you trying to avoid taking the spinlock?"

Aha.., sorry, I misunderstood your question. I don't have a good answer.
I just copied the code from Jeff Layton's patch.

Agreed that cost of taking spin lock will not be significant until
syncfs() is called at high frequency. Having said that, most of the
time taking spin lock will not be needed, so avoiding it with
a simple call to errseq_check() sounds reasonable too.

I don't have any strong opinions here. I am fine with any of the
implementation people like.

Vivek
Jeff Layton Dec. 23, 2020, 12:53 p.m. UTC | #5
On Tue, 2020-12-22 at 12:55 -0500, Vivek Goyal wrote:
> On Tue, Dec 22, 2020 at 05:46:37PM +0000, Matthew Wilcox wrote:
> > On Tue, Dec 22, 2020 at 11:29:25AM -0500, Vivek Goyal wrote:
> > > On Tue, Dec 22, 2020 at 04:20:27PM +0000, Matthew Wilcox wrote:
> > > > On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > > > > +static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
> > > > > +{
> > > > > +	struct ovl_fs *ofs = sb->s_fs_info;
> > > > > +	struct super_block *upper_sb;
> > > > > +	int ret;
> > > > > +
> > > > > +	if (!ovl_upper_mnt(ofs))
> > > > > +		return 0;
> > > > > +
> > > > > +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > +
> > > > > +	if (!errseq_check(&upper_sb->s_wb_err, file->f_sb_err))
> > > > > +		return 0;
> > > > > +
> > > > > +	/* Something changed, must use slow path */
> > > > > +	spin_lock(&file->f_lock);
> > > > > +	ret = errseq_check_and_advance(&upper_sb->s_wb_err, &file->f_sb_err);
> > > > > +	spin_unlock(&file->f_lock);
> > > > 
> > > > Why are you microoptimising syncfs()?  Are there really applications which
> > > > call syncfs() in a massively parallel manner on the same file descriptor?
> > > 
> > > This is atleast theoritical race. I am not aware which application can
> > > trigger this race. So to me it makes sense to fix the race.
> > > 
> > > Jeff Layton also posted a fix for syncfs().
> > > 
> > > https://lore.kernel.org/linux-fsdevel/20201219134804.20034-1-jlayton@kernel.org/
> > > 
> > > To me it makes sense to fix the race irrespective of the fact if somebody
> > > hit it or not. People end up copying code in other parts of kernel and
> > > and they will atleast copy race free code.
> > 
> > Let me try again.  "Why are you trying to avoid taking the spinlock?"
> 
> Aha.., sorry, I misunderstood your question. I don't have a good answer.
> I just copied the code from Jeff Layton's patch.
> 
> Agreed that cost of taking spin lock will not be significant until
> syncfs() is called at high frequency. Having said that, most of the
> time taking spin lock will not be needed, so avoiding it with
> a simple call to errseq_check() sounds reasonable too.
> 
> I don't have any strong opinions here. I am fine with any of the
> implementation people like.
> 

It is a micro-optimization, but we'll almost always be able to avoid
taking the lock altogether. Errors here should be very, very infrequent.

That said I don't have strong feelings on this either.
Sargun Dhillon Dec. 23, 2020, 6:20 p.m. UTC | #6
On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> Currently syncfs() and fsync() seem to be two interfaces which check and
> return writeback errors on superblock to user space. fsync() should
> work fine with overlayfs as it relies on underlying filesystem to
> do the check and return error. For example, if ext4 is on upper filesystem,
> then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
> upper file and returns error. So overlayfs does not have to do anything
> special.
> 
> But with syncfs(), error check happens in vfs in syncfs() w.r.t
> overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
> does not do actual writeback and all writeback errors are recorded
> on underlying filesystem. So sb->s_wb_err is never updated hence
> syncfs() does not work with overlay.
> 
> Jeff suggested that instead of trying to propagate errors to overlay
> super block, why not simply check for errors against upper filesystem
> super block. I implemented this idea.
> 
> Overlay file has "since" value which needs to be initialized at open
> time. Overlay overrides VFS initialization and re-initializes
> f->f_sb_err w.r.t upper super block. Later when
> ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
> since value to figure out if any error on upper sb has happened since
> then.
> 
> Note, Right now this patch only deals with regular file and directories.
> Yet to deal with special files like device inodes, socket, fifo etc.
> 
> Suggested-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/overlayfs/file.c      |  1 +
>  fs/overlayfs/overlayfs.h |  1 +
>  fs/overlayfs/readdir.c   |  1 +
>  fs/overlayfs/super.c     | 23 +++++++++++++++++++++++
>  fs/overlayfs/util.c      | 13 +++++++++++++
>  5 files changed, 39 insertions(+)
> 
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index efccb7c1f9bc..7b58a44dcb71 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -163,6 +163,7 @@ static int ovl_open(struct inode *inode, struct file *file)
>  		return PTR_ERR(realfile);
>  
>  	file->private_data = realfile;
> +	ovl_init_file_errseq(file);
>  
>  	return 0;
>  }
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index f8880aa2ba0e..47838abbfb3d 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry);
>  bool ovl_is_metacopy_dentry(struct dentry *dentry);
>  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
>  			     int padding);
> +void ovl_init_file_errseq(struct file *file);
>  
>  static inline bool ovl_is_impuredir(struct super_block *sb,
>  				    struct dentry *dentry)
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 01620ebae1bd..0c48f1545483 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -960,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct file *file)
>  	od->is_real = ovl_dir_is_real(file->f_path.dentry);
>  	od->is_upper = OVL_TYPE_UPPER(type);
>  	file->private_data = od;
> +	ovl_init_file_errseq(file);
>  
>  	return 0;
>  }
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 290983bcfbb3..d99867983722 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -390,6 +390,28 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
>  	return ret;
>  }
>  
> +static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
> +{
> +	struct ovl_fs *ofs = sb->s_fs_info;
> +	struct super_block *upper_sb;
> +	int ret;
> +
> +	if (!ovl_upper_mnt(ofs))
> +		return 0;
> +
> +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> +
> +	if (!errseq_check(&upper_sb->s_wb_err, file->f_sb_err))
> +		return 0;
> +
> +	/* Something changed, must use slow path */
> +	spin_lock(&file->f_lock);
> +	ret = errseq_check_and_advance(&upper_sb->s_wb_err, &file->f_sb_err);
> +	spin_unlock(&file->f_lock);
> +
> +	return ret;
> +}
> +
>  static const struct super_operations ovl_super_operations = {
>  	.alloc_inode	= ovl_alloc_inode,
>  	.free_inode	= ovl_free_inode,
> @@ -400,6 +422,7 @@ static const struct super_operations ovl_super_operations = {
>  	.statfs		= ovl_statfs,
>  	.show_options	= ovl_show_options,
>  	.remount_fs	= ovl_remount,
> +	.errseq_check_advance	= ovl_errseq_check_advance,
>  };
>  
>  enum {
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index 23f475627d07..a1742847f3a8 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -950,3 +950,16 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
>  	kfree(buf);
>  	return ERR_PTR(res);
>  }
> +
> +void ovl_init_file_errseq(struct file *file)
> +{
> +	struct super_block *sb = file_dentry(file)->d_sb;
> +	struct ovl_fs *ofs = sb->s_fs_info;
> +	struct super_block *upper_sb;
> +
> +	if (!ovl_upper_mnt(ofs))
> +		return;
> +
> +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> +	file->f_sb_err = errseq_sample(&upper_sb->s_wb_err);
> +}
> -- 
> 2.25.4
> 

I fail to see why this is neccessary if you incorporate error reporting into the 
sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
patch that adds the 2nd flag to errseq for "observed", you should be able to
stash the first errseq seen in the ovl_fs struct, and do the check-and-return
in there instead instead of adding this new infrastructure.

IMHO, if we're going to fix this, sync_fs should be replaced, and there should 
be a generic_sync_fs wrapper which does the errseq, callback, and sync blockdev, 
but then filesystems should be able to override it and do the requisite work.
Matthew Wilcox Dec. 23, 2020, 6:50 p.m. UTC | #7
On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> I fail to see why this is neccessary if you incorporate error reporting into the 
> sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
> patch that adds the 2nd flag to errseq for "observed", you should be able to
> stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> in there instead instead of adding this new infrastructure.

You still haven't explained why you want to add the "observed" flag.
Jeff Layton Dec. 23, 2020, 7 p.m. UTC | #8
On Wed, 2020-12-23 at 18:20 +0000, Sargun Dhillon wrote:
> On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > Currently syncfs() and fsync() seem to be two interfaces which check and
> > return writeback errors on superblock to user space. fsync() should
> > work fine with overlayfs as it relies on underlying filesystem to
> > do the check and return error. For example, if ext4 is on upper filesystem,
> > then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
> > upper file and returns error. So overlayfs does not have to do anything
> > special.
> > 
> > But with syncfs(), error check happens in vfs in syncfs() w.r.t
> > overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
> > does not do actual writeback and all writeback errors are recorded
> > on underlying filesystem. So sb->s_wb_err is never updated hence
> > syncfs() does not work with overlay.
> > 
> > Jeff suggested that instead of trying to propagate errors to overlay
> > super block, why not simply check for errors against upper filesystem
> > super block. I implemented this idea.
> > 
> > Overlay file has "since" value which needs to be initialized at open
> > time. Overlay overrides VFS initialization and re-initializes
> > f->f_sb_err w.r.t upper super block. Later when
> > ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
> > since value to figure out if any error on upper sb has happened since
> > then.
> > 
> > Note, Right now this patch only deals with regular file and directories.
> > Yet to deal with special files like device inodes, socket, fifo etc.
> > 
> > Suggested-by: Jeff Layton <jlayton@kernel.org>
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  fs/overlayfs/file.c      |  1 +
> >  fs/overlayfs/overlayfs.h |  1 +
> >  fs/overlayfs/readdir.c   |  1 +
> >  fs/overlayfs/super.c     | 23 +++++++++++++++++++++++
> >  fs/overlayfs/util.c      | 13 +++++++++++++
> >  5 files changed, 39 insertions(+)
> > 
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index efccb7c1f9bc..7b58a44dcb71 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -163,6 +163,7 @@ static int ovl_open(struct inode *inode, struct file *file)
> >  		return PTR_ERR(realfile);
> >  
> > 
> >  	file->private_data = realfile;
> > +	ovl_init_file_errseq(file);
> >  
> > 
> >  	return 0;
> >  }
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index f8880aa2ba0e..47838abbfb3d 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry);
> >  bool ovl_is_metacopy_dentry(struct dentry *dentry);
> >  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
> >  			     int padding);
> > +void ovl_init_file_errseq(struct file *file);
> >  
> > 
> >  static inline bool ovl_is_impuredir(struct super_block *sb,
> >  				    struct dentry *dentry)
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 01620ebae1bd..0c48f1545483 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -960,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct file *file)
> >  	od->is_real = ovl_dir_is_real(file->f_path.dentry);
> >  	od->is_upper = OVL_TYPE_UPPER(type);
> >  	file->private_data = od;
> > +	ovl_init_file_errseq(file);
> >  
> > 
> >  	return 0;
> >  }
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 290983bcfbb3..d99867983722 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -390,6 +390,28 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
> >  	return ret;
> >  }
> >  
> > 
> > +static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
> > +{
> > +	struct ovl_fs *ofs = sb->s_fs_info;
> > +	struct super_block *upper_sb;
> > +	int ret;
> > +
> > +	if (!ovl_upper_mnt(ofs))
> > +		return 0;
> > +
> > +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +
> > +	if (!errseq_check(&upper_sb->s_wb_err, file->f_sb_err))
> > +		return 0;
> > +
> > +	/* Something changed, must use slow path */
> > +	spin_lock(&file->f_lock);
> > +	ret = errseq_check_and_advance(&upper_sb->s_wb_err, &file->f_sb_err);
> > +	spin_unlock(&file->f_lock);
> > +
> > +	return ret;
> > +}
> > +
> >  static const struct super_operations ovl_super_operations = {
> >  	.alloc_inode	= ovl_alloc_inode,
> >  	.free_inode	= ovl_free_inode,
> > @@ -400,6 +422,7 @@ static const struct super_operations ovl_super_operations = {
> >  	.statfs		= ovl_statfs,
> >  	.show_options	= ovl_show_options,
> >  	.remount_fs	= ovl_remount,
> > +	.errseq_check_advance	= ovl_errseq_check_advance,
> >  };
> >  
> > 
> >  enum {
> > diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> > index 23f475627d07..a1742847f3a8 100644
> > --- a/fs/overlayfs/util.c
> > +++ b/fs/overlayfs/util.c
> > @@ -950,3 +950,16 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
> >  	kfree(buf);
> >  	return ERR_PTR(res);
> >  }
> > +
> > +void ovl_init_file_errseq(struct file *file)
> > +{
> > +	struct super_block *sb = file_dentry(file)->d_sb;
> > +	struct ovl_fs *ofs = sb->s_fs_info;
> > +	struct super_block *upper_sb;
> > +
> > +	if (!ovl_upper_mnt(ofs))
> > +		return;
> > +
> > +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +	file->f_sb_err = errseq_sample(&upper_sb->s_wb_err);
> > +}
> > -- 
> > 2.25.4
> > 
> 
> I fail to see why this is neccessary if you incorporate error reporting into the 
> sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
> patch that adds the 2nd flag to errseq for "observed", you should be able to
> stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> in there instead instead of adding this new infrastructure.
> 
> IMHO, if we're going to fix this, sync_fs should be replaced, and there should 
> be a generic_sync_fs wrapper which does the errseq, callback, and sync blockdev, 
> but then filesystems should be able to override it and do the requisite work.

The big problem is that ->sync_fs is called in several different
contexts. For syncfs(), yes, but also for sync(), some quota handling,
etc.

In most of those, we don't want to do an errseq_check_and_advance
because we don't have a way to send that error back to userland at all
(e.g., sync()), or reporting a writeback error might not make sense.
(e.g. quotactl()).

IOW, we need to be able to distinguish the context in which the sync_fs
is being performed before "scraping" the error.

Cheers,
Sargun Dhillon Dec. 23, 2020, 7:29 p.m. UTC | #9
On Wed, Dec 23, 2020 at 06:50:44PM +0000, Matthew Wilcox wrote:
> On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> > I fail to see why this is neccessary if you incorporate error reporting into the 
> > sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
> > patch that adds the 2nd flag to errseq for "observed", you should be able to
> > stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> > in there instead instead of adding this new infrastructure.
> 
> You still haven't explained why you want to add the "observed" flag.


In the overlayfs model, many users may be using the same filesystem (super block)
for their upperdir. Let's say you have something like this:

/workdir [Mounted FS]
/workdir/upperdir1 [overlayfs upperdir]
/workdir/upperdir2 [overlayfs upperdir]
/workdir/userscratchspace

The user needs to be able to do something like:
sync -f ${overlayfs1}/file

which in turn will call sync on the the underlying filesystem (the one mounted 
on /workdir), and can check if the errseq has changed since the overlayfs was
mounted, and use that to return an error to the user.

If we do not advance the errseq on the upperdir to "mark it as seen", that means 
future errors will not be reported if the user calls sync -f ${overlayfs1}/file,
because errseq will not increment the value if the seen bit is unset.

On the other hand, if we mark it as seen, then if the user calls sync on 
/workdir/userscratchspace/file, they wont see the error since we just set the 
SEEN flag.

You need a new flag (observed) to differentiate between "Seen and reported to 
user" versus "seen by a second-order system, so should now increment".

One alternative is to always increment the errseq error counter, but I've
gotta imagine there's a reason that wasn't done in the first place.
Matthew Wilcox Dec. 23, 2020, 8:07 p.m. UTC | #10
On Wed, Dec 23, 2020 at 07:29:41PM +0000, Sargun Dhillon wrote:
> On Wed, Dec 23, 2020 at 06:50:44PM +0000, Matthew Wilcox wrote:
> > On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> > > I fail to see why this is neccessary if you incorporate error reporting into the 
> > > sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
> > > patch that adds the 2nd flag to errseq for "observed", you should be able to
> > > stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> > > in there instead instead of adding this new infrastructure.
> > 
> > You still haven't explained why you want to add the "observed" flag.
> 
> 
> In the overlayfs model, many users may be using the same filesystem (super block)
> for their upperdir. Let's say you have something like this:
> 
> /workdir [Mounted FS]
> /workdir/upperdir1 [overlayfs upperdir]
> /workdir/upperdir2 [overlayfs upperdir]
> /workdir/userscratchspace
> 
> The user needs to be able to do something like:
> sync -f ${overlayfs1}/file
> 
> which in turn will call sync on the the underlying filesystem (the one mounted 
> on /workdir), and can check if the errseq has changed since the overlayfs was
> mounted, and use that to return an error to the user.

OK, but I don't see why the current scheme doesn't work for this.  If
(each instance of) overlayfs samples the errseq at mount time and then
check_and_advances it at sync time, it will see any error that has occurred
since the mount happened (and possibly also an error which occurred before
the mount happened, but hadn't been reported to anybody before).

> If we do not advance the errseq on the upperdir to "mark it as seen", that means 
> future errors will not be reported if the user calls sync -f ${overlayfs1}/file,
> because errseq will not increment the value if the seen bit is unset.
> 
> On the other hand, if we mark it as seen, then if the user calls sync on 
> /workdir/userscratchspace/file, they wont see the error since we just set the 
> SEEN flag.

While we set the SEEN flag, if the file were opened before the error
occurred, we would still report the error because the sequence is higher
than it was when we sampled the error.
Sargun Dhillon Dec. 23, 2020, 8:21 p.m. UTC | #11
On Wed, Dec 23, 2020 at 08:07:46PM +0000, Matthew Wilcox wrote:
> On Wed, Dec 23, 2020 at 07:29:41PM +0000, Sargun Dhillon wrote:
> > On Wed, Dec 23, 2020 at 06:50:44PM +0000, Matthew Wilcox wrote:
> > > On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> > > > I fail to see why this is neccessary if you incorporate error reporting into the 
> > > > sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
> > > > patch that adds the 2nd flag to errseq for "observed", you should be able to
> > > > stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> > > > in there instead instead of adding this new infrastructure.
> > > 
> > > You still haven't explained why you want to add the "observed" flag.
> > 
> > 
> > In the overlayfs model, many users may be using the same filesystem (super block)
> > for their upperdir. Let's say you have something like this:
> > 
> > /workdir [Mounted FS]
> > /workdir/upperdir1 [overlayfs upperdir]
> > /workdir/upperdir2 [overlayfs upperdir]
> > /workdir/userscratchspace
> > 
> > The user needs to be able to do something like:
> > sync -f ${overlayfs1}/file
> > 
> > which in turn will call sync on the the underlying filesystem (the one mounted 
> > on /workdir), and can check if the errseq has changed since the overlayfs was
> > mounted, and use that to return an error to the user.
> 
> OK, but I don't see why the current scheme doesn't work for this.  If
> (each instance of) overlayfs samples the errseq at mount time and then
> check_and_advances it at sync time, it will see any error that has occurred
> since the mount happened (and possibly also an error which occurred before
> the mount happened, but hadn't been reported to anybody before).
> 

If there is an outstanding error at mount time, and the SEEN flag is unset, 
subsequent errors will not increment the counter, until the user calls sync on
the upperdir's filesystem. If overlayfs calls check_and_advance on the upperdir's
super block at any point, it will then set the seen block, and if the user calls
syncfs on the upperdir, it will not return that there is an outstanding error,
since overlayfs just cleared it.


> > If we do not advance the errseq on the upperdir to "mark it as seen", that means 
> > future errors will not be reported if the user calls sync -f ${overlayfs1}/file,
> > because errseq will not increment the value if the seen bit is unset.
> > 
> > On the other hand, if we mark it as seen, then if the user calls sync on 
> > /workdir/userscratchspace/file, they wont see the error since we just set the 
> > SEEN flag.
> 
> While we set the SEEN flag, if the file were opened before the error
> occurred, we would still report the error because the sequence is higher
> than it was when we sampled the error.
> 

Right, this isn't a problem for people calling f(data)sync on a particular file, 
because it takes its own snapshot of errseq. This is only problematic for folks 
calling syncfs. In Jeff's other messages, it sounded like this behaviour is
pretty important, and the likes of postgresql depend on it.
Matthew Wilcox Dec. 23, 2020, 8:44 p.m. UTC | #12
On Wed, Dec 23, 2020 at 08:21:41PM +0000, Sargun Dhillon wrote:
> On Wed, Dec 23, 2020 at 08:07:46PM +0000, Matthew Wilcox wrote:
> > On Wed, Dec 23, 2020 at 07:29:41PM +0000, Sargun Dhillon wrote:
> > > On Wed, Dec 23, 2020 at 06:50:44PM +0000, Matthew Wilcox wrote:
> > > > On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> > > > > I fail to see why this is neccessary if you incorporate error reporting into the 
> > > > > sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
> > > > > patch that adds the 2nd flag to errseq for "observed", you should be able to
> > > > > stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> > > > > in there instead instead of adding this new infrastructure.
> > > > 
> > > > You still haven't explained why you want to add the "observed" flag.
> > > 
> > > 
> > > In the overlayfs model, many users may be using the same filesystem (super block)
> > > for their upperdir. Let's say you have something like this:
> > > 
> > > /workdir [Mounted FS]
> > > /workdir/upperdir1 [overlayfs upperdir]
> > > /workdir/upperdir2 [overlayfs upperdir]
> > > /workdir/userscratchspace
> > > 
> > > The user needs to be able to do something like:
> > > sync -f ${overlayfs1}/file
> > > 
> > > which in turn will call sync on the the underlying filesystem (the one mounted 
> > > on /workdir), and can check if the errseq has changed since the overlayfs was
> > > mounted, and use that to return an error to the user.
> > 
> > OK, but I don't see why the current scheme doesn't work for this.  If
> > (each instance of) overlayfs samples the errseq at mount time and then
> > check_and_advances it at sync time, it will see any error that has occurred
> > since the mount happened (and possibly also an error which occurred before
> > the mount happened, but hadn't been reported to anybody before).
> > 
> 
> If there is an outstanding error at mount time, and the SEEN flag is unset, 
> subsequent errors will not increment the counter, until the user calls sync on
> the upperdir's filesystem. If overlayfs calls check_and_advance on the upperdir's
> super block at any point, it will then set the seen block, and if the user calls
> syncfs on the upperdir, it will not return that there is an outstanding error,
> since overlayfs just cleared it.

Your concern is this case:

fs is mounted on /workdir
/workdir/A is written to and then closed.
writeback happens and -EIO happens, but there's nobody around to care.
/workdir/upperdir1 becomes part of an overlayfs mount
overlayfs samples the error
a user writes to /workdir/B, another -EIO occurs, but nothing happens
someone calls syncfs on /workdir/upperdir/A, gets the EIO.
a user opens /workdir/B and calls syncfs, but sees no error

do i have that right?  or is it something else?

> > > If we do not advance the errseq on the upperdir to "mark it as seen", that means 
> > > future errors will not be reported if the user calls sync -f ${overlayfs1}/file,
> > > because errseq will not increment the value if the seen bit is unset.
> > > 
> > > On the other hand, if we mark it as seen, then if the user calls sync on 
> > > /workdir/userscratchspace/file, they wont see the error since we just set the 
> > > SEEN flag.
> > 
> > While we set the SEEN flag, if the file were opened before the error
> > occurred, we would still report the error because the sequence is higher
> > than it was when we sampled the error.
> > 
> 
> Right, this isn't a problem for people calling f(data)sync on a particular file, 
> because it takes its own snapshot of errseq. This is only problematic for folks 
> calling syncfs. In Jeff's other messages, it sounded like this behaviour is
> pretty important, and the likes of postgresql depend on it.

i would suggest that in the example above, the error _didn't_ occur
while calling syncfs(), it occurred before we synced the filesystem,
and we don't have to report it in that case.
Amir Goldstein Dec. 24, 2020, 9:32 a.m. UTC | #13
On Wed, Dec 23, 2020 at 10:44 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Dec 23, 2020 at 08:21:41PM +0000, Sargun Dhillon wrote:
> > On Wed, Dec 23, 2020 at 08:07:46PM +0000, Matthew Wilcox wrote:
> > > On Wed, Dec 23, 2020 at 07:29:41PM +0000, Sargun Dhillon wrote:
> > > > On Wed, Dec 23, 2020 at 06:50:44PM +0000, Matthew Wilcox wrote:
> > > > > On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> > > > > > I fail to see why this is neccessary if you incorporate error reporting into the
> > > > > > sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
> > > > > > patch that adds the 2nd flag to errseq for "observed", you should be able to
> > > > > > stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> > > > > > in there instead instead of adding this new infrastructure.
> > > > >
> > > > > You still haven't explained why you want to add the "observed" flag.
> > > >
> > > >
> > > > In the overlayfs model, many users may be using the same filesystem (super block)
> > > > for their upperdir. Let's say you have something like this:
> > > >
> > > > /workdir [Mounted FS]
> > > > /workdir/upperdir1 [overlayfs upperdir]
> > > > /workdir/upperdir2 [overlayfs upperdir]
> > > > /workdir/userscratchspace
> > > >
> > > > The user needs to be able to do something like:
> > > > sync -f ${overlayfs1}/file
> > > >
> > > > which in turn will call sync on the the underlying filesystem (the one mounted
> > > > on /workdir), and can check if the errseq has changed since the overlayfs was
> > > > mounted, and use that to return an error to the user.
> > >
> > > OK, but I don't see why the current scheme doesn't work for this.  If
> > > (each instance of) overlayfs samples the errseq at mount time and then
> > > check_and_advances it at sync time, it will see any error that has occurred
> > > since the mount happened (and possibly also an error which occurred before
> > > the mount happened, but hadn't been reported to anybody before).
> > >
> >
> > If there is an outstanding error at mount time, and the SEEN flag is unset,
> > subsequent errors will not increment the counter, until the user calls sync on
> > the upperdir's filesystem. If overlayfs calls check_and_advance on the upperdir's
> > super block at any point, it will then set the seen block, and if the user calls
> > syncfs on the upperdir, it will not return that there is an outstanding error,
> > since overlayfs just cleared it.
>
> Your concern is this case:
>
> fs is mounted on /workdir
> /workdir/A is written to and then closed.
> writeback happens and -EIO happens, but there's nobody around to care.
> /workdir/upperdir1 becomes part of an overlayfs mount
> overlayfs samples the error
> a user writes to /workdir/B, another -EIO occurs, but nothing happens
> someone calls syncfs on /workdir/upperdir/A, gets the EIO.
> a user opens /workdir/B and calls syncfs, but sees no error
>
> do i have that right?  or is it something else?

IMO it is something else. Others may disagree.
IMO the level of interference between users accessing overlay and users
accessing upper fs directly is not well defined and it can stay this way.

Concurrent access to  /workdir/upperdir/A via overlay and underlying fs
is explicitly warranted against in Documentation/filesystems/overlayfs.rst#
Changes to underlying filesystems:
"Changes to the underlying filesystems while part of a mounted overlay
filesystem are not allowed.  If the underlying filesystem is changed,
the behavior of the overlay is undefined, though it will not result in
a crash or deadlock."

The question is whether syncfs(open(/workdir/B)) is considered
"Changes to the underlying filesystems". Regardless of the answer,
this is not an interesting case IMO.

The real issue is with interference between overlays that share the
same upper fs, because this is by far and large the common use case
that is creating real problems for a lot of container users.

Workloads running inside containers (with overlayfs storage driver)
will never be as isolated as workloads running inside VMs, but it
doesn't mean we cannot try to improve.

In current master, syncfs() on any file by any container user will
result in full syncfs() of the upperfs, which is very bad for container
isolation. This has been partly fixed by Chengguang Xu [1] and I expect
his work will be merged soon. Overlayfs still does not do the writeback
and syncfs() in overlay still waits for all upper fs writeback to complete,
but at least syncfs() in overlay only kicks writeback for upper fs files
dirtied by this overlay.

[1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@mail.gmail.com/

Sharing the same SEEN flag among thousands of containers is also
far from ideal, because effectively this means that any given workload
in any single container has very little chance of observing the SEEN flag.

To this end, I do agree with Matthew that overlayfs should sample errseq
and the best patchset to implement it so far IMO is Jeff's patchset [2].
This patch set was written to cater only "volatile" overlayfs mount, but
there is no reason not to use the same mechanism for regular overlay
mount. The only difference being that "volatile" overlay only checks for
error since mount on syncfs() (because "volatile" overlay does NOT
syncfs upper fs) and regular overlay checks and advances the overlay's
errseq sample on syncfs (and does syncfs upper fs).

Matthew, I hope that my explanation of the use case and Jeff's answer
is sufficient to understand why the split of the SEEN flag is needed.

[2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@kernel.org/

w.r.t Vivek's patchset (this one), I do not object to it at all, but it fixes
a problem that Jeff's patch had already solved with an ugly hack:

  /* Propagate errors from upper to overlayfs */
  ret = errseq_check(&upper_sb->s_wb_err, ofs->err_mark);
  errseq_set(&sb->s_wb_err, ret);

Since Jeff's patch is minimal, I think that it should be the fix applied
first and proposed for stable (with adaptations for non-volatile overlay).

I guess that Vivek's patch 1/3 from this series [3] is also needed to
complement the work that should go to stable.

Vivek, Sargun,

Do you understand my proposal?
Do you agree with it as a way forward to address the various syncfs
issues for volatile/non-volatile that both of you were trying to address?

Sargun, I know this all discussion has forked from your volatile re-use
patch set, but let's not confuse fsdevel forks more than we have to.
The way forward for volatile re-use from this proposal is straight forward.

Thanks,
Amir.
Sargun Dhillon Dec. 24, 2020, 10:12 a.m. UTC | #14
On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> On Wed, Dec 23, 2020 at 10:44 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Dec 23, 2020 at 08:21:41PM +0000, Sargun Dhillon wrote:
> > > On Wed, Dec 23, 2020 at 08:07:46PM +0000, Matthew Wilcox wrote:
> > > > On Wed, Dec 23, 2020 at 07:29:41PM +0000, Sargun Dhillon wrote:
> > > > > On Wed, Dec 23, 2020 at 06:50:44PM +0000, Matthew Wilcox wrote:
> > > > > > On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> > > > > > > I fail to see why this is neccessary if you incorporate error reporting into the
> > > > > > > sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
> > > > > > > patch that adds the 2nd flag to errseq for "observed", you should be able to
> > > > > > > stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> > > > > > > in there instead instead of adding this new infrastructure.
> > > > > >
> > > > > > You still haven't explained why you want to add the "observed" flag.
> > > > >
> > > > >
> > > > > In the overlayfs model, many users may be using the same filesystem (super block)
> > > > > for their upperdir. Let's say you have something like this:
> > > > >
> > > > > /workdir [Mounted FS]
> > > > > /workdir/upperdir1 [overlayfs upperdir]
> > > > > /workdir/upperdir2 [overlayfs upperdir]
> > > > > /workdir/userscratchspace
> > > > >
> > > > > The user needs to be able to do something like:
> > > > > sync -f ${overlayfs1}/file
> > > > >
> > > > > which in turn will call sync on the the underlying filesystem (the one mounted
> > > > > on /workdir), and can check if the errseq has changed since the overlayfs was
> > > > > mounted, and use that to return an error to the user.
> > > >
> > > > OK, but I don't see why the current scheme doesn't work for this.  If
> > > > (each instance of) overlayfs samples the errseq at mount time and then
> > > > check_and_advances it at sync time, it will see any error that has occurred
> > > > since the mount happened (and possibly also an error which occurred before
> > > > the mount happened, but hadn't been reported to anybody before).
> > > >
> > >
> > > If there is an outstanding error at mount time, and the SEEN flag is unset,
> > > subsequent errors will not increment the counter, until the user calls sync on
> > > the upperdir's filesystem. If overlayfs calls check_and_advance on the upperdir's
> > > super block at any point, it will then set the seen block, and if the user calls
> > > syncfs on the upperdir, it will not return that there is an outstanding error,
> > > since overlayfs just cleared it.
> >
> > Your concern is this case:
> >
> > fs is mounted on /workdir
> > /workdir/A is written to and then closed.
> > writeback happens and -EIO happens, but there's nobody around to care.
> > /workdir/upperdir1 becomes part of an overlayfs mount
> > overlayfs samples the error
> > a user writes to /workdir/B, another -EIO occurs, but nothing happens
> > someone calls syncfs on /workdir/upperdir/A, gets the EIO.
> > a user opens /workdir/B and calls syncfs, but sees no error
> >
> > do i have that right?  or is it something else?
> 
> IMO it is something else. Others may disagree.
> IMO the level of interference between users accessing overlay and users
> accessing upper fs directly is not well defined and it can stay this way.
> 
> Concurrent access to  /workdir/upperdir/A via overlay and underlying fs
> is explicitly warranted against in Documentation/filesystems/overlayfs.rst#
> Changes to underlying filesystems:
> "Changes to the underlying filesystems while part of a mounted overlay
> filesystem are not allowed.  If the underlying filesystem is changed,
> the behavior of the overlay is undefined, though it will not result in
> a crash or deadlock."
> 
> The question is whether syncfs(open(/workdir/B)) is considered
> "Changes to the underlying filesystems". Regardless of the answer,
> this is not an interesting case IMO.
> 
> The real issue is with interference between overlays that share the
> same upper fs, because this is by far and large the common use case
> that is creating real problems for a lot of container users.
> 
> Workloads running inside containers (with overlayfs storage driver)
> will never be as isolated as workloads running inside VMs, but it
> doesn't mean we cannot try to improve.
> 
> In current master, syncfs() on any file by any container user will
> result in full syncfs() of the upperfs, which is very bad for container
> isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> his work will be merged soon. Overlayfs still does not do the writeback
> and syncfs() in overlay still waits for all upper fs writeback to complete,
> but at least syncfs() in overlay only kicks writeback for upper fs files
> dirtied by this overlay.
> 
> [1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@mail.gmail.com/
> 
> Sharing the same SEEN flag among thousands of containers is also
> far from ideal, because effectively this means that any given workload
> in any single container has very little chance of observing the SEEN flag.
> 
> To this end, I do agree with Matthew that overlayfs should sample errseq
> and the best patchset to implement it so far IMO is Jeff's patchset [2].
> This patch set was written to cater only "volatile" overlayfs mount, but
> there is no reason not to use the same mechanism for regular overlay
> mount. The only difference being that "volatile" overlay only checks for
> error since mount on syncfs() (because "volatile" overlay does NOT
> syncfs upper fs) and regular overlay checks and advances the overlay's
> errseq sample on syncfs (and does syncfs upper fs).
> 
> Matthew, I hope that my explanation of the use case and Jeff's answer
> is sufficient to understand why the split of the SEEN flag is needed.
> 
> [2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@kernel.org/
> 
> w.r.t Vivek's patchset (this one), I do not object to it at all, but it fixes
> a problem that Jeff's patch had already solved with an ugly hack:
> 
>   /* Propagate errors from upper to overlayfs */
>   ret = errseq_check(&upper_sb->s_wb_err, ofs->err_mark);
>   errseq_set(&sb->s_wb_err, ret);
> 
> Since Jeff's patch is minimal, I think that it should be the fix applied
> first and proposed for stable (with adaptations for non-volatile overlay).
> 
> I guess that Vivek's patch 1/3 from this series [3] is also needed to
> complement the work that should go to stable.
> 
> Vivek, Sargun,
> 
> Do you understand my proposal?
Yes. I agree that Jeff's patch should be added to stable. The fact we don't
bubble up writeback errors turns out to be a real problem that I never knew
was happening, but upon investigating, it looks like a real thing.

I think we can use Jeff's hacky approach in stable, as it's far more minimal, 
and has a much lower chance of causing issues, but if we make further 
improvements, we wont be able to backport them to stable as easily. I have
nothing explicitly against Vivek's approach though.

> Do you agree with it as a way forward to address the various syncfs
> issues for volatile/non-volatile that both of you were trying to address?
Yes. I think Vivek's patchset of introducing a new superblock callback is
the best approach.

> 
> Sargun, I know this all discussion has forked from your volatile re-use
> patch set, but let's not confuse fsdevel forks more than we have to.
> The way forward for volatile re-use from this proposal is straight forward.

I think that Vivek's patchset of adding a new callback, plus Jeff's new flag 
solves detection of errors for normal mounts, volatile mounts, and volatile 
remounts.

I think I responded to Jeff's patch and it looked good, bar one of the loops.
My only suggestion is that someone add the intended behaviour here as comment
to super_ops of the new callback.
> 
> Thanks,
> Amir.
Matthew Wilcox Dec. 24, 2020, 12:13 p.m. UTC | #15
On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> In current master, syncfs() on any file by any container user will
> result in full syncfs() of the upperfs, which is very bad for container
> isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> his work will be merged soon. Overlayfs still does not do the writeback
> and syncfs() in overlay still waits for all upper fs writeback to complete,
> but at least syncfs() in overlay only kicks writeback for upper fs files
> dirtied by this overlay.
> 
> [1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@mail.gmail.com/
> 
> Sharing the same SEEN flag among thousands of containers is also
> far from ideal, because effectively this means that any given workload
> in any single container has very little chance of observing the SEEN flag.

Perhaps you misunderstand how errseq works.  If each container samples
the errseq at startup, then they will all see any error which occurs
during their lifespan (and possibly an error which occurred before they
started up).

> To this end, I do agree with Matthew that overlayfs should sample errseq
> and the best patchset to implement it so far IMO is Jeff's patchset [2].
> This patch set was written to cater only "volatile" overlayfs mount, but
> there is no reason not to use the same mechanism for regular overlay
> mount. The only difference being that "volatile" overlay only checks for
> error since mount on syncfs() (because "volatile" overlay does NOT
> syncfs upper fs) and regular overlay checks and advances the overlay's
> errseq sample on syncfs (and does syncfs upper fs).
> 
> Matthew, I hope that my explanation of the use case and Jeff's answer
> is sufficient to understand why the split of the SEEN flag is needed.
> 
> [2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@kernel.org/

No, it still feels weird and wrong.
Amir Goldstein Dec. 25, 2020, 6:50 a.m. UTC | #16
On Thu, Dec 24, 2020 at 2:13 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> > In current master, syncfs() on any file by any container user will
> > result in full syncfs() of the upperfs, which is very bad for container
> > isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> > his work will be merged soon. Overlayfs still does not do the writeback
> > and syncfs() in overlay still waits for all upper fs writeback to complete,
> > but at least syncfs() in overlay only kicks writeback for upper fs files
> > dirtied by this overlay.
> >
> > [1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@mail.gmail.com/
> >
> > Sharing the same SEEN flag among thousands of containers is also
> > far from ideal, because effectively this means that any given workload
> > in any single container has very little chance of observing the SEEN flag.
>
> Perhaps you misunderstand how errseq works.  If each container samples
> the errseq at startup, then they will all see any error which occurs
> during their lifespan

Meant to say "...very little chance of NOT observing the SEEN flag",
but We are not in disagreement.
My argument against sharing the SEEN flag refers to Vivek's patch of
stacked errseq_sample()/errseq_check_and_advance() which does NOT
sample errseq at overlayfs mount time. That is why my next sentence is:
"I do agree with Matthew that overlayfs should sample errseq...".

> (and possibly an error which occurred before they started up).
>

Right. And this is where the discussion of splitting the SEEN flag started.
Some of us want to treat overlayfs mount time as a true epoc for errseq.
The new container didn't write any files yet, so it should not care about
writeback errors from the past.

I agree that it may not be very critical, but as I wrote before, I think we
should do our best to try and isolate container workloads.

> > To this end, I do agree with Matthew that overlayfs should sample errseq
> > and the best patchset to implement it so far IMO is Jeff's patchset [2].
> > This patch set was written to cater only "volatile" overlayfs mount, but
> > there is no reason not to use the same mechanism for regular overlay
> > mount. The only difference being that "volatile" overlay only checks for
> > error since mount on syncfs() (because "volatile" overlay does NOT
> > syncfs upper fs) and regular overlay checks and advances the overlay's
> > errseq sample on syncfs (and does syncfs upper fs).
> >
> > Matthew, I hope that my explanation of the use case and Jeff's answer
> > is sufficient to understand why the split of the SEEN flag is needed.
> >
> > [2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@kernel.org/
>
> No, it still feels weird and wrong.
>

All right. Considering your reservations, I think perhaps the split of the
SEEN flag can wait for a later time after more discussions and maybe
not as suitable for stable as we thought.

I think that for stable, it would be sufficient to adapt Surgun's original
syncfs for volatile mount patch [1] to cover the non-volatile case:
on mout:
- errseq_sample() upper fs
- on volatile mount, errseq_check() upper fs and fail mount on un-SEEN error
on syncfs:
- errseq_check() for volatile mount
- errseq_check_and_advance() for non-volatile mount
- errseq_set() overlay sb on upper fs error

Now errseq_set() is not only a hack around __sync_filesystem ignoring
return value of ->sync_fs(). It is really needed for per-overlay SEEN
error isolation in the non-volatile case.

Unless I am missing something, I think we do not strictly need Vivek's
1/3 patch [2] for stable, but not sure.

Sargun,

Do you agree with the above proposal?
Will you make it into a patch?

Vivek, Jefff,

Do you agree that overlay syncfs observing writeback errors that predate
overlay mount time is an issue that can be deferred (maybe forever)?

BTW, in all the discussions we always assumed that stacked fsync() is correct
WRT errseq, but in fact, fsync() can also observe an unseen error that predates
overlay mount.
In order to fix that, we will probably need to split the SEEN flag and some
more errseq acrobatics, but again, not sure it is worth the effort.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-unionfs/20201202092720.41522-1-sargun@sargun.me/
[2] https://lore.kernel.org/linux-unionfs/20201222151752.GA3248@redhat.com/
Jeff Layton Dec. 28, 2020, 1:25 p.m. UTC | #17
On Fri, 2020-12-25 at 08:50 +0200, Amir Goldstein wrote:
> On Thu, Dec 24, 2020 at 2:13 PM Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> > > In current master, syncfs() on any file by any container user will
> > > result in full syncfs() of the upperfs, which is very bad for container
> > > isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> > > his work will be merged soon. Overlayfs still does not do the writeback
> > > and syncfs() in overlay still waits for all upper fs writeback to complete,
> > > but at least syncfs() in overlay only kicks writeback for upper fs files
> > > dirtied by this overlay.
> > > 
> > > [1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@mail.gmail.com/
> > > 
> > > Sharing the same SEEN flag among thousands of containers is also
> > > far from ideal, because effectively this means that any given workload
> > > in any single container has very little chance of observing the SEEN flag.
> > 
> > Perhaps you misunderstand how errseq works.  If each container samples
> > the errseq at startup, then they will all see any error which occurs
> > during their lifespan
> 
> Meant to say "...very little chance of NOT observing the SEEN flag",
> but We are not in disagreement.
> My argument against sharing the SEEN flag refers to Vivek's patch of
> stacked errseq_sample()/errseq_check_and_advance() which does NOT
> sample errseq at overlayfs mount time. That is why my next sentence is:
> "I do agree with Matthew that overlayfs should sample errseq...".
> 
> > (and possibly an error which occurred before they started up).
> > 
> 
> Right. And this is where the discussion of splitting the SEEN flag started.
> Some of us want to treat overlayfs mount time as a true epoc for errseq.
> The new container didn't write any files yet, so it should not care about
> writeback errors from the past.
> 
> I agree that it may not be very critical, but as I wrote before, I think we
> should do our best to try and isolate container workloads.
> 
> > > To this end, I do agree with Matthew that overlayfs should sample errseq
> > > and the best patchset to implement it so far IMO is Jeff's patchset [2].
> > > This patch set was written to cater only "volatile" overlayfs mount, but
> > > there is no reason not to use the same mechanism for regular overlay
> > > mount. The only difference being that "volatile" overlay only checks for
> > > error since mount on syncfs() (because "volatile" overlay does NOT
> > > syncfs upper fs) and regular overlay checks and advances the overlay's
> > > errseq sample on syncfs (and does syncfs upper fs).
> > > 
> > > Matthew, I hope that my explanation of the use case and Jeff's answer
> > > is sufficient to understand why the split of the SEEN flag is needed.
> > > 
> > > [2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@kernel.org/
> > 
> > No, it still feels weird and wrong.
> > 
> 
> All right. Considering your reservations, I think perhaps the split of the
> SEEN flag can wait for a later time after more discussions and maybe
> not as suitable for stable as we thought.
> 
> I think that for stable, it would be sufficient to adapt Surgun's original
> syncfs for volatile mount patch [1] to cover the non-volatile case:
> on mout:
> - errseq_sample() upper fs
> - on volatile mount, errseq_check() upper fs and fail mount on un-SEEN error
> on syncfs:
> - errseq_check() for volatile mount
> - errseq_check_and_advance() for non-volatile mount
> - errseq_set() overlay sb on upper fs error
> 
> Now errseq_set() is not only a hack around __sync_filesystem ignoring
> return value of ->sync_fs(). It is really needed for per-overlay SEEN
> error isolation in the non-volatile case.
> 
> Unless I am missing something, I think we do not strictly need Vivek's
> 1/3 patch [2] for stable, but not sure.
> 
> Sargun,
> 
> Do you agree with the above proposal?
> Will you make it into a patch?
> 
> Vivek, Jefff,
> 
> Do you agree that overlay syncfs observing writeback errors that predate
> overlay mount time is an issue that can be deferred (maybe forever)?
> 

That's very application dependent.

To be clear, the main thing you'll lose with the method above is the
ability to see an unseen error on a newly opened fd, if there was an
overlayfs mount using the same upper sb before your open occurred.

IOW, consider two overlayfs mounts using the same upper layer sb:

ovlfs1				ovlfs2
----------------------------------------------------------------------
mount
open fd1
write to fd1
<writeback fails>
				mount (upper errseq_t SEEN flag marked)
open fd2
syncfs(fd2)
syncfs(fd1)


On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
calls. The first one has a sample from before the error occurred, and
the second one has a sample of 0, due to the fact that the error was
unseen at open time.

On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
return an error and syncfs(fd2) will not. If we split the SEEN flag into
two, then we can ensure that they both still get an error in this
situation.

> BTW, in all the discussions we always assumed that stacked fsync() is correct
> WRT errseq, but in fact, fsync() can also observe an unseen error that predates
> overlay mount.
> In order to fix that, we will probably need to split the SEEN flag and some
> more errseq acrobatics, but again, not sure it is worth the effort.
> 
> Thanks,
> Amir.
> 
> [1] https://lore.kernel.org/linux-unionfs/20201202092720.41522-1-sargun@sargun.me/
> [2] https://lore.kernel.org/linux-unionfs/20201222151752.GA3248@redhat.com/
Amir Goldstein Dec. 28, 2020, 3:51 p.m. UTC | #18
On Mon, Dec 28, 2020 at 3:25 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2020-12-25 at 08:50 +0200, Amir Goldstein wrote:
> > On Thu, Dec 24, 2020 at 2:13 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> > > > In current master, syncfs() on any file by any container user will
> > > > result in full syncfs() of the upperfs, which is very bad for container
> > > > isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> > > > his work will be merged soon. Overlayfs still does not do the writeback
> > > > and syncfs() in overlay still waits for all upper fs writeback to complete,
> > > > but at least syncfs() in overlay only kicks writeback for upper fs files
> > > > dirtied by this overlay.
> > > >
> > > > [1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@mail.gmail.com/
> > > >
> > > > Sharing the same SEEN flag among thousands of containers is also
> > > > far from ideal, because effectively this means that any given workload
> > > > in any single container has very little chance of observing the SEEN flag.
> > >
> > > Perhaps you misunderstand how errseq works.  If each container samples
> > > the errseq at startup, then they will all see any error which occurs
> > > during their lifespan
> >
> > Meant to say "...very little chance of NOT observing the SEEN flag",
> > but We are not in disagreement.
> > My argument against sharing the SEEN flag refers to Vivek's patch of
> > stacked errseq_sample()/errseq_check_and_advance() which does NOT
> > sample errseq at overlayfs mount time. That is why my next sentence is:
> > "I do agree with Matthew that overlayfs should sample errseq...".
> >
> > > (and possibly an error which occurred before they started up).
> > >
> >
> > Right. And this is where the discussion of splitting the SEEN flag started.
> > Some of us want to treat overlayfs mount time as a true epoc for errseq.
> > The new container didn't write any files yet, so it should not care about
> > writeback errors from the past.
> >
> > I agree that it may not be very critical, but as I wrote before, I think we
> > should do our best to try and isolate container workloads.
> >
> > > > To this end, I do agree with Matthew that overlayfs should sample errseq
> > > > and the best patchset to implement it so far IMO is Jeff's patchset [2].
> > > > This patch set was written to cater only "volatile" overlayfs mount, but
> > > > there is no reason not to use the same mechanism for regular overlay
> > > > mount. The only difference being that "volatile" overlay only checks for
> > > > error since mount on syncfs() (because "volatile" overlay does NOT
> > > > syncfs upper fs) and regular overlay checks and advances the overlay's
> > > > errseq sample on syncfs (and does syncfs upper fs).
> > > >
> > > > Matthew, I hope that my explanation of the use case and Jeff's answer
> > > > is sufficient to understand why the split of the SEEN flag is needed.
> > > >
> > > > [2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@kernel.org/
> > >
> > > No, it still feels weird and wrong.
> > >
> >
> > All right. Considering your reservations, I think perhaps the split of the
> > SEEN flag can wait for a later time after more discussions and maybe
> > not as suitable for stable as we thought.
> >
> > I think that for stable, it would be sufficient to adapt Surgun's original
> > syncfs for volatile mount patch [1] to cover the non-volatile case:
> > on mout:
> > - errseq_sample() upper fs
> > - on volatile mount, errseq_check() upper fs and fail mount on un-SEEN error
> > on syncfs:
> > - errseq_check() for volatile mount
> > - errseq_check_and_advance() for non-volatile mount
> > - errseq_set() overlay sb on upper fs error
> >
> > Now errseq_set() is not only a hack around __sync_filesystem ignoring
> > return value of ->sync_fs(). It is really needed for per-overlay SEEN
> > error isolation in the non-volatile case.
> >
> > Unless I am missing something, I think we do not strictly need Vivek's
> > 1/3 patch [2] for stable, but not sure.
> >
> > Sargun,
> >
> > Do you agree with the above proposal?
> > Will you make it into a patch?
> >
> > Vivek, Jefff,
> >
> > Do you agree that overlay syncfs observing writeback errors that predate
> > overlay mount time is an issue that can be deferred (maybe forever)?
> >
>
> That's very application dependent.
>
> To be clear, the main thing you'll lose with the method above is the
> ability to see an unseen error on a newly opened fd, if there was an
> overlayfs mount using the same upper sb before your open occurred.
>
> IOW, consider two overlayfs mounts using the same upper layer sb:
>
> ovlfs1                          ovlfs2
> ----------------------------------------------------------------------
> mount
> open fd1
> write to fd1
> <writeback fails>
>                                 mount (upper errseq_t SEEN flag marked)
> open fd2
> syncfs(fd2)
> syncfs(fd1)
>
>
> On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> calls. The first one has a sample from before the error occurred, and
> the second one has a sample of 0, due to the fact that the error was
> unseen at open time.
>
> On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> return an error and syncfs(fd2) will not. If we split the SEEN flag into
> two, then we can ensure that they both still get an error in this
> situation.
>

Either I am not following or we are not talking about the same solution.
IMO it is perfectly fine the ovlfs2 mount consumes the unseen error
on upper fs, because there is no conceptual difference between an overlay
mount and any application that does syncfs().

However, in the situation that you describe, open fd2 is NOT supposed
to sample upper fs errseq at all, it is supposed to sample ovlfs1's
sb->s_wb_err.

syncfs(fd2) reports an error because ovl_check_sync(ofs) sees that
ofs->upper_errseq sample is older than upper fs errseq.

If ovlfs1 is volatile, syncfs(fd1), returns an error for the same reason,
because we never advance ofs->upper_errseq.

If ovlfs1 is non-volatile, the first syncfs(fd2) calls ovl_check_sync(ofs)
that observes the upper fs error which is newer than ofs->upper_errseq
and advances ofs->upper_errseq.
But it also calls errseq_set(&sb->s_wb_err) (a.k.a. "the hack"),
This will happen regardless of upper fs SEEN flag set by ovlfs2 mount.

The second syncfs(fd1) will also return an error because vfs had sampled
an old errseq value of ovlfs1 sb->s_wb_err, before the call to errseq_set().

What am I missing?

Thanks,
Amir.
Matthew Wilcox Dec. 28, 2020, 3:56 p.m. UTC | #19
On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> To be clear, the main thing you'll lose with the method above is the
> ability to see an unseen error on a newly opened fd, if there was an
> overlayfs mount using the same upper sb before your open occurred.
> 
> IOW, consider two overlayfs mounts using the same upper layer sb:
> 
> ovlfs1				ovlfs2
> ----------------------------------------------------------------------
> mount
> open fd1
> write to fd1
> <writeback fails>
> 				mount (upper errseq_t SEEN flag marked)
> open fd2
> syncfs(fd2)
> syncfs(fd1)
> 
> 
> On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> calls. The first one has a sample from before the error occurred, and
> the second one has a sample of 0, due to the fact that the error was
> unseen at open time.
> 
> On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> return an error and syncfs(fd2) will not. If we split the SEEN flag into
> two, then we can ensure that they both still get an error in this
> situation.

But do we need to?  If the inode has been evicted we also lose the errno.
The guarantee we provide is that a fd that was open before the error
occurred will see the error.  An fd that's opened after the error occurred
may or may not see the error.
Jeff Layton Dec. 28, 2020, 5:26 p.m. UTC | #20
On Mon, 2020-12-28 at 15:56 +0000, Matthew Wilcox wrote:
> On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> > To be clear, the main thing you'll lose with the method above is the
> > ability to see an unseen error on a newly opened fd, if there was an
> > overlayfs mount using the same upper sb before your open occurred.
> > 
> > IOW, consider two overlayfs mounts using the same upper layer sb:
> > 
> > ovlfs1				ovlfs2
> > ----------------------------------------------------------------------
> > mount
> > open fd1
> > write to fd1
> > <writeback fails>
> > 				mount (upper errseq_t SEEN flag marked)
> > open fd2
> > syncfs(fd2)
> > syncfs(fd1)
> > 
> > 
> > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > calls. The first one has a sample from before the error occurred, and
> > the second one has a sample of 0, due to the fact that the error was
> > unseen at open time.
> > 
> > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > two, then we can ensure that they both still get an error in this
> > situation.
> 
> But do we need to?  If the inode has been evicted we also lose the errno.
> The guarantee we provide is that a fd that was open before the error
> occurred will see the error.  An fd that's opened after the error occurred
> may or may not see the error.
> 

In principle, you can lose errors this way (which was the justification
for making errseq_sample return 0 when there are unseen errors). E.g.,
if you close fd1 instead of doing a syncfs on it, that error will be
lost forever.

As to whether that's OK, it's hard to say. It is a deviation from how
this works in a non-containerized situation, and I'd argue that it's
less than ideal. You may or may not see the error on fd2, but it's
dependent on events that take place outside the container and that
aren't observable from within it. That effectively makes the results
non-deterministic, which is usually a bad thing in computing...

--
Jeff Layton <jlayton@kernel.org>
Sargun Dhillon Dec. 28, 2020, 7:25 p.m. UTC | #21
On Mon, Dec 28, 2020 at 9:26 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2020-12-28 at 15:56 +0000, Matthew Wilcox wrote:
> > On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> > > To be clear, the main thing you'll lose with the method above is the
> > > ability to see an unseen error on a newly opened fd, if there was an
> > > overlayfs mount using the same upper sb before your open occurred.
> > >
> > > IOW, consider two overlayfs mounts using the same upper layer sb:
> > >
> > > ovlfs1                              ovlfs2
> > > ----------------------------------------------------------------------
> > > mount
> > > open fd1
> > > write to fd1
> > > <writeback fails>
> > >                             mount (upper errseq_t SEEN flag marked)
> > > open fd2
> > > syncfs(fd2)
> > > syncfs(fd1)
> > >
> > >
> > > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > > calls. The first one has a sample from before the error occurred, and
> > > the second one has a sample of 0, due to the fact that the error was
> > > unseen at open time.
> > >
> > > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > > two, then we can ensure that they both still get an error in this
> > > situation.
> >
> > But do we need to?  If the inode has been evicted we also lose the errno.
> > The guarantee we provide is that a fd that was open before the error
> > occurred will see the error.  An fd that's opened after the error occurred
> > may or may not see the error.
> >
>
> In principle, you can lose errors this way (which was the justification
> for making errseq_sample return 0 when there are unseen errors). E.g.,
> if you close fd1 instead of doing a syncfs on it, that error will be
> lost forever.
>
> As to whether that's OK, it's hard to say. It is a deviation from how
> this works in a non-containerized situation, and I'd argue that it's
> less than ideal. You may or may not see the error on fd2, but it's
> dependent on events that take place outside the container and that
> aren't observable from within it. That effectively makes the results
> non-deterministic, which is usually a bad thing in computing...
>
> --
> Jeff Layton <jlayton@kernel.org>
>

I agree that predictable behaviour outweighs any benefit of complexity
cutting we might do here.
Amir Goldstein Dec. 28, 2020, 7:37 p.m. UTC | #22
On Mon, Dec 28, 2020 at 7:26 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2020-12-28 at 15:56 +0000, Matthew Wilcox wrote:
> > On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> > > To be clear, the main thing you'll lose with the method above is the
> > > ability to see an unseen error on a newly opened fd, if there was an
> > > overlayfs mount using the same upper sb before your open occurred.
> > >
> > > IOW, consider two overlayfs mounts using the same upper layer sb:
> > >
> > > ovlfs1                              ovlfs2
> > > ----------------------------------------------------------------------
> > > mount
> > > open fd1
> > > write to fd1
> > > <writeback fails>
> > >                             mount (upper errseq_t SEEN flag marked)
> > > open fd2
> > > syncfs(fd2)
> > > syncfs(fd1)
> > >
> > >
> > > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > > calls. The first one has a sample from before the error occurred, and
> > > the second one has a sample of 0, due to the fact that the error was
> > > unseen at open time.
> > >
> > > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > > two, then we can ensure that they both still get an error in this
> > > situation.
> >
> > But do we need to?  If the inode has been evicted we also lose the errno.
> > The guarantee we provide is that a fd that was open before the error
> > occurred will see the error.  An fd that's opened after the error occurred
> > may or may not see the error.
> >
>
> In principle, you can lose errors this way (which was the justification
> for making errseq_sample return 0 when there are unseen errors). E.g.,
> if you close fd1 instead of doing a syncfs on it, that error will be
> lost forever.
>
> As to whether that's OK, it's hard to say. It is a deviation from how
> this works in a non-containerized situation, and I'd argue that it's
> less than ideal. You may or may not see the error on fd2, but it's
> dependent on events that take place outside the container and that
> aren't observable from within it. That effectively makes the results
> non-deterministic, which is usually a bad thing in computing...
>

I understand that user experience inside containers will deviate from
non containerized use cases. I can't say that I fully understand the
situations that deviate.

Having said that, I never objected to the SEEN flag split.
To me, the split looks architecturally correct. If not for anything else,
then for not observing past errors inside the overlay mount.
I think you still need to convince Matthew though.

The question remains what, if anything, should be nominated for
stable. I was trying to propose the minimal patch that fixes the
most basic syncfs overlayfs issues. In that context, it seemed
that the issues that SEEN flag split solves are not on the
MUST HAVE list, but maybe I am wrong.

Sargun,

How about sending another version of your patch, with or without the
SEEN flag split (up to you) but not only for both the volatile and non-
volatile cases, following my proposal.

At least we can continue debating on a concrete patch instead of
an envisioned combination of pieces posted to the list.

If you can give some examples of use cases that the patch fixes
with and without the SEEN flag split that could be useful for the
discussion.

Thanks,
Amir.
Matthew Wilcox Dec. 28, 2020, 8:48 p.m. UTC | #23
On Mon, Dec 28, 2020 at 09:37:37PM +0200, Amir Goldstein wrote:
> Having said that, I never objected to the SEEN flag split.

I STRONGLY object to the SEEN flag split.  I think it is completely
unnecessary and nobody's shown me a use-case that changes my mind.
Jeff Layton Jan. 2, 2021, 1:25 p.m. UTC | #24
On Mon, 2020-12-28 at 20:48 +0000, Matthew Wilcox wrote:
> On Mon, Dec 28, 2020 at 09:37:37PM +0200, Amir Goldstein wrote:
> > Having said that, I never objected to the SEEN flag split.
> 
> I STRONGLY object to the SEEN flag split.  I think it is completely
> unnecessary and nobody's shown me a use-case that changes my mind.

I think the flag split makes better sense conceptually, though the
existing callers don't really have a need for it. I have a use-case in
mind that doesn't really involve overlayfs:

We still have a lot of internal callers that ultimately call
filemap_check_errors() to check and clear the mapping's AS_EIO/AS_ENOSPC
flags.

Splitting the SEEN flag in two could allow those callers to instead
sample the errseq_t using errseq_peek for their own purposes, without
clearing the REPORTED flag. That means that the existing semantics for
seeing errors on newly opened files could be preserved while allowing
internal callers to use errseq_t-based error handling.

That said, I don't have any patches to do this right now. It's a fairly
significant project to convert all of the existing callers of
filemap_check_errors() to such a scheme wholesale. It could be done
piecemeal though, and we could start discouraging new callers of
filemap_check_errors and the like.
Vivek Goyal Jan. 4, 2021, 3:14 p.m. UTC | #25
On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> On Wed, Dec 23, 2020 at 10:44 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Dec 23, 2020 at 08:21:41PM +0000, Sargun Dhillon wrote:
> > > On Wed, Dec 23, 2020 at 08:07:46PM +0000, Matthew Wilcox wrote:
> > > > On Wed, Dec 23, 2020 at 07:29:41PM +0000, Sargun Dhillon wrote:
> > > > > On Wed, Dec 23, 2020 at 06:50:44PM +0000, Matthew Wilcox wrote:
> > > > > > On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> > > > > > > I fail to see why this is neccessary if you incorporate error reporting into the
> > > > > > > sync_fs callback. Why is this separate from that callback? If you pickup Jeff's
> > > > > > > patch that adds the 2nd flag to errseq for "observed", you should be able to
> > > > > > > stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> > > > > > > in there instead instead of adding this new infrastructure.
> > > > > >
> > > > > > You still haven't explained why you want to add the "observed" flag.
> > > > >
> > > > >
> > > > > In the overlayfs model, many users may be using the same filesystem (super block)
> > > > > for their upperdir. Let's say you have something like this:
> > > > >
> > > > > /workdir [Mounted FS]
> > > > > /workdir/upperdir1 [overlayfs upperdir]
> > > > > /workdir/upperdir2 [overlayfs upperdir]
> > > > > /workdir/userscratchspace
> > > > >
> > > > > The user needs to be able to do something like:
> > > > > sync -f ${overlayfs1}/file
> > > > >
> > > > > which in turn will call sync on the the underlying filesystem (the one mounted
> > > > > on /workdir), and can check if the errseq has changed since the overlayfs was
> > > > > mounted, and use that to return an error to the user.
> > > >
> > > > OK, but I don't see why the current scheme doesn't work for this.  If
> > > > (each instance of) overlayfs samples the errseq at mount time and then
> > > > check_and_advances it at sync time, it will see any error that has occurred
> > > > since the mount happened (and possibly also an error which occurred before
> > > > the mount happened, but hadn't been reported to anybody before).
> > > >
> > >
> > > If there is an outstanding error at mount time, and the SEEN flag is unset,
> > > subsequent errors will not increment the counter, until the user calls sync on
> > > the upperdir's filesystem. If overlayfs calls check_and_advance on the upperdir's
> > > super block at any point, it will then set the seen block, and if the user calls
> > > syncfs on the upperdir, it will not return that there is an outstanding error,
> > > since overlayfs just cleared it.
> >
> > Your concern is this case:
> >
> > fs is mounted on /workdir
> > /workdir/A is written to and then closed.
> > writeback happens and -EIO happens, but there's nobody around to care.
> > /workdir/upperdir1 becomes part of an overlayfs mount
> > overlayfs samples the error
> > a user writes to /workdir/B, another -EIO occurs, but nothing happens
> > someone calls syncfs on /workdir/upperdir/A, gets the EIO.
> > a user opens /workdir/B and calls syncfs, but sees no error
> >
> > do i have that right?  or is it something else?
> 
> IMO it is something else. Others may disagree.
> IMO the level of interference between users accessing overlay and users
> accessing upper fs directly is not well defined and it can stay this way.
> 
> Concurrent access to  /workdir/upperdir/A via overlay and underlying fs
> is explicitly warranted against in Documentation/filesystems/overlayfs.rst#
> Changes to underlying filesystems:
> "Changes to the underlying filesystems while part of a mounted overlay
> filesystem are not allowed.  If the underlying filesystem is changed,
> the behavior of the overlay is undefined, though it will not result in
> a crash or deadlock."

I think people use same underlying filesystem both as upper for multiple
overlayfs mounts as well as root filesystem. For example, when you
run podman (or docker), they all share same filesystem for all containers
as well as other non-containered apps use same filesystem.

IIUC, what we meant to say is that lowerdir/workdir/upperdir being
used for overlayfs mount should be left untouched. Right?

What I am trying to say is that while discussing this problem and
solution, we should assume that both a regular application might
be using same upper fs as being used by overlayfs. It seems to
be a very common operating model.

> 
> The question is whether syncfs(open(/workdir/B)) is considered
> "Changes to the underlying filesystems". Regardless of the answer,
> this is not an interesting case IMO.
> 
> The real issue is with interference between overlays that share the
> same upper fs, because this is by far and large the common use case
> that is creating real problems for a lot of container users.
> 
> Workloads running inside containers (with overlayfs storage driver)
> will never be as isolated as workloads running inside VMs, but it
> doesn't mean we cannot try to improve.
> 
> In current master, syncfs() on any file by any container user will
> result in full syncfs() of the upperfs, which is very bad for container
> isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> his work will be merged soon. Overlayfs still does not do the writeback
> and syncfs() in overlay still waits for all upper fs writeback to complete,
> but at least syncfs() in overlay only kicks writeback for upper fs files
> dirtied by this overlay.
> 
> [1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@mail.gmail.com/
> 
> Sharing the same SEEN flag among thousands of containers is also
> far from ideal, because effectively this means that any given workload
> in any single container has very little chance of observing the SEEN flag.
> 
> To this end, I do agree with Matthew that overlayfs should sample errseq
> and the best patchset to implement it so far IMO is Jeff's patchset [2].
> This patch set was written to cater only "volatile" overlayfs mount, but
> there is no reason not to use the same mechanism for regular overlay
> mount. The only difference being that "volatile" overlay only checks for
> error since mount on syncfs() (because "volatile" overlay does NOT
> syncfs upper fs) and regular overlay checks and advances the overlay's
> errseq sample on syncfs (and does syncfs upper fs).
> 
> Matthew, I hope that my explanation of the use case and Jeff's answer
> is sufficient to understand why the split of the SEEN flag is needed.
> 
> [2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@kernel.org/
> 
> w.r.t Vivek's patchset (this one), I do not object to it at all, but it fixes
> a problem that Jeff's patch had already solved with an ugly hack:
> 
>   /* Propagate errors from upper to overlayfs */
>   ret = errseq_check(&upper_sb->s_wb_err, ofs->err_mark);
>   errseq_set(&sb->s_wb_err, ret);
> 
> Since Jeff's patch is minimal, I think that it should be the fix applied
> first and proposed for stable (with adaptations for non-volatile overlay).

Does stable fix has to be same as mainline fix. IOW, I think atleast in
mainline we should first fix it the right way and then think how to fix
it for stable. If fixes taken in mainline are not realistic for stable,
can we push a different small fix just for stable?

IOW, because we have to push a fix in stable, should not determine
what should be problem solution for mainline, IMHO.

The porblem I have with Jeff's fix is that its only works for volatile
mounts. While I prefer a solution where syncfs() is fixed both for
volatile as well as non-volatile mount and then there is less confusion.

Thanks
Vivek

> 
> I guess that Vivek's patch 1/3 from this series [3] is also needed to
> complement the work that should go to stable.
> 
> Vivek, Sargun,
> 
> Do you understand my proposal?
> Do you agree with it as a way forward to address the various syncfs
> issues for volatile/non-volatile that both of you were trying to address?
> 
> Sargun, I know this all discussion has forked from your volatile re-use
> patch set, but let's not confuse fsdevel forks more than we have to.
> The way forward for volatile re-use from this proposal is straight forward.
> 
> Thanks,
> Amir.
>
Amir Goldstein Jan. 4, 2021, 3:22 p.m. UTC | #26
> > Since Jeff's patch is minimal, I think that it should be the fix applied
> > first and proposed for stable (with adaptations for non-volatile overlay).
>
> Does stable fix has to be same as mainline fix. IOW, I think atleast in
> mainline we should first fix it the right way and then think how to fix
> it for stable. If fixes taken in mainline are not realistic for stable,
> can we push a different small fix just for stable?

We can do a lot of things.
But if we are able to create a series with minimal (and most critical) fixes
followed by other fixes, it would be easier for everyone involved.

>
> IOW, because we have to push a fix in stable, should not determine
> what should be problem solution for mainline, IMHO.
>

I find in this case there is a correlation between the simplest fix and the
most relevant fix for stable.

> The porblem I have with Jeff's fix is that its only works for volatile
> mounts. While I prefer a solution where syncfs() is fixed both for
> volatile as well as non-volatile mount and then there is less confusion.
>

I proposed a variation on Jeff's patch that covers both cases.
Sargun is going to work on it.

Thanks,
Amir.
Vivek Goyal Jan. 4, 2021, 3:40 p.m. UTC | #27
On Mon, Jan 04, 2021 at 05:22:07PM +0200, Amir Goldstein wrote:
> > > Since Jeff's patch is minimal, I think that it should be the fix applied
> > > first and proposed for stable (with adaptations for non-volatile overlay).
> >
> > Does stable fix has to be same as mainline fix. IOW, I think atleast in
> > mainline we should first fix it the right way and then think how to fix
> > it for stable. If fixes taken in mainline are not realistic for stable,
> > can we push a different small fix just for stable?
> 
> We can do a lot of things.
> But if we are able to create a series with minimal (and most critical) fixes
> followed by other fixes, it would be easier for everyone involved.

I am not sure this is really critical. writeback error reporting for
overlayfs are broken since the beginning for regular mounts. There is no
notion of these errors being reported to user space. If that did not
create a major issue, then why suddenly volatile mounts make it
a critical issue.

To me we should fix the issue properly which is easy to maintain
down the line and then worry about doing a stable fix if need be.

> 
> >
> > IOW, because we have to push a fix in stable, should not determine
> > what should be problem solution for mainline, IMHO.
> >
> 
> I find in this case there is a correlation between the simplest fix and the
> most relevant fix for stable.
> 
> > The porblem I have with Jeff's fix is that its only works for volatile
> > mounts. While I prefer a solution where syncfs() is fixed both for
> > volatile as well as non-volatile mount and then there is less confusion.
> >
> 
> I proposed a variation on Jeff's patch that covers both cases.
> Sargun is going to work on it.

What's the problem with my patches which fixes syncfs() error reporting
for overlayfs both for volatile and non-volatile mount?

Thanks
Vivek
Vivek Goyal Jan. 4, 2021, 3:51 p.m. UTC | #28
On Mon, Dec 28, 2020 at 05:51:06PM +0200, Amir Goldstein wrote:
> On Mon, Dec 28, 2020 at 3:25 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Fri, 2020-12-25 at 08:50 +0200, Amir Goldstein wrote:
> > > On Thu, Dec 24, 2020 at 2:13 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> > > > > In current master, syncfs() on any file by any container user will
> > > > > result in full syncfs() of the upperfs, which is very bad for container
> > > > > isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> > > > > his work will be merged soon. Overlayfs still does not do the writeback
> > > > > and syncfs() in overlay still waits for all upper fs writeback to complete,
> > > > > but at least syncfs() in overlay only kicks writeback for upper fs files
> > > > > dirtied by this overlay.
> > > > >
> > > > > [1] https://lore.kernel.org/linux-unionfs/CAJfpegsbb4iTxW8ZyuRFVNc63zg7Ku7vzpSNuzHASYZH-d5wWA@mail.gmail.com/
> > > > >
> > > > > Sharing the same SEEN flag among thousands of containers is also
> > > > > far from ideal, because effectively this means that any given workload
> > > > > in any single container has very little chance of observing the SEEN flag.
> > > >
> > > > Perhaps you misunderstand how errseq works.  If each container samples
> > > > the errseq at startup, then they will all see any error which occurs
> > > > during their lifespan
> > >
> > > Meant to say "...very little chance of NOT observing the SEEN flag",
> > > but We are not in disagreement.
> > > My argument against sharing the SEEN flag refers to Vivek's patch of
> > > stacked errseq_sample()/errseq_check_and_advance() which does NOT
> > > sample errseq at overlayfs mount time. That is why my next sentence is:
> > > "I do agree with Matthew that overlayfs should sample errseq...".
> > >
> > > > (and possibly an error which occurred before they started up).
> > > >
> > >
> > > Right. And this is where the discussion of splitting the SEEN flag started.
> > > Some of us want to treat overlayfs mount time as a true epoc for errseq.
> > > The new container didn't write any files yet, so it should not care about
> > > writeback errors from the past.
> > >
> > > I agree that it may not be very critical, but as I wrote before, I think we
> > > should do our best to try and isolate container workloads.
> > >
> > > > > To this end, I do agree with Matthew that overlayfs should sample errseq
> > > > > and the best patchset to implement it so far IMO is Jeff's patchset [2].
> > > > > This patch set was written to cater only "volatile" overlayfs mount, but
> > > > > there is no reason not to use the same mechanism for regular overlay
> > > > > mount. The only difference being that "volatile" overlay only checks for
> > > > > error since mount on syncfs() (because "volatile" overlay does NOT
> > > > > syncfs upper fs) and regular overlay checks and advances the overlay's
> > > > > errseq sample on syncfs (and does syncfs upper fs).
> > > > >
> > > > > Matthew, I hope that my explanation of the use case and Jeff's answer
> > > > > is sufficient to understand why the split of the SEEN flag is needed.
> > > > >
> > > > > [2] https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlayton@kernel.org/
> > > >
> > > > No, it still feels weird and wrong.
> > > >
> > >
> > > All right. Considering your reservations, I think perhaps the split of the
> > > SEEN flag can wait for a later time after more discussions and maybe
> > > not as suitable for stable as we thought.
> > >
> > > I think that for stable, it would be sufficient to adapt Surgun's original
> > > syncfs for volatile mount patch [1] to cover the non-volatile case:
> > > on mout:
> > > - errseq_sample() upper fs
> > > - on volatile mount, errseq_check() upper fs and fail mount on un-SEEN error
> > > on syncfs:
> > > - errseq_check() for volatile mount
> > > - errseq_check_and_advance() for non-volatile mount
> > > - errseq_set() overlay sb on upper fs error
> > >
> > > Now errseq_set() is not only a hack around __sync_filesystem ignoring
> > > return value of ->sync_fs(). It is really needed for per-overlay SEEN
> > > error isolation in the non-volatile case.
> > >
> > > Unless I am missing something, I think we do not strictly need Vivek's
> > > 1/3 patch [2] for stable, but not sure.
> > >
> > > Sargun,
> > >
> > > Do you agree with the above proposal?
> > > Will you make it into a patch?
> > >
> > > Vivek, Jefff,
> > >
> > > Do you agree that overlay syncfs observing writeback errors that predate
> > > overlay mount time is an issue that can be deferred (maybe forever)?
> > >
> >
> > That's very application dependent.
> >
> > To be clear, the main thing you'll lose with the method above is the
> > ability to see an unseen error on a newly opened fd, if there was an
> > overlayfs mount using the same upper sb before your open occurred.
> >
> > IOW, consider two overlayfs mounts using the same upper layer sb:
> >
> > ovlfs1                          ovlfs2
> > ----------------------------------------------------------------------
> > mount
> > open fd1
> > write to fd1
> > <writeback fails>
> >                                 mount (upper errseq_t SEEN flag marked)
> > open fd2
> > syncfs(fd2)
> > syncfs(fd1)
> >
> >
> > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > calls. The first one has a sample from before the error occurred, and
> > the second one has a sample of 0, due to the fact that the error was
> > unseen at open time.
> >
> > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > two, then we can ensure that they both still get an error in this
> > situation.
> >
> 
> Either I am not following or we are not talking about the same solution.
> IMO it is perfectly fine the ovlfs2 mount consumes the unseen error
> on upper fs, because there is no conceptual difference between an overlay
> mount and any application that does syncfs().

I think currently mount() does not consume writeback errors on upper
filesystem. So if mount() of overlayfs notices an unseen error on
upper, and fails mount (without consuming upper sb error), then it should
be fine. We discussed that in past as well. You proposed it.

This will mean that user will need to call syncfs() on upper to clear
unseen error and then retry overlay mount. And this will not necessarily
need splitting SEEN flag. 

This requires calling syncfs() on upper in case of unseen error. And
Sargun wanted to avoid syncfs() on upper completely.

Vivek

> 
> However, in the situation that you describe, open fd2 is NOT supposed
> to sample upper fs errseq at all, it is supposed to sample ovlfs1's
> sb->s_wb_err.
> 
> syncfs(fd2) reports an error because ovl_check_sync(ofs) sees that
> ofs->upper_errseq sample is older than upper fs errseq.
> 
> If ovlfs1 is volatile, syncfs(fd1), returns an error for the same reason,
> because we never advance ofs->upper_errseq.
> 
> If ovlfs1 is non-volatile, the first syncfs(fd2) calls ovl_check_sync(ofs)
> that observes the upper fs error which is newer than ofs->upper_errseq
> and advances ofs->upper_errseq.
> But it also calls errseq_set(&sb->s_wb_err) (a.k.a. "the hack"),
> This will happen regardless of upper fs SEEN flag set by ovlfs2 mount.
> 
> The second syncfs(fd1) will also return an error because vfs had sampled
> an old errseq value of ovlfs1 sb->s_wb_err, before the call to errseq_set().
> 
> What am I missing?
> 
> Thanks,
> Amir.
>
Vivek Goyal Jan. 4, 2021, 4:59 p.m. UTC | #29
On Mon, Dec 28, 2020 at 03:56:18PM +0000, Matthew Wilcox wrote:
> On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> > To be clear, the main thing you'll lose with the method above is the
> > ability to see an unseen error on a newly opened fd, if there was an
> > overlayfs mount using the same upper sb before your open occurred.
> > 
> > IOW, consider two overlayfs mounts using the same upper layer sb:
> > 
> > ovlfs1				ovlfs2
> > ----------------------------------------------------------------------
> > mount
> > open fd1
> > write to fd1
> > <writeback fails>
> > 				mount (upper errseq_t SEEN flag marked)
> > open fd2
> > syncfs(fd2)
> > syncfs(fd1)
> > 
> > 
> > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > calls. The first one has a sample from before the error occurred, and
> > the second one has a sample of 0, due to the fact that the error was
> > unseen at open time.
> > 
> > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > two, then we can ensure that they both still get an error in this
> > situation.
> 
> But do we need to?  If the inode has been evicted we also lose the errno.

That's for the case of fsync(), right? For the case of syncfs() we will
not lose error as its stored in super_block.

Even for the case of fsync(), inode can be evicted only if no other
fd is opened for the file. So in above example, fd1 is opened so
inode can't be evicted, that means we will see error on syncfs(fd2)
and not lose it.

So if we start consuming upper fs on overlay mount(), it will be a
change of behavior for applications using same upper fs. So far
overlay mount() does not consume unseen error and even if an fd
is opened after the error, application will see error on super
block. If we consume error on mount(), we change behavior.

I am not saying that's necessarily bad, I am just trying to point
out that its a user space visible behavior change and worried
if somebody starts calling it a regression.

Anyway, I looks like two problems got mixed into same thread. One
problem we need to solve is that syncfs() on overlayfs should
report back writeback errors (as well as other errors) to applications.
And that's what this patch series is solving.

And then second issue is detecting writeback errors over remount
for volatile mounts. And that's where this question comes whether
we should split seen flag or we should simply consume error on
mount. So this can be further discussed when patches for this
changes are posted again.

For now, I will focus on trying to fix first issue and post patches
for that again after more testing.

Vivek


> The guarantee we provide is that a fd that was open before the error
> occurred will see the error.  An fd that's opened after the error occurred
> may or may not see the error.
Vivek Goyal Jan. 4, 2021, 8 p.m. UTC | #30
On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > Currently syncfs() and fsync() seem to be two interfaces which check and
> > return writeback errors on superblock to user space. fsync() should
> > work fine with overlayfs as it relies on underlying filesystem to
> > do the check and return error. For example, if ext4 is on upper filesystem,
> > then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
> > upper file and returns error. So overlayfs does not have to do anything
> > special.
> > 
> > But with syncfs(), error check happens in vfs in syncfs() w.r.t
> > overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
> > does not do actual writeback and all writeback errors are recorded
> > on underlying filesystem. So sb->s_wb_err is never updated hence
> > syncfs() does not work with overlay.
> > 
> > Jeff suggested that instead of trying to propagate errors to overlay
> > super block, why not simply check for errors against upper filesystem
> > super block. I implemented this idea.
> > 
> > Overlay file has "since" value which needs to be initialized at open
> > time. Overlay overrides VFS initialization and re-initializes
> > f->f_sb_err w.r.t upper super block. Later when
> > ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
> > since value to figure out if any error on upper sb has happened since
> > then.
> > 
> > Note, Right now this patch only deals with regular file and directories.
> > Yet to deal with special files like device inodes, socket, fifo etc.
> > 
> > Suggested-by: Jeff Layton <jlayton@kernel.org>
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  fs/overlayfs/file.c      |  1 +
> >  fs/overlayfs/overlayfs.h |  1 +
> >  fs/overlayfs/readdir.c   |  1 +
> >  fs/overlayfs/super.c     | 23 +++++++++++++++++++++++
> >  fs/overlayfs/util.c      | 13 +++++++++++++
> >  5 files changed, 39 insertions(+)
> > 
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index efccb7c1f9bc..7b58a44dcb71 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -163,6 +163,7 @@ static int ovl_open(struct inode *inode, struct file *file)
> >  		return PTR_ERR(realfile);
> >  
> >  	file->private_data = realfile;
> > +	ovl_init_file_errseq(file);
> >  
> >  	return 0;
> >  }
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index f8880aa2ba0e..47838abbfb3d 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry);
> >  bool ovl_is_metacopy_dentry(struct dentry *dentry);
> >  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
> >  			     int padding);
> > +void ovl_init_file_errseq(struct file *file);
> >  
> >  static inline bool ovl_is_impuredir(struct super_block *sb,
> >  				    struct dentry *dentry)
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 01620ebae1bd..0c48f1545483 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -960,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct file *file)
> >  	od->is_real = ovl_dir_is_real(file->f_path.dentry);
> >  	od->is_upper = OVL_TYPE_UPPER(type);
> >  	file->private_data = od;
> > +	ovl_init_file_errseq(file);
> >  
> >  	return 0;
> >  }
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 290983bcfbb3..d99867983722 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -390,6 +390,28 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
> >  	return ret;
> >  }
> >  
> > +static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
> > +{
> > +	struct ovl_fs *ofs = sb->s_fs_info;
> > +	struct super_block *upper_sb;
> > +	int ret;
> > +
> > +	if (!ovl_upper_mnt(ofs))
> > +		return 0;
> > +
> > +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +
> > +	if (!errseq_check(&upper_sb->s_wb_err, file->f_sb_err))
> > +		return 0;
> > +
> > +	/* Something changed, must use slow path */
> > +	spin_lock(&file->f_lock);
> > +	ret = errseq_check_and_advance(&upper_sb->s_wb_err, &file->f_sb_err);
> > +	spin_unlock(&file->f_lock);
> > +
> > +	return ret;
> > +}
> > +
> >  static const struct super_operations ovl_super_operations = {
> >  	.alloc_inode	= ovl_alloc_inode,
> >  	.free_inode	= ovl_free_inode,
> > @@ -400,6 +422,7 @@ static const struct super_operations ovl_super_operations = {
> >  	.statfs		= ovl_statfs,
> >  	.show_options	= ovl_show_options,
> >  	.remount_fs	= ovl_remount,
> > +	.errseq_check_advance	= ovl_errseq_check_advance,
> >  };
> >  
> >  enum {
> > diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> > index 23f475627d07..a1742847f3a8 100644
> > --- a/fs/overlayfs/util.c
> > +++ b/fs/overlayfs/util.c
> > @@ -950,3 +950,16 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
> >  	kfree(buf);
> >  	return ERR_PTR(res);
> >  }
> > +
> > +void ovl_init_file_errseq(struct file *file)
> > +{
> > +	struct super_block *sb = file_dentry(file)->d_sb;
> > +	struct ovl_fs *ofs = sb->s_fs_info;
> > +	struct super_block *upper_sb;
> > +
> > +	if (!ovl_upper_mnt(ofs))
> > +		return;
> > +
> > +	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +	file->f_sb_err = errseq_sample(&upper_sb->s_wb_err);
> > +}
> > -- 
> > 2.25.4
> > 
> 
> I fail to see why this is neccessary if you incorporate error reporting into the 
> sync_fs callback. Why is this separate from that callback?

- Writeback error should be checked after __sync_blockdev() has been 
  called. And that's called by VFS and not in ->sync_fs. This can be
  changed by moving __sync_blockdev() insde ->sync_fs() for all
  filesystems which need it. And that's what Jeff had done in the
  past. That attempt did not make it upstream.

  https://lore.kernel.org/linux-fsdevel/20180518123415.28181-1-jlayton@kernel.org/ 

  Given this triggered changes all over the place, I started looking for
  alternatives. And thought adding a super operation to check for errors 
  solves the problem and keeps the changes to minimum.

> If you pickup Jeff's
> patch that adds the 2nd flag to errseq for "observed", you should be able to
> stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> in there instead instead of adding this new infrastructure.

Why to stash errseq in ovl_fs struct when we can directly compare it
with upper sb? "since" value is in "struct file" and current errseq
value is in upper_sb. Atleast to solve this problem it should not
be mandatory to stash errseq value in ovl_fs.

> 
> IMHO, if we're going to fix this, sync_fs should be replaced, and there should 
> be a generic_sync_fs wrapper which does the errseq, callback, and sync blockdev, 
> but then filesystems should be able to override it and do the requisite work.

Not exactly sure what you mean. But I guess you are falling back to the
idea of moving some of the vfs functionality of calling __sync_blockdev()
into filesystem ->sync_fs handler. That leads back to patches Jeff
Layton had posted in the past and is much more invasive change.

I am not against that approach but so far I have not seen any strong
interest from other in favor of that approach instead. So I find
this approach simpler and much less intrusive.

Vivek
Amir Goldstein Jan. 4, 2021, 9:42 p.m. UTC | #31
On Mon, Jan 4, 2021 at 5:40 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Jan 04, 2021 at 05:22:07PM +0200, Amir Goldstein wrote:
> > > > Since Jeff's patch is minimal, I think that it should be the fix applied
> > > > first and proposed for stable (with adaptations for non-volatile overlay).
> > >
> > > Does stable fix has to be same as mainline fix. IOW, I think atleast in
> > > mainline we should first fix it the right way and then think how to fix
> > > it for stable. If fixes taken in mainline are not realistic for stable,
> > > can we push a different small fix just for stable?
> >
> > We can do a lot of things.
> > But if we are able to create a series with minimal (and most critical) fixes
> > followed by other fixes, it would be easier for everyone involved.
>
> I am not sure this is really critical. writeback error reporting for
> overlayfs are broken since the beginning for regular mounts. There is no
> notion of these errors being reported to user space. If that did not
> create a major issue, then why suddenly volatile mounts make it
> a critical issue.
>

Volatile mounts didn't make this a critical issue.
But this discussion made us notice a mildly serious issue.
It is not surprising to me that users did not report this issue.
Do you know what it takes for a user to notice that writeback had failed,
but an application did fsync and error did not get reported?
Filesystem durability guaranties are hard to prove especially with so
many subsystem layers and with fsync that does return an error correctly.
I once found a durability bug in fsync of xfs that existed for 12 years.
That fact does not at all make it any less critical.

> To me we should fix the issue properly which is easy to maintain
> down the line and then worry about doing a stable fix if need be.
>
> >
> > >
> > > IOW, because we have to push a fix in stable, should not determine
> > > what should be problem solution for mainline, IMHO.
> > >
> >
> > I find in this case there is a correlation between the simplest fix and the
> > most relevant fix for stable.
> >
> > > The porblem I have with Jeff's fix is that its only works for volatile
> > > mounts. While I prefer a solution where syncfs() is fixed both for
> > > volatile as well as non-volatile mount and then there is less confusion.
> > >
> >
> > I proposed a variation on Jeff's patch that covers both cases.
> > Sargun is going to work on it.
>
> What's the problem with my patches which fixes syncfs() error reporting
> for overlayfs both for volatile and non-volatile mount?
>

- mount 1000 overlays
- 1 writeback error recorded in upper sb
- syncfs (new fd) inside each of the 1000 containers

With your patch 3/3 only one syncfs will report an error for
both volatile and non-volatile cases. Right?

What I would rather see is:
- Non-volatile: first syncfs in every container gets an error (nice to have)
- Volatile: every syncfs and every fsync in every container gets an error
  (important IMO)

This is why I prefer to sample upper sb error on mount and propagate
new errors to overlayfs sb (Jeff's patch).

I am very much in favor of your patch 1/3 and I am not against the concept
of patches 2-3/3. Just think that ovl_errseq_check_advance() is not the
implementation that gives the most desirable result.

If people do accept my point of view that proxying the stacked error check
is preferred over "passthrough" to upper sb error check, then as a by-product,
the new ->check_error() method is not going to make much of a difference for
overlayfs. Maybe it can be used to fine tune some corner cases.
I am not sure.
If we do agree on the propagate error concept then IMO all other use
cases for not consuming the unseen error from upper fs are nice-to-have.

Before we continue to debate on the implementation, let's first try
to agree on the desired behavior, what is a must vs. what is nice to have.
Without consensus on this, it will be quite hard to converge.

Another thing, to help everyone, I think it is best that any patch on ovl_syncfs
"solutions" will include detailed description of the use cases it solves and
the use cases that it leaves unsolved.

Thanks,
Amir.
Vivek Goyal Jan. 4, 2021, 10:44 p.m. UTC | #32
On Mon, Jan 04, 2021 at 11:42:51PM +0200, Amir Goldstein wrote:
> On Mon, Jan 4, 2021 at 5:40 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, Jan 04, 2021 at 05:22:07PM +0200, Amir Goldstein wrote:
> > > > > Since Jeff's patch is minimal, I think that it should be the fix applied
> > > > > first and proposed for stable (with adaptations for non-volatile overlay).
> > > >
> > > > Does stable fix has to be same as mainline fix. IOW, I think atleast in
> > > > mainline we should first fix it the right way and then think how to fix
> > > > it for stable. If fixes taken in mainline are not realistic for stable,
> > > > can we push a different small fix just for stable?
> > >
> > > We can do a lot of things.
> > > But if we are able to create a series with minimal (and most critical) fixes
> > > followed by other fixes, it would be easier for everyone involved.
> >
> > I am not sure this is really critical. writeback error reporting for
> > overlayfs are broken since the beginning for regular mounts. There is no
> > notion of these errors being reported to user space. If that did not
> > create a major issue, then why suddenly volatile mounts make it
> > a critical issue.
> >
> 
> Volatile mounts didn't make this a critical issue.
> But this discussion made us notice a mildly serious issue.
> It is not surprising to me that users did not report this issue.
> Do you know what it takes for a user to notice that writeback had failed,
> but an application did fsync and error did not get reported?
> Filesystem durability guaranties are hard to prove especially with so
> many subsystem layers and with fsync that does return an error correctly.
> I once found a durability bug in fsync of xfs that existed for 12 years.
> That fact does not at all make it any less critical.
> 
> > To me we should fix the issue properly which is easy to maintain
> > down the line and then worry about doing a stable fix if need be.
> >
> > >
> > > >
> > > > IOW, because we have to push a fix in stable, should not determine
> > > > what should be problem solution for mainline, IMHO.
> > > >
> > >
> > > I find in this case there is a correlation between the simplest fix and the
> > > most relevant fix for stable.
> > >
> > > > The porblem I have with Jeff's fix is that its only works for volatile
> > > > mounts. While I prefer a solution where syncfs() is fixed both for
> > > > volatile as well as non-volatile mount and then there is less confusion.
> > > >
> > >
> > > I proposed a variation on Jeff's patch that covers both cases.
> > > Sargun is going to work on it.
> >
> > What's the problem with my patches which fixes syncfs() error reporting
> > for overlayfs both for volatile and non-volatile mount?
> >
> 
> - mount 1000 overlays
> - 1 writeback error recorded in upper sb
> - syncfs (new fd) inside each of the 1000 containers
> 
> With your patch 3/3 only one syncfs will report an error for
> both volatile and non-volatile cases. Right?

Right. If you don't have an old fd open in each container, then only
one container will see the error. If you want to see error in each
container, then one fd needs to be kept opened in each container
before error hapens and call syncfs() on that fd, and then each
container should see the error.

> 
> What I would rather see is:
> - Non-volatile: first syncfs in every container gets an error (nice to have)

I am not sure why are we making this behavior per container. This should
be no different from current semantics we have for syncfs() on regular
filesystem. And that will provide what you are looking for. If you
want single error to be reported in all ovleray mounts, then make
sure you have one fd open in each mount after mount, then call syncfs()
on that fd.

Not sure why overlayfs behavior/semantics should be any differnt
than what regular filessytems like ext4/xfs are offering. Once we
get page cache sharing sorted out with xfs reflink, then people
will not even need overlayfs and be able to launch containers
just using xfs reflink and share base image. In that case also
they will need to keep an fd open per container they want to
see an error in.

So my patches exactly provide that. syncfs() behavior is same with
overlayfs as application gets it on other filesystems. And to me
its important to keep behavior same.

> - Volatile: every syncfs and every fsync in every container gets an error
>   (important IMO)

For volatile mounts, I agree that we need to fail overlayfs instance
as soon as first error is detected since mount. And this applies to
not only syncfs()/fsync() but to read/write and other operations too.

For that we will need additional patches which are floating around
to keep errseq sample in overlay and check for errors in all
paths syncfs/fsync/read/write/.... and fail fs. But these patches
build on top of my patches. My patches don't solve this problem of
failing overlay mount for the volatile mount case.

> 
> This is why I prefer to sample upper sb error on mount and propagate
> new errors to overlayfs sb (Jeff's patch).

Ok, I think this is one of the key points of the whole discussion. What
mechanism should be used to propagate writeback errors through overlayfs.

A. Propagate errors from upper sb to overlay sb.
B. Leave overlay sb alone and use upper sb for error checks.

We don't have good model to propagate errors between super blocks,
so Jeff preferred not to do error propagation between super blocks
for regular mounts.

https://lore.kernel.org/linux-fsdevel/bff90dfee3a3392d67a4f3516ab28989e87fa25f.camel@kernel.org/

If we are not defining new semantics for syncfs() for overlayfs, then
I can't see what's the advantage of coming up with new mechanism to
propagate errors to overlay sb. Approach B should work just fine and
provide the syncfs() semantics we want for overlayfs (Same semantics
as other filesystems).

Having said that, I am open to the idea of propagating errors if that
makes implementation better. Its just an implementation detail to
me and user visible behavior should remain same.

> 
> I am very much in favor of your patch 1/3 and I am not against the concept
> of patches 2-3/3. Just think that ovl_errseq_check_advance() is not the
> implementation that gives the most desirable result.

I think this is the key point of contention. You seem to expecting
a different syncfs() behavior only for overlayfs and tying it to 
the notion of container. And I am wondering why it should be any
different from any other filesystem. And those who want to see
upper_sb error in each mounted overlay instance, they should keep
one fd open in overlay after mount.

So lets sort that out this syncfs() behavior part first before we
get to implementation details.

Thanks
Vivek

> 
> If people do accept my point of view that proxying the stacked error check
> is preferred over "passthrough" to upper sb error check, then as a by-product,
> the new ->check_error() method is not going to make much of a difference for
> overlayfs. Maybe it can be used to fine tune some corner cases.
> I am not sure.
> If we do agree on the propagate error concept then IMO all other use
> cases for not consuming the unseen error from upper fs are nice-to-have.
> 
> Before we continue to debate on the implementation, let's first try
> to agree on the desired behavior, what is a must vs. what is nice to have.
> Without consensus on this, it will be quite hard to converge.
> 
> Another thing, to help everyone, I think it is best that any patch on ovl_syncfs
> "solutions" will include detailed description of the use cases it solves and
> the use cases that it leaves unsolved.
> 
> Thanks,
> Amir.
>
Amir Goldstein Jan. 5, 2021, 7:11 a.m. UTC | #33
> >
> > What I would rather see is:
> > - Non-volatile: first syncfs in every container gets an error (nice to have)
>
> I am not sure why are we making this behavior per container. This should
> be no different from current semantics we have for syncfs() on regular
> filesystem. And that will provide what you are looking for. If you
> want single error to be reported in all ovleray mounts, then make
> sure you have one fd open in each mount after mount, then call syncfs()
> on that fd.
>

Ok.

> Not sure why overlayfs behavior/semantics should be any differnt
> than what regular filessytems like ext4/xfs are offering. Once we
> get page cache sharing sorted out with xfs reflink, then people
> will not even need overlayfs and be able to launch containers
> just using xfs reflink and share base image. In that case also
> they will need to keep an fd open per container they want to
> see an error in.
>
> So my patches exactly provide that. syncfs() behavior is same with
> overlayfs as application gets it on other filesystems. And to me
> its important to keep behavior same.
>
> > - Volatile: every syncfs and every fsync in every container gets an error
> >   (important IMO)
>
> For volatile mounts, I agree that we need to fail overlayfs instance
> as soon as first error is detected since mount. And this applies to
> not only syncfs()/fsync() but to read/write and other operations too.
>
> For that we will need additional patches which are floating around
> to keep errseq sample in overlay and check for errors in all
> paths syncfs/fsync/read/write/.... and fail fs.

> But these patches build on top of my patches.

Here we disagree.

I don't see how Jeff's patch is "building on top of your patches"
seeing that it is perfectly well contained and does not in fact depend
on your patches.

And I do insist that the fix for volatile mounts syncfs/fsync error
reporting should be applied before your patches or at the very least
not heavily depend on them.

volatile mount was introduced in fresh new v5.10, which is also an
LTS kernel. It would be inconsiderate of volatile mount users and developers
to make backporting that fix to v5.10.y any harder than it should be.

> My patches don't solve this problem of failing overlay mount for
> the volatile mount case.
>

Here we agree.

> >
> > This is why I prefer to sample upper sb error on mount and propagate
> > new errors to overlayfs sb (Jeff's patch).
>
> Ok, I think this is one of the key points of the whole discussion. What
> mechanism should be used to propagate writeback errors through overlayfs.
>
> A. Propagate errors from upper sb to overlay sb.
> B. Leave overlay sb alone and use upper sb for error checks.
>
> We don't have good model to propagate errors between super blocks,
> so Jeff preferred not to do error propagation between super blocks
> for regular mounts.
>
> https://lore.kernel.org/linux-fsdevel/bff90dfee3a3392d67a4f3516ab28989e87fa25f.camel@kernel.org/
>
> If we are not defining new semantics for syncfs() for overlayfs, then
> I can't see what's the advantage of coming up with new mechanism to
> propagate errors to overlay sb. Approach B should work just fine and
> provide the syncfs() semantics we want for overlayfs (Same semantics
> as other filesystems).
>

Ok. I am on board with B.

Philosophically. overlayfs model is somewhere between "passthrough"
and "proxy" when handling pure upper files and as overlayfs evolves,
it steadily moves towards the "proxy" model, with page cache and
writeback being the largest remaining piece to convert.

So I concede that as long as overlayfs writeback is mostly passthrough,
syncfs might as well be passthrough to upper fs as well.

Thanks,
Amir.
Vivek Goyal Jan. 5, 2021, 4:26 p.m. UTC | #34
On Tue, Jan 05, 2021 at 09:11:23AM +0200, Amir Goldstein wrote:
> > >
> > > What I would rather see is:
> > > - Non-volatile: first syncfs in every container gets an error (nice to have)
> >
> > I am not sure why are we making this behavior per container. This should
> > be no different from current semantics we have for syncfs() on regular
> > filesystem. And that will provide what you are looking for. If you
> > want single error to be reported in all ovleray mounts, then make
> > sure you have one fd open in each mount after mount, then call syncfs()
> > on that fd.
> >
> 
> Ok.
> 
> > Not sure why overlayfs behavior/semantics should be any differnt
> > than what regular filessytems like ext4/xfs are offering. Once we
> > get page cache sharing sorted out with xfs reflink, then people
> > will not even need overlayfs and be able to launch containers
> > just using xfs reflink and share base image. In that case also
> > they will need to keep an fd open per container they want to
> > see an error in.
> >
> > So my patches exactly provide that. syncfs() behavior is same with
> > overlayfs as application gets it on other filesystems. And to me
> > its important to keep behavior same.
> >
> > > - Volatile: every syncfs and every fsync in every container gets an error
> > >   (important IMO)
> >
> > For volatile mounts, I agree that we need to fail overlayfs instance
> > as soon as first error is detected since mount. And this applies to
> > not only syncfs()/fsync() but to read/write and other operations too.
> >
> > For that we will need additional patches which are floating around
> > to keep errseq sample in overlay and check for errors in all
> > paths syncfs/fsync/read/write/.... and fail fs.
> 
> > But these patches build on top of my patches.
> 
> Here we disagree.
> 
> I don't see how Jeff's patch is "building on top of your patches"
> seeing that it is perfectly well contained and does not in fact depend
> on your patches.

Jeff's patches are solving problem only for volatile mounts and they
are propagating error to overlayfs sb.

My patches are solving the issue both for volatile mount as well as
non-volatile mounts and solve it using same method so there is no
confusion.

So there are multiple pieces to this puzzle and IMHO, it probably
should be fixed in this order.

A. First fix the syncfs() path to return error both for volatile as
   as well non-volatile mounts.

B. And then add patches to fail filesystem for volatile mount as soon
   as first error is detected (either in syncfs path or in other paths
   like read/write/...). This probably will require to save errseq
   in ovl_fs, and then compare with upper_sb in critical paths and fail
   filesystem as soon as error is detected.

C. Finally fix the issues related to mount/remount error detection which
   Sargun is wanting to fix. This will be largerly solved by B except
   saving errseq on disk.

My patches should fix the first problem. And more patches can be
applied on top to fix issue B and issue C.

Now if we agree with this, in this context I see that fixing problem
B and C is building on top of my patches which fixes problem A.

> 
> And I do insist that the fix for volatile mounts syncfs/fsync error
> reporting should be applied before your patches or at the very least
> not heavily depend on them.

I still don't understand that why volatile syncfs() error reporting
is more important than non-volatile syncfs(). But I will stop harping
on this point now.

My issue with Jeff's patches is that syncfs() error reporting should
be dealt in same way both for volatile and non-volatile mount. That
is compare file->f_sb_err and upper_sb->s_wb_err to figure out if
there is an error to report to user space. Currently this patches
only solve the problem for volatile mounts and use propagation to
overlay sb which is conflicting for non-volatile mounts.

IIUC, your primary concern with volatile mount is that you want to
detect as soon as writeback error happens, and flag it to container
manager so that container manager can stop container, throw away
upper layer and restart from scratch. If yes, what you want can
be solved by solving problem B and backporting it to LTS kernel.
I think patches for that will be well contained within overlayfs
(And no VFS) changes and should be relatively easy to backport.

IOW, backportability to LTS kernel should not be a concern/blocker
for my patch series which fixes syncfs() issue for overlayfs.

Thanks
Vivek

> 
> volatile mount was introduced in fresh new v5.10, which is also an
> LTS kernel. It would be inconsiderate of volatile mount users and developers
> to make backporting that fix to v5.10.y any harder than it should be.

> 
> > My patches don't solve this problem of failing overlay mount for
> > the volatile mount case.
> >
> 
> Here we agree.
> 
> > >
> > > This is why I prefer to sample upper sb error on mount and propagate
> > > new errors to overlayfs sb (Jeff's patch).
> >
> > Ok, I think this is one of the key points of the whole discussion. What
> > mechanism should be used to propagate writeback errors through overlayfs.
> >
> > A. Propagate errors from upper sb to overlay sb.
> > B. Leave overlay sb alone and use upper sb for error checks.
> >
> > We don't have good model to propagate errors between super blocks,
> > so Jeff preferred not to do error propagation between super blocks
> > for regular mounts.
> >
> > https://lore.kernel.org/linux-fsdevel/bff90dfee3a3392d67a4f3516ab28989e87fa25f.camel@kernel.org/
> >
> > If we are not defining new semantics for syncfs() for overlayfs, then
> > I can't see what's the advantage of coming up with new mechanism to
> > propagate errors to overlay sb. Approach B should work just fine and
> > provide the syncfs() semantics we want for overlayfs (Same semantics
> > as other filesystems).
> >
> 
> Ok. I am on board with B.
> 
> Philosophically. overlayfs model is somewhere between "passthrough"
> and "proxy" when handling pure upper files and as overlayfs evolves,
> it steadily moves towards the "proxy" model, with page cache and
> writeback being the largest remaining piece to convert.
> 
> So I concede that as long as overlayfs writeback is mostly passthrough,
> syncfs might as well be passthrough to upper fs as well.
> 
> Thanks,
> Amir.
>
Amir Goldstein Jan. 5, 2021, 4:57 p.m. UTC | #35
On Tue, Jan 5, 2021 at 6:26 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Jan 05, 2021 at 09:11:23AM +0200, Amir Goldstein wrote:
> > > >
> > > > What I would rather see is:
> > > > - Non-volatile: first syncfs in every container gets an error (nice to have)
> > >
> > > I am not sure why are we making this behavior per container. This should
> > > be no different from current semantics we have for syncfs() on regular
> > > filesystem. And that will provide what you are looking for. If you
> > > want single error to be reported in all ovleray mounts, then make
> > > sure you have one fd open in each mount after mount, then call syncfs()
> > > on that fd.
> > >
> >
> > Ok.
> >
> > > Not sure why overlayfs behavior/semantics should be any differnt
> > > than what regular filessytems like ext4/xfs are offering. Once we
> > > get page cache sharing sorted out with xfs reflink, then people
> > > will not even need overlayfs and be able to launch containers
> > > just using xfs reflink and share base image. In that case also
> > > they will need to keep an fd open per container they want to
> > > see an error in.
> > >
> > > So my patches exactly provide that. syncfs() behavior is same with
> > > overlayfs as application gets it on other filesystems. And to me
> > > its important to keep behavior same.
> > >
> > > > - Volatile: every syncfs and every fsync in every container gets an error
> > > >   (important IMO)
> > >
> > > For volatile mounts, I agree that we need to fail overlayfs instance
> > > as soon as first error is detected since mount. And this applies to
> > > not only syncfs()/fsync() but to read/write and other operations too.
> > >
> > > For that we will need additional patches which are floating around
> > > to keep errseq sample in overlay and check for errors in all
> > > paths syncfs/fsync/read/write/.... and fail fs.
> >
> > > But these patches build on top of my patches.
> >
> > Here we disagree.
> >
> > I don't see how Jeff's patch is "building on top of your patches"
> > seeing that it is perfectly well contained and does not in fact depend
> > on your patches.
>
> Jeff's patches are solving problem only for volatile mounts and they
> are propagating error to overlayfs sb.
>
> My patches are solving the issue both for volatile mount as well as
> non-volatile mounts and solve it using same method so there is no
> confusion.
>
> So there are multiple pieces to this puzzle and IMHO, it probably
> should be fixed in this order.
>
> A. First fix the syncfs() path to return error both for volatile as
>    as well non-volatile mounts.
>
> B. And then add patches to fail filesystem for volatile mount as soon
>    as first error is detected (either in syncfs path or in other paths
>    like read/write/...). This probably will require to save errseq
>    in ovl_fs, and then compare with upper_sb in critical paths and fail
>    filesystem as soon as error is detected.
>
> C. Finally fix the issues related to mount/remount error detection which
>    Sargun is wanting to fix. This will be largerly solved by B except
>    saving errseq on disk.
>
> My patches should fix the first problem. And more patches can be
> applied on top to fix issue B and issue C.
>
> Now if we agree with this, in this context I see that fixing problem
> B and C is building on top of my patches which fixes problem A.
>

That order is fine by me.

> >
> > And I do insist that the fix for volatile mounts syncfs/fsync error
> > reporting should be applied before your patches or at the very least
> > not heavily depend on them.
>
> I still don't understand that why volatile syncfs() error reporting
> is more important than non-volatile syncfs(). But I will stop harping
> on this point now.
>
> My issue with Jeff's patches is that syncfs() error reporting should
> be dealt in same way both for volatile and non-volatile mount. That
> is compare file->f_sb_err and upper_sb->s_wb_err to figure out if
> there is an error to report to user space. Currently this patches
> only solve the problem for volatile mounts and use propagation to
> overlay sb which is conflicting for non-volatile mounts.
>
> IIUC, your primary concern with volatile mount is that you want to
> detect as soon as writeback error happens, and flag it to container
> manager so that container manager can stop container, throw away
> upper layer and restart from scratch. If yes, what you want can
> be solved by solving problem B and backporting it to LTS kernel.
> I think patches for that will be well contained within overlayfs
> (And no VFS) changes and should be relatively easy to backport.
>
> IOW, backportability to LTS kernel should not be a concern/blocker
> for my patch series which fixes syncfs() issue for overlayfs.
>

That's all I wanted to know.

Thanks,
Amir.
diff mbox series

Patch

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index efccb7c1f9bc..7b58a44dcb71 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -163,6 +163,7 @@  static int ovl_open(struct inode *inode, struct file *file)
 		return PTR_ERR(realfile);
 
 	file->private_data = realfile;
+	ovl_init_file_errseq(file);
 
 	return 0;
 }
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f8880aa2ba0e..47838abbfb3d 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -322,6 +322,7 @@  int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry);
 bool ovl_is_metacopy_dentry(struct dentry *dentry);
 char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
 			     int padding);
+void ovl_init_file_errseq(struct file *file);
 
 static inline bool ovl_is_impuredir(struct super_block *sb,
 				    struct dentry *dentry)
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 01620ebae1bd..0c48f1545483 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -960,6 +960,7 @@  static int ovl_dir_open(struct inode *inode, struct file *file)
 	od->is_real = ovl_dir_is_real(file->f_path.dentry);
 	od->is_upper = OVL_TYPE_UPPER(type);
 	file->private_data = od;
+	ovl_init_file_errseq(file);
 
 	return 0;
 }
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..d99867983722 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -390,6 +390,28 @@  static int ovl_remount(struct super_block *sb, int *flags, char *data)
 	return ret;
 }
 
+static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
+{
+	struct ovl_fs *ofs = sb->s_fs_info;
+	struct super_block *upper_sb;
+	int ret;
+
+	if (!ovl_upper_mnt(ofs))
+		return 0;
+
+	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
+
+	if (!errseq_check(&upper_sb->s_wb_err, file->f_sb_err))
+		return 0;
+
+	/* Something changed, must use slow path */
+	spin_lock(&file->f_lock);
+	ret = errseq_check_and_advance(&upper_sb->s_wb_err, &file->f_sb_err);
+	spin_unlock(&file->f_lock);
+
+	return ret;
+}
+
 static const struct super_operations ovl_super_operations = {
 	.alloc_inode	= ovl_alloc_inode,
 	.free_inode	= ovl_free_inode,
@@ -400,6 +422,7 @@  static const struct super_operations ovl_super_operations = {
 	.statfs		= ovl_statfs,
 	.show_options	= ovl_show_options,
 	.remount_fs	= ovl_remount,
+	.errseq_check_advance	= ovl_errseq_check_advance,
 };
 
 enum {
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 23f475627d07..a1742847f3a8 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -950,3 +950,16 @@  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
 	kfree(buf);
 	return ERR_PTR(res);
 }
+
+void ovl_init_file_errseq(struct file *file)
+{
+	struct super_block *sb = file_dentry(file)->d_sb;
+	struct ovl_fs *ofs = sb->s_fs_info;
+	struct super_block *upper_sb;
+
+	if (!ovl_upper_mnt(ofs))
+		return;
+
+	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
+	file->f_sb_err = errseq_sample(&upper_sb->s_wb_err);
+}