Message ID | 1488724854.2925.6.camel@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote: > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > I could get back -EIO errors in some cases when I should have instead > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > PG_error on a writeback error, and that error would clobber the mapping > > error. > > > > I should also note that relying on PG_error to report writeback errors > is inherently unreliable as well. If someone calls sync() before your > fsync gets in there, then you'll likely lose it anyway. > > filemap_fdatawait_keep_errors will preserve the error in the mapping, > but not the individual PG_error flags, so I think we do want to ensure > that the mapping error is set when there is a writeback error and not > rely on PG_error bit for that. > > > While I fixed that problem by simply not setting that bit on errors, > > that led me down a rabbit hole of looking at how PG_error is being > > handled in the kernel. > > > > This patch series is a few fixes for things that I 100% noticed by > > inspection. I don't have a great way to test these since they involve > > error handling. I can certainly doctor up a kernel to inject errors > > in this code and test by hand however if these look plausible up front. > > > > Jeff Layton (3): > > nilfs2: set the mapping error when calling SetPageError on writeback > > mm: don't TestClearPageError in __filemap_fdatawait_range > > mm: set mapping error when launder_pages fails > > > > fs/nilfs2/segment.c | 1 + > > mm/filemap.c | 19 ++++--------------- > > mm/truncate.c | 6 +++++- > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > (cc'ing Ross...) > > Just when I thought that only NILFS2 needed a little work here, I see > another spot... > > I think that we should also need to fix dax_writeback_mapping_range to > set a mapping error on writeback as well. It looks like that's not > happening today. Something like the patch below (obviously untested). > > I'll also plan to follow up with a patch to vfs.txt to outline how > writeback errors should be handled by filesystems, assuming that this > patchset isn't completely off base. > > -------------------8<----------------------- > > [PATCH] dax: set error in mapping when writeback fails > > In order to get proper error codes from fsync, we must set an error in > the mapping range when writeback fails. > > Signed-off-by: Jeff Layton <jlayton@redhat.com> > --- > fs/dax.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/dax.c b/fs/dax.c > index c45598b912e1..9005d90deeda 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, > > ret = dax_writeback_one(bdev, mapping, indices[i], > pvec.pages[i]); > - if (ret < 0) > + if (ret < 0) { > + mapping_set_error(mapping, ret); > return ret; > + } (Adding Jan) I tested this a bit, and for the DAX case at least I don't think this does what you want. The current code already returns -EIO if dax_writeback_one() hits an error, which bubbles up through the call stack and makes the fsync() call in userspace fail with EIO, as we want. With both ext4 and xfs this patch (applied to v4.10) makes it so that we fail the current fsync() due to the return value of -EIO, then we fail the next fsync() as well because only then do we actually process the AS_EIO flag inside of filemap_check_errors(). I think maybe the missing piece is that our normal DAX fsync call stack doesn't include a call to filemap_check_errors() if we return -EIO. Here's our stack in xfs: dax_writeback_mapping_range+0x32/0x70 xfs_vm_writepages+0x8c/0xf0 do_writepages+0x21/0x30 __filemap_fdatawrite_range+0xc6/0x100 filemap_write_and_wait_range+0x44/0x90 xfs_file_fsync+0x7a/0x2c0 vfs_fsync_range+0x4b/0xb0 ? trace_hardirqs_on_caller+0xf5/0x1b0 do_fsync+0x3d/0x70 SyS_fsync+0x10/0x20 entry_SYSCALL_64_fastpath+0x1f/0xc2 On the subsequent fsync() call we *do* end up calling filemap_check_errors() via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the mapping: filemap_fdatawait_range+0x3b/0x80 filemap_write_and_wait_range+0x5a/0x90 xfs_file_fsync+0x7a/0x2c0 vfs_fsync_range+0x4b/0xb0 ? trace_hardirqs_on_caller+0xf5/0x1b0 do_fsync+0x3d/0x70 SyS_fsync+0x10/0x20 entry_SYSCALL_64_fastpath+0x1f/0xc2 Was your concern just that you didn't think that fsync() was properly returning an error when dax_writeback_one() hit an error? Or is there another path by which we need to report the error, where it is actually important that we set AS_EIO? If it's the latter, then I think we need to rework the fsync call path so that we both generate and consume AS_EIO on the same call, probably in filemap_write_and_wait_range().
On Mon 06-03-17 16:08:01, Ross Zwisler wrote: > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote: > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > > I could get back -EIO errors in some cases when I should have instead > > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > > PG_error on a writeback error, and that error would clobber the mapping > > > error. > > > > > > > I should also note that relying on PG_error to report writeback errors > > is inherently unreliable as well. If someone calls sync() before your > > fsync gets in there, then you'll likely lose it anyway. > > > > filemap_fdatawait_keep_errors will preserve the error in the mapping, > > but not the individual PG_error flags, so I think we do want to ensure > > that the mapping error is set when there is a writeback error and not > > rely on PG_error bit for that. > > > > > While I fixed that problem by simply not setting that bit on errors, > > > that led me down a rabbit hole of looking at how PG_error is being > > > handled in the kernel. > > > > > > This patch series is a few fixes for things that I 100% noticed by > > > inspection. I don't have a great way to test these since they involve > > > error handling. I can certainly doctor up a kernel to inject errors > > > in this code and test by hand however if these look plausible up front. > > > > > > Jeff Layton (3): > > > nilfs2: set the mapping error when calling SetPageError on writeback > > > mm: don't TestClearPageError in __filemap_fdatawait_range > > > mm: set mapping error when launder_pages fails > > > > > > fs/nilfs2/segment.c | 1 + > > > mm/filemap.c | 19 ++++--------------- > > > mm/truncate.c | 6 +++++- > > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > > > > (cc'ing Ross...) > > > > Just when I thought that only NILFS2 needed a little work here, I see > > another spot... > > > > I think that we should also need to fix dax_writeback_mapping_range to > > set a mapping error on writeback as well. It looks like that's not > > happening today. Something like the patch below (obviously untested). > > > > I'll also plan to follow up with a patch to vfs.txt to outline how > > writeback errors should be handled by filesystems, assuming that this > > patchset isn't completely off base. > > > > -------------------8<----------------------- > > > > [PATCH] dax: set error in mapping when writeback fails > > > > In order to get proper error codes from fsync, we must set an error in > > the mapping range when writeback fails. > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com> > > --- > > fs/dax.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index c45598b912e1..9005d90deeda 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, > > > > ret = dax_writeback_one(bdev, mapping, indices[i], > > pvec.pages[i]); > > - if (ret < 0) > > + if (ret < 0) { > > + mapping_set_error(mapping, ret); > > return ret; > > + } > > (Adding Jan) > > I tested this a bit, and for the DAX case at least I don't think this does > what you want. The current code already returns -EIO if dax_writeback_one() > hits an error, which bubbles up through the call stack and makes the fsync() > call in userspace fail with EIO, as we want. With both ext4 and xfs this > patch (applied to v4.10) makes it so that we fail the current fsync() due to > the return value of -EIO, then we fail the next fsync() as well because only > then do we actually process the AS_EIO flag inside of filemap_check_errors(). > > I think maybe the missing piece is that our normal DAX fsync call stack > doesn't include a call to filemap_check_errors() if we return -EIO. Here's > our stack in xfs: > > dax_writeback_mapping_range+0x32/0x70 > xfs_vm_writepages+0x8c/0xf0 > do_writepages+0x21/0x30 > __filemap_fdatawrite_range+0xc6/0x100 > filemap_write_and_wait_range+0x44/0x90 > xfs_file_fsync+0x7a/0x2c0 > vfs_fsync_range+0x4b/0xb0 > ? trace_hardirqs_on_caller+0xf5/0x1b0 > do_fsync+0x3d/0x70 > SyS_fsync+0x10/0x20 > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > On the subsequent fsync() call we *do* end up calling filemap_check_errors() > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the > mapping: > > filemap_fdatawait_range+0x3b/0x80 > filemap_write_and_wait_range+0x5a/0x90 > xfs_file_fsync+0x7a/0x2c0 > vfs_fsync_range+0x4b/0xb0 > ? trace_hardirqs_on_caller+0xf5/0x1b0 > do_fsync+0x3d/0x70 > SyS_fsync+0x10/0x20 > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > Was your concern just that you didn't think that fsync() was properly > returning an error when dax_writeback_one() hit an error? Or is there another > path by which we need to report the error, where it is actually important that > we set AS_EIO? If it's the latter, then I think we need to rework the fsync > call path so that we both generate and consume AS_EIO on the same call, > probably in filemap_write_and_wait_range(). So I believe this is due to the special handling of EIO inside filemap_write_and_wait(). Normally, filemap_check_errors() happens inside filemap_fdatawait() there however not for EIO returned from filemap_fdatawrite(). In that case we bail out immediately. So I think Jeff's patch is correct but we need to change filemap_write_and_wait() to call also filemap_check_errors() directly on EIO from filemap_fdatawrite(). On a more general note (DAX is actually fine here), I find the current practice of clearing page dirty bits on error and reporting it just once problematic. It keeps the system running but data is lost and possibly without getting the error anywhere where it is useful. We get away with this because it is a rare event but it seems like a problematic behavior. But this is more for the discussion at LSF. Honza
On Tue, 2017-03-07 at 11:26 +0100, Jan Kara wrote: > On Mon 06-03-17 16:08:01, Ross Zwisler wrote: > > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote: > > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > > > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > > > I could get back -EIO errors in some cases when I should have instead > > > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > > > PG_error on a writeback error, and that error would clobber the mapping > > > > error. > > > > > > > > > > I should also note that relying on PG_error to report writeback errors > > > is inherently unreliable as well. If someone calls sync() before your > > > fsync gets in there, then you'll likely lose it anyway. > > > > > > filemap_fdatawait_keep_errors will preserve the error in the mapping, > > > but not the individual PG_error flags, so I think we do want to ensure > > > that the mapping error is set when there is a writeback error and not > > > rely on PG_error bit for that. > > > > > > > While I fixed that problem by simply not setting that bit on errors, > > > > that led me down a rabbit hole of looking at how PG_error is being > > > > handled in the kernel. > > > > > > > > This patch series is a few fixes for things that I 100% noticed by > > > > inspection. I don't have a great way to test these since they involve > > > > error handling. I can certainly doctor up a kernel to inject errors > > > > in this code and test by hand however if these look plausible up front. > > > > > > > > Jeff Layton (3): > > > > nilfs2: set the mapping error when calling SetPageError on writeback > > > > mm: don't TestClearPageError in __filemap_fdatawait_range > > > > mm: set mapping error when launder_pages fails > > > > > > > > fs/nilfs2/segment.c | 1 + > > > > mm/filemap.c | 19 ++++--------------- > > > > mm/truncate.c | 6 +++++- > > > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > > > > > > > (cc'ing Ross...) > > > > > > Just when I thought that only NILFS2 needed a little work here, I see > > > another spot... > > > > > > I think that we should also need to fix dax_writeback_mapping_range to > > > set a mapping error on writeback as well. It looks like that's not > > > happening today. Something like the patch below (obviously untested). > > > > > > I'll also plan to follow up with a patch to vfs.txt to outline how > > > writeback errors should be handled by filesystems, assuming that this > > > patchset isn't completely off base. > > > > > > -------------------8<----------------------- > > > > > > [PATCH] dax: set error in mapping when writeback fails > > > > > > In order to get proper error codes from fsync, we must set an error in > > > the mapping range when writeback fails. > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com> > > > --- > > > fs/dax.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index c45598b912e1..9005d90deeda 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, > > > > > > ret = dax_writeback_one(bdev, mapping, indices[i], > > > pvec.pages[i]); > > > - if (ret < 0) > > > + if (ret < 0) { > > > + mapping_set_error(mapping, ret); > > > return ret; > > > + } > > > > (Adding Jan) > > > > I tested this a bit, and for the DAX case at least I don't think this does > > what you want. The current code already returns -EIO if dax_writeback_one() > > hits an error, which bubbles up through the call stack and makes the fsync() > > call in userspace fail with EIO, as we want. With both ext4 and xfs this > > patch (applied to v4.10) makes it so that we fail the current fsync() due to > > the return value of -EIO, then we fail the next fsync() as well because only > > then do we actually process the AS_EIO flag inside of filemap_check_errors(). > > > > I think maybe the missing piece is that our normal DAX fsync call stack > > doesn't include a call to filemap_check_errors() if we return -EIO. Here's > > our stack in xfs: > > > > dax_writeback_mapping_range+0x32/0x70 > > xfs_vm_writepages+0x8c/0xf0 > > do_writepages+0x21/0x30 > > __filemap_fdatawrite_range+0xc6/0x100 > > filemap_write_and_wait_range+0x44/0x90 > > xfs_file_fsync+0x7a/0x2c0 > > vfs_fsync_range+0x4b/0xb0 > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > do_fsync+0x3d/0x70 > > SyS_fsync+0x10/0x20 > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > On the subsequent fsync() call we *do* end up calling filemap_check_errors() > > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the > > mapping: > > > > filemap_fdatawait_range+0x3b/0x80 > > filemap_write_and_wait_range+0x5a/0x90 > > xfs_file_fsync+0x7a/0x2c0 > > vfs_fsync_range+0x4b/0xb0 > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > do_fsync+0x3d/0x70 > > SyS_fsync+0x10/0x20 > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > Was your concern just that you didn't think that fsync() was properly > > returning an error when dax_writeback_one() hit an error? Or is there another > > path by which we need to report the error, where it is actually important that > > we set AS_EIO? If it's the latter, then I think we need to rework the fsync > > call path so that we both generate and consume AS_EIO on the same call, > > probably in filemap_write_and_wait_range(). > > So I believe this is due to the special handling of EIO inside > filemap_write_and_wait(). Normally, filemap_check_errors() happens inside > filemap_fdatawait() there however not for EIO returned from > filemap_fdatawrite(). In that case we bail out immediately. So I think > Jeff's patch is correct but we need to change filemap_write_and_wait() to > call also filemap_check_errors() directly on EIO from filemap_fdatawrite(). > Yes that makes total sense. I've got a filemap_write_and_wait patch in my pile now that does this. I'll run what I have through an xfstests run, and see how it does, and will plan to post a v2 set once I do. > On a more general note (DAX is actually fine here), I find the current > practice of clearing page dirty bits on error and reporting it just once > problematic. It keeps the system running but data is lost and possibly > without getting the error anywhere where it is useful. We get away with > this because it is a rare event but it seems like a problematic behavior. > But this is more for the discussion at LSF. > That really is the crux of the matter. Unfortunately, that's sort of how the POSIX write/fsync model is designed. If we want to change that, then I think that we have to consider what a new interface for this would look like. Maybe we can do something there with new sync_file_range flags? I think this probably also dovetails with Kevin Wolf's proposed LSF topic too, so maybe we can discuss all of this together there.
On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > On Mon 06-03-17 16:08:01, Ross Zwisler wrote: > > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote: > > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > > > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > > > I could get back -EIO errors in some cases when I should have instead > > > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > > > PG_error on a writeback error, and that error would clobber the mapping > > > > error. > > > > > > > > > > I should also note that relying on PG_error to report writeback errors > > > is inherently unreliable as well. If someone calls sync() before your > > > fsync gets in there, then you'll likely lose it anyway. > > > > > > filemap_fdatawait_keep_errors will preserve the error in the mapping, > > > but not the individual PG_error flags, so I think we do want to ensure > > > that the mapping error is set when there is a writeback error and not > > > rely on PG_error bit for that. > > > > > > > While I fixed that problem by simply not setting that bit on errors, > > > > that led me down a rabbit hole of looking at how PG_error is being > > > > handled in the kernel. > > > > > > > > This patch series is a few fixes for things that I 100% noticed by > > > > inspection. I don't have a great way to test these since they involve > > > > error handling. I can certainly doctor up a kernel to inject errors > > > > in this code and test by hand however if these look plausible up front. > > > > > > > > Jeff Layton (3): > > > > nilfs2: set the mapping error when calling SetPageError on writeback > > > > mm: don't TestClearPageError in __filemap_fdatawait_range > > > > mm: set mapping error when launder_pages fails > > > > > > > > fs/nilfs2/segment.c | 1 + > > > > mm/filemap.c | 19 ++++--------------- > > > > mm/truncate.c | 6 +++++- > > > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > > > > > > > (cc'ing Ross...) > > > > > > Just when I thought that only NILFS2 needed a little work here, I see > > > another spot... > > > > > > I think that we should also need to fix dax_writeback_mapping_range to > > > set a mapping error on writeback as well. It looks like that's not > > > happening today. Something like the patch below (obviously untested). > > > > > > I'll also plan to follow up with a patch to vfs.txt to outline how > > > writeback errors should be handled by filesystems, assuming that this > > > patchset isn't completely off base. > > > > > > -------------------8<----------------------- > > > > > > [PATCH] dax: set error in mapping when writeback fails > > > > > > In order to get proper error codes from fsync, we must set an error in > > > the mapping range when writeback fails. > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com> > > > --- > > > fs/dax.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index c45598b912e1..9005d90deeda 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, > > > > > > ret = dax_writeback_one(bdev, mapping, indices[i], > > > pvec.pages[i]); > > > - if (ret < 0) > > > + if (ret < 0) { > > > + mapping_set_error(mapping, ret); > > > return ret; > > > + } > > > > (Adding Jan) > > > > I tested this a bit, and for the DAX case at least I don't think this does > > what you want. The current code already returns -EIO if dax_writeback_one() > > hits an error, which bubbles up through the call stack and makes the fsync() > > call in userspace fail with EIO, as we want. With both ext4 and xfs this > > patch (applied to v4.10) makes it so that we fail the current fsync() due to > > the return value of -EIO, then we fail the next fsync() as well because only > > then do we actually process the AS_EIO flag inside of filemap_check_errors(). > > > > I think maybe the missing piece is that our normal DAX fsync call stack > > doesn't include a call to filemap_check_errors() if we return -EIO. Here's > > our stack in xfs: > > > > dax_writeback_mapping_range+0x32/0x70 > > xfs_vm_writepages+0x8c/0xf0 > > do_writepages+0x21/0x30 > > __filemap_fdatawrite_range+0xc6/0x100 > > filemap_write_and_wait_range+0x44/0x90 > > xfs_file_fsync+0x7a/0x2c0 > > vfs_fsync_range+0x4b/0xb0 > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > do_fsync+0x3d/0x70 > > SyS_fsync+0x10/0x20 > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > On the subsequent fsync() call we *do* end up calling filemap_check_errors() > > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the > > mapping: > > > > filemap_fdatawait_range+0x3b/0x80 > > filemap_write_and_wait_range+0x5a/0x90 > > xfs_file_fsync+0x7a/0x2c0 > > vfs_fsync_range+0x4b/0xb0 > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > do_fsync+0x3d/0x70 > > SyS_fsync+0x10/0x20 > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > Was your concern just that you didn't think that fsync() was properly > > returning an error when dax_writeback_one() hit an error? Or is there another > > path by which we need to report the error, where it is actually important that > > we set AS_EIO? If it's the latter, then I think we need to rework the fsync > > call path so that we both generate and consume AS_EIO on the same call, > > probably in filemap_write_and_wait_range(). > > So I believe this is due to the special handling of EIO inside > filemap_write_and_wait(). Normally, filemap_check_errors() happens inside s/filemap_write_and_wait/filemap_write_and_wait_range/ for this particular case, but we definitely want to make changes that keep them consistent. > filemap_fdatawait() there however not for EIO returned from > filemap_fdatawrite(). In that case we bail out immediately. So I think > Jeff's patch is correct but we need to change filemap_write_and_wait() to > call also filemap_check_errors() directly on EIO from filemap_fdatawrite(). So I guess my question was: why is it important that we set AS_EIO, if the EIO is already being reported correctly? Is it just for consistency with the buffered fsync case, or is there currently a path where the -EIO from DAX will be missed, and we actually need AS_EIO to be set in mapping->flags so that we correctly report an error? > On a more general note (DAX is actually fine here), I find the current > practice of clearing page dirty bits on error and reporting it just once > problematic. It keeps the system running but data is lost and possibly > without getting the error anywhere where it is useful. We get away with > this because it is a rare event but it seems like a problematic behavior. > But this is more for the discussion at LSF. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Tue 07-03-17 08:59:16, Ross Zwisler wrote: > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > On Mon 06-03-17 16:08:01, Ross Zwisler wrote: > > > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote: > > > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote: > > > > > I recently did some work to wire up -ENOSPC handling in ceph, and found > > > > > I could get back -EIO errors in some cases when I should have instead > > > > > gotten -ENOSPC. The problem was that the ceph writeback code would set > > > > > PG_error on a writeback error, and that error would clobber the mapping > > > > > error. > > > > > > > > > > > > > I should also note that relying on PG_error to report writeback errors > > > > is inherently unreliable as well. If someone calls sync() before your > > > > fsync gets in there, then you'll likely lose it anyway. > > > > > > > > filemap_fdatawait_keep_errors will preserve the error in the mapping, > > > > but not the individual PG_error flags, so I think we do want to ensure > > > > that the mapping error is set when there is a writeback error and not > > > > rely on PG_error bit for that. > > > > > > > > > While I fixed that problem by simply not setting that bit on errors, > > > > > that led me down a rabbit hole of looking at how PG_error is being > > > > > handled in the kernel. > > > > > > > > > > This patch series is a few fixes for things that I 100% noticed by > > > > > inspection. I don't have a great way to test these since they involve > > > > > error handling. I can certainly doctor up a kernel to inject errors > > > > > in this code and test by hand however if these look plausible up front. > > > > > > > > > > Jeff Layton (3): > > > > > nilfs2: set the mapping error when calling SetPageError on writeback > > > > > mm: don't TestClearPageError in __filemap_fdatawait_range > > > > > mm: set mapping error when launder_pages fails > > > > > > > > > > fs/nilfs2/segment.c | 1 + > > > > > mm/filemap.c | 19 ++++--------------- > > > > > mm/truncate.c | 6 +++++- > > > > > 3 files changed, 10 insertions(+), 16 deletions(-) > > > > > > > > > > > > > (cc'ing Ross...) > > > > > > > > Just when I thought that only NILFS2 needed a little work here, I see > > > > another spot... > > > > > > > > I think that we should also need to fix dax_writeback_mapping_range to > > > > set a mapping error on writeback as well. It looks like that's not > > > > happening today. Something like the patch below (obviously untested). > > > > > > > > I'll also plan to follow up with a patch to vfs.txt to outline how > > > > writeback errors should be handled by filesystems, assuming that this > > > > patchset isn't completely off base. > > > > > > > > -------------------8<----------------------- > > > > > > > > [PATCH] dax: set error in mapping when writeback fails > > > > > > > > In order to get proper error codes from fsync, we must set an error in > > > > the mapping range when writeback fails. > > > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com> > > > > --- > > > > fs/dax.c | 4 +++- > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > > index c45598b912e1..9005d90deeda 100644 > > > > --- a/fs/dax.c > > > > +++ b/fs/dax.c > > > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, > > > > > > > > ret = dax_writeback_one(bdev, mapping, indices[i], > > > > pvec.pages[i]); > > > > - if (ret < 0) > > > > + if (ret < 0) { > > > > + mapping_set_error(mapping, ret); > > > > return ret; > > > > + } > > > > > > (Adding Jan) > > > > > > I tested this a bit, and for the DAX case at least I don't think this does > > > what you want. The current code already returns -EIO if dax_writeback_one() > > > hits an error, which bubbles up through the call stack and makes the fsync() > > > call in userspace fail with EIO, as we want. With both ext4 and xfs this > > > patch (applied to v4.10) makes it so that we fail the current fsync() due to > > > the return value of -EIO, then we fail the next fsync() as well because only > > > then do we actually process the AS_EIO flag inside of filemap_check_errors(). > > > > > > I think maybe the missing piece is that our normal DAX fsync call stack > > > doesn't include a call to filemap_check_errors() if we return -EIO. Here's > > > our stack in xfs: > > > > > > dax_writeback_mapping_range+0x32/0x70 > > > xfs_vm_writepages+0x8c/0xf0 > > > do_writepages+0x21/0x30 > > > __filemap_fdatawrite_range+0xc6/0x100 > > > filemap_write_and_wait_range+0x44/0x90 > > > xfs_file_fsync+0x7a/0x2c0 > > > vfs_fsync_range+0x4b/0xb0 > > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > > do_fsync+0x3d/0x70 > > > SyS_fsync+0x10/0x20 > > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > > > On the subsequent fsync() call we *do* end up calling filemap_check_errors() > > > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the > > > mapping: > > > > > > filemap_fdatawait_range+0x3b/0x80 > > > filemap_write_and_wait_range+0x5a/0x90 > > > xfs_file_fsync+0x7a/0x2c0 > > > vfs_fsync_range+0x4b/0xb0 > > > ? trace_hardirqs_on_caller+0xf5/0x1b0 > > > do_fsync+0x3d/0x70 > > > SyS_fsync+0x10/0x20 > > > entry_SYSCALL_64_fastpath+0x1f/0xc2 > > > > > > Was your concern just that you didn't think that fsync() was properly > > > returning an error when dax_writeback_one() hit an error? Or is there another > > > path by which we need to report the error, where it is actually important that > > > we set AS_EIO? If it's the latter, then I think we need to rework the fsync > > > call path so that we both generate and consume AS_EIO on the same call, > > > probably in filemap_write_and_wait_range(). > > > > So I believe this is due to the special handling of EIO inside > > filemap_write_and_wait(). Normally, filemap_check_errors() happens inside > > s/filemap_write_and_wait/filemap_write_and_wait_range/ for this particular > case, but we definitely want to make changes that keep them consistent. > > > filemap_fdatawait() there however not for EIO returned from > > filemap_fdatawrite(). In that case we bail out immediately. So I think > > Jeff's patch is correct but we need to change filemap_write_and_wait() to > > call also filemap_check_errors() directly on EIO from filemap_fdatawrite(). > > So I guess my question was: why is it important that we set AS_EIO, if the EIO > is already being reported correctly? Is it just for consistency with the > buffered fsync case, or is there currently a path where the -EIO from DAX will > be missed, and we actually need AS_EIO to be set in mapping->flags so that we > correctly report an error? It is just for consistency and future-proofing. E.g. if we ever decided to do some background flushing of CPU caches, current DAX behavior would suddently result in missed reports of EIO errors. Honza
On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > On a more general note (DAX is actually fine here), I find the current > practice of clearing page dirty bits on error and reporting it just once > problematic. It keeps the system running but data is lost and possibly > without getting the error anywhere where it is useful. We get away with > this because it is a rare event but it seems like a problematic behavior. > But this is more for the discussion at LSF. I'm actually running into this in the last day or two because some MM folks at $WORK have been trying to push hard for GFP_NOFS removal in ext4 (at least when we are holding some mutex/semaphore like i_data_sem) because otherwise it's possible for the OOM killer to be unable to kill processes because they are holding on to locks that ext4 is holding. I've done some initial investigation, and while it's not that hard to remove GFP_NOFS from certain parts of the writepages() codepath (which is where we had been are running into problems), a really, REALLY big problem is if any_filesystem->writepages() returns ENOMEM, it causes silent data loss, because the pages are marked clean, and so data written using buffered writeback goes *poof*. I confirmed this by creating a test kernel with a simple patch such that if the ext4 file system is mounted with -o debug, there was a 1 in 16 chance that ext4_writepages will immediately return with ENOMEM (and printk the inode number, so I knew which inodes had gotten the ENOMEM treatment). The result was **NOT** pretty. What I think we should strongly consider is at the very least, special case ENOMEM being returned by writepages() during background writeback, and *not* mark the pages clean, and make sure the inode stays on the dirty inode list, so we can retry the write later. This is especially important since the process that issued the write may have gone away, so there might not even be a userspace process to complain to. By converting certain page allocations (most notably in ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to release the i_data_sem lock and return an error. This should allow allow the OOM killer to do its dirty deed, and hopefully we can retry the writepages() for that inode later. In the case of a data integrity sync being issued by fsync() or umount(), we could allow ENOMEM to get returned to userspace in that case as well. I'm not convinced all userspace code will handle an ENOMEM correctly or sanely, but at least they people will be (less likely) to blame file system developers. :-) The real problem that's going on here, by the way, is that people are trying to run programs in insanely tight containers, and then when the kernel locks up, they blame the mm developers. But if there is silent data corruption, they will blame the fs developers instead. And while kernel lockups are temporary (all you have to do is let the watchdog reboot the system :-), silent data corruption is *forever*. So what we really need to do is to allow the OOM killer do its work, and if job owners are unhappy that their processes are getting OOM killed, maybe they will be suitably incentivized to pay for more memory in their containers.... - Ted P.S. Michael Hocko, apologies for not getting back to you with your GFP_NOFS removal patches. But the possibility of fs malfunctions that might lead to silent data corruption is why I'm being very cautious, and I now have rather strong confirmation that this is not just an irrational concern on my part. (This is also performance review season, FAST conference was last week, and Usenix ATC program committee reviews are due this week. So apologies for any reply latency.)
On Wed 08-03-17 21:57:25, Ted Tso wrote: > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > On a more general note (DAX is actually fine here), I find the current > > practice of clearing page dirty bits on error and reporting it just once > > problematic. It keeps the system running but data is lost and possibly > > without getting the error anywhere where it is useful. We get away with > > this because it is a rare event but it seems like a problematic behavior. > > But this is more for the discussion at LSF. > > I'm actually running into this in the last day or two because some MM > folks at $WORK have been trying to push hard for GFP_NOFS removal in > ext4 (at least when we are holding some mutex/semaphore like > i_data_sem) because otherwise it's possible for the OOM killer to be > unable to kill processes because they are holding on to locks that > ext4 is holding. > > I've done some initial investigation, and while it's not that hard to > remove GFP_NOFS from certain parts of the writepages() codepath (which > is where we had been are running into problems), a really, REALLY big > problem is if any_filesystem->writepages() returns ENOMEM, it causes > silent data loss, because the pages are marked clean, and so data > written using buffered writeback goes *poof*. > > I confirmed this by creating a test kernel with a simple patch such > that if the ext4 file system is mounted with -o debug, there was a 1 > in 16 chance that ext4_writepages will immediately return with ENOMEM > (and printk the inode number, so I knew which inodes had gotten the > ENOMEM treatment). The result was **NOT** pretty. > > What I think we should strongly consider is at the very least, special > case ENOMEM being returned by writepages() during background > writeback, and *not* mark the pages clean, and make sure the inode > stays on the dirty inode list, so we can retry the write later. This > is especially important since the process that issued the write may > have gone away, so there might not even be a userspace process to > complain to. By converting certain page allocations (most notably in > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > release the i_data_sem lock and return an error. This should allow > allow the OOM killer to do its dirty deed, and hopefully we can retry > the writepages() for that inode later. Yeah, so if we can hope the error is transient, keeping pages dirty and retrying the write is definitely better option. For start we can say that ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means there's no hope of getting data to disk and so we just discard them. It will be somewhat rough distinction but probably better than what we have now. Honza
On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote: > On Wed 08-03-17 21:57:25, Ted Tso wrote: > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > > On a more general note (DAX is actually fine here), I find the current > > > practice of clearing page dirty bits on error and reporting it just once > > > problematic. It keeps the system running but data is lost and possibly > > > without getting the error anywhere where it is useful. We get away with > > > this because it is a rare event but it seems like a problematic behavior. > > > But this is more for the discussion at LSF. > > > > I'm actually running into this in the last day or two because some MM > > folks at $WORK have been trying to push hard for GFP_NOFS removal in > > ext4 (at least when we are holding some mutex/semaphore like > > i_data_sem) because otherwise it's possible for the OOM killer to be > > unable to kill processes because they are holding on to locks that > > ext4 is holding. > > > > I've done some initial investigation, and while it's not that hard to > > remove GFP_NOFS from certain parts of the writepages() codepath (which > > is where we had been are running into problems), a really, REALLY big > > problem is if any_filesystem->writepages() returns ENOMEM, it causes > > silent data loss, because the pages are marked clean, and so data > > written using buffered writeback goes *poof*. > > > > I confirmed this by creating a test kernel with a simple patch such > > that if the ext4 file system is mounted with -o debug, there was a 1 > > in 16 chance that ext4_writepages will immediately return with ENOMEM > > (and printk the inode number, so I knew which inodes had gotten the > > ENOMEM treatment). The result was **NOT** pretty. > > > > What I think we should strongly consider is at the very least, special > > case ENOMEM being returned by writepages() during background > > writeback, and *not* mark the pages clean, and make sure the inode > > stays on the dirty inode list, so we can retry the write later. This > > is especially important since the process that issued the write may > > have gone away, so there might not even be a userspace process to > > complain to. By converting certain page allocations (most notably in > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > > release the i_data_sem lock and return an error. This should allow > > allow the OOM killer to do its dirty deed, and hopefully we can retry > > the writepages() for that inode later. > > Yeah, so if we can hope the error is transient, keeping pages dirty and > retrying the write is definitely better option. For start we can say that > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means > there's no hope of getting data to disk and so we just discard them. It > will be somewhat rough distinction but probably better than what we have > now. > > Honza I'm not sure about ENOSPC there. That's a return code that is specifically expected to be returned by fsync. It seems like that ought not be considered a transient error?
On Thu 09-03-17 05:47:51, Jeff Layton wrote: > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote: > > On Wed 08-03-17 21:57:25, Ted Tso wrote: > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > > > On a more general note (DAX is actually fine here), I find the current > > > > practice of clearing page dirty bits on error and reporting it just once > > > > problematic. It keeps the system running but data is lost and possibly > > > > without getting the error anywhere where it is useful. We get away with > > > > this because it is a rare event but it seems like a problematic behavior. > > > > But this is more for the discussion at LSF. > > > > > > I'm actually running into this in the last day or two because some MM > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in > > > ext4 (at least when we are holding some mutex/semaphore like > > > i_data_sem) because otherwise it's possible for the OOM killer to be > > > unable to kill processes because they are holding on to locks that > > > ext4 is holding. > > > > > > I've done some initial investigation, and while it's not that hard to > > > remove GFP_NOFS from certain parts of the writepages() codepath (which > > > is where we had been are running into problems), a really, REALLY big > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes > > > silent data loss, because the pages are marked clean, and so data > > > written using buffered writeback goes *poof*. > > > > > > I confirmed this by creating a test kernel with a simple patch such > > > that if the ext4 file system is mounted with -o debug, there was a 1 > > > in 16 chance that ext4_writepages will immediately return with ENOMEM > > > (and printk the inode number, so I knew which inodes had gotten the > > > ENOMEM treatment). The result was **NOT** pretty. > > > > > > What I think we should strongly consider is at the very least, special > > > case ENOMEM being returned by writepages() during background > > > writeback, and *not* mark the pages clean, and make sure the inode > > > stays on the dirty inode list, so we can retry the write later. This > > > is especially important since the process that issued the write may > > > have gone away, so there might not even be a userspace process to > > > complain to. By converting certain page allocations (most notably in > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > > > release the i_data_sem lock and return an error. This should allow > > > allow the OOM killer to do its dirty deed, and hopefully we can retry > > > the writepages() for that inode later. > > > > Yeah, so if we can hope the error is transient, keeping pages dirty and > > retrying the write is definitely better option. For start we can say that > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means > > there's no hope of getting data to disk and so we just discard them. It > > will be somewhat rough distinction but probably better than what we have > > now. > > > > Honza > > I'm not sure about ENOSPC there. That's a return code that is > specifically expected to be returned by fsync. It seems like that ought > not be considered a transient error? Yeah, for start we should probably keep ENOSPC as is to prevent surprises. Long term, we may need to make at least some ENOSPC situations behave as transient to make thin provisioned storage not loose data in case admin does not supply additional space fast enough (i.e., before ENOSPC is actually hit). EIO is actually in a similar bucket although probably more on the "hard failure" side - I can imagine there can by types of storage and situations where the loss of connectivity to the storage is only transient. But for start I would not bother with this. Honza
On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote: > On Thu 09-03-17 05:47:51, Jeff Layton wrote: > > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote: > > > On Wed 08-03-17 21:57:25, Ted Tso wrote: > > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > > > > On a more general note (DAX is actually fine here), I find the current > > > > > practice of clearing page dirty bits on error and reporting it just once > > > > > problematic. It keeps the system running but data is lost and possibly > > > > > without getting the error anywhere where it is useful. We get away with > > > > > this because it is a rare event but it seems like a problematic behavior. > > > > > But this is more for the discussion at LSF. > > > > > > > > I'm actually running into this in the last day or two because some MM > > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in > > > > ext4 (at least when we are holding some mutex/semaphore like > > > > i_data_sem) because otherwise it's possible for the OOM killer to be > > > > unable to kill processes because they are holding on to locks that > > > > ext4 is holding. > > > > > > > > I've done some initial investigation, and while it's not that hard to > > > > remove GFP_NOFS from certain parts of the writepages() codepath (which > > > > is where we had been are running into problems), a really, REALLY big > > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes > > > > silent data loss, because the pages are marked clean, and so data > > > > written using buffered writeback goes *poof*. > > > > > > > > I confirmed this by creating a test kernel with a simple patch such > > > > that if the ext4 file system is mounted with -o debug, there was a 1 > > > > in 16 chance that ext4_writepages will immediately return with ENOMEM > > > > (and printk the inode number, so I knew which inodes had gotten the > > > > ENOMEM treatment). The result was **NOT** pretty. > > > > > > > > What I think we should strongly consider is at the very least, special > > > > case ENOMEM being returned by writepages() during background > > > > writeback, and *not* mark the pages clean, and make sure the inode > > > > stays on the dirty inode list, so we can retry the write later. This > > > > is especially important since the process that issued the write may > > > > have gone away, so there might not even be a userspace process to > > > > complain to. By converting certain page allocations (most notably in > > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > > > > release the i_data_sem lock and return an error. This should allow > > > > allow the OOM killer to do its dirty deed, and hopefully we can retry > > > > the writepages() for that inode later. > > > > > > Yeah, so if we can hope the error is transient, keeping pages dirty and > > > retrying the write is definitely better option. For start we can say that > > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means > > > there's no hope of getting data to disk and so we just discard them. It > > > will be somewhat rough distinction but probably better than what we have > > > now. > > > > > > Honza > > > > I'm not sure about ENOSPC there. That's a return code that is > > specifically expected to be returned by fsync. It seems like that ought > > not be considered a transient error? > > Yeah, for start we should probably keep ENOSPC as is to prevent surprises. > Long term, we may need to make at least some ENOSPC situations behave as > transient to make thin provisioned storage not loose data in case admin > does not supply additional space fast enough (i.e., before ENOSPC is > actually hit). > Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a transient error? Just have it hang until we get enough space when that tunable is enabled? > EIO is actually in a similar bucket although probably more on the "hard > failure" side - I can imagine there can by types of storage and situations > where the loss of connectivity to the storage is only transient. But for > start I would not bother with this. > > Honza I don't see what we can reasonably do with -EIO other than return a hard error. If we want to deal with loss of connectivity to storage as a transient failure, I think that we'd need to ensure that the lower layers return more distinct error codes in those cases (ENODEV or ENXIO maybe? Or declare a new kernel-internal code -- EDEVGONE?). In any case, I think that the basic idea of marking certain writepage/writepages/launder_page errors as transient might be a reasonable approach to handling this sanely. The problem with all of this though is that we have a pile of existing code that will likely need to be reworked for the new error handling. I expect that we'll have to walk all of the writepage/writepages/launder_page implementations and fix them up one by one once we sort out the rules for this.
On Thu, Mar 09, 2017 at 07:43:12AM -0500, Jeff Layton wrote: > On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote: > > On Thu 09-03-17 05:47:51, Jeff Layton wrote: > > > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote: > > > > On Wed 08-03-17 21:57:25, Ted Tso wrote: > > > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote: > > > > > > On a more general note (DAX is actually fine here), I find the current > > > > > > practice of clearing page dirty bits on error and reporting it just once > > > > > > problematic. It keeps the system running but data is lost and possibly > > > > > > without getting the error anywhere where it is useful. We get away with > > > > > > this because it is a rare event but it seems like a problematic behavior. > > > > > > But this is more for the discussion at LSF. > > > > > > > > > > I'm actually running into this in the last day or two because some MM > > > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in > > > > > ext4 (at least when we are holding some mutex/semaphore like > > > > > i_data_sem) because otherwise it's possible for the OOM killer to be > > > > > unable to kill processes because they are holding on to locks that > > > > > ext4 is holding. > > > > > > > > > > I've done some initial investigation, and while it's not that hard to > > > > > remove GFP_NOFS from certain parts of the writepages() codepath (which > > > > > is where we had been are running into problems), a really, REALLY big > > > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes > > > > > silent data loss, because the pages are marked clean, and so data > > > > > written using buffered writeback goes *poof*. > > > > > > > > > > I confirmed this by creating a test kernel with a simple patch such > > > > > that if the ext4 file system is mounted with -o debug, there was a 1 > > > > > in 16 chance that ext4_writepages will immediately return with ENOMEM > > > > > (and printk the inode number, so I knew which inodes had gotten the > > > > > ENOMEM treatment). The result was **NOT** pretty. > > > > > > > > > > What I think we should strongly consider is at the very least, special > > > > > case ENOMEM being returned by writepages() during background > > > > > writeback, and *not* mark the pages clean, and make sure the inode > > > > > stays on the dirty inode list, so we can retry the write later. This > > > > > is especially important since the process that issued the write may > > > > > have gone away, so there might not even be a userspace process to > > > > > complain to. By converting certain page allocations (most notably in > > > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to > > > > > release the i_data_sem lock and return an error. This should allow > > > > > allow the OOM killer to do its dirty deed, and hopefully we can retry > > > > > the writepages() for that inode later. > > > > > > > > Yeah, so if we can hope the error is transient, keeping pages dirty and > > > > retrying the write is definitely better option. For start we can say that > > > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means > > > > there's no hope of getting data to disk and so we just discard them. It > > > > will be somewhat rough distinction but probably better than what we have > > > > now. > > > > > > > > Honza > > > > > > I'm not sure about ENOSPC there. That's a return code that is > > > specifically expected to be returned by fsync. It seems like that ought > > > not be considered a transient error? > > > > Yeah, for start we should probably keep ENOSPC as is to prevent surprises. > > Long term, we may need to make at least some ENOSPC situations behave as > > transient to make thin provisioned storage not loose data in case admin > > does not supply additional space fast enough (i.e., before ENOSPC is > > actually hit). > > > > Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a > transient error? Just have it hang until we get enough space when that > tunable is enabled? > Just FYI, XFS has a similar error configuration mechanism that we use for essentially this purpose when dealing with metadata buffer I/O errors. The original motivation was to help us deal with the varying requirements for thin provisioning. I.e., whether a particular error is permanent or transient depends on the admin's preference and affects whether the filesystem continuously retries failed I/Os in anticipation of future success or gives up quickly and shuts down (or something in between). See roughly commits 192852be8b5 through e6b3bb7896 for the initial code, /sys/fs/xfs/<dev>/error on an XFS filesystem for the user interface, and the "Error handling" section of Documentation/filesystems/xfs.txt for information on how the interface works. Brian > > EIO is actually in a similar bucket although probably more on the "hard > > failure" side - I can imagine there can by types of storage and situations > > where the loss of connectivity to the storage is only transient. But for > > start I would not bother with this. > > > > Honza > > I don't see what we can reasonably do with -EIO other than return a hard > error. If we want to deal with loss of connectivity to storage as a > transient failure, I think that we'd need to ensure that the lower > layers return more distinct error codes in those cases (ENODEV or ENXIO > maybe? Or declare a new kernel-internal code -- EDEVGONE?). > > In any case, I think that the basic idea of marking certain > writepage/writepages/launder_page errors as transient might be a > reasonable approach to handling this sanely. > > The problem with all of this though is that we have a pile of existing > code that will likely need to be reworked for the new error handling. I > expect that we'll have to walk all of the > writepage/writepages/launder_page implementations and fix them up one by > one once we sort out the rules for this. > > -- > Jeff Layton <jlayton@redhat.com>
On Thu, Mar 09, 2017 at 07:43:12AM -0500, Jeff Layton wrote: > > Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a > transient error? Just have it hang until we get enough space when that > tunable is enabled? Or maybe we need a new kernel-internal errno (ala ERESTARSYS) which means it's a "soft ENOSPC"? It would get translated to ENOSPC if it gets propagated to userspace, but that way for devices like dm-thin or other storage array with thin volumes, it could send back a soft ENOSPC, while for file systems where "ENOSPC means ENOSPC", we can treat those as a hard ENOSPC. - Ted
diff --git a/fs/dax.c b/fs/dax.c index c45598b912e1..9005d90deeda 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, ret = dax_writeback_one(bdev, mapping, indices[i], pvec.pages[i]); - if (ret < 0) + if (ret < 0) { + mapping_set_error(mapping, ret); return ret; + } } } return 0;