Message ID | 20200921144353.31319-5-rgoldwyn@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | BTRFS DIO inode locking/D_SYNC fix | expand |
On 21/09/2020 16:44, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > iomap complete routine can deadlock with btrfs_fallocate because of the > call to generic_write_sync(). > > P0 P1 > inode_lock() fallocate(FALLOC_FL_ZERO_RANGE) > __iomap_dio_rw() inode_lock() > <block> > <submits IO> > <completes IO> > inode_unlock() > <gets inode_lock()> > inode_dio_wait() > iomap_dio_complete() > generic_write_sync() > btrfs_file_fsync() > inode_lock() > <deadlock> > > inode_dio_end() is used to notify the end of DIO data in order > to synchronize with truncate. Call inode_dio_end() before calling > generic_write_sync(), so filesystems can lock i_rwsem during a sync. > > --- > fs/iomap/direct-io.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Missing S-o-b as well.
On Mon, Sep 21, 2020 at 09:43:42AM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > iomap complete routine can deadlock with btrfs_fallocate because of the > call to generic_write_sync(). > > P0 P1 > inode_lock() fallocate(FALLOC_FL_ZERO_RANGE) > __iomap_dio_rw() inode_lock() > <block> > <submits IO> > <completes IO> > inode_unlock() > <gets inode_lock()> > inode_dio_wait() > iomap_dio_complete() > generic_write_sync() > btrfs_file_fsync() > inode_lock() > <deadlock> > > inode_dio_end() is used to notify the end of DIO data in order > to synchronize with truncate. Call inode_dio_end() before calling > generic_write_sync(), so filesystems can lock i_rwsem during a sync. Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> You might want to mention in the commit log that this matches the sequence in the old fs/direct-io.c implementation.
On 9/21/20 10:43 AM, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > iomap complete routine can deadlock with btrfs_fallocate because of the > call to generic_write_sync(). > > P0 P1 > inode_lock() fallocate(FALLOC_FL_ZERO_RANGE) > __iomap_dio_rw() inode_lock() > <block> > <submits IO> > <completes IO> > inode_unlock() > <gets inode_lock()> > inode_dio_wait() > iomap_dio_complete() > generic_write_sync() > btrfs_file_fsync() > inode_lock() > <deadlock> > > inode_dio_end() is used to notify the end of DIO data in order > to synchronize with truncate. Call inode_dio_end() before calling > generic_write_sync(), so filesystems can lock i_rwsem during a sync. > > --- > fs/iomap/direct-io.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index d970c6bbbe11..e01f81e7b76f 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) > dio_warn_stale_pagecache(iocb->ki_filp); > } > > + inode_dio_end(file_inode(iocb->ki_filp)); > /* > * If this is a DSYNC write, make sure we push it to stable storage now > * that we've written data. > @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) > if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC)) > ret = generic_write_sync(iocb, ret); > > - inode_dio_end(file_inode(iocb->ki_filp)); > kfree(dio); > > return ret; > Did you verify that xfs or ext4 don't rely on the inode_dio_end() happening before the generic_write_sync()? I wouldn't expect that they would, but we've already run into problems making those kind of assumptions. If it's fine you can add Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Tue, Sep 22, 2020 at 10:20:11AM -0400, Josef Bacik wrote: > On 9/21/20 10:43 AM, Goldwyn Rodrigues wrote: > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > iomap complete routine can deadlock with btrfs_fallocate because of the > > call to generic_write_sync(). > > > > P0 P1 > > inode_lock() fallocate(FALLOC_FL_ZERO_RANGE) > > __iomap_dio_rw() inode_lock() > > <block> > > <submits IO> > > <completes IO> > > inode_unlock() > > <gets inode_lock()> > > inode_dio_wait() > > iomap_dio_complete() > > generic_write_sync() > > btrfs_file_fsync() > > inode_lock() > > <deadlock> > > > > inode_dio_end() is used to notify the end of DIO data in order > > to synchronize with truncate. Call inode_dio_end() before calling > > generic_write_sync(), so filesystems can lock i_rwsem during a sync. > > > > --- > > fs/iomap/direct-io.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > index d970c6bbbe11..e01f81e7b76f 100644 > > --- a/fs/iomap/direct-io.c > > +++ b/fs/iomap/direct-io.c > > @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) > > dio_warn_stale_pagecache(iocb->ki_filp); > > } > > + inode_dio_end(file_inode(iocb->ki_filp)); > > /* > > * If this is a DSYNC write, make sure we push it to stable storage now > > * that we've written data. > > @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) > > if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC)) > > ret = generic_write_sync(iocb, ret); > > - inode_dio_end(file_inode(iocb->ki_filp)); > > kfree(dio); > > return ret; > > > > Did you verify that xfs or ext4 don't rely on the inode_dio_end() happening > before the generic_write_sync()? I wouldn't expect that they would, but > we've already run into problems making those kind of assumptions. If it's > fine you can add I was gonna ask the same question, but as there's no SoB on this patch I hadn't really looked at it yet. ;) Operations that rely on inode_dio_wait to have blocked until all the directios are complete could get tripped up by iomap not having done the generic_write_sync to stabilise the metadata, but I /think/ most operations that do that also themselves flush the file. But I don't really know if there's a subtlety there if the inode_dio_wait thread manages to grab the ILOCK before the generic_write_sync thread does. --D > Reviewed-by: Josef Bacik <josef@toxicpanda.com> > > Thanks, > > Josef
On 9:31 22/09, Darrick J. Wong wrote: > On Tue, Sep 22, 2020 at 10:20:11AM -0400, Josef Bacik wrote: > > On 9/21/20 10:43 AM, Goldwyn Rodrigues wrote: > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > > > iomap complete routine can deadlock with btrfs_fallocate because of the > > > call to generic_write_sync(). > > > > > > P0 P1 > > > inode_lock() fallocate(FALLOC_FL_ZERO_RANGE) > > > __iomap_dio_rw() inode_lock() > > > <block> > > > <submits IO> > > > <completes IO> > > > inode_unlock() > > > <gets inode_lock()> > > > inode_dio_wait() > > > iomap_dio_complete() > > > generic_write_sync() > > > btrfs_file_fsync() > > > inode_lock() > > > <deadlock> > > > > > > inode_dio_end() is used to notify the end of DIO data in order > > > to synchronize with truncate. Call inode_dio_end() before calling > > > generic_write_sync(), so filesystems can lock i_rwsem during a sync. > > > > > > --- > > > fs/iomap/direct-io.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > > index d970c6bbbe11..e01f81e7b76f 100644 > > > --- a/fs/iomap/direct-io.c > > > +++ b/fs/iomap/direct-io.c > > > @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) > > > dio_warn_stale_pagecache(iocb->ki_filp); > > > } > > > + inode_dio_end(file_inode(iocb->ki_filp)); > > > /* > > > * If this is a DSYNC write, make sure we push it to stable storage now > > > * that we've written data. > > > @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) > > > if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC)) > > > ret = generic_write_sync(iocb, ret); > > > - inode_dio_end(file_inode(iocb->ki_filp)); > > > kfree(dio); > > > return ret; > > > > > > > Did you verify that xfs or ext4 don't rely on the inode_dio_end() happening > > before the generic_write_sync()? I wouldn't expect that they would, but > > we've already run into problems making those kind of assumptions. If it's > > fine you can add > > I was gonna ask the same question, but as there's no SoB on this patch I > hadn't really looked at it yet. ;) > > Operations that rely on inode_dio_wait to have blocked until all the > directios are complete could get tripped up by iomap not having done the > generic_write_sync to stabilise the metadata, but I /think/ most > operations that do that also themselves flush the file. But I don't > really know if there's a subtlety there if the inode_dio_wait thread > manages to grab the ILOCK before the generic_write_sync thread does. > I ran xfstests and it was successful. I am testing ext4 now.
On Tue, Sep 22, 2020 at 09:31:56AM -0700, Darrick J. Wong wrote: > On Tue, Sep 22, 2020 at 10:20:11AM -0400, Josef Bacik wrote: > > On 9/21/20 10:43 AM, Goldwyn Rodrigues wrote: > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > > > iomap complete routine can deadlock with btrfs_fallocate because of the > > > call to generic_write_sync(). > > > > > > P0 P1 > > > inode_lock() fallocate(FALLOC_FL_ZERO_RANGE) > > > __iomap_dio_rw() inode_lock() > > > <block> > > > <submits IO> > > > <completes IO> > > > inode_unlock() > > > <gets inode_lock()> > > > inode_dio_wait() > > > iomap_dio_complete() > > > generic_write_sync() > > > btrfs_file_fsync() > > > inode_lock() > > > <deadlock> > > > > > > inode_dio_end() is used to notify the end of DIO data in order > > > to synchronize with truncate. Call inode_dio_end() before calling > > > generic_write_sync(), so filesystems can lock i_rwsem during a sync. > > > > > > --- > > > fs/iomap/direct-io.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > > index d970c6bbbe11..e01f81e7b76f 100644 > > > --- a/fs/iomap/direct-io.c > > > +++ b/fs/iomap/direct-io.c > > > @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) > > > dio_warn_stale_pagecache(iocb->ki_filp); > > > } > > > + inode_dio_end(file_inode(iocb->ki_filp)); > > > /* > > > * If this is a DSYNC write, make sure we push it to stable storage now > > > * that we've written data. > > > @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) > > > if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC)) > > > ret = generic_write_sync(iocb, ret); > > > - inode_dio_end(file_inode(iocb->ki_filp)); > > > kfree(dio); > > > return ret; > > > > > > > Did you verify that xfs or ext4 don't rely on the inode_dio_end() happening > > before the generic_write_sync()? I wouldn't expect that they would, but > > we've already run into problems making those kind of assumptions. If it's > > fine you can add > > I was gonna ask the same question, but as there's no SoB on this patch I > hadn't really looked at it yet. ;) > > Operations that rely on inode_dio_wait to have blocked until all the > directios are complete could get tripped up by iomap not having done the > generic_write_sync to stabilise the metadata, but I /think/ most > operations that do that also themselves flush the file. But I don't > really know if there's a subtlety there if the inode_dio_wait thread > manages to grab the ILOCK before the generic_write_sync thread does. I did point out in the previous thread that this actually means that inode_dio_wait() now has inconsistent wait semantics for O_DSYNC writes. If it's a pure overwrite and we hit the FUA path, the O_DSYNC write will be complete and guaranteed to be on stable storage before the IO completes. If the inode is metadata dirty, then the IO will now be signalled complete *before* the data and metadata are flushed to stable storage. Hence, from the perspective of writes to *stable* storage, this makes the ordering of O_DSYNC DIO against anything waiting for it to complete to be potentially inconsistent at the stable storage level. That's an extremely subtle change of behaviour, and something that would be largely impossible to test or reproduce. And, really, I don't like having this sort of "oh, it should be fine" handwavy justification when we are talking about data integrity operations... Cheers, Dave.
On Wed, Sep 23, 2020 at 07:49:34AM +1000, Dave Chinner wrote: > I did point out in the previous thread that this actually means that > inode_dio_wait() now has inconsistent wait semantics for O_DSYNC > writes. If it's a pure overwrite and we hit the FUA path, the > O_DSYNC write will be complete and guaranteed to be on stable storage > before the IO completes. If the inode is metadata dirty, then the IO > will now be signalled complete *before* the data and metadata are > flushed to stable storage. > > Hence, from the perspective of writes to *stable* storage, this > makes the ordering of O_DSYNC DIO against anything waiting for it to > complete to be potentially inconsistent at the stable storage level. > > That's an extremely subtle change of behaviour, and something that > would be largely impossible to test or reproduce. And, really, I > don't like having this sort of "oh, it should be fine" handwavy > justification when we are talking about data integrity operations... ... and I replied with a detailed analysis of what it is fine, and how this just restores the behavior we historically had before switching to the iomap direct I/O code. Although if we want to go into the fine details we did not have the REQ_FUA path back then, but that does not change the analysis.
On Wed, Sep 23, 2020 at 07:16:58AM +0200, Christoph Hellwig wrote: > On Wed, Sep 23, 2020 at 07:49:34AM +1000, Dave Chinner wrote: > > I did point out in the previous thread that this actually means that > > inode_dio_wait() now has inconsistent wait semantics for O_DSYNC > > writes. If it's a pure overwrite and we hit the FUA path, the > > O_DSYNC write will be complete and guaranteed to be on stable storage > > before the IO completes. If the inode is metadata dirty, then the IO > > will now be signalled complete *before* the data and metadata are > > flushed to stable storage. > > > > Hence, from the perspective of writes to *stable* storage, this > > makes the ordering of O_DSYNC DIO against anything waiting for it to > > complete to be potentially inconsistent at the stable storage level. > > > > That's an extremely subtle change of behaviour, and something that > > would be largely impossible to test or reproduce. And, really, I > > don't like having this sort of "oh, it should be fine" handwavy > > justification when we are talking about data integrity operations... > > ... and I replied with a detailed analysis of what it is fine, and > how this just restores the behavior we historically had before > switching to the iomap direct I/O code. Although if we want to go > into the fine details we did not have the REQ_FUA path back then, > but that does not change the analysis. You did? Got a link? Not sure if vger/oraclemail are still delaying messages for me.... :/ --D
On Tue, Sep 22, 2020 at 10:31:49PM -0700, Darrick J. Wong wrote: > > ... and I replied with a detailed analysis of what it is fine, and > > how this just restores the behavior we historically had before > > switching to the iomap direct I/O code. Although if we want to go > > into the fine details we did not have the REQ_FUA path back then, > > but that does not change the analysis. > > You did? Got a link? Not sure if vger/oraclemail are still delaying > messages for me.... :/ Two replies from September 17 to the "Re: [RFC PATCH] btrfs: don't call btrfs_sync_file from iomap context" thread. Msg IDs: 20200917055232.GA31646@lst.de and 20200917064238.GA32441@lst.de
On Wed, Sep 23, 2020 at 07:49:25AM +0200, Christoph Hellwig wrote: > On Tue, Sep 22, 2020 at 10:31:49PM -0700, Darrick J. Wong wrote: > > > ... and I replied with a detailed analysis of what it is fine, and > > > how this just restores the behavior we historically had before > > > switching to the iomap direct I/O code. Although if we want to go > > > into the fine details we did not have the REQ_FUA path back then, > > > but that does not change the analysis. > > > > You did? Got a link? Not sure if vger/oraclemail are still delaying > > messages for me.... :/ > > Two replies from September 17 to the > "Re: [RFC PATCH] btrfs: don't call btrfs_sync_file from iomap context" > > thread. > > Msg IDs: > > 20200917055232.GA31646@lst.de > > and > > 20200917064238.GA32441@lst.de <sigh> That last one is not in my local archive - vger has been on the blink lately, so I guess I'm not really that surprised that mail has gone missing and not just delayed for a day or two.... Cheers, Dave.
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index d970c6bbbe11..e01f81e7b76f 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) dio_warn_stale_pagecache(iocb->ki_filp); } + inode_dio_end(file_inode(iocb->ki_filp)); /* * If this is a DSYNC write, make sure we push it to stable storage now * that we've written data. @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC)) ret = generic_write_sync(iocb, ret); - inode_dio_end(file_inode(iocb->ki_filp)); kfree(dio); return ret;
From: Goldwyn Rodrigues <rgoldwyn@suse.com> iomap complete routine can deadlock with btrfs_fallocate because of the call to generic_write_sync(). P0 P1 inode_lock() fallocate(FALLOC_FL_ZERO_RANGE) __iomap_dio_rw() inode_lock() <block> <submits IO> <completes IO> inode_unlock() <gets inode_lock()> inode_dio_wait() iomap_dio_complete() generic_write_sync() btrfs_file_fsync() inode_lock() <deadlock> inode_dio_end() is used to notify the end of DIO data in order to synchronize with truncate. Call inode_dio_end() before calling generic_write_sync(), so filesystems can lock i_rwsem during a sync. --- fs/iomap/direct-io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)