Message ID | 1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > dax_writeback_mapping_range() needs a struct block_device, and it used to > get that from inode->i_sb->s_bdev. This is correct for normal inodes > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time files. > > Instead, call dax_writeback_mapping_range() directly from the filesystem > ->writepages function so that it can supply us with a valid block > device. This also fixes DAX code to properly flush caches in response to > sync(2). > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/block_dev.c | 16 +++++++++++++++- > fs/dax.c | 13 ++++++++----- > fs/ext2/inode.c | 11 +++++++++++ > fs/ext4/inode.c | 7 +++++++ > fs/xfs/xfs_aops.c | 9 +++++++++ > include/linux/dax.h | 6 ++++-- > mm/filemap.c | 12 ++++-------- > 7 files changed, 58 insertions(+), 16 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 39b3a17..fc01e43 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > return try_to_free_buffers(page); > } > > +static int blkdev_writepages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + if (dax_mapping(mapping)) { > + struct block_device *bdev = I_BDEV(mapping->host); > + int error; > + > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > + if (error) > + return error; > + } > + return generic_writepages(mapping, wbc); > +} Can you remind of the reason for calling generic_writepages() on DAX enabled address spaces? Cheers, Dave.
On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time files. > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > ->writepages function so that it can supply us with a valid block > > device. This also fixes DAX code to properly flush caches in response to > > sync(2). > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > fs/block_dev.c | 16 +++++++++++++++- > > fs/dax.c | 13 ++++++++----- > > fs/ext2/inode.c | 11 +++++++++++ > > fs/ext4/inode.c | 7 +++++++ > > fs/xfs/xfs_aops.c | 9 +++++++++ > > include/linux/dax.h | 6 ++++-- > > mm/filemap.c | 12 ++++-------- > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index 39b3a17..fc01e43 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > return try_to_free_buffers(page); > > } > > > > +static int blkdev_writepages(struct address_space *mapping, > > + struct writeback_control *wbc) > > +{ > > + if (dax_mapping(mapping)) { > > + struct block_device *bdev = I_BDEV(mapping->host); > > + int error; > > + > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > + if (error) > > + return error; > > + } > > + return generic_writepages(mapping, wbc); > > +} > > Can you remind of the reason for calling generic_writepages() on DAX > enabled address spaces? Sure. The initial version of this patch didn't do this, and during testing I hit a bunch of xfstests failures. In ext2 at least I believe these were happening because we were skipping the call into generic_writepages() for DAX inodes. Without a lot of data to back this up, my guess is that this is due to metadata inodes or something being marked as DAX (so dax_mapping(mapping) returns true), but having dirty page cache pages that need to be written back as part of the writeback. Changing this so we always call generic_writepages() even in the DAX case solved the xfstest failures. If this sounds incorrect, please let me know and I'll go and gather more data. - Ross
On Wed, Feb 10, 2016 at 03:43:40PM -0700, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time files. > > > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > > ->writepages function so that it can supply us with a valid block > > > device. This also fixes DAX code to properly flush caches in response to > > > sync(2). > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > > Signed-off-by: Jan Kara <jack@suse.cz> > > > --- > > > fs/block_dev.c | 16 +++++++++++++++- > > > fs/dax.c | 13 ++++++++----- > > > fs/ext2/inode.c | 11 +++++++++++ > > > fs/ext4/inode.c | 7 +++++++ > > > fs/xfs/xfs_aops.c | 9 +++++++++ > > > include/linux/dax.h | 6 ++++-- > > > mm/filemap.c | 12 ++++-------- > > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 39b3a17..fc01e43 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > > return try_to_free_buffers(page); > > > } > > > > > > +static int blkdev_writepages(struct address_space *mapping, > > > + struct writeback_control *wbc) > > > +{ > > > + if (dax_mapping(mapping)) { > > > + struct block_device *bdev = I_BDEV(mapping->host); > > > + int error; > > > + > > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > > + if (error) > > > + return error; > > > + } > > > + return generic_writepages(mapping, wbc); > > > +} > > > > Can you remind of the reason for calling generic_writepages() on DAX > > enabled address spaces? > > Sure. The initial version of this patch didn't do this, and during testing I > hit a bunch of xfstests failures. In ext2 at least I believe these were > happening because we were skipping the call into generic_writepages() for DAX > inodes. Without a lot of data to back this up, my guess is that this is due > to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > returns true), but having dirty page cache pages that need to be written back > as part of the writeback. Hmmm - the ext2 filesystem metadata uses the block device page cache to buffer inode writeback, and so writeback doesn't occur until sync_blockdev() is called. But the data access should be through the ext2 inode address space, not the block device address space, so DAX flushing occurs in ext2_writepages. So how is the block device inode being marked as a DAX inode? If it is being marked as a DAX inode, how is this valid when the filesystem metadata uses bufferheads and requires struct pages to be found in the block device mapping tree? e.g. mkfs writes the metadata into the bdev via DAX, resulting in an DAX exceptional entry in the bdev radix tree, then __bread_gfp() comes along to read the same metadata after mount and expects to find pages in the blockdev radix tree? FWIW, this seems to be specifically a block device inode issue, though, not something that affects regular files in a filesystem. i.e. filesystem inodes can only be either DAX or non-DAX, and so there is no mixed mode flushing required, right? > Changing this so we always call generic_writepages() even in the > DAX case solved the xfstest failures. > > If this sounds incorrect, please let me know and I'll go and > gather more data. It seems to me that there's a problem here with DAX on block device inodes, but not for the filesystem mappings. At minimum, the block device needs a bloody big comment explaining this landmine so people don't forget why it is a special snowflake... Cheers, Dave.
On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time files. > > > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > > ->writepages function so that it can supply us with a valid block > > > device. This also fixes DAX code to properly flush caches in response to > > > sync(2). > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > > Signed-off-by: Jan Kara <jack@suse.cz> > > > --- > > > fs/block_dev.c | 16 +++++++++++++++- > > > fs/dax.c | 13 ++++++++----- > > > fs/ext2/inode.c | 11 +++++++++++ > > > fs/ext4/inode.c | 7 +++++++ > > > fs/xfs/xfs_aops.c | 9 +++++++++ > > > include/linux/dax.h | 6 ++++-- > > > mm/filemap.c | 12 ++++-------- > > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 39b3a17..fc01e43 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > > return try_to_free_buffers(page); > > > } > > > > > > +static int blkdev_writepages(struct address_space *mapping, > > > + struct writeback_control *wbc) > > > +{ > > > + if (dax_mapping(mapping)) { > > > + struct block_device *bdev = I_BDEV(mapping->host); > > > + int error; > > > + > > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > > + if (error) > > > + return error; > > > + } > > > + return generic_writepages(mapping, wbc); > > > +} > > > > Can you remind of the reason for calling generic_writepages() on DAX > > enabled address spaces? > > Sure. The initial version of this patch didn't do this, and during testing I > hit a bunch of xfstests failures. In ext2 at least I believe these were > happening because we were skipping the call into generic_writepages() for DAX > inodes. Without a lot of data to back this up, my guess is that this is due > to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > returns true), but having dirty page cache pages that need to be written back > as part of the writeback. > > Changing this so we always call generic_writepages() even in the DAX case > solved the xfstest failures. > > If this sounds incorrect, please let me know and I'll go and gather more data. So I think a more correct fix it to not set S_DAX for inodes that will have any pagecache pages - e.g. don't set S_DAX for block device inodes when filesystem is mounted on it (probably the easiest is to just refuse to mount filesystem on block device which has S_DAX set). Honza
On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara <jack@suse.cz> wrote: > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >> > > block devices and for XFS real-time files. >> > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem >> > > ->writepages function so that it can supply us with a valid block >> > > device. This also fixes DAX code to properly flush caches in response to >> > > sync(2). >> > > >> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> >> > > Signed-off-by: Jan Kara <jack@suse.cz> >> > > --- >> > > fs/block_dev.c | 16 +++++++++++++++- >> > > fs/dax.c | 13 ++++++++----- >> > > fs/ext2/inode.c | 11 +++++++++++ >> > > fs/ext4/inode.c | 7 +++++++ >> > > fs/xfs/xfs_aops.c | 9 +++++++++ >> > > include/linux/dax.h | 6 ++++-- >> > > mm/filemap.c | 12 ++++-------- >> > > 7 files changed, 58 insertions(+), 16 deletions(-) >> > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c >> > > index 39b3a17..fc01e43 100644 >> > > --- a/fs/block_dev.c >> > > +++ b/fs/block_dev.c >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) >> > > return try_to_free_buffers(page); >> > > } >> > > >> > > +static int blkdev_writepages(struct address_space *mapping, >> > > + struct writeback_control *wbc) >> > > +{ >> > > + if (dax_mapping(mapping)) { >> > > + struct block_device *bdev = I_BDEV(mapping->host); >> > > + int error; >> > > + >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); >> > > + if (error) >> > > + return error; >> > > + } >> > > + return generic_writepages(mapping, wbc); >> > > +} >> > >> > Can you remind of the reason for calling generic_writepages() on DAX >> > enabled address spaces? >> >> Sure. The initial version of this patch didn't do this, and during testing I >> hit a bunch of xfstests failures. In ext2 at least I believe these were >> happening because we were skipping the call into generic_writepages() for DAX >> inodes. Without a lot of data to back this up, my guess is that this is due >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) >> returns true), but having dirty page cache pages that need to be written back >> as part of the writeback. >> >> Changing this so we always call generic_writepages() even in the DAX case >> solved the xfstest failures. >> >> If this sounds incorrect, please let me know and I'll go and gather more data. > > So I think a more correct fix it to not set S_DAX for inodes that will have > any pagecache pages - e.g. don't set S_DAX for block device inodes when > filesystem is mounted on it (probably the easiest is to just refuse to > mount filesystem on block device which has S_DAX set). I think we have a wider problem here. See __blkdev_get, we set S_DAX on all block devices that have ->direct_access() and have a page-aligned starting address. It seems to me we need to modify the metadata i/o paths to bypass the page cache, or teach the fsync code how to flush populated data pages out of the radix.
On Thu 11-02-16 07:22:00, Dan Williams wrote: > On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara <jack@suse.cz> wrote: > > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > > block devices and for XFS real-time files. > >> > > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem > >> > > ->writepages function so that it can supply us with a valid block > >> > > device. This also fixes DAX code to properly flush caches in response to > >> > > sync(2). > >> > > > >> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > >> > > Signed-off-by: Jan Kara <jack@suse.cz> > >> > > --- > >> > > fs/block_dev.c | 16 +++++++++++++++- > >> > > fs/dax.c | 13 ++++++++----- > >> > > fs/ext2/inode.c | 11 +++++++++++ > >> > > fs/ext4/inode.c | 7 +++++++ > >> > > fs/xfs/xfs_aops.c | 9 +++++++++ > >> > > include/linux/dax.h | 6 ++++-- > >> > > mm/filemap.c | 12 ++++-------- > >> > > 7 files changed, 58 insertions(+), 16 deletions(-) > >> > > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c > >> > > index 39b3a17..fc01e43 100644 > >> > > --- a/fs/block_dev.c > >> > > +++ b/fs/block_dev.c > >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > >> > > return try_to_free_buffers(page); > >> > > } > >> > > > >> > > +static int blkdev_writepages(struct address_space *mapping, > >> > > + struct writeback_control *wbc) > >> > > +{ > >> > > + if (dax_mapping(mapping)) { > >> > > + struct block_device *bdev = I_BDEV(mapping->host); > >> > > + int error; > >> > > + > >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > >> > > + if (error) > >> > > + return error; > >> > > + } > >> > > + return generic_writepages(mapping, wbc); > >> > > +} > >> > > >> > Can you remind of the reason for calling generic_writepages() on DAX > >> > enabled address spaces? > >> > >> Sure. The initial version of this patch didn't do this, and during testing I > >> hit a bunch of xfstests failures. In ext2 at least I believe these were > >> happening because we were skipping the call into generic_writepages() for DAX > >> inodes. Without a lot of data to back this up, my guess is that this is due > >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > >> returns true), but having dirty page cache pages that need to be written back > >> as part of the writeback. > >> > >> Changing this so we always call generic_writepages() even in the DAX case > >> solved the xfstest failures. > >> > >> If this sounds incorrect, please let me know and I'll go and gather more data. > > > > So I think a more correct fix it to not set S_DAX for inodes that will have > > any pagecache pages - e.g. don't set S_DAX for block device inodes when > > filesystem is mounted on it (probably the easiest is to just refuse to > > mount filesystem on block device which has S_DAX set). > > I think we have a wider problem here. See __blkdev_get, we set S_DAX > on all block devices that have ->direct_access() and have a > page-aligned starting address. It seems to me we need to modify the > metadata i/o paths to bypass the page cache Heh, no way to do that easily. All the journalling machinery depends on buffers and pages... >, or teach the fsync code > how to flush populated data pages out of the radix. This might be doable but it will be difficult to avoid aliasing issues and data corruption. And mainly I don't see the point: When you mount a filesystem on top of block device, you do not want to mess with the block device directly, even less using DAX. So we just have to find a way how to set S_DAX for normal open but clear it from fs path. At worst, we could clear S_DAX on the block device in mount_bdev() or something like that... Honza
On Thu, Feb 11, 2016 at 07:22:00AM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara <jack@suse.cz> wrote: > > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > > block devices and for XFS real-time files. > >> > > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem > >> > > ->writepages function so that it can supply us with a valid block > >> > > device. This also fixes DAX code to properly flush caches in response to > >> > > sync(2). > >> > > > >> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > >> > > Signed-off-by: Jan Kara <jack@suse.cz> > >> > > --- > >> > > fs/block_dev.c | 16 +++++++++++++++- > >> > > fs/dax.c | 13 ++++++++----- > >> > > fs/ext2/inode.c | 11 +++++++++++ > >> > > fs/ext4/inode.c | 7 +++++++ > >> > > fs/xfs/xfs_aops.c | 9 +++++++++ > >> > > include/linux/dax.h | 6 ++++-- > >> > > mm/filemap.c | 12 ++++-------- > >> > > 7 files changed, 58 insertions(+), 16 deletions(-) > >> > > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c > >> > > index 39b3a17..fc01e43 100644 > >> > > --- a/fs/block_dev.c > >> > > +++ b/fs/block_dev.c > >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > >> > > return try_to_free_buffers(page); > >> > > } > >> > > > >> > > +static int blkdev_writepages(struct address_space *mapping, > >> > > + struct writeback_control *wbc) > >> > > +{ > >> > > + if (dax_mapping(mapping)) { > >> > > + struct block_device *bdev = I_BDEV(mapping->host); > >> > > + int error; > >> > > + > >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > >> > > + if (error) > >> > > + return error; > >> > > + } > >> > > + return generic_writepages(mapping, wbc); > >> > > +} > >> > > >> > Can you remind of the reason for calling generic_writepages() on DAX > >> > enabled address spaces? > >> > >> Sure. The initial version of this patch didn't do this, and during testing I > >> hit a bunch of xfstests failures. In ext2 at least I believe these were > >> happening because we were skipping the call into generic_writepages() for DAX > >> inodes. Without a lot of data to back this up, my guess is that this is due > >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > >> returns true), but having dirty page cache pages that need to be written back > >> as part of the writeback. > >> > >> Changing this so we always call generic_writepages() even in the DAX case > >> solved the xfstest failures. > >> > >> If this sounds incorrect, please let me know and I'll go and gather more data. > > > > So I think a more correct fix it to not set S_DAX for inodes that will have > > any pagecache pages - e.g. don't set S_DAX for block device inodes when > > filesystem is mounted on it (probably the easiest is to just refuse to > > mount filesystem on block device which has S_DAX set). > > I think we have a wider problem here. See __blkdev_get, we set S_DAX > on all block devices that have ->direct_access() and have a > page-aligned starting address. That's seeming like a premature optimisation to me now. I didn't say anything at the time because I was busy with other things and it didn't affect XFS. > It seems to me we need to modify the > metadata i/o paths to bypass the page cache, XFS doesn't use the block device page cache for it's metadata - it has it's own internal metadata cache structures and uses get_pages or heap memory to back it's metadata. But that doesn't make mixing DAX and pages in the block device mapping tree sane. What you are missing here is that the underlying architecture of journalling filesystems mean they can't use DAX for their metadata. Modifications have to be buffered, because they have to be written to the journal first before they are written back in place. IOWs, we need to buffer changes in volatile memory for some time, and that means we can't use DAX during transactional modifications. And to put the final nail in that coffin, metadata in XFS can be discontiguous multi-block objects - in those situations we vmap the underlying pages so they appear to the code to be a contiguous buffer, and that's something we can't do with DAX.... > or teach the fsync code > how to flush populated data pages out of the radix. That doesn't solve the problem. Filesystems free and reallocate filesystem blocks without intermediate block device mapping invalidation calls, so what is one minute a data block accessed by DAX may become a metadata block that accessed via buffered IO. It all goes to crap very quickly.... However, I'd say fsync is not the place to address this. This block device cache aliasing issue is supposed to be what unmap_underlying_metadata() solves, right? Cheers, Dave.
On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote: [..] >> It seems to me we need to modify the >> metadata i/o paths to bypass the page cache, > > XFS doesn't use the block device page cache for it's metadata - it > has it's own internal metadata cache structures and uses get_pages > or heap memory to back it's metadata. But that doesn't make mixing > DAX and pages in the block device mapping tree sane. > > What you are missing here is that the underlying architecture of > journalling filesystems mean they can't use DAX for their metadata. > Modifications have to be buffered, because they have to be written > to the journal first before they are written back in place. IOWs, we > need to buffer changes in volatile memory for some time, and that > means we can't use DAX during transactional modifications. > > And to put the final nail in that coffin, metadata in XFS can be > discontiguous multi-block objects - in those situations we vmap the > underlying pages so they appear to the code to be a contiguous > buffer, and that's something we can't do with DAX.... Sorry, I wasn't clear when I said "bypass page cache" I meant a solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition table reads". However, I suspect that is broken if the filesystem is not ready to see a new page allocated for every I/O. I assume one thread will want to insert a page in the radix for another thread to find/manipulate before metadata gets written back to storage. >> or teach the fsync code >> how to flush populated data pages out of the radix. > > That doesn't solve the problem. Filesystems free and reallocate > filesystem blocks without intermediate block device mapping > invalidation calls, so what is one minute a data block accessed by > DAX may become a metadata block that accessed via buffered IO. It > all goes to crap very quickly.... > > However, I'd say fsync is not the place to address this. This block > device cache aliasing issue is supposed to be what > unmap_underlying_metadata() solves, right? I'll take a look at this. Right now I'm trying to implement the "clear block-device-inode S_DAX on fs mount" approach. My concern though is that we need to disable block device mmap while a filesystem is mounted... Maybe I don't need to worry because it's already the case that a mmap of the raw device may not see the most up to date data for a file that has dirty fs-page-cache data.
On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote: > [..] > >> It seems to me we need to modify the > >> metadata i/o paths to bypass the page cache, > > > > XFS doesn't use the block device page cache for it's metadata - it > > has it's own internal metadata cache structures and uses get_pages > > or heap memory to back it's metadata. But that doesn't make mixing > > DAX and pages in the block device mapping tree sane. > > > > What you are missing here is that the underlying architecture of > > journalling filesystems mean they can't use DAX for their metadata. > > Modifications have to be buffered, because they have to be written > > to the journal first before they are written back in place. IOWs, we > > need to buffer changes in volatile memory for some time, and that > > means we can't use DAX during transactional modifications. > > > > And to put the final nail in that coffin, metadata in XFS can be > > discontiguous multi-block objects - in those situations we vmap the > > underlying pages so they appear to the code to be a contiguous > > buffer, and that's something we can't do with DAX.... > > Sorry, I wasn't clear when I said "bypass page cache" I meant a > solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition > table reads". So there's already bandaids to prevent bad shit from happening in the block layer, let alone when we consider all the ways that userspace can screw this all up. > However, I suspect that is broken if the filesystem is not ready > to see a new page allocated for every I/O. I assume one > thread will want to insert a page in the radix for another thread > to find/manipulate before metadata gets written back to storage. Right, you can't do that, especially as the struct page has a 1-1 relationship with the bufferhead that is attached to it as the bufferhead carries the filesystem state for the given cached page. > >> or teach the fsync code how to flush populated data pages out > >> of the radix. > > > > That doesn't solve the problem. Filesystems free and reallocate > > filesystem blocks without intermediate block device mapping > > invalidation calls, so what is one minute a data block accessed > > by DAX may become a metadata block that accessed via buffered > > IO. It all goes to crap very quickly.... > > > > However, I'd say fsync is not the place to address this. This > > block device cache aliasing issue is supposed to be what > > unmap_underlying_metadata() solves, right? > > I'll take a look at this. Right now I'm trying to implement the > "clear block-device-inode S_DAX on fs mount" approach. My concern > though is that we need to disable block device mmap while a > filesystem is mounted... /me chokes on his coffee. When did mmaping the block device behind the back of a mounted fileystem become a valid use case? It's not supported for normal block devices and for the same reasons it won't be supported for DAX enabled block devices, either. i.e. I'm going to tell anyone who has an application that does this to go and take a hike when (not if!) they report filesystem corruption problems. > Maybe I don't need to worry because it's already the case that a > mmap of the raw device may not see the most up to date data for a > file that has dirty fs-page-cache data. It goes both ways. What happens if mkfs or fsck modifies the block device via mmap+DAX and then the filesystem mounts the block device and tries to read that metadata via the block device page cache? Quite frankly, DAX on the block device is a can of worms we really don't need to deal with right now. IMO it's a solution looking for a problem to solve, the "default to on" policy is wrong (DAX is opt-in, not opt-out) and given this we should turn it off until we've solved the more important problems we need to solve. i.e. We need to concentrate on getting data integrity working correctly first, then address the cache aliasing issues, then address the "safe access" issues, and then we can re-introduce block device DAX access... Cheers, Dave.
On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote: >> [..] >> >> It seems to me we need to modify the >> >> metadata i/o paths to bypass the page cache, >> > >> > XFS doesn't use the block device page cache for it's metadata - it >> > has it's own internal metadata cache structures and uses get_pages >> > or heap memory to back it's metadata. But that doesn't make mixing >> > DAX and pages in the block device mapping tree sane. >> > >> > What you are missing here is that the underlying architecture of >> > journalling filesystems mean they can't use DAX for their metadata. >> > Modifications have to be buffered, because they have to be written >> > to the journal first before they are written back in place. IOWs, we >> > need to buffer changes in volatile memory for some time, and that >> > means we can't use DAX during transactional modifications. >> > >> > And to put the final nail in that coffin, metadata in XFS can be >> > discontiguous multi-block objects - in those situations we vmap the >> > underlying pages so they appear to the code to be a contiguous >> > buffer, and that's something we can't do with DAX.... >> >> Sorry, I wasn't clear when I said "bypass page cache" I meant a >> solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition >> table reads". > > So there's already bandaids to prevent bad shit from happening in > the block layer, let alone when we consider all the ways that > userspace can screw this all up. > >> However, I suspect that is broken if the filesystem is not ready >> to see a new page allocated for every I/O. I assume one >> thread will want to insert a page in the radix for another thread >> to find/manipulate before metadata gets written back to storage. > > Right, you can't do that, especially as the struct page has a 1-1 > relationship with the bufferhead that is attached to it as the > bufferhead carries the filesystem state for the given cached page. > >> >> or teach the fsync code how to flush populated data pages out >> >> of the radix. >> > >> > That doesn't solve the problem. Filesystems free and reallocate >> > filesystem blocks without intermediate block device mapping >> > invalidation calls, so what is one minute a data block accessed >> > by DAX may become a metadata block that accessed via buffered >> > IO. It all goes to crap very quickly.... >> > >> > However, I'd say fsync is not the place to address this. This >> > block device cache aliasing issue is supposed to be what >> > unmap_underlying_metadata() solves, right? >> >> I'll take a look at this. Right now I'm trying to implement the >> "clear block-device-inode S_DAX on fs mount" approach. My concern >> though is that we need to disable block device mmap while a >> filesystem is mounted... > > /me chokes on his coffee. > > When did mmaping the block device behind the back of a mounted > fileystem become a valid use case? It's not supported for normal > block devices and for the same reasons it won't be supported for DAX > enabled block devices, either. i.e. I'm going to tell anyone who has > an application that does this to go and take a hike when (not if!) > they report filesystem corruption problems. Right, but we need to not confuse the fsync code regardless of how bad of an idea this is ::-). >> Maybe I don't need to worry because it's already the case that a >> mmap of the raw device may not see the most up to date data for a >> file that has dirty fs-page-cache data. > > It goes both ways. What happens if mkfs or fsck modifies the > block device via mmap+DAX and then the filesystem mounts the block > device and tries to read that metadata via the block device page > cache? > > Quite frankly, DAX on the block device is a can of worms we really > don't need to deal with right now. IMO it's a solution looking for a > problem to solve, Virtualization use cases want to give large ranges to guest-VMs, and it is currently the only way to reliably get 1GiB mappings. > the "default to on" policy is wrong (DAX is > opt-in, not opt-out) and given this we should turn it off until > we've solved the more important problems we need to solve. i.e. We > need to concentrate on getting data integrity working correctly > first, then address the cache aliasing issues, then address the > "safe access" issues, and then we can re-introduce block device DAX > access... Agreed. Note that the "default-on policy" came from commit bbab37ddc20b "block: Add support for DAX reads/writes to block devices" way back in 4.2. We're just now noticing. Credit Ross for good sanity checking.
On Thu, Feb 11, 2016 at 02:59:14PM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: > >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote: > >> Maybe I don't need to worry because it's already the case that a > >> mmap of the raw device may not see the most up to date data for a > >> file that has dirty fs-page-cache data. > > > > It goes both ways. What happens if mkfs or fsck modifies the > > block device via mmap+DAX and then the filesystem mounts the block > > device and tries to read that metadata via the block device page > > cache? > > > > Quite frankly, DAX on the block device is a can of worms we really > > don't need to deal with right now. IMO it's a solution looking for a > > problem to solve, > > Virtualization use cases want to give large ranges to guest-VMs, and > it is currently the only way to reliably get 1GiB mappings. Precisely my point - block devices are not the best way to solve this problem. A file, on XFS, with a 1GB extent size hint and preallocated to be aligned to 1GB addresses (i.e. mkfs.xfs -d su=1G,sw=1 on the host filesystem) will give reliable 1GB aligned blocks for DAX mappings, just like a block device will. Peformance wise it's little different to using the block device directly. Management wise it's way more flexible, especially as such image files can be recycled for new VMs almost instantly via FALLOC_FL_FLAG_ZERO_RANGE. Cheers, Dave.
diff --git a/fs/block_dev.c b/fs/block_dev.c index 39b3a17..fc01e43 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) return try_to_free_buffers(page); } +static int blkdev_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (dax_mapping(mapping)) { + struct block_device *bdev = I_BDEV(mapping->host); + int error; + + error = dax_writeback_mapping_range(mapping, bdev, wbc); + if (error) + return error; + } + return generic_writepages(mapping, wbc); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .readpages = blkdev_readpages, .writepage = blkdev_writepage, .write_begin = blkdev_write_begin, .write_end = blkdev_write_end, - .writepages = generic_writepages, + .writepages = blkdev_writepages, .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, .is_dirty_writeback = buffer_check_dirty_writeback, diff --git a/fs/dax.c b/fs/dax.c index 9a173dd..034dd02 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; @@ -496,11 +495,15 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, int i, ret = 0; void *entry; + + if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) + return 0; + if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT)) return -EIO; - start_index = start >> PAGE_CACHE_SHIFT; - end_index = end >> PAGE_CACHE_SHIFT; + start_index = wbc->range_start >> PAGE_CACHE_SHIFT; + end_index = wbc->range_end >> PAGE_CACHE_SHIFT; pmd_index = DAX_PMD_INDEX(start_index); rcu_read_lock(); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index b6b965b..7e44fc3 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -876,6 +876,17 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset) static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { +#ifdef CONFIG_FS_DAX + if (dax_mapping(mapping)) { + int error; + + error = dax_writeback_mapping_range(mapping, + mapping->host->i_sb->s_bdev, wbc); + if (error) + return error; + } +#endif + return mpage_writepages(mapping, wbc, ext2_get_block); } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 83bc8bf..8c42020 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2450,6 +2450,13 @@ static int ext4_writepages(struct address_space *mapping, trace_ext4_writepages(inode, wbc); + if (dax_mapping(mapping)) { + ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + wbc); + if (ret) + goto out_writepages; + } + /* * No pages to write? This is mainly a kludge to avoid starting * a transaction for special inodes like journal inode on last iput() diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index fc20518..1139ecd 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1208,6 +1208,15 @@ xfs_vm_writepages( struct writeback_control *wbc) { xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); + if (dax_mapping(mapping)) { + int error; + + error = dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(mapping->host), wbc); + if (error) + return error; + } + return generic_writepages(mapping, wbc); } diff --git a/include/linux/dax.h b/include/linux/dax.h index 7b6bced..636dd59 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -52,6 +52,8 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); + +struct writeback_control; +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc94386..a829779 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -446,7 +446,8 @@ int filemap_write_and_wait(struct address_space *mapping) { int err = 0; - if (mapping->nrpages) { + if (mapping->nrpages || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = filemap_fdatawrite(mapping); /* * Even if the above returned error, the pages may be @@ -482,13 +483,8 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - - if (mapping->nrpages) { + if (mapping->nrpages || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */