[v2,2/2] dax: move writeback calls into the filesystems

Message ID	1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> From: Ross Zwisler <ross.zwisler@linux.intel.com> To: linux-kernel@vger.kernel.org Subject: [PATCH v2 2/2] dax: move writeback calls into the filesystems Date: Wed, 10 Feb 2016 13:48:56 -0700 Message-Id: <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu>, Andrew Morton <akpm@linux-foundation.org>, linux-nvdimm@lists.01.org, Jan Kara <jack@suse.cz>, Dave Chinner <david@fromorbit.com>, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger <adilger.kernel@dilger.ca>, Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.com>, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Ross Zwisler Feb. 10, 2016, 8:48 p.m. UTC

Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
dax_writeback_mapping_range() needs a struct block_device, and it used to
get that from inode->i_sb->s_bdev.  This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response to
sync(2).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c      | 16 +++++++++++++++-
 fs/dax.c            | 13 ++++++++-----
 fs/ext2/inode.c     | 11 +++++++++++
 fs/ext4/inode.c     |  7 +++++++
 fs/xfs/xfs_aops.c   |  9 +++++++++
 include/linux/dax.h |  6 ++++--
 mm/filemap.c        | 12 ++++--------
 7 files changed, 58 insertions(+), 16 deletions(-)

Dave Chinner Feb. 10, 2016, 10:03 p.m. UTC | #1

On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> dax_writeback_mapping_range() needs a struct block_device, and it used to
> get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> block devices and for XFS real-time files.
> 
> Instead, call dax_writeback_mapping_range() directly from the filesystem
> ->writepages function so that it can supply us with a valid block
> device. This also fixes DAX code to properly flush caches in response to
> sync(2).
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/block_dev.c      | 16 +++++++++++++++-
>  fs/dax.c            | 13 ++++++++-----
>  fs/ext2/inode.c     | 11 +++++++++++
>  fs/ext4/inode.c     |  7 +++++++
>  fs/xfs/xfs_aops.c   |  9 +++++++++
>  include/linux/dax.h |  6 ++++--
>  mm/filemap.c        | 12 ++++--------
>  7 files changed, 58 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 39b3a17..fc01e43 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
>  	return try_to_free_buffers(page);
>  }
>  
> +static int blkdev_writepages(struct address_space *mapping,
> +			     struct writeback_control *wbc)
> +{
> +	if (dax_mapping(mapping)) {
> +		struct block_device *bdev = I_BDEV(mapping->host);
> +		int error;
> +
> +		error = dax_writeback_mapping_range(mapping, bdev, wbc);
> +		if (error)
> +			return error;
> +	}
> +	return generic_writepages(mapping, wbc);
> +}

Can you remind of the reason for calling generic_writepages() on DAX
enabled address spaces?

Cheers,

Dave.

Ross Zwisler Feb. 10, 2016, 10:43 p.m. UTC | #2

On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
> On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> > dax_writeback_mapping_range() needs a struct block_device, and it used to
> > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> > block devices and for XFS real-time files.
> > 
> > Instead, call dax_writeback_mapping_range() directly from the filesystem
> > ->writepages function so that it can supply us with a valid block
> > device. This also fixes DAX code to properly flush caches in response to
> > sync(2).
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/block_dev.c      | 16 +++++++++++++++-
> >  fs/dax.c            | 13 ++++++++-----
> >  fs/ext2/inode.c     | 11 +++++++++++
> >  fs/ext4/inode.c     |  7 +++++++
> >  fs/xfs/xfs_aops.c   |  9 +++++++++
> >  include/linux/dax.h |  6 ++++--
> >  mm/filemap.c        | 12 ++++--------
> >  7 files changed, 58 insertions(+), 16 deletions(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 39b3a17..fc01e43 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
> >  	return try_to_free_buffers(page);
> >  }
> >  
> > +static int blkdev_writepages(struct address_space *mapping,
> > +			     struct writeback_control *wbc)
> > +{
> > +	if (dax_mapping(mapping)) {
> > +		struct block_device *bdev = I_BDEV(mapping->host);
> > +		int error;
> > +
> > +		error = dax_writeback_mapping_range(mapping, bdev, wbc);
> > +		if (error)
> > +			return error;
> > +	}
> > +	return generic_writepages(mapping, wbc);
> > +}
> 
> Can you remind of the reason for calling generic_writepages() on DAX
> enabled address spaces?

Sure.  The initial version of this patch didn't do this, and during testing I
hit a bunch of xfstests failures.  In ext2 at least I believe these were
happening because we were skipping the call into generic_writepages() for DAX
inodes. Without a lot of data to back this up, my guess is that this is due
to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
returns true), but having dirty page cache pages that need to be written back
as part of the writeback.

Changing this so we always call generic_writepages() even in the DAX case
solved the xfstest failures. 

If this sounds incorrect, please let me know and I'll go and gather more data.

- Ross

Dave Chinner Feb. 10, 2016, 11:44 p.m. UTC | #3

On Wed, Feb 10, 2016 at 03:43:40PM -0700, Ross Zwisler wrote:
> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> > > dax_writeback_mapping_range() needs a struct block_device, and it used to
> > > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> > > block devices and for XFS real-time files.
> > > 
> > > Instead, call dax_writeback_mapping_range() directly from the filesystem
> > > ->writepages function so that it can supply us with a valid block
> > > device. This also fixes DAX code to properly flush caches in response to
> > > sync(2).
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/block_dev.c      | 16 +++++++++++++++-
> > >  fs/dax.c            | 13 ++++++++-----
> > >  fs/ext2/inode.c     | 11 +++++++++++
> > >  fs/ext4/inode.c     |  7 +++++++
> > >  fs/xfs/xfs_aops.c   |  9 +++++++++
> > >  include/linux/dax.h |  6 ++++--
> > >  mm/filemap.c        | 12 ++++--------
> > >  7 files changed, 58 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 39b3a17..fc01e43 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
> > >  	return try_to_free_buffers(page);
> > >  }
> > >  
> > > +static int blkdev_writepages(struct address_space *mapping,
> > > +			     struct writeback_control *wbc)
> > > +{
> > > +	if (dax_mapping(mapping)) {
> > > +		struct block_device *bdev = I_BDEV(mapping->host);
> > > +		int error;
> > > +
> > > +		error = dax_writeback_mapping_range(mapping, bdev, wbc);
> > > +		if (error)
> > > +			return error;
> > > +	}
> > > +	return generic_writepages(mapping, wbc);
> > > +}
> > 
> > Can you remind of the reason for calling generic_writepages() on DAX
> > enabled address spaces?
> 
> Sure.  The initial version of this patch didn't do this, and during testing I
> hit a bunch of xfstests failures.  In ext2 at least I believe these were
> happening because we were skipping the call into generic_writepages() for DAX
> inodes. Without a lot of data to back this up, my guess is that this is due
> to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
> returns true), but having dirty page cache pages that need to be written back
> as part of the writeback.

Hmmm - the ext2 filesystem metadata uses the block device page cache
to buffer inode writeback, and so writeback doesn't occur until
sync_blockdev() is called.

But the data access should be through the ext2 inode address space,
not the block device address space, so DAX flushing occurs in
ext2_writepages. So how is the block device inode being marked as
a DAX inode?

If it is being marked as a DAX inode, how is this valid when the
filesystem metadata uses bufferheads and requires struct pages to be
found in the block device mapping tree?  e.g. mkfs writes the
metadata into the bdev via DAX, resulting in an DAX exceptional
entry in the bdev radix tree, then __bread_gfp() comes along to read
the same metadata after mount and expects to find pages in the
blockdev radix tree?

FWIW, this seems to be specifically a block device inode issue,
though, not something that affects regular files in a filesystem.
i.e. filesystem inodes can only be either DAX or non-DAX, and so
there is no mixed mode flushing required, right?

> Changing this so we always call generic_writepages() even in the
> DAX case solved the xfstest failures. 
> 
> If this sounds incorrect, please let me know and I'll go and
> gather more data.

It seems to me that there's a problem here with DAX on block device
inodes, but not for the filesystem mappings. At minimum, the block
device needs a bloody big comment explaining this landmine so people
don't forget why it is a special snowflake...

Cheers,

Dave.

Jan Kara Feb. 11, 2016, 12:50 p.m. UTC | #4

On Wed 10-02-16 15:43:40, Ross Zwisler wrote:
> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> > > dax_writeback_mapping_range() needs a struct block_device, and it used to
> > > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> > > block devices and for XFS real-time files.
> > > 
> > > Instead, call dax_writeback_mapping_range() directly from the filesystem
> > > ->writepages function so that it can supply us with a valid block
> > > device. This also fixes DAX code to properly flush caches in response to
> > > sync(2).
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/block_dev.c      | 16 +++++++++++++++-
> > >  fs/dax.c            | 13 ++++++++-----
> > >  fs/ext2/inode.c     | 11 +++++++++++
> > >  fs/ext4/inode.c     |  7 +++++++
> > >  fs/xfs/xfs_aops.c   |  9 +++++++++
> > >  include/linux/dax.h |  6 ++++--
> > >  mm/filemap.c        | 12 ++++--------
> > >  7 files changed, 58 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 39b3a17..fc01e43 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
> > >  	return try_to_free_buffers(page);
> > >  }
> > >  
> > > +static int blkdev_writepages(struct address_space *mapping,
> > > +			     struct writeback_control *wbc)
> > > +{
> > > +	if (dax_mapping(mapping)) {
> > > +		struct block_device *bdev = I_BDEV(mapping->host);
> > > +		int error;
> > > +
> > > +		error = dax_writeback_mapping_range(mapping, bdev, wbc);
> > > +		if (error)
> > > +			return error;
> > > +	}
> > > +	return generic_writepages(mapping, wbc);
> > > +}
> > 
> > Can you remind of the reason for calling generic_writepages() on DAX
> > enabled address spaces?
> 
> Sure.  The initial version of this patch didn't do this, and during testing I
> hit a bunch of xfstests failures.  In ext2 at least I believe these were
> happening because we were skipping the call into generic_writepages() for DAX
> inodes. Without a lot of data to back this up, my guess is that this is due
> to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
> returns true), but having dirty page cache pages that need to be written back
> as part of the writeback.
> 
> Changing this so we always call generic_writepages() even in the DAX case
> solved the xfstest failures. 
> 
> If this sounds incorrect, please let me know and I'll go and gather more data.

So I think a more correct fix it to not set S_DAX for inodes that will have
any pagecache pages - e.g. don't set S_DAX for block device inodes when
filesystem is mounted on it (probably the easiest is to just refuse to
mount filesystem on block device which has S_DAX set).

								Honza

Dan Williams Feb. 11, 2016, 3:22 p.m. UTC | #5

On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 10-02-16 15:43:40, Ross Zwisler wrote:
>> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
>> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
>> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
>> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
>> > > dax_writeback_mapping_range() needs a struct block_device, and it used to
>> > > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
>> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
>> > > block devices and for XFS real-time files.
>> > >
>> > > Instead, call dax_writeback_mapping_range() directly from the filesystem
>> > > ->writepages function so that it can supply us with a valid block
>> > > device. This also fixes DAX code to properly flush caches in response to
>> > > sync(2).
>> > >
>> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> > > Signed-off-by: Jan Kara <jack@suse.cz>
>> > > ---
>> > >  fs/block_dev.c      | 16 +++++++++++++++-
>> > >  fs/dax.c            | 13 ++++++++-----
>> > >  fs/ext2/inode.c     | 11 +++++++++++
>> > >  fs/ext4/inode.c     |  7 +++++++
>> > >  fs/xfs/xfs_aops.c   |  9 +++++++++
>> > >  include/linux/dax.h |  6 ++++--
>> > >  mm/filemap.c        | 12 ++++--------
>> > >  7 files changed, 58 insertions(+), 16 deletions(-)
>> > >
>> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
>> > > index 39b3a17..fc01e43 100644
>> > > --- a/fs/block_dev.c
>> > > +++ b/fs/block_dev.c
>> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
>> > >   return try_to_free_buffers(page);
>> > >  }
>> > >
>> > > +static int blkdev_writepages(struct address_space *mapping,
>> > > +                      struct writeback_control *wbc)
>> > > +{
>> > > + if (dax_mapping(mapping)) {
>> > > +         struct block_device *bdev = I_BDEV(mapping->host);
>> > > +         int error;
>> > > +
>> > > +         error = dax_writeback_mapping_range(mapping, bdev, wbc);
>> > > +         if (error)
>> > > +                 return error;
>> > > + }
>> > > + return generic_writepages(mapping, wbc);
>> > > +}
>> >
>> > Can you remind of the reason for calling generic_writepages() on DAX
>> > enabled address spaces?
>>
>> Sure.  The initial version of this patch didn't do this, and during testing I
>> hit a bunch of xfstests failures.  In ext2 at least I believe these were
>> happening because we were skipping the call into generic_writepages() for DAX
>> inodes. Without a lot of data to back this up, my guess is that this is due
>> to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
>> returns true), but having dirty page cache pages that need to be written back
>> as part of the writeback.
>>
>> Changing this so we always call generic_writepages() even in the DAX case
>> solved the xfstest failures.
>>
>> If this sounds incorrect, please let me know and I'll go and gather more data.
>
> So I think a more correct fix it to not set S_DAX for inodes that will have
> any pagecache pages - e.g. don't set S_DAX for block device inodes when
> filesystem is mounted on it (probably the easiest is to just refuse to
> mount filesystem on block device which has S_DAX set).

I think we have a wider problem here.  See __blkdev_get, we set S_DAX
on all block devices that have ->direct_access() and have a
page-aligned starting address.  It seems to me we need to modify the
metadata i/o paths to bypass the page cache, or teach the fsync code
how to flush populated data pages out of the radix.

Jan Kara Feb. 11, 2016, 4:22 p.m. UTC | #6

On Thu 11-02-16 07:22:00, Dan Williams wrote:
> On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 10-02-16 15:43:40, Ross Zwisler wrote:
> >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
> >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to
> >> > > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> >> > > block devices and for XFS real-time files.
> >> > >
> >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem
> >> > > ->writepages function so that it can supply us with a valid block
> >> > > device. This also fixes DAX code to properly flush caches in response to
> >> > > sync(2).
> >> > >
> >> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> > > Signed-off-by: Jan Kara <jack@suse.cz>
> >> > > ---
> >> > >  fs/block_dev.c      | 16 +++++++++++++++-
> >> > >  fs/dax.c            | 13 ++++++++-----
> >> > >  fs/ext2/inode.c     | 11 +++++++++++
> >> > >  fs/ext4/inode.c     |  7 +++++++
> >> > >  fs/xfs/xfs_aops.c   |  9 +++++++++
> >> > >  include/linux/dax.h |  6 ++++--
> >> > >  mm/filemap.c        | 12 ++++--------
> >> > >  7 files changed, 58 insertions(+), 16 deletions(-)
> >> > >
> >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> >> > > index 39b3a17..fc01e43 100644
> >> > > --- a/fs/block_dev.c
> >> > > +++ b/fs/block_dev.c
> >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
> >> > >   return try_to_free_buffers(page);
> >> > >  }
> >> > >
> >> > > +static int blkdev_writepages(struct address_space *mapping,
> >> > > +                      struct writeback_control *wbc)
> >> > > +{
> >> > > + if (dax_mapping(mapping)) {
> >> > > +         struct block_device *bdev = I_BDEV(mapping->host);
> >> > > +         int error;
> >> > > +
> >> > > +         error = dax_writeback_mapping_range(mapping, bdev, wbc);
> >> > > +         if (error)
> >> > > +                 return error;
> >> > > + }
> >> > > + return generic_writepages(mapping, wbc);
> >> > > +}
> >> >
> >> > Can you remind of the reason for calling generic_writepages() on DAX
> >> > enabled address spaces?
> >>
> >> Sure.  The initial version of this patch didn't do this, and during testing I
> >> hit a bunch of xfstests failures.  In ext2 at least I believe these were
> >> happening because we were skipping the call into generic_writepages() for DAX
> >> inodes. Without a lot of data to back this up, my guess is that this is due
> >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
> >> returns true), but having dirty page cache pages that need to be written back
> >> as part of the writeback.
> >>
> >> Changing this so we always call generic_writepages() even in the DAX case
> >> solved the xfstest failures.
> >>
> >> If this sounds incorrect, please let me know and I'll go and gather more data.
> >
> > So I think a more correct fix it to not set S_DAX for inodes that will have
> > any pagecache pages - e.g. don't set S_DAX for block device inodes when
> > filesystem is mounted on it (probably the easiest is to just refuse to
> > mount filesystem on block device which has S_DAX set).
> 
> I think we have a wider problem here.  See __blkdev_get, we set S_DAX
> on all block devices that have ->direct_access() and have a
> page-aligned starting address.  It seems to me we need to modify the
> metadata i/o paths to bypass the page cache

Heh, no way to do that easily. All the journalling machinery depends on
buffers and pages...

>, or teach the fsync code
> how to flush populated data pages out of the radix.

This might be doable but it will be difficult to avoid aliasing issues and
data corruption. And mainly I don't see the point: When you mount a
filesystem on top of block device, you do not want to mess with the block
device directly, even less using DAX. So we just have to find a way how to
set S_DAX for normal open but clear it from fs path. At worst, we could
clear S_DAX on the block device in mount_bdev() or something like that...

								Honza

Dave Chinner Feb. 11, 2016, 8:46 p.m. UTC | #7

On Thu, Feb 11, 2016 at 07:22:00AM -0800, Dan Williams wrote:
> On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 10-02-16 15:43:40, Ross Zwisler wrote:
> >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
> >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to
> >> > > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> >> > > block devices and for XFS real-time files.
> >> > >
> >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem
> >> > > ->writepages function so that it can supply us with a valid block
> >> > > device. This also fixes DAX code to properly flush caches in response to
> >> > > sync(2).
> >> > >
> >> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> > > Signed-off-by: Jan Kara <jack@suse.cz>
> >> > > ---
> >> > >  fs/block_dev.c      | 16 +++++++++++++++-
> >> > >  fs/dax.c            | 13 ++++++++-----
> >> > >  fs/ext2/inode.c     | 11 +++++++++++
> >> > >  fs/ext4/inode.c     |  7 +++++++
> >> > >  fs/xfs/xfs_aops.c   |  9 +++++++++
> >> > >  include/linux/dax.h |  6 ++++--
> >> > >  mm/filemap.c        | 12 ++++--------
> >> > >  7 files changed, 58 insertions(+), 16 deletions(-)
> >> > >
> >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> >> > > index 39b3a17..fc01e43 100644
> >> > > --- a/fs/block_dev.c
> >> > > +++ b/fs/block_dev.c
> >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
> >> > >   return try_to_free_buffers(page);
> >> > >  }
> >> > >
> >> > > +static int blkdev_writepages(struct address_space *mapping,
> >> > > +                      struct writeback_control *wbc)
> >> > > +{
> >> > > + if (dax_mapping(mapping)) {
> >> > > +         struct block_device *bdev = I_BDEV(mapping->host);
> >> > > +         int error;
> >> > > +
> >> > > +         error = dax_writeback_mapping_range(mapping, bdev, wbc);
> >> > > +         if (error)
> >> > > +                 return error;
> >> > > + }
> >> > > + return generic_writepages(mapping, wbc);
> >> > > +}
> >> >
> >> > Can you remind of the reason for calling generic_writepages() on DAX
> >> > enabled address spaces?
> >>
> >> Sure.  The initial version of this patch didn't do this, and during testing I
> >> hit a bunch of xfstests failures.  In ext2 at least I believe these were
> >> happening because we were skipping the call into generic_writepages() for DAX
> >> inodes. Without a lot of data to back this up, my guess is that this is due
> >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
> >> returns true), but having dirty page cache pages that need to be written back
> >> as part of the writeback.
> >>
> >> Changing this so we always call generic_writepages() even in the DAX case
> >> solved the xfstest failures.
> >>
> >> If this sounds incorrect, please let me know and I'll go and gather more data.
> >
> > So I think a more correct fix it to not set S_DAX for inodes that will have
> > any pagecache pages - e.g. don't set S_DAX for block device inodes when
> > filesystem is mounted on it (probably the easiest is to just refuse to
> > mount filesystem on block device which has S_DAX set).
> 
> I think we have a wider problem here.  See __blkdev_get, we set S_DAX
> on all block devices that have ->direct_access() and have a
> page-aligned starting address.

That's seeming like a premature optimisation to me now. I didn't
say anything at the time because I was busy with other things and it
didn't affect XFS.

> It seems to me we need to modify the
> metadata i/o paths to bypass the page cache,

XFS doesn't use the block device page cache for it's metadata - it
has it's own internal metadata cache structures and uses get_pages
or heap memory to back it's metadata. But that doesn't make mixing
DAX and pages in the block device mapping tree sane.

What you are missing here is that the underlying architecture of
journalling filesystems mean they can't use DAX for their metadata.
Modifications have to be buffered, because they have to be written
to the journal first before they are written back in place. IOWs, we
need to buffer changes in volatile memory for some time, and that
means we can't use DAX during transactional modifications.

And to put the final nail in that coffin, metadata in XFS can be
discontiguous multi-block objects - in those situations we vmap the
underlying pages so they appear to the code to be a contiguous
buffer, and that's something we can't do with DAX....

> or teach the fsync code
> how to flush populated data pages out of the radix.

That doesn't solve the problem. Filesystems free and reallocate
filesystem blocks without intermediate block device mapping
invalidation calls, so what is one minute a data block accessed by
DAX may become a metadata block that accessed via buffered IO.  It
all goes to crap very quickly....

However, I'd say fsync is not the place to address this. This block
device cache aliasing issue is supposed to be what
unmap_underlying_metadata() solves, right?

Cheers,

Dave.

Dan Williams Feb. 11, 2016, 8:58 p.m. UTC | #8

On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote:
[..]
>> It seems to me we need to modify the
>> metadata i/o paths to bypass the page cache,
>
> XFS doesn't use the block device page cache for it's metadata - it
> has it's own internal metadata cache structures and uses get_pages
> or heap memory to back it's metadata. But that doesn't make mixing
> DAX and pages in the block device mapping tree sane.
>
> What you are missing here is that the underlying architecture of
> journalling filesystems mean they can't use DAX for their metadata.
> Modifications have to be buffered, because they have to be written
> to the journal first before they are written back in place. IOWs, we
> need to buffer changes in volatile memory for some time, and that
> means we can't use DAX during transactional modifications.
>
> And to put the final nail in that coffin, metadata in XFS can be
> discontiguous multi-block objects - in those situations we vmap the
> underlying pages so they appear to the code to be a contiguous
> buffer, and that's something we can't do with DAX....

Sorry, I wasn't clear when I said "bypass page cache" I meant a
solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition
table reads".  However, I suspect that is broken if the filesystem is
not ready to see a new page allocated for every I/O.  I assume one
thread will want to insert a page in the radix for another thread to
find/manipulate before metadata gets written back to storage.

>> or teach the fsync code
>> how to flush populated data pages out of the radix.
>
> That doesn't solve the problem. Filesystems free and reallocate
> filesystem blocks without intermediate block device mapping
> invalidation calls, so what is one minute a data block accessed by
> DAX may become a metadata block that accessed via buffered IO.  It
> all goes to crap very quickly....
>
> However, I'd say fsync is not the place to address this. This block
> device cache aliasing issue is supposed to be what
> unmap_underlying_metadata() solves, right?

I'll take a look at this.  Right now I'm trying to implement the
"clear block-device-inode S_DAX on fs mount" approach.  My concern
though is that  we need to disable block device mmap while a
filesystem is mounted...

Maybe I don't need to worry because it's already the case that a mmap
of the raw device may not see the most up to date data for a file that
has dirty fs-page-cache data.

Dave Chinner Feb. 11, 2016, 10:46 p.m. UTC | #9

On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote:
> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> [..]
> >> It seems to me we need to modify the
> >> metadata i/o paths to bypass the page cache,
> >
> > XFS doesn't use the block device page cache for it's metadata - it
> > has it's own internal metadata cache structures and uses get_pages
> > or heap memory to back it's metadata. But that doesn't make mixing
> > DAX and pages in the block device mapping tree sane.
> >
> > What you are missing here is that the underlying architecture of
> > journalling filesystems mean they can't use DAX for their metadata.
> > Modifications have to be buffered, because they have to be written
> > to the journal first before they are written back in place. IOWs, we
> > need to buffer changes in volatile memory for some time, and that
> > means we can't use DAX during transactional modifications.
> >
> > And to put the final nail in that coffin, metadata in XFS can be
> > discontiguous multi-block objects - in those situations we vmap the
> > underlying pages so they appear to the code to be a contiguous
> > buffer, and that's something we can't do with DAX....
> 
> Sorry, I wasn't clear when I said "bypass page cache" I meant a
> solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition
> table reads".

So there's already bandaids to prevent bad shit from happening in
the block layer, let alone when we consider all the ways that
userspace can screw this all up.

> However, I suspect that is broken if the filesystem is not ready
> to see a new page allocated for every I/O.  I assume one
> thread will want to insert a page in the radix for another thread
> to find/manipulate before metadata gets written back to storage.

Right, you can't do that, especially as the struct page has a 1-1
relationship with the bufferhead that is attached to it as the
bufferhead carries the filesystem state for the given cached page.

> >> or teach the fsync code how to flush populated data pages out
> >> of the radix.
> >
> > That doesn't solve the problem. Filesystems free and reallocate
> > filesystem blocks without intermediate block device mapping
> > invalidation calls, so what is one minute a data block accessed
> > by DAX may become a metadata block that accessed via buffered
> > IO.  It all goes to crap very quickly....
> >
> > However, I'd say fsync is not the place to address this. This
> > block device cache aliasing issue is supposed to be what
> > unmap_underlying_metadata() solves, right?
> 
> I'll take a look at this.  Right now I'm trying to implement the
> "clear block-device-inode S_DAX on fs mount" approach.  My concern
> though is that  we need to disable block device mmap while a
> filesystem is mounted...

/me chokes on his coffee.

When did mmaping the block device behind the back of a mounted
fileystem become a valid use case? It's not supported for normal
block devices and for the same reasons it won't be supported for DAX
enabled block devices, either. i.e. I'm going to tell anyone who has
an application that does this to go and take a hike when (not if!)
they report filesystem corruption problems.

> Maybe I don't need to worry because it's already the case that a
> mmap of the raw device may not see the most up to date data for a
> file that has dirty fs-page-cache data.

It goes both ways. What happens if mkfs or fsck modifies the
block device via mmap+DAX and then the filesystem mounts the block
device and tries to read that metadata via the block device page
cache?

Quite frankly, DAX on the block device is a can of worms we really
don't need to deal with right now. IMO it's a solution looking for a
problem to solve, the "default to on" policy is wrong (DAX is
opt-in, not opt-out) and given this we should turn it off until
we've solved the more important problems we need to solve. i.e. We
need to concentrate on getting data integrity working correctly
first, then address the cache aliasing issues, then address the
"safe access" issues, and then we can re-introduce block device DAX
access...

Cheers,

Dave.

Dan Williams Feb. 11, 2016, 10:59 p.m. UTC | #10

On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote:
>> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote:
>> [..]
>> >> It seems to me we need to modify the
>> >> metadata i/o paths to bypass the page cache,
>> >
>> > XFS doesn't use the block device page cache for it's metadata - it
>> > has it's own internal metadata cache structures and uses get_pages
>> > or heap memory to back it's metadata. But that doesn't make mixing
>> > DAX and pages in the block device mapping tree sane.
>> >
>> > What you are missing here is that the underlying architecture of
>> > journalling filesystems mean they can't use DAX for their metadata.
>> > Modifications have to be buffered, because they have to be written
>> > to the journal first before they are written back in place. IOWs, we
>> > need to buffer changes in volatile memory for some time, and that
>> > means we can't use DAX during transactional modifications.
>> >
>> > And to put the final nail in that coffin, metadata in XFS can be
>> > discontiguous multi-block objects - in those situations we vmap the
>> > underlying pages so they appear to the code to be a contiguous
>> > buffer, and that's something we can't do with DAX....
>>
>> Sorry, I wasn't clear when I said "bypass page cache" I meant a
>> solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition
>> table reads".
>
> So there's already bandaids to prevent bad shit from happening in
> the block layer, let alone when we consider all the ways that
> userspace can screw this all up.
>
>> However, I suspect that is broken if the filesystem is not ready
>> to see a new page allocated for every I/O.  I assume one
>> thread will want to insert a page in the radix for another thread
>> to find/manipulate before metadata gets written back to storage.
>
> Right, you can't do that, especially as the struct page has a 1-1
> relationship with the bufferhead that is attached to it as the
> bufferhead carries the filesystem state for the given cached page.
>
>> >> or teach the fsync code how to flush populated data pages out
>> >> of the radix.
>> >
>> > That doesn't solve the problem. Filesystems free and reallocate
>> > filesystem blocks without intermediate block device mapping
>> > invalidation calls, so what is one minute a data block accessed
>> > by DAX may become a metadata block that accessed via buffered
>> > IO.  It all goes to crap very quickly....
>> >
>> > However, I'd say fsync is not the place to address this. This
>> > block device cache aliasing issue is supposed to be what
>> > unmap_underlying_metadata() solves, right?
>>
>> I'll take a look at this.  Right now I'm trying to implement the
>> "clear block-device-inode S_DAX on fs mount" approach.  My concern
>> though is that  we need to disable block device mmap while a
>> filesystem is mounted...
>
> /me chokes on his coffee.
>
> When did mmaping the block device behind the back of a mounted
> fileystem become a valid use case? It's not supported for normal
> block devices and for the same reasons it won't be supported for DAX
> enabled block devices, either. i.e. I'm going to tell anyone who has
> an application that does this to go and take a hike when (not if!)
> they report filesystem corruption problems.

Right, but we need to not confuse the fsync code regardless of how bad
of an idea this is ::-).

>> Maybe I don't need to worry because it's already the case that a
>> mmap of the raw device may not see the most up to date data for a
>> file that has dirty fs-page-cache data.
>
> It goes both ways. What happens if mkfs or fsck modifies the
> block device via mmap+DAX and then the filesystem mounts the block
> device and tries to read that metadata via the block device page
> cache?
>
> Quite frankly, DAX on the block device is a can of worms we really
> don't need to deal with right now. IMO it's a solution looking for a
> problem to solve,

Virtualization use cases want to give large ranges to guest-VMs, and
it is currently the only way to reliably get 1GiB mappings.

> the "default to on" policy is wrong (DAX is
> opt-in, not opt-out) and given this we should turn it off until
> we've solved the more important problems we need to solve. i.e. We
> need to concentrate on getting data integrity working correctly
> first, then address the cache aliasing issues, then address the
> "safe access" issues, and then we can re-introduce block device DAX
> access...

Agreed.

Note that the "default-on policy" came from commit bbab37ddc20b
"block: Add support for DAX reads/writes to block devices" way back in
4.2.  We're just now noticing.  Credit Ross for good sanity checking.

Dave Chinner Feb. 11, 2016, 11:44 p.m. UTC | #11

On Thu, Feb 11, 2016 at 02:59:14PM -0800, Dan Williams wrote:
> On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote:
> >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> Maybe I don't need to worry because it's already the case that a
> >> mmap of the raw device may not see the most up to date data for a
> >> file that has dirty fs-page-cache data.
> >
> > It goes both ways. What happens if mkfs or fsck modifies the
> > block device via mmap+DAX and then the filesystem mounts the block
> > device and tries to read that metadata via the block device page
> > cache?
> >
> > Quite frankly, DAX on the block device is a can of worms we really
> > don't need to deal with right now. IMO it's a solution looking for a
> > problem to solve,
> 
> Virtualization use cases want to give large ranges to guest-VMs, and
> it is currently the only way to reliably get 1GiB mappings.

Precisely my point - block devices are not the best way to solve
this problem.

A file, on XFS, with a 1GB extent size hint and preallocated to be
aligned to 1GB addresses (i.e. mkfs.xfs -d su=1G,sw=1 on the host
filesystem) will give reliable 1GB aligned blocks for DAX mappings,
just like a block device will. Peformance wise it's little different
to using the block device directly. Management wise it's way more
flexible, especially as such image files can be recycled for new VMs
almost instantly via FALLOC_FL_FLAG_ZERO_RANGE.

Cheers,

Dave.

[v2,2/2] dax: move writeback calls into the filesystems

Commit Message

Comments

Patch