[17/17] xfs: support for synchronous DAX faults

Message ID	20171024152415.22864-18-jack@suse.cz (mailing list archive)
State	Accepted
Commit	a39e596baa07
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=195.135.220.15; helo=mx2.suse.de; envelope-from=jack@suse.cz; receiver=linux-nvdimm@lists.01.org From: Jan Kara <jack@suse.cz> To: Dan Williams <dan.j.williams@intel.com> Subject: [PATCH 17/17] xfs: support for synchronous DAX faults Date: Tue, 24 Oct 2017 17:24:14 +0200 Message-Id: <20171024152415.22864-18-jack@suse.cz> In-Reply-To: <20171024152415.22864-1-jack@suse.cz> References: <20171024152415.22864-1-jack@suse.cz> Precedence: list Cc: linux-xfs@vger.kernel.org, Jan Kara <jack@suse.cz>, linux-nvdimm@lists.01.org, linux-api@vger.kernel.org, Christoph Hellwig <hch@infradead.org>, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, Christoph Hellwig <hch@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Jan Kara Oct. 24, 2017, 3:24 p.m. UTC

From: Christoph Hellwig <hch@lst.de>

Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
blocks for writing and the inode is pinned, and has dirty fields other
than the timestamps.  In __xfs_filemap_fault() we then detect this case
and call dax_finish_sync_fault() to make sure all metadata is committed,
and to insert the page table entry.

Note that this will also dirty corresponding radix tree entry which is
what we want - fsync(2) will still provide data integrity guarantees for
applications not using userspace flushing. And applications using
userspace flushing can avoid calling fsync(2) and thus avoid the
performance overhead.

[JK: Added VM_SYNC flag handling]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c  | 15 ++++++++++++++-
 fs/xfs/xfs_iomap.c |  5 +++++
 2 files changed, 19 insertions(+), 1 deletion(-)

Ross Zwisler Oct. 24, 2017, 9:29 p.m. UTC | #1

On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> blocks for writing and the inode is pinned, and has dirty fields other
> than the timestamps.  In __xfs_filemap_fault() we then detect this case
> and call dax_finish_sync_fault() to make sure all metadata is committed,
> and to insert the page table entry.
> 
> Note that this will also dirty corresponding radix tree entry which is
> what we want - fsync(2) will still provide data integrity guarantees for
> applications not using userspace flushing. And applications using
> userspace flushing can avoid calling fsync(2) and thus avoid the
> performance overhead.
> 
> [JK: Added VM_SYNC flag handling]
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jan Kara <jack@suse.cz>

I don't know enough about XFS dirty inode handling to be able to comment on
the changes in xfs_file_iomap_begin(), but the rest looks good.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Dave Chinner Oct. 24, 2017, 10:23 p.m. UTC | #2

On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> blocks for writing and the inode is pinned, and has dirty fields other
> than the timestamps.

That's "fdatasync dirty", not "fsync dirty".

IOMAP_F_DIRTY needs a far better description of it's semantics than
"/* block mapping is not yet on persistent storage */" so we know
exactly what filesystems are supposed to be implementing here. I
suspect that what it really is meant to say is:

/*
 * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to
 * written data and requires fdatasync to commit to persistent storage.
 */

[....]

> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index f179bdf1644d..b43be199fbdf 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -33,6 +33,7 @@
>  #include "xfs_error.h"
>  #include "xfs_trans.h"
>  #include "xfs_trans_space.h"
> +#include "xfs_inode_item.h"
>  #include "xfs_iomap.h"
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
> @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
>  		trace_xfs_iomap_found(ip, offset, length, 0, &imap);
>  	}
>  
> +	if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> +	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> +		iomap->flags |= IOMAP_F_DIRTY;

This is the very definition of an inode that is "fdatasync dirty".

Hmmmm, shouldn't this also be set for read faults, too?

Cheers,

Dave.

Jan Kara Oct. 26, 2017, 3:48 p.m. UTC | #3

On Wed 25-10-17 09:23:22, Dave Chinner wrote:
> On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> > blocks for writing and the inode is pinned, and has dirty fields other
> > than the timestamps.
> 
> That's "fdatasync dirty", not "fsync dirty".

Correct.

> IOMAP_F_DIRTY needs a far better description of it's semantics than
> "/* block mapping is not yet on persistent storage */" so we know
> exactly what filesystems are supposed to be implementing here. I
> suspect that what it really is meant to say is:
> 
> /*
>  * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to
>  * written data and requires fdatasync to commit to persistent storage.
>  */

I'll update the comment. Thanks!

> [....]
> 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index f179bdf1644d..b43be199fbdf 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -33,6 +33,7 @@
> >  #include "xfs_error.h"
> >  #include "xfs_trans.h"
> >  #include "xfs_trans_space.h"
> > +#include "xfs_inode_item.h"
> >  #include "xfs_iomap.h"
> >  #include "xfs_trace.h"
> >  #include "xfs_icache.h"
> > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> >  		trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> >  	}
> >  
> > +	if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > +	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > +		iomap->flags |= IOMAP_F_DIRTY;
> 
> This is the very definition of an inode that is "fdatasync dirty".
> 
> Hmmmm, shouldn't this also be set for read faults, too?

No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
data to the page which he'd then like to be persistent. The only reason why
I thought it could be useful for a while was that it would be nice to make
MAP_SYNC mapping provide the guarantee that data you see now is the data
you'll see after a crash but we cannot provide that guarantee for RO
mapping anyway if someone else has the page mapped as well. So I just
decided not to return IOMAP_F_DIRTY for read faults.

But now that I look at XFS implementation again, it misses handling
of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right).
I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite()
as well since it mostly duplicates it anyway... Thanks for inquiring!

								Honza

Dave Chinner Oct. 26, 2017, 9:16 p.m. UTC | #4

On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> On Wed 25-10-17 09:23:22, Dave Chinner wrote:
> > On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> > > From: Christoph Hellwig <hch@lst.de>
> > > 
> > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> > > blocks for writing and the inode is pinned, and has dirty fields other
> > > than the timestamps.
> > 
> > That's "fdatasync dirty", not "fsync dirty".
> 
> Correct.
> 
> > IOMAP_F_DIRTY needs a far better description of it's semantics than
> > "/* block mapping is not yet on persistent storage */" so we know
> > exactly what filesystems are supposed to be implementing here. I
> > suspect that what it really is meant to say is:
> > 
> > /*
> >  * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to
> >  * written data and requires fdatasync to commit to persistent storage.
> >  */
> 
> I'll update the comment. Thanks!
> 
> > [....]
> > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index f179bdf1644d..b43be199fbdf 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -33,6 +33,7 @@
> > >  #include "xfs_error.h"
> > >  #include "xfs_trans.h"
> > >  #include "xfs_trans_space.h"
> > > +#include "xfs_inode_item.h"
> > >  #include "xfs_iomap.h"
> > >  #include "xfs_trace.h"
> > >  #include "xfs_icache.h"
> > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > >  		trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> > >  	}
> > >  
> > > +	if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > > +	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > +		iomap->flags |= IOMAP_F_DIRTY;
> > 
> > This is the very definition of an inode that is "fdatasync dirty".
> > 
> > Hmmmm, shouldn't this also be set for read faults, too?
> 
> No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
> data to the page which he'd then like to be persistent. The only reason why
> I thought it could be useful for a while was that it would be nice to make
> MAP_SYNC mapping provide the guarantee that data you see now is the data
> you'll see after a crash

Isn't that the entire point of MAP_SYNC? i.e. That when we return
from a page fault, the app knows that the data and it's underlying
extent is on persistent storage?

> but we cannot provide that guarantee for RO
> mapping anyway if someone else has the page mapped as well. So I just
> decided not to return IOMAP_F_DIRTY for read faults.

If there are multiple MAP_SYNC mappings to the inode, I would have
expected that they all sync all of the data/metadata on every page
fault, regardless of who dirtied the inode. An RO mapping doesn't
mean the data/metadata on the inode can't change, it just means it
can't change through that mapping.  Running fsync() to guarantee the
persistence of that data/metadata doesn't actually changing any
data....

IOWs, if read faults don't guarantee the mapped range has stable
extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
because it's not giving consistent guarantees to userspace. Yes, it
works fine when only one MAP_SYNC mapping is modifying the inode,
but the moment we have concurrent operations on the inode that
aren't MAP_SYNC or O_SYNC this goes out the window....

Cheers,

Dave.

Christoph Hellwig Oct. 27, 2017, 6:43 a.m. UTC | #5

On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> But now that I look at XFS implementation again, it misses handling
> of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right).
> I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite()
> as well since it mostly duplicates it anyway... Thanks for inquiring!

My first patches move xfs_filemap_pfn_mkwrite to use __xfs_filemap_fault,
but that didn't work.  Wish I'd remember why, though.

Jan Kara Oct. 27, 2017, 9:13 a.m. UTC | #6

On Fri 27-10-17 08:43:01, Christoph Hellwig wrote:
> On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> > But now that I look at XFS implementation again, it misses handling
> > of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right).
> > I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite()
> > as well since it mostly duplicates it anyway... Thanks for inquiring!
> 
> My first patches move xfs_filemap_pfn_mkwrite to use __xfs_filemap_fault,
> but that didn't work.  Wish I'd remember why, though.

Maybe due to the additional check on IS_DAX(inode) in __xfs_filemap_fault()
which could do the wrong thing if per-inode DAX flag is switched? Because
otherwise __xfs_filemap_fault(vmf, PE_SIZE_PTE, true) does exactly the same
thing as xfs_filemap_pfn_mkwrite() did.

If we do care about per-inode DAX flag switching, I can just fixup
xfs_filemap_pfn_mkwrite() but my understanding was that we ditched the idea
at least until someone comes up with a reliable way to implement that...

							Honza

Jan Kara Oct. 27, 2017, 10:08 a.m. UTC | #7

On Fri 27-10-17 08:16:11, Dave Chinner wrote:
> On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > index f179bdf1644d..b43be199fbdf 100644
> > > > --- a/fs/xfs/xfs_iomap.c
> > > > +++ b/fs/xfs/xfs_iomap.c
> > > > @@ -33,6 +33,7 @@
> > > >  #include "xfs_error.h"
> > > >  #include "xfs_trans.h"
> > > >  #include "xfs_trans_space.h"
> > > > +#include "xfs_inode_item.h"
> > > >  #include "xfs_iomap.h"
> > > >  #include "xfs_trace.h"
> > > >  #include "xfs_icache.h"
> > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > > >  		trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> > > >  	}
> > > >  
> > > > +	if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > > > +	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > > +		iomap->flags |= IOMAP_F_DIRTY;
> > > 
> > > This is the very definition of an inode that is "fdatasync dirty".
> > > 
> > > Hmmmm, shouldn't this also be set for read faults, too?
> > 
> > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
> > data to the page which he'd then like to be persistent. The only reason why
> > I thought it could be useful for a while was that it would be nice to make
> > MAP_SYNC mapping provide the guarantee that data you see now is the data
> > you'll see after a crash
> 
> Isn't that the entire point of MAP_SYNC? i.e. That when we return
> from a page fault, the app knows that the data and it's underlying
> extent is on persistent storage?
> 
> > but we cannot provide that guarantee for RO
> > mapping anyway if someone else has the page mapped as well. So I just
> > decided not to return IOMAP_F_DIRTY for read faults.
> 
> If there are multiple MAP_SYNC mappings to the inode, I would have
> expected that they all sync all of the data/metadata on every page
> fault, regardless of who dirtied the inode. An RO mapping doesn't

Well, they all do sync regardless of who dirtied the inode on every *write*
fault.

> mean the data/metadata on the inode can't change, it just means it
> can't change through that mapping.  Running fsync() to guarantee the
> persistence of that data/metadata doesn't actually changing any
> data....
> 
> IOWs, if read faults don't guarantee the mapped range has stable
> extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
> because it's not giving consistent guarantees to userspace. Yes, it
> works fine when only one MAP_SYNC mapping is modifying the inode,
> but the moment we have concurrent operations on the inode that
> aren't MAP_SYNC or O_SYNC this goes out the window....

MAP_SYNC as I've implemented it provides guarantees only for data the
process has actually written. I agree with that and it was a conscious
decision. In my opinion that covers most usecases, provides reasonably
simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
persist it just using CPU instructions), and reasonable performance.

Now you seem to suggest the semantics should be: "Data you have read from or
written to a MAP_SYNC mapping can be persisted using CPU instructions." And
from implementation POV we can do that rather easily (just rip out the
IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
be useful enough to justify the slowdown of read faults? I was not able to
come up with a good usecase and so I've decided for current semantics. What
do other people think?

And now that I've spelled out exact semantics I don't think your comparison
that you can fsync() data you didn't write quite matches - with MAP_SYNC
you will have to at least read the data to be able to persist it and you
don't have that requirement for fsync() either...

								Honza

Jan Kara Oct. 31, 2017, 3:19 p.m. UTC | #8

On Fri 27-10-17 12:08:34, Jan Kara wrote:
> On Fri 27-10-17 08:16:11, Dave Chinner wrote:
> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > > index f179bdf1644d..b43be199fbdf 100644
> > > > > --- a/fs/xfs/xfs_iomap.c
> > > > > +++ b/fs/xfs/xfs_iomap.c
> > > > > @@ -33,6 +33,7 @@
> > > > >  #include "xfs_error.h"
> > > > >  #include "xfs_trans.h"
> > > > >  #include "xfs_trans_space.h"
> > > > > +#include "xfs_inode_item.h"
> > > > >  #include "xfs_iomap.h"
> > > > >  #include "xfs_trace.h"
> > > > >  #include "xfs_icache.h"
> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > > > >  		trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> > > > >  	}
> > > > >  
> > > > > +	if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > > > > +	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > > > +		iomap->flags |= IOMAP_F_DIRTY;
> > > > 
> > > > This is the very definition of an inode that is "fdatasync dirty".
> > > > 
> > > > Hmmmm, shouldn't this also be set for read faults, too?
> > > 
> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
> > > data to the page which he'd then like to be persistent. The only reason why
> > > I thought it could be useful for a while was that it would be nice to make
> > > MAP_SYNC mapping provide the guarantee that data you see now is the data
> > > you'll see after a crash
> > 
> > Isn't that the entire point of MAP_SYNC? i.e. That when we return
> > from a page fault, the app knows that the data and it's underlying
> > extent is on persistent storage?
> > 
> > > but we cannot provide that guarantee for RO
> > > mapping anyway if someone else has the page mapped as well. So I just
> > > decided not to return IOMAP_F_DIRTY for read faults.
> > 
> > If there are multiple MAP_SYNC mappings to the inode, I would have
> > expected that they all sync all of the data/metadata on every page
> > fault, regardless of who dirtied the inode. An RO mapping doesn't
> 
> Well, they all do sync regardless of who dirtied the inode on every *write*
> fault.
> 
> > mean the data/metadata on the inode can't change, it just means it
> > can't change through that mapping.  Running fsync() to guarantee the
> > persistence of that data/metadata doesn't actually changing any
> > data....
> > 
> > IOWs, if read faults don't guarantee the mapped range has stable
> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
> > because it's not giving consistent guarantees to userspace. Yes, it
> > works fine when only one MAP_SYNC mapping is modifying the inode,
> > but the moment we have concurrent operations on the inode that
> > aren't MAP_SYNC or O_SYNC this goes out the window....
> 
> MAP_SYNC as I've implemented it provides guarantees only for data the
> process has actually written. I agree with that and it was a conscious
> decision. In my opinion that covers most usecases, provides reasonably
> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
> persist it just using CPU instructions), and reasonable performance.
> 
> Now you seem to suggest the semantics should be: "Data you have read from or
> written to a MAP_SYNC mapping can be persisted using CPU instructions." And
> from implementation POV we can do that rather easily (just rip out the
> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
> be useful enough to justify the slowdown of read faults? I was not able to
> come up with a good usecase and so I've decided for current semantics. What
> do other people think?

Nobody commented on this for couple of days so how do we proceed? I would
prefer to go just with a guarantee for data written and we can always make
the guarantee stronger (i.e. apply it also for read data) when some user
comes with a good usecase?

								Honza

Dan Williams Oct. 31, 2017, 9:50 p.m. UTC | #9

On Tue, Oct 31, 2017 at 8:19 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 27-10-17 12:08:34, Jan Kara wrote:
>> On Fri 27-10-17 08:16:11, Dave Chinner wrote:
>> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
>> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> > > > > index f179bdf1644d..b43be199fbdf 100644
>> > > > > --- a/fs/xfs/xfs_iomap.c
>> > > > > +++ b/fs/xfs/xfs_iomap.c
>> > > > > @@ -33,6 +33,7 @@
>> > > > >  #include "xfs_error.h"
>> > > > >  #include "xfs_trans.h"
>> > > > >  #include "xfs_trans_space.h"
>> > > > > +#include "xfs_inode_item.h"
>> > > > >  #include "xfs_iomap.h"
>> > > > >  #include "xfs_trace.h"
>> > > > >  #include "xfs_icache.h"
>> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
>> > > > >               trace_xfs_iomap_found(ip, offset, length, 0, &imap);
>> > > > >       }
>> > > > >
>> > > > > +     if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
>> > > > > +         (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
>> > > > > +             iomap->flags |= IOMAP_F_DIRTY;
>> > > >
>> > > > This is the very definition of an inode that is "fdatasync dirty".
>> > > >
>> > > > Hmmmm, shouldn't this also be set for read faults, too?
>> > >
>> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
>> > > data to the page which he'd then like to be persistent. The only reason why
>> > > I thought it could be useful for a while was that it would be nice to make
>> > > MAP_SYNC mapping provide the guarantee that data you see now is the data
>> > > you'll see after a crash
>> >
>> > Isn't that the entire point of MAP_SYNC? i.e. That when we return
>> > from a page fault, the app knows that the data and it's underlying
>> > extent is on persistent storage?
>> >
>> > > but we cannot provide that guarantee for RO
>> > > mapping anyway if someone else has the page mapped as well. So I just
>> > > decided not to return IOMAP_F_DIRTY for read faults.
>> >
>> > If there are multiple MAP_SYNC mappings to the inode, I would have
>> > expected that they all sync all of the data/metadata on every page
>> > fault, regardless of who dirtied the inode. An RO mapping doesn't
>>
>> Well, they all do sync regardless of who dirtied the inode on every *write*
>> fault.
>>
>> > mean the data/metadata on the inode can't change, it just means it
>> > can't change through that mapping.  Running fsync() to guarantee the
>> > persistence of that data/metadata doesn't actually changing any
>> > data....
>> >
>> > IOWs, if read faults don't guarantee the mapped range has stable
>> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
>> > because it's not giving consistent guarantees to userspace. Yes, it
>> > works fine when only one MAP_SYNC mapping is modifying the inode,
>> > but the moment we have concurrent operations on the inode that
>> > aren't MAP_SYNC or O_SYNC this goes out the window....
>>
>> MAP_SYNC as I've implemented it provides guarantees only for data the
>> process has actually written. I agree with that and it was a conscious
>> decision. In my opinion that covers most usecases, provides reasonably
>> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
>> persist it just using CPU instructions), and reasonable performance.
>>
>> Now you seem to suggest the semantics should be: "Data you have read from or
>> written to a MAP_SYNC mapping can be persisted using CPU instructions." And
>> from implementation POV we can do that rather easily (just rip out the
>> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
>> be useful enough to justify the slowdown of read faults? I was not able to
>> come up with a good usecase and so I've decided for current semantics. What
>> do other people think?
>
> Nobody commented on this for couple of days so how do we proceed? I would
> prefer to go just with a guarantee for data written and we can always make
> the guarantee stronger (i.e. apply it also for read data) when some user
> comes with a good usecase?

I think it is easier to strengthen the guarantee than loosen it later
especially since it is not yet clear that we have a use case for the
stronger semantic. At least the initial motivation for MAP_SYNC was
for writers.

Ross Zwisler Nov. 1, 2017, 3:47 a.m. UTC | #10

On Tue, Oct 31, 2017 at 02:50:01PM -0700, Dan Williams wrote:
> On Tue, Oct 31, 2017 at 8:19 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 27-10-17 12:08:34, Jan Kara wrote:
> >> On Fri 27-10-17 08:16:11, Dave Chinner wrote:
> >> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> >> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> >> > > > > index f179bdf1644d..b43be199fbdf 100644
> >> > > > > --- a/fs/xfs/xfs_iomap.c
> >> > > > > +++ b/fs/xfs/xfs_iomap.c
> >> > > > > @@ -33,6 +33,7 @@
> >> > > > >  #include "xfs_error.h"
> >> > > > >  #include "xfs_trans.h"
> >> > > > >  #include "xfs_trans_space.h"
> >> > > > > +#include "xfs_inode_item.h"
> >> > > > >  #include "xfs_iomap.h"
> >> > > > >  #include "xfs_trace.h"
> >> > > > >  #include "xfs_icache.h"
> >> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> >> > > > >               trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> >> > > > >       }
> >> > > > >
> >> > > > > +     if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> >> > > > > +         (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> >> > > > > +             iomap->flags |= IOMAP_F_DIRTY;
> >> > > >
> >> > > > This is the very definition of an inode that is "fdatasync dirty".
> >> > > >
> >> > > > Hmmmm, shouldn't this also be set for read faults, too?
> >> > >
> >> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
> >> > > data to the page which he'd then like to be persistent. The only reason why
> >> > > I thought it could be useful for a while was that it would be nice to make
> >> > > MAP_SYNC mapping provide the guarantee that data you see now is the data
> >> > > you'll see after a crash
> >> >
> >> > Isn't that the entire point of MAP_SYNC? i.e. That when we return
> >> > from a page fault, the app knows that the data and it's underlying
> >> > extent is on persistent storage?
> >> >
> >> > > but we cannot provide that guarantee for RO
> >> > > mapping anyway if someone else has the page mapped as well. So I just
> >> > > decided not to return IOMAP_F_DIRTY for read faults.
> >> >
> >> > If there are multiple MAP_SYNC mappings to the inode, I would have
> >> > expected that they all sync all of the data/metadata on every page
> >> > fault, regardless of who dirtied the inode. An RO mapping doesn't
> >>
> >> Well, they all do sync regardless of who dirtied the inode on every *write*
> >> fault.
> >>
> >> > mean the data/metadata on the inode can't change, it just means it
> >> > can't change through that mapping.  Running fsync() to guarantee the
> >> > persistence of that data/metadata doesn't actually changing any
> >> > data....
> >> >
> >> > IOWs, if read faults don't guarantee the mapped range has stable
> >> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
> >> > because it's not giving consistent guarantees to userspace. Yes, it
> >> > works fine when only one MAP_SYNC mapping is modifying the inode,
> >> > but the moment we have concurrent operations on the inode that
> >> > aren't MAP_SYNC or O_SYNC this goes out the window....
> >>
> >> MAP_SYNC as I've implemented it provides guarantees only for data the
> >> process has actually written. I agree with that and it was a conscious
> >> decision. In my opinion that covers most usecases, provides reasonably
> >> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
> >> persist it just using CPU instructions), and reasonable performance.
> >>
> >> Now you seem to suggest the semantics should be: "Data you have read from or
> >> written to a MAP_SYNC mapping can be persisted using CPU instructions." And
> >> from implementation POV we can do that rather easily (just rip out the
> >> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
> >> be useful enough to justify the slowdown of read faults? I was not able to
> >> come up with a good usecase and so I've decided for current semantics. What
> >> do other people think?
> >
> > Nobody commented on this for couple of days so how do we proceed? I would
> > prefer to go just with a guarantee for data written and we can always make
> > the guarantee stronger (i.e. apply it also for read data) when some user
> > comes with a good usecase?
> 
> I think it is easier to strengthen the guarantee than loosen it later
> especially since it is not yet clear that we have a use case for the
> stronger semantic. At least the initial motivation for MAP_SYNC was
> for writers.

I agree.  It seems like all threads/processes in a given application need to
use MAP_SYNC consistently so they can be sure that data that is written (and
then possibly read) will be durable on media.  I think what you have is a good
starting point, and we can adjust later if necessary.

[17/17] xfs: support for synchronous DAX faults

Commit Message

Comments

Patch