Message ID | 20171024152415.22864-18-jack@suse.cz (mailing list archive) |
---|---|
State | Accepted |
Commit | a39e596baa07 |
Headers | show |
On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote: > From: Christoph Hellwig <hch@lst.de> > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare > blocks for writing and the inode is pinned, and has dirty fields other > than the timestamps. In __xfs_filemap_fault() we then detect this case > and call dax_finish_sync_fault() to make sure all metadata is committed, > and to insert the page table entry. > > Note that this will also dirty corresponding radix tree entry which is > what we want - fsync(2) will still provide data integrity guarantees for > applications not using userspace flushing. And applications using > userspace flushing can avoid calling fsync(2) and thus avoid the > performance overhead. > > [JK: Added VM_SYNC flag handling] > > Signed-off-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Jan Kara <jack@suse.cz> I don't know enough about XFS dirty inode handling to be able to comment on the changes in xfs_file_iomap_begin(), but the rest looks good. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote: > From: Christoph Hellwig <hch@lst.de> > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare > blocks for writing and the inode is pinned, and has dirty fields other > than the timestamps. That's "fdatasync dirty", not "fsync dirty". IOMAP_F_DIRTY needs a far better description of it's semantics than "/* block mapping is not yet on persistent storage */" so we know exactly what filesystems are supposed to be implementing here. I suspect that what it really is meant to say is: /* * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to * written data and requires fdatasync to commit to persistent storage. */ [....] > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index f179bdf1644d..b43be199fbdf 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -33,6 +33,7 @@ > #include "xfs_error.h" > #include "xfs_trans.h" > #include "xfs_trans_space.h" > +#include "xfs_inode_item.h" > #include "xfs_iomap.h" > #include "xfs_trace.h" > #include "xfs_icache.h" > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > trace_xfs_iomap_found(ip, offset, length, 0, &imap); > } > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > + iomap->flags |= IOMAP_F_DIRTY; This is the very definition of an inode that is "fdatasync dirty". Hmmmm, shouldn't this also be set for read faults, too? Cheers, Dave.
On Wed 25-10-17 09:23:22, Dave Chinner wrote: > On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote: > > From: Christoph Hellwig <hch@lst.de> > > > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare > > blocks for writing and the inode is pinned, and has dirty fields other > > than the timestamps. > > That's "fdatasync dirty", not "fsync dirty". Correct. > IOMAP_F_DIRTY needs a far better description of it's semantics than > "/* block mapping is not yet on persistent storage */" so we know > exactly what filesystems are supposed to be implementing here. I > suspect that what it really is meant to say is: > > /* > * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to > * written data and requires fdatasync to commit to persistent storage. > */ I'll update the comment. Thanks! > [....] > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > index f179bdf1644d..b43be199fbdf 100644 > > --- a/fs/xfs/xfs_iomap.c > > +++ b/fs/xfs/xfs_iomap.c > > @@ -33,6 +33,7 @@ > > #include "xfs_error.h" > > #include "xfs_trans.h" > > #include "xfs_trans_space.h" > > +#include "xfs_inode_item.h" > > #include "xfs_iomap.h" > > #include "xfs_trace.h" > > #include "xfs_icache.h" > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > > trace_xfs_iomap_found(ip, offset, length, 0, &imap); > > } > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > > + iomap->flags |= IOMAP_F_DIRTY; > > This is the very definition of an inode that is "fdatasync dirty". > > Hmmmm, shouldn't this also be set for read faults, too? No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any data to the page which he'd then like to be persistent. The only reason why I thought it could be useful for a while was that it would be nice to make MAP_SYNC mapping provide the guarantee that data you see now is the data you'll see after a crash but we cannot provide that guarantee for RO mapping anyway if someone else has the page mapped as well. So I just decided not to return IOMAP_F_DIRTY for read faults. But now that I look at XFS implementation again, it misses handling of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right). I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite() as well since it mostly duplicates it anyway... Thanks for inquiring! Honza
On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > On Wed 25-10-17 09:23:22, Dave Chinner wrote: > > On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote: > > > From: Christoph Hellwig <hch@lst.de> > > > > > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare > > > blocks for writing and the inode is pinned, and has dirty fields other > > > than the timestamps. > > > > That's "fdatasync dirty", not "fsync dirty". > > Correct. > > > IOMAP_F_DIRTY needs a far better description of it's semantics than > > "/* block mapping is not yet on persistent storage */" so we know > > exactly what filesystems are supposed to be implementing here. I > > suspect that what it really is meant to say is: > > > > /* > > * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to > > * written data and requires fdatasync to commit to persistent storage. > > */ > > I'll update the comment. Thanks! > > > [....] > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > index f179bdf1644d..b43be199fbdf 100644 > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > > > @@ -33,6 +33,7 @@ > > > #include "xfs_error.h" > > > #include "xfs_trans.h" > > > #include "xfs_trans_space.h" > > > +#include "xfs_inode_item.h" > > > #include "xfs_iomap.h" > > > #include "xfs_trace.h" > > > #include "xfs_icache.h" > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > > > trace_xfs_iomap_found(ip, offset, length, 0, &imap); > > > } > > > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > > > + iomap->flags |= IOMAP_F_DIRTY; > > > > This is the very definition of an inode that is "fdatasync dirty". > > > > Hmmmm, shouldn't this also be set for read faults, too? > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any > data to the page which he'd then like to be persistent. The only reason why > I thought it could be useful for a while was that it would be nice to make > MAP_SYNC mapping provide the guarantee that data you see now is the data > you'll see after a crash Isn't that the entire point of MAP_SYNC? i.e. That when we return from a page fault, the app knows that the data and it's underlying extent is on persistent storage? > but we cannot provide that guarantee for RO > mapping anyway if someone else has the page mapped as well. So I just > decided not to return IOMAP_F_DIRTY for read faults. If there are multiple MAP_SYNC mappings to the inode, I would have expected that they all sync all of the data/metadata on every page fault, regardless of who dirtied the inode. An RO mapping doesn't mean the data/metadata on the inode can't change, it just means it can't change through that mapping. Running fsync() to guarantee the persistence of that data/metadata doesn't actually changing any data.... IOWs, if read faults don't guarantee the mapped range has stable extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken because it's not giving consistent guarantees to userspace. Yes, it works fine when only one MAP_SYNC mapping is modifying the inode, but the moment we have concurrent operations on the inode that aren't MAP_SYNC or O_SYNC this goes out the window.... Cheers, Dave.
On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > But now that I look at XFS implementation again, it misses handling > of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right). > I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite() > as well since it mostly duplicates it anyway... Thanks for inquiring! My first patches move xfs_filemap_pfn_mkwrite to use __xfs_filemap_fault, but that didn't work. Wish I'd remember why, though.
On Fri 27-10-17 08:43:01, Christoph Hellwig wrote: > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > > But now that I look at XFS implementation again, it misses handling > > of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right). > > I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite() > > as well since it mostly duplicates it anyway... Thanks for inquiring! > > My first patches move xfs_filemap_pfn_mkwrite to use __xfs_filemap_fault, > but that didn't work. Wish I'd remember why, though. Maybe due to the additional check on IS_DAX(inode) in __xfs_filemap_fault() which could do the wrong thing if per-inode DAX flag is switched? Because otherwise __xfs_filemap_fault(vmf, PE_SIZE_PTE, true) does exactly the same thing as xfs_filemap_pfn_mkwrite() did. If we do care about per-inode DAX flag switching, I can just fixup xfs_filemap_pfn_mkwrite() but my understanding was that we ditched the idea at least until someone comes up with a reliable way to implement that... Honza
On Fri 27-10-17 08:16:11, Dave Chinner wrote: > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > > index f179bdf1644d..b43be199fbdf 100644 > > > > --- a/fs/xfs/xfs_iomap.c > > > > +++ b/fs/xfs/xfs_iomap.c > > > > @@ -33,6 +33,7 @@ > > > > #include "xfs_error.h" > > > > #include "xfs_trans.h" > > > > #include "xfs_trans_space.h" > > > > +#include "xfs_inode_item.h" > > > > #include "xfs_iomap.h" > > > > #include "xfs_trace.h" > > > > #include "xfs_icache.h" > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > > > > trace_xfs_iomap_found(ip, offset, length, 0, &imap); > > > > } > > > > > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > > > > + iomap->flags |= IOMAP_F_DIRTY; > > > > > > This is the very definition of an inode that is "fdatasync dirty". > > > > > > Hmmmm, shouldn't this also be set for read faults, too? > > > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any > > data to the page which he'd then like to be persistent. The only reason why > > I thought it could be useful for a while was that it would be nice to make > > MAP_SYNC mapping provide the guarantee that data you see now is the data > > you'll see after a crash > > Isn't that the entire point of MAP_SYNC? i.e. That when we return > from a page fault, the app knows that the data and it's underlying > extent is on persistent storage? > > > but we cannot provide that guarantee for RO > > mapping anyway if someone else has the page mapped as well. So I just > > decided not to return IOMAP_F_DIRTY for read faults. > > If there are multiple MAP_SYNC mappings to the inode, I would have > expected that they all sync all of the data/metadata on every page > fault, regardless of who dirtied the inode. An RO mapping doesn't Well, they all do sync regardless of who dirtied the inode on every *write* fault. > mean the data/metadata on the inode can't change, it just means it > can't change through that mapping. Running fsync() to guarantee the > persistence of that data/metadata doesn't actually changing any > data.... > > IOWs, if read faults don't guarantee the mapped range has stable > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken > because it's not giving consistent guarantees to userspace. Yes, it > works fine when only one MAP_SYNC mapping is modifying the inode, > but the moment we have concurrent operations on the inode that > aren't MAP_SYNC or O_SYNC this goes out the window.... MAP_SYNC as I've implemented it provides guarantees only for data the process has actually written. I agree with that and it was a conscious decision. In my opinion that covers most usecases, provides reasonably simple semantics (i.e., if you write data through MAP_SYNC mapping, you can persist it just using CPU instructions), and reasonable performance. Now you seem to suggest the semantics should be: "Data you have read from or written to a MAP_SYNC mapping can be persisted using CPU instructions." And from implementation POV we can do that rather easily (just rip out the IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would be useful enough to justify the slowdown of read faults? I was not able to come up with a good usecase and so I've decided for current semantics. What do other people think? And now that I've spelled out exact semantics I don't think your comparison that you can fsync() data you didn't write quite matches - with MAP_SYNC you will have to at least read the data to be able to persist it and you don't have that requirement for fsync() either... Honza
On Fri 27-10-17 12:08:34, Jan Kara wrote: > On Fri 27-10-17 08:16:11, Dave Chinner wrote: > > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > > > index f179bdf1644d..b43be199fbdf 100644 > > > > > --- a/fs/xfs/xfs_iomap.c > > > > > +++ b/fs/xfs/xfs_iomap.c > > > > > @@ -33,6 +33,7 @@ > > > > > #include "xfs_error.h" > > > > > #include "xfs_trans.h" > > > > > #include "xfs_trans_space.h" > > > > > +#include "xfs_inode_item.h" > > > > > #include "xfs_iomap.h" > > > > > #include "xfs_trace.h" > > > > > #include "xfs_icache.h" > > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > > > > > trace_xfs_iomap_found(ip, offset, length, 0, &imap); > > > > > } > > > > > > > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > > > > > + iomap->flags |= IOMAP_F_DIRTY; > > > > > > > > This is the very definition of an inode that is "fdatasync dirty". > > > > > > > > Hmmmm, shouldn't this also be set for read faults, too? > > > > > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any > > > data to the page which he'd then like to be persistent. The only reason why > > > I thought it could be useful for a while was that it would be nice to make > > > MAP_SYNC mapping provide the guarantee that data you see now is the data > > > you'll see after a crash > > > > Isn't that the entire point of MAP_SYNC? i.e. That when we return > > from a page fault, the app knows that the data and it's underlying > > extent is on persistent storage? > > > > > but we cannot provide that guarantee for RO > > > mapping anyway if someone else has the page mapped as well. So I just > > > decided not to return IOMAP_F_DIRTY for read faults. > > > > If there are multiple MAP_SYNC mappings to the inode, I would have > > expected that they all sync all of the data/metadata on every page > > fault, regardless of who dirtied the inode. An RO mapping doesn't > > Well, they all do sync regardless of who dirtied the inode on every *write* > fault. > > > mean the data/metadata on the inode can't change, it just means it > > can't change through that mapping. Running fsync() to guarantee the > > persistence of that data/metadata doesn't actually changing any > > data.... > > > > IOWs, if read faults don't guarantee the mapped range has stable > > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken > > because it's not giving consistent guarantees to userspace. Yes, it > > works fine when only one MAP_SYNC mapping is modifying the inode, > > but the moment we have concurrent operations on the inode that > > aren't MAP_SYNC or O_SYNC this goes out the window.... > > MAP_SYNC as I've implemented it provides guarantees only for data the > process has actually written. I agree with that and it was a conscious > decision. In my opinion that covers most usecases, provides reasonably > simple semantics (i.e., if you write data through MAP_SYNC mapping, you can > persist it just using CPU instructions), and reasonable performance. > > Now you seem to suggest the semantics should be: "Data you have read from or > written to a MAP_SYNC mapping can be persisted using CPU instructions." And > from implementation POV we can do that rather easily (just rip out the > IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would > be useful enough to justify the slowdown of read faults? I was not able to > come up with a good usecase and so I've decided for current semantics. What > do other people think? Nobody commented on this for couple of days so how do we proceed? I would prefer to go just with a guarantee for data written and we can always make the guarantee stronger (i.e. apply it also for read data) when some user comes with a good usecase? Honza
On Tue, Oct 31, 2017 at 8:19 AM, Jan Kara <jack@suse.cz> wrote: > On Fri 27-10-17 12:08:34, Jan Kara wrote: >> On Fri 27-10-17 08:16:11, Dave Chinner wrote: >> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: >> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c >> > > > > index f179bdf1644d..b43be199fbdf 100644 >> > > > > --- a/fs/xfs/xfs_iomap.c >> > > > > +++ b/fs/xfs/xfs_iomap.c >> > > > > @@ -33,6 +33,7 @@ >> > > > > #include "xfs_error.h" >> > > > > #include "xfs_trans.h" >> > > > > #include "xfs_trans_space.h" >> > > > > +#include "xfs_inode_item.h" >> > > > > #include "xfs_iomap.h" >> > > > > #include "xfs_trace.h" >> > > > > #include "xfs_icache.h" >> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( >> > > > > trace_xfs_iomap_found(ip, offset, length, 0, &imap); >> > > > > } >> > > > > >> > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && >> > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) >> > > > > + iomap->flags |= IOMAP_F_DIRTY; >> > > > >> > > > This is the very definition of an inode that is "fdatasync dirty". >> > > > >> > > > Hmmmm, shouldn't this also be set for read faults, too? >> > > >> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any >> > > data to the page which he'd then like to be persistent. The only reason why >> > > I thought it could be useful for a while was that it would be nice to make >> > > MAP_SYNC mapping provide the guarantee that data you see now is the data >> > > you'll see after a crash >> > >> > Isn't that the entire point of MAP_SYNC? i.e. That when we return >> > from a page fault, the app knows that the data and it's underlying >> > extent is on persistent storage? >> > >> > > but we cannot provide that guarantee for RO >> > > mapping anyway if someone else has the page mapped as well. So I just >> > > decided not to return IOMAP_F_DIRTY for read faults. >> > >> > If there are multiple MAP_SYNC mappings to the inode, I would have >> > expected that they all sync all of the data/metadata on every page >> > fault, regardless of who dirtied the inode. An RO mapping doesn't >> >> Well, they all do sync regardless of who dirtied the inode on every *write* >> fault. >> >> > mean the data/metadata on the inode can't change, it just means it >> > can't change through that mapping. Running fsync() to guarantee the >> > persistence of that data/metadata doesn't actually changing any >> > data.... >> > >> > IOWs, if read faults don't guarantee the mapped range has stable >> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken >> > because it's not giving consistent guarantees to userspace. Yes, it >> > works fine when only one MAP_SYNC mapping is modifying the inode, >> > but the moment we have concurrent operations on the inode that >> > aren't MAP_SYNC or O_SYNC this goes out the window.... >> >> MAP_SYNC as I've implemented it provides guarantees only for data the >> process has actually written. I agree with that and it was a conscious >> decision. In my opinion that covers most usecases, provides reasonably >> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can >> persist it just using CPU instructions), and reasonable performance. >> >> Now you seem to suggest the semantics should be: "Data you have read from or >> written to a MAP_SYNC mapping can be persisted using CPU instructions." And >> from implementation POV we can do that rather easily (just rip out the >> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would >> be useful enough to justify the slowdown of read faults? I was not able to >> come up with a good usecase and so I've decided for current semantics. What >> do other people think? > > Nobody commented on this for couple of days so how do we proceed? I would > prefer to go just with a guarantee for data written and we can always make > the guarantee stronger (i.e. apply it also for read data) when some user > comes with a good usecase? I think it is easier to strengthen the guarantee than loosen it later especially since it is not yet clear that we have a use case for the stronger semantic. At least the initial motivation for MAP_SYNC was for writers.
On Tue, Oct 31, 2017 at 02:50:01PM -0700, Dan Williams wrote: > On Tue, Oct 31, 2017 at 8:19 AM, Jan Kara <jack@suse.cz> wrote: > > On Fri 27-10-17 12:08:34, Jan Kara wrote: > >> On Fri 27-10-17 08:16:11, Dave Chinner wrote: > >> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > >> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > >> > > > > index f179bdf1644d..b43be199fbdf 100644 > >> > > > > --- a/fs/xfs/xfs_iomap.c > >> > > > > +++ b/fs/xfs/xfs_iomap.c > >> > > > > @@ -33,6 +33,7 @@ > >> > > > > #include "xfs_error.h" > >> > > > > #include "xfs_trans.h" > >> > > > > #include "xfs_trans_space.h" > >> > > > > +#include "xfs_inode_item.h" > >> > > > > #include "xfs_iomap.h" > >> > > > > #include "xfs_trace.h" > >> > > > > #include "xfs_icache.h" > >> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > >> > > > > trace_xfs_iomap_found(ip, offset, length, 0, &imap); > >> > > > > } > >> > > > > > >> > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > >> > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > >> > > > > + iomap->flags |= IOMAP_F_DIRTY; > >> > > > > >> > > > This is the very definition of an inode that is "fdatasync dirty". > >> > > > > >> > > > Hmmmm, shouldn't this also be set for read faults, too? > >> > > > >> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any > >> > > data to the page which he'd then like to be persistent. The only reason why > >> > > I thought it could be useful for a while was that it would be nice to make > >> > > MAP_SYNC mapping provide the guarantee that data you see now is the data > >> > > you'll see after a crash > >> > > >> > Isn't that the entire point of MAP_SYNC? i.e. That when we return > >> > from a page fault, the app knows that the data and it's underlying > >> > extent is on persistent storage? > >> > > >> > > but we cannot provide that guarantee for RO > >> > > mapping anyway if someone else has the page mapped as well. So I just > >> > > decided not to return IOMAP_F_DIRTY for read faults. > >> > > >> > If there are multiple MAP_SYNC mappings to the inode, I would have > >> > expected that they all sync all of the data/metadata on every page > >> > fault, regardless of who dirtied the inode. An RO mapping doesn't > >> > >> Well, they all do sync regardless of who dirtied the inode on every *write* > >> fault. > >> > >> > mean the data/metadata on the inode can't change, it just means it > >> > can't change through that mapping. Running fsync() to guarantee the > >> > persistence of that data/metadata doesn't actually changing any > >> > data.... > >> > > >> > IOWs, if read faults don't guarantee the mapped range has stable > >> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken > >> > because it's not giving consistent guarantees to userspace. Yes, it > >> > works fine when only one MAP_SYNC mapping is modifying the inode, > >> > but the moment we have concurrent operations on the inode that > >> > aren't MAP_SYNC or O_SYNC this goes out the window.... > >> > >> MAP_SYNC as I've implemented it provides guarantees only for data the > >> process has actually written. I agree with that and it was a conscious > >> decision. In my opinion that covers most usecases, provides reasonably > >> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can > >> persist it just using CPU instructions), and reasonable performance. > >> > >> Now you seem to suggest the semantics should be: "Data you have read from or > >> written to a MAP_SYNC mapping can be persisted using CPU instructions." And > >> from implementation POV we can do that rather easily (just rip out the > >> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would > >> be useful enough to justify the slowdown of read faults? I was not able to > >> come up with a good usecase and so I've decided for current semantics. What > >> do other people think? > > > > Nobody commented on this for couple of days so how do we proceed? I would > > prefer to go just with a guarantee for data written and we can always make > > the guarantee stronger (i.e. apply it also for read data) when some user > > comes with a good usecase? > > I think it is easier to strengthen the guarantee than loosen it later > especially since it is not yet clear that we have a use case for the > stronger semantic. At least the initial motivation for MAP_SYNC was > for writers. I agree. It seems like all threads/processes in a given application need to use MAP_SYNC consistently so they can be sure that data that is written (and then possibly read) will be durable on media. I think what you have is a good starting point, and we can adjust later if necessary.
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 7c6b8def6eed..02093df4b314 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -44,6 +44,7 @@ #include <linux/falloc.h> #include <linux/pagevec.h> #include <linux/backing-dev.h> +#include <linux/mman.h> static const struct vm_operations_struct xfs_file_vm_ops; @@ -1040,7 +1041,11 @@ __xfs_filemap_fault( xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); if (IS_DAX(inode)) { - ret = dax_iomap_fault(vmf, pe_size, NULL, &xfs_iomap_ops); + pfn_t pfn; + + ret = dax_iomap_fault(vmf, pe_size, &pfn, &xfs_iomap_ops); + if (ret & VM_FAULT_NEEDDSYNC) + ret = dax_finish_sync_fault(vmf, pe_size, pfn); } else { if (write_fault) ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops); @@ -1131,6 +1136,13 @@ xfs_file_mmap( struct file *filp, struct vm_area_struct *vma) { + /* + * We don't support synchronous mappings for non-DAX files. At least + * until someone comes with a sensible use case. + */ + if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC)) + return -EOPNOTSUPP; + file_accessed(filp); vma->vm_ops = &xfs_file_vm_ops; if (IS_DAX(file_inode(filp))) @@ -1149,6 +1161,7 @@ const struct file_operations xfs_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .mmap = xfs_file_mmap, + .mmap_supported_flags = MAP_SYNC, .open = xfs_file_open, .release = xfs_file_release, .fsync = xfs_file_fsync, diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index f179bdf1644d..b43be199fbdf 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -33,6 +33,7 @@ #include "xfs_error.h" #include "xfs_trans.h" #include "xfs_trans_space.h" +#include "xfs_inode_item.h" #include "xfs_iomap.h" #include "xfs_trace.h" #include "xfs_icache.h" @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( trace_xfs_iomap_found(ip, offset, length, 0, &imap); } + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) + iomap->flags |= IOMAP_F_DIRTY; + xfs_bmbt_to_iomap(ip, iomap, &imap); if (shared)