diff mbox series

[RFC] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA

Message ID 20190527172655.9287-1-amir73il@gmail.com (mailing list archive)
State New, archived
Headers show
Series [RFC] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA | expand

Commit Message

Amir Goldstein May 27, 2019, 5:26 p.m. UTC
New link flags to request "atomic" link.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Hi Guys,

Following our discussions on LSF/MM and beyond [1][2], here is
an RFC documentation patch.

Ted, I know we discussed limiting the API for linking an O_TMPFILE
to avert the hardlinks issue, but I decided it would be better to
document the hardlinks non-guaranty instead. This will allow me to
replicate the same semantics and documentation to renameat(2).
Let me know how that works out for you.

I also decided to try out two separate flags for data and metadata.
I do not find any of those flags very useful without the other, but
documenting them seprately was easier, because of the fsync/fdatasync
reference.  In the end, we are trying to solve a social engineering
problem, so this is the least confusing way I could think of to describe
the new API.

First implementation of AT_ATOMIC_METADATA is expected to be
noop for xfs/ext4 and probably fsync for btrfs.

First implementation of AT_ATOMIC_DATA is expected to be
filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs.

Thoughts?

Amir.

[1] https://lwn.net/Articles/789038/
[2] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/

 man2/link.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

Comments

Darrick J. Wong May 28, 2019, 8:06 p.m. UTC | #1
On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote:
> New link flags to request "atomic" link.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> 
> Hi Guys,
> 
> Following our discussions on LSF/MM and beyond [1][2], here is
> an RFC documentation patch.
> 
> Ted, I know we discussed limiting the API for linking an O_TMPFILE
> to avert the hardlinks issue, but I decided it would be better to
> document the hardlinks non-guaranty instead. This will allow me to
> replicate the same semantics and documentation to renameat(2).
> Let me know how that works out for you.
> 
> I also decided to try out two separate flags for data and metadata.
> I do not find any of those flags very useful without the other, but
> documenting them seprately was easier, because of the fsync/fdatasync
> reference.  In the end, we are trying to solve a social engineering
> problem, so this is the least confusing way I could think of to describe
> the new API.
> 
> First implementation of AT_ATOMIC_METADATA is expected to be
> noop for xfs/ext4 and probably fsync for btrfs.
> 
> First implementation of AT_ATOMIC_DATA is expected to be
> filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs.
> 
> Thoughts?
> 
> Amir.
> 
> [1] https://lwn.net/Articles/789038/
> [2] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/
> 
>  man2/link.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/man2/link.2 b/man2/link.2
> index 649ba00c7..15c24703e 100644
> --- a/man2/link.2
> +++ b/man2/link.2
> @@ -184,6 +184,57 @@ See
>  .BR openat (2)
>  for an explanation of the need for
>  .BR linkat ().
> +.TP
> +.BR AT_ATOMIC_METADATA " (since Linux 5.x)"
> +By default, a link operation followed by a system crash, may result in the
> +new file name being linked with old inode metadata, such as out dated time
> +stamps or missing extended attributes.
> +.BR fsync (2)
> +before linking the inode, but that involves flushing of volatile disk caches.
> +
> +A filesystem that accepts this flag will guaranty, that old inode metadata
> +will not be exposed in the new linked name.
> +Some filesystems may internally perform
> +.BR fsync (2)
> +before linking the inode to provide this guaranty,
> +but often, filesystems will have a more efficient method to provide this
> +guaranty without flushing volatile disk caches.
> +
> +A filesystem that accepts this flag does
> +.BR NOT
> +guaranty that the new file name will exist after a system crash, nor that the
> +current inode metadata is persisted to disk.

Hmmm.  I think it would be much clearer to state the two expectations in
the same place, e.g.:

"A filesystem that accepts this flag guarantees that after a successful
call completion, the filesystem will return either (a) the version of
the metadata that was on disk at the time the call completed; (b) a
newer version of that metadata; or (c) -ENOENT.  In other words, a
subsequent access of the file path will never return metadata that was
obsolete at the time that the call completed, even if the system crashes
immediately afterwards."

Did I get that right?  I /think/ this means that one could implement Ye
Olde Write And Rename as:

fd = open(".", O_TMPFILE...);
write(fd);
fsync(fd);
snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, path, AT_FDCWD, "file.txt",
	AT_EMPTY_PATH | AT_ATOMIC_DATA | AT_ATOMIC_METADATA);

(Still struggling to figure out what userspace programs would use this
for...)

--D

> +Specifically, if a file has hardlinks, the existance of the linked name after
> +a system crash does
> +.BR NOT
> +guaranty that any of the other file names exist, nor that the last observed
> +value of
> +.I st_nlink
> +(see
> +.BR stat (2))
> +has persisted.
> +.TP
> +.BR AT_ATOMIC_DATA " (since Linux 5.x)"
> +By default, a link operation followed by a system crash, may result in the
> +new file name being linked with old data or missing data.
> +One way to prevent this is to call
> +.BR fdatasync (2)
> +before linking the inode, but that involves flushing of volatile disk caches.
> +
> +A filesystem that accepts this flag will guaranty, that old data
> +will not be exposed in the new linked name.
> +Some filesystems may internally perform
> +.BR fsync (2)
> +before linking the inode to provide this guaranty,
> +but often, filesystems will have a more efficient method to provide this
> +guaranty without flushing volatile disk caches.
> +
> +A filesystem that accepts this flag does
> +.BR NOT
> +guaranty that the new file name will exist after a system crash, nor that the
> +current inode data is persisted to disk.
> +.TP
>  .SH RETURN VALUE
>  On success, zero is returned.
>  On error, \-1 is returned, and
> -- 
> 2.17.1
>
Theodore Ts'o May 28, 2019, 8:26 p.m. UTC | #2
On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote:
> 
> Following our discussions on LSF/MM and beyond [1][2], here is
> an RFC documentation patch.
> 
> Ted, I know we discussed limiting the API for linking an O_TMPFILE
> to avert the hardlinks issue, but I decided it would be better to
> document the hardlinks non-guaranty instead. This will allow me to
> replicate the same semantics and documentation to renameat(2).
> Let me know how that works out for you.
> 
> I also decided to try out two separate flags for data and metadata.
> I do not find any of those flags very useful without the other, but
> documenting them seprately was easier, because of the fsync/fdatasync
> reference.  In the end, we are trying to solve a social engineering
> problem, so this is the least confusing way I could think of to describe
> the new API.

The way you have stated thigs is very confusing, and prone to be
misunderstood.  I think it would be helpful to state things in the
positive, instead of the negative.

Let's review what you had wanted:

	*If* the filename is visible in the directory after the crash,
	*then* all of the metadata/data that had been written to the file
	before the linkat(2) would be visible.

	HOWEVER, you did not want to necessarily force an fsync(2) in
	order to provide that guarantee.  That is, the filename would
	not necessarily be guaranteed to be visible after a crash when
	linkat(2) returns, but if the existence of the filename is
	persisted, then the data would be too.

	Also, at least initially we talked about this only making
	sense for O_TMPFILE file desacriptors.  I believe you were
	trying to generalize things so it wouldn't necessarily have to
	be a file created using O_TMPFILE.  Is that correct?

So instead of saying "A filesystem that accepts this flag will
guaranty, that old inode data will not be exposed in the new linked
name."  It's much clearer to state this in the affirmative:

	A filesystem which accepts this flag will guarantee that if
	the new pathname exists after a crash, all of the data written
	to the file at the time of the linkat(2) call will be visible.

I would think it's much simpler to say what *will* happen, instead of
what will not be visible.  (After all, technically speaking, returning
all zeros or random garbage data fufills the requirement "old data
will not be exposed", but that's probably not what you had in mind.  :-)

Also please note that it's spelled "guarantee".

Cheers,

						- Ted
Amir Goldstein May 29, 2019, 5:38 a.m. UTC | #3
On Tue, May 28, 2019 at 11:27 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote:
> >
> > Following our discussions on LSF/MM and beyond [1][2], here is
> > an RFC documentation patch.
> >
> > Ted, I know we discussed limiting the API for linking an O_TMPFILE
> > to avert the hardlinks issue, but I decided it would be better to
> > document the hardlinks non-guaranty instead. This will allow me to
> > replicate the same semantics and documentation to renameat(2).
> > Let me know how that works out for you.
> >
> > I also decided to try out two separate flags for data and metadata.
> > I do not find any of those flags very useful without the other, but
> > documenting them seprately was easier, because of the fsync/fdatasync
> > reference.  In the end, we are trying to solve a social engineering
> > problem, so this is the least confusing way I could think of to describe
> > the new API.
>
> The way you have stated thigs is very confusing, and prone to be
> misunderstood.  I think it would be helpful to state things in the
> positive, instead of the negative.
>
> Let's review what you had wanted:
>
>         *If* the filename is visible in the directory after the crash,
>         *then* all of the metadata/data that had been written to the file
>         before the linkat(2) would be visible.
>
>         HOWEVER, you did not want to necessarily force an fsync(2) in
>         order to provide that guarantee.  That is, the filename would
>         not necessarily be guaranteed to be visible after a crash when
>         linkat(2) returns, but if the existence of the filename is
>         persisted, then the data would be too.
>
>         Also, at least initially we talked about this only making
>         sense for O_TMPFILE file desacriptors.  I believe you were
>         trying to generalize things so it wouldn't necessarily have to
>         be a file created using O_TMPFILE.  Is that correct?

That is correct. I felt we were limiting ourselves only to avert the
hardlinks issue, so decided its better to explain that "nlink is not
part of the inode metadata" that this guarantee refers to.
I would be happy to get your feedback about the hardlink disclaimer
since you brought up the concern. It the disclaimer enough?
Not needed at all?

>
> So instead of saying "A filesystem that accepts this flag will
> guaranty, that old inode data will not be exposed in the new linked
> name."  It's much clearer to state this in the affirmative:
>
>         A filesystem which accepts this flag will guarantee that if
>         the new pathname exists after a crash, all of the data written
>         to the file at the time of the linkat(2) call will be visible.
>

Sounds good to me. I will take a swing at another patch.

Thanks for the review!
Amir.
Amir Goldstein May 29, 2019, 5:58 a.m. UTC | #4
On Tue, May 28, 2019 at 11:07 PM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote:
> > New link flags to request "atomic" link.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >
> > Hi Guys,
> >
> > Following our discussions on LSF/MM and beyond [1][2], here is
> > an RFC documentation patch.
> >
> > Ted, I know we discussed limiting the API for linking an O_TMPFILE
> > to avert the hardlinks issue, but I decided it would be better to
> > document the hardlinks non-guaranty instead. This will allow me to
> > replicate the same semantics and documentation to renameat(2).
> > Let me know how that works out for you.
> >
> > I also decided to try out two separate flags for data and metadata.
> > I do not find any of those flags very useful without the other, but
> > documenting them seprately was easier, because of the fsync/fdatasync
> > reference.  In the end, we are trying to solve a social engineering
> > problem, so this is the least confusing way I could think of to describe
> > the new API.
> >
> > First implementation of AT_ATOMIC_METADATA is expected to be
> > noop for xfs/ext4 and probably fsync for btrfs.
> >
> > First implementation of AT_ATOMIC_DATA is expected to be
> > filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs.
> >
> > Thoughts?
> >
> > Amir.
> >
> > [1] https://lwn.net/Articles/789038/
> > [2] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/
> >
> >  man2/link.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 51 insertions(+)
> >
> > diff --git a/man2/link.2 b/man2/link.2
> > index 649ba00c7..15c24703e 100644
> > --- a/man2/link.2
> > +++ b/man2/link.2
> > @@ -184,6 +184,57 @@ See
> >  .BR openat (2)
> >  for an explanation of the need for
> >  .BR linkat ().
> > +.TP
> > +.BR AT_ATOMIC_METADATA " (since Linux 5.x)"
> > +By default, a link operation followed by a system crash, may result in the
> > +new file name being linked with old inode metadata, such as out dated time
> > +stamps or missing extended attributes.
> > +.BR fsync (2)
> > +before linking the inode, but that involves flushing of volatile disk caches.
> > +
> > +A filesystem that accepts this flag will guaranty, that old inode metadata
> > +will not be exposed in the new linked name.
> > +Some filesystems may internally perform
> > +.BR fsync (2)
> > +before linking the inode to provide this guaranty,
> > +but often, filesystems will have a more efficient method to provide this
> > +guaranty without flushing volatile disk caches.
> > +
> > +A filesystem that accepts this flag does
> > +.BR NOT
> > +guaranty that the new file name will exist after a system crash, nor that the
> > +current inode metadata is persisted to disk.
>
> Hmmm.  I think it would be much clearer to state the two expectations in
> the same place, e.g.:
>
> "A filesystem that accepts this flag guarantees that after a successful
> call completion, the filesystem will return either (a) the version of
> the metadata that was on disk at the time the call completed; (b) a
> newer version of that metadata; or (c) -ENOENT.  In other words, a
> subsequent access of the file path will never return metadata that was
> obsolete at the time that the call completed, even if the system crashes
> immediately afterwards."

Your feedback is along the same line as Ted's feedback.
I will try out the phrasing that Ted proposed, will see how that goes...

>
> Did I get that right?  I /think/ this means that one could implement Ye
> Olde Write And Rename as:
>
> fd = open(".", O_TMPFILE...);
> write(fd);
> fsync(fd);

Certainly not fsync(), that what my "counter-fsync()" phrasing was trying to
convey. The flags are meant as a "cheaper" replacement for that fsync().

> snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
> linkat(AT_FDCWD, path, AT_FDCWD, "file.txt",
>         AT_EMPTY_PATH | AT_ATOMIC_DATA | AT_ATOMIC_METADATA);
>
> (Still struggling to figure out what userspace programs would use this
> for...)
>

There are generally two use cases:

1. Flushing volatile disk caches is very expensive.
In this case sync_file_range(SYNC_FILE_RANGE_WRITE_AND_WAIT)
is cheaper than fdatasync() for a preallocated file and obviously a noop
for AT_ATOMIC_METADATA in "S.O.M.C" filesystem is cheaper than
a journal commit.

2. Highly concurrent metadata workloads
Many users running  AT_ATOMIC_METADATA concurrently are much
less likely to interfere each other than many users running fsync concurrently.

IWO, when I'm using fsync() to provide the AT_ATOMIC_METADATA guarantee
on a single journal fs, other users pay the penalty for a guaranty that I didn't
ask for (i.e. persistency).

Thanks,
Amir.
Amir Goldstein May 31, 2019, 3:21 p.m. UTC | #5
> >
> > So instead of saying "A filesystem that accepts this flag will
> > guaranty, that old inode data will not be exposed in the new linked
> > name."  It's much clearer to state this in the affirmative:
> >
> >         A filesystem which accepts this flag will guarantee that if
> >         the new pathname exists after a crash, all of the data written
> >         to the file at the time of the linkat(2) call will be visible.
> >
>
> Sounds good to me. I will take a swing at another patch.
>

So I am down to single flag documented with 3 tweets ;-)

What do you think of:

"AT_ATOMIC_DATA (since Linux 5.x)
A filesystem which accepts this flag will guarantee that if the linked file
name exists after a system crash, then all of the data written to the file
and all of the file's metadata at the time of the linkat(2) call will be
visible.

The way to achieve this guarantee on old kernels is to call fsync (2)
before linking the file, but doing so will also results in flushing of
volatile disk caches.

A filesystem which accepts this flag does NOT
guarantee that any of the file hardlinks will exist after a system crash,
nor that the last observed value of st_nlink (see stat (2)) will persist."


Thanks,
Amir.
Theodore Ts'o May 31, 2019, 4:41 p.m. UTC | #6
On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote:
> What do you think of:
> 
> "AT_ATOMIC_DATA (since Linux 5.x)
> A filesystem which accepts this flag will guarantee that if the linked file
> name exists after a system crash, then all of the data written to the file
> and all of the file's metadata at the time of the linkat(2) call will be
> visible.

".... will be visible after the the file system is remounted".  (Never
hurts to be explicit.)

> The way to achieve this guarantee on old kernels is to call fsync (2)
> before linking the file, but doing so will also results in flushing of
> volatile disk caches.
>
> A filesystem which accepts this flag does NOT
> guarantee that any of the file hardlinks will exist after a system crash,
> nor that the last observed value of st_nlink (see stat (2)) will persist."
> 

This is I think more precise:

    This guarantee can be achieved by calling fsync(2) before linking
    the file, but there may be more performant ways to provide these
    semantics.  In particular, note that the use of the AT_ATOMIC_DATA
    flag does *not* guarantee that the new link created by linkat(2)
    will be persisted after a crash.

We should also document that a file system which does not implement
this flag MUST return EINVAL if it is passed this flag to linkat(2).

     	       	      	     	      - Ted
Amir Goldstein May 31, 2019, 5:22 p.m. UTC | #7
On Fri, May 31, 2019 at 7:41 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote:
> > What do you think of:
> >
> > "AT_ATOMIC_DATA (since Linux 5.x)
> > A filesystem which accepts this flag will guarantee that if the linked file
> > name exists after a system crash, then all of the data written to the file
> > and all of the file's metadata at the time of the linkat(2) call will be
> > visible.
>
> ".... will be visible after the the file system is remounted".  (Never
> hurts to be explicit.)
>
> > The way to achieve this guarantee on old kernels is to call fsync (2)
> > before linking the file, but doing so will also results in flushing of
> > volatile disk caches.
> >
> > A filesystem which accepts this flag does NOT
> > guarantee that any of the file hardlinks will exist after a system crash,
> > nor that the last observed value of st_nlink (see stat (2)) will persist."
> >
>
> This is I think more precise:
>
>     This guarantee can be achieved by calling fsync(2) before linking
>     the file, but there may be more performant ways to provide these
>     semantics.  In particular, note that the use of the AT_ATOMIC_DATA
>     flag does *not* guarantee that the new link created by linkat(2)
>     will be persisted after a crash.

OK. Just to be clear, mentioning hardlinks and st_link is not needed
in your opinion?

>
> We should also document that a file system which does not implement
> this flag MUST return EINVAL if it is passed this flag to linkat(2).
>

OK. I think this part can be documented as possible reason for EINVAL
As in renameat(2) man page:
       EINVAL The filesystem does not support one of the flags in flags.

Thanks,
Amir.
Theodore Ts'o May 31, 2019, 7:21 p.m. UTC | #8
On Fri, May 31, 2019 at 08:22:06PM +0300, Amir Goldstein wrote:
> >
> > This is I think more precise:
> >
> >     This guarantee can be achieved by calling fsync(2) before linking
> >     the file, but there may be more performant ways to provide these
> >     semantics.  In particular, note that the use of the AT_ATOMIC_DATA
> >     flag does *not* guarantee that the new link created by linkat(2)
> >     will be persisted after a crash.
> 
> OK. Just to be clear, mentioning hardlinks and st_link is not needed
> in your opinion?

Your previous text stated that it was undefined what would happen to
all hardlinks belonging to the file, and that would imply that if a
file had N hard links, some in the directory which we are modifying,
and some in other directories, that somehow any of them might not be
present after the crash.  And that's not the case.  Suppose the file
currently has hardlinks test1/foo, test1/quux, and test2/baz --- and
we've called syncfs(2) on the file system so everything is persisted,
and then linkat(2) is used to create a new hardlink, test1/bar.

After a crash, the existence of test1/foo, test1/quux, and test2/baz
are not in question.  It's only unclear whether or not test1/bar
exists after the crash.

As far as st_nlink is concerned, the presumption is that the file
system itself will be consistent after the crash.  So if the hard link
has been persisted, then st_nlink will be incremented, if it has not,
it won't be.

Finally, one thing which gets hard about trying to state these sorts
of things as guarantees.  Sometimes, the file system won't *know*
whether or not it can make these guarantees.  For example what should
we do if the file system is mounted with nobarrier?  If the overall
hardware design includes UPS's or some other kind of battery backup,
the guarantee may very well exist.  But the file system code can't
know whether or not that is the case.  So my inclination is to allow
the file system to accept the flag even if the mount option nobarrier
is in play --- but in that case, the guarantee is only if the rest of
the system is designed appropriately.

(For that matter, it used to be that there existed hard drives that
lied about whether they had a writeback cache, and/or made the CACHE
FLUSH a no-op so they could win the Winbench benchmarketing wars,
which was worth millions and millions of dollars in sales.  So we can
only assume that the hardware isn't lying to us when we use words like
"guarantee".)

						- Ted
Dave Chinner May 31, 2019, 10:45 p.m. UTC | #9
On Fri, May 31, 2019 at 12:41:36PM -0400, Theodore Ts'o wrote:
> On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote:
> > What do you think of:
> > 
> > "AT_ATOMIC_DATA (since Linux 5.x)
> > A filesystem which accepts this flag will guarantee that if the linked file
> > name exists after a system crash, then all of the data written to the file
> > and all of the file's metadata at the time of the linkat(2) call will be
> > visible.
> 
> ".... will be visible after the the file system is remounted".  (Never
> hurts to be explicit.)
> 
> > The way to achieve this guarantee on old kernels is to call fsync (2)
> > before linking the file, but doing so will also results in flushing of
> > volatile disk caches.
> >
> > A filesystem which accepts this flag does NOT
> > guarantee that any of the file hardlinks will exist after a system crash,
> > nor that the last observed value of st_nlink (see stat (2)) will persist."
> > 
> 
> This is I think more precise:
> 
>     This guarantee can be achieved by calling fsync(2) before linking
>     the file, but there may be more performant ways to provide these
>     semantics.  In particular, note that the use of the AT_ATOMIC_DATA
>     flag does *not* guarantee that the new link created by linkat(2)
>     will be persisted after a crash.

So here's the *implementation* problem I see with this definition of
AT_ATOMIC_DATA. After linkat(dirfd, name, AT_ATOMIC_DATA), there is
no guarantee that the data is on disk or that the link is present.

However:

	linkat(dirfd, name, AT_ATOMIC_DATA);
	fsync(dirfd);

Suddenly changes all that.

i.e. when we fsync(dirfd) we guarantee that "name" is present in the
directory and because we used AT_ATOMIC_DATA it implies that the
data pointed to by "name" must be present on disk. IOWs, what was
once a pure directory sync operation now *must* fsync all the child
inodes that have been linkat(AT_ATOMIC_DATA) since the last time the
direct has been made stable. 

IOWs, the described AT_ATOMIC_DATA "we don't have to write the data
during linkat() go-fast-get-out-of-gaol-free" behaviour isn't worth
the pixels it is written on - it just moves all the complexity to
directory fsync, and that's /already/ a behavioural minefield.

IMO, the "work-around" of forcing filesystems to write back
destination inodes during a link() operation is just nasty and will
just end up with really weird performance anomalies occurring in
production systems. That's not really a solution, either, especially
as it is far, far faster for applications to use AIO_FSYNC and then
on the completion callback run a normal linkat() operation...

Hence, if I've understood these correctly, then I'll be recommending
that XFS follows this:

> We should also document that a file system which does not implement
> this flag MUST return EINVAL if it is passed this flag to linkat(2).

and returns -EINVAL to these flags because we do not have the change
tracking infrastructure to handle these directory fsync semantics.
I also suspect that, even if we could track this efficiently, we
can't do the flushing atomically because of locking order
constraints between directories, regular files, pages in the page
cache, etc.

Given that we can already use AIO to provide this sort of ordering,
and AIO is vastly faster than synchronous IO, I don't see any point
in adding complex barrier interfaces that can be /easily implemented
in userspace/ using existing AIO primitives. You should start
thinking about expanding libaio with stuff like
"link_after_fdatasync()" and suddenly the whole problem of
filesystem data vs metadata ordering goes away because the
application directly controls all ordering without blocking and
doesn't need to care what the filesystem under it does....

Cheers,

Dave.
Dave Chinner May 31, 2019, 11:28 p.m. UTC | #10
On Sat, Jun 01, 2019 at 08:45:49AM +1000, Dave Chinner wrote:
> Given that we can already use AIO to provide this sort of ordering,
> and AIO is vastly faster than synchronous IO, I don't see any point
> in adding complex barrier interfaces that can be /easily implemented
> in userspace/ using existing AIO primitives. You should start
> thinking about expanding libaio with stuff like
> "link_after_fdatasync()" and suddenly the whole problem of
> filesystem data vs metadata ordering goes away because the
> application directly controls all ordering without blocking and
> doesn't need to care what the filesystem under it does....

And let me point out that this is also how userspace can do an
efficient atomic rename - rename_after_fdatasync(). i.e. on
completion of the AIO_FSYNC, run the rename. This guarantees that
the application will see either the old file of the complete new
file, and it *doesn't have to wait for the operation to complete*.
Once it is in flight, the file will contain the old data until some
point in the near future when will it contain the new data....

Seriously, sit down and work out all the "atomic" data vs metadata
behaviours you want, and then tell me how many of them cannot be
implemented as "AIO_FSYNC w/ completion callback function" in
userspace. This mechanism /guarantees ordering/ at the application
level, the application does not block waiting for these data
integrity operations to complete, and you don't need any new kernel
side functionality to implement this.

Fundamentally, the assertion that disk cache flushes are not what
causes fsync "to be slow" is incorrect. It's the synchronous
"waiting for IO completion" that makes fsync "slow". AIO_FSYNC
avoids needing to wait for IO completion, allowing the application
to do useful work (like issue more DI ops) while data integrity
operations are in flight. At this point, fsync is no longer a "slow"
operation - it's just another background async data flush operation
like the BDI flusher thread...

Cheers,

Dave.
Amir Goldstein June 1, 2019, 7:21 a.m. UTC | #11
On Sat, Jun 1, 2019 at 1:46 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, May 31, 2019 at 12:41:36PM -0400, Theodore Ts'o wrote:
> > On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote:
> > > What do you think of:
> > >
> > > "AT_ATOMIC_DATA (since Linux 5.x)
> > > A filesystem which accepts this flag will guarantee that if the linked file
> > > name exists after a system crash, then all of the data written to the file
> > > and all of the file's metadata at the time of the linkat(2) call will be
> > > visible.
> >
> > ".... will be visible after the the file system is remounted".  (Never
> > hurts to be explicit.)
> >
> > > The way to achieve this guarantee on old kernels is to call fsync (2)
> > > before linking the file, but doing so will also results in flushing of
> > > volatile disk caches.
> > >
> > > A filesystem which accepts this flag does NOT
> > > guarantee that any of the file hardlinks will exist after a system crash,
> > > nor that the last observed value of st_nlink (see stat (2)) will persist."
> > >
> >
> > This is I think more precise:
> >
> >     This guarantee can be achieved by calling fsync(2) before linking
> >     the file, but there may be more performant ways to provide these
> >     semantics.  In particular, note that the use of the AT_ATOMIC_DATA
> >     flag does *not* guarantee that the new link created by linkat(2)
> >     will be persisted after a crash.
>
> So here's the *implementation* problem I see with this definition of
> AT_ATOMIC_DATA. After linkat(dirfd, name, AT_ATOMIC_DATA), there is
> no guarantee that the data is on disk or that the link is present.
>
> However:
>
>         linkat(dirfd, name, AT_ATOMIC_DATA);
>         fsync(dirfd);
>
> Suddenly changes all that.
>
> i.e. when we fsync(dirfd) we guarantee that "name" is present in the
> directory and because we used AT_ATOMIC_DATA it implies that the
> data pointed to by "name" must be present on disk. IOWs, what was
> once a pure directory sync operation now *must* fsync all the child
> inodes that have been linkat(AT_ATOMIC_DATA) since the last time the
> direct has been made stable.
>
> IOWs, the described AT_ATOMIC_DATA "we don't have to write the data
> during linkat() go-fast-get-out-of-gaol-free" behaviour isn't worth
> the pixels it is written on - it just moves all the complexity to
> directory fsync, and that's /already/ a behavioural minefield.

Where does it say we don't have to write the data during linkat()?
I was only talking about avoid FLUSH/FUA caused by fsync().
I wrote in commit message:
"First implementation of AT_ATOMIC_DATA is expected to be
filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs."

I failed to convey the reasoning for this flag to you.
It is *not* about the latency of the "atomic link" for the calling thread
It is about not interfering with other workloads running at the same time.

>
> IMO, the "work-around" of forcing filesystems to write back
> destination inodes during a link() operation is just nasty and will
> just end up with really weird performance anomalies occurring in
> production systems. That's not really a solution, either, especially
> as it is far, far faster for applications to use AIO_FSYNC and then
> on the completion callback run a normal linkat() operation...
>
> Hence, if I've understood these correctly, then I'll be recommending
> that XFS follows this:
>
> > We should also document that a file system which does not implement
> > this flag MUST return EINVAL if it is passed this flag to linkat(2).
>
> and returns -EINVAL to these flags because we do not have the change
> tracking infrastructure to handle these directory fsync semantics.
> I also suspect that, even if we could track this efficiently, we
> can't do the flushing atomically because of locking order
> constraints between directories, regular files, pages in the page
> cache, etc.

That is not at all what I had in mind for XFS with the flag.

>
> Given that we can already use AIO to provide this sort of ordering,
> and AIO is vastly faster than synchronous IO, I don't see any point
> in adding complex barrier interfaces that can be /easily implemented
> in userspace/ using existing AIO primitives. You should start
> thinking about expanding libaio with stuff like
> "link_after_fdatasync()" and suddenly the whole problem of
> filesystem data vs metadata ordering goes away because the
> application directly controls all ordering without blocking and
> doesn't need to care what the filesystem under it does....
>

OK. I can work with that. It's not that simple, but I will reply on
your next email, where you wrote more about this alternative.

Thanks,
Amir.
Amir Goldstein June 1, 2019, 8:01 a.m. UTC | #12
On Sat, Jun 1, 2019 at 2:28 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Jun 01, 2019 at 08:45:49AM +1000, Dave Chinner wrote:
> > Given that we can already use AIO to provide this sort of ordering,
> > and AIO is vastly faster than synchronous IO, I don't see any point
> > in adding complex barrier interfaces that can be /easily implemented
> > in userspace/ using existing AIO primitives. You should start
> > thinking about expanding libaio with stuff like
> > "link_after_fdatasync()" and suddenly the whole problem of
> > filesystem data vs metadata ordering goes away because the
> > application directly controls all ordering without blocking and
> > doesn't need to care what the filesystem under it does....
>
> And let me point out that this is also how userspace can do an
> efficient atomic rename - rename_after_fdatasync(). i.e. on
> completion of the AIO_FSYNC, run the rename. This guarantees that
> the application will see either the old file of the complete new
> file, and it *doesn't have to wait for the operation to complete*.
> Once it is in flight, the file will contain the old data until some
> point in the near future when will it contain the new data....

What I am looking for is a way to isolate the effects of "atomic rename/link"
from the rest of the users. Sure there is I/O bandwidth and queued
bios, but at least isolate other threads working on other files or metadata
from contending with the "atomic rename" thread of journal flushes and
the like. Actually, one of my use cases is "atomic rename" of files with
no data (looking for atomicity w.r.t xattr and mtime), so this "atomic rename"
thread should not be interfering with other workloads at all.

>
> Seriously, sit down and work out all the "atomic" data vs metadata
> behaviours you want, and then tell me how many of them cannot be
> implemented as "AIO_FSYNC w/ completion callback function" in
> userspace. This mechanism /guarantees ordering/ at the application
> level, the application does not block waiting for these data
> integrity operations to complete, and you don't need any new kernel
> side functionality to implement this.

So I think what I could have used is AIO_BATCH_FSYNC, an interface
that was proposed by Ric Wheeler and discussed on LSF:
https://lwn.net/Articles/789024/
Ric was looking for a way to efficiently fsync a "bunch of files".
Submitting several AIO_FSYNC calls is not the efficient way of doing that.
So it is either a new AIO_BATCH_FSYNC and a kernel implementation
that flushes the inodes and then calls ->sync_fs(), or a new AIO operation
that just does the ->sync_fs() bit and using sync_file_range() for the inodes.

To be more accurate, the AIO operation that would emulate my
proposed API more closely is AIO_WAIT_FOR_SYNCFS, as I do not wish
to impose excessive journal flushes, I just need a completion callback
when they happened to perform the rename/link.

>
> Fundamentally, the assertion that disk cache flushes are not what
> causes fsync "to be slow" is incorrect. It's the synchronous

Too many double negatives. I am not sure I parsed this correctly.
But I think by now you understand that I don't care that fsync is "slow".
I care about frequent fsyncs making the entire system slow down.

Heck, xfs even has a mitigation in place to improve performance
of too frequent fsyncs, but that mitigation is partly gone since
47c7d0b19502 xfs: fix incorrect log_flushed on fsync

The situation with frequent fsync on ext4 at the moment is probably
worse.

I am trying to reduce the number of fsyncs from applications
and converting fsync to AIO_FSYNC is not going to help with that.

Thanks,
Amir.
Dave Chinner June 3, 2019, 4:25 a.m. UTC | #13
On Sat, Jun 01, 2019 at 11:01:42AM +0300, Amir Goldstein wrote:
> On Sat, Jun 1, 2019 at 2:28 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sat, Jun 01, 2019 at 08:45:49AM +1000, Dave Chinner wrote:
> > > Given that we can already use AIO to provide this sort of ordering,
> > > and AIO is vastly faster than synchronous IO, I don't see any point
> > > in adding complex barrier interfaces that can be /easily implemented
> > > in userspace/ using existing AIO primitives. You should start
> > > thinking about expanding libaio with stuff like
> > > "link_after_fdatasync()" and suddenly the whole problem of
> > > filesystem data vs metadata ordering goes away because the
> > > application directly controls all ordering without blocking and
> > > doesn't need to care what the filesystem under it does....
> >
> > And let me point out that this is also how userspace can do an
> > efficient atomic rename - rename_after_fdatasync(). i.e. on
> > completion of the AIO_FSYNC, run the rename. This guarantees that
> > the application will see either the old file of the complete new
> > file, and it *doesn't have to wait for the operation to complete*.
> > Once it is in flight, the file will contain the old data until some
> > point in the near future when will it contain the new data....
> 
> What I am looking for is a way to isolate the effects of "atomic rename/link"
> from the rest of the users.  Sure there is I/O bandwidth and queued
> bios, but at least isolate other threads working on other files or metadata
> from contending with the "atomic rename" thread of journal flushes and
> the like.

That's not a function of the kernel API. That's a function of the
implementation behind the kernel API. i.e. The API requires data to
be written before the rename/link is committed, how that is achieved
is up to the filesystem. And some filesystems will not be able to
isolate the API behavioural requirement from other users....

> Actually, one of my use cases is "atomic rename" of files with
> no data (looking for atomicity w.r.t xattr and mtime), so this "atomic rename"
> thread should not be interfering with other workloads at all.

Which should already guaranteed because a) rename is supposed to be
atomic, and b) metadata ordering requirements in journalled
filesystems. If they lose xattrs across rename, there's something
seriously wrong with the filesystem implementation.  I'm really not
sure what you think filesystems are actually doing with metadata
across rename operations....

> > Seriously, sit down and work out all the "atomic" data vs metadata
> > behaviours you want, and then tell me how many of them cannot be
> > implemented as "AIO_FSYNC w/ completion callback function" in
> > userspace. This mechanism /guarantees ordering/ at the application
> > level, the application does not block waiting for these data
> > integrity operations to complete, and you don't need any new kernel
> > side functionality to implement this.
> 
> So I think what I could have used is AIO_BATCH_FSYNC, an interface
> that was proposed by Ric Wheeler and discussed on LSF:
> https://lwn.net/Articles/789024/
> Ric was looking for a way to efficiently fsync a "bunch of files".
> Submitting several AIO_FSYNC calls is not the efficient way of doing that.

/me sighs.

That's not what I just suggested, and I've already addressed this
"AIO_FSYNC sucks" FUD in multiple separate threads.  You do realise
you can submit multiple AIO operations with a single io_submit()
call, right?

	struct iocb	ioc[10];
	struct io_event ev[10];

	for (i = 0; i < 10; i++) {
		io_prep_fsync(&ioc[i], fd[i]);
		ioc[i]->data = callback_arg[i];
	}

	io_submit(aio_ctx, 10, &ioc);
	io_getevents(aio_ctx, 10, 10, ev, NULL);

	for (i = 0; i < 10; i++)
		post_fsync_callback(&ev[i]);


There's your single syscall AIO_BATCH_FSYNC functionality, and it
implements a per-fd post-fsync callback function. This isn't rocket
science....

[snip]

> I am trying to reduce the number of fsyncs from applications
> and converting fsync to AIO_FSYNC is not going to help with that.

Your whole argument is "fsync is inefficient because cache flushes,
therefore AIO_FSYNC must be inefficient." IOWs, you've already
decided what is wrong, how it can and can't be fixed and the
solution you want regardless of whether your assertions are correct
or not. You haven't provided any evidence that a new kernel API is
the only viable solution, nor that the existing ones cannot provide
the functionality you require.

So, in the interests of /informed debate/, please implement what you
want using batched AIO_FSYNC + rename/linkat completion callback and
measure what it acheives. Then implement a sync_file_range/linkat
thread pool that provides the same functionality to the application
(i.e. writeback concurrency in userspace) and measure it. Then we
can discuss what the relative overhead is with numbers and can
perform analysis to determine what the cause of the performance
differential actually is.

Neither of these things require kernel modifications, but you need
to provide the evidence that existing APIs are insufficient.
Indeed, we now have the new async ioring stuff that can run async
sync_file_range calls, so you probably need to benchmark replacing
AIO_FSYNC with that interface as well. This new API likely does
exactly what you want without the journal/device cache flush
overhead of AIO_FSYNC....

Cheers,

Dave.
Amir Goldstein June 3, 2019, 6:17 a.m. UTC | #14
> > Actually, one of my use cases is "atomic rename" of files with
> > no data (looking for atomicity w.r.t xattr and mtime), so this "atomic rename"
> > thread should not be interfering with other workloads at all.
>
> Which should already guaranteed because a) rename is supposed to be
> atomic, and b) metadata ordering requirements in journalled
> filesystems. If they lose xattrs across rename, there's something
> seriously wrong with the filesystem implementation.  I'm really not
> sure what you think filesystems are actually doing with metadata
> across rename operations....
>

Dave,

We are going in circles so much that my head is spinning.
I don't blame anyone for having a hard time to keep up with the plot, because
it spans many threads and subjects, so let me re-iterate:

- I *do* know that rename provides me the needed "metadata barrier"
  w.r.t. xattr on xfs/ext4 today.
- I *do* know the sync_file_range()+rename() callback provides the
"data barrier"
  I need on xfs/ext4 today.
- I *do* use this internal fs knowledge in my applications
- I even fixed up sync_file_range() per your suggestion, so I won't need to use
  the FIEMAP_FLAG_SYNC hack
- At attempt from CrashMonkey developers to document this behavior was
  "shot down" for many justified reasons
- Without any documentation nor explicit API with a clean guarantee, users
  cannot write efficient applications without being aware of the filesystem
  underneath and follow that filesystem development to make sure behavior
  has not changed
- The most recent proposal I have made in LSF, based on Jan's suggestion is
  to change nothing in filesystem implementation, but use a new *explicit* verb
  to communicate the expectation of the application, so that
filesystems are free
  the change behavior in the future in the absence of the new verb

Once again, ATOMIC_METADATA is a noop in preset xfs/ext4.
ATOMIC_DATA is sync_file_range() in present xfs/ext4.
The APIs I *need* from the kernel *do* exist, but the filesystem developers
(except xfs) are not willing to document the guarantee that the existing
interfaces provide in the present.

[...]
> So, in the interests of /informed debate/, please implement what you
> want using batched AIO_FSYNC + rename/linkat completion callback and
> measure what it acheives. Then implement a sync_file_range/linkat
> thread pool that provides the same functionality to the application
> (i.e. writeback concurrency in userspace) and measure it. Then we
> can discuss what the relative overhead is with numbers and can
> perform analysis to determine what the cause of the performance
> differential actually is.
>

Fare enough.

> Neither of these things require kernel modifications, but you need
> to provide the evidence that existing APIs are insufficient.

APIs are sufficient if I know which filesystem I am running on.
btrfs needs a different set of syscalls to get the same thing done.

> Indeed, we now have the new async ioring stuff that can run async
> sync_file_range calls, so you probably need to benchmark replacing
> AIO_FSYNC with that interface as well. This new API likely does
> exactly what you want without the journal/device cache flush
> overhead of AIO_FSYNC....
>

Indeed, I am keeping a close watch on io_uring.

Thanks,
Amir.
diff mbox series

Patch

diff --git a/man2/link.2 b/man2/link.2
index 649ba00c7..15c24703e 100644
--- a/man2/link.2
+++ b/man2/link.2
@@ -184,6 +184,57 @@  See
 .BR openat (2)
 for an explanation of the need for
 .BR linkat ().
+.TP
+.BR AT_ATOMIC_METADATA " (since Linux 5.x)"
+By default, a link operation followed by a system crash, may result in the
+new file name being linked with old inode metadata, such as out dated time
+stamps or missing extended attributes.
+One way to prevent this is to call
+.BR fsync (2)
+before linking the inode, but that involves flushing of volatile disk caches.
+
+A filesystem that accepts this flag will guaranty, that old inode metadata
+will not be exposed in the new linked name.
+Some filesystems may internally perform
+.BR fsync (2)
+before linking the inode to provide this guaranty,
+but often, filesystems will have a more efficient method to provide this
+guaranty without flushing volatile disk caches.
+
+A filesystem that accepts this flag does
+.BR NOT
+guaranty that the new file name will exist after a system crash, nor that the
+current inode metadata is persisted to disk.
+Specifically, if a file has hardlinks, the existance of the linked name after
+a system crash does
+.BR NOT
+guaranty that any of the other file names exist, nor that the last observed
+value of
+.I st_nlink
+(see
+.BR stat (2))
+has persisted.
+.TP
+.BR AT_ATOMIC_DATA " (since Linux 5.x)"
+By default, a link operation followed by a system crash, may result in the
+new file name being linked with old data or missing data.
+One way to prevent this is to call
+.BR fdatasync (2)
+before linking the inode, but that involves flushing of volatile disk caches.
+
+A filesystem that accepts this flag will guaranty, that old data
+will not be exposed in the new linked name.
+Some filesystems may internally perform
+.BR fsync (2)
+before linking the inode to provide this guaranty,
+but often, filesystems will have a more efficient method to provide this
+guaranty without flushing volatile disk caches.
+
+A filesystem that accepts this flag does
+.BR NOT
+guaranty that the new file name will exist after a system crash, nor that the
+current inode data is persisted to disk.
+.TP
 .SH RETURN VALUE
 On success, zero is returned.
 On error, \-1 is returned, and