Message ID | 20190527172655.9287-1-amir73il@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA | expand |
On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote: > New link flags to request "atomic" link. > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > --- > > Hi Guys, > > Following our discussions on LSF/MM and beyond [1][2], here is > an RFC documentation patch. > > Ted, I know we discussed limiting the API for linking an O_TMPFILE > to avert the hardlinks issue, but I decided it would be better to > document the hardlinks non-guaranty instead. This will allow me to > replicate the same semantics and documentation to renameat(2). > Let me know how that works out for you. > > I also decided to try out two separate flags for data and metadata. > I do not find any of those flags very useful without the other, but > documenting them seprately was easier, because of the fsync/fdatasync > reference. In the end, we are trying to solve a social engineering > problem, so this is the least confusing way I could think of to describe > the new API. > > First implementation of AT_ATOMIC_METADATA is expected to be > noop for xfs/ext4 and probably fsync for btrfs. > > First implementation of AT_ATOMIC_DATA is expected to be > filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs. > > Thoughts? > > Amir. > > [1] https://lwn.net/Articles/789038/ > [2] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/ > > man2/link.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 51 insertions(+) > > diff --git a/man2/link.2 b/man2/link.2 > index 649ba00c7..15c24703e 100644 > --- a/man2/link.2 > +++ b/man2/link.2 > @@ -184,6 +184,57 @@ See > .BR openat (2) > for an explanation of the need for > .BR linkat (). > +.TP > +.BR AT_ATOMIC_METADATA " (since Linux 5.x)" > +By default, a link operation followed by a system crash, may result in the > +new file name being linked with old inode metadata, such as out dated time > +stamps or missing extended attributes. > +.BR fsync (2) > +before linking the inode, but that involves flushing of volatile disk caches. > + > +A filesystem that accepts this flag will guaranty, that old inode metadata > +will not be exposed in the new linked name. > +Some filesystems may internally perform > +.BR fsync (2) > +before linking the inode to provide this guaranty, > +but often, filesystems will have a more efficient method to provide this > +guaranty without flushing volatile disk caches. > + > +A filesystem that accepts this flag does > +.BR NOT > +guaranty that the new file name will exist after a system crash, nor that the > +current inode metadata is persisted to disk. Hmmm. I think it would be much clearer to state the two expectations in the same place, e.g.: "A filesystem that accepts this flag guarantees that after a successful call completion, the filesystem will return either (a) the version of the metadata that was on disk at the time the call completed; (b) a newer version of that metadata; or (c) -ENOENT. In other words, a subsequent access of the file path will never return metadata that was obsolete at the time that the call completed, even if the system crashes immediately afterwards." Did I get that right? I /think/ this means that one could implement Ye Olde Write And Rename as: fd = open(".", O_TMPFILE...); write(fd); fsync(fd); snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd); linkat(AT_FDCWD, path, AT_FDCWD, "file.txt", AT_EMPTY_PATH | AT_ATOMIC_DATA | AT_ATOMIC_METADATA); (Still struggling to figure out what userspace programs would use this for...) --D > +Specifically, if a file has hardlinks, the existance of the linked name after > +a system crash does > +.BR NOT > +guaranty that any of the other file names exist, nor that the last observed > +value of > +.I st_nlink > +(see > +.BR stat (2)) > +has persisted. > +.TP > +.BR AT_ATOMIC_DATA " (since Linux 5.x)" > +By default, a link operation followed by a system crash, may result in the > +new file name being linked with old data or missing data. > +One way to prevent this is to call > +.BR fdatasync (2) > +before linking the inode, but that involves flushing of volatile disk caches. > + > +A filesystem that accepts this flag will guaranty, that old data > +will not be exposed in the new linked name. > +Some filesystems may internally perform > +.BR fsync (2) > +before linking the inode to provide this guaranty, > +but often, filesystems will have a more efficient method to provide this > +guaranty without flushing volatile disk caches. > + > +A filesystem that accepts this flag does > +.BR NOT > +guaranty that the new file name will exist after a system crash, nor that the > +current inode data is persisted to disk. > +.TP > .SH RETURN VALUE > On success, zero is returned. > On error, \-1 is returned, and > -- > 2.17.1 >
On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote: > > Following our discussions on LSF/MM and beyond [1][2], here is > an RFC documentation patch. > > Ted, I know we discussed limiting the API for linking an O_TMPFILE > to avert the hardlinks issue, but I decided it would be better to > document the hardlinks non-guaranty instead. This will allow me to > replicate the same semantics and documentation to renameat(2). > Let me know how that works out for you. > > I also decided to try out two separate flags for data and metadata. > I do not find any of those flags very useful without the other, but > documenting them seprately was easier, because of the fsync/fdatasync > reference. In the end, we are trying to solve a social engineering > problem, so this is the least confusing way I could think of to describe > the new API. The way you have stated thigs is very confusing, and prone to be misunderstood. I think it would be helpful to state things in the positive, instead of the negative. Let's review what you had wanted: *If* the filename is visible in the directory after the crash, *then* all of the metadata/data that had been written to the file before the linkat(2) would be visible. HOWEVER, you did not want to necessarily force an fsync(2) in order to provide that guarantee. That is, the filename would not necessarily be guaranteed to be visible after a crash when linkat(2) returns, but if the existence of the filename is persisted, then the data would be too. Also, at least initially we talked about this only making sense for O_TMPFILE file desacriptors. I believe you were trying to generalize things so it wouldn't necessarily have to be a file created using O_TMPFILE. Is that correct? So instead of saying "A filesystem that accepts this flag will guaranty, that old inode data will not be exposed in the new linked name." It's much clearer to state this in the affirmative: A filesystem which accepts this flag will guarantee that if the new pathname exists after a crash, all of the data written to the file at the time of the linkat(2) call will be visible. I would think it's much simpler to say what *will* happen, instead of what will not be visible. (After all, technically speaking, returning all zeros or random garbage data fufills the requirement "old data will not be exposed", but that's probably not what you had in mind. :-) Also please note that it's spelled "guarantee". Cheers, - Ted
On Tue, May 28, 2019 at 11:27 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote: > > > > Following our discussions on LSF/MM and beyond [1][2], here is > > an RFC documentation patch. > > > > Ted, I know we discussed limiting the API for linking an O_TMPFILE > > to avert the hardlinks issue, but I decided it would be better to > > document the hardlinks non-guaranty instead. This will allow me to > > replicate the same semantics and documentation to renameat(2). > > Let me know how that works out for you. > > > > I also decided to try out two separate flags for data and metadata. > > I do not find any of those flags very useful without the other, but > > documenting them seprately was easier, because of the fsync/fdatasync > > reference. In the end, we are trying to solve a social engineering > > problem, so this is the least confusing way I could think of to describe > > the new API. > > The way you have stated thigs is very confusing, and prone to be > misunderstood. I think it would be helpful to state things in the > positive, instead of the negative. > > Let's review what you had wanted: > > *If* the filename is visible in the directory after the crash, > *then* all of the metadata/data that had been written to the file > before the linkat(2) would be visible. > > HOWEVER, you did not want to necessarily force an fsync(2) in > order to provide that guarantee. That is, the filename would > not necessarily be guaranteed to be visible after a crash when > linkat(2) returns, but if the existence of the filename is > persisted, then the data would be too. > > Also, at least initially we talked about this only making > sense for O_TMPFILE file desacriptors. I believe you were > trying to generalize things so it wouldn't necessarily have to > be a file created using O_TMPFILE. Is that correct? That is correct. I felt we were limiting ourselves only to avert the hardlinks issue, so decided its better to explain that "nlink is not part of the inode metadata" that this guarantee refers to. I would be happy to get your feedback about the hardlink disclaimer since you brought up the concern. It the disclaimer enough? Not needed at all? > > So instead of saying "A filesystem that accepts this flag will > guaranty, that old inode data will not be exposed in the new linked > name." It's much clearer to state this in the affirmative: > > A filesystem which accepts this flag will guarantee that if > the new pathname exists after a crash, all of the data written > to the file at the time of the linkat(2) call will be visible. > Sounds good to me. I will take a swing at another patch. Thanks for the review! Amir.
On Tue, May 28, 2019 at 11:07 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote: > > New link flags to request "atomic" link. > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > --- > > > > Hi Guys, > > > > Following our discussions on LSF/MM and beyond [1][2], here is > > an RFC documentation patch. > > > > Ted, I know we discussed limiting the API for linking an O_TMPFILE > > to avert the hardlinks issue, but I decided it would be better to > > document the hardlinks non-guaranty instead. This will allow me to > > replicate the same semantics and documentation to renameat(2). > > Let me know how that works out for you. > > > > I also decided to try out two separate flags for data and metadata. > > I do not find any of those flags very useful without the other, but > > documenting them seprately was easier, because of the fsync/fdatasync > > reference. In the end, we are trying to solve a social engineering > > problem, so this is the least confusing way I could think of to describe > > the new API. > > > > First implementation of AT_ATOMIC_METADATA is expected to be > > noop for xfs/ext4 and probably fsync for btrfs. > > > > First implementation of AT_ATOMIC_DATA is expected to be > > filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs. > > > > Thoughts? > > > > Amir. > > > > [1] https://lwn.net/Articles/789038/ > > [2] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/ > > > > man2/link.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 51 insertions(+) > > > > diff --git a/man2/link.2 b/man2/link.2 > > index 649ba00c7..15c24703e 100644 > > --- a/man2/link.2 > > +++ b/man2/link.2 > > @@ -184,6 +184,57 @@ See > > .BR openat (2) > > for an explanation of the need for > > .BR linkat (). > > +.TP > > +.BR AT_ATOMIC_METADATA " (since Linux 5.x)" > > +By default, a link operation followed by a system crash, may result in the > > +new file name being linked with old inode metadata, such as out dated time > > +stamps or missing extended attributes. > > +.BR fsync (2) > > +before linking the inode, but that involves flushing of volatile disk caches. > > + > > +A filesystem that accepts this flag will guaranty, that old inode metadata > > +will not be exposed in the new linked name. > > +Some filesystems may internally perform > > +.BR fsync (2) > > +before linking the inode to provide this guaranty, > > +but often, filesystems will have a more efficient method to provide this > > +guaranty without flushing volatile disk caches. > > + > > +A filesystem that accepts this flag does > > +.BR NOT > > +guaranty that the new file name will exist after a system crash, nor that the > > +current inode metadata is persisted to disk. > > Hmmm. I think it would be much clearer to state the two expectations in > the same place, e.g.: > > "A filesystem that accepts this flag guarantees that after a successful > call completion, the filesystem will return either (a) the version of > the metadata that was on disk at the time the call completed; (b) a > newer version of that metadata; or (c) -ENOENT. In other words, a > subsequent access of the file path will never return metadata that was > obsolete at the time that the call completed, even if the system crashes > immediately afterwards." Your feedback is along the same line as Ted's feedback. I will try out the phrasing that Ted proposed, will see how that goes... > > Did I get that right? I /think/ this means that one could implement Ye > Olde Write And Rename as: > > fd = open(".", O_TMPFILE...); > write(fd); > fsync(fd); Certainly not fsync(), that what my "counter-fsync()" phrasing was trying to convey. The flags are meant as a "cheaper" replacement for that fsync(). > snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd); > linkat(AT_FDCWD, path, AT_FDCWD, "file.txt", > AT_EMPTY_PATH | AT_ATOMIC_DATA | AT_ATOMIC_METADATA); > > (Still struggling to figure out what userspace programs would use this > for...) > There are generally two use cases: 1. Flushing volatile disk caches is very expensive. In this case sync_file_range(SYNC_FILE_RANGE_WRITE_AND_WAIT) is cheaper than fdatasync() for a preallocated file and obviously a noop for AT_ATOMIC_METADATA in "S.O.M.C" filesystem is cheaper than a journal commit. 2. Highly concurrent metadata workloads Many users running AT_ATOMIC_METADATA concurrently are much less likely to interfere each other than many users running fsync concurrently. IWO, when I'm using fsync() to provide the AT_ATOMIC_METADATA guarantee on a single journal fs, other users pay the penalty for a guaranty that I didn't ask for (i.e. persistency). Thanks, Amir.
> > > > So instead of saying "A filesystem that accepts this flag will > > guaranty, that old inode data will not be exposed in the new linked > > name." It's much clearer to state this in the affirmative: > > > > A filesystem which accepts this flag will guarantee that if > > the new pathname exists after a crash, all of the data written > > to the file at the time of the linkat(2) call will be visible. > > > > Sounds good to me. I will take a swing at another patch. > So I am down to single flag documented with 3 tweets ;-) What do you think of: "AT_ATOMIC_DATA (since Linux 5.x) A filesystem which accepts this flag will guarantee that if the linked file name exists after a system crash, then all of the data written to the file and all of the file's metadata at the time of the linkat(2) call will be visible. The way to achieve this guarantee on old kernels is to call fsync (2) before linking the file, but doing so will also results in flushing of volatile disk caches. A filesystem which accepts this flag does NOT guarantee that any of the file hardlinks will exist after a system crash, nor that the last observed value of st_nlink (see stat (2)) will persist." Thanks, Amir.
On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote: > What do you think of: > > "AT_ATOMIC_DATA (since Linux 5.x) > A filesystem which accepts this flag will guarantee that if the linked file > name exists after a system crash, then all of the data written to the file > and all of the file's metadata at the time of the linkat(2) call will be > visible. ".... will be visible after the the file system is remounted". (Never hurts to be explicit.) > The way to achieve this guarantee on old kernels is to call fsync (2) > before linking the file, but doing so will also results in flushing of > volatile disk caches. > > A filesystem which accepts this flag does NOT > guarantee that any of the file hardlinks will exist after a system crash, > nor that the last observed value of st_nlink (see stat (2)) will persist." > This is I think more precise: This guarantee can be achieved by calling fsync(2) before linking the file, but there may be more performant ways to provide these semantics. In particular, note that the use of the AT_ATOMIC_DATA flag does *not* guarantee that the new link created by linkat(2) will be persisted after a crash. We should also document that a file system which does not implement this flag MUST return EINVAL if it is passed this flag to linkat(2). - Ted
On Fri, May 31, 2019 at 7:41 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote: > > What do you think of: > > > > "AT_ATOMIC_DATA (since Linux 5.x) > > A filesystem which accepts this flag will guarantee that if the linked file > > name exists after a system crash, then all of the data written to the file > > and all of the file's metadata at the time of the linkat(2) call will be > > visible. > > ".... will be visible after the the file system is remounted". (Never > hurts to be explicit.) > > > The way to achieve this guarantee on old kernels is to call fsync (2) > > before linking the file, but doing so will also results in flushing of > > volatile disk caches. > > > > A filesystem which accepts this flag does NOT > > guarantee that any of the file hardlinks will exist after a system crash, > > nor that the last observed value of st_nlink (see stat (2)) will persist." > > > > This is I think more precise: > > This guarantee can be achieved by calling fsync(2) before linking > the file, but there may be more performant ways to provide these > semantics. In particular, note that the use of the AT_ATOMIC_DATA > flag does *not* guarantee that the new link created by linkat(2) > will be persisted after a crash. OK. Just to be clear, mentioning hardlinks and st_link is not needed in your opinion? > > We should also document that a file system which does not implement > this flag MUST return EINVAL if it is passed this flag to linkat(2). > OK. I think this part can be documented as possible reason for EINVAL As in renameat(2) man page: EINVAL The filesystem does not support one of the flags in flags. Thanks, Amir.
On Fri, May 31, 2019 at 08:22:06PM +0300, Amir Goldstein wrote: > > > > This is I think more precise: > > > > This guarantee can be achieved by calling fsync(2) before linking > > the file, but there may be more performant ways to provide these > > semantics. In particular, note that the use of the AT_ATOMIC_DATA > > flag does *not* guarantee that the new link created by linkat(2) > > will be persisted after a crash. > > OK. Just to be clear, mentioning hardlinks and st_link is not needed > in your opinion? Your previous text stated that it was undefined what would happen to all hardlinks belonging to the file, and that would imply that if a file had N hard links, some in the directory which we are modifying, and some in other directories, that somehow any of them might not be present after the crash. And that's not the case. Suppose the file currently has hardlinks test1/foo, test1/quux, and test2/baz --- and we've called syncfs(2) on the file system so everything is persisted, and then linkat(2) is used to create a new hardlink, test1/bar. After a crash, the existence of test1/foo, test1/quux, and test2/baz are not in question. It's only unclear whether or not test1/bar exists after the crash. As far as st_nlink is concerned, the presumption is that the file system itself will be consistent after the crash. So if the hard link has been persisted, then st_nlink will be incremented, if it has not, it won't be. Finally, one thing which gets hard about trying to state these sorts of things as guarantees. Sometimes, the file system won't *know* whether or not it can make these guarantees. For example what should we do if the file system is mounted with nobarrier? If the overall hardware design includes UPS's or some other kind of battery backup, the guarantee may very well exist. But the file system code can't know whether or not that is the case. So my inclination is to allow the file system to accept the flag even if the mount option nobarrier is in play --- but in that case, the guarantee is only if the rest of the system is designed appropriately. (For that matter, it used to be that there existed hard drives that lied about whether they had a writeback cache, and/or made the CACHE FLUSH a no-op so they could win the Winbench benchmarketing wars, which was worth millions and millions of dollars in sales. So we can only assume that the hardware isn't lying to us when we use words like "guarantee".) - Ted
On Fri, May 31, 2019 at 12:41:36PM -0400, Theodore Ts'o wrote: > On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote: > > What do you think of: > > > > "AT_ATOMIC_DATA (since Linux 5.x) > > A filesystem which accepts this flag will guarantee that if the linked file > > name exists after a system crash, then all of the data written to the file > > and all of the file's metadata at the time of the linkat(2) call will be > > visible. > > ".... will be visible after the the file system is remounted". (Never > hurts to be explicit.) > > > The way to achieve this guarantee on old kernels is to call fsync (2) > > before linking the file, but doing so will also results in flushing of > > volatile disk caches. > > > > A filesystem which accepts this flag does NOT > > guarantee that any of the file hardlinks will exist after a system crash, > > nor that the last observed value of st_nlink (see stat (2)) will persist." > > > > This is I think more precise: > > This guarantee can be achieved by calling fsync(2) before linking > the file, but there may be more performant ways to provide these > semantics. In particular, note that the use of the AT_ATOMIC_DATA > flag does *not* guarantee that the new link created by linkat(2) > will be persisted after a crash. So here's the *implementation* problem I see with this definition of AT_ATOMIC_DATA. After linkat(dirfd, name, AT_ATOMIC_DATA), there is no guarantee that the data is on disk or that the link is present. However: linkat(dirfd, name, AT_ATOMIC_DATA); fsync(dirfd); Suddenly changes all that. i.e. when we fsync(dirfd) we guarantee that "name" is present in the directory and because we used AT_ATOMIC_DATA it implies that the data pointed to by "name" must be present on disk. IOWs, what was once a pure directory sync operation now *must* fsync all the child inodes that have been linkat(AT_ATOMIC_DATA) since the last time the direct has been made stable. IOWs, the described AT_ATOMIC_DATA "we don't have to write the data during linkat() go-fast-get-out-of-gaol-free" behaviour isn't worth the pixels it is written on - it just moves all the complexity to directory fsync, and that's /already/ a behavioural minefield. IMO, the "work-around" of forcing filesystems to write back destination inodes during a link() operation is just nasty and will just end up with really weird performance anomalies occurring in production systems. That's not really a solution, either, especially as it is far, far faster for applications to use AIO_FSYNC and then on the completion callback run a normal linkat() operation... Hence, if I've understood these correctly, then I'll be recommending that XFS follows this: > We should also document that a file system which does not implement > this flag MUST return EINVAL if it is passed this flag to linkat(2). and returns -EINVAL to these flags because we do not have the change tracking infrastructure to handle these directory fsync semantics. I also suspect that, even if we could track this efficiently, we can't do the flushing atomically because of locking order constraints between directories, regular files, pages in the page cache, etc. Given that we can already use AIO to provide this sort of ordering, and AIO is vastly faster than synchronous IO, I don't see any point in adding complex barrier interfaces that can be /easily implemented in userspace/ using existing AIO primitives. You should start thinking about expanding libaio with stuff like "link_after_fdatasync()" and suddenly the whole problem of filesystem data vs metadata ordering goes away because the application directly controls all ordering without blocking and doesn't need to care what the filesystem under it does.... Cheers, Dave.
On Sat, Jun 01, 2019 at 08:45:49AM +1000, Dave Chinner wrote: > Given that we can already use AIO to provide this sort of ordering, > and AIO is vastly faster than synchronous IO, I don't see any point > in adding complex barrier interfaces that can be /easily implemented > in userspace/ using existing AIO primitives. You should start > thinking about expanding libaio with stuff like > "link_after_fdatasync()" and suddenly the whole problem of > filesystem data vs metadata ordering goes away because the > application directly controls all ordering without blocking and > doesn't need to care what the filesystem under it does.... And let me point out that this is also how userspace can do an efficient atomic rename - rename_after_fdatasync(). i.e. on completion of the AIO_FSYNC, run the rename. This guarantees that the application will see either the old file of the complete new file, and it *doesn't have to wait for the operation to complete*. Once it is in flight, the file will contain the old data until some point in the near future when will it contain the new data.... Seriously, sit down and work out all the "atomic" data vs metadata behaviours you want, and then tell me how many of them cannot be implemented as "AIO_FSYNC w/ completion callback function" in userspace. This mechanism /guarantees ordering/ at the application level, the application does not block waiting for these data integrity operations to complete, and you don't need any new kernel side functionality to implement this. Fundamentally, the assertion that disk cache flushes are not what causes fsync "to be slow" is incorrect. It's the synchronous "waiting for IO completion" that makes fsync "slow". AIO_FSYNC avoids needing to wait for IO completion, allowing the application to do useful work (like issue more DI ops) while data integrity operations are in flight. At this point, fsync is no longer a "slow" operation - it's just another background async data flush operation like the BDI flusher thread... Cheers, Dave.
On Sat, Jun 1, 2019 at 1:46 AM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, May 31, 2019 at 12:41:36PM -0400, Theodore Ts'o wrote: > > On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote: > > > What do you think of: > > > > > > "AT_ATOMIC_DATA (since Linux 5.x) > > > A filesystem which accepts this flag will guarantee that if the linked file > > > name exists after a system crash, then all of the data written to the file > > > and all of the file's metadata at the time of the linkat(2) call will be > > > visible. > > > > ".... will be visible after the the file system is remounted". (Never > > hurts to be explicit.) > > > > > The way to achieve this guarantee on old kernels is to call fsync (2) > > > before linking the file, but doing so will also results in flushing of > > > volatile disk caches. > > > > > > A filesystem which accepts this flag does NOT > > > guarantee that any of the file hardlinks will exist after a system crash, > > > nor that the last observed value of st_nlink (see stat (2)) will persist." > > > > > > > This is I think more precise: > > > > This guarantee can be achieved by calling fsync(2) before linking > > the file, but there may be more performant ways to provide these > > semantics. In particular, note that the use of the AT_ATOMIC_DATA > > flag does *not* guarantee that the new link created by linkat(2) > > will be persisted after a crash. > > So here's the *implementation* problem I see with this definition of > AT_ATOMIC_DATA. After linkat(dirfd, name, AT_ATOMIC_DATA), there is > no guarantee that the data is on disk or that the link is present. > > However: > > linkat(dirfd, name, AT_ATOMIC_DATA); > fsync(dirfd); > > Suddenly changes all that. > > i.e. when we fsync(dirfd) we guarantee that "name" is present in the > directory and because we used AT_ATOMIC_DATA it implies that the > data pointed to by "name" must be present on disk. IOWs, what was > once a pure directory sync operation now *must* fsync all the child > inodes that have been linkat(AT_ATOMIC_DATA) since the last time the > direct has been made stable. > > IOWs, the described AT_ATOMIC_DATA "we don't have to write the data > during linkat() go-fast-get-out-of-gaol-free" behaviour isn't worth > the pixels it is written on - it just moves all the complexity to > directory fsync, and that's /already/ a behavioural minefield. Where does it say we don't have to write the data during linkat()? I was only talking about avoid FLUSH/FUA caused by fsync(). I wrote in commit message: "First implementation of AT_ATOMIC_DATA is expected to be filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs." I failed to convey the reasoning for this flag to you. It is *not* about the latency of the "atomic link" for the calling thread It is about not interfering with other workloads running at the same time. > > IMO, the "work-around" of forcing filesystems to write back > destination inodes during a link() operation is just nasty and will > just end up with really weird performance anomalies occurring in > production systems. That's not really a solution, either, especially > as it is far, far faster for applications to use AIO_FSYNC and then > on the completion callback run a normal linkat() operation... > > Hence, if I've understood these correctly, then I'll be recommending > that XFS follows this: > > > We should also document that a file system which does not implement > > this flag MUST return EINVAL if it is passed this flag to linkat(2). > > and returns -EINVAL to these flags because we do not have the change > tracking infrastructure to handle these directory fsync semantics. > I also suspect that, even if we could track this efficiently, we > can't do the flushing atomically because of locking order > constraints between directories, regular files, pages in the page > cache, etc. That is not at all what I had in mind for XFS with the flag. > > Given that we can already use AIO to provide this sort of ordering, > and AIO is vastly faster than synchronous IO, I don't see any point > in adding complex barrier interfaces that can be /easily implemented > in userspace/ using existing AIO primitives. You should start > thinking about expanding libaio with stuff like > "link_after_fdatasync()" and suddenly the whole problem of > filesystem data vs metadata ordering goes away because the > application directly controls all ordering without blocking and > doesn't need to care what the filesystem under it does.... > OK. I can work with that. It's not that simple, but I will reply on your next email, where you wrote more about this alternative. Thanks, Amir.
On Sat, Jun 1, 2019 at 2:28 AM Dave Chinner <david@fromorbit.com> wrote: > > On Sat, Jun 01, 2019 at 08:45:49AM +1000, Dave Chinner wrote: > > Given that we can already use AIO to provide this sort of ordering, > > and AIO is vastly faster than synchronous IO, I don't see any point > > in adding complex barrier interfaces that can be /easily implemented > > in userspace/ using existing AIO primitives. You should start > > thinking about expanding libaio with stuff like > > "link_after_fdatasync()" and suddenly the whole problem of > > filesystem data vs metadata ordering goes away because the > > application directly controls all ordering without blocking and > > doesn't need to care what the filesystem under it does.... > > And let me point out that this is also how userspace can do an > efficient atomic rename - rename_after_fdatasync(). i.e. on > completion of the AIO_FSYNC, run the rename. This guarantees that > the application will see either the old file of the complete new > file, and it *doesn't have to wait for the operation to complete*. > Once it is in flight, the file will contain the old data until some > point in the near future when will it contain the new data.... What I am looking for is a way to isolate the effects of "atomic rename/link" from the rest of the users. Sure there is I/O bandwidth and queued bios, but at least isolate other threads working on other files or metadata from contending with the "atomic rename" thread of journal flushes and the like. Actually, one of my use cases is "atomic rename" of files with no data (looking for atomicity w.r.t xattr and mtime), so this "atomic rename" thread should not be interfering with other workloads at all. > > Seriously, sit down and work out all the "atomic" data vs metadata > behaviours you want, and then tell me how many of them cannot be > implemented as "AIO_FSYNC w/ completion callback function" in > userspace. This mechanism /guarantees ordering/ at the application > level, the application does not block waiting for these data > integrity operations to complete, and you don't need any new kernel > side functionality to implement this. So I think what I could have used is AIO_BATCH_FSYNC, an interface that was proposed by Ric Wheeler and discussed on LSF: https://lwn.net/Articles/789024/ Ric was looking for a way to efficiently fsync a "bunch of files". Submitting several AIO_FSYNC calls is not the efficient way of doing that. So it is either a new AIO_BATCH_FSYNC and a kernel implementation that flushes the inodes and then calls ->sync_fs(), or a new AIO operation that just does the ->sync_fs() bit and using sync_file_range() for the inodes. To be more accurate, the AIO operation that would emulate my proposed API more closely is AIO_WAIT_FOR_SYNCFS, as I do not wish to impose excessive journal flushes, I just need a completion callback when they happened to perform the rename/link. > > Fundamentally, the assertion that disk cache flushes are not what > causes fsync "to be slow" is incorrect. It's the synchronous Too many double negatives. I am not sure I parsed this correctly. But I think by now you understand that I don't care that fsync is "slow". I care about frequent fsyncs making the entire system slow down. Heck, xfs even has a mitigation in place to improve performance of too frequent fsyncs, but that mitigation is partly gone since 47c7d0b19502 xfs: fix incorrect log_flushed on fsync The situation with frequent fsync on ext4 at the moment is probably worse. I am trying to reduce the number of fsyncs from applications and converting fsync to AIO_FSYNC is not going to help with that. Thanks, Amir.
On Sat, Jun 01, 2019 at 11:01:42AM +0300, Amir Goldstein wrote: > On Sat, Jun 1, 2019 at 2:28 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Sat, Jun 01, 2019 at 08:45:49AM +1000, Dave Chinner wrote: > > > Given that we can already use AIO to provide this sort of ordering, > > > and AIO is vastly faster than synchronous IO, I don't see any point > > > in adding complex barrier interfaces that can be /easily implemented > > > in userspace/ using existing AIO primitives. You should start > > > thinking about expanding libaio with stuff like > > > "link_after_fdatasync()" and suddenly the whole problem of > > > filesystem data vs metadata ordering goes away because the > > > application directly controls all ordering without blocking and > > > doesn't need to care what the filesystem under it does.... > > > > And let me point out that this is also how userspace can do an > > efficient atomic rename - rename_after_fdatasync(). i.e. on > > completion of the AIO_FSYNC, run the rename. This guarantees that > > the application will see either the old file of the complete new > > file, and it *doesn't have to wait for the operation to complete*. > > Once it is in flight, the file will contain the old data until some > > point in the near future when will it contain the new data.... > > What I am looking for is a way to isolate the effects of "atomic rename/link" > from the rest of the users. Sure there is I/O bandwidth and queued > bios, but at least isolate other threads working on other files or metadata > from contending with the "atomic rename" thread of journal flushes and > the like. That's not a function of the kernel API. That's a function of the implementation behind the kernel API. i.e. The API requires data to be written before the rename/link is committed, how that is achieved is up to the filesystem. And some filesystems will not be able to isolate the API behavioural requirement from other users.... > Actually, one of my use cases is "atomic rename" of files with > no data (looking for atomicity w.r.t xattr and mtime), so this "atomic rename" > thread should not be interfering with other workloads at all. Which should already guaranteed because a) rename is supposed to be atomic, and b) metadata ordering requirements in journalled filesystems. If they lose xattrs across rename, there's something seriously wrong with the filesystem implementation. I'm really not sure what you think filesystems are actually doing with metadata across rename operations.... > > Seriously, sit down and work out all the "atomic" data vs metadata > > behaviours you want, and then tell me how many of them cannot be > > implemented as "AIO_FSYNC w/ completion callback function" in > > userspace. This mechanism /guarantees ordering/ at the application > > level, the application does not block waiting for these data > > integrity operations to complete, and you don't need any new kernel > > side functionality to implement this. > > So I think what I could have used is AIO_BATCH_FSYNC, an interface > that was proposed by Ric Wheeler and discussed on LSF: > https://lwn.net/Articles/789024/ > Ric was looking for a way to efficiently fsync a "bunch of files". > Submitting several AIO_FSYNC calls is not the efficient way of doing that. /me sighs. That's not what I just suggested, and I've already addressed this "AIO_FSYNC sucks" FUD in multiple separate threads. You do realise you can submit multiple AIO operations with a single io_submit() call, right? struct iocb ioc[10]; struct io_event ev[10]; for (i = 0; i < 10; i++) { io_prep_fsync(&ioc[i], fd[i]); ioc[i]->data = callback_arg[i]; } io_submit(aio_ctx, 10, &ioc); io_getevents(aio_ctx, 10, 10, ev, NULL); for (i = 0; i < 10; i++) post_fsync_callback(&ev[i]); There's your single syscall AIO_BATCH_FSYNC functionality, and it implements a per-fd post-fsync callback function. This isn't rocket science.... [snip] > I am trying to reduce the number of fsyncs from applications > and converting fsync to AIO_FSYNC is not going to help with that. Your whole argument is "fsync is inefficient because cache flushes, therefore AIO_FSYNC must be inefficient." IOWs, you've already decided what is wrong, how it can and can't be fixed and the solution you want regardless of whether your assertions are correct or not. You haven't provided any evidence that a new kernel API is the only viable solution, nor that the existing ones cannot provide the functionality you require. So, in the interests of /informed debate/, please implement what you want using batched AIO_FSYNC + rename/linkat completion callback and measure what it acheives. Then implement a sync_file_range/linkat thread pool that provides the same functionality to the application (i.e. writeback concurrency in userspace) and measure it. Then we can discuss what the relative overhead is with numbers and can perform analysis to determine what the cause of the performance differential actually is. Neither of these things require kernel modifications, but you need to provide the evidence that existing APIs are insufficient. Indeed, we now have the new async ioring stuff that can run async sync_file_range calls, so you probably need to benchmark replacing AIO_FSYNC with that interface as well. This new API likely does exactly what you want without the journal/device cache flush overhead of AIO_FSYNC.... Cheers, Dave.
> > Actually, one of my use cases is "atomic rename" of files with > > no data (looking for atomicity w.r.t xattr and mtime), so this "atomic rename" > > thread should not be interfering with other workloads at all. > > Which should already guaranteed because a) rename is supposed to be > atomic, and b) metadata ordering requirements in journalled > filesystems. If they lose xattrs across rename, there's something > seriously wrong with the filesystem implementation. I'm really not > sure what you think filesystems are actually doing with metadata > across rename operations.... > Dave, We are going in circles so much that my head is spinning. I don't blame anyone for having a hard time to keep up with the plot, because it spans many threads and subjects, so let me re-iterate: - I *do* know that rename provides me the needed "metadata barrier" w.r.t. xattr on xfs/ext4 today. - I *do* know the sync_file_range()+rename() callback provides the "data barrier" I need on xfs/ext4 today. - I *do* use this internal fs knowledge in my applications - I even fixed up sync_file_range() per your suggestion, so I won't need to use the FIEMAP_FLAG_SYNC hack - At attempt from CrashMonkey developers to document this behavior was "shot down" for many justified reasons - Without any documentation nor explicit API with a clean guarantee, users cannot write efficient applications without being aware of the filesystem underneath and follow that filesystem development to make sure behavior has not changed - The most recent proposal I have made in LSF, based on Jan's suggestion is to change nothing in filesystem implementation, but use a new *explicit* verb to communicate the expectation of the application, so that filesystems are free the change behavior in the future in the absence of the new verb Once again, ATOMIC_METADATA is a noop in preset xfs/ext4. ATOMIC_DATA is sync_file_range() in present xfs/ext4. The APIs I *need* from the kernel *do* exist, but the filesystem developers (except xfs) are not willing to document the guarantee that the existing interfaces provide in the present. [...] > So, in the interests of /informed debate/, please implement what you > want using batched AIO_FSYNC + rename/linkat completion callback and > measure what it acheives. Then implement a sync_file_range/linkat > thread pool that provides the same functionality to the application > (i.e. writeback concurrency in userspace) and measure it. Then we > can discuss what the relative overhead is with numbers and can > perform analysis to determine what the cause of the performance > differential actually is. > Fare enough. > Neither of these things require kernel modifications, but you need > to provide the evidence that existing APIs are insufficient. APIs are sufficient if I know which filesystem I am running on. btrfs needs a different set of syscalls to get the same thing done. > Indeed, we now have the new async ioring stuff that can run async > sync_file_range calls, so you probably need to benchmark replacing > AIO_FSYNC with that interface as well. This new API likely does > exactly what you want without the journal/device cache flush > overhead of AIO_FSYNC.... > Indeed, I am keeping a close watch on io_uring. Thanks, Amir.
diff --git a/man2/link.2 b/man2/link.2 index 649ba00c7..15c24703e 100644 --- a/man2/link.2 +++ b/man2/link.2 @@ -184,6 +184,57 @@ See .BR openat (2) for an explanation of the need for .BR linkat (). +.TP +.BR AT_ATOMIC_METADATA " (since Linux 5.x)" +By default, a link operation followed by a system crash, may result in the +new file name being linked with old inode metadata, such as out dated time +stamps or missing extended attributes. +One way to prevent this is to call +.BR fsync (2) +before linking the inode, but that involves flushing of volatile disk caches. + +A filesystem that accepts this flag will guaranty, that old inode metadata +will not be exposed in the new linked name. +Some filesystems may internally perform +.BR fsync (2) +before linking the inode to provide this guaranty, +but often, filesystems will have a more efficient method to provide this +guaranty without flushing volatile disk caches. + +A filesystem that accepts this flag does +.BR NOT +guaranty that the new file name will exist after a system crash, nor that the +current inode metadata is persisted to disk. +Specifically, if a file has hardlinks, the existance of the linked name after +a system crash does +.BR NOT +guaranty that any of the other file names exist, nor that the last observed +value of +.I st_nlink +(see +.BR stat (2)) +has persisted. +.TP +.BR AT_ATOMIC_DATA " (since Linux 5.x)" +By default, a link operation followed by a system crash, may result in the +new file name being linked with old data or missing data. +One way to prevent this is to call +.BR fdatasync (2) +before linking the inode, but that involves flushing of volatile disk caches. + +A filesystem that accepts this flag will guaranty, that old data +will not be exposed in the new linked name. +Some filesystems may internally perform +.BR fsync (2) +before linking the inode to provide this guaranty, +but often, filesystems will have a more efficient method to provide this +guaranty without flushing volatile disk caches. + +A filesystem that accepts this flag does +.BR NOT +guaranty that the new file name will exist after a system crash, nor that the +current inode data is persisted to disk. +.TP .SH RETURN VALUE On success, zero is returned. On error, \-1 is returned, and
New link flags to request "atomic" link. Signed-off-by: Amir Goldstein <amir73il@gmail.com> --- Hi Guys, Following our discussions on LSF/MM and beyond [1][2], here is an RFC documentation patch. Ted, I know we discussed limiting the API for linking an O_TMPFILE to avert the hardlinks issue, but I decided it would be better to document the hardlinks non-guaranty instead. This will allow me to replicate the same semantics and documentation to renameat(2). Let me know how that works out for you. I also decided to try out two separate flags for data and metadata. I do not find any of those flags very useful without the other, but documenting them seprately was easier, because of the fsync/fdatasync reference. In the end, we are trying to solve a social engineering problem, so this is the least confusing way I could think of to describe the new API. First implementation of AT_ATOMIC_METADATA is expected to be noop for xfs/ext4 and probably fsync for btrfs. First implementation of AT_ATOMIC_DATA is expected to be filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs. Thoughts? Amir. [1] https://lwn.net/Articles/789038/ [2] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/ man2/link.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+)