Message ID | 1430949612-21356-1-git-send-email-zab@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Zach, On Wed, May 6, 2015 at 6:00 PM, Zach Brown <zab@redhat.com> wrote: > > Add the O_NOMTIME flag which prevents mtime from being updated which can > greatly reduce the IO overhead of writes to allocated and initialized > regions of files. > > ceph servers can have loads where they perform O_DIRECT overwrites of > allocated file data and then sync to make sure that the O_DIRECT writes > are flushed from write caches. If the writes dirty the inode with mtime > updates then the syncs also write out the metadata needed to track the > inodes which can add significant iop and latency overhead. > > The ceph servers don't use mtime at all. They're using the local file > system as a backing store and any backups would be driven by their upper > level ceph metadata. For ceph, slow IO from mtime updates in the file > system is as daft as if we had block devices slowing down IO for > per-block write timestamps that file systems never use. > > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a > sync went from 2 serial write round trips to 1 in XFS and from 4 serial > IO round trips to 1 in ext4. > > file_update_time() checks for O_NOMTIME and aborts the update if it's > set, just like the current check for the in-kernel inode flag > S_NOCMTIME. I didn't update any other mtime update sites. They could be > added as we decide that it's appropriate to do so. > > I opted not to name the flag O_NOCMTIME because I didn't want the name > to imply that ctime updates would be prevented for other inode changes > like updating i_size in truncate. Not updating ctime is a side-effect > of removing mtime updates when it's the only thing changing in the > inode. > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > owning the file or having the CAP_FOWNER capability. If we're not > comfortable allowing owners to prevent mtime/ctime updates then we > should add a tunable to allow O_NOMTIME. Maybe a mount option? > Just out of curiosity, if you need to modify the application anyway, why wouldn't use of fdatasync() when flushing be able to offer a similar performance boost? Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 6 May 2015, Trond Myklebust wrote: > Hi Zach, > > On Wed, May 6, 2015 at 6:00 PM, Zach Brown <zab@redhat.com> wrote: > > > > Add the O_NOMTIME flag which prevents mtime from being updated which can > > greatly reduce the IO overhead of writes to allocated and initialized > > regions of files. > > > > ceph servers can have loads where they perform O_DIRECT overwrites of > > allocated file data and then sync to make sure that the O_DIRECT writes > > are flushed from write caches. If the writes dirty the inode with mtime > > updates then the syncs also write out the metadata needed to track the > > inodes which can add significant iop and latency overhead. > > > > The ceph servers don't use mtime at all. They're using the local file > > system as a backing store and any backups would be driven by their upper > > level ceph metadata. For ceph, slow IO from mtime updates in the file > > system is as daft as if we had block devices slowing down IO for > > per-block write timestamps that file systems never use. > > > > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a > > sync went from 2 serial write round trips to 1 in XFS and from 4 serial > > IO round trips to 1 in ext4. > > > > file_update_time() checks for O_NOMTIME and aborts the update if it's > > set, just like the current check for the in-kernel inode flag > > S_NOCMTIME. I didn't update any other mtime update sites. They could be > > added as we decide that it's appropriate to do so. > > > > I opted not to name the flag O_NOCMTIME because I didn't want the name > > to imply that ctime updates would be prevented for other inode changes > > like updating i_size in truncate. Not updating ctime is a side-effect > > of removing mtime updates when it's the only thing changing in the > > inode. > > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > owning the file or having the CAP_FOWNER capability. If we're not > > comfortable allowing owners to prevent mtime/ctime updates then we > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > Just out of curiosity, if you need to modify the application anyway, > why wouldn't use of fdatasync() when flushing be able to offer a > similar performance boost? Although fdatasync(2) doesn't have to update synchronously, it does eventually get written, and that can trigger lots of unwanted IO. In practice we fsync(2) to avoid deferred IO that we can't control/bound, but that's a long and sad story. O_NOMTIME would make for a much better ending! sage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 06, 2015 at 03:19:13PM -0700, Sage Weil wrote: > On Wed, 6 May 2015, Trond Myklebust wrote: > > Hi Zach, > > > > On Wed, May 6, 2015 at 6:00 PM, Zach Brown <zab@redhat.com> wrote: > > > > > > Add the O_NOMTIME flag which prevents mtime from being updated which can > > > greatly reduce the IO overhead of writes to allocated and initialized > > > regions of files. > > > > > > ceph servers can have loads where they perform O_DIRECT overwrites of > > > allocated file data and then sync to make sure that the O_DIRECT writes > > > are flushed from write caches. If the writes dirty the inode with mtime > > > updates then the syncs also write out the metadata needed to track the > > > inodes which can add significant iop and latency overhead. > > > > > > The ceph servers don't use mtime at all. They're using the local file > > > system as a backing store and any backups would be driven by their upper > > > level ceph metadata. For ceph, slow IO from mtime updates in the file > > > system is as daft as if we had block devices slowing down IO for > > > per-block write timestamps that file systems never use. > > > > > > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a > > > sync went from 2 serial write round trips to 1 in XFS and from 4 serial > > > IO round trips to 1 in ext4. > > > > > > file_update_time() checks for O_NOMTIME and aborts the update if it's > > > set, just like the current check for the in-kernel inode flag > > > S_NOCMTIME. I didn't update any other mtime update sites. They could be > > > added as we decide that it's appropriate to do so. > > > > > > I opted not to name the flag O_NOCMTIME because I didn't want the name > > > to imply that ctime updates would be prevented for other inode changes > > > like updating i_size in truncate. Not updating ctime is a side-effect > > > of removing mtime updates when it's the only thing changing in the > > > inode. > > > > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > > owning the file or having the CAP_FOWNER capability. If we're not > > > comfortable allowing owners to prevent mtime/ctime updates then we > > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > > > > Just out of curiosity, if you need to modify the application anyway, > > why wouldn't use of fdatasync() when flushing be able to offer a > > similar performance boost? > > Although fdatasync(2) doesn't have to update synchronously, it does > eventually get written, and that can trigger lots of unwanted IO. And the unwanted IO is per file. Are there circumstances where the write:file ratio is small enough that dirty inode writes could start to add up to meaningful write amplification? - z -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 6 May 2015, Zach Brown wrote: > On Wed, May 06, 2015 at 03:19:13PM -0700, Sage Weil wrote: > > On Wed, 6 May 2015, Trond Myklebust wrote: > > > Hi Zach, > > > > > > On Wed, May 6, 2015 at 6:00 PM, Zach Brown <zab@redhat.com> wrote: > > > > > > > > Add the O_NOMTIME flag which prevents mtime from being updated which can > > > > greatly reduce the IO overhead of writes to allocated and initialized > > > > regions of files. > > > > > > > > ceph servers can have loads where they perform O_DIRECT overwrites of > > > > allocated file data and then sync to make sure that the O_DIRECT writes > > > > are flushed from write caches. If the writes dirty the inode with mtime > > > > updates then the syncs also write out the metadata needed to track the > > > > inodes which can add significant iop and latency overhead. > > > > > > > > The ceph servers don't use mtime at all. They're using the local file > > > > system as a backing store and any backups would be driven by their upper > > > > level ceph metadata. For ceph, slow IO from mtime updates in the file > > > > system is as daft as if we had block devices slowing down IO for > > > > per-block write timestamps that file systems never use. > > > > > > > > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a > > > > sync went from 2 serial write round trips to 1 in XFS and from 4 serial > > > > IO round trips to 1 in ext4. > > > > > > > > file_update_time() checks for O_NOMTIME and aborts the update if it's > > > > set, just like the current check for the in-kernel inode flag > > > > S_NOCMTIME. I didn't update any other mtime update sites. They could be > > > > added as we decide that it's appropriate to do so. > > > > > > > > I opted not to name the flag O_NOCMTIME because I didn't want the name > > > > to imply that ctime updates would be prevented for other inode changes > > > > like updating i_size in truncate. Not updating ctime is a side-effect > > > > of removing mtime updates when it's the only thing changing in the > > > > inode. > > > > > > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > > > owning the file or having the CAP_FOWNER capability. If we're not > > > > comfortable allowing owners to prevent mtime/ctime updates then we > > > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > > > > > > > Just out of curiosity, if you need to modify the application anyway, > > > why wouldn't use of fdatasync() when flushing be able to offer a > > > similar performance boost? > > > > Although fdatasync(2) doesn't have to update synchronously, it does > > eventually get written, and that can trigger lots of unwanted IO. > > And the unwanted IO is per file. Are there circumstances where the > write:file ratio is small enough that dirty inode writes could start to > add up to meaningful write amplification? Yeah, exactly: in some not-so-uncommon workloads it's approaching 1:1. sage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 06, 2015 at 03:19:13PM -0700, Sage Weil wrote: > > Just out of curiosity, if you need to modify the application anyway, > > why wouldn't use of fdatasync() when flushing be able to offer a > > similar performance boost? > > Although fdatasync(2) doesn't have to update synchronously, it does > eventually get written, and that can trigger lots of unwanted IO. Something that might be worth trying out is using MS_LAZYTIME plus fdatasync(2). That should significantly reduce the unwanted IO, while eventually letting the mtimes get updated, plus allowing updates of adjacent inodes in the same inode table block update the mtime "for free". Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > Add the O_NOMTIME flag which prevents mtime from being updated which can > greatly reduce the IO overhead of writes to allocated and initialized > regions of files. Hmmm. How do backup programs now work out if the file has changed and hence needs copying again? ie. applications using this will break other critical infrastructure in subtle ways. > ceph servers can have loads where they perform O_DIRECT overwrites of > allocated file data and then sync to make sure that the O_DIRECT writes > are flushed from write caches. If the writes dirty the inode with mtime > updates then the syncs also write out the metadata needed to track the > inodes which can add significant iop and latency overhead. > > The ceph servers don't use mtime at all. They're using the local file > system as a backing store and any backups would be driven by their upper > level ceph metadata. For ceph, slow IO from mtime updates in the file > system is as daft as if we had block devices slowing down IO for > per-block write timestamps that file systems never use. > > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a > sync went from 2 serial write round trips to 1 in XFS and from 4 serial > IO round trips to 1 in ext4. > > file_update_time() checks for O_NOMTIME and aborts the update if it's > set, just like the current check for the in-kernel inode flag > S_NOCMTIME. I didn't update any other mtime update sites. They could be > added as we decide that it's appropriate to do so. > > I opted not to name the flag O_NOCMTIME because I didn't want the name > to imply that ctime updates would be prevented for other inode changes > like updating i_size in truncate. Not updating ctime is a side-effect > of removing mtime updates when it's the only thing changing in the > inode. If adding this, wouldn't we want to unify O_NOMTIME and FMODE_NOCMTIME at the same time? i.e. it makes no sense to add O_NOMTIME and not add O_NOCMTIME, likewise it makes no sense to have two different "no mtime" detection mechanisms. i.e. file_is_nomtime(file)) should return true for both files opened with O_NOMTIME, files that have had FMODE_NOCMTIME added to them and inodes with the S_NOCMTIME flag set on them. > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > owning the file or having the CAP_FOWNER capability. If we're not > comfortable allowing owners to prevent mtime/ctime updates then we > should add a tunable to allow O_NOMTIME. Maybe a mount option? I dislike "turn off safety for performance" options because Joe SpeedRacer will always select performance over safety. Cheers, Dave.
On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > > Add the O_NOMTIME flag which prevents mtime from being updated which can > > greatly reduce the IO overhead of writes to allocated and initialized > > regions of files. > > Hmmm. How do backup programs now work out if the file has changed > and hence needs copying again? ie. applications using this will > break other critical infrastructure in subtle ways. By using backup infrastructure that doesn't use cmtime. Like btrfs send/recv. Or application level backups that know how to do incrementals from metadata in giant database files, say, without walking, comparing, and copying the entire thing. > > I opted not to name the flag O_NOCMTIME because I didn't want the name > > to imply that ctime updates would be prevented for other inode changes > > like updating i_size in truncate. Not updating ctime is a side-effect > > of removing mtime updates when it's the only thing changing in the > > inode. > > If adding this, wouldn't we want to unify O_NOMTIME and > FMODE_NOCMTIME at the same time? I could see that, sure. > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > owning the file or having the CAP_FOWNER capability. If we're not > > comfortable allowing owners to prevent mtime/ctime updates then we > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > I dislike "turn off safety for performance" options because Joe > SpeedRacer will always select performance over safety. Well, for ceph there's no safety concern. They never use cmtime in these files. So are you suggesting not implementing this and making them rework their IO paths to avoid the fs maintaining mtime so that we don't give Joe Speedracer more rope? Or are we talking about adding some speed bumps that ceph can flip on that might give Joe Speedracer pause? - z -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > > owning the file or having the CAP_FOWNER capability. If we're not > > > comfortable allowing owners to prevent mtime/ctime updates then we > > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > I dislike "turn off safety for performance" options because Joe > > SpeedRacer will always select performance over safety. > > Well, for ceph there's no safety concern. They never use cmtime in > these files. > > So are you suggesting not implementing this and making them rework their > IO paths to avoid the fs maintaining mtime so that we don't give Joe > Speedracer more rope? Or are we talking about adding some speed bumps > that ceph can flip on that might give Joe Speedracer pause? Maybe one way to make it less of an attractive nuisance would be to hide it under open_by_handle_at(). Like xfs_open_by_handle() does today but we probably don't want to unconditionally add it to the generic path so we'd have a flag. They want to move to opening by handles anyway to avoid dirent lookups when opening cold files. - z -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 7, 2015 at 7:20 PM, Zach Brown <zab@redhat.com> wrote: > On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: >> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: >> > Add the O_NOMTIME flag which prevents mtime from being updated which can >> > greatly reduce the IO overhead of writes to allocated and initialized >> > regions of files. >> >> Hmmm. How do backup programs now work out if the file has changed >> and hence needs copying again? ie. applications using this will >> break other critical infrastructure in subtle ways. > > By using backup infrastructure that doesn't use cmtime. Like btrfs > send/recv. Or application level backups that know how to do > incrementals from metadata in giant database files, say, without > walking, comparing, and copying the entire thing. But how can Joey random user know that some of his applications are using O_NOMTIME and his KISS backup program does no longer function as expected?
On Thu, May 7, 2015 at 12:09 PM, Richard Weinberger <richard.weinberger@gmail.com> wrote: > On Thu, May 7, 2015 at 7:20 PM, Zach Brown <zab@redhat.com> wrote: >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: >>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: >>> > Add the O_NOMTIME flag which prevents mtime from being updated which can >>> > greatly reduce the IO overhead of writes to allocated and initialized >>> > regions of files. >>> >>> Hmmm. How do backup programs now work out if the file has changed >>> and hence needs copying again? ie. applications using this will >>> break other critical infrastructure in subtle ways. >> >> By using backup infrastructure that doesn't use cmtime. Like btrfs >> send/recv. Or application level backups that know how to do >> incrementals from metadata in giant database files, say, without >> walking, comparing, and copying the entire thing. > > But how can Joey random user know that some of his > applications are using O_NOMTIME and his KISS backup > program does no longer function as expected? > Joey random user can't have a working KISS backup anyway, though, because we screw up mtime updates on mmap writes. I have patches gathering dust that fix that, though. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 7, 2015 at 1:02 PM, Richard Weinberger <richard@nod.at> wrote: > Am 07.05.2015 um 21:53 schrieb Andy Lutomirski: >> On Thu, May 7, 2015 at 12:09 PM, Richard Weinberger >> <richard.weinberger@gmail.com> wrote: >>> On Thu, May 7, 2015 at 7:20 PM, Zach Brown <zab@redhat.com> wrote: >>>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: >>>>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: >>>>>> Add the O_NOMTIME flag which prevents mtime from being updated which can >>>>>> greatly reduce the IO overhead of writes to allocated and initialized >>>>>> regions of files. >>>>> >>>>> Hmmm. How do backup programs now work out if the file has changed >>>>> and hence needs copying again? ie. applications using this will >>>>> break other critical infrastructure in subtle ways. >>>> >>>> By using backup infrastructure that doesn't use cmtime. Like btrfs >>>> send/recv. Or application level backups that know how to do >>>> incrementals from metadata in giant database files, say, without >>>> walking, comparing, and copying the entire thing. >>> >>> But how can Joey random user know that some of his >>> applications are using O_NOMTIME and his KISS backup >>> program does no longer function as expected? >>> >> >> Joey random user can't have a working KISS backup anyway, though, >> because we screw up mtime updates on mmap writes. I have patches >> gathering dust that fix that, though. > > Hmmm, I thought mtime will be updated upon msync()? > Assuming a sane application is using msync()... > So would I. Unfortunately, mtime is updated on the page fault that makes an mmapped page writeable, thus guaranteeing that the resulting mtime is stale if you mmap a file, write to it, unmap it, and close it. It's much more stale if you mmap it, write, wait for a while but not long enough that the page is automatically written back, write again, unmap, and close. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 7 May 2015, Zach Brown wrote: > On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > > owning the file or having the CAP_FOWNER capability. If we're not > > > comfortable allowing owners to prevent mtime/ctime updates then we > > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > I dislike "turn off safety for performance" options because Joe > > SpeedRacer will always select performance over safety. > > Well, for ceph there's no safety concern. They never use cmtime in > these files. > > So are you suggesting not implementing this and making them rework their > IO paths to avoid the fs maintaining mtime so that we don't give Joe > Speedracer more rope? Or are we talking about adding some speed bumps > that ceph can flip on that might give Joe Speedracer pause? I think this is the fundamental question: who do we give the ammunition to, the user or app writer, or the sysadmin? One might argue that we gave the user a similar power with O_NOATIME (the power to break applications that assume atime is accurate). Here we give developers/users the power to not update mtime and suffer the consequences (like, obviously, breaking mtime-based backups). It should be pretty obvious to anyone using the flag what the consequences are. Note that we can suffer similar lapses in mtime with fdatasync followed by a system crash. And as Andy points out it's semi-broken for writable mmap. The crash case is obviously a slightly different thing, but the idea that mtime can't always be trusted certainly isn't crazy talk. Or, we can be conservative and require a mount option so that the admin has to explicitly allow behavior that might break some existing assumptions about mtime/ctime ('-o user_noatime' I guess?). I'm happy either way, so long as in the end an unprivileged ceph daemon avoids the useless work. In our case we always own the entire mount/disk, so a mount option is just fine. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 7, 2015 at 9:01 PM, Sage Weil <sage@newdream.net> wrote: > On Thu, 7 May 2015, Zach Brown wrote: >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: >> > > owning the file or having the CAP_FOWNER capability. If we're not >> > > comfortable allowing owners to prevent mtime/ctime updates then we >> > > should add a tunable to allow O_NOMTIME. Maybe a mount option? >> > >> > I dislike "turn off safety for performance" options because Joe >> > SpeedRacer will always select performance over safety. >> >> Well, for ceph there's no safety concern. They never use cmtime in >> these files. >> >> So are you suggesting not implementing this and making them rework their >> IO paths to avoid the fs maintaining mtime so that we don't give Joe >> Speedracer more rope? Or are we talking about adding some speed bumps >> that ceph can flip on that might give Joe Speedracer pause? > > I think this is the fundamental question: who do we give the ammunition > to, the user or app writer, or the sysadmin? > > One might argue that we gave the user a similar power with O_NOATIME (the > power to break applications that assume atime is accurate). Here we give > developers/users the power to not update mtime and suffer the consequences > (like, obviously, breaking mtime-based backups). It should be pretty > obvious to anyone using the flag what the consequences are. > > Note that we can suffer similar lapses in mtime with fdatasync followed by > a system crash. And as Andy points out it's semi-broken for writable > mmap. The crash case is obviously a slightly different thing, but the > idea that mtime can't always be trusted certainly isn't crazy talk. > > Or, we can be conservative and require a mount option so that the admin > has to explicitly allow behavior that might break some existing > assumptions about mtime/ctime ('-o user_noatime' I guess?). > > I'm happy either way, so long as in the end an unprivileged ceph daemon > avoids the useless work. In our case we always own the entire mount/disk, > so a mount option is just fine. > So, what is the expectation here for filesystems that cannot support this flag? NFSv3 in particular would break pretty catastrophically if someone decided on a whim to turn off mtime: they will have turned off the client's ability to detect cache incoherencies. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 07, 2015 at 10:20:53AM -0700, Zach Brown wrote: > On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > > > Add the O_NOMTIME flag which prevents mtime from being updated which can > > > greatly reduce the IO overhead of writes to allocated and initialized > > > regions of files. > > > > Hmmm. How do backup programs now work out if the file has changed > > and hence needs copying again? ie. applications using this will > > break other critical infrastructure in subtle ways. > > By using backup infrastructure that doesn't use cmtime. Like btrfs > send/recv. Or application level backups that know how to do > incrementals from metadata in giant database files, say, without > walking, comparing, and copying the entire thing. "Use magical thing that doesn't exist"? Really? e.g. you can't do incremental backups with tools like xfsdump if mtime is not being updated. The last thing an admin wants when doing disaster recovery is to find out that the app started using O_NOMTIME as a result of the upgrade they did 6 months ago. Hence the last 6 months of production data isn't in the backups despite the backup procedure having been extensively tested and verified when it was first put in place. > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > > owning the file or having the CAP_FOWNER capability. If we're not > > > comfortable allowing owners to prevent mtime/ctime updates then we > > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > I dislike "turn off safety for performance" options because Joe > > SpeedRacer will always select performance over safety. > > Well, for ceph there's no safety concern. They never use cmtime in > these files. Understood. > So are you suggesting not implementing this No. > Or are we talking about adding some speed bumps > that ceph can flip on that might give Joe Speedracer pause? Yes, but not just Joe Speedracer - if it can be turned on silently by apps then it's a great big landmine that most users and sysadmins will not know about until it is too late. Cheers, Dave.
On Thu, May 07, 2015 at 12:53:46PM -0700, Andy Lutomirski wrote: > On Thu, May 7, 2015 at 12:09 PM, Richard Weinberger > <richard.weinberger@gmail.com> wrote: > > On Thu, May 7, 2015 at 7:20 PM, Zach Brown <zab@redhat.com> wrote: > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > >>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > >>> > Add the O_NOMTIME flag which prevents mtime from being updated which can > >>> > greatly reduce the IO overhead of writes to allocated and initialized > >>> > regions of files. > >>> > >>> Hmmm. How do backup programs now work out if the file has changed > >>> and hence needs copying again? ie. applications using this will > >>> break other critical infrastructure in subtle ways. > >> > >> By using backup infrastructure that doesn't use cmtime. Like btrfs > >> send/recv. Or application level backups that know how to do > >> incrementals from metadata in giant database files, say, without > >> walking, comparing, and copying the entire thing. > > > > But how can Joey random user know that some of his > > applications are using O_NOMTIME and his KISS backup > > program does no longer function as expected? > > > > Joey random user can't have a working KISS backup anyway, though, > because we screw up mtime updates on mmap writes. I have patches > gathering dust that fix that, though. They are close enough to be good for backup purposes. The mtime only need change once per backup period - it doesn't need to be millisecond accurate. Yes, I know you needed that changed for different reasons (avoid variable page fault latency), but it doesn't matter for once-a-day or even once-an-hour incremental backup scans. Besides, anyone who cares about accurate backups is doing a backup from a snapshot so they data and metadata is consistent across the entire backup. And that makes worries about mmap and mtime completely irrelevant because a snapshot freezes the filesystem and hence cleans all the mapped pages. Once the snapshot is taken the next mmap write will trigger a page fault and so change the mtime and it will be picked up in the next backup scan... Cheers, Dave.
On May 8, 2015 8:11 AM, "Dave Chinner" <david@fromorbit.com> wrote: > > On Thu, May 07, 2015 at 10:20:53AM -0700, Zach Brown wrote: > > On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > > > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > > > > Add the O_NOMTIME flag which prevents mtime from being updated which can > > > > greatly reduce the IO overhead of writes to allocated and initialized > > > > regions of files. > > > > > > Hmmm. How do backup programs now work out if the file has changed > > > and hence needs copying again? ie. applications using this will > > > break other critical infrastructure in subtle ways. > > > > By using backup infrastructure that doesn't use cmtime. Like btrfs > > send/recv. Or application level backups that know how to do > > incrementals from metadata in giant database files, say, without > > walking, comparing, and copying the entire thing. > > "Use magical thing that doesn't exist"? Really? > > e.g. you can't do incremental backups with tools like xfsdump if > mtime is not being updated. The last thing an admin wants when > doing disaster recovery is to find out that the app started using > O_NOMTIME as a result of the upgrade they did 6 months ago. Hence > the last 6 months of production data isn't in the backups despite > the backup procedure having been extensively tested and verified > when it was first put in place. > > > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > > > owning the file or having the CAP_FOWNER capability. If we're not > > > > comfortable allowing owners to prevent mtime/ctime updates then we > > > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > > > I dislike "turn off safety for performance" options because Joe > > > SpeedRacer will always select performance over safety. > > > > Well, for ceph there's no safety concern. They never use cmtime in > > these files. > > Understood. > > > So are you suggesting not implementing this > > No. > > > Or are we talking about adding some speed bumps > > that ceph can flip on that might give Joe Speedracer pause? > > Yes, but not just Joe Speedracer - if it can be turned on silently > by apps then it's a great big landmine that most users and sysadmins > will not know about until it is too late. What about programs like tar that explicitly override mtime? No admin buy-in is required for that. Admittedly, that doesn't affect ctime, nor is it as likely to bite unexpectedly as a nomtime flag. I think it would be reasonably safe if a mount option had to be set to allow O_NOCMTIME or such. --Andy > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Sage" == Sage Weil <sage@newdream.net> writes: Sage> On Thu, 7 May 2015, Zach Brown wrote: >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: >> > > owning the file or having the CAP_FOWNER capability. If we're not >> > > comfortable allowing owners to prevent mtime/ctime updates then we >> > > should add a tunable to allow O_NOMTIME. Maybe a mount option? >> > >> > I dislike "turn off safety for performance" options because Joe >> > SpeedRacer will always select performance over safety. >> >> Well, for ceph there's no safety concern. They never use cmtime in >> these files. >> >> So are you suggesting not implementing this and making them rework their >> IO paths to avoid the fs maintaining mtime so that we don't give Joe >> Speedracer more rope? Or are we talking about adding some speed bumps >> that ceph can flip on that might give Joe Speedracer pause? Sage> I think this is the fundamental question: who do we give the Sage> ammunition to, the user or app writer, or the sysadmin? Sage> One might argue that we gave the user a similar power with Sage> O_NOATIME (the power to break applications that assume atime is Sage> accurate). Here we give developers/users the power to not Sage> update mtime and suffer the consequences (like, obviously, Sage> breaking mtime-based backups). It should be pretty obvious to Sage> anyone using the flag what the consequences are. Not modifying atime doesn't really break anything except people who think they can tell when a file was last accessed. Which isn't critical (unless your in a paranoid security conscious place...) but MTIME is another beast entirely. Turning that off is going to break lots of hidden assumptions. Sage> Note that we can suffer similar lapses in mtime with fdatasync Sage> followed by a system crash. And as Andy points out it's Sage> semi-broken for writable mmap. The crash case is obviously a Sage> slightly different thing, but the idea that mtime can't always Sage> be trusted certainly isn't crazy talk. True, but after a crash... people expect and understand there might be corruption in a filesystem. Sage> Or, we can be conservative and require a mount option so that Sage> the admin has to explicitly allow behavior that might break some Sage> existing assumptions about mtime/ctime ('-o user_noatime' I Sage> guess?). Sage> I'm happy either way, so long as in the end an unprivileged ceph Sage> daemon avoids the useless work. In our case we always own the Sage> entire mount/disk, so a mount option is just fine. I agree with the mount option, makes it crystal clear. And then it's on the sysadmin/owner of the system to understand (ha!) the problems. This is all me speaking with my Sysadmin hat firmly on my head. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-05-07 21:01, Sage Weil wrote: > On Thu, 7 May 2015, Zach Brown wrote: >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: >>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: >>>> The criteria for using O_NOMTIME is the same as for using O_NOATIME: >>>> owning the file or having the CAP_FOWNER capability. If we're not >>>> comfortable allowing owners to prevent mtime/ctime updates then we >>>> should add a tunable to allow O_NOMTIME. Maybe a mount option? >>> >>> I dislike "turn off safety for performance" options because Joe >>> SpeedRacer will always select performance over safety. >> >> Well, for ceph there's no safety concern. They never use cmtime in >> these files. >> >> So are you suggesting not implementing this and making them rework their >> IO paths to avoid the fs maintaining mtime so that we don't give Joe >> Speedracer more rope? Or are we talking about adding some speed bumps >> that ceph can flip on that might give Joe Speedracer pause? > > I think this is the fundamental question: who do we give the ammunition > to, the user or app writer, or the sysadmin? > > One might argue that we gave the user a similar power with O_NOATIME (the > power to break applications that assume atime is accurate). Here we give > developers/users the power to not update mtime and suffer the consequences > (like, obviously, breaking mtime-based backups). It should be pretty > obvious to anyone using the flag what the consequences are. The difference is that the only widely used program that uses atime for anything is Mutt (and many people who don't use Mutt just disable updating it altogether to improve performance), whereas mtime is used at the very least by many backup tools, and pretty much all NFSv{3,2} clients, as well as a number of other pieces of software. > > Note that we can suffer similar lapses in mtime with fdatasync followed by > a system crash. And as Andy points out it's semi-broken for writable > mmap. The crash case is obviously a slightly different thing, but the > idea that mtime can't always be trusted certainly isn't crazy talk. > > Or, we can be conservative and require a mount option so that the admin > has to explicitly allow behavior that might break some existing > assumptions about mtime/ctime ('-o user_noatime' I guess?). Personally, I agree that there should be a mount option. We should make sure to put a big fat warning about it in the manpage however, irrespective of how it is controlled. > > I'm happy either way, so long as in the end an unprivileged ceph daemon > avoids the useless work. In our case we always own the entire mount/disk, > so a mount option is just fine. > Thanks! > sage > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ >
On 5/7/15 10:24 PM, Andy Lutomirski wrote: > On May 8, 2015 8:11 AM, "Dave Chinner" <david@fromorbit.com> wrote: >> >> On Thu, May 07, 2015 at 10:20:53AM -0700, Zach Brown wrote: >>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: >>>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: >>>>> Add the O_NOMTIME flag which prevents mtime from being updated which can >>>>> greatly reduce the IO overhead of writes to allocated and initialized >>>>> regions of files. >>>> >>>> Hmmm. How do backup programs now work out if the file has changed >>>> and hence needs copying again? ie. applications using this will >>>> break other critical infrastructure in subtle ways. >>> >>> By using backup infrastructure that doesn't use cmtime. Like btrfs >>> send/recv. Or application level backups that know how to do >>> incrementals from metadata in giant database files, say, without >>> walking, comparing, and copying the entire thing. >> >> "Use magical thing that doesn't exist"? Really? >> >> e.g. you can't do incremental backups with tools like xfsdump if >> mtime is not being updated. The last thing an admin wants when >> doing disaster recovery is to find out that the app started using >> O_NOMTIME as a result of the upgrade they did 6 months ago. Hence >> the last 6 months of production data isn't in the backups despite >> the backup procedure having been extensively tested and verified >> when it was first put in place. >> >>>>> The criteria for using O_NOMTIME is the same as for using O_NOATIME: >>>>> owning the file or having the CAP_FOWNER capability. If we're not >>>>> comfortable allowing owners to prevent mtime/ctime updates then we >>>>> should add a tunable to allow O_NOMTIME. Maybe a mount option? >>>> >>>> I dislike "turn off safety for performance" options because Joe >>>> SpeedRacer will always select performance over safety. >>> >>> Well, for ceph there's no safety concern. They never use cmtime in >>> these files. >> >> Understood. >> >>> So are you suggesting not implementing this >> >> No. >> >>> Or are we talking about adding some speed bumps >>> that ceph can flip on that might give Joe Speedracer pause? >> >> Yes, but not just Joe Speedracer - if it can be turned on silently >> by apps then it's a great big landmine that most users and sysadmins >> will not know about until it is too late. > > What about programs like tar that explicitly override mtime? No admin > buy-in is required for that. Admittedly, that doesn't affect ctime, > nor is it as likely to bite unexpectedly as a nomtime flag. > > I think it would be reasonably safe if a mount option had to be set to > allow O_NOCMTIME or such. I was going to suggest the same. Make infrastructure available for an app to request O_NOMTIME, but a mount option must be set to allow it, so the administrator doesn't get an unhappy surprise at backup-restore time. (Not a big fan of more twiddly knobs, but that seems to put the control in all the right places). -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 7 May 2015, Trond Myklebust wrote: > On Thu, May 7, 2015 at 9:01 PM, Sage Weil <sage@newdream.net> wrote: > > On Thu, 7 May 2015, Zach Brown wrote: > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > >> > > owning the file or having the CAP_FOWNER capability. If we're not > >> > > comfortable allowing owners to prevent mtime/ctime updates then we > >> > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > >> > > >> > I dislike "turn off safety for performance" options because Joe > >> > SpeedRacer will always select performance over safety. > >> > >> Well, for ceph there's no safety concern. They never use cmtime in > >> these files. > >> > >> So are you suggesting not implementing this and making them rework their > >> IO paths to avoid the fs maintaining mtime so that we don't give Joe > >> Speedracer more rope? Or are we talking about adding some speed bumps > >> that ceph can flip on that might give Joe Speedracer pause? > > > > I think this is the fundamental question: who do we give the ammunition > > to, the user or app writer, or the sysadmin? > > > > One might argue that we gave the user a similar power with O_NOATIME (the > > power to break applications that assume atime is accurate). Here we give > > developers/users the power to not update mtime and suffer the consequences > > (like, obviously, breaking mtime-based backups). It should be pretty > > obvious to anyone using the flag what the consequences are. > > > > Note that we can suffer similar lapses in mtime with fdatasync followed by > > a system crash. And as Andy points out it's semi-broken for writable > > mmap. The crash case is obviously a slightly different thing, but the > > idea that mtime can't always be trusted certainly isn't crazy talk. > > > > Or, we can be conservative and require a mount option so that the admin > > has to explicitly allow behavior that might break some existing > > assumptions about mtime/ctime ('-o user_noatime' I guess?). > > > > I'm happy either way, so long as in the end an unprivileged ceph daemon > > avoids the useless work. In our case we always own the entire mount/disk, > > so a mount option is just fine. > > > > So, what is the expectation here for filesystems that cannot support > this flag? NFSv3 in particular would break pretty catastrophically if > someone decided on a whim to turn off mtime: they will have turned off > the client's ability to detect cache incoherencies. Is this based on mtime or ctime? If the former, would things could also break if a user does, say, some stat(2), write(2), utimes(2) shenanigans? So, my assumption is that if the mount option isn't there allowing this then O_NOMTIME would be a no-op (as opposed to EPERM or something)... but maybe that's not the right thing to do. Whatever we do there, though, I suppose NFS would do the same thing? sage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 07, 2015 at 06:01:23PM -0700, Sage Weil wrote: > On Thu, 7 May 2015, Zach Brown wrote: > > On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > > > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > > > owning the file or having the CAP_FOWNER capability. If we're not > > > > comfortable allowing owners to prevent mtime/ctime updates then we > > > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > > > I dislike "turn off safety for performance" options because Joe > > > SpeedRacer will always select performance over safety. > > > > Well, for ceph there's no safety concern. They never use cmtime in > > these files. > > > > So are you suggesting not implementing this and making them rework their > > IO paths to avoid the fs maintaining mtime so that we don't give Joe > > Speedracer more rope? Or are we talking about adding some speed bumps > > that ceph can flip on that might give Joe Speedracer pause? > > I think this is the fundamental question: who do we give the ammunition > to, the user or app writer, or the sysadmin? Yeah, I think this is right. Dave doesn't want the possibility of it bleeding in to installations through irresponsible default use in apps without explicit buy-in from the people responsible for the backups. > [...] > > Or, we can be conservative and require a mount option so that the admin > has to explicitly allow behavior that might break some existing > assumptions about mtime/ctime ('-o user_noatime' I guess?). > > I'm happy either way, so long as in the end an unprivileged ceph daemon > avoids the useless work. In our case we always own the entire mount/disk, > so a mount option is just fine. It seems that the thread has headed towards responding to my suggestion of a possible mount option with an enthusiastic "yes, please, no surprises." So I'll try that. - z -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 07, 2015 at 09:23:24PM -0400, Trond Myklebust wrote: > On Thu, May 7, 2015 at 9:01 PM, Sage Weil <sage@newdream.net> wrote: > > On Thu, 7 May 2015, Zach Brown wrote: > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > >> > > owning the file or having the CAP_FOWNER capability. If we're not > >> > > comfortable allowing owners to prevent mtime/ctime updates then we > >> > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > >> > > >> > I dislike "turn off safety for performance" options because Joe > >> > SpeedRacer will always select performance over safety. > >> > >> Well, for ceph there's no safety concern. They never use cmtime in > >> these files. > >> > >> So are you suggesting not implementing this and making them rework their > >> IO paths to avoid the fs maintaining mtime so that we don't give Joe > >> Speedracer more rope? Or are we talking about adding some speed bumps > >> that ceph can flip on that might give Joe Speedracer pause? > > > > I think this is the fundamental question: who do we give the ammunition > > to, the user or app writer, or the sysadmin? > > > > One might argue that we gave the user a similar power with O_NOATIME (the > > power to break applications that assume atime is accurate). Here we give > > developers/users the power to not update mtime and suffer the consequences > > (like, obviously, breaking mtime-based backups). It should be pretty > > obvious to anyone using the flag what the consequences are. > > > > Note that we can suffer similar lapses in mtime with fdatasync followed by > > a system crash. And as Andy points out it's semi-broken for writable > > mmap. The crash case is obviously a slightly different thing, but the > > idea that mtime can't always be trusted certainly isn't crazy talk. > > > > Or, we can be conservative and require a mount option so that the admin > > has to explicitly allow behavior that might break some existing > > assumptions about mtime/ctime ('-o user_noatime' I guess?). > > > > I'm happy either way, so long as in the end an unprivileged ceph daemon > > avoids the useless work. In our case we always own the entire mount/disk, > > so a mount option is just fine. > > > > So, what is the expectation here for filesystems that cannot support > this flag? NFSv3 in particular would break pretty catastrophically if > someone decided on a whim to turn off mtime: they will have turned off > the client's ability to detect cache incoherencies. It's worse than that, now that I think about it. I think nomtime will break nfsv4 as the I_VERSION check is done *after* the NO[C]MTIME checks. e.g. the atomic change count used to detect file changes is only updated during the mtime update on write() calls in XFS. i.e. when the timestamp is changed, a transaction to change mtime is run, and that transaction commit bumps the change count. So cutting out mtime updates at the VFS will prevent XFS and other I_VERSION aware filesystems from updating the change count that NFSv4 clients rely on to detect foreign data changes in a file. Not sure what to do here, because the current NOCMTIME implementation intentionally cuts out the timestamp update because it's usage is fully invisible IO. i.e. it is used by utilities like xfs_fsr and HSMs to move data into and out of files without the application being able to detect the data movement in any way. These are not data modification operations, though - the file contents as read by the application do not change despite the fact we are moving data in and out of the file. In this case we don't want timestamps or change counters to change on the data movement, so I think we've actually got a difference in behaviour here between O_NOMTIME and O_NOCMTIME, right? i.e. for nfsv4 sanity O_NOMTIME still needs to bump I_VERSION on write, just not modify the timestamp? In which case, not modifying the timestamps gains us nothing, because the inode is still dirtied? The list of caveats on O_NOMTIME seems to be growing... Cheers, Dave.
On Sat, 9 May 2015, Dave Chinner wrote: > On Thu, May 07, 2015 at 09:23:24PM -0400, Trond Myklebust wrote: > > On Thu, May 7, 2015 at 9:01 PM, Sage Weil <sage@newdream.net> wrote: > > > On Thu, 7 May 2015, Zach Brown wrote: > > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > > >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > > >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > >> > > owning the file or having the CAP_FOWNER capability. If we're not > > >> > > comfortable allowing owners to prevent mtime/ctime updates then we > > >> > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > >> > > > >> > I dislike "turn off safety for performance" options because Joe > > >> > SpeedRacer will always select performance over safety. > > >> > > >> Well, for ceph there's no safety concern. They never use cmtime in > > >> these files. > > >> > > >> So are you suggesting not implementing this and making them rework their > > >> IO paths to avoid the fs maintaining mtime so that we don't give Joe > > >> Speedracer more rope? Or are we talking about adding some speed bumps > > >> that ceph can flip on that might give Joe Speedracer pause? > > > > > > I think this is the fundamental question: who do we give the ammunition > > > to, the user or app writer, or the sysadmin? > > > > > > One might argue that we gave the user a similar power with O_NOATIME (the > > > power to break applications that assume atime is accurate). Here we give > > > developers/users the power to not update mtime and suffer the consequences > > > (like, obviously, breaking mtime-based backups). It should be pretty > > > obvious to anyone using the flag what the consequences are. > > > > > > Note that we can suffer similar lapses in mtime with fdatasync followed by > > > a system crash. And as Andy points out it's semi-broken for writable > > > mmap. The crash case is obviously a slightly different thing, but the > > > idea that mtime can't always be trusted certainly isn't crazy talk. > > > > > > Or, we can be conservative and require a mount option so that the admin > > > has to explicitly allow behavior that might break some existing > > > assumptions about mtime/ctime ('-o user_noatime' I guess?). > > > > > > I'm happy either way, so long as in the end an unprivileged ceph daemon > > > avoids the useless work. In our case we always own the entire mount/disk, > > > so a mount option is just fine. > > > > > > > So, what is the expectation here for filesystems that cannot support > > this flag? NFSv3 in particular would break pretty catastrophically if > > someone decided on a whim to turn off mtime: they will have turned off > > the client's ability to detect cache incoherencies. > > It's worse than that, now that I think about it. I think nomtime > will break nfsv4 as the I_VERSION check is done *after* the > NO[C]MTIME checks. e.g. the atomic change count used to detect file > changes is only updated during the mtime update on write() calls in > XFS. i.e. when the timestamp is changed, a transaction to change > mtime is run, and that transaction commit bumps the change count. > > So cutting out mtime updates at the VFS will prevent XFS and other > I_VERSION aware filesystems from updating the change count that > NFSv4 clients rely on to detect foreign data changes in a file. > > Not sure what to do here, because the current NOCMTIME > implementation intentionally cuts out the timestamp update because > it's usage is fully invisible IO. i.e. it is used by utilities like > xfs_fsr and HSMs to move data into and out of files without the > application being able to detect the data movement in any way. These > are not data modification operations, though - the file contents as > read by the application do not change despite the fact we are moving > data in and out of the file. In this case we don't want timestamps > or change counters to change on the data movement, so I think we've > actually got a difference in behaviour here between O_NOMTIME and > O_NOCMTIME, right? > > i.e. for nfsv4 sanity O_NOMTIME still needs to bump I_VERSION on > write, just not modify the timestamp? In which case, not modifying > the timestamps gains us nothing, because the inode is still dirtied? Right: if we dirty the inode we've defeated the purpose of the patch. > The list of caveats on O_NOMTIME seems to be growing... ...and remain consistent with our goals. We couldn't care less if NFS or backup software or anything else doesn't notice these changes. This is private data that is wholly managed by the ceph daemon. The goal is to derive *some* value from the file system and avoid reimplementing it in userspace (without the bits we don't need). I'm sure you realize what we're try to achieve is the same "invisible IO" that the XFS open by handle ioctls do by default. Would you be more comfortable if this option where only available to the generic open_by_handle syscall, and not to open(2)? sage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote: > On Sat, 9 May 2015, Dave Chinner wrote: >> On Thu, May 07, 2015 at 09:23:24PM -0400, Trond Myklebust wrote: >> > On Thu, May 7, 2015 at 9:01 PM, Sage Weil <sage@newdream.net> wrote: >> > > On Thu, 7 May 2015, Zach Brown wrote: >> > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: >> > >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: >> > >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: >> > >> > > owning the file or having the CAP_FOWNER capability. If we're not >> > >> > > comfortable allowing owners to prevent mtime/ctime updates then we >> > >> > > should add a tunable to allow O_NOMTIME. Maybe a mount option? >> > >> > >> > >> > I dislike "turn off safety for performance" options because Joe >> > >> > SpeedRacer will always select performance over safety. >> > >> >> > >> Well, for ceph there's no safety concern. They never use cmtime in >> > >> these files. >> > >> >> > >> So are you suggesting not implementing this and making them rework their >> > >> IO paths to avoid the fs maintaining mtime so that we don't give Joe >> > >> Speedracer more rope? Or are we talking about adding some speed bumps >> > >> that ceph can flip on that might give Joe Speedracer pause? >> > > >> > > I think this is the fundamental question: who do we give the ammunition >> > > to, the user or app writer, or the sysadmin? >> > > >> > > One might argue that we gave the user a similar power with O_NOATIME (the >> > > power to break applications that assume atime is accurate). Here we give >> > > developers/users the power to not update mtime and suffer the consequences >> > > (like, obviously, breaking mtime-based backups). It should be pretty >> > > obvious to anyone using the flag what the consequences are. >> > > >> > > Note that we can suffer similar lapses in mtime with fdatasync followed by >> > > a system crash. And as Andy points out it's semi-broken for writable >> > > mmap. The crash case is obviously a slightly different thing, but the >> > > idea that mtime can't always be trusted certainly isn't crazy talk. >> > > >> > > Or, we can be conservative and require a mount option so that the admin >> > > has to explicitly allow behavior that might break some existing >> > > assumptions about mtime/ctime ('-o user_noatime' I guess?). >> > > >> > > I'm happy either way, so long as in the end an unprivileged ceph daemon >> > > avoids the useless work. In our case we always own the entire mount/disk, >> > > so a mount option is just fine. >> > > >> > >> > So, what is the expectation here for filesystems that cannot support >> > this flag? NFSv3 in particular would break pretty catastrophically if >> > someone decided on a whim to turn off mtime: they will have turned off >> > the client's ability to detect cache incoherencies. >> >> It's worse than that, now that I think about it. I think nomtime >> will break nfsv4 as the I_VERSION check is done *after* the >> NO[C]MTIME checks. e.g. the atomic change count used to detect file >> changes is only updated during the mtime update on write() calls in >> XFS. i.e. when the timestamp is changed, a transaction to change >> mtime is run, and that transaction commit bumps the change count. >> >> So cutting out mtime updates at the VFS will prevent XFS and other >> I_VERSION aware filesystems from updating the change count that >> NFSv4 clients rely on to detect foreign data changes in a file. >> >> Not sure what to do here, because the current NOCMTIME >> implementation intentionally cuts out the timestamp update because >> it's usage is fully invisible IO. i.e. it is used by utilities like >> xfs_fsr and HSMs to move data into and out of files without the >> application being able to detect the data movement in any way. These >> are not data modification operations, though - the file contents as >> read by the application do not change despite the fact we are moving >> data in and out of the file. In this case we don't want timestamps >> or change counters to change on the data movement, so I think we've >> actually got a difference in behaviour here between O_NOMTIME and >> O_NOCMTIME, right? >> >> i.e. for nfsv4 sanity O_NOMTIME still needs to bump I_VERSION on >> write, just not modify the timestamp? In which case, not modifying >> the timestamps gains us nothing, because the inode is still dirtied? > > Right: if we dirty the inode we've defeated the purpose of the patch. > >> The list of caveats on O_NOMTIME seems to be growing... > > ...and remain consistent with our goals. We couldn't care less if NFS or > backup software or anything else doesn't notice these changes. This is > private data that is wholly managed by the ceph daemon. The goal is to > derive *some* value from the file system and avoid reimplementing it in > userspace (without the bits we don't need). That makes it completely non-generic though. By putting this in the VFS, you are giving applications a loaded gun that is pointed straight at the application user's head. > I'm sure you realize what we're try to achieve is the same "invisible IO" > that the XFS open by handle ioctls do by default. Would you be more > comfortable if this option where only available to the generic > open_by_handle syscall, and not to open(2)? It should be an ioctl(). It has no business being part of open_by_handle either, since that is another generic interface. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote: > > I'm sure you realize what we're try to achieve is the same "invisible IO" > > that the XFS open by handle ioctls do by default. Would you be more > > comfortable if this option where only available to the generic > > open_by_handle syscall, and not to open(2)? > > It should be an ioctl(). It has no business being part of > open_by_handle either, since that is another generic interface. I'm happy for it to be an ioctl interface - even an XFS specific interface if you want to go that route, Sage - and it probably should emit a warning to syslog first time it is used so there is trace for bug triage purposes. i.e. we know the app is not using mtime updates, so bug reports that are the result of mtime mishandling don't result in large amounts of wasted developer time trying to understand them... Cheers, Dave.
On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: > That makes it completely non-generic though. By putting this in the > VFS, you are giving applications a loaded gun that is pointed straight > at the application user's head. Let me re-ask the question that I asked last week (and was apparently ignored). Why not trying to use the lazytime feature instead of pointing a head straight at the application's --- and system administrators' --- heads? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 11 May 2015, Theodore Ts'o wrote: > On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: > > That makes it completely non-generic though. By putting this in the > > VFS, you are giving applications a loaded gun that is pointed straight > > at the application user's head. > > Let me re-ask the question that I asked last week (and was apparently > ignored). Why not trying to use the lazytime feature instead of > pointing a head straight at the application's --- and system > administrators' --- heads? Sorry Ted, I thought I responded already. The goal is to avoid inode writeout entirely when we can, and as I understand it lazytime will still force writeout before the inode is dropped from the cache. In systems like Ceph in particular, the IOs can be spread across lots of files, so simply deferring writeout doesn't always help. sage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 11 May 2015, Dave Chinner wrote: > On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: > > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote: > > > I'm sure you realize what we're try to achieve is the same "invisible IO" > > > that the XFS open by handle ioctls do by default. Would you be more > > > comfortable if this option where only available to the generic > > > open_by_handle syscall, and not to open(2)? > > > > It should be an ioctl(). It has no business being part of > > open_by_handle either, since that is another generic interface. Our use-case doesn't make sense on network file systems, but it does on any reasonably featureful local filesystem, and the goal is to be generic there. If mtime is critical to a network file system's consistency it seems pretty reasonable to disallow/ignore it for just that file system (e.g., by masking off the flag at open time), as others won't have that same problem (cephfs doesn't, for example). Perhaps making each fs opt-in instead of handling it in a generic path would alleviate this concern? > I'm happy for it to be an ioctl interface - even an XFS specific > interface if you want to go that route, Sage - and it probably > should emit a warning to syslog first time it is used so there is > trace for bug triage purposes. i.e. we know the app is not using > mtime updates, so bug reports that are the result of mtime > mishandling don't result in large amounts of wasted developer time > trying to understand them... A warning on using the interface (or when mounting with user_nomtime) sounds reasonable. I'd rather not make this XFS specific as other local filesystmes (ext4, f2fs, possibly btrfs) would similarly benefit. (And if we want to target XFS specifically the existing XFS open-by-handle ioctl is sufficient as it already does O_NOMTIME unconditionally.) sage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 11, 2015 at 12:39 PM, Sage Weil <sage@newdream.net> wrote: > On Mon, 11 May 2015, Dave Chinner wrote: >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote: >> > > I'm sure you realize what we're try to achieve is the same "invisible IO" >> > > that the XFS open by handle ioctls do by default. Would you be more >> > > comfortable if this option where only available to the generic >> > > open_by_handle syscall, and not to open(2)? >> > >> > It should be an ioctl(). It has no business being part of >> > open_by_handle either, since that is another generic interface. > > Our use-case doesn't make sense on network file systems, but it does on > any reasonably featureful local filesystem, and the goal is to be generic > there. If mtime is critical to a network file system's consistency it > seems pretty reasonable to disallow/ignore it for just that file system > (e.g., by masking off the flag at open time), as others won't have that > same problem (cephfs doesn't, for example). > > Perhaps making each fs opt-in instead of handling it in a generic path > would alleviate this concern? The issue isn't whether or not you have a network file system, it's whether or not you want users to be able to manage data. mtime isn't useful for the application (which knows whether or not it has changed the file) or for the filesystem (ditto). It exists, rather, in order to enable data management by users and other applications, letting them know whether or not the data contents of the file have changed, and when that change occurred. If you are able to guarantee that your users don't care about that, then fine, but that would be a very special case that doesn't fit the way that most data centres are run. Backups are one case where mtime matters, tiering and archiving is another. Neither of these examples cases are under the control of the application that calls open(O_NOMTIME). >> I'm happy for it to be an ioctl interface - even an XFS specific >> interface if you want to go that route, Sage - and it probably >> should emit a warning to syslog first time it is used so there is >> trace for bug triage purposes. i.e. we know the app is not using >> mtime updates, so bug reports that are the result of mtime >> mishandling don't result in large amounts of wasted developer time >> trying to understand them... > > A warning on using the interface (or when mounting with user_nomtime) > sounds reasonable. > > I'd rather not make this XFS specific as other local filesystmes (ext4, > f2fs, possibly btrfs) would similarly benefit. (And if we want to target > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it > already does O_NOMTIME unconditionally.) Lack of a namespace, doesn't imply that you don't want to manage the data. The whole point of using object storage instead of plain old block storage is to be able to provide whatever metadata you still need in order to manage the object. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 11 May 2015, Trond Myklebust wrote: > On Mon, May 11, 2015 at 12:39 PM, Sage Weil <sage@newdream.net> wrote: > > On Mon, 11 May 2015, Dave Chinner wrote: > >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: > >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote: > >> > > I'm sure you realize what we're try to achieve is the same "invisible IO" > >> > > that the XFS open by handle ioctls do by default. Would you be more > >> > > comfortable if this option where only available to the generic > >> > > open_by_handle syscall, and not to open(2)? > >> > > >> > It should be an ioctl(). It has no business being part of > >> > open_by_handle either, since that is another generic interface. > > > > Our use-case doesn't make sense on network file systems, but it does on > > any reasonably featureful local filesystem, and the goal is to be generic > > there. If mtime is critical to a network file system's consistency it > > seems pretty reasonable to disallow/ignore it for just that file system > > (e.g., by masking off the flag at open time), as others won't have that > > same problem (cephfs doesn't, for example). > > > > Perhaps making each fs opt-in instead of handling it in a generic path > > would alleviate this concern? > > The issue isn't whether or not you have a network file system, it's > whether or not you want users to be able to manage data. mtime isn't > useful for the application (which knows whether or not it has changed > the file) or for the filesystem (ditto). It exists, rather, in order > to enable data management by users and other applications, letting > them know whether or not the data contents of the file have changed, > and when that change occurred. Agreed. > If you are able to guarantee that your users don't care about that, > then fine, but that would be a very special case that doesn't fit the > way that most data centres are run. Backups are one case where mtime > matters, tiering and archiving is another. This is true, although I argue it is becoming increasingly common for the data management (including backups and so forth) to be layered not on top of the POSIX file system but on something higher up in the stack. This is true of pretty much any distributed system (ceph, cassandra, mongo, etc., and I assume commercial databases like Oracle, too) where backups, replication, and any other DR strategies need to be orchestrated across nodes to be consistent--simply copying files out from underneath them is already insufficient and a recipe for disaster. There is a growing category of applications that can benefit from this capability... > Neither of these examples > cases are under the control of the application that calls > open(O_NOMTIME). Wouldn't a mount option (e.g., allow_nomtime) address this concern? Only nodes provisioned explicitly to run these systems would be enable this option. > >> I'm happy for it to be an ioctl interface - even an XFS specific > >> interface if you want to go that route, Sage - and it probably > >> should emit a warning to syslog first time it is used so there is > >> trace for bug triage purposes. i.e. we know the app is not using > >> mtime updates, so bug reports that are the result of mtime > >> mishandling don't result in large amounts of wasted developer time > >> trying to understand them... > > > > A warning on using the interface (or when mounting with user_nomtime) > > sounds reasonable. > > > > I'd rather not make this XFS specific as other local filesystmes (ext4, > > f2fs, possibly btrfs) would similarly benefit. (And if we want to target > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it > > already does O_NOMTIME unconditionally.) > > Lack of a namespace, doesn't imply that you don't want to manage the > data. The whole point of using object storage instead of plain old > block storage is to be able to provide whatever metadata you still > need in order to manage the object. Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd like to use) doesn't assume O_NOMTIME. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 08, 2015 at 09:44:25AM -0500, Eric Sandeen wrote: > On 5/7/15 10:24 PM, Andy Lutomirski wrote: > > On May 8, 2015 8:11 AM, "Dave Chinner" <david@fromorbit.com> wrote: > >> > >> On Thu, May 07, 2015 at 10:20:53AM -0700, Zach Brown wrote: > >>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > >>>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > >>>>> Add the O_NOMTIME flag which prevents mtime from being updated which can > >>>>> greatly reduce the IO overhead of writes to allocated and initialized > >>>>> regions of files. > >>>> > >>>> Hmmm. How do backup programs now work out if the file has changed > >>>> and hence needs copying again? ie. applications using this will > >>>> break other critical infrastructure in subtle ways. > >>> > >>> By using backup infrastructure that doesn't use cmtime. Like btrfs > >>> send/recv. Or application level backups that know how to do > >>> incrementals from metadata in giant database files, say, without > >>> walking, comparing, and copying the entire thing. > >> > >> "Use magical thing that doesn't exist"? Really? > >> > >> e.g. you can't do incremental backups with tools like xfsdump if > >> mtime is not being updated. The last thing an admin wants when > >> doing disaster recovery is to find out that the app started using > >> O_NOMTIME as a result of the upgrade they did 6 months ago. Hence > >> the last 6 months of production data isn't in the backups despite > >> the backup procedure having been extensively tested and verified > >> when it was first put in place. > >> > >>>>> The criteria for using O_NOMTIME is the same as for using O_NOATIME: > >>>>> owning the file or having the CAP_FOWNER capability. If we're not > >>>>> comfortable allowing owners to prevent mtime/ctime updates then we > >>>>> should add a tunable to allow O_NOMTIME. Maybe a mount option? > >>>> > >>>> I dislike "turn off safety for performance" options because Joe > >>>> SpeedRacer will always select performance over safety. > >>> > >>> Well, for ceph there's no safety concern. They never use cmtime in > >>> these files. > >> > >> Understood. > >> > >>> So are you suggesting not implementing this > >> > >> No. > >> > >>> Or are we talking about adding some speed bumps > >>> that ceph can flip on that might give Joe Speedracer pause? > >> > >> Yes, but not just Joe Speedracer - if it can be turned on silently > >> by apps then it's a great big landmine that most users and sysadmins > >> will not know about until it is too late. > > > > What about programs like tar that explicitly override mtime? No admin > > buy-in is required for that. Admittedly, that doesn't affect ctime, > > nor is it as likely to bite unexpectedly as a nomtime flag. > > > > I think it would be reasonably safe if a mount option had to be set to > > allow O_NOCMTIME or such. > > I was going to suggest the same. Make infrastructure available for an app > to request O_NOMTIME, but a mount option must be set to allow it, so the > administrator doesn't get an unhappy surprise at backup-restore time. > > (Not a big fan of more twiddly knobs, but that seems to put the control > in all the right places). It seems more like a permanent feature of the filesystem than a per-mount option: once you've turned off mtime updates you lose information that can't be regained after remounting. A mkfs option might make more sense? But I guess those aren't very generic. (I do hope we can get an O_NOMTIME flag, it will make me smile every time I see it....) --b. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: > > Let me re-ask the question that I asked last week (and was apparently > > ignored). Why not trying to use the lazytime feature instead of > > pointing a head straight at the application's --- and system > > administrators' --- heads? > > Sorry Ted, I thought I responded already. > > The goal is to avoid inode writeout entirely when we can, and > as I understand it lazytime will still force writeout before the inode > is dropped from the cache. In systems like Ceph in particular, the > IOs can be spread across lots of files, so simply deferring writeout > doesn't always help. Sure, but it would reduce the writeout by orders of magnitude. I can understand if you want to reduce it further, but it might be good enough for your purposes. I considered doing the equivalent of O_NOMTIME for our purposes at $WORK, and our use case is actually not that different from Ceph's (i.e., using a local disk file system to support a cluster file system), and lazytime was (a) something I figured was something I could upstream in good conscience, and (b) was more than good enough for us. Cheers, - Ted P.S. I do agree that if we do need this upstream, requiring a mount option to enable the feature is probably a good compromise. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 11, 2015 at 10:30:58AM -0700, Sage Weil wrote: > On Mon, 11 May 2015, Trond Myklebust wrote: > > On Mon, May 11, 2015 at 12:39 PM, Sage Weil <sage@newdream.net> wrote: > > > On Mon, 11 May 2015, Dave Chinner wrote: > > >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: > > >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote: > > >> > > I'm sure you realize what we're try to achieve is the same "invisible IO" > > >> > > that the XFS open by handle ioctls do by default. Would you be more > > >> > > comfortable if this option where only available to the generic > > >> > > open_by_handle syscall, and not to open(2)? > > >> > > > >> > It should be an ioctl(). It has no business being part of > > >> > open_by_handle either, since that is another generic interface. > > > > > > Our use-case doesn't make sense on network file systems, but it does on > > > any reasonably featureful local filesystem, and the goal is to be generic > > > there. If mtime is critical to a network file system's consistency it > > > seems pretty reasonable to disallow/ignore it for just that file system > > > (e.g., by masking off the flag at open time), as others won't have that > > > same problem (cephfs doesn't, for example). > > > > > > Perhaps making each fs opt-in instead of handling it in a generic path > > > would alleviate this concern? > > > > The issue isn't whether or not you have a network file system, it's > > whether or not you want users to be able to manage data. mtime isn't > > useful for the application (which knows whether or not it has changed > > the file) or for the filesystem (ditto). It exists, rather, in order > > to enable data management by users and other applications, letting > > them know whether or not the data contents of the file have changed, > > and when that change occurred. > > Agreed. > > > If you are able to guarantee that your users don't care about that, > > then fine, but that would be a very special case that doesn't fit the > > way that most data centres are run. Backups are one case where mtime > > matters, tiering and archiving is another. > > This is true, although I argue it is becoming increasingly common for the > data management (including backups and so forth) to be layered not on top > of the POSIX file system but on something higher up in the stack. This is In the cloud storage world, yes. In the rest of the world, no. It's the rest of the world we are worried about here. :/ > > Neither of these examples > > cases are under the control of the application that calls > > open(O_NOMTIME). > > Wouldn't a mount option (e.g., allow_nomtime) address this concern? Only > nodes provisioned explicitly to run these systems would be enable this > option. Back to my Joe Speedracer comments..... I'm not sure what the right answer is - mount options are simply too easy to add without understanding the full implications of them. e.g. we didn't merge FALLOC_FL_NO_HIDE_STALE simply because it was too dangerous for unsuspecting users. This isn't at that same level or concern, but it's still a landmine we want to avoid users from arming without realising it... > > >> I'm happy for it to be an ioctl interface - even an XFS specific > > >> interface if you want to go that route, Sage - and it probably > > >> should emit a warning to syslog first time it is used so there is > > >> trace for bug triage purposes. i.e. we know the app is not using > > >> mtime updates, so bug reports that are the result of mtime > > >> mishandling don't result in large amounts of wasted developer time > > >> trying to understand them... > > > > > > A warning on using the interface (or when mounting with user_nomtime) > > > sounds reasonable. > > > > > > I'd rather not make this XFS specific as other local filesystmes (ext4, > > > f2fs, possibly btrfs) would similarly benefit. (And if we want to target > > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it > > > already does O_NOMTIME unconditionally.) > > > > Lack of a namespace, doesn't imply that you don't want to manage the > > data. The whole point of using object storage instead of plain old > > block storage is to be able to provide whatever metadata you still > > need in order to manage the object. > > Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd > like to use) doesn't assume O_NOMTIME. Right - the XFS ioctls were designed specifically for applications that interacted directly with the structure of XFS filesystems and so needed invisible IO (e.g. online defragmenter). IOWs, they are not interfaces intended for general usage. They are also only available to root, so a typical user application won't be making use of them, either. Cheers, Dave.
On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: > On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: > > > Let me re-ask the question that I asked last week (and was apparently > > > ignored). Why not trying to use the lazytime feature instead of > > > pointing a head straight at the application's --- and system > > > administrators' --- heads? > > > > Sorry Ted, I thought I responded already. > > > > The goal is to avoid inode writeout entirely when we can, and > > as I understand it lazytime will still force writeout before the inode > > is dropped from the cache. In systems like Ceph in particular, the > > IOs can be spread across lots of files, so simply deferring writeout > > doesn't always help. > > Sure, but it would reduce the writeout by orders of magnitude. I can > understand if you want to reduce it further, but it might be good > enough for your purposes. > > I considered doing the equivalent of O_NOMTIME for our purposes at > $WORK, and our use case is actually not that different from Ceph's > (i.e., using a local disk file system to support a cluster file > system), and lazytime was (a) something I figured was something I > could upstream in good conscience, and (b) was more than good enough > for us. A safer alternative might be a chattr file attribute that if set, the mtime is not updated on writes, and stat() on the file always shows the mtime as "right now". At least that way, the file won't accidentally get left out of backups that rely on the mtime. (If the file attribute is unset, you immediately update the mtime then too, and from then on the file is back to normal). - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-05-12 01:08, Kevin Easton wrote: > On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: >> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: >>>> Let me re-ask the question that I asked last week (and was apparently >>>> ignored). Why not trying to use the lazytime feature instead of >>>> pointing a head straight at the application's --- and system >>>> administrators' --- heads? >>> >>> Sorry Ted, I thought I responded already. >>> >>> The goal is to avoid inode writeout entirely when we can, and >>> as I understand it lazytime will still force writeout before the inode >>> is dropped from the cache. In systems like Ceph in particular, the >>> IOs can be spread across lots of files, so simply deferring writeout >>> doesn't always help. >> >> Sure, but it would reduce the writeout by orders of magnitude. I can >> understand if you want to reduce it further, but it might be good >> enough for your purposes. >> >> I considered doing the equivalent of O_NOMTIME for our purposes at >> $WORK, and our use case is actually not that different from Ceph's >> (i.e., using a local disk file system to support a cluster file >> system), and lazytime was (a) something I figured was something I >> could upstream in good conscience, and (b) was more than good enough >> for us. > > A safer alternative might be a chattr file attribute that if set, the > mtime is not updated on writes, and stat() on the file always shows the > mtime as "right now". At least that way, the file won't accidentally > get left out of backups that rely on the mtime. > > (If the file attribute is unset, you immediately update the mtime then > too, and from then on the file is back to normal). > I like this even better than the flag suggestion, it provides better control, means that you don't need to update applications to get the benefits, and prevents backup software from breaking (although backups would be bigger).
>>>>> "Sage" == Sage Weil <sage@newdream.net> writes: Sage> On Mon, 11 May 2015, Trond Myklebust wrote: >> On Mon, May 11, 2015 at 12:39 PM, Sage Weil <sage@newdream.net> wrote: >> > On Mon, 11 May 2015, Dave Chinner wrote: >> >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: >> >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote: >> >> > > I'm sure you realize what we're try to achieve is the same "invisible IO" >> >> > > that the XFS open by handle ioctls do by default. Would you be more >> >> > > comfortable if this option where only available to the generic >> >> > > open_by_handle syscall, and not to open(2)? >> >> > >> >> > It should be an ioctl(). It has no business being part of >> >> > open_by_handle either, since that is another generic interface. >> > >> > Our use-case doesn't make sense on network file systems, but it does on >> > any reasonably featureful local filesystem, and the goal is to be generic >> > there. If mtime is critical to a network file system's consistency it >> > seems pretty reasonable to disallow/ignore it for just that file system >> > (e.g., by masking off the flag at open time), as others won't have that >> > same problem (cephfs doesn't, for example). >> > >> > Perhaps making each fs opt-in instead of handling it in a generic path >> > would alleviate this concern? >> >> The issue isn't whether or not you have a network file system, it's >> whether or not you want users to be able to manage data. mtime isn't >> useful for the application (which knows whether or not it has changed >> the file) or for the filesystem (ditto). It exists, rather, in order >> to enable data management by users and other applications, letting >> them know whether or not the data contents of the file have changed, >> and when that change occurred. Sage> Agreed. >> If you are able to guarantee that your users don't care about that, >> then fine, but that would be a very special case that doesn't fit the >> way that most data centres are run. Backups are one case where mtime >> matters, tiering and archiving is another. Sage> This is true, although I argue it is becoming increasingly Sage> common for the data management (including backups and so forth) Sage> to be layered not on top of the POSIX file system but on Sage> something higher up in the stack. This is true of pretty much Sage> any distributed system (ceph, cassandra, mongo, etc., and I Sage> assume commercial databases like Oracle, too) where backups, Sage> replication, and any other DR strategies need to be orchestrated Sage> across nodes to be consistent--simply copying files out from Sage> underneath them is already insufficient and a recipe for Sage> disaster. you're smoking crack here. Backups are not layered at higher layers unless absolutely necessary, such as for databases. Now Mongo, Hadoop and others might also fit this model, but for day to day backup of data, it's mtime all the way. I don't see why you insist that this is a good idea to implement for a very special corner case. Sage> There is a growing category of applications that can benefit Sage> from this capability... There is a perceived growing category of super special niche applications which might think they want this capability. Why are you even using a filesystem in the first place if you're so worried about writing out inodes being a performance problem? Just use raw partitions and do all the work yourself. Oracle and other DBs can do this when they want. >> Neither of these examples >> cases are under the control of the application that calls >> open(O_NOMTIME). Sage> Wouldn't a mount option (e.g., allow_nomtime) address this Sage> concern? Only nodes provisioned explicitly to run these systems Sage> would be enable this option. Why do you keep coming back to a mount option? What's wrong with a per-file ioctl option? Making this a mount option means that you default to a fail hard setup. If someone screws up and mounts user home directories with this option thinking that it's like the noatime option, then suddenly all their backups will silently break unless they're aware of disk space churn numbers and notice that they are only backing up tiny bits. With an ioctl, it's upto the damn application to *request* this change, and then the VFS/filesystem and *maybe* support this, but the application shouldn't actually know or care what the result is, it's just a performance hint/request. We should default to sane semantics and not give out such a big foot-gun if at all possible. I'm a sysadm by day (and night, evening, early morning... :-) and I know my user's don't think about thinks like this. They don't even think about backups until they want to restore something. User's only care about restores, not backups. John -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes: Austin> On 2015-05-12 01:08, Kevin Easton wrote: >> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: >>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: >>>>> Let me re-ask the question that I asked last week (and was apparently >>>>> ignored). Why not trying to use the lazytime feature instead of >>>>> pointing a head straight at the application's --- and system >>>>> administrators' --- heads? >>>> >>>> Sorry Ted, I thought I responded already. >>>> >>>> The goal is to avoid inode writeout entirely when we can, and >>>> as I understand it lazytime will still force writeout before the inode >>>> is dropped from the cache. In systems like Ceph in particular, the >>>> IOs can be spread across lots of files, so simply deferring writeout >>>> doesn't always help. >>> >>> Sure, but it would reduce the writeout by orders of magnitude. I can >>> understand if you want to reduce it further, but it might be good >>> enough for your purposes. >>> >>> I considered doing the equivalent of O_NOMTIME for our purposes at >>> $WORK, and our use case is actually not that different from Ceph's >>> (i.e., using a local disk file system to support a cluster file >>> system), and lazytime was (a) something I figured was something I >>> could upstream in good conscience, and (b) was more than good enough >>> for us. >> >> A safer alternative might be a chattr file attribute that if set, the >> mtime is not updated on writes, and stat() on the file always shows the >> mtime as "right now". At least that way, the file won't accidentally >> get left out of backups that rely on the mtime. >> >> (If the file attribute is unset, you immediately update the mtime then >> too, and from then on the file is back to normal). >> Austin> I like this even better than the flag suggestion, it provides Austin> better control, means that you don't need to update Austin> applications to get the benefits, and prevents backup software Austin> from breaking (although backups would be bigger). Me too, it fails in a safer mode, where you do more work on backups than strictly needed. I'm still against this as a mount option though, way way way too many bullets in the foot gun. And as someone else said, once you mount with O_NOMTIME, then unmount, then mount again without O_NOMTIME, you've lost information. Not good. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 12, 2015 at 09:54:27AM -0400, John Stoffel wrote: > >>>>> "Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes: > > Austin> On 2015-05-12 01:08, Kevin Easton wrote: > >> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: > >>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: > >>>>> Let me re-ask the question that I asked last week (and was apparently > >>>>> ignored). Why not trying to use the lazytime feature instead of > >>>>> pointing a head straight at the application's --- and system > >>>>> administrators' --- heads? > >>>> > >>>> Sorry Ted, I thought I responded already. > >>>> > >>>> The goal is to avoid inode writeout entirely when we can, and > >>>> as I understand it lazytime will still force writeout before the inode > >>>> is dropped from the cache. In systems like Ceph in particular, the > >>>> IOs can be spread across lots of files, so simply deferring writeout > >>>> doesn't always help. > >>> > >>> Sure, but it would reduce the writeout by orders of magnitude. I can > >>> understand if you want to reduce it further, but it might be good > >>> enough for your purposes. > >>> > >>> I considered doing the equivalent of O_NOMTIME for our purposes at > >>> $WORK, and our use case is actually not that different from Ceph's > >>> (i.e., using a local disk file system to support a cluster file > >>> system), and lazytime was (a) something I figured was something I > >>> could upstream in good conscience, and (b) was more than good enough > >>> for us. > >> > >> A safer alternative might be a chattr file attribute that if set, the > >> mtime is not updated on writes, and stat() on the file always shows the > >> mtime as "right now". At least that way, the file won't accidentally > >> get left out of backups that rely on the mtime. > >> > >> (If the file attribute is unset, you immediately update the mtime then > >> too, and from then on the file is back to normal). > >> > > Austin> I like this even better than the flag suggestion, it provides > Austin> better control, means that you don't need to update > Austin> applications to get the benefits, and prevents backup software > Austin> from breaking (although backups would be bigger). > > Me too, it fails in a safer mode, where you do more work on backups > than strictly needed. I'm still against this as a mount option > though, way way way too many bullets in the foot gun. And as someone > else said, once you mount with O_NOMTIME, then unmount, then mount > again without O_NOMTIME, you've lost information. Not good. That was me. Zach also pointed out to me that'd mean figuring out where to store that information on-disk for every filesystem you care about. I like the idea of something persistent, but maybe it's more trouble than it's worth--I honestly don't know. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-05-12 10:36, J. Bruce Fields wrote: > On Tue, May 12, 2015 at 09:54:27AM -0400, John Stoffel wrote: >>>>>>> "Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes: >> >> Austin> On 2015-05-12 01:08, Kevin Easton wrote: >>>> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: >>>>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: >>>>>>> Let me re-ask the question that I asked last week (and was apparently >>>>>>> ignored). Why not trying to use the lazytime feature instead of >>>>>>> pointing a head straight at the application's --- and system >>>>>>> administrators' --- heads? >>>>>> >>>>>> Sorry Ted, I thought I responded already. >>>>>> >>>>>> The goal is to avoid inode writeout entirely when we can, and >>>>>> as I understand it lazytime will still force writeout before the inode >>>>>> is dropped from the cache. In systems like Ceph in particular, the >>>>>> IOs can be spread across lots of files, so simply deferring writeout >>>>>> doesn't always help. >>>>> >>>>> Sure, but it would reduce the writeout by orders of magnitude. I can >>>>> understand if you want to reduce it further, but it might be good >>>>> enough for your purposes. >>>>> >>>>> I considered doing the equivalent of O_NOMTIME for our purposes at >>>>> $WORK, and our use case is actually not that different from Ceph's >>>>> (i.e., using a local disk file system to support a cluster file >>>>> system), and lazytime was (a) something I figured was something I >>>>> could upstream in good conscience, and (b) was more than good enough >>>>> for us. >>>> >>>> A safer alternative might be a chattr file attribute that if set, the >>>> mtime is not updated on writes, and stat() on the file always shows the >>>> mtime as "right now". At least that way, the file won't accidentally >>>> get left out of backups that rely on the mtime. >>>> >>>> (If the file attribute is unset, you immediately update the mtime then >>>> too, and from then on the file is back to normal). >>>> >> >> Austin> I like this even better than the flag suggestion, it provides >> Austin> better control, means that you don't need to update >> Austin> applications to get the benefits, and prevents backup software >> Austin> from breaking (although backups would be bigger). >> >> Me too, it fails in a safer mode, where you do more work on backups >> than strictly needed. I'm still against this as a mount option >> though, way way way too many bullets in the foot gun. And as someone >> else said, once you mount with O_NOMTIME, then unmount, then mount >> again without O_NOMTIME, you've lost information. Not good. > > That was me. Zach also pointed out to me that'd mean figuring out where > to store that information on-disk for every filesystem you care about. > I like the idea of something persistent, but maybe it's more trouble > than it's worth--I honestly don't know. > But if we do it as a flag controlled by the API used by chattr, it becomes the responsibility of the filesystems to deal with where to store the information, assuming they choose to support it; personally, I would be really surprised if XFS and BTRFS didn't add support for this relatively soon after the API getting merged upstream, and ext4 would likely follow soon afterwards. As far as support goes, I really think this will be easier to _safely_ implement (mount options are just too easy to arbitrarily change without knowing the consequences), although I think that reporting mtime as the current wall time for files under this effect is important regardless of what methodology get's implemented.
On Tue, 12 May 2015, Kevin Easton wrote: > On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: > > On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: > > > > Let me re-ask the question that I asked last week (and was apparently > > > > ignored). Why not trying to use the lazytime feature instead of > > > > pointing a head straight at the application's --- and system > > > > administrators' --- heads? > > > > > > Sorry Ted, I thought I responded already. > > > > > > The goal is to avoid inode writeout entirely when we can, and > > > as I understand it lazytime will still force writeout before the inode > > > is dropped from the cache. In systems like Ceph in particular, the > > > IOs can be spread across lots of files, so simply deferring writeout > > > doesn't always help. > > > > Sure, but it would reduce the writeout by orders of magnitude. I can > > understand if you want to reduce it further, but it might be good > > enough for your purposes. > > > > I considered doing the equivalent of O_NOMTIME for our purposes at > > $WORK, and our use case is actually not that different from Ceph's > > (i.e., using a local disk file system to support a cluster file > > system), and lazytime was (a) something I figured was something I > > could upstream in good conscience, and (b) was more than good enough > > for us. > > A safer alternative might be a chattr file attribute that if set, the > mtime is not updated on writes, and stat() on the file always shows the > mtime as "right now". At least that way, the file won't accidentally > get left out of backups that rely on the mtime. > > (If the file attribute is unset, you immediately update the mtime then > too, and from then on the file is back to normal). Interesting! I didn't realize there was already a chattr +A that disabled atime (although I suspect it doesn't do the "right now" for stat thing). This makes the nomtime-ness a bit more obscure (I don't think most users would think to check these file attributes), but it's a safer failure condition for backups at least. The fact that chattr +A (and hopefully +M) will work for non-root is a bonus, as we're also trying to get ceph daemons to drop most privileges. sage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 12, 2015 at 10:53:29AM -0400, Austin S Hemmelgarn wrote: > On 2015-05-12 10:36, J. Bruce Fields wrote: > >On Tue, May 12, 2015 at 09:54:27AM -0400, John Stoffel wrote: > >>>>>>>"Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes: > >> > >>Austin> On 2015-05-12 01:08, Kevin Easton wrote: > >>>>On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: > >>>>>On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: > >>>>>>>Let me re-ask the question that I asked last week (and was apparently > >>>>>>>ignored). Why not trying to use the lazytime feature instead of > >>>>>>>pointing a head straight at the application's --- and system > >>>>>>>administrators' --- heads? > >>>>>> > >>>>>>Sorry Ted, I thought I responded already. > >>>>>> > >>>>>>The goal is to avoid inode writeout entirely when we can, and > >>>>>>as I understand it lazytime will still force writeout before the inode > >>>>>>is dropped from the cache. In systems like Ceph in particular, the > >>>>>>IOs can be spread across lots of files, so simply deferring writeout > >>>>>>doesn't always help. > >>>>> > >>>>>Sure, but it would reduce the writeout by orders of magnitude. I can > >>>>>understand if you want to reduce it further, but it might be good > >>>>>enough for your purposes. > >>>>> > >>>>>I considered doing the equivalent of O_NOMTIME for our purposes at > >>>>>$WORK, and our use case is actually not that different from Ceph's > >>>>>(i.e., using a local disk file system to support a cluster file > >>>>>system), and lazytime was (a) something I figured was something I > >>>>>could upstream in good conscience, and (b) was more than good enough > >>>>>for us. > >>>> > >>>>A safer alternative might be a chattr file attribute that if set, the > >>>>mtime is not updated on writes, and stat() on the file always shows the > >>>>mtime as "right now". At least that way, the file won't accidentally > >>>>get left out of backups that rely on the mtime. > >>>> > >>>>(If the file attribute is unset, you immediately update the mtime then > >>>>too, and from then on the file is back to normal). > >>>> > >> > >>Austin> I like this even better than the flag suggestion, it provides > >>Austin> better control, means that you don't need to update > >>Austin> applications to get the benefits, and prevents backup software > >>Austin> from breaking (although backups would be bigger). > >> > >>Me too, it fails in a safer mode, where you do more work on backups > >>than strictly needed. I'm still against this as a mount option > >>though, way way way too many bullets in the foot gun. And as someone > >>else said, once you mount with O_NOMTIME, then unmount, then mount > >>again without O_NOMTIME, you've lost information. Not good. > > > >That was me. Zach also pointed out to me that'd mean figuring out where > >to store that information on-disk for every filesystem you care about. > >I like the idea of something persistent, but maybe it's more trouble > >than it's worth--I honestly don't know. > > > But if we do it as a flag controlled by the API used by chattr, it > becomes the responsibility of the filesystems to deal with where to > store the information, assuming they choose to support it; > personally, I would be really surprised if XFS and BTRFS didn't add > support for this relatively soon after the API getting merged > upstream, and ext4 would likely follow soon afterwards. It's an on-disk format change, which means that there are all sorts of compatibility issues to take into account, as well as all the work needed to teach the filesystem userspace tools about the new flag. e.g. xfs_repair, xfs_db, xfsdump/restore, xfs_io, test code in xfstests, etc. Keep in mind that the moment we make something persistent, the amount of work to implement and verify the new functionality filesystem to implement it goes up by an order of magnitude *for each filesystem*. IOWs, support of new features that require persistence don't just magically appear overnight... Cheers, Dave.
On Tue, 12 May 2015 10:36:37 -0400 bfields@fieldses.org (J. Bruce Fields) wrote: > On Tue, May 12, 2015 at 09:54:27AM -0400, John Stoffel wrote: > > >>>>> "Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes: > > > > Austin> On 2015-05-12 01:08, Kevin Easton wrote: > > >> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: > > >>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: > > >>>>> Let me re-ask the question that I asked last week (and was apparently > > >>>>> ignored). Why not trying to use the lazytime feature instead of > > >>>>> pointing a head straight at the application's --- and system > > >>>>> administrators' --- heads? > > >>>> > > >>>> Sorry Ted, I thought I responded already. > > >>>> > > >>>> The goal is to avoid inode writeout entirely when we can, and > > >>>> as I understand it lazytime will still force writeout before the inode > > >>>> is dropped from the cache. In systems like Ceph in particular, the > > >>>> IOs can be spread across lots of files, so simply deferring writeout > > >>>> doesn't always help. > > >>> > > >>> Sure, but it would reduce the writeout by orders of magnitude. I can > > >>> understand if you want to reduce it further, but it might be good > > >>> enough for your purposes. > > >>> > > >>> I considered doing the equivalent of O_NOMTIME for our purposes at > > >>> $WORK, and our use case is actually not that different from Ceph's > > >>> (i.e., using a local disk file system to support a cluster file > > >>> system), and lazytime was (a) something I figured was something I > > >>> could upstream in good conscience, and (b) was more than good enough > > >>> for us. > > >> > > >> A safer alternative might be a chattr file attribute that if set, the > > >> mtime is not updated on writes, and stat() on the file always shows the > > >> mtime as "right now". At least that way, the file won't accidentally > > >> get left out of backups that rely on the mtime. > > >> > > >> (If the file attribute is unset, you immediately update the mtime then > > >> too, and from then on the file is back to normal). > > >> > > > > Austin> I like this even better than the flag suggestion, it provides > > Austin> better control, means that you don't need to update > > Austin> applications to get the benefits, and prevents backup software > > Austin> from breaking (although backups would be bigger). > > > > Me too, it fails in a safer mode, where you do more work on backups > > than strictly needed. I'm still against this as a mount option > > though, way way way too many bullets in the foot gun. And as someone > > else said, once you mount with O_NOMTIME, then unmount, then mount > > again without O_NOMTIME, you've lost information. Not good. > > That was me. Zach also pointed out to me that'd mean figuring out where > to store that information on-disk for every filesystem you care about. > I like the idea of something persistent, but maybe it's more trouble > than it's worth--I honestly don't know. > When this persistent flag is in effect, the values stored in mtime and atime, and probably ctime, become irrelevant. Surely we can choose some magic value to store there that would never happen in practice. e.g. ctime is signed and so goes back to 1902 (is that right?). As ctime cannot be set (via POSIX) to anything but "now", and as there were no Unix systems in 1902, such values are impossible. So a specific large negative value in ctime could safely be take to mean "don't update time stamps, and always report them as 'now'". Or do we need to keep ctime 'real'? BTW When you "swap" to a file the mtime doesn't get updated. No one seems to complain about that. I guess it is a rather narrow use-case though. NeilBrown
On Tue, 12 May 2015, Dave Chinner wrote: > > > Neither of these examples cases are under the control of the > > > application that calls open(O_NOMTIME). > > > > Wouldn't a mount option (e.g., allow_nomtime) address this concern? Only > > nodes provisioned explicitly to run these systems would be enable this > > option. > > Back to my Joe Speedracer comments..... > > I'm not sure what the right answer is - mount options are simply too > easy to add without understanding the full implications of them. > e.g. we didn't merge FALLOC_FL_NO_HIDE_STALE simply because it was > too dangerous for unsuspecting users. This isn't at that same level > or concern, but it's still a landmine we want to avoid users from > arming without realising it... > > > > >> I'm happy for it to be an ioctl interface - even an XFS specific > > > >> interface if you want to go that route, Sage - and it probably > > > >> should emit a warning to syslog first time it is used so there is > > > >> trace for bug triage purposes. i.e. we know the app is not using > > > >> mtime updates, so bug reports that are the result of mtime > > > >> mishandling don't result in large amounts of wasted developer time > > > >> trying to understand them... > > > > > > > > A warning on using the interface (or when mounting with user_nomtime) > > > > sounds reasonable. > > > > > > > > I'd rather not make this XFS specific as other local filesystmes (ext4, > > > > f2fs, possibly btrfs) would similarly benefit. (And if we want to target > > > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it > > > > already does O_NOMTIME unconditionally.) > > > > > > Lack of a namespace, doesn't imply that you don't want to manage the > > > data. The whole point of using object storage instead of plain old > > > block storage is to be able to provide whatever metadata you still > > > need in order to manage the object. > > > > Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd > > like to use) doesn't assume O_NOMTIME. > > Right - the XFS ioctls were designed specifically for applications > that interacted directly with the structure of XFS filesystems and > so needed invisible IO (e.g. online defragmenter). IOWs, they are > not interfaces intended for general usage. They are also only > available to root, so a typical user application won't be making use > of them, either. I understand that's what they're intended for, but I'm having a hard time parsing out the difference between what they *do* and what O_NOMTIME + -o allow_nomtime does. The open-by-handle ioctls have nothing to do with the online XFS format--they simply allow you to open a file via an opaque handle (albeit a differently formatted one than the generic open_by_handle_at(2)). They also force you into an O_NOMTIME-equivalent mode. AFAICS the only difference that I see is that 1) the ioctl is XFS specific. (As open_by_handle_at(2) demonstrates, this needn't be the case.) 2) the NOMTIME mode is only available via the open-by-handle interface, not open(2). 3) it is an ioctl interface, and thus more obscure. (Well, there is a libhandle library, but it doesn't seem to be widely used.) Would you object less if 1) the O_NOMTIME flag were only available via open_by_handle_at(2)? 2) an equivalent ioctl were implemented for each file system of interest that (say) called into open_by_handle_at(2) code, adding in the O_NOMTIME flag? 3) O_NOMTIME required root (vs a mount option that requires root and unpriviledged O_NOMTIME)? Just trying to tease apart which part is problematic... Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 12, 2015 at 04:12:46PM -0700, Sage Weil wrote: > On Tue, 12 May 2015, Dave Chinner wrote: > > > > > I'd rather not make this XFS specific as other local filesystmes (ext4, > > > > > f2fs, possibly btrfs) would similarly benefit. (And if we want to target > > > > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it > > > > > already does O_NOMTIME unconditionally.) > > > > > > > > Lack of a namespace, doesn't imply that you don't want to manage the > > > > data. The whole point of using object storage instead of plain old > > > > block storage is to be able to provide whatever metadata you still > > > > need in order to manage the object. > > > > > > Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd > > > like to use) doesn't assume O_NOMTIME. > > > > Right - the XFS ioctls were designed specifically for applications > > that interacted directly with the structure of XFS filesystems and > > so needed invisible IO (e.g. online defragmenter). IOWs, they are > > not interfaces intended for general usage. They are also only > > available to root, so a typical user application won't be making use > > of them, either. > > I understand that's what they're intended for, but I'm having a hard time > parsing out the difference between what they *do* and what O_NOMTIME + -o > allow_nomtime does. The open-by-handle ioctls have nothing to do with the > online XFS format--they simply allow you to open a file via an opaque > handle (albeit a differently formatted one than the generic > open_by_handle_at(2)). They also force you into an O_NOMTIME-equivalent > mode. Actually, the handle is dervied from the information on disk. We don't do directory lookups to build handles in many cases, we do a bulkstat to get *on-disk* inode information (inode number, generation, timestamps, etc) and then use that to build a handle in userspace *and* validate the file has not changed since the infomration was retrieved and the handle was built. > AFAICS the only difference that I see is that > > 1) the ioctl is XFS specific. (As open_by_handle_at(2) demonstrates, this > needn't be the case.) Of course - it's been in use for 15 years longer than the generic interface. :) > 2) the NOMTIME mode is only available via the open-by-handle interface, > not open(2). Right, because of the XFS handle interfaces are intended for invisible IO which is required by applications interacting directly with the XFS on-disk data layout. > 3) it is an ioctl interface, and thus more obscure. (Well, there is a > libhandle library, but it doesn't seem to be widely used.) The library only exists for xfsdump and the HSMs that interact directly with the XFS on disk data. These are very constrained applications. > Would you object less if > > 1) the O_NOMTIME flag were only available via open_by_handle_at(2)? Which limits it to files that have already by created and written to disk, otherwise there is no handle.... > 2) an equivalent ioctl were implemented for each file system of interest > that (say) called into open_by_handle_at(2) code, adding in the O_NOMTIME > flag? Seems like a silly hoop to jump through. I was thinking of a root-only fcntl() style flag that could be set, but.... > 3) O_NOMTIME required root (vs a mount option that requires root and > unpriviledged O_NOMTIME)? > > Just trying to tease apart which part is problematic... ... it's very existence ias either a open or fcntl flag is still problematic. :/ The concept of it being an on-disk attribute flag is less prone to silent abuse - it's easily discoverable and is persistent. And it's managable if we make it an "inherit from parent" style flag, because then ceph can simply set it on the root dir, and every file it then creates will not do mtime updates. The other thing that is worth noting here is that we also have a NODUMP flag on disk (chattr +d). Hence we could define that the nomtime attribute also implies/sets the nodump attribute, and hence makes it clear and upfront that turning on the nomtime inode attribute will mean the files with this set will not get backed up by mtime sensitive backup programs.... Cheers, Dave.
On Mon 11-05-15 09:24:09, Sage Weil wrote: > On Mon, 11 May 2015, Theodore Ts'o wrote: > > On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: > > > That makes it completely non-generic though. By putting this in the > > > VFS, you are giving applications a loaded gun that is pointed straight > > > at the application user's head. > > > > Let me re-ask the question that I asked last week (and was apparently > > ignored). Why not trying to use the lazytime feature instead of > > pointing a head straight at the application's --- and system > > administrators' --- heads? > > Sorry Ted, I thought I responded already. > > The goal is to avoid inode writeout entirely when we can, and > as I understand it lazytime will still force writeout before the inode > is dropped from the cache. In systems like Ceph in particular, the > IOs can be spread across lots of files, so simply deferring writeout > doesn't always help. Can we get some numbers on this? Before we go on and implement new mount options, persistent inode flags, open flags, or whatever other crap (neither of which looks particularly appealing to me) I'd like to know how big is the performance difference between lazytime + fdatasync and not updating mtime at all for Ceph... Honza
On 2015-05-12 17:51, Dave Chinner wrote: > On Tue, May 12, 2015 at 10:53:29AM -0400, Austin S Hemmelgarn wrote: >> On 2015-05-12 10:36, J. Bruce Fields wrote: >>> On Tue, May 12, 2015 at 09:54:27AM -0400, John Stoffel wrote: >>>>>>>>> "Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes: >>>> >>>> Austin> On 2015-05-12 01:08, Kevin Easton wrote: >>>>>> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: >>>>>>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: >>>>>>>>> Let me re-ask the question that I asked last week (and was apparently >>>>>>>>> ignored). Why not trying to use the lazytime feature instead of >>>>>>>>> pointing a head straight at the application's --- and system >>>>>>>>> administrators' --- heads? >>>>>>>> >>>>>>>> Sorry Ted, I thought I responded already. >>>>>>>> >>>>>>>> The goal is to avoid inode writeout entirely when we can, and >>>>>>>> as I understand it lazytime will still force writeout before the inode >>>>>>>> is dropped from the cache. In systems like Ceph in particular, the >>>>>>>> IOs can be spread across lots of files, so simply deferring writeout >>>>>>>> doesn't always help. >>>>>>> >>>>>>> Sure, but it would reduce the writeout by orders of magnitude. I can >>>>>>> understand if you want to reduce it further, but it might be good >>>>>>> enough for your purposes. >>>>>>> >>>>>>> I considered doing the equivalent of O_NOMTIME for our purposes at >>>>>>> $WORK, and our use case is actually not that different from Ceph's >>>>>>> (i.e., using a local disk file system to support a cluster file >>>>>>> system), and lazytime was (a) something I figured was something I >>>>>>> could upstream in good conscience, and (b) was more than good enough >>>>>>> for us. >>>>>> >>>>>> A safer alternative might be a chattr file attribute that if set, the >>>>>> mtime is not updated on writes, and stat() on the file always shows the >>>>>> mtime as "right now". At least that way, the file won't accidentally >>>>>> get left out of backups that rely on the mtime. >>>>>> >>>>>> (If the file attribute is unset, you immediately update the mtime then >>>>>> too, and from then on the file is back to normal). >>>>>> >>>> >>>> Austin> I like this even better than the flag suggestion, it provides >>>> Austin> better control, means that you don't need to update >>>> Austin> applications to get the benefits, and prevents backup software >>>> Austin> from breaking (although backups would be bigger). >>>> >>>> Me too, it fails in a safer mode, where you do more work on backups >>>> than strictly needed. I'm still against this as a mount option >>>> though, way way way too many bullets in the foot gun. And as someone >>>> else said, once you mount with O_NOMTIME, then unmount, then mount >>>> again without O_NOMTIME, you've lost information. Not good. >>> >>> That was me. Zach also pointed out to me that'd mean figuring out where >>> to store that information on-disk for every filesystem you care about. >>> I like the idea of something persistent, but maybe it's more trouble >>> than it's worth--I honestly don't know. >>> >> But if we do it as a flag controlled by the API used by chattr, it >> becomes the responsibility of the filesystems to deal with where to >> store the information, assuming they choose to support it; >> personally, I would be really surprised if XFS and BTRFS didn't add >> support for this relatively soon after the API getting merged >> upstream, and ext4 would likely follow soon afterwards. > > It's an on-disk format change, which means that there are all sorts > of compatibility issues to take into account, as well as all the > work needed to teach the filesystem userspace tools about the new > flag. e.g. xfs_repair, xfs_db, xfsdump/restore, xfs_io, test code in > xfstests, etc. > > Keep in mind that the moment we make something persistent, the > amount of work to implement and verify the new functionality > filesystem to implement it goes up by an order of magnitude *for > each filesystem*. IOWs, support of new features that require > persistence don't just magically appear overnight... > I'm not saying that it will, and any sane way of safely implementing this will _almost_ certainly need some kind of work done on the filesystems themselves. My only point was that it would be simpler on the VFS side of things than most of the other proposals so far. Also, BTRFS at least won't (theoretically) need a format change for this, as it could just be added to the property interface. As for the other filesystems, it would probably be possible to re-purpose one of the other bits for this, s (secure delete) and u (undeletion) are both not honored by any filesystem in the kernel, and also not honored by any other UNIX filesystem implementation that I know of; s would probably be the better of the 2 to use for this, as it's currently assigned purpose is functionally impossible to implement properly on modern hardware.
On Thu 2015-05-07 12:53:46, Andy Lutomirski wrote: > On Thu, May 7, 2015 at 12:09 PM, Richard Weinberger > <richard.weinberger@gmail.com> wrote: > > On Thu, May 7, 2015 at 7:20 PM, Zach Brown <zab@redhat.com> wrote: > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > >>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > >>> > Add the O_NOMTIME flag which prevents mtime from being updated which can > >>> > greatly reduce the IO overhead of writes to allocated and initialized > >>> > regions of files. > >>> > >>> Hmmm. How do backup programs now work out if the file has changed > >>> and hence needs copying again? ie. applications using this will > >>> break other critical infrastructure in subtle ways. > >> > >> By using backup infrastructure that doesn't use cmtime. Like btrfs > >> send/recv. Or application level backups that know how to do > >> incrementals from metadata in giant database files, say, without > >> walking, comparing, and copying the entire thing. > > > > But how can Joey random user know that some of his > > applications are using O_NOMTIME and his KISS backup > > program does no longer function as expected? > > > > Joey random user can't have a working KISS backup anyway, though, > because we screw up mtime updates on mmap writes. I have patches > gathering dust that fix that, though. I'm using unison, and yes, I believe I already seen failures from mmap(). I'd like to see that fixed, and I can test the patches. Thanks, Pavel
> Sage> I think this is the fundamental question: who do we give the > Sage> ammunition to, the user or app writer, or the sysadmin? > > Sage> One might argue that we gave the user a similar power with > Sage> O_NOATIME (the power to break applications that assume atime is > Sage> accurate). Here we give developers/users the power to not > Sage> update mtime and suffer the consequences (like, obviously, > Sage> breaking mtime-based backups). It should be pretty obvious to > Sage> anyone using the flag what the consequences are. > > Not modifying atime doesn't really break anything except people who > think they can tell when a file was last accessed. Which isn't > critical (unless your in a paranoid security conscious place...) but > MTIME is another beast entirely. Turning that off is going to break > lots of hidden assumptions. > > Sage> Note that we can suffer similar lapses in mtime with fdatasync > Sage> followed by a system crash. And as Andy points out it's > Sage> semi-broken for writable mmap. The crash case is obviously a > Sage> slightly different thing, but the idea that mtime can't always > Sage> be trusted certainly isn't crazy talk. > > True, but after a crash... people expect and understand there might be > corruption in a filesystem. Umm. No; people do not expect anything newer than ext3 to get corrupted, ever. In fact, I did not know about fdatasync/crash. That's rather nasty surprise. Pavel
Hi! > BTW When you "swap" to a file the mtime doesn't get updated. No one seems to > complain about that. I guess it is a rather narrow use-case though. Actually yes, I'd like to complain. It was not swap, it was mount -o loop, but I guess that's the same case. Then rsync refused to work on that file... and being on slow ARM system it took me a while to figure out WTF is going on. So yes, we have problems with mtime, and yes, they matter. Pavel
On Tue, 14 Jul 2015 15:13:00 +0200 Pavel Machek <pavel@ucw.cz> wrote: > Hi! > > > BTW When you "swap" to a file the mtime doesn't get updated. No one seems to > > complain about that. I guess it is a rather narrow use-case though. > > Actually yes, I'd like to complain. > > It was not swap, it was mount -o loop, but I guess that's the same > case. Then rsync refused to work on that file... and being on slow ARM > system it took me a while to figure out WTF is going on. > > So yes, we have problems with mtime, and yes, they matter. > Pavel Odd... I assume you mean mount -o loop /some/file /mountpoint and then when you write to the filesystem on /mountpoint the mtime of /some/file doesn't get updated? I think it should. drivers/block/loop.c uses vfs_iter_write() to write to a file. That calls f_op->write_iter which will typically call generic_file_write_iter() which will call file_update_time() to update the time stamps. What filesystem was /some/file on? I just did some testing on ext4 and it seems to do the right thing mtime gets updated. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 2015-07-15 14:54:56, NeilBrown wrote: > On Tue, 14 Jul 2015 15:13:00 +0200 Pavel Machek <pavel@ucw.cz> wrote: > > > Hi! > > > > > BTW When you "swap" to a file the mtime doesn't get updated. No one seems to > > > complain about that. I guess it is a rather narrow use-case though. > > > > Actually yes, I'd like to complain. > > > > It was not swap, it was mount -o loop, but I guess that's the same > > case. Then rsync refused to work on that file... and being on slow ARM > > system it took me a while to figure out WTF is going on. > > > > So yes, we have problems with mtime, and yes, they matter. > > Pavel > > Odd... > I assume you mean > mount -o loop /some/file /mountpoint > > and then when you write to the filesystem on /mountpoint the mtime > of /some/file doesn't get updated? > I think it should. > drivers/block/loop.c uses vfs_iter_write() to write to a file. > That calls f_op->write_iter which will typically call > generic_file_write_iter() which will call file_update_time() to update > the time stamps. Yes, that. I'm pretty sure I seen it, but it was probably on 2.6.X kernel... Does it make sense to try to reproduce it on the old kernel? > What filesystem was /some/file on? Very probably VFAT. > I just did some testing on ext4 and it seems to do the right thing > mtime gets updated. Yes, I tried here, and it seems to be ok. Thanks, Pavel
diff --git a/fs/fcntl.c b/fs/fcntl.c index ee85cd4..9e48092 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -27,7 +27,8 @@ #include <asm/siginfo.h> #include <asm/uaccess.h> -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME) +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \ + O_NOMTIME) static int setfl(int fd, struct file * filp, unsigned long arg) { @@ -41,8 +42,9 @@ static int setfl(int fd, struct file * filp, unsigned long arg) if (((arg ^ filp->f_flags) & O_APPEND) && IS_APPEND(inode)) return -EPERM; - /* O_NOATIME can only be set by the owner or superuser */ - if ((arg & O_NOATIME) && !(filp->f_flags & O_NOATIME)) + /* O_NOATIME and O_NOMTIME can only be set by the owner or superuser */ + if (((arg & O_NOATIME) && !(filp->f_flags & O_NOATIME)) || + ((arg & O_NOMTIME) && !(filp->f_flags & O_NOMTIME))) if (!inode_owner_or_capable(inode)) return -EPERM; @@ -740,7 +742,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | O_APPEND | /* O_NONBLOCK | */ @@ -748,7 +750,7 @@ static int __init fcntl_init(void) O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | O_NOATIME | O_CLOEXEC | __FMODE_EXEC | O_PATH | __O_TMPFILE | - __FMODE_NONOTIFY + __FMODE_NONOTIFY| O_NOMTIME )); fasync_cache = kmem_cache_create("fasync_cache", diff --git a/fs/inode.c b/fs/inode.c index ea37cd1..8976edc 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1721,7 +1721,7 @@ int file_update_time(struct file *file) int ret; /* First try to exhaust all avenues to not sync */ - if (IS_NOCMTIME(inode)) + if (IS_NOCMTIME(inode) || (file->f_flags & O_NOMTIME)) return 0; now = current_fs_time(inode->i_sb); diff --git a/fs/namei.c b/fs/namei.c index 4a8d998b..1a3ccb3 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2609,8 +2609,8 @@ static int may_open(struct path *path, int acc_mode, int flag) return -EPERM; } - /* O_NOATIME can only be set by the owner or superuser */ - if (flag & O_NOATIME && !inode_owner_or_capable(inode)) + /* O_NOATIME and O_NOMTIME can only be set by the owner or superuser */ + if (flag & (O_NOATIME|O_NOMTIME) && !inode_owner_or_capable(inode)) return -EPERM; return 0; diff --git a/include/linux/fs.h b/include/linux/fs.h index 35ec87e..34602f5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -110,12 +110,7 @@ typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* 64bit hashes as llseek() offset (for directories) */ #define FMODE_64BITHASH ((__force fmode_t)0x400) -/* - * Don't update ctime and mtime. - * - * Currently a special hack for the XFS open_by_handle ioctl, but we'll - * hopefully graduate it to a proper O_CMTIME flag supported by open(2) soon. - */ +/* Don't update ctime and mtime. */ #define FMODE_NOCMTIME ((__force fmode_t)0x800) /* Expect random access pattern */ diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index e063eff..8e484ae 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -88,6 +88,10 @@ #define __O_TMPFILE 020000000 #endif +#ifndef O_NOMTIME +#define O_NOMTIME 040000000 +#endif + /* a horrid kludge trying to make sure that this will fail on old kernels */ #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
Add the O_NOMTIME flag which prevents mtime from being updated which can greatly reduce the IO overhead of writes to allocated and initialized regions of files. ceph servers can have loads where they perform O_DIRECT overwrites of allocated file data and then sync to make sure that the O_DIRECT writes are flushed from write caches. If the writes dirty the inode with mtime updates then the syncs also write out the metadata needed to track the inodes which can add significant iop and latency overhead. The ceph servers don't use mtime at all. They're using the local file system as a backing store and any backups would be driven by their upper level ceph metadata. For ceph, slow IO from mtime updates in the file system is as daft as if we had block devices slowing down IO for per-block write timestamps that file systems never use. In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a sync went from 2 serial write round trips to 1 in XFS and from 4 serial IO round trips to 1 in ext4. file_update_time() checks for O_NOMTIME and aborts the update if it's set, just like the current check for the in-kernel inode flag S_NOCMTIME. I didn't update any other mtime update sites. They could be added as we decide that it's appropriate to do so. I opted not to name the flag O_NOCMTIME because I didn't want the name to imply that ctime updates would be prevented for other inode changes like updating i_size in truncate. Not updating ctime is a side-effect of removing mtime updates when it's the only thing changing in the inode. The criteria for using O_NOMTIME is the same as for using O_NOATIME: owning the file or having the CAP_FOWNER capability. If we're not comfortable allowing owners to prevent mtime/ctime updates then we should add a tunable to allow O_NOMTIME. Maybe a mount option? Signed-off-by: Zach Brown <zab@redhat.com> Cc: Sage Weil <sweil@redhat.com> --- fs/fcntl.c | 12 +++++++----- fs/inode.c | 2 +- fs/namei.c | 4 ++-- include/linux/fs.h | 7 +------ include/uapi/asm-generic/fcntl.h | 4 ++++ 5 files changed, 15 insertions(+), 14 deletions(-)