diff mbox

[RFC] vfs: add a O_NOMTIME flag

Message ID	1430949612-21356-1-git-send-email-zab@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Zach Brown <zab@redhat.com> To: Alexander Viro <viro@zeniv.linux.org.uk>, Sage Weil <sweil@redhat.com>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org Subject: [PATCH RFC] vfs: add a O_NOMTIME flag Date: Wed, 6 May 2015 15:00:12 -0700 Message-Id: <1430949612-21356-1-git-send-email-zab@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Message ID

1430949612-21356-1-git-send-email-zab@redhat.com (mailing list archive)

State

New, archived

Headers

From: Zach Brown <zab@redhat.com>
To: Alexander Viro <viro@zeniv.linux.org.uk>,
	Sage Weil <sweil@redhat.com>, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org
Subject: [PATCH RFC] vfs: add a O_NOMTIME flag
Date: Wed,  6 May 2015 15:00:12 -0700
Message-Id: <1430949612-21356-1-git-send-email-zab@redhat.com>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Zach Brown May 6, 2015, 10 p.m. UTC

Add the O_NOMTIME flag which prevents mtime from being updated which can
greatly reduce the IO overhead of writes to allocated and initialized
regions of files.

ceph servers can have loads where they perform O_DIRECT overwrites of
allocated file data and then sync to make sure that the O_DIRECT writes
are flushed from write caches.  If the writes dirty the inode with mtime
updates then the syncs also write out the metadata needed to track the
inodes which can add significant iop and latency overhead.

The ceph servers don't use mtime at all.  They're using the local file
system as a backing store and any backups would be driven by their upper
level ceph metadata.  For ceph, slow IO from mtime updates in the file
system is as daft as if we had block devices slowing down IO for
per-block write timestamps that file systems never use.

In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a
sync went from 2 serial write round trips to 1 in XFS and from 4 serial
IO round trips to 1 in ext4.

file_update_time() checks for O_NOMTIME and aborts the update if it's
set, just like the current check for the in-kernel inode flag
S_NOCMTIME.  I didn't update any other mtime update sites. They could be
added as we decide that it's appropriate to do so.

I opted not to name the flag O_NOCMTIME because I didn't want the name
to imply that ctime updates would be prevented for other inode changes
like updating i_size in truncate.  Not updating ctime is a side-effect
of removing mtime updates when it's the only thing changing in the
inode.

The criteria for using O_NOMTIME is the same as for using O_NOATIME:
owning the file or having the CAP_FOWNER capability.  If we're not
comfortable allowing owners to prevent mtime/ctime updates then we
should add a tunable to allow O_NOMTIME.  Maybe a mount option?

Signed-off-by: Zach Brown <zab@redhat.com>
Cc: Sage Weil <sweil@redhat.com>
---
 fs/fcntl.c                       | 12 +++++++-----
 fs/inode.c                       |  2 +-
 fs/namei.c                       |  4 ++--
 include/linux/fs.h               |  7 +------
 include/uapi/asm-generic/fcntl.h |  4 ++++
 5 files changed, 15 insertions(+), 14 deletions(-)

Comments

Trond Myklebust May 6, 2015, 10:14 p.m. UTC | #1

Hi Zach,

On Wed, May 6, 2015 at 6:00 PM, Zach Brown <zab@redhat.com> wrote:
>
> Add the O_NOMTIME flag which prevents mtime from being updated which can
> greatly reduce the IO overhead of writes to allocated and initialized
> regions of files.
>
> ceph servers can have loads where they perform O_DIRECT overwrites of
> allocated file data and then sync to make sure that the O_DIRECT writes
> are flushed from write caches.  If the writes dirty the inode with mtime
> updates then the syncs also write out the metadata needed to track the
> inodes which can add significant iop and latency overhead.
>
> The ceph servers don't use mtime at all.  They're using the local file
> system as a backing store and any backups would be driven by their upper
> level ceph metadata.  For ceph, slow IO from mtime updates in the file
> system is as daft as if we had block devices slowing down IO for
> per-block write timestamps that file systems never use.
>
> In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a
> sync went from 2 serial write round trips to 1 in XFS and from 4 serial
> IO round trips to 1 in ext4.
>
> file_update_time() checks for O_NOMTIME and aborts the update if it's
> set, just like the current check for the in-kernel inode flag
> S_NOCMTIME.  I didn't update any other mtime update sites. They could be
> added as we decide that it's appropriate to do so.
>
> I opted not to name the flag O_NOCMTIME because I didn't want the name
> to imply that ctime updates would be prevented for other inode changes
> like updating i_size in truncate.  Not updating ctime is a side-effect
> of removing mtime updates when it's the only thing changing in the
> inode.
>
> The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> owning the file or having the CAP_FOWNER capability.  If we're not
> comfortable allowing owners to prevent mtime/ctime updates then we
> should add a tunable to allow O_NOMTIME.  Maybe a mount option?
>

Just out of curiosity, if you need to modify the application anyway,
why wouldn't use of fdatasync() when flushing be able to offer a
similar performance boost?

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil May 6, 2015, 10:19 p.m. UTC | #2

On Wed, 6 May 2015, Trond Myklebust wrote:
> Hi Zach,
> 
> On Wed, May 6, 2015 at 6:00 PM, Zach Brown <zab@redhat.com> wrote:
> >
> > Add the O_NOMTIME flag which prevents mtime from being updated which can
> > greatly reduce the IO overhead of writes to allocated and initialized
> > regions of files.
> >
> > ceph servers can have loads where they perform O_DIRECT overwrites of
> > allocated file data and then sync to make sure that the O_DIRECT writes
> > are flushed from write caches.  If the writes dirty the inode with mtime
> > updates then the syncs also write out the metadata needed to track the
> > inodes which can add significant iop and latency overhead.
> >
> > The ceph servers don't use mtime at all.  They're using the local file
> > system as a backing store and any backups would be driven by their upper
> > level ceph metadata.  For ceph, slow IO from mtime updates in the file
> > system is as daft as if we had block devices slowing down IO for
> > per-block write timestamps that file systems never use.
> >
> > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a
> > sync went from 2 serial write round trips to 1 in XFS and from 4 serial
> > IO round trips to 1 in ext4.
> >
> > file_update_time() checks for O_NOMTIME and aborts the update if it's
> > set, just like the current check for the in-kernel inode flag
> > S_NOCMTIME.  I didn't update any other mtime update sites. They could be
> > added as we decide that it's appropriate to do so.
> >
> > I opted not to name the flag O_NOCMTIME because I didn't want the name
> > to imply that ctime updates would be prevented for other inode changes
> > like updating i_size in truncate.  Not updating ctime is a side-effect
> > of removing mtime updates when it's the only thing changing in the
> > inode.
> >
> > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > owning the file or having the CAP_FOWNER capability.  If we're not
> > comfortable allowing owners to prevent mtime/ctime updates then we
> > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> >
> 
> Just out of curiosity, if you need to modify the application anyway,
> why wouldn't use of fdatasync() when flushing be able to offer a
> similar performance boost?

Although fdatasync(2) doesn't have to update synchronously, it does 
eventually get written, and that can trigger lots of unwanted IO.

In practice we fsync(2) to avoid deferred IO that we can't control/bound, 
but that's a long and sad story.  O_NOMTIME would make for a much better 
ending!

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown May 6, 2015, 10:41 p.m. UTC | #3

On Wed, May 06, 2015 at 03:19:13PM -0700, Sage Weil wrote:
> On Wed, 6 May 2015, Trond Myklebust wrote:
> > Hi Zach,
> > 
> > On Wed, May 6, 2015 at 6:00 PM, Zach Brown <zab@redhat.com> wrote:
> > >
> > > Add the O_NOMTIME flag which prevents mtime from being updated which can
> > > greatly reduce the IO overhead of writes to allocated and initialized
> > > regions of files.
> > >
> > > ceph servers can have loads where they perform O_DIRECT overwrites of
> > > allocated file data and then sync to make sure that the O_DIRECT writes
> > > are flushed from write caches.  If the writes dirty the inode with mtime
> > > updates then the syncs also write out the metadata needed to track the
> > > inodes which can add significant iop and latency overhead.
> > >
> > > The ceph servers don't use mtime at all.  They're using the local file
> > > system as a backing store and any backups would be driven by their upper
> > > level ceph metadata.  For ceph, slow IO from mtime updates in the file
> > > system is as daft as if we had block devices slowing down IO for
> > > per-block write timestamps that file systems never use.
> > >
> > > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a
> > > sync went from 2 serial write round trips to 1 in XFS and from 4 serial
> > > IO round trips to 1 in ext4.
> > >
> > > file_update_time() checks for O_NOMTIME and aborts the update if it's
> > > set, just like the current check for the in-kernel inode flag
> > > S_NOCMTIME.  I didn't update any other mtime update sites. They could be
> > > added as we decide that it's appropriate to do so.
> > >
> > > I opted not to name the flag O_NOCMTIME because I didn't want the name
> > > to imply that ctime updates would be prevented for other inode changes
> > > like updating i_size in truncate.  Not updating ctime is a side-effect
> > > of removing mtime updates when it's the only thing changing in the
> > > inode.
> > >
> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > > owning the file or having the CAP_FOWNER capability.  If we're not
> > > comfortable allowing owners to prevent mtime/ctime updates then we
> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > >
> > 
> > Just out of curiosity, if you need to modify the application anyway,
> > why wouldn't use of fdatasync() when flushing be able to offer a
> > similar performance boost?
> 
> Although fdatasync(2) doesn't have to update synchronously, it does 
> eventually get written, and that can trigger lots of unwanted IO.

And the unwanted IO is per file.  Are there circumstances where the
write:file ratio is small enough that dirty inode writes could start to
add up to meaningful write amplification?

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil May 6, 2015, 10:46 p.m. UTC | #4

On Wed, 6 May 2015, Zach Brown wrote:
> On Wed, May 06, 2015 at 03:19:13PM -0700, Sage Weil wrote:
> > On Wed, 6 May 2015, Trond Myklebust wrote:
> > > Hi Zach,
> > > 
> > > On Wed, May 6, 2015 at 6:00 PM, Zach Brown <zab@redhat.com> wrote:
> > > >
> > > > Add the O_NOMTIME flag which prevents mtime from being updated which can
> > > > greatly reduce the IO overhead of writes to allocated and initialized
> > > > regions of files.
> > > >
> > > > ceph servers can have loads where they perform O_DIRECT overwrites of
> > > > allocated file data and then sync to make sure that the O_DIRECT writes
> > > > are flushed from write caches.  If the writes dirty the inode with mtime
> > > > updates then the syncs also write out the metadata needed to track the
> > > > inodes which can add significant iop and latency overhead.
> > > >
> > > > The ceph servers don't use mtime at all.  They're using the local file
> > > > system as a backing store and any backups would be driven by their upper
> > > > level ceph metadata.  For ceph, slow IO from mtime updates in the file
> > > > system is as daft as if we had block devices slowing down IO for
> > > > per-block write timestamps that file systems never use.
> > > >
> > > > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a
> > > > sync went from 2 serial write round trips to 1 in XFS and from 4 serial
> > > > IO round trips to 1 in ext4.
> > > >
> > > > file_update_time() checks for O_NOMTIME and aborts the update if it's
> > > > set, just like the current check for the in-kernel inode flag
> > > > S_NOCMTIME.  I didn't update any other mtime update sites. They could be
> > > > added as we decide that it's appropriate to do so.
> > > >
> > > > I opted not to name the flag O_NOCMTIME because I didn't want the name
> > > > to imply that ctime updates would be prevented for other inode changes
> > > > like updating i_size in truncate.  Not updating ctime is a side-effect
> > > > of removing mtime updates when it's the only thing changing in the
> > > > inode.
> > > >
> > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > > > owning the file or having the CAP_FOWNER capability.  If we're not
> > > > comfortable allowing owners to prevent mtime/ctime updates then we
> > > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > > >
> > > 
> > > Just out of curiosity, if you need to modify the application anyway,
> > > why wouldn't use of fdatasync() when flushing be able to offer a
> > > similar performance boost?
> > 
> > Although fdatasync(2) doesn't have to update synchronously, it does 
> > eventually get written, and that can trigger lots of unwanted IO.
> 
> And the unwanted IO is per file.  Are there circumstances where the
> write:file ratio is small enough that dirty inode writes could start to
> add up to meaningful write amplification?

Yeah, exactly: in some not-so-uncommon workloads it's approaching 1:1.

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Ts'o May 6, 2015, 11:21 p.m. UTC | #5

On Wed, May 06, 2015 at 03:19:13PM -0700, Sage Weil wrote:
> > Just out of curiosity, if you need to modify the application anyway,
> > why wouldn't use of fdatasync() when flushing be able to offer a
> > similar performance boost?
> 
> Although fdatasync(2) doesn't have to update synchronously, it does 
> eventually get written, and that can trigger lots of unwanted IO.

Something that might be worth trying out is using MS_LAZYTIME plus
fdatasync(2).  That should significantly reduce the unwanted IO, while
eventually letting the mtimes get updated, plus allowing updates of
adjacent inodes in the same inode table block update the mtime "for
free".

Regards,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner May 7, 2015, 12:26 a.m. UTC | #6

On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> Add the O_NOMTIME flag which prevents mtime from being updated which can
> greatly reduce the IO overhead of writes to allocated and initialized
> regions of files.

Hmmm. How do backup programs now work out if the file has changed
and hence needs copying again? ie. applications using this will
break other critical infrastructure in subtle ways.

> ceph servers can have loads where they perform O_DIRECT overwrites of
> allocated file data and then sync to make sure that the O_DIRECT writes
> are flushed from write caches.  If the writes dirty the inode with mtime
> updates then the syncs also write out the metadata needed to track the
> inodes which can add significant iop and latency overhead.
> 
> The ceph servers don't use mtime at all.  They're using the local file
> system as a backing store and any backups would be driven by their upper
> level ceph metadata.  For ceph, slow IO from mtime updates in the file
> system is as daft as if we had block devices slowing down IO for
> per-block write timestamps that file systems never use.
> 
> In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a
> sync went from 2 serial write round trips to 1 in XFS and from 4 serial
> IO round trips to 1 in ext4.
> 
> file_update_time() checks for O_NOMTIME and aborts the update if it's
> set, just like the current check for the in-kernel inode flag
> S_NOCMTIME.  I didn't update any other mtime update sites. They could be
> added as we decide that it's appropriate to do so.
> 
> I opted not to name the flag O_NOCMTIME because I didn't want the name
> to imply that ctime updates would be prevented for other inode changes
> like updating i_size in truncate.  Not updating ctime is a side-effect
> of removing mtime updates when it's the only thing changing in the
> inode.

If adding this, wouldn't we want to unify O_NOMTIME and
FMODE_NOCMTIME at the same time?

i.e. it makes no sense to add O_NOMTIME and not add O_NOCMTIME,
likewise it makes no sense to have two different "no mtime"
detection mechanisms.  i.e. file_is_nomtime(file)) should return
true for both files opened with O_NOMTIME, files that have had
FMODE_NOCMTIME added to them and inodes with the S_NOCMTIME flag
set on them.

> The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> owning the file or having the CAP_FOWNER capability.  If we're not
> comfortable allowing owners to prevent mtime/ctime updates then we
> should add a tunable to allow O_NOMTIME.  Maybe a mount option?

I dislike "turn off safety for performance" options because Joe
SpeedRacer will always select performance over safety.

Cheers,

Dave.

Zach Brown May 7, 2015, 5:20 p.m. UTC | #7

On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> > Add the O_NOMTIME flag which prevents mtime from being updated which can
> > greatly reduce the IO overhead of writes to allocated and initialized
> > regions of files.
> 
> Hmmm. How do backup programs now work out if the file has changed
> and hence needs copying again? ie. applications using this will
> break other critical infrastructure in subtle ways.

By using backup infrastructure that doesn't use cmtime.  Like btrfs
send/recv.  Or application level backups that know how to do
incrementals from metadata in giant database files, say, without
walking, comparing, and copying the entire thing.

> > I opted not to name the flag O_NOCMTIME because I didn't want the name
> > to imply that ctime updates would be prevented for other inode changes
> > like updating i_size in truncate.  Not updating ctime is a side-effect
> > of removing mtime updates when it's the only thing changing in the
> > inode.
> 
> If adding this, wouldn't we want to unify O_NOMTIME and
> FMODE_NOCMTIME at the same time?

I could see that, sure.

> > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > owning the file or having the CAP_FOWNER capability.  If we're not
> > comfortable allowing owners to prevent mtime/ctime updates then we
> > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> 
> I dislike "turn off safety for performance" options because Joe
> SpeedRacer will always select performance over safety.

Well, for ceph there's no safety concern.  They never use cmtime in
these files.

So are you suggesting not implementing this and making them rework their
IO paths to avoid the fs maintaining mtime so that we don't give Joe
Speedracer more rope?  Or are we talking about adding some speed bumps
that ceph can flip on that might give Joe Speedracer pause?

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown May 7, 2015, 6:43 p.m. UTC | #8

> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > > owning the file or having the CAP_FOWNER capability.  If we're not
> > > comfortable allowing owners to prevent mtime/ctime updates then we
> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > 
> > I dislike "turn off safety for performance" options because Joe
> > SpeedRacer will always select performance over safety.
> 
> Well, for ceph there's no safety concern.  They never use cmtime in
> these files.
> 
> So are you suggesting not implementing this and making them rework their
> IO paths to avoid the fs maintaining mtime so that we don't give Joe
> Speedracer more rope?  Or are we talking about adding some speed bumps
> that ceph can flip on that might give Joe Speedracer pause?

Maybe one way to make it less of an attractive nuisance would be to hide
it under open_by_handle_at().  Like xfs_open_by_handle() does today but
we probably don't want to unconditionally add it to the generic path so
we'd have a flag.

They want to move to opening by handles anyway to avoid dirent lookups
when opening cold files.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Weinberger May 7, 2015, 7:09 p.m. UTC | #9

On Thu, May 7, 2015 at 7:20 PM, Zach Brown <zab@redhat.com> wrote:
> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
>> > Add the O_NOMTIME flag which prevents mtime from being updated which can
>> > greatly reduce the IO overhead of writes to allocated and initialized
>> > regions of files.
>>
>> Hmmm. How do backup programs now work out if the file has changed
>> and hence needs copying again? ie. applications using this will
>> break other critical infrastructure in subtle ways.
>
> By using backup infrastructure that doesn't use cmtime.  Like btrfs
> send/recv.  Or application level backups that know how to do
> incrementals from metadata in giant database files, say, without
> walking, comparing, and copying the entire thing.

But how can Joey random user know that some of his
applications are using O_NOMTIME and his KISS backup
program does no longer function as expected?

Andy Lutomirski May 7, 2015, 7:53 p.m. UTC | #10

On Thu, May 7, 2015 at 12:09 PM, Richard Weinberger
<richard.weinberger@gmail.com> wrote:
> On Thu, May 7, 2015 at 7:20 PM, Zach Brown <zab@redhat.com> wrote:
>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
>>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
>>> > Add the O_NOMTIME flag which prevents mtime from being updated which can
>>> > greatly reduce the IO overhead of writes to allocated and initialized
>>> > regions of files.
>>>
>>> Hmmm. How do backup programs now work out if the file has changed
>>> and hence needs copying again? ie. applications using this will
>>> break other critical infrastructure in subtle ways.
>>
>> By using backup infrastructure that doesn't use cmtime.  Like btrfs
>> send/recv.  Or application level backups that know how to do
>> incrementals from metadata in giant database files, say, without
>> walking, comparing, and copying the entire thing.
>
> But how can Joey random user know that some of his
> applications are using O_NOMTIME and his KISS backup
> program does no longer function as expected?
>

Joey random user can't have a working KISS backup anyway, though,
because we screw up mtime updates on mmap writes.  I have patches
gathering dust that fix that, though.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski May 7, 2015, 8:06 p.m. UTC | #11

On Thu, May 7, 2015 at 1:02 PM, Richard Weinberger <richard@nod.at> wrote:
> Am 07.05.2015 um 21:53 schrieb Andy Lutomirski:
>> On Thu, May 7, 2015 at 12:09 PM, Richard Weinberger
>> <richard.weinberger@gmail.com> wrote:
>>> On Thu, May 7, 2015 at 7:20 PM, Zach Brown <zab@redhat.com> wrote:
>>>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
>>>>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
>>>>>> Add the O_NOMTIME flag which prevents mtime from being updated which can
>>>>>> greatly reduce the IO overhead of writes to allocated and initialized
>>>>>> regions of files.
>>>>>
>>>>> Hmmm. How do backup programs now work out if the file has changed
>>>>> and hence needs copying again? ie. applications using this will
>>>>> break other critical infrastructure in subtle ways.
>>>>
>>>> By using backup infrastructure that doesn't use cmtime.  Like btrfs
>>>> send/recv.  Or application level backups that know how to do
>>>> incrementals from metadata in giant database files, say, without
>>>> walking, comparing, and copying the entire thing.
>>>
>>> But how can Joey random user know that some of his
>>> applications are using O_NOMTIME and his KISS backup
>>> program does no longer function as expected?
>>>
>>
>> Joey random user can't have a working KISS backup anyway, though,
>> because we screw up mtime updates on mmap writes.  I have patches
>> gathering dust that fix that, though.
>
> Hmmm, I thought mtime will be updated upon msync()?
> Assuming a sane application is using msync()...
>

So would I.  Unfortunately, mtime is updated on the page fault that
makes an mmapped page writeable, thus guaranteeing that the resulting
mtime is stale if you mmap a file, write to it, unmap it, and close
it.  It's much more stale if you mmap it, write, wait for a while but
not long enough that the page is automatically written back, write
again, unmap, and close.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil May 8, 2015, 1:01 a.m. UTC | #12

On Thu, 7 May 2015, Zach Brown wrote:
> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > > owning the file or having the CAP_FOWNER capability.  If we're not
> > > comfortable allowing owners to prevent mtime/ctime updates then we
> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > 
> > I dislike "turn off safety for performance" options because Joe
> > SpeedRacer will always select performance over safety.
> 
> Well, for ceph there's no safety concern.  They never use cmtime in
> these files.
> 
> So are you suggesting not implementing this and making them rework their
> IO paths to avoid the fs maintaining mtime so that we don't give Joe
> Speedracer more rope?  Or are we talking about adding some speed bumps
> that ceph can flip on that might give Joe Speedracer pause?

I think this is the fundamental question: who do we give the ammunition 
to, the user or app writer, or the sysadmin?

One might argue that we gave the user a similar power with O_NOATIME (the 
power to break applications that assume atime is accurate).  Here we give 
developers/users the power to not update mtime and suffer the consequences 
(like, obviously, breaking mtime-based backups).  It should be pretty 
obvious to anyone using the flag what the consequences are.

Note that we can suffer similar lapses in mtime with fdatasync followed by 
a system crash.  And as Andy points out it's semi-broken for writable 
mmap.  The crash case is obviously a slightly different thing, but the 
idea that mtime can't always be trusted certainly isn't crazy talk.

Or, we can be conservative and require a mount option so that the admin 
has to explicitly allow behavior that might break some existing 
assumptions about mtime/ctime ('-o user_noatime' I guess?).

I'm happy either way, so long as in the end an unprivileged ceph daemon 
avoids the useless work.  In our case we always own the entire mount/disk, 
so a mount option is just fine.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust May 8, 2015, 1:23 a.m. UTC | #13

On Thu, May 7, 2015 at 9:01 PM, Sage Weil <sage@newdream.net> wrote:
> On Thu, 7 May 2015, Zach Brown wrote:
>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
>> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
>> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
>> > > owning the file or having the CAP_FOWNER capability.  If we're not
>> > > comfortable allowing owners to prevent mtime/ctime updates then we
>> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
>> >
>> > I dislike "turn off safety for performance" options because Joe
>> > SpeedRacer will always select performance over safety.
>>
>> Well, for ceph there's no safety concern.  They never use cmtime in
>> these files.
>>
>> So are you suggesting not implementing this and making them rework their
>> IO paths to avoid the fs maintaining mtime so that we don't give Joe
>> Speedracer more rope?  Or are we talking about adding some speed bumps
>> that ceph can flip on that might give Joe Speedracer pause?
>
> I think this is the fundamental question: who do we give the ammunition
> to, the user or app writer, or the sysadmin?
>
> One might argue that we gave the user a similar power with O_NOATIME (the
> power to break applications that assume atime is accurate).  Here we give
> developers/users the power to not update mtime and suffer the consequences
> (like, obviously, breaking mtime-based backups).  It should be pretty
> obvious to anyone using the flag what the consequences are.
>
> Note that we can suffer similar lapses in mtime with fdatasync followed by
> a system crash.  And as Andy points out it's semi-broken for writable
> mmap.  The crash case is obviously a slightly different thing, but the
> idea that mtime can't always be trusted certainly isn't crazy talk.
>
> Or, we can be conservative and require a mount option so that the admin
> has to explicitly allow behavior that might break some existing
> assumptions about mtime/ctime ('-o user_noatime' I guess?).
>
> I'm happy either way, so long as in the end an unprivileged ceph daemon
> avoids the useless work.  In our case we always own the entire mount/disk,
> so a mount option is just fine.
>

So, what is the expectation here for filesystems that cannot support
this flag? NFSv3 in particular would break pretty catastrophically if
someone decided on a whim to turn off mtime: they will have turned off
the client's ability to detect cache incoherencies.

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner May 8, 2015, 2:37 a.m. UTC | #14

On Thu, May 07, 2015 at 10:20:53AM -0700, Zach Brown wrote:
> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> > > Add the O_NOMTIME flag which prevents mtime from being updated which can
> > > greatly reduce the IO overhead of writes to allocated and initialized
> > > regions of files.
> > 
> > Hmmm. How do backup programs now work out if the file has changed
> > and hence needs copying again? ie. applications using this will
> > break other critical infrastructure in subtle ways.
> 
> By using backup infrastructure that doesn't use cmtime.  Like btrfs
> send/recv.  Or application level backups that know how to do
> incrementals from metadata in giant database files, say, without
> walking, comparing, and copying the entire thing.

"Use magical thing that doesn't exist"? Really?

e.g. you can't do incremental backups with tools like xfsdump if
mtime is not being updated.  The last thing an admin wants when
doing disaster recovery is to find out that the app started using
O_NOMTIME as a result of the upgrade they did 6 months ago. Hence
the last 6 months of production data isn't in the backups despite
the backup procedure having been extensively tested and verified
when it was first put in place.

> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > > owning the file or having the CAP_FOWNER capability.  If we're not
> > > comfortable allowing owners to prevent mtime/ctime updates then we
> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > 
> > I dislike "turn off safety for performance" options because Joe
> > SpeedRacer will always select performance over safety.
> 
> Well, for ceph there's no safety concern.  They never use cmtime in
> these files.

Understood.

> So are you suggesting not implementing this

No.

> Or are we talking about adding some speed bumps
> that ceph can flip on that might give Joe Speedracer pause?

Yes, but not just Joe Speedracer - if it can be turned on silently
by apps then it's a great big landmine that most users and sysadmins
will not know about until it is too late.

Cheers,

Dave.

Dave Chinner May 8, 2015, 2:42 a.m. UTC | #15

On Thu, May 07, 2015 at 12:53:46PM -0700, Andy Lutomirski wrote:
> On Thu, May 7, 2015 at 12:09 PM, Richard Weinberger
> <richard.weinberger@gmail.com> wrote:
> > On Thu, May 7, 2015 at 7:20 PM, Zach Brown <zab@redhat.com> wrote:
> >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> >>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> >>> > Add the O_NOMTIME flag which prevents mtime from being updated which can
> >>> > greatly reduce the IO overhead of writes to allocated and initialized
> >>> > regions of files.
> >>>
> >>> Hmmm. How do backup programs now work out if the file has changed
> >>> and hence needs copying again? ie. applications using this will
> >>> break other critical infrastructure in subtle ways.
> >>
> >> By using backup infrastructure that doesn't use cmtime.  Like btrfs
> >> send/recv.  Or application level backups that know how to do
> >> incrementals from metadata in giant database files, say, without
> >> walking, comparing, and copying the entire thing.
> >
> > But how can Joey random user know that some of his
> > applications are using O_NOMTIME and his KISS backup
> > program does no longer function as expected?
> >
> 
> Joey random user can't have a working KISS backup anyway, though,
> because we screw up mtime updates on mmap writes.  I have patches
> gathering dust that fix that, though.

They are close enough to be good for backup purposes. The mtime only
need change once per backup period - it doesn't need to be
millisecond accurate. Yes, I know you needed that changed for
different reasons (avoid variable page fault latency), but it
doesn't matter for once-a-day or even once-an-hour incremental
backup scans.

Besides, anyone who cares about accurate backups is doing a backup
from a snapshot so they data and metadata is consistent across the
entire backup. And that makes worries about mmap and mtime
completely irrelevant because a snapshot freezes the filesystem and
hence cleans all the mapped pages. Once the snapshot is taken
the next mmap write will trigger a page fault and so change the
mtime and it will be picked up in the next backup scan...

Cheers,

Dave.

Andy Lutomirski May 8, 2015, 3:24 a.m. UTC | #16

On May 8, 2015 8:11 AM, "Dave Chinner" <david@fromorbit.com> wrote:
>
> On Thu, May 07, 2015 at 10:20:53AM -0700, Zach Brown wrote:
> > On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> > > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> > > > Add the O_NOMTIME flag which prevents mtime from being updated which can
> > > > greatly reduce the IO overhead of writes to allocated and initialized
> > > > regions of files.
> > >
> > > Hmmm. How do backup programs now work out if the file has changed
> > > and hence needs copying again? ie. applications using this will
> > > break other critical infrastructure in subtle ways.
> >
> > By using backup infrastructure that doesn't use cmtime.  Like btrfs
> > send/recv.  Or application level backups that know how to do
> > incrementals from metadata in giant database files, say, without
> > walking, comparing, and copying the entire thing.
>
> "Use magical thing that doesn't exist"? Really?
>
> e.g. you can't do incremental backups with tools like xfsdump if
> mtime is not being updated.  The last thing an admin wants when
> doing disaster recovery is to find out that the app started using
> O_NOMTIME as a result of the upgrade they did 6 months ago. Hence
> the last 6 months of production data isn't in the backups despite
> the backup procedure having been extensively tested and verified
> when it was first put in place.
>
> > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > > > owning the file or having the CAP_FOWNER capability.  If we're not
> > > > comfortable allowing owners to prevent mtime/ctime updates then we
> > > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > >
> > > I dislike "turn off safety for performance" options because Joe
> > > SpeedRacer will always select performance over safety.
> >
> > Well, for ceph there's no safety concern.  They never use cmtime in
> > these files.
>
> Understood.
>
> > So are you suggesting not implementing this
>
> No.
>
> > Or are we talking about adding some speed bumps
> > that ceph can flip on that might give Joe Speedracer pause?
>
> Yes, but not just Joe Speedracer - if it can be turned on silently
> by apps then it's a great big landmine that most users and sysadmins
> will not know about until it is too late.

What about programs like tar that explicitly override mtime?  No admin
buy-in is required for that.  Admittedly, that doesn't affect ctime,
nor is it as likely to bite unexpectedly as a nomtime flag.

I think it would be reasonably safe if a mount option had to be set to
allow O_NOCMTIME or such.

--Andy

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Stoffel May 8, 2015, 2:29 p.m. UTC | #17

>>>>> "Sage" == Sage Weil <sage@newdream.net> writes:

Sage> On Thu, 7 May 2015, Zach Brown wrote:
>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
>> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
>> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
>> > > owning the file or having the CAP_FOWNER capability.  If we're not
>> > > comfortable allowing owners to prevent mtime/ctime updates then we
>> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
>> > 
>> > I dislike "turn off safety for performance" options because Joe
>> > SpeedRacer will always select performance over safety.
>> 
>> Well, for ceph there's no safety concern.  They never use cmtime in
>> these files.
>> 
>> So are you suggesting not implementing this and making them rework their
>> IO paths to avoid the fs maintaining mtime so that we don't give Joe
>> Speedracer more rope?  Or are we talking about adding some speed bumps
>> that ceph can flip on that might give Joe Speedracer pause?

Sage> I think this is the fundamental question: who do we give the
Sage> ammunition to, the user or app writer, or the sysadmin?

Sage> One might argue that we gave the user a similar power with
Sage> O_NOATIME (the power to break applications that assume atime is
Sage> accurate).  Here we give developers/users the power to not
Sage> update mtime and suffer the consequences (like, obviously,
Sage> breaking mtime-based backups).  It should be pretty obvious to
Sage> anyone using the flag what the consequences are.

Not modifying atime doesn't really break anything except people who
think they can tell when a file was last accessed.  Which isn't
critical (unless your in a paranoid security conscious place...) but
MTIME is another beast entirely.   Turning that off is going to break
lots of hidden assumptions.  

Sage> Note that we can suffer similar lapses in mtime with fdatasync
Sage> followed by a system crash.  And as Andy points out it's
Sage> semi-broken for writable mmap.  The crash case is obviously a
Sage> slightly different thing, but the idea that mtime can't always
Sage> be trusted certainly isn't crazy talk.

True, but after a crash... people expect and understand there might be
corruption in a filesystem.  

Sage> Or, we can be conservative and require a mount option so that
Sage> the admin has to explicitly allow behavior that might break some
Sage> existing assumptions about mtime/ctime ('-o user_noatime' I
Sage> guess?).

Sage> I'm happy either way, so long as in the end an unprivileged ceph
Sage> daemon avoids the useless work.  In our case we always own the
Sage> entire mount/disk, so a mount option is just fine.

I agree with the mount option, makes it crystal clear.  And then it's
on the sysadmin/owner of the system to understand (ha!) the problems.

This is all me speaking with my Sysadmin hat firmly on my head.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Austin S. Hemmelgarn May 8, 2015, 2:43 p.m. UTC | #18

On 2015-05-07 21:01, Sage Weil wrote:
> On Thu, 7 May 2015, Zach Brown wrote:
>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
>>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
>>>> The criteria for using O_NOMTIME is the same as for using O_NOATIME:
>>>> owning the file or having the CAP_FOWNER capability.  If we're not
>>>> comfortable allowing owners to prevent mtime/ctime updates then we
>>>> should add a tunable to allow O_NOMTIME.  Maybe a mount option?
>>>
>>> I dislike "turn off safety for performance" options because Joe
>>> SpeedRacer will always select performance over safety.
>>
>> Well, for ceph there's no safety concern.  They never use cmtime in
>> these files.
>>
>> So are you suggesting not implementing this and making them rework their
>> IO paths to avoid the fs maintaining mtime so that we don't give Joe
>> Speedracer more rope?  Or are we talking about adding some speed bumps
>> that ceph can flip on that might give Joe Speedracer pause?
>
> I think this is the fundamental question: who do we give the ammunition
> to, the user or app writer, or the sysadmin?
>
> One might argue that we gave the user a similar power with O_NOATIME (the
> power to break applications that assume atime is accurate).  Here we give
> developers/users the power to not update mtime and suffer the consequences
> (like, obviously, breaking mtime-based backups).  It should be pretty
> obvious to anyone using the flag what the consequences are.
The difference is that the only widely used program that uses atime for 
anything is Mutt (and many people who don't use Mutt just disable 
updating it altogether to improve performance), whereas mtime is used at 
the very least by many backup tools, and pretty much all NFSv{3,2} 
clients, as well as a number of other pieces of software.
>
> Note that we can suffer similar lapses in mtime with fdatasync followed by
> a system crash.  And as Andy points out it's semi-broken for writable
> mmap.  The crash case is obviously a slightly different thing, but the
> idea that mtime can't always be trusted certainly isn't crazy talk.
>
> Or, we can be conservative and require a mount option so that the admin
> has to explicitly allow behavior that might break some existing
> assumptions about mtime/ctime ('-o user_noatime' I guess?).
Personally, I agree that there should be a mount option.  We should make 
sure to put a big fat warning about it in the manpage however, 
irrespective of how it is controlled.
>
> I'm happy either way, so long as in the end an unprivileged ceph daemon
> avoids the useless work.  In our case we always own the entire mount/disk,
> so a mount option is just fine.
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Eric Sandeen May 8, 2015, 2:44 p.m. UTC | #19

On 5/7/15 10:24 PM, Andy Lutomirski wrote:
> On May 8, 2015 8:11 AM, "Dave Chinner" <david@fromorbit.com> wrote:
>>
>> On Thu, May 07, 2015 at 10:20:53AM -0700, Zach Brown wrote:
>>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
>>>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
>>>>> Add the O_NOMTIME flag which prevents mtime from being updated which can
>>>>> greatly reduce the IO overhead of writes to allocated and initialized
>>>>> regions of files.
>>>>
>>>> Hmmm. How do backup programs now work out if the file has changed
>>>> and hence needs copying again? ie. applications using this will
>>>> break other critical infrastructure in subtle ways.
>>>
>>> By using backup infrastructure that doesn't use cmtime.  Like btrfs
>>> send/recv.  Or application level backups that know how to do
>>> incrementals from metadata in giant database files, say, without
>>> walking, comparing, and copying the entire thing.
>>
>> "Use magical thing that doesn't exist"? Really?
>>
>> e.g. you can't do incremental backups with tools like xfsdump if
>> mtime is not being updated.  The last thing an admin wants when
>> doing disaster recovery is to find out that the app started using
>> O_NOMTIME as a result of the upgrade they did 6 months ago. Hence
>> the last 6 months of production data isn't in the backups despite
>> the backup procedure having been extensively tested and verified
>> when it was first put in place.
>>
>>>>> The criteria for using O_NOMTIME is the same as for using O_NOATIME:
>>>>> owning the file or having the CAP_FOWNER capability.  If we're not
>>>>> comfortable allowing owners to prevent mtime/ctime updates then we
>>>>> should add a tunable to allow O_NOMTIME.  Maybe a mount option?
>>>>
>>>> I dislike "turn off safety for performance" options because Joe
>>>> SpeedRacer will always select performance over safety.
>>>
>>> Well, for ceph there's no safety concern.  They never use cmtime in
>>> these files.
>>
>> Understood.
>>
>>> So are you suggesting not implementing this
>>
>> No.
>>
>>> Or are we talking about adding some speed bumps
>>> that ceph can flip on that might give Joe Speedracer pause?
>>
>> Yes, but not just Joe Speedracer - if it can be turned on silently
>> by apps then it's a great big landmine that most users and sysadmins
>> will not know about until it is too late.
> 
> What about programs like tar that explicitly override mtime?  No admin
> buy-in is required for that.  Admittedly, that doesn't affect ctime,
> nor is it as likely to bite unexpectedly as a nomtime flag.
> 
> I think it would be reasonably safe if a mount option had to be set to
> allow O_NOCMTIME or such.

I was going to suggest the same.  Make infrastructure available for an app
to request O_NOMTIME, but a mount option must be set to allow it, so the
administrator doesn't get an unhappy surprise at backup-restore time.

(Not a big fan of more twiddly knobs, but that seems to put the control
in all the right places).

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil May 8, 2015, 3:19 p.m. UTC | #20

On Thu, 7 May 2015, Trond Myklebust wrote:
> On Thu, May 7, 2015 at 9:01 PM, Sage Weil <sage@newdream.net> wrote:
> > On Thu, 7 May 2015, Zach Brown wrote:
> >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> >> > > owning the file or having the CAP_FOWNER capability.  If we're not
> >> > > comfortable allowing owners to prevent mtime/ctime updates then we
> >> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> >> >
> >> > I dislike "turn off safety for performance" options because Joe
> >> > SpeedRacer will always select performance over safety.
> >>
> >> Well, for ceph there's no safety concern.  They never use cmtime in
> >> these files.
> >>
> >> So are you suggesting not implementing this and making them rework their
> >> IO paths to avoid the fs maintaining mtime so that we don't give Joe
> >> Speedracer more rope?  Or are we talking about adding some speed bumps
> >> that ceph can flip on that might give Joe Speedracer pause?
> >
> > I think this is the fundamental question: who do we give the ammunition
> > to, the user or app writer, or the sysadmin?
> >
> > One might argue that we gave the user a similar power with O_NOATIME (the
> > power to break applications that assume atime is accurate).  Here we give
> > developers/users the power to not update mtime and suffer the consequences
> > (like, obviously, breaking mtime-based backups).  It should be pretty
> > obvious to anyone using the flag what the consequences are.
> >
> > Note that we can suffer similar lapses in mtime with fdatasync followed by
> > a system crash.  And as Andy points out it's semi-broken for writable
> > mmap.  The crash case is obviously a slightly different thing, but the
> > idea that mtime can't always be trusted certainly isn't crazy talk.
> >
> > Or, we can be conservative and require a mount option so that the admin
> > has to explicitly allow behavior that might break some existing
> > assumptions about mtime/ctime ('-o user_noatime' I guess?).
> >
> > I'm happy either way, so long as in the end an unprivileged ceph daemon
> > avoids the useless work.  In our case we always own the entire mount/disk,
> > so a mount option is just fine.
> >
> 
> So, what is the expectation here for filesystems that cannot support
> this flag? NFSv3 in particular would break pretty catastrophically if
> someone decided on a whim to turn off mtime: they will have turned off
> the client's ability to detect cache incoherencies.

Is this based on mtime or ctime?  If the former, would things could also 
break if a user does, say, some stat(2), write(2), utimes(2) shenanigans?

So, my assumption is that if the mount option isn't there allowing this 
then O_NOMTIME would be a no-op (as opposed to EPERM or something)... but 
maybe that's not the right thing to do.  Whatever we do there, though, I 
suppose NFS would do the same thing?

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown May 8, 2015, 5:11 p.m. UTC | #21

On Thu, May 07, 2015 at 06:01:23PM -0700, Sage Weil wrote:
> On Thu, 7 May 2015, Zach Brown wrote:
> > On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> > > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > > > owning the file or having the CAP_FOWNER capability.  If we're not
> > > > comfortable allowing owners to prevent mtime/ctime updates then we
> > > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > > 
> > > I dislike "turn off safety for performance" options because Joe
> > > SpeedRacer will always select performance over safety.
> > 
> > Well, for ceph there's no safety concern.  They never use cmtime in
> > these files.
> > 
> > So are you suggesting not implementing this and making them rework their
> > IO paths to avoid the fs maintaining mtime so that we don't give Joe
> > Speedracer more rope?  Or are we talking about adding some speed bumps
> > that ceph can flip on that might give Joe Speedracer pause?
> 
> I think this is the fundamental question: who do we give the ammunition 
> to, the user or app writer, or the sysadmin?

Yeah, I think this is right.  Dave doesn't want the possibility of it
bleeding in to installations through irresponsible default use in apps
without explicit buy-in from the people responsible for the backups.

> [...]
> 
> Or, we can be conservative and require a mount option so that the admin 
> has to explicitly allow behavior that might break some existing 
> assumptions about mtime/ctime ('-o user_noatime' I guess?).
> 
> I'm happy either way, so long as in the end an unprivileged ceph daemon 
> avoids the useless work.  In our case we always own the entire mount/disk, 
> so a mount option is just fine.

It seems that the thread has headed towards responding to my suggestion
of a possible mount option with an enthusiastic "yes, please, no
surprises."

So I'll try that.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner May 8, 2015, 10:13 p.m. UTC | #22

On Thu, May 07, 2015 at 09:23:24PM -0400, Trond Myklebust wrote:
> On Thu, May 7, 2015 at 9:01 PM, Sage Weil <sage@newdream.net> wrote:
> > On Thu, 7 May 2015, Zach Brown wrote:
> >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> >> > > owning the file or having the CAP_FOWNER capability.  If we're not
> >> > > comfortable allowing owners to prevent mtime/ctime updates then we
> >> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> >> >
> >> > I dislike "turn off safety for performance" options because Joe
> >> > SpeedRacer will always select performance over safety.
> >>
> >> Well, for ceph there's no safety concern.  They never use cmtime in
> >> these files.
> >>
> >> So are you suggesting not implementing this and making them rework their
> >> IO paths to avoid the fs maintaining mtime so that we don't give Joe
> >> Speedracer more rope?  Or are we talking about adding some speed bumps
> >> that ceph can flip on that might give Joe Speedracer pause?
> >
> > I think this is the fundamental question: who do we give the ammunition
> > to, the user or app writer, or the sysadmin?
> >
> > One might argue that we gave the user a similar power with O_NOATIME (the
> > power to break applications that assume atime is accurate).  Here we give
> > developers/users the power to not update mtime and suffer the consequences
> > (like, obviously, breaking mtime-based backups).  It should be pretty
> > obvious to anyone using the flag what the consequences are.
> >
> > Note that we can suffer similar lapses in mtime with fdatasync followed by
> > a system crash.  And as Andy points out it's semi-broken for writable
> > mmap.  The crash case is obviously a slightly different thing, but the
> > idea that mtime can't always be trusted certainly isn't crazy talk.
> >
> > Or, we can be conservative and require a mount option so that the admin
> > has to explicitly allow behavior that might break some existing
> > assumptions about mtime/ctime ('-o user_noatime' I guess?).
> >
> > I'm happy either way, so long as in the end an unprivileged ceph daemon
> > avoids the useless work.  In our case we always own the entire mount/disk,
> > so a mount option is just fine.
> >
> 
> So, what is the expectation here for filesystems that cannot support
> this flag? NFSv3 in particular would break pretty catastrophically if
> someone decided on a whim to turn off mtime: they will have turned off
> the client's ability to detect cache incoherencies.

It's worse than that, now that I think about it. I think nomtime
will break nfsv4 as the I_VERSION check is done *after* the
NO[C]MTIME checks. e.g. the atomic change count used to detect file
changes is only updated during the mtime update on write() calls in
XFS. i.e. when the timestamp is changed, a transaction to change
mtime is run, and that transaction commit bumps the change count.

So cutting out mtime updates at the VFS will prevent XFS and other
I_VERSION aware filesystems from updating the change count that
NFSv4 clients rely on to detect foreign data changes in a file.

Not sure what to do here, because the current NOCMTIME
implementation intentionally cuts out the timestamp update because
it's usage is fully invisible IO. i.e. it is used by utilities like
xfs_fsr and HSMs to move data into and out of files without the
application being able to detect the data movement in any way. These
are not data modification operations, though - the file contents as
read by the application do not change despite the fact we are moving
data in and out of the file. In this case we don't want timestamps
or change counters to change on the data movement, so I think we've
actually got a difference in behaviour here between O_NOMTIME and
O_NOCMTIME, right?

i.e. for nfsv4 sanity O_NOMTIME still needs to bump I_VERSION on
write, just not modify the timestamp? In which case, not modifying
the timestamps gains us nothing, because the inode is still dirtied?

The list of caveats on O_NOMTIME seems to be growing...

Cheers,

Dave.

Sage Weil May 8, 2015, 10:24 p.m. UTC | #23

On Sat, 9 May 2015, Dave Chinner wrote:
> On Thu, May 07, 2015 at 09:23:24PM -0400, Trond Myklebust wrote:
> > On Thu, May 7, 2015 at 9:01 PM, Sage Weil <sage@newdream.net> wrote:
> > > On Thu, 7 May 2015, Zach Brown wrote:
> > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> > >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> > >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > >> > > owning the file or having the CAP_FOWNER capability.  If we're not
> > >> > > comfortable allowing owners to prevent mtime/ctime updates then we
> > >> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > >> >
> > >> > I dislike "turn off safety for performance" options because Joe
> > >> > SpeedRacer will always select performance over safety.
> > >>
> > >> Well, for ceph there's no safety concern.  They never use cmtime in
> > >> these files.
> > >>
> > >> So are you suggesting not implementing this and making them rework their
> > >> IO paths to avoid the fs maintaining mtime so that we don't give Joe
> > >> Speedracer more rope?  Or are we talking about adding some speed bumps
> > >> that ceph can flip on that might give Joe Speedracer pause?
> > >
> > > I think this is the fundamental question: who do we give the ammunition
> > > to, the user or app writer, or the sysadmin?
> > >
> > > One might argue that we gave the user a similar power with O_NOATIME (the
> > > power to break applications that assume atime is accurate).  Here we give
> > > developers/users the power to not update mtime and suffer the consequences
> > > (like, obviously, breaking mtime-based backups).  It should be pretty
> > > obvious to anyone using the flag what the consequences are.
> > >
> > > Note that we can suffer similar lapses in mtime with fdatasync followed by
> > > a system crash.  And as Andy points out it's semi-broken for writable
> > > mmap.  The crash case is obviously a slightly different thing, but the
> > > idea that mtime can't always be trusted certainly isn't crazy talk.
> > >
> > > Or, we can be conservative and require a mount option so that the admin
> > > has to explicitly allow behavior that might break some existing
> > > assumptions about mtime/ctime ('-o user_noatime' I guess?).
> > >
> > > I'm happy either way, so long as in the end an unprivileged ceph daemon
> > > avoids the useless work.  In our case we always own the entire mount/disk,
> > > so a mount option is just fine.
> > >
> > 
> > So, what is the expectation here for filesystems that cannot support
> > this flag? NFSv3 in particular would break pretty catastrophically if
> > someone decided on a whim to turn off mtime: they will have turned off
> > the client's ability to detect cache incoherencies.
> 
> It's worse than that, now that I think about it. I think nomtime
> will break nfsv4 as the I_VERSION check is done *after* the
> NO[C]MTIME checks. e.g. the atomic change count used to detect file
> changes is only updated during the mtime update on write() calls in
> XFS. i.e. when the timestamp is changed, a transaction to change
> mtime is run, and that transaction commit bumps the change count.
> 
> So cutting out mtime updates at the VFS will prevent XFS and other
> I_VERSION aware filesystems from updating the change count that
> NFSv4 clients rely on to detect foreign data changes in a file.
> 
> Not sure what to do here, because the current NOCMTIME
> implementation intentionally cuts out the timestamp update because
> it's usage is fully invisible IO. i.e. it is used by utilities like
> xfs_fsr and HSMs to move data into and out of files without the
> application being able to detect the data movement in any way. These
> are not data modification operations, though - the file contents as
> read by the application do not change despite the fact we are moving
> data in and out of the file. In this case we don't want timestamps
> or change counters to change on the data movement, so I think we've
> actually got a difference in behaviour here between O_NOMTIME and
> O_NOCMTIME, right?
> 
> i.e. for nfsv4 sanity O_NOMTIME still needs to bump I_VERSION on
> write, just not modify the timestamp? In which case, not modifying
> the timestamps gains us nothing, because the inode is still dirtied?

Right: if we dirty the inode we've defeated the purpose of the patch.

> The list of caveats on O_NOMTIME seems to be growing...

...and remain consistent with our goals.  We couldn't care less if NFS or 
backup software or anything else doesn't notice these changes.  This is 
private data that is wholly managed by the ceph daemon.  The goal is to 
derive *some* value from the file system and avoid reimplementing it in 
userspace (without the bits we don't need).

I'm sure you realize what we're try to achieve is the same "invisible IO" 
that the XFS open by handle ioctls do by default.  Would you be more 
comfortable if this option where only available to the generic 
open_by_handle syscall, and not to open(2)?

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust May 10, 2015, 11:13 p.m. UTC | #24

On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote:
> On Sat, 9 May 2015, Dave Chinner wrote:
>> On Thu, May 07, 2015 at 09:23:24PM -0400, Trond Myklebust wrote:
>> > On Thu, May 7, 2015 at 9:01 PM, Sage Weil <sage@newdream.net> wrote:
>> > > On Thu, 7 May 2015, Zach Brown wrote:
>> > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
>> > >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
>> > >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
>> > >> > > owning the file or having the CAP_FOWNER capability.  If we're not
>> > >> > > comfortable allowing owners to prevent mtime/ctime updates then we
>> > >> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
>> > >> >
>> > >> > I dislike "turn off safety for performance" options because Joe
>> > >> > SpeedRacer will always select performance over safety.
>> > >>
>> > >> Well, for ceph there's no safety concern.  They never use cmtime in
>> > >> these files.
>> > >>
>> > >> So are you suggesting not implementing this and making them rework their
>> > >> IO paths to avoid the fs maintaining mtime so that we don't give Joe
>> > >> Speedracer more rope?  Or are we talking about adding some speed bumps
>> > >> that ceph can flip on that might give Joe Speedracer pause?
>> > >
>> > > I think this is the fundamental question: who do we give the ammunition
>> > > to, the user or app writer, or the sysadmin?
>> > >
>> > > One might argue that we gave the user a similar power with O_NOATIME (the
>> > > power to break applications that assume atime is accurate).  Here we give
>> > > developers/users the power to not update mtime and suffer the consequences
>> > > (like, obviously, breaking mtime-based backups).  It should be pretty
>> > > obvious to anyone using the flag what the consequences are.
>> > >
>> > > Note that we can suffer similar lapses in mtime with fdatasync followed by
>> > > a system crash.  And as Andy points out it's semi-broken for writable
>> > > mmap.  The crash case is obviously a slightly different thing, but the
>> > > idea that mtime can't always be trusted certainly isn't crazy talk.
>> > >
>> > > Or, we can be conservative and require a mount option so that the admin
>> > > has to explicitly allow behavior that might break some existing
>> > > assumptions about mtime/ctime ('-o user_noatime' I guess?).
>> > >
>> > > I'm happy either way, so long as in the end an unprivileged ceph daemon
>> > > avoids the useless work.  In our case we always own the entire mount/disk,
>> > > so a mount option is just fine.
>> > >
>> >
>> > So, what is the expectation here for filesystems that cannot support
>> > this flag? NFSv3 in particular would break pretty catastrophically if
>> > someone decided on a whim to turn off mtime: they will have turned off
>> > the client's ability to detect cache incoherencies.
>>
>> It's worse than that, now that I think about it. I think nomtime
>> will break nfsv4 as the I_VERSION check is done *after* the
>> NO[C]MTIME checks. e.g. the atomic change count used to detect file
>> changes is only updated during the mtime update on write() calls in
>> XFS. i.e. when the timestamp is changed, a transaction to change
>> mtime is run, and that transaction commit bumps the change count.
>>
>> So cutting out mtime updates at the VFS will prevent XFS and other
>> I_VERSION aware filesystems from updating the change count that
>> NFSv4 clients rely on to detect foreign data changes in a file.
>>
>> Not sure what to do here, because the current NOCMTIME
>> implementation intentionally cuts out the timestamp update because
>> it's usage is fully invisible IO. i.e. it is used by utilities like
>> xfs_fsr and HSMs to move data into and out of files without the
>> application being able to detect the data movement in any way. These
>> are not data modification operations, though - the file contents as
>> read by the application do not change despite the fact we are moving
>> data in and out of the file. In this case we don't want timestamps
>> or change counters to change on the data movement, so I think we've
>> actually got a difference in behaviour here between O_NOMTIME and
>> O_NOCMTIME, right?
>>
>> i.e. for nfsv4 sanity O_NOMTIME still needs to bump I_VERSION on
>> write, just not modify the timestamp? In which case, not modifying
>> the timestamps gains us nothing, because the inode is still dirtied?
>
> Right: if we dirty the inode we've defeated the purpose of the patch.
>
>> The list of caveats on O_NOMTIME seems to be growing...
>
> ...and remain consistent with our goals.  We couldn't care less if NFS or
> backup software or anything else doesn't notice these changes.  This is
> private data that is wholly managed by the ceph daemon.  The goal is to
> derive *some* value from the file system and avoid reimplementing it in
> userspace (without the bits we don't need).

That makes it completely non-generic though. By putting this in the
VFS, you are giving applications a loaded gun that is pointed straight
at the application user's head.

> I'm sure you realize what we're try to achieve is the same "invisible IO"
> that the XFS open by handle ioctls do by default.  Would you be more
> comfortable if this option where only available to the generic
> open_by_handle syscall, and not to open(2)?

It should be an ioctl(). It has no business being part of
open_by_handle either, since that is another generic interface.

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner May 11, 2015, 7:31 a.m. UTC | #25

On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote:
> > I'm sure you realize what we're try to achieve is the same "invisible IO"
> > that the XFS open by handle ioctls do by default.  Would you be more
> > comfortable if this option where only available to the generic
> > open_by_handle syscall, and not to open(2)?
> 
> It should be an ioctl(). It has no business being part of
> open_by_handle either, since that is another generic interface.

I'm happy for it to be an ioctl interface - even an XFS specific
interface if you want to go that route, Sage - and it probably
should emit a warning to syslog first time it is used so there is
trace for bug triage purposes. i.e. we know the app is not using
mtime updates, so bug reports that are the result of mtime
mishandling don't result in large amounts of wasted developer time
trying to understand them...

Cheers,

Dave.

Theodore Ts'o May 11, 2015, 2:47 p.m. UTC | #26

On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> That makes it completely non-generic though. By putting this in the
> VFS, you are giving applications a loaded gun that is pointed straight
> at the application user's head.

Let me re-ask the question that I asked last week (and was apparently
ignored).  Why not trying to use the lazytime feature instead of
pointing a head straight at the application's --- and system
administrators' --- heads?

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil May 11, 2015, 4:24 p.m. UTC | #27

On Mon, 11 May 2015, Theodore Ts'o wrote:
> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> > That makes it completely non-generic though. By putting this in the
> > VFS, you are giving applications a loaded gun that is pointed straight
> > at the application user's head.
> 
> Let me re-ask the question that I asked last week (and was apparently
> ignored).  Why not trying to use the lazytime feature instead of
> pointing a head straight at the application's --- and system
> administrators' --- heads?

Sorry Ted, I thought I responded already.

The goal is to avoid inode writeout entirely when we can, and 
as I understand it lazytime will still force writeout before the inode 
is dropped from the cache.  In systems like Ceph in particular, the 
IOs can be spread across lots of files, so simply deferring writeout 
doesn't always help.

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil May 11, 2015, 4:39 p.m. UTC | #28

On Mon, 11 May 2015, Dave Chinner wrote:
> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote:
> > > I'm sure you realize what we're try to achieve is the same "invisible IO"
> > > that the XFS open by handle ioctls do by default.  Would you be more
> > > comfortable if this option where only available to the generic
> > > open_by_handle syscall, and not to open(2)?
> > 
> > It should be an ioctl(). It has no business being part of
> > open_by_handle either, since that is another generic interface.

Our use-case doesn't make sense on network file systems, but it does on 
any reasonably featureful local filesystem, and the goal is to be generic 
there.  If mtime is critical to a network file system's consistency it 
seems pretty reasonable to disallow/ignore it for just that file system 
(e.g., by masking off the flag at open time), as others won't have that 
same problem (cephfs doesn't, for example).

Perhaps making each fs opt-in instead of handling it in a generic path 
would alleviate this concern?

> I'm happy for it to be an ioctl interface - even an XFS specific
> interface if you want to go that route, Sage - and it probably
> should emit a warning to syslog first time it is used so there is
> trace for bug triage purposes. i.e. we know the app is not using
> mtime updates, so bug reports that are the result of mtime
> mishandling don't result in large amounts of wasted developer time
> trying to understand them...

A warning on using the interface (or when mounting with user_nomtime) 
sounds reasonable.

I'd rather not make this XFS specific as other local filesystmes (ext4, 
f2fs, possibly btrfs) would similarly benefit.  (And if we want to target 
XFS specifically the existing XFS open-by-handle ioctl is sufficient as it 
already does O_NOMTIME unconditionally.)

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust May 11, 2015, 5:12 p.m. UTC | #29

On Mon, May 11, 2015 at 12:39 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 11 May 2015, Dave Chinner wrote:
>> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
>> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote:
>> > > I'm sure you realize what we're try to achieve is the same "invisible IO"
>> > > that the XFS open by handle ioctls do by default.  Would you be more
>> > > comfortable if this option where only available to the generic
>> > > open_by_handle syscall, and not to open(2)?
>> >
>> > It should be an ioctl(). It has no business being part of
>> > open_by_handle either, since that is another generic interface.
>
> Our use-case doesn't make sense on network file systems, but it does on
> any reasonably featureful local filesystem, and the goal is to be generic
> there.  If mtime is critical to a network file system's consistency it
> seems pretty reasonable to disallow/ignore it for just that file system
> (e.g., by masking off the flag at open time), as others won't have that
> same problem (cephfs doesn't, for example).
>
> Perhaps making each fs opt-in instead of handling it in a generic path
> would alleviate this concern?

The issue isn't whether or not you have a network file system, it's
whether or not you want users to be able to manage data. mtime isn't
useful for the application (which knows whether or not it has changed
the file) or for the filesystem (ditto). It exists, rather, in order
to enable data management by users and other applications, letting
them know whether or not the data contents of the file have changed,
and when that change occurred.

If you are able to guarantee that your users don't care about that,
then fine, but that would be a very special case that doesn't fit the
way that most data centres are run. Backups are one case where mtime
matters, tiering and archiving is another. Neither of these examples
cases are under the control of the application that calls
open(O_NOMTIME).

>> I'm happy for it to be an ioctl interface - even an XFS specific
>> interface if you want to go that route, Sage - and it probably
>> should emit a warning to syslog first time it is used so there is
>> trace for bug triage purposes. i.e. we know the app is not using
>> mtime updates, so bug reports that are the result of mtime
>> mishandling don't result in large amounts of wasted developer time
>> trying to understand them...
>
> A warning on using the interface (or when mounting with user_nomtime)
> sounds reasonable.
>
> I'd rather not make this XFS specific as other local filesystmes (ext4,
> f2fs, possibly btrfs) would similarly benefit.  (And if we want to target
> XFS specifically the existing XFS open-by-handle ioctl is sufficient as it
> already does O_NOMTIME unconditionally.)

Lack of a namespace, doesn't imply that you don't want to manage the
data. The whole point of using object storage instead of plain old
block storage is to be able to provide whatever metadata you still
need in order to manage the object.

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil May 11, 2015, 5:30 p.m. UTC | #30

On Mon, 11 May 2015, Trond Myklebust wrote:
> On Mon, May 11, 2015 at 12:39 PM, Sage Weil <sage@newdream.net> wrote:
> > On Mon, 11 May 2015, Dave Chinner wrote:
> >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote:
> >> > > I'm sure you realize what we're try to achieve is the same "invisible IO"
> >> > > that the XFS open by handle ioctls do by default.  Would you be more
> >> > > comfortable if this option where only available to the generic
> >> > > open_by_handle syscall, and not to open(2)?
> >> >
> >> > It should be an ioctl(). It has no business being part of
> >> > open_by_handle either, since that is another generic interface.
> >
> > Our use-case doesn't make sense on network file systems, but it does on
> > any reasonably featureful local filesystem, and the goal is to be generic
> > there.  If mtime is critical to a network file system's consistency it
> > seems pretty reasonable to disallow/ignore it for just that file system
> > (e.g., by masking off the flag at open time), as others won't have that
> > same problem (cephfs doesn't, for example).
> >
> > Perhaps making each fs opt-in instead of handling it in a generic path
> > would alleviate this concern?
> 
> The issue isn't whether or not you have a network file system, it's
> whether or not you want users to be able to manage data. mtime isn't
> useful for the application (which knows whether or not it has changed
> the file) or for the filesystem (ditto). It exists, rather, in order
> to enable data management by users and other applications, letting
> them know whether or not the data contents of the file have changed,
> and when that change occurred.

Agreed.
 
> If you are able to guarantee that your users don't care about that,
> then fine, but that would be a very special case that doesn't fit the
> way that most data centres are run. Backups are one case where mtime
> matters, tiering and archiving is another.

This is true, although I argue it is becoming increasingly common for the 
data management (including backups and so forth) to be layered not on top 
of the POSIX file system but on something higher up in the stack. This is 
true of pretty much any distributed system (ceph, cassandra, mongo, etc., 
and I assume commercial databases like Oracle, too) where backups, 
replication, and any other DR strategies need to be orchestrated across 
nodes to be consistent--simply copying files out from underneath them is 
already insufficient and a recipe for disaster.

There is a growing category of applications that can benefit from this 
capability...

> Neither of these examples
> cases are under the control of the application that calls
> open(O_NOMTIME).

Wouldn't a mount option (e.g., allow_nomtime) address this concern?  Only 
nodes provisioned explicitly to run these systems would be enable this 
option.

> >> I'm happy for it to be an ioctl interface - even an XFS specific
> >> interface if you want to go that route, Sage - and it probably
> >> should emit a warning to syslog first time it is used so there is
> >> trace for bug triage purposes. i.e. we know the app is not using
> >> mtime updates, so bug reports that are the result of mtime
> >> mishandling don't result in large amounts of wasted developer time
> >> trying to understand them...
> >
> > A warning on using the interface (or when mounting with user_nomtime)
> > sounds reasonable.
> >
> > I'd rather not make this XFS specific as other local filesystmes (ext4,
> > f2fs, possibly btrfs) would similarly benefit.  (And if we want to target
> > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it
> > already does O_NOMTIME unconditionally.)
> 
> Lack of a namespace, doesn't imply that you don't want to manage the
> data. The whole point of using object storage instead of plain old
> block storage is to be able to provide whatever metadata you still
> need in order to manage the object.

Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd 
like to use) doesn't assume O_NOMTIME.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields May 11, 2015, 8:36 p.m. UTC | #31

On Fri, May 08, 2015 at 09:44:25AM -0500, Eric Sandeen wrote:
> On 5/7/15 10:24 PM, Andy Lutomirski wrote:
> > On May 8, 2015 8:11 AM, "Dave Chinner" <david@fromorbit.com> wrote:
> >>
> >> On Thu, May 07, 2015 at 10:20:53AM -0700, Zach Brown wrote:
> >>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> >>>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> >>>>> Add the O_NOMTIME flag which prevents mtime from being updated which can
> >>>>> greatly reduce the IO overhead of writes to allocated and initialized
> >>>>> regions of files.
> >>>>
> >>>> Hmmm. How do backup programs now work out if the file has changed
> >>>> and hence needs copying again? ie. applications using this will
> >>>> break other critical infrastructure in subtle ways.
> >>>
> >>> By using backup infrastructure that doesn't use cmtime.  Like btrfs
> >>> send/recv.  Or application level backups that know how to do
> >>> incrementals from metadata in giant database files, say, without
> >>> walking, comparing, and copying the entire thing.
> >>
> >> "Use magical thing that doesn't exist"? Really?
> >>
> >> e.g. you can't do incremental backups with tools like xfsdump if
> >> mtime is not being updated.  The last thing an admin wants when
> >> doing disaster recovery is to find out that the app started using
> >> O_NOMTIME as a result of the upgrade they did 6 months ago. Hence
> >> the last 6 months of production data isn't in the backups despite
> >> the backup procedure having been extensively tested and verified
> >> when it was first put in place.
> >>
> >>>>> The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> >>>>> owning the file or having the CAP_FOWNER capability.  If we're not
> >>>>> comfortable allowing owners to prevent mtime/ctime updates then we
> >>>>> should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> >>>>
> >>>> I dislike "turn off safety for performance" options because Joe
> >>>> SpeedRacer will always select performance over safety.
> >>>
> >>> Well, for ceph there's no safety concern.  They never use cmtime in
> >>> these files.
> >>
> >> Understood.
> >>
> >>> So are you suggesting not implementing this
> >>
> >> No.
> >>
> >>> Or are we talking about adding some speed bumps
> >>> that ceph can flip on that might give Joe Speedracer pause?
> >>
> >> Yes, but not just Joe Speedracer - if it can be turned on silently
> >> by apps then it's a great big landmine that most users and sysadmins
> >> will not know about until it is too late.
> > 
> > What about programs like tar that explicitly override mtime?  No admin
> > buy-in is required for that.  Admittedly, that doesn't affect ctime,
> > nor is it as likely to bite unexpectedly as a nomtime flag.
> > 
> > I think it would be reasonably safe if a mount option had to be set to
> > allow O_NOCMTIME or such.
> 
> I was going to suggest the same.  Make infrastructure available for an app
> to request O_NOMTIME, but a mount option must be set to allow it, so the
> administrator doesn't get an unhappy surprise at backup-restore time.
> 
> (Not a big fan of more twiddly knobs, but that seems to put the control
> in all the right places).

It seems more like a permanent feature of the filesystem than a
per-mount option: once you've turned off mtime updates you lose
information that can't be regained after remounting.  A mkfs option
might make more sense?  But I guess those aren't very generic.

(I do hope we can get an O_NOMTIME flag, it will make me smile every
time I see it....)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Ts'o May 11, 2015, 11:10 p.m. UTC | #32

On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
> > Let me re-ask the question that I asked last week (and was apparently
> > ignored).  Why not trying to use the lazytime feature instead of
> > pointing a head straight at the application's --- and system
> > administrators' --- heads?
> 
> Sorry Ted, I thought I responded already.
> 
> The goal is to avoid inode writeout entirely when we can, and 
> as I understand it lazytime will still force writeout before the inode 
> is dropped from the cache.  In systems like Ceph in particular, the 
> IOs can be spread across lots of files, so simply deferring writeout 
> doesn't always help.

Sure, but it would reduce the writeout by orders of magnitude.  I can
understand if you want to reduce it further, but it might be good
enough for your purposes.

I considered doing the equivalent of O_NOMTIME for our purposes at
$WORK, and our use case is actually not that different from Ceph's
(i.e., using a local disk file system to support a cluster file
system), and lazytime was (a) something I figured was something I
could upstream in good conscience, and (b) was more than good enough
for us.

Cheers,

					- Ted

P.S.  I do agree that if we do need this upstream, requiring a mount
option to enable the feature is probably a good compromise.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner May 12, 2015, 1:21 a.m. UTC | #33

On Mon, May 11, 2015 at 10:30:58AM -0700, Sage Weil wrote:
> On Mon, 11 May 2015, Trond Myklebust wrote:
> > On Mon, May 11, 2015 at 12:39 PM, Sage Weil <sage@newdream.net> wrote:
> > > On Mon, 11 May 2015, Dave Chinner wrote:
> > >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> > >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote:
> > >> > > I'm sure you realize what we're try to achieve is the same "invisible IO"
> > >> > > that the XFS open by handle ioctls do by default.  Would you be more
> > >> > > comfortable if this option where only available to the generic
> > >> > > open_by_handle syscall, and not to open(2)?
> > >> >
> > >> > It should be an ioctl(). It has no business being part of
> > >> > open_by_handle either, since that is another generic interface.
> > >
> > > Our use-case doesn't make sense on network file systems, but it does on
> > > any reasonably featureful local filesystem, and the goal is to be generic
> > > there.  If mtime is critical to a network file system's consistency it
> > > seems pretty reasonable to disallow/ignore it for just that file system
> > > (e.g., by masking off the flag at open time), as others won't have that
> > > same problem (cephfs doesn't, for example).
> > >
> > > Perhaps making each fs opt-in instead of handling it in a generic path
> > > would alleviate this concern?
> > 
> > The issue isn't whether or not you have a network file system, it's
> > whether or not you want users to be able to manage data. mtime isn't
> > useful for the application (which knows whether or not it has changed
> > the file) or for the filesystem (ditto). It exists, rather, in order
> > to enable data management by users and other applications, letting
> > them know whether or not the data contents of the file have changed,
> > and when that change occurred.
> 
> Agreed.
>  
> > If you are able to guarantee that your users don't care about that,
> > then fine, but that would be a very special case that doesn't fit the
> > way that most data centres are run. Backups are one case where mtime
> > matters, tiering and archiving is another.
> 
> This is true, although I argue it is becoming increasingly common for the 
> data management (including backups and so forth) to be layered not on top 
> of the POSIX file system but on something higher up in the stack. This is 

In the cloud storage world, yes. In the rest of the world, no.
It's the rest of the world we are worried about here. :/

> > Neither of these examples
> > cases are under the control of the application that calls
> > open(O_NOMTIME).
> 
> Wouldn't a mount option (e.g., allow_nomtime) address this concern?  Only 
> nodes provisioned explicitly to run these systems would be enable this 
> option.

Back to my Joe Speedracer comments.....

I'm not sure what the right answer is - mount options are simply too
easy to add without understanding the full implications of them.
e.g. we didn't merge FALLOC_FL_NO_HIDE_STALE simply because it was
too dangerous for unsuspecting users. This isn't at that same level
or concern, but it's still a landmine we want to avoid users from
arming without realising it...

> > >> I'm happy for it to be an ioctl interface - even an XFS specific
> > >> interface if you want to go that route, Sage - and it probably
> > >> should emit a warning to syslog first time it is used so there is
> > >> trace for bug triage purposes. i.e. we know the app is not using
> > >> mtime updates, so bug reports that are the result of mtime
> > >> mishandling don't result in large amounts of wasted developer time
> > >> trying to understand them...
> > >
> > > A warning on using the interface (or when mounting with user_nomtime)
> > > sounds reasonable.
> > >
> > > I'd rather not make this XFS specific as other local filesystmes (ext4,
> > > f2fs, possibly btrfs) would similarly benefit.  (And if we want to target
> > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it
> > > already does O_NOMTIME unconditionally.)
> > 
> > Lack of a namespace, doesn't imply that you don't want to manage the
> > data. The whole point of using object storage instead of plain old
> > block storage is to be able to provide whatever metadata you still
> > need in order to manage the object.
> 
> Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd 
> like to use) doesn't assume O_NOMTIME.

Right - the XFS ioctls were designed specifically for applications
that interacted directly with the structure of XFS filesystems and
so needed invisible IO (e.g. online defragmenter). IOWs, they are
not interfaces intended for general usage. They are also only
available to root, so a typical user application won't be making use
of them, either.

Cheers,

Dave.

Kevin Easton May 12, 2015, 5:08 a.m. UTC | #34

On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
> > > Let me re-ask the question that I asked last week (and was apparently
> > > ignored).  Why not trying to use the lazytime feature instead of
> > > pointing a head straight at the application's --- and system
> > > administrators' --- heads?
> > 
> > Sorry Ted, I thought I responded already.
> > 
> > The goal is to avoid inode writeout entirely when we can, and 
> > as I understand it lazytime will still force writeout before the inode 
> > is dropped from the cache.  In systems like Ceph in particular, the 
> > IOs can be spread across lots of files, so simply deferring writeout 
> > doesn't always help.
> 
> Sure, but it would reduce the writeout by orders of magnitude.  I can
> understand if you want to reduce it further, but it might be good
> enough for your purposes.
> 
> I considered doing the equivalent of O_NOMTIME for our purposes at
> $WORK, and our use case is actually not that different from Ceph's
> (i.e., using a local disk file system to support a cluster file
> system), and lazytime was (a) something I figured was something I
> could upstream in good conscience, and (b) was more than good enough
> for us.

A safer alternative might be a chattr file attribute that if set, the
mtime is not updated on writes, and stat() on the file always shows the
mtime as "right now".  At least that way, the file won't accidentally
get left out of backups that rely on the mtime.

(If the file attribute is unset, you immediately update the mtime then
too, and from then on the file is back to normal).

    - Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Austin S. Hemmelgarn May 12, 2015, 11:45 a.m. UTC | #35

On 2015-05-12 01:08, Kevin Easton wrote:
> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
>>>> Let me re-ask the question that I asked last week (and was apparently
>>>> ignored).  Why not trying to use the lazytime feature instead of
>>>> pointing a head straight at the application's --- and system
>>>> administrators' --- heads?
>>>
>>> Sorry Ted, I thought I responded already.
>>>
>>> The goal is to avoid inode writeout entirely when we can, and
>>> as I understand it lazytime will still force writeout before the inode
>>> is dropped from the cache.  In systems like Ceph in particular, the
>>> IOs can be spread across lots of files, so simply deferring writeout
>>> doesn't always help.
>>
>> Sure, but it would reduce the writeout by orders of magnitude.  I can
>> understand if you want to reduce it further, but it might be good
>> enough for your purposes.
>>
>> I considered doing the equivalent of O_NOMTIME for our purposes at
>> $WORK, and our use case is actually not that different from Ceph's
>> (i.e., using a local disk file system to support a cluster file
>> system), and lazytime was (a) something I figured was something I
>> could upstream in good conscience, and (b) was more than good enough
>> for us.
>
> A safer alternative might be a chattr file attribute that if set, the
> mtime is not updated on writes, and stat() on the file always shows the
> mtime as "right now".  At least that way, the file won't accidentally
> get left out of backups that rely on the mtime.
>
> (If the file attribute is unset, you immediately update the mtime then
> too, and from then on the file is back to normal).
>
I like this even better than the flag suggestion, it provides better 
control, means that you don't need to update applications to get the 
benefits, and prevents backup software from breaking (although backups 
would be bigger).

John Stoffel May 12, 2015, 1:41 p.m. UTC | #36

>>>>> "Sage" == Sage Weil <sage@newdream.net> writes:

Sage> On Mon, 11 May 2015, Trond Myklebust wrote:
>> On Mon, May 11, 2015 at 12:39 PM, Sage Weil <sage@newdream.net> wrote:
>> > On Mon, 11 May 2015, Dave Chinner wrote:
>> >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
>> >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil <sage@newdream.net> wrote:
>> >> > > I'm sure you realize what we're try to achieve is the same "invisible IO"
>> >> > > that the XFS open by handle ioctls do by default.  Would you be more
>> >> > > comfortable if this option where only available to the generic
>> >> > > open_by_handle syscall, and not to open(2)?
>> >> >
>> >> > It should be an ioctl(). It has no business being part of
>> >> > open_by_handle either, since that is another generic interface.
>> >
>> > Our use-case doesn't make sense on network file systems, but it does on
>> > any reasonably featureful local filesystem, and the goal is to be generic
>> > there.  If mtime is critical to a network file system's consistency it
>> > seems pretty reasonable to disallow/ignore it for just that file system
>> > (e.g., by masking off the flag at open time), as others won't have that
>> > same problem (cephfs doesn't, for example).
>> >
>> > Perhaps making each fs opt-in instead of handling it in a generic path
>> > would alleviate this concern?
>> 
>> The issue isn't whether or not you have a network file system, it's
>> whether or not you want users to be able to manage data. mtime isn't
>> useful for the application (which knows whether or not it has changed
>> the file) or for the filesystem (ditto). It exists, rather, in order
>> to enable data management by users and other applications, letting
>> them know whether or not the data contents of the file have changed,
>> and when that change occurred.

Sage> Agreed.

>> If you are able to guarantee that your users don't care about that,
>> then fine, but that would be a very special case that doesn't fit the
>> way that most data centres are run. Backups are one case where mtime
>> matters, tiering and archiving is another.

Sage> This is true, although I argue it is becoming increasingly
Sage> common for the data management (including backups and so forth)
Sage> to be layered not on top of the POSIX file system but on
Sage> something higher up in the stack. This is true of pretty much
Sage> any distributed system (ceph, cassandra, mongo, etc., and I
Sage> assume commercial databases like Oracle, too) where backups,
Sage> replication, and any other DR strategies need to be orchestrated
Sage> across nodes to be consistent--simply copying files out from
Sage> underneath them is already insufficient and a recipe for
Sage> disaster.

you're smoking crack here.  Backups are not layered at higher layers
unless absolutely necessary, such as for databases.  Now Mongo, Hadoop
and others might also fit this model, but for day to day backup of
data, it's mtime all the way.  

I don't see why you insist that this is a good idea to implement for a
very special corner case.  

Sage> There is a growing category of applications that can benefit
Sage> from this capability...

There is a perceived growing category of super special niche
applications which might think they want this capability.  

Why are you even using a filesystem in the first place if you're so
worried about writing out inodes being a performance problem?  Just
use raw partitions and do all the work yourself.  Oracle and other DBs
can do this when they want.  

>> Neither of these examples
>> cases are under the control of the application that calls
>> open(O_NOMTIME).

Sage> Wouldn't a mount option (e.g., allow_nomtime) address this
Sage> concern?  Only nodes provisioned explicitly to run these systems
Sage> would be enable this option.

Why do you keep coming back to a mount option?  What's wrong with a
per-file ioctl option?  Making this a mount option means that you
default to a fail hard setup.  If someone screws up and mounts user
home directories with this option thinking that it's like the noatime
option, then suddenly all their backups will silently break unless
they're aware of disk space churn numbers and notice that they are
only backing up tiny bits.

With an ioctl, it's upto the damn application to *request* this
change, and then the VFS/filesystem and *maybe* support this, but the
application shouldn't actually know or care what the result is, it's
just a performance hint/request.  

We should default to sane semantics and not give out such a big
foot-gun if at all possible.  

I'm a sysadm by day (and night, evening, early morning... :-) and I
know my user's don't think about thinks like this. They don't even
think about backups until they want to restore something.  User's only
care about restores, not backups.

John

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Stoffel May 12, 2015, 1:54 p.m. UTC | #37

>>>>> "Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes:

Austin> On 2015-05-12 01:08, Kevin Easton wrote:
>> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
>>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
>>>>> Let me re-ask the question that I asked last week (and was apparently
>>>>> ignored).  Why not trying to use the lazytime feature instead of
>>>>> pointing a head straight at the application's --- and system
>>>>> administrators' --- heads?
>>>> 
>>>> Sorry Ted, I thought I responded already.
>>>> 
>>>> The goal is to avoid inode writeout entirely when we can, and
>>>> as I understand it lazytime will still force writeout before the inode
>>>> is dropped from the cache.  In systems like Ceph in particular, the
>>>> IOs can be spread across lots of files, so simply deferring writeout
>>>> doesn't always help.
>>> 
>>> Sure, but it would reduce the writeout by orders of magnitude.  I can
>>> understand if you want to reduce it further, but it might be good
>>> enough for your purposes.
>>> 
>>> I considered doing the equivalent of O_NOMTIME for our purposes at
>>> $WORK, and our use case is actually not that different from Ceph's
>>> (i.e., using a local disk file system to support a cluster file
>>> system), and lazytime was (a) something I figured was something I
>>> could upstream in good conscience, and (b) was more than good enough
>>> for us.
>> 
>> A safer alternative might be a chattr file attribute that if set, the
>> mtime is not updated on writes, and stat() on the file always shows the
>> mtime as "right now".  At least that way, the file won't accidentally
>> get left out of backups that rely on the mtime.
>> 
>> (If the file attribute is unset, you immediately update the mtime then
>> too, and from then on the file is back to normal).
>> 

Austin> I like this even better than the flag suggestion, it provides
Austin> better control, means that you don't need to update
Austin> applications to get the benefits, and prevents backup software
Austin> from breaking (although backups would be bigger).

Me too, it fails in a safer mode, where you do more work on backups
than strictly needed.  I'm still against this as a mount option
though, way way way too many bullets in the foot gun.  And as someone
else said, once you mount with O_NOMTIME, then unmount, then mount
again without O_NOMTIME, you've lost information.  Not good.  

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields May 12, 2015, 2:36 p.m. UTC | #38

On Tue, May 12, 2015 at 09:54:27AM -0400, John Stoffel wrote:
> >>>>> "Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes:
> 
> Austin> On 2015-05-12 01:08, Kevin Easton wrote:
> >> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
> >>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
> >>>>> Let me re-ask the question that I asked last week (and was apparently
> >>>>> ignored).  Why not trying to use the lazytime feature instead of
> >>>>> pointing a head straight at the application's --- and system
> >>>>> administrators' --- heads?
> >>>> 
> >>>> Sorry Ted, I thought I responded already.
> >>>> 
> >>>> The goal is to avoid inode writeout entirely when we can, and
> >>>> as I understand it lazytime will still force writeout before the inode
> >>>> is dropped from the cache.  In systems like Ceph in particular, the
> >>>> IOs can be spread across lots of files, so simply deferring writeout
> >>>> doesn't always help.
> >>> 
> >>> Sure, but it would reduce the writeout by orders of magnitude.  I can
> >>> understand if you want to reduce it further, but it might be good
> >>> enough for your purposes.
> >>> 
> >>> I considered doing the equivalent of O_NOMTIME for our purposes at
> >>> $WORK, and our use case is actually not that different from Ceph's
> >>> (i.e., using a local disk file system to support a cluster file
> >>> system), and lazytime was (a) something I figured was something I
> >>> could upstream in good conscience, and (b) was more than good enough
> >>> for us.
> >> 
> >> A safer alternative might be a chattr file attribute that if set, the
> >> mtime is not updated on writes, and stat() on the file always shows the
> >> mtime as "right now".  At least that way, the file won't accidentally
> >> get left out of backups that rely on the mtime.
> >> 
> >> (If the file attribute is unset, you immediately update the mtime then
> >> too, and from then on the file is back to normal).
> >> 
> 
> Austin> I like this even better than the flag suggestion, it provides
> Austin> better control, means that you don't need to update
> Austin> applications to get the benefits, and prevents backup software
> Austin> from breaking (although backups would be bigger).
> 
> Me too, it fails in a safer mode, where you do more work on backups
> than strictly needed.  I'm still against this as a mount option
> though, way way way too many bullets in the foot gun.  And as someone
> else said, once you mount with O_NOMTIME, then unmount, then mount
> again without O_NOMTIME, you've lost information.  Not good.  

That was me.  Zach also pointed out to me that'd mean figuring out where
to store that information on-disk for every filesystem you care about.
I like the idea of something persistent, but maybe it's more trouble
than it's worth--I honestly don't know.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Austin S. Hemmelgarn May 12, 2015, 2:53 p.m. UTC | #39

On 2015-05-12 10:36, J. Bruce Fields wrote:
> On Tue, May 12, 2015 at 09:54:27AM -0400, John Stoffel wrote:
>>>>>>> "Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes:
>>
>> Austin> On 2015-05-12 01:08, Kevin Easton wrote:
>>>> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
>>>>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
>>>>>>> Let me re-ask the question that I asked last week (and was apparently
>>>>>>> ignored).  Why not trying to use the lazytime feature instead of
>>>>>>> pointing a head straight at the application's --- and system
>>>>>>> administrators' --- heads?
>>>>>>
>>>>>> Sorry Ted, I thought I responded already.
>>>>>>
>>>>>> The goal is to avoid inode writeout entirely when we can, and
>>>>>> as I understand it lazytime will still force writeout before the inode
>>>>>> is dropped from the cache.  In systems like Ceph in particular, the
>>>>>> IOs can be spread across lots of files, so simply deferring writeout
>>>>>> doesn't always help.
>>>>>
>>>>> Sure, but it would reduce the writeout by orders of magnitude.  I can
>>>>> understand if you want to reduce it further, but it might be good
>>>>> enough for your purposes.
>>>>>
>>>>> I considered doing the equivalent of O_NOMTIME for our purposes at
>>>>> $WORK, and our use case is actually not that different from Ceph's
>>>>> (i.e., using a local disk file system to support a cluster file
>>>>> system), and lazytime was (a) something I figured was something I
>>>>> could upstream in good conscience, and (b) was more than good enough
>>>>> for us.
>>>>
>>>> A safer alternative might be a chattr file attribute that if set, the
>>>> mtime is not updated on writes, and stat() on the file always shows the
>>>> mtime as "right now".  At least that way, the file won't accidentally
>>>> get left out of backups that rely on the mtime.
>>>>
>>>> (If the file attribute is unset, you immediately update the mtime then
>>>> too, and from then on the file is back to normal).
>>>>
>>
>> Austin> I like this even better than the flag suggestion, it provides
>> Austin> better control, means that you don't need to update
>> Austin> applications to get the benefits, and prevents backup software
>> Austin> from breaking (although backups would be bigger).
>>
>> Me too, it fails in a safer mode, where you do more work on backups
>> than strictly needed.  I'm still against this as a mount option
>> though, way way way too many bullets in the foot gun.  And as someone
>> else said, once you mount with O_NOMTIME, then unmount, then mount
>> again without O_NOMTIME, you've lost information.  Not good.
>
> That was me.  Zach also pointed out to me that'd mean figuring out where
> to store that information on-disk for every filesystem you care about.
> I like the idea of something persistent, but maybe it's more trouble
> than it's worth--I honestly don't know.
>
But if we do it as a flag controlled by the API used by chattr, it 
becomes the responsibility of the filesystems to deal with where to 
store the information, assuming they choose to support it; personally, I 
would be really surprised if XFS and BTRFS didn't add support for this 
relatively soon after the API getting merged upstream, and ext4 would 
likely follow soon afterwards.

As far as support goes, I really think this will be easier to _safely_ 
implement (mount options are just too easy to arbitrarily change without 
knowing the consequences), although I think that reporting mtime as the 
current wall time for files under this effect is important regardless of 
what methodology get's implemented.

Sage Weil May 12, 2015, 9:35 p.m. UTC | #40

On Tue, 12 May 2015, Kevin Easton wrote:
> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
> > On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
> > > > Let me re-ask the question that I asked last week (and was apparently
> > > > ignored).  Why not trying to use the lazytime feature instead of
> > > > pointing a head straight at the application's --- and system
> > > > administrators' --- heads?
> > > 
> > > Sorry Ted, I thought I responded already.
> > > 
> > > The goal is to avoid inode writeout entirely when we can, and 
> > > as I understand it lazytime will still force writeout before the inode 
> > > is dropped from the cache.  In systems like Ceph in particular, the 
> > > IOs can be spread across lots of files, so simply deferring writeout 
> > > doesn't always help.
> > 
> > Sure, but it would reduce the writeout by orders of magnitude.  I can
> > understand if you want to reduce it further, but it might be good
> > enough for your purposes.
> > 
> > I considered doing the equivalent of O_NOMTIME for our purposes at
> > $WORK, and our use case is actually not that different from Ceph's
> > (i.e., using a local disk file system to support a cluster file
> > system), and lazytime was (a) something I figured was something I
> > could upstream in good conscience, and (b) was more than good enough
> > for us.
> 
> A safer alternative might be a chattr file attribute that if set, the
> mtime is not updated on writes, and stat() on the file always shows the
> mtime as "right now".  At least that way, the file won't accidentally
> get left out of backups that rely on the mtime.
> 
> (If the file attribute is unset, you immediately update the mtime then
> too, and from then on the file is back to normal).

Interesting!  I didn't realize there was already a chattr +A that disabled 
atime (although I suspect it doesn't do the "right now" for stat thing). 
This makes the nomtime-ness a bit more obscure (I don't think most users 
would think to check these file attributes), but it's a safer failure 
condition for backups at least.

The fact that chattr +A (and hopefully +M) will work for non-root is a 
bonus, as we're also trying to get ceph daemons to drop most privileges.

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner May 12, 2015, 9:51 p.m. UTC | #41

On Tue, May 12, 2015 at 10:53:29AM -0400, Austin S Hemmelgarn wrote:
> On 2015-05-12 10:36, J. Bruce Fields wrote:
> >On Tue, May 12, 2015 at 09:54:27AM -0400, John Stoffel wrote:
> >>>>>>>"Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes:
> >>
> >>Austin> On 2015-05-12 01:08, Kevin Easton wrote:
> >>>>On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
> >>>>>On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
> >>>>>>>Let me re-ask the question that I asked last week (and was apparently
> >>>>>>>ignored).  Why not trying to use the lazytime feature instead of
> >>>>>>>pointing a head straight at the application's --- and system
> >>>>>>>administrators' --- heads?
> >>>>>>
> >>>>>>Sorry Ted, I thought I responded already.
> >>>>>>
> >>>>>>The goal is to avoid inode writeout entirely when we can, and
> >>>>>>as I understand it lazytime will still force writeout before the inode
> >>>>>>is dropped from the cache.  In systems like Ceph in particular, the
> >>>>>>IOs can be spread across lots of files, so simply deferring writeout
> >>>>>>doesn't always help.
> >>>>>
> >>>>>Sure, but it would reduce the writeout by orders of magnitude.  I can
> >>>>>understand if you want to reduce it further, but it might be good
> >>>>>enough for your purposes.
> >>>>>
> >>>>>I considered doing the equivalent of O_NOMTIME for our purposes at
> >>>>>$WORK, and our use case is actually not that different from Ceph's
> >>>>>(i.e., using a local disk file system to support a cluster file
> >>>>>system), and lazytime was (a) something I figured was something I
> >>>>>could upstream in good conscience, and (b) was more than good enough
> >>>>>for us.
> >>>>
> >>>>A safer alternative might be a chattr file attribute that if set, the
> >>>>mtime is not updated on writes, and stat() on the file always shows the
> >>>>mtime as "right now".  At least that way, the file won't accidentally
> >>>>get left out of backups that rely on the mtime.
> >>>>
> >>>>(If the file attribute is unset, you immediately update the mtime then
> >>>>too, and from then on the file is back to normal).
> >>>>
> >>
> >>Austin> I like this even better than the flag suggestion, it provides
> >>Austin> better control, means that you don't need to update
> >>Austin> applications to get the benefits, and prevents backup software
> >>Austin> from breaking (although backups would be bigger).
> >>
> >>Me too, it fails in a safer mode, where you do more work on backups
> >>than strictly needed.  I'm still against this as a mount option
> >>though, way way way too many bullets in the foot gun.  And as someone
> >>else said, once you mount with O_NOMTIME, then unmount, then mount
> >>again without O_NOMTIME, you've lost information.  Not good.
> >
> >That was me.  Zach also pointed out to me that'd mean figuring out where
> >to store that information on-disk for every filesystem you care about.
> >I like the idea of something persistent, but maybe it's more trouble
> >than it's worth--I honestly don't know.
> >
> But if we do it as a flag controlled by the API used by chattr, it
> becomes the responsibility of the filesystems to deal with where to
> store the information, assuming they choose to support it;
> personally, I would be really surprised if XFS and BTRFS didn't add
> support for this relatively soon after the API getting merged
> upstream, and ext4 would likely follow soon afterwards.

It's an on-disk format change, which means that there are all sorts
of compatibility issues to take into account, as well as all the
work needed to teach the filesystem userspace tools about the new
flag. e.g. xfs_repair, xfs_db, xfsdump/restore, xfs_io, test code in
xfstests, etc.

Keep in mind that the moment we make something persistent, the
amount of work to implement and verify the new functionality
filesystem to implement it goes up by an order of magnitude *for
each filesystem*.  IOWs, support of new features that require
persistence don't just magically appear overnight...

Cheers,

Dave.

NeilBrown May 12, 2015, 10:39 p.m. UTC | #42

On Tue, 12 May 2015 10:36:37 -0400 bfields@fieldses.org (J. Bruce Fields)
wrote:

> On Tue, May 12, 2015 at 09:54:27AM -0400, John Stoffel wrote:
> > >>>>> "Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes:
> > 
> > Austin> On 2015-05-12 01:08, Kevin Easton wrote:
> > >> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
> > >>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
> > >>>>> Let me re-ask the question that I asked last week (and was apparently
> > >>>>> ignored).  Why not trying to use the lazytime feature instead of
> > >>>>> pointing a head straight at the application's --- and system
> > >>>>> administrators' --- heads?
> > >>>> 
> > >>>> Sorry Ted, I thought I responded already.
> > >>>> 
> > >>>> The goal is to avoid inode writeout entirely when we can, and
> > >>>> as I understand it lazytime will still force writeout before the inode
> > >>>> is dropped from the cache.  In systems like Ceph in particular, the
> > >>>> IOs can be spread across lots of files, so simply deferring writeout
> > >>>> doesn't always help.
> > >>> 
> > >>> Sure, but it would reduce the writeout by orders of magnitude.  I can
> > >>> understand if you want to reduce it further, but it might be good
> > >>> enough for your purposes.
> > >>> 
> > >>> I considered doing the equivalent of O_NOMTIME for our purposes at
> > >>> $WORK, and our use case is actually not that different from Ceph's
> > >>> (i.e., using a local disk file system to support a cluster file
> > >>> system), and lazytime was (a) something I figured was something I
> > >>> could upstream in good conscience, and (b) was more than good enough
> > >>> for us.
> > >> 
> > >> A safer alternative might be a chattr file attribute that if set, the
> > >> mtime is not updated on writes, and stat() on the file always shows the
> > >> mtime as "right now".  At least that way, the file won't accidentally
> > >> get left out of backups that rely on the mtime.
> > >> 
> > >> (If the file attribute is unset, you immediately update the mtime then
> > >> too, and from then on the file is back to normal).
> > >> 
> > 
> > Austin> I like this even better than the flag suggestion, it provides
> > Austin> better control, means that you don't need to update
> > Austin> applications to get the benefits, and prevents backup software
> > Austin> from breaking (although backups would be bigger).
> > 
> > Me too, it fails in a safer mode, where you do more work on backups
> > than strictly needed.  I'm still against this as a mount option
> > though, way way way too many bullets in the foot gun.  And as someone
> > else said, once you mount with O_NOMTIME, then unmount, then mount
> > again without O_NOMTIME, you've lost information.  Not good.  
> 
> That was me.  Zach also pointed out to me that'd mean figuring out where
> to store that information on-disk for every filesystem you care about.
> I like the idea of something persistent, but maybe it's more trouble
> than it's worth--I honestly don't know.
> 

When this persistent flag is in effect, the values stored in mtime and atime,
and probably ctime, become irrelevant.  Surely we can choose some magic value
to store there that would never happen in practice.

e.g. ctime is signed and so goes back to 1902 (is that right?).  As ctime
cannot be set (via POSIX) to anything but "now", and as there were no Unix
systems in 1902, such values are impossible.

So a specific large negative value in ctime could safely be take to mean
"don't update time stamps, and always report them as 'now'".

Or do we need to keep ctime 'real'?


BTW When you "swap" to a file the mtime doesn't get updated.  No one seems to
complain about that.  I guess it is a rather narrow use-case though.


NeilBrown

Sage Weil May 12, 2015, 11:12 p.m. UTC | #43

On Tue, 12 May 2015, Dave Chinner wrote:
> > > Neither of these examples cases are under the control of the 
> > > application that calls open(O_NOMTIME).
> > 
> > Wouldn't a mount option (e.g., allow_nomtime) address this concern?  Only 
> > nodes provisioned explicitly to run these systems would be enable this 
> > option.
> 
> Back to my Joe Speedracer comments.....
> 
> I'm not sure what the right answer is - mount options are simply too
> easy to add without understanding the full implications of them.
> e.g. we didn't merge FALLOC_FL_NO_HIDE_STALE simply because it was
> too dangerous for unsuspecting users. This isn't at that same level
> or concern, but it's still a landmine we want to avoid users from
> arming without realising it...
> 
> > > >> I'm happy for it to be an ioctl interface - even an XFS specific
> > > >> interface if you want to go that route, Sage - and it probably
> > > >> should emit a warning to syslog first time it is used so there is
> > > >> trace for bug triage purposes. i.e. we know the app is not using
> > > >> mtime updates, so bug reports that are the result of mtime
> > > >> mishandling don't result in large amounts of wasted developer time
> > > >> trying to understand them...
> > > >
> > > > A warning on using the interface (or when mounting with user_nomtime)
> > > > sounds reasonable.
> > > >
> > > > I'd rather not make this XFS specific as other local filesystmes (ext4,
> > > > f2fs, possibly btrfs) would similarly benefit.  (And if we want to target
> > > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it
> > > > already does O_NOMTIME unconditionally.)
> > > 
> > > Lack of a namespace, doesn't imply that you don't want to manage the
> > > data. The whole point of using object storage instead of plain old
> > > block storage is to be able to provide whatever metadata you still
> > > need in order to manage the object.
> > 
> > Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd 
> > like to use) doesn't assume O_NOMTIME.
> 
> Right - the XFS ioctls were designed specifically for applications
> that interacted directly with the structure of XFS filesystems and
> so needed invisible IO (e.g. online defragmenter). IOWs, they are
> not interfaces intended for general usage. They are also only
> available to root, so a typical user application won't be making use
> of them, either.

I understand that's what they're intended for, but I'm having a hard time 
parsing out the difference between what they *do* and what O_NOMTIME + -o 
allow_nomtime does.  The open-by-handle ioctls have nothing to do with the 
online XFS format--they simply allow you to open a file via an opaque 
handle (albeit a differently formatted one than the generic 
open_by_handle_at(2)).  They also force you into an O_NOMTIME-equivalent 
mode.

AFAICS the only difference that I see is that

1) the ioctl is XFS specific.  (As open_by_handle_at(2) demonstrates, this 
needn't be the case.)

2) the NOMTIME mode is only available via the open-by-handle interface, 
not open(2).

3) it is an ioctl interface, and thus more obscure.  (Well, there is a 
libhandle library, but it doesn't seem to be widely used.)

Would you object less if 

1) the O_NOMTIME flag were only available via open_by_handle_at(2)?

2) an equivalent ioctl were implemented for each file system of interest 
that (say) called into open_by_handle_at(2) code, adding in the O_NOMTIME 
flag?

3) O_NOMTIME required root (vs a mount option that requires root and 
unpriviledged O_NOMTIME)?

Just trying to tease apart which part is problematic...

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner May 13, 2015, 12:57 a.m. UTC | #44

On Tue, May 12, 2015 at 04:12:46PM -0700, Sage Weil wrote:
> On Tue, 12 May 2015, Dave Chinner wrote:
> > > > > I'd rather not make this XFS specific as other local filesystmes (ext4,
> > > > > f2fs, possibly btrfs) would similarly benefit.  (And if we want to target
> > > > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it
> > > > > already does O_NOMTIME unconditionally.)
> > > > 
> > > > Lack of a namespace, doesn't imply that you don't want to manage the
> > > > data. The whole point of using object storage instead of plain old
> > > > block storage is to be able to provide whatever metadata you still
> > > > need in order to manage the object.
> > > 
> > > Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd 
> > > like to use) doesn't assume O_NOMTIME.
> > 
> > Right - the XFS ioctls were designed specifically for applications
> > that interacted directly with the structure of XFS filesystems and
> > so needed invisible IO (e.g. online defragmenter). IOWs, they are
> > not interfaces intended for general usage. They are also only
> > available to root, so a typical user application won't be making use
> > of them, either.
> 
> I understand that's what they're intended for, but I'm having a hard time 
> parsing out the difference between what they *do* and what O_NOMTIME + -o 
> allow_nomtime does.  The open-by-handle ioctls have nothing to do with the 
> online XFS format--they simply allow you to open a file via an opaque 
> handle (albeit a differently formatted one than the generic 
> open_by_handle_at(2)).  They also force you into an O_NOMTIME-equivalent 
> mode.

Actually, the handle is dervied from the information on disk. We
don't do directory lookups to build handles in many cases, we do a
bulkstat to get *on-disk* inode information (inode number, generation,
timestamps, etc) and then use that to build a handle in userspace
*and* validate the file has not changed since the infomration was
retrieved and the handle was built.

> AFAICS the only difference that I see is that
> 
> 1) the ioctl is XFS specific.  (As open_by_handle_at(2) demonstrates, this 
> needn't be the case.)

Of course - it's been in use for 15 years longer than the generic
interface. :)

> 2) the NOMTIME mode is only available via the open-by-handle interface, 
> not open(2).

Right, because of the XFS handle interfaces are intended for
invisible IO which is required by applications interacting directly
with the XFS on-disk data layout.

> 3) it is an ioctl interface, and thus more obscure.  (Well, there is a 
> libhandle library, but it doesn't seem to be widely used.)

The library only exists for xfsdump and the HSMs that interact
directly with the XFS on disk data. These are very constrained
applications.

> Would you object less if 
> 
> 1) the O_NOMTIME flag were only available via open_by_handle_at(2)?

Which limits it to files that have already by created and written to
disk, otherwise there is no handle....

> 2) an equivalent ioctl were implemented for each file system of interest 
> that (say) called into open_by_handle_at(2) code, adding in the O_NOMTIME 
> flag?

Seems like a silly hoop to jump through. I was thinking of a
root-only fcntl() style flag that could be set, but....

> 3) O_NOMTIME required root (vs a mount option that requires root and 
> unpriviledged O_NOMTIME)?
>
> Just trying to tease apart which part is problematic...

... it's very existence ias either a open or fcntl flag is still
problematic. :/

The concept of it being an on-disk attribute flag is less prone to
silent abuse - it's easily discoverable and is persistent. And it's
managable if we make it an "inherit from parent" style flag, because
then ceph can simply set it on the root dir, and every file it then
creates will not do mtime updates.

The other thing that is worth noting here is that we also have a
NODUMP flag on disk (chattr +d). Hence we could define that the
nomtime attribute also implies/sets the nodump attribute, and hence
makes it clear and upfront that turning on the nomtime inode
attribute will mean the files with this set will not get backed up
by mtime sensitive backup programs....

Cheers,

Dave.

Jan Kara May 13, 2015, 12:32 p.m. UTC | #45

On Mon 11-05-15 09:24:09, Sage Weil wrote:
> On Mon, 11 May 2015, Theodore Ts'o wrote:
> > On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> > > That makes it completely non-generic though. By putting this in the
> > > VFS, you are giving applications a loaded gun that is pointed straight
> > > at the application user's head.
> > 
> > Let me re-ask the question that I asked last week (and was apparently
> > ignored).  Why not trying to use the lazytime feature instead of
> > pointing a head straight at the application's --- and system
> > administrators' --- heads?
> 
> Sorry Ted, I thought I responded already.
> 
> The goal is to avoid inode writeout entirely when we can, and 
> as I understand it lazytime will still force writeout before the inode 
> is dropped from the cache.  In systems like Ceph in particular, the 
> IOs can be spread across lots of files, so simply deferring writeout 
> doesn't always help.
  Can we get some numbers on this? Before we go on and implement new mount
options, persistent inode flags, open flags, or whatever other crap
(neither of which looks particularly appealing to me) I'd like to know how
big is the performance difference between lazytime + fdatasync and not
updating mtime at all for Ceph...
								Honza

Austin S. Hemmelgarn May 13, 2015, 3:16 p.m. UTC | #46

On 2015-05-12 17:51, Dave Chinner wrote:
> On Tue, May 12, 2015 at 10:53:29AM -0400, Austin S Hemmelgarn wrote:
>> On 2015-05-12 10:36, J. Bruce Fields wrote:
>>> On Tue, May 12, 2015 at 09:54:27AM -0400, John Stoffel wrote:
>>>>>>>>> "Austin" == Austin S Hemmelgarn <ahferroin7@gmail.com> writes:
>>>>
>>>> Austin> On 2015-05-12 01:08, Kevin Easton wrote:
>>>>>> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
>>>>>>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
>>>>>>>>> Let me re-ask the question that I asked last week (and was apparently
>>>>>>>>> ignored).  Why not trying to use the lazytime feature instead of
>>>>>>>>> pointing a head straight at the application's --- and system
>>>>>>>>> administrators' --- heads?
>>>>>>>>
>>>>>>>> Sorry Ted, I thought I responded already.
>>>>>>>>
>>>>>>>> The goal is to avoid inode writeout entirely when we can, and
>>>>>>>> as I understand it lazytime will still force writeout before the inode
>>>>>>>> is dropped from the cache.  In systems like Ceph in particular, the
>>>>>>>> IOs can be spread across lots of files, so simply deferring writeout
>>>>>>>> doesn't always help.
>>>>>>>
>>>>>>> Sure, but it would reduce the writeout by orders of magnitude.  I can
>>>>>>> understand if you want to reduce it further, but it might be good
>>>>>>> enough for your purposes.
>>>>>>>
>>>>>>> I considered doing the equivalent of O_NOMTIME for our purposes at
>>>>>>> $WORK, and our use case is actually not that different from Ceph's
>>>>>>> (i.e., using a local disk file system to support a cluster file
>>>>>>> system), and lazytime was (a) something I figured was something I
>>>>>>> could upstream in good conscience, and (b) was more than good enough
>>>>>>> for us.
>>>>>>
>>>>>> A safer alternative might be a chattr file attribute that if set, the
>>>>>> mtime is not updated on writes, and stat() on the file always shows the
>>>>>> mtime as "right now".  At least that way, the file won't accidentally
>>>>>> get left out of backups that rely on the mtime.
>>>>>>
>>>>>> (If the file attribute is unset, you immediately update the mtime then
>>>>>> too, and from then on the file is back to normal).
>>>>>>
>>>>
>>>> Austin> I like this even better than the flag suggestion, it provides
>>>> Austin> better control, means that you don't need to update
>>>> Austin> applications to get the benefits, and prevents backup software
>>>> Austin> from breaking (although backups would be bigger).
>>>>
>>>> Me too, it fails in a safer mode, where you do more work on backups
>>>> than strictly needed.  I'm still against this as a mount option
>>>> though, way way way too many bullets in the foot gun.  And as someone
>>>> else said, once you mount with O_NOMTIME, then unmount, then mount
>>>> again without O_NOMTIME, you've lost information.  Not good.
>>>
>>> That was me.  Zach also pointed out to me that'd mean figuring out where
>>> to store that information on-disk for every filesystem you care about.
>>> I like the idea of something persistent, but maybe it's more trouble
>>> than it's worth--I honestly don't know.
>>>
>> But if we do it as a flag controlled by the API used by chattr, it
>> becomes the responsibility of the filesystems to deal with where to
>> store the information, assuming they choose to support it;
>> personally, I would be really surprised if XFS and BTRFS didn't add
>> support for this relatively soon after the API getting merged
>> upstream, and ext4 would likely follow soon afterwards.
>
> It's an on-disk format change, which means that there are all sorts
> of compatibility issues to take into account, as well as all the
> work needed to teach the filesystem userspace tools about the new
> flag. e.g. xfs_repair, xfs_db, xfsdump/restore, xfs_io, test code in
> xfstests, etc.
>
> Keep in mind that the moment we make something persistent, the
> amount of work to implement and verify the new functionality
> filesystem to implement it goes up by an order of magnitude *for
> each filesystem*.  IOWs, support of new features that require
> persistence don't just magically appear overnight...
>
I'm not saying that it will, and any sane way of safely implementing 
this will _almost_ certainly need some kind of work done on the 
filesystems themselves.  My only point was that it would be simpler on 
the VFS side of things than most of the other proposals so far.

Also, BTRFS at least won't (theoretically) need a format change for 
this, as it could just be added to the property interface.  As for the 
other filesystems, it would probably be possible to re-purpose one of 
the other bits for this, s (secure delete) and u (undeletion) are both 
not honored by any filesystem in the kernel, and also not honored by any 
other UNIX filesystem implementation that I know of; s would probably be 
the better of the 2 to use for this, as it's currently assigned purpose 
is functionally impossible to implement properly on modern hardware.

Pavel Machek July 14, 2015, 11:44 a.m. UTC | #47

On Thu 2015-05-07 12:53:46, Andy Lutomirski wrote:
> On Thu, May 7, 2015 at 12:09 PM, Richard Weinberger
> <richard.weinberger@gmail.com> wrote:
> > On Thu, May 7, 2015 at 7:20 PM, Zach Brown <zab@redhat.com> wrote:
> >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> >>> On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> >>> > Add the O_NOMTIME flag which prevents mtime from being updated which can
> >>> > greatly reduce the IO overhead of writes to allocated and initialized
> >>> > regions of files.
> >>>
> >>> Hmmm. How do backup programs now work out if the file has changed
> >>> and hence needs copying again? ie. applications using this will
> >>> break other critical infrastructure in subtle ways.
> >>
> >> By using backup infrastructure that doesn't use cmtime.  Like btrfs
> >> send/recv.  Or application level backups that know how to do
> >> incrementals from metadata in giant database files, say, without
> >> walking, comparing, and copying the entire thing.
> >
> > But how can Joey random user know that some of his
> > applications are using O_NOMTIME and his KISS backup
> > program does no longer function as expected?
> >
> 
> Joey random user can't have a working KISS backup anyway, though,
> because we screw up mtime updates on mmap writes.  I have patches
> gathering dust that fix that, though.

I'm using unison, and yes, I believe I already seen failures from
mmap(). I'd like to see that fixed, and I can test the patches.

Thanks,
									Pavel

Pavel Machek July 14, 2015, 11:50 a.m. UTC | #48

> Sage> I think this is the fundamental question: who do we give the
> Sage> ammunition to, the user or app writer, or the sysadmin?
> 
> Sage> One might argue that we gave the user a similar power with
> Sage> O_NOATIME (the power to break applications that assume atime is
> Sage> accurate).  Here we give developers/users the power to not
> Sage> update mtime and suffer the consequences (like, obviously,
> Sage> breaking mtime-based backups).  It should be pretty obvious to
> Sage> anyone using the flag what the consequences are.
> 
> Not modifying atime doesn't really break anything except people who
> think they can tell when a file was last accessed.  Which isn't
> critical (unless your in a paranoid security conscious place...) but
> MTIME is another beast entirely.   Turning that off is going to break
> lots of hidden assumptions.  
> 
> Sage> Note that we can suffer similar lapses in mtime with fdatasync
> Sage> followed by a system crash.  And as Andy points out it's
> Sage> semi-broken for writable mmap.  The crash case is obviously a
> Sage> slightly different thing, but the idea that mtime can't always
> Sage> be trusted certainly isn't crazy talk.
> 
> True, but after a crash... people expect and understand there might be
> corruption in a filesystem.

Umm. No; people do not expect anything newer than ext3 to get
corrupted, ever.

In fact, I did not know about fdatasync/crash. That's rather nasty
surprise.

									Pavel

Pavel Machek July 14, 2015, 1:13 p.m. UTC | #49

Hi!

> BTW When you "swap" to a file the mtime doesn't get updated.  No one seems to
> complain about that.  I guess it is a rather narrow use-case though.

Actually yes, I'd like to complain.

It was not swap, it was mount -o loop, but I guess that's the same
case. Then rsync refused to work on that file... and being on slow ARM
system it took me a while to figure out WTF is going on.

So yes, we have problems with mtime, and yes, they matter.
									Pavel

NeilBrown July 15, 2015, 4:54 a.m. UTC | #50

On Tue, 14 Jul 2015 15:13:00 +0200 Pavel Machek <pavel@ucw.cz> wrote:

> Hi!
> 
> > BTW When you "swap" to a file the mtime doesn't get updated.  No one seems to
> > complain about that.  I guess it is a rather narrow use-case though.
> 
> Actually yes, I'd like to complain.
> 
> It was not swap, it was mount -o loop, but I guess that's the same
> case. Then rsync refused to work on that file... and being on slow ARM
> system it took me a while to figure out WTF is going on.
> 
> So yes, we have problems with mtime, and yes, they matter.
> 									Pavel

Odd...
I assume you mean
  mount -o loop /some/file  /mountpoint

and then when you write to the filesystem on /mountpoint the mtime
of /some/file doesn't get updated?
I think it should.
 drivers/block/loop.c uses vfs_iter_write() to write to a file.
 That calls f_op->write_iter which will typically call
 generic_file_write_iter() which will call file_update_time() to update
 the time stamps.

What filesystem was /some/file on?
I just did some testing on ext4 and it seems to do the right thing
mtime gets updated.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Pavel Machek July 22, 2015, 1:47 p.m. UTC | #51

On Wed 2015-07-15 14:54:56, NeilBrown wrote:
> On Tue, 14 Jul 2015 15:13:00 +0200 Pavel Machek <pavel@ucw.cz> wrote:
> 
> > Hi!
> > 
> > > BTW When you "swap" to a file the mtime doesn't get updated.  No one seems to
> > > complain about that.  I guess it is a rather narrow use-case though.
> > 
> > Actually yes, I'd like to complain.
> > 
> > It was not swap, it was mount -o loop, but I guess that's the same
> > case. Then rsync refused to work on that file... and being on slow ARM
> > system it took me a while to figure out WTF is going on.
> > 
> > So yes, we have problems with mtime, and yes, they matter.
> > 									Pavel
> 
> Odd...
> I assume you mean
>   mount -o loop /some/file  /mountpoint
> 
> and then when you write to the filesystem on /mountpoint the mtime
> of /some/file doesn't get updated?
> I think it should.
>  drivers/block/loop.c uses vfs_iter_write() to write to a file.
>  That calls f_op->write_iter which will typically call
>  generic_file_write_iter() which will call file_update_time() to update
>  the time stamps.

Yes, that. I'm pretty sure I seen it, but it was probably on 2.6.X
kernel... Does it make sense to try to reproduce it on the old kernel?

> What filesystem was /some/file on?

Very probably VFAT.

> I just did some testing on ext4 and it seems to do the right thing
> mtime gets updated.

Yes, I tried here, and it seems to be ok.

Thanks,
									Pavel

diff mbox

Patch

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..9e48092 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -27,7 +27,8 @@ 
 #include <asm/siginfo.h>
 #include <asm/uaccess.h>
 
-#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
+#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
+		    O_NOMTIME)
 
 static int setfl(int fd, struct file * filp, unsigned long arg)
 {
@@ -41,8 +42,9 @@  static int setfl(int fd, struct file * filp, unsigned long arg)
 	if (((arg ^ filp->f_flags) & O_APPEND) && IS_APPEND(inode))
 		return -EPERM;
 
-	/* O_NOATIME can only be set by the owner or superuser */
-	if ((arg & O_NOATIME) && !(filp->f_flags & O_NOATIME))
+	/* O_NOATIME and O_NOMTIME can only be set by the owner or superuser */
+	if (((arg & O_NOATIME) && !(filp->f_flags & O_NOATIME)) ||
+	    ((arg & O_NOMTIME) && !(filp->f_flags & O_NOMTIME)))
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
 
@@ -740,7 +742,7 @@  static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
 		O_RDONLY	| O_WRONLY	| O_RDWR	|
 		O_CREAT		| O_EXCL	| O_NOCTTY	|
 		O_TRUNC		| O_APPEND	| /* O_NONBLOCK	| */
@@ -748,7 +750,7 @@  static int __init fcntl_init(void)
 		O_DIRECT	| O_LARGEFILE	| O_DIRECTORY	|
 		O_NOFOLLOW	| O_NOATIME	| O_CLOEXEC	|
 		__FMODE_EXEC	| O_PATH	| __O_TMPFILE	|
-		__FMODE_NONOTIFY
+		__FMODE_NONOTIFY| O_NOMTIME
 		));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
diff --git a/fs/inode.c b/fs/inode.c
index ea37cd1..8976edc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1721,7 +1721,7 @@  int file_update_time(struct file *file)
 	int ret;
 
 	/* First try to exhaust all avenues to not sync */
-	if (IS_NOCMTIME(inode))
+	if (IS_NOCMTIME(inode) || (file->f_flags & O_NOMTIME))
 		return 0;
 
 	now = current_fs_time(inode->i_sb);
diff --git a/fs/namei.c b/fs/namei.c
index 4a8d998b..1a3ccb3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2609,8 +2609,8 @@  static int may_open(struct path *path, int acc_mode, int flag)
 			return -EPERM;
 	}
 
-	/* O_NOATIME can only be set by the owner or superuser */
-	if (flag & O_NOATIME && !inode_owner_or_capable(inode))
+	/* O_NOATIME and O_NOMTIME can only be set by the owner or superuser */
+	if (flag & (O_NOATIME|O_NOMTIME) && !inode_owner_or_capable(inode))
 		return -EPERM;
 
 	return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 35ec87e..34602f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -110,12 +110,7 @@  typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* 64bit hashes as llseek() offset (for directories) */
 #define FMODE_64BITHASH         ((__force fmode_t)0x400)
 
-/*
- * Don't update ctime and mtime.
- *
- * Currently a special hack for the XFS open_by_handle ioctl, but we'll
- * hopefully graduate it to a proper O_CMTIME flag supported by open(2) soon.
- */
+/* Don't update ctime and mtime. */
 #define FMODE_NOCMTIME		((__force fmode_t)0x800)
 
 /* Expect random access pattern */
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index e063eff..8e484ae 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@ 
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_NOMTIME
+#define O_NOMTIME	040000000
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)