diff mbox

[ANNOUNCE] Native Linux KVM tool v2

Message ID BANLkTikoMZCFzA6jUw+ddHZ3x2cvO_NzhA@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Pekka Enberg June 16, 2011, 9:57 a.m. UTC
On Thu, Jun 16, 2011 at 12:48 PM, Christoph Hellwig <hch@infradead.org> wrote:
> You also missed:
>
> " This system call does not flush disk write caches and thus does not
>  provide any data integrity on systems with volatile disk write
>  caches."
>
> so it's not safe if you either have a cache, or are using btrfs, or
> are using a sparse image, or are using an image preallocated using
> fallocate/posix_fallocate.

Uh-oh. Someone needs to apply this patch to sync_file_range():

                goto out;


>> What's the right thing to do here? Is fdatasync() sufficient?
>
> Yes.

We'll fix that up. Thanks Christoph!

                         Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Christoph Hellwig June 16, 2011, 10:02 a.m. UTC | #1
On Thu, Jun 16, 2011 at 12:57:36PM +0300, Pekka Enberg wrote:
> Uh-oh. Someone needs to apply this patch to sync_file_range():

There actually are a few cases where using it makes sense.  It's just
the minority.  

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar June 16, 2011, 11:22 a.m. UTC | #2
* Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Jun 16, 2011 at 12:57:36PM +0300, Pekka Enberg wrote:
> > Uh-oh. Someone needs to apply this patch to sync_file_range():
> 
> There actually are a few cases where using it makes sense. [...]

Such as? I don't think apps can actually know whether disk blocks 
have been 'instantiated' by a particular filesystem or not, so the 
manpage:

   Some details
       None  of these operations write out the file’s metadata.  Therefore, unless the appli-
       cation is strictly performing overwrites of already-instantiated  disk  blocks,  there
       are no guarantees that the data will be available after a crash.

is rather misleading. This is a dangerous (and rather pointless) 
syscall and this should be made much clearer in the manpage.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig June 16, 2011, 11:25 a.m. UTC | #3
On Thu, Jun 16, 2011 at 01:22:30PM +0200, Ingo Molnar wrote:
> Such as? I don't think apps can actually know whether disk blocks 
> have been 'instantiated' by a particular filesystem or not, so the 
> manpage:

In general they can't.  The only good use case for sync_file_range
is to paper over^H^H^H^H^H^H^H^H^Hcontrol write back behaviour.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar June 16, 2011, 11:40 a.m. UTC | #4
* Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Jun 16, 2011 at 01:22:30PM +0200, Ingo Molnar wrote:

> > Such as? I don't think apps can actually know whether disk blocks 
> > have been 'instantiated' by a particular filesystem or not, so 
> > the manpage:
> 
> In general they can't.  The only good use case for sync_file_range 
> is to paper over^H^H^H^H^H^H^H^H^Hcontrol write back behaviour.

Well, if overwrite is fundamentally safe on a filesystem (which is 
most of them) then sync_file_range() would work - and it has the big 
advantage that it's a pretty simple facility.

Filesystems that cannot guarantee that should map their 
sync_file_range() implementation to fdatasync() or so, right?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig June 16, 2011, 11:51 a.m. UTC | #5
On Thu, Jun 16, 2011 at 01:40:45PM +0200, Ingo Molnar wrote:
> Filesystems that cannot guarantee that should map their 
> sync_file_range() implementation to fdatasync() or so, right?

Filesystems aren't even told about sync_file_range, it's purely a VM
thing, which is the root of the problem.

In-kernel we have all the infrastructure for a real ranged
fsync/fdatasync, and once we get a killer users for that can triviall
export it at the syscall level.  I don't think mapping sync_file_range
with it's weird set of flags and confusing behaviour to it is a good
idea, though.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Garzik June 17, 2011, 7:21 a.m. UTC | #6
On 06/16/2011 07:22 AM, Ingo Molnar wrote:
>
> * Christoph Hellwig<hch@infradead.org>  wrote:
>
>> On Thu, Jun 16, 2011 at 12:57:36PM +0300, Pekka Enberg wrote:
>>> Uh-oh. Someone needs to apply this patch to sync_file_range():
>>
>> There actually are a few cases where using it makes sense. [...]
>
> Such as? I don't think apps can actually know whether disk blocks
> have been 'instantiated' by a particular filesystem or not, so the
> manpage:
>
>     Some details
>         None  of these operations write out the file’s metadata.  Therefore, unless the appli-
>         cation is strictly performing overwrites of already-instantiated  disk  blocks,  there
>         are no guarantees that the data will be available after a crash.
>
> is rather misleading. This is a dangerous (and rather pointless)
> syscall and this should be made much clearer in the manpage.

Not pointless at all -- see Linus's sync_file_range() examples in "Re: 
Unexpected splice "always copy" behavior observed" thread from May 2010.

Apps like MythTV may use it for streaming data to disk, basically 
shoving the VM out of the way to give the app more fine-grained writeout 
control.

Just don't mistake sync_file_range() for a data integrity syscall.

	Jeff



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/sync.c b/fs/sync.c
index ba76b96..32078aa 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -277,6 +277,8 @@  SYSCALL_DEFINE(sync_file_range)(int fd, loff_t
offset, loff_t nbytes,
        int fput_needed;
        umode_t i_mode;

+       WARN_ONCE(1, "when this breaks, you get to keep both pieces");
+
        ret = -EINVAL;
        if (flags & ~VALID_FLAGS)