diff mbox

[v7,5/4] copy_file_range.2: New page documenting copy_file_range()

Message ID 1445628736-13058-6-git-send-email-Anna.Schumaker@Netapp.com (mailing list archive)
State New, archived
Headers show

Commit Message

Schumaker, Anna Oct. 23, 2015, 7:32 p.m. UTC
copy_file_range() is a new system call for copying ranges of data
completely in the kernel.  This gives filesystems an opportunity to
implement some kind of "copy acceleration", such as reflinks or
server-side-copy (in the case of NFS).

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v7:
- Remove COPY_FR_REFLINK
- Fix extra punctuation in the license
- Make a note about sparse file expansion
---
 man2/copy_file_range.2 | 199 +++++++++++++++++++++++++++++++++++++++++++++++++
 man2/splice.2          |   1 +
 2 files changed, 200 insertions(+)
 create mode 100644 man2/copy_file_range.2

Comments

Pádraig Brady Oct. 24, 2015, 12:02 p.m. UTC | #1
On 23/10/15 20:32, Anna Schumaker wrote:
> +    len = stat.st_size;
> +
> +    fd_out = open(argv[2], O_CREAT|O_WRONLY|O_TRUNC, 0644);
> +    if (fd_out == \-1) {
> +        perror("open (argv[2])");
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    do {
> +        ret = copy_file_range(fd_in, NULL, fd_out, NULL, len, 0);
> +        if (ret == \-1) {
> +            perror("copy_file_range");
> +            exit(EXIT_FAILURE);
> +        }
> +
> +        len \-= ret;
> +    } while (len > 0);

Is this an infinite loop if len decreases before the copy completes?
Perhaps this should be: while (len && ret);

Otherwise this set looks good.

I'm a bit worried about the sparse expansion and default reflinking
which might preclude cp(1) from using this call in most cases, but I will
test and try to use it. coreutils has heuristics for determining if files
are remote, which we might use to restrict to that use case.

thanks,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pádraig Brady Oct. 26, 2015, 12:19 p.m. UTC | #2
On 26/10/15 03:39, Christoph Hellwig wrote:
> On Sat, Oct 24, 2015 at 01:02:21PM +0100, P??draig Brady wrote:
>> I'm a bit worried about the sparse expansion and default reflinking
>> which might preclude cp(1) from using this call in most cases, but I will
>> test and try to use it. coreutils has heuristics for determining if files
>> are remote, which we might use to restrict to that use case.
> 
> Can you explain why reflinking and hole expansion are an issue if done
> locally and not if done remotely?  I'd really like to make the call as
> usable as possible for everyone, but we really need clear sem?ntics for
> that.

Fair point on local vs remote. I was just assuming that remote
copy offload would not do reflinking on the backend, or at
least wasn't an exposed option over the remote interface.

I get the impression that you think reflinking should be hidden
from the user, i.e. cp(1) should not have had the --reflink option
(for the last 6 years)?  I'm not convinced of that, and even so
I think lower level interfaces would benefit from finer grained options.
This would be especially useful since there is no general interface
to reflink at present. I was happy with the reflink control options,
thinking the extra control could allow cp to use this by default.

> Also note that Annas current series allows for hole filling - any decent
> implementation should not do them, but that's really a quality of
> implementation and not an interface issue.

I think you're saying the default `cp --sparse=auto` operation
could rely on copy_file_range(...complete file...), while
cp --sparse={always,never} would have to iterate over the
file, punching or filling holes as appropriate. I thought
Anna indicated differently wrt splice filling holes by default.

TBH I'm not clear on the semantics of the current implementation,
so need to test the above in various cases.

thanks,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Oct. 26, 2015, 9:41 p.m. UTC | #3
On Mon, Oct 26, 2015 at 12:19:33PM +0000, Pádraig Brady wrote:
> On 26/10/15 03:39, Christoph Hellwig wrote:
> > On Sat, Oct 24, 2015 at 01:02:21PM +0100, P??draig Brady wrote:
> >> I'm a bit worried about the sparse expansion and default reflinking
> >> which might preclude cp(1) from using this call in most cases, but I will
> >> test and try to use it. coreutils has heuristics for determining if files
> >> are remote, which we might use to restrict to that use case.
> > 
> > Can you explain why reflinking and hole expansion are an issue if done
> > locally and not if done remotely?  I'd really like to make the call as
> > usable as possible for everyone, but we really need clear sem?ntics for
> > that.
> 
> Fair point on local vs remote. I was just assuming that remote
> copy offload would not do reflinking on the backend, or at
> least wasn't an exposed option over the remote interface.

The server could definitely do a reflink.  More generally, from the
description of the NFS COPY operation:

	https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-39#page-64

	If the copy completes successfully, either synchronously or
	asynchronously, the data copied from the source file to the
	destination file MUST appear identical to the NFS client.
	However, the NFS server's on disk representation of the data in
	the source file and destination file MAY differ.  For example,
	the NFS server might encrypt, compress, deduplicate, or
	otherwise represent the on disk data in the source and
	destination file differently.

> I get the impression that you think reflinking should be hidden
> from the user, i.e. cp(1) should not have had the --reflink option
> (for the last 6 years)?  I'm not convinced of that, and even so
> I think lower level interfaces would benefit from finer grained options.
> This would be especially useful since there is no general interface
> to reflink at present. I was happy with the reflink control options,
> thinking the extra control could allow cp to use this by default.

Maybe that's a case for Christoph's "clone" operation.

I agree with him that it makes sense to allow the filesystem to
implement "copy" using reflink or similar tricks under the covers.  And
that in fact it's difficult to imagine how you'd prevent that in the
presence of layers of filesystem or block protocols underneath.

That "cp" flag seems strange to me, but if "cp" wants to take advantage
of a copy system call while continuing to make something like that
distinction then I suppose it could fallocate the destination range file
after the copy.

--b.

> > Also note that Annas current series allows for hole filling - any decent
> > implementation should not do them, but that's really a quality of
> > implementation and not an interface issue.
> 
> I think you're saying the default `cp --sparse=auto` operation
> could rely on copy_file_range(...complete file...), while
> cp --sparse={always,never} would have to iterate over the
> file, punching or filling holes as appropriate. I thought
> Anna indicated differently wrt splice filling holes by default.
> 
> TBH I'm not clear on the semantics of the current implementation,
> so need to test the above in various cases.
> 
> thanks,
> Pádraig.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Austin S. Hemmelgarn Oct. 27, 2015, 11:34 a.m. UTC | #4
On 2015-10-26 17:41, J. Bruce Fields wrote:
> On Mon, Oct 26, 2015 at 12:19:33PM +0000, Pádraig Brady wrote:
>> I get the impression that you think reflinking should be hidden
>> from the user, i.e. cp(1) should not have had the --reflink option
>> (for the last 6 years)?  I'm not convinced of that, and even so
>> I think lower level interfaces would benefit from finer grained options.
>> This would be especially useful since there is no general interface
>> to reflink at present. I was happy with the reflink control options,
>> thinking the extra control could allow cp to use this by default.
>
> Maybe that's a case for Christoph's "clone" operation.
>
> I agree with him that it makes sense to allow the filesystem to
> implement "copy" using reflink or similar tricks under the covers.  And
> that in fact it's difficult to imagine how you'd prevent that in the
> presence of layers of filesystem or block protocols underneath.
>
> That "cp" flag seems strange to me, but if "cp" wants to take advantage
> of a copy system call while continuing to make something like that
> distinction then I suppose it could fallocate the destination range file
> after the copy.
FWIW, I'm pretty sure that the '--reflink=never' option was added 
originally just for those poor misguided people who don't understand 
that deduplication is perfectly safe as long as you do it right. 
Personally, I really hope that Busybox and the other Coreutils 
replacements don't make that mistake, as the very fact that cp allows 
you to force it not to reflink things indirectly implies that it isn't 
safe in some circumstances, which is completely bogus WRT all the 
filesystems in Linux that support it if they are used properly.

If you want to make sure the space is allocated on disk, you should be 
using fallocate (or dd, or something equivalent), not cp.
diff mbox

Patch

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
new file mode 100644
index 0000000..d619e37
--- /dev/null
+++ b/man2/copy_file_range.2
@@ -0,0 +1,199 @@ 
+.\"This manpage is Copyright (C) 2015 Anna Schumaker <Anna.Schumaker@Netapp.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of
+.\" this manual under the conditions for verbatim copying, provided that
+.\" the entire resulting derived work is distributed under the terms of
+.\" a permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume
+.\" no responsibility for errors or omissions, or for damages resulting
+.\" from the use of the information contained herein.  The author(s) may
+.\" not have taken the same level of care in the production of this
+.\" manual, which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH COPY 2 2015-10-16 "Linux" "Linux Programmer's Manual"
+.SH NAME
+copy_file_range \- Copy a range of data from one file to another
+.SH SYNOPSIS
+.nf
+.B #include <sys/syscall.h>
+.B #include <unistd.h>
+
+.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out ",
+.BI "                        loff_t *" off_out ", size_t " len \
+", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR copy_file_range ()
+system call performs an in-kernel copy between two file descriptors
+without the additional cost of transferring data from the kernel to userspace
+and then back into the kernel.
+It copies up to
+.I len
+bytes of data from file descriptor
+.I fd_in
+to file descriptor
+.IR fd_out ,
+overwriting any data that exists within the requested range of the target file.
+
+The following semantics apply for
+.IR off_in ,
+and similar statements apply to
+.IR off_out :
+.IP * 3
+If
+.I off_in
+is NULL, then bytes are read from
+.I fd_in
+starting from the current file offset, and the offset is
+adjusted by the number of bytes copied.
+.IP *
+If
+.I off_in
+is not NULL, then
+.I off_in
+must point to a buffer that specifies the starting
+offset where bytes from
+.I fd_in
+will be read.  The current file offset of
+.I fd_in
+is not changed, but
+.I off_in
+is adjusted appropriately.
+.PP
+
+The
+.I flags
+argument must be set to 0.
+.SH RETURN VALUE
+Upon successful completion,
+.BR copy_file_range ()
+will return the number of bytes copied between files.
+This could be less than the length originally requested.
+
+On error,
+.BR copy_file_range ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+One or more file descriptors are not valid; or
+.I fd_in
+is not open for reading; or
+.I fd_out
+is not open for writing.
+.TP
+.B EINVAL
+Requested range extends beyond the end of the source file; or the
+.I flags
+argument is not 0.
+.TP
+.B EIO
+A low level I/O error occurred while copying.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOSPC
+There is not enough space on the target filesystem to complete the copy.
+.TP
+.B EXDEV
+.IR file_in " and " file_out
+are not on the same mounted filesystem.
+.SH VERSIONS
+The
+.BR copy_file_range ()
+system call first appeared in Linux 4.4.
+.SH CONFORMING TO
+The
+.BR copy_file_range ()
+system call is a nonstandard Linux extension.
+.SH NOTES
+If
+.I file_in
+is a sparse file, then
+.BR copy_file_range ()
+may expand any holes existing in the requested range.
+Users may benefit from calling
+.BR copy_file_range ()
+in a loop, and using
+.BR lseek (2)
+to find the locations of data segments.
+.SH EXAMPLE
+.nf
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+loff_t copy_file_range(int fd_in, loff_t *off_in, int fd_out,
+                       loff_t *off_out, size_t len, unsigned int flags)
+{
+    return syscall(__NR_copy_file_range, fd_in, off_in, fd_out,
+                   off_out, len, flags);
+}
+
+int main(int argc, char **argv)
+{
+    int fd_in, fd_out;
+    struct stat stat;
+    loff_t len, ret;
+    char buf[2];
+
+    if (argc != 3) {
+        fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
+        exit(EXIT_FAILURE);
+    }
+
+    fd_in = open(argv[1], O_RDONLY);
+    if (fd_in == \-1) {
+        perror("open (argv[1])");
+        exit(EXIT_FAILURE);
+    }
+
+    if (fstat(fd_in, &stat) == \-1) {
+        perror("fstat");
+        exit(EXIT_FAILURE);
+    }
+    len = stat.st_size;
+
+    fd_out = open(argv[2], O_CREAT|O_WRONLY|O_TRUNC, 0644);
+    if (fd_out == \-1) {
+        perror("open (argv[2])");
+        exit(EXIT_FAILURE);
+    }
+
+    do {
+        ret = copy_file_range(fd_in, NULL, fd_out, NULL, len, 0);
+        if (ret == \-1) {
+            perror("copy_file_range");
+            exit(EXIT_FAILURE);
+        }
+
+        len \-= ret;
+    } while (len > 0);
+
+    close(fd_in);
+    close(fd_out);
+    exit(EXIT_SUCCESS);
+}
+.fi
+.SH SEE ALSO
+.BR splice (2)
diff --git a/man2/splice.2 b/man2/splice.2
index b9b4f42..5c162e0 100644
--- a/man2/splice.2
+++ b/man2/splice.2
@@ -238,6 +238,7 @@  only pointers are copied, not the pages of the buffer.
 See
 .BR tee (2).
 .SH SEE ALSO
+.BR copy_file_range (2),
 .BR sendfile (2),
 .BR tee (2),
 .BR vmsplice (2)