From patchwork Thu Sep 13 09:16:19 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Henriques X-Patchwork-Id: 10599037 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C638C15A7 for ; Thu, 13 Sep 2018 09:15:31 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B634A2A932 for ; Thu, 13 Sep 2018 09:15:31 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id AA88B2A937; Thu, 13 Sep 2018 09:15:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 50A952A932 for ; Thu, 13 Sep 2018 09:15:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727656AbeIMOYE (ORCPT ); Thu, 13 Sep 2018 10:24:04 -0400 Received: from mx2.suse.de ([195.135.220.15]:54320 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726862AbeIMOYE (ORCPT ); Thu, 13 Sep 2018 10:24:04 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 7662CAD25; Thu, 13 Sep 2018 09:15:28 +0000 (UTC) From: Luis Henriques To: "Yan, Zheng" , Sage Weil , Ilya Dryomov , Gregory Farnum Cc: ceph-devel@vger.kernel.org, Luis Henriques Subject: [RFC PATCH v4 1/4] ceph: add non-blocking parameter to ceph_try_get_caps() Date: Thu, 13 Sep 2018 10:16:19 +0100 Message-Id: <20180913091622.28489-2-lhenriques@suse.com> In-Reply-To: <20180913091622.28489-1-lhenriques@suse.com> References: <20180913091622.28489-1-lhenriques@suse.com> Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP ceph_try_get_caps currently calls try_get_cap_refs with the nonblock parameter always set to 'true'. This change adds a new parameter that allows to set it's value. This will be useful for a follow-up patch that will need to get two sets of capabilities for two different inodes without risking a deadlock. Signed-off-by: Luis Henriques --- fs/ceph/addr.c | 2 +- fs/ceph/caps.c | 7 ++++--- fs/ceph/super.h | 2 +- 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 9c332a6f6667..8eade7a993c1 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -322,7 +322,7 @@ static int start_read(struct inode *inode, struct ceph_rw_context *rw_ctx, /* caller of readpages does not hold buffer and read caps * (fadvise, madvise and readahead cases) */ int want = CEPH_CAP_FILE_CACHE; - ret = ceph_try_get_caps(ci, CEPH_CAP_FILE_RD, want, &got); + ret = ceph_try_get_caps(ci, CEPH_CAP_FILE_RD, want, true, &got); if (ret < 0) { dout("start_read %p, error getting cap\n", inode); } else if (!(got & want)) { diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index dd7dfdd2ba13..b1697d2837f0 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2671,17 +2671,18 @@ static void check_max_size(struct inode *inode, loff_t endoff) ceph_check_caps(ci, CHECK_CAPS_AUTHONLY, NULL); } -int ceph_try_get_caps(struct ceph_inode_info *ci, int need, int want, int *got) +int ceph_try_get_caps(struct ceph_inode_info *ci, int need, int want, + bool nonblock, int *got) { int ret, err = 0; BUG_ON(need & ~CEPH_CAP_FILE_RD); - BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)); + BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO|CEPH_CAP_FILE_SHARED)); ret = ceph_pool_perm_check(ci, need); if (ret < 0) return ret; - ret = try_get_cap_refs(ci, need, want, 0, true, got, &err); + ret = try_get_cap_refs(ci, need, want, 0, nonblock, got, &err); if (ret) { if (err == -EAGAIN) { ret = 0; diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 582e28fd1b7b..91b13400badd 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -1008,7 +1008,7 @@ extern int ceph_encode_dentry_release(void **p, struct dentry *dn, extern int ceph_get_caps(struct ceph_inode_info *ci, int need, int want, loff_t endoff, int *got, struct page **pinned_page); extern int ceph_try_get_caps(struct ceph_inode_info *ci, - int need, int want, int *got); + int need, int want, bool nonblock, int *got); /* for counting open files by mode */ extern void __ceph_get_fmode(struct ceph_inode_info *ci, int mode); From patchwork Thu Sep 13 09:16:20 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Henriques X-Patchwork-Id: 10599041 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BEB5513BF for ; Thu, 13 Sep 2018 09:15:33 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AD55E2A932 for ; Thu, 13 Sep 2018 09:15:33 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A1B902A934; Thu, 13 Sep 2018 09:15:33 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0C8D92A937 for ; Thu, 13 Sep 2018 09:15:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727685AbeIMOYG (ORCPT ); Thu, 13 Sep 2018 10:24:06 -0400 Received: from mx2.suse.de ([195.135.220.15]:54334 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726832AbeIMOYG (ORCPT ); Thu, 13 Sep 2018 10:24:06 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 6CFE6AD73; Thu, 13 Sep 2018 09:15:29 +0000 (UTC) From: Luis Henriques To: "Yan, Zheng" , Sage Weil , Ilya Dryomov , Gregory Farnum Cc: ceph-devel@vger.kernel.org, Luis Henriques Subject: [RFC PATCH v4 2/4] ceph: support the RADOS copy-from operation Date: Thu, 13 Sep 2018 10:16:20 +0100 Message-Id: <20180913091622.28489-3-lhenriques@suse.com> In-Reply-To: <20180913091622.28489-1-lhenriques@suse.com> References: <20180913091622.28489-1-lhenriques@suse.com> Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add support for performing remote object copies using the 'copy-from' operation. Signed-off-by: Luis Henriques --- include/linux/ceph/osd_client.h | 17 ++++++++ include/linux/ceph/rados.h | 19 +++++++++ net/ceph/osd_client.c | 72 +++++++++++++++++++++++++++++++++ 3 files changed, 108 insertions(+) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 02096da01845..f25b4aaaab09 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -136,6 +136,13 @@ struct ceph_osd_req_op { u64 expected_object_size; u64 expected_write_size; } alloc_hint; + struct { + u64 snapid; + u64 src_version; + u8 flags; + u32 src_fadvise_flags; + struct ceph_osd_data osd_data; + } copy_from; }; }; @@ -511,6 +518,16 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc, struct timespec64 *mtime, struct page **pages, int nr_pages); +extern int ceph_osdc_copy_from(struct ceph_osd_client *osdc, + u64 src_snapid, u64 src_version, + struct ceph_object_id *src_oid, + struct ceph_object_locator *src_oloc, + u32 src_fadvise_flags, + u64 dst_snapid, + struct ceph_object_id *dst_oid, + struct ceph_object_locator *dst_oloc, + u8 dst_fadvise_flags); + /* watch/notify */ struct ceph_osd_linger_request * ceph_osdc_watch(struct ceph_osd_client *osdc, diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h index f1988387c5ad..d47540986eff 100644 --- a/include/linux/ceph/rados.h +++ b/include/linux/ceph/rados.h @@ -410,6 +410,14 @@ enum { enum { CEPH_OSD_OP_FLAG_EXCL = 1, /* EXCL object create */ CEPH_OSD_OP_FLAG_FAILOK = 2, /* continue despite failure */ + CEPH_OSD_OP_FLAG_FADVISE_RANDOM = 0x4, /* the op is random */ + CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL = 0x8, /* the op is sequential */ + CEPH_OSD_OP_FLAG_FADVISE_WILLNEED = 0x10,/* data will be accessed in + the near future */ + CEPH_OSD_OP_FLAG_FADVISE_DONTNEED = 0x20,/* data will not be accessed + in the near future */ + CEPH_OSD_OP_FLAG_FADVISE_NOCACHE = 0x40,/* data will be accessed only + once by this client */ }; #define EOLDSNAPC ERESTART /* ORDERSNAP flag set; writer has old snapc*/ @@ -497,6 +505,17 @@ struct ceph_osd_op { __le64 expected_object_size; __le64 expected_write_size; } __attribute__ ((packed)) alloc_hint; + struct { + __le64 snapid; + __le64 src_version; + __u8 flags; + /* + * __le32 flags: CEPH_OSD_OP_FLAG_FADVISE_: mean the + * fadvise flags for dest object src_fadvise_flags mean + * the fadvise flags for src object + */ + __le32 src_fadvise_flags; + } __attribute__ ((packed)) copy_from; }; __le32 payload_len; } __attribute__ ((packed)); diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 60934bd8796c..9fd19ff556af 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -955,6 +955,14 @@ static u32 osd_req_encode_op(struct ceph_osd_op *dst, case CEPH_OSD_OP_CREATE: case CEPH_OSD_OP_DELETE: break; + case CEPH_OSD_OP_COPY_FROM: + dst->copy_from.snapid = cpu_to_le64(src->copy_from.snapid); + dst->copy_from.src_version = + cpu_to_le64(src->copy_from.src_version); + dst->copy_from.flags = src->copy_from.flags; + dst->copy_from.src_fadvise_flags = + cpu_to_le32(src->copy_from.src_fadvise_flags); + break; default: pr_err("unsupported osd opcode %s\n", ceph_osd_op_name(src->op)); @@ -1908,6 +1916,10 @@ static void setup_request_data(struct ceph_osd_request *req, ceph_osdc_msg_data_add(req->r_reply, &op->notify.response_data); break; + case CEPH_OSD_OP_COPY_FROM: + ceph_osdc_msg_data_add(msg, + &op->copy_from.osd_data); + break; } data_len += op->indata_len; @@ -5168,6 +5180,66 @@ int ceph_osdc_writepages(struct ceph_osd_client *osdc, struct ceph_vino vino, } EXPORT_SYMBOL(ceph_osdc_writepages); +int ceph_osdc_copy_from(struct ceph_osd_client *osdc, + u64 src_snapid, u64 src_version, + struct ceph_object_id *src_oid, + struct ceph_object_locator *src_oloc, + u32 src_fadvise_flags, + u64 dst_snapid, + struct ceph_object_id *dst_oid, + struct ceph_object_locator *dst_oloc, + u8 dst_fadvise_flags) +{ + struct ceph_osd_request *req = NULL; + struct ceph_options *opts = osdc->client->options; + struct ceph_osd_req_op *op; + struct page **pages; + void *p, *end; + int ret; + + pages = ceph_alloc_page_vector(1, GFP_NOIO); + if (IS_ERR(pages)) + return PTR_ERR(pages); + req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_NOIO); + if (!req) { + ret = -ENOMEM; + goto out; + } + req->r_flags = CEPH_OSD_FLAG_WRITE; + op = _osd_req_op_init(req, 0, CEPH_OSD_OP_COPY_FROM, 0); + op->copy_from.snapid = src_snapid; + op->copy_from.src_version = src_version; + op->copy_from.flags = dst_fadvise_flags; + op->copy_from.src_fadvise_flags = src_fadvise_flags; + + ceph_oloc_copy(&req->r_t.base_oloc, dst_oloc); + ceph_oid_copy(&req->r_t.base_oid, dst_oid); + + p = page_address(pages[0]); + end = p + PAGE_SIZE; + ceph_encode_string(&p, end, src_oid->name, src_oid->name_len); + encode_oloc(&p, end, src_oloc); + op->indata_len = PAGE_SIZE - (end - p); + + ceph_osd_data_pages_init(&op->copy_from.osd_data, pages, + op->indata_len, 0, + false, true); + req->r_snapid = dst_snapid; + req->r_data_offset = 0; /* XXX dst_off? */ + + ret = ceph_osdc_alloc_messages(req, GFP_NOFS); + if (ret) + goto out; + ceph_osdc_start_request(osdc, req, false); + ret = wait_request_timeout(req, opts->mount_timeout); +out: + ceph_release_page_vector(pages, 1); + if (req) + ceph_osdc_put_request(req); + return ret; +} +EXPORT_SYMBOL(ceph_osdc_copy_from); + int __init ceph_osdc_setup(void) { size_t size = sizeof(struct ceph_osd_request) + From patchwork Thu Sep 13 09:16:21 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Henriques X-Patchwork-Id: 10599043 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2E54113B6 for ; Thu, 13 Sep 2018 09:15:35 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1C91D2A932 for ; Thu, 13 Sep 2018 09:15:35 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1124F2A936; Thu, 13 Sep 2018 09:15:35 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 56ADB2A932 for ; Thu, 13 Sep 2018 09:15:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727690AbeIMOYH (ORCPT ); Thu, 13 Sep 2018 10:24:07 -0400 Received: from mx2.suse.de ([195.135.220.15]:54342 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726862AbeIMOYH (ORCPT ); Thu, 13 Sep 2018 10:24:07 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id B452FAD17; Thu, 13 Sep 2018 09:15:30 +0000 (UTC) From: Luis Henriques To: "Yan, Zheng" , Sage Weil , Ilya Dryomov , Gregory Farnum Cc: ceph-devel@vger.kernel.org, Luis Henriques Subject: [RFC PATCH v4 3/4] ceph: support copy_file_range file operation Date: Thu, 13 Sep 2018 10:16:21 +0100 Message-Id: <20180913091622.28489-4-lhenriques@suse.com> In-Reply-To: <20180913091622.28489-1-lhenriques@suse.com> References: <20180913091622.28489-1-lhenriques@suse.com> Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This commit implements support for the copy_file_range syscall in cephfs. It is implemented using the RADOS 'copy-from' operation, which allows to do a remote object copy, without the need to download/upload data from/to the OSDs. Some manual copy may however be required if the source/destination file offsets aren't object aligned or if the copy lenght is smaller than the object size. Signed-off-by: Luis Henriques --- fs/ceph/file.c | 272 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 271 insertions(+), 1 deletion(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 92ab20433682..cf2d6fb4735e 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include +#include #include #include @@ -1829,6 +1830,275 @@ static long ceph_fallocate(struct file *file, int mode, return ret; } +/* + * This function tries to get FILE_WR capabilities for dst_ci and FILE_RD for + * src_ci. Two attempts are made to obtain both caps, and an error is return if + * this fails; zero is returned on success. + */ +static int get_rd_wr_caps(struct ceph_inode_info *src_ci, + loff_t src_endoff, int *src_got, + struct ceph_inode_info *dst_ci, + loff_t dst_endoff, int *dst_got) +{ + int ret = 0; + bool retrying = false; + +retry_caps: + ret = ceph_get_caps(dst_ci, CEPH_CAP_FILE_WR, CEPH_CAP_FILE_BUFFER, + dst_endoff, dst_got, NULL); + if (ret < 0) + return ret; + + /* + * Since we're already holding the FILE_WR capability for the dst file, + * we would risk a deadlock by using ceph_get_caps. Thus, we'll do some + * retry dance instead to try to get both capabilities. + */ + ret = ceph_try_get_caps(src_ci, CEPH_CAP_FILE_RD, CEPH_CAP_FILE_SHARED, + false, src_got); + if (ret <= 0) { + /* Start by dropping dst_ci caps and getting src_ci caps */ + ceph_put_cap_refs(dst_ci, *dst_got); + if (retrying) { + if (!ret) + /* ceph_try_get_caps masks EAGAIN */ + ret = -EAGAIN; + return ret; + } + ret = ceph_get_caps(src_ci, CEPH_CAP_FILE_RD, + CEPH_CAP_FILE_SHARED, src_endoff, + src_got, NULL); + if (ret < 0) + return ret; + /*... drop src_ci caps too, and retry */ + ceph_put_cap_refs(src_ci, *src_got); + retrying = true; + goto retry_caps; + } + return ret; +} + +static void put_rd_wr_caps(struct ceph_inode_info *src_ci, int src_got, + struct ceph_inode_info *dst_ci, int dst_got) +{ + ceph_put_cap_refs(src_ci, src_got); + ceph_put_cap_refs(dst_ci, dst_got); +} + +static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off, + struct file *dst_file, loff_t dst_off, + size_t len, unsigned int flags) +{ + struct inode *src_inode = file_inode(src_file); + struct inode *dst_inode = file_inode(dst_file); + struct ceph_inode_info *src_ci = ceph_inode(src_inode); + struct ceph_inode_info *dst_ci = ceph_inode(dst_inode); + struct ceph_osd_client *osdc = + &ceph_inode_to_client(src_inode)->client->osdc; + struct ceph_cap_flush *prealloc_cf; + struct ceph_object_locator src_oloc, dst_oloc; + loff_t endoff = 0; + loff_t size; + ssize_t ret = -EIO; + int src_got = 0; + int dst_got = 0; + int err, dirty; + + if (src_inode == dst_inode) + return -EINVAL; + if (ceph_snap(dst_inode) != CEPH_NOSNAP) + return -EROFS; + + prealloc_cf = ceph_alloc_cap_flush(); + if (!prealloc_cf) + return -ENOMEM; + + /* Start by sync'ing the source file */ + ret = file_write_and_wait_range(src_file, src_off, (src_off + len)); + if (ret < 0) + goto out; + + /* + * If something fails we just return -EOPNOTSUPP and fallback to the VFS + * default copy_file_range implementation. + */ + ret = -EOPNOTSUPP; + + /* + * We need FILE_WR caps for dst_ci and FILE_RD for src_ci as other + * clients may have dirty data in their caches. And OSDs know nothing + * about caps, so they can't safely do the remote object copies. + */ + err = get_rd_wr_caps(src_ci, (src_off + len), &src_got, + dst_ci, (dst_off + len), &dst_got); + if (err < 0) { + dout("get_rd_wr_caps returned %d\n", err); + goto out; + } + + size = i_size_read(src_inode); + /* + * Don't copy beyond source file EOF. Instead of simply setting lenght + * to (size - src_off), just drop to VFS default implementation, as the + * local i_size may be stale due to other clients writing to the source + * inode. + */ + if (src_off + len > size) { + dout("Copy beyond EOF (%llu + %ld > %llu)\n", + src_off, len, size); + goto out_caps; + } + size = i_size_read(dst_inode); + + endoff = dst_off + len; + ret = inode_newsize_ok(dst_inode, endoff); + if (ret) + goto out_caps; + + if (ceph_quota_is_max_bytes_exceeded(dst_inode, endoff)) { + ret = -EDQUOT; + goto out_caps; + } + + /* Drop dst file cached pages */ + ret = invalidate_inode_pages2_range(dst_inode->i_mapping, + dst_off >> PAGE_SHIFT, + endoff >> PAGE_SHIFT); + if (ret < 0) { + dout("Failed to invalidate inode pages (%ld)\n", ret); + ret = 0; /* XXX */ + } + src_oloc.pool = src_ci->i_layout.pool_id; + src_oloc.pool_ns = ceph_try_get_string(src_ci->i_layout.pool_ns); + dst_oloc.pool = dst_ci->i_layout.pool_id; + dst_oloc.pool_ns = ceph_try_get_string(dst_ci->i_layout.pool_ns); + + while (len > 0) { + struct ceph_object_id src_oid, dst_oid; + u64 src_objnum, dst_objnum, objoff; + u32 objlen; + size_t copy_len = min_t(size_t, src_ci->i_layout.object_size, len); + + ceph_calc_file_object_mapping(&src_ci->i_layout, src_off, + copy_len, &src_objnum, &objoff, + &objlen); + + /* Do manual copy if: + * - source file offset isn't object aligned, or + * - copy length is smaller than object size + */ + if (objoff || (copy_len < src_ci->i_layout.object_size)) { + /* Do not copy beyond this object */ + if (copy_len > objlen) + copy_len = objlen; + /* we need to temporarily drop all caps as we'll be + * calling {read,write}_iter, which will get caps + * again. + */ + put_rd_wr_caps(src_ci, src_got, dst_ci, dst_got); + err = do_splice_direct(src_file, &src_off, dst_file, + &dst_off, copy_len, flags); + if (err < 0) { + dout("do_splice_direct returned %d\n", err); + goto out; + } + len -= err; + ret += err; + err = get_rd_wr_caps(src_ci, (src_off + len), + &src_got, dst_ci, + (dst_off + len), &dst_got); + if (err < 0) { + goto out; + } + continue; + } + + ceph_calc_file_object_mapping(&dst_ci->i_layout, dst_off, + copy_len, &dst_objnum, &objoff, + &objlen); + /* Again... do a manual copy if: + * - destination file offset isn't object aligned, or + * - copy length is smaller than object size + * (although the object size should be the same for different + * files in the same filesystem...) + */ + if (objoff || (copy_len < dst_ci->i_layout.object_size)) { + if (copy_len > objlen) + copy_len = objlen; + put_rd_wr_caps(src_ci, src_got, dst_ci, dst_got); + err = do_splice_direct(src_file, &src_off, dst_file, + &dst_off, copy_len, flags); + if (err < 0) { + dout("do_splice_direct returned %d\n", err); + goto out; + } + len -= err; + ret += err; + err = get_rd_wr_caps(src_ci, (src_off + len), + &src_got, dst_ci, + (dst_off + len), &dst_got); + if (err < 0) { + goto out; + } + continue; + } + ceph_oid_init(&src_oid); + ceph_oid_printf(&src_oid, "%llx.%08llx", + src_ci->i_vino.ino, src_objnum); + ceph_oid_init(&dst_oid); + ceph_oid_printf(&dst_oid, "%llx.%08llx", + dst_ci->i_vino.ino, dst_objnum); + /* Finally... do an object remote copy */ + err = ceph_osdc_copy_from( + osdc, src_ci->i_vino.snap, 0, /* XXX src_ci->i_version? */ + &src_oid, &src_oloc, + CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL | + CEPH_OSD_OP_FLAG_FADVISE_WILLNEED, + dst_ci->i_vino.snap, &dst_oid, &dst_oloc, + CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL | + CEPH_OSD_OP_FLAG_FADVISE_WILLNEED); + ceph_oid_destroy(&src_oid); + ceph_oid_destroy(&dst_oid); + if (err) { + dout("ceph_osdc_copy_from returned %d\n", err); + goto out_caps; + } + len -= copy_len; + src_off += copy_len; + dst_off += copy_len; + ret += copy_len; + } + file_update_time(dst_file); + if (endoff > size) { + int caps_flags = CHECK_CAPS_AUTHONLY; + + /* Let the MDS know about dst file size change */ + if (ceph_quota_is_max_bytes_approaching(dst_inode, endoff)) + caps_flags |= CHECK_CAPS_NODELAY; + if (ceph_inode_set_size(dst_inode, endoff)) + caps_flags |= CHECK_CAPS_AUTHONLY; + if (caps_flags) + ceph_check_caps(dst_ci, caps_flags, NULL); + } + /* Mark Fw dirty */ + spin_lock(&dst_ci->i_ceph_lock); + dst_ci->i_inline_version = CEPH_INLINE_NONE; + dirty = __ceph_mark_dirty_caps(dst_ci, CEPH_CAP_FILE_WR, &prealloc_cf); + spin_unlock(&dst_ci->i_ceph_lock); + if (dirty) + __mark_inode_dirty(dst_inode, dirty); + +out_caps: + put_rd_wr_caps(src_ci, src_got, dst_ci, dst_got); +out: + ceph_free_cap_flush(prealloc_cf); + + if (!ret) + ret = err; + + return ret; +} + const struct file_operations ceph_file_fops = { .open = ceph_open, .release = ceph_release, @@ -1844,5 +2114,5 @@ const struct file_operations ceph_file_fops = { .unlocked_ioctl = ceph_ioctl, .compat_ioctl = ceph_ioctl, .fallocate = ceph_fallocate, + .copy_file_range = ceph_copy_file_range, }; - From patchwork Thu Sep 13 09:16:22 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Henriques X-Patchwork-Id: 10599045 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AB07313BF for ; Thu, 13 Sep 2018 09:15:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9A8B72A932 for ; Thu, 13 Sep 2018 09:15:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8F3A22A936; Thu, 13 Sep 2018 09:15:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2BC6D2A932 for ; Thu, 13 Sep 2018 09:15:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727721AbeIMOYJ (ORCPT ); Thu, 13 Sep 2018 10:24:09 -0400 Received: from mx2.suse.de ([195.135.220.15]:54354 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726832AbeIMOYI (ORCPT ); Thu, 13 Sep 2018 10:24:08 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 0348AAD25; Thu, 13 Sep 2018 09:15:32 +0000 (UTC) From: Luis Henriques To: "Yan, Zheng" , Sage Weil , Ilya Dryomov , Gregory Farnum Cc: ceph-devel@vger.kernel.org, Luis Henriques Subject: [RFC PATCH v4 4/4] ceph: new mount option to disable usage of RADOS 'copy-from' op Date: Thu, 13 Sep 2018 10:16:22 +0100 Message-Id: <20180913091622.28489-5-lhenriques@suse.com> In-Reply-To: <20180913091622.28489-1-lhenriques@suse.com> References: <20180913091622.28489-1-lhenriques@suse.com> Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add a new mount option 'nocopyfrom' that will prevent the usage of the RADOS 'copy-from' operation in cephfs. This could be useful, for example, for an administrator to temporarily mitigate any possible bugs in the 'copy-from' implementation. Currently, only copy_file_range uses this RADOS operation. Setting this mount option will result in this syscall reverting to the default VFS implementation, i.e. to perform the copies locally instead of doing remote object copies. Signed-off-by: Luis Henriques --- Documentation/filesystems/ceph.txt | 5 +++++ fs/ceph/file.c | 3 +++ fs/ceph/super.c | 13 +++++++++++++ fs/ceph/super.h | 1 + 4 files changed, 22 insertions(+) diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt index 8bf62240e10d..1177052701e1 100644 --- a/Documentation/filesystems/ceph.txt +++ b/Documentation/filesystems/ceph.txt @@ -151,6 +151,11 @@ Mount Options Report overall filesystem usage in statfs instead of using the root directory quota. + nocopyfrom + Don't use the RADOS 'copy-from' operation to perform remote object + copies. Currently, it's only used in copy_file_range, which will revert + to the default VFS implementation if this option is used. + More Information ================ diff --git a/fs/ceph/file.c b/fs/ceph/file.c index cf2d6fb4735e..40a80ccfd7bc 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1909,6 +1909,9 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off, if (ceph_snap(dst_inode) != CEPH_NOSNAP) return -EROFS; + if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM)) + return -EOPNOTSUPP; + prealloc_cf = ceph_alloc_cap_flush(); if (!prealloc_cf) return -ENOMEM; diff --git a/fs/ceph/super.c b/fs/ceph/super.c index 43ca3b763875..e329a0f659dc 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -165,6 +165,8 @@ enum { Opt_noacl, Opt_quotadf, Opt_noquotadf, + Opt_copyfrom, + Opt_nocopyfrom, }; static match_table_t fsopt_tokens = { @@ -203,6 +205,8 @@ static match_table_t fsopt_tokens = { {Opt_noacl, "noacl"}, {Opt_quotadf, "quotadf"}, {Opt_noquotadf, "noquotadf"}, + {Opt_copyfrom, "copyfrom"}, + {Opt_nocopyfrom, "nocopyfrom"}, {-1, NULL} }; @@ -355,6 +359,12 @@ static int parse_fsopt_token(char *c, void *private) case Opt_noquotadf: fsopt->flags |= CEPH_MOUNT_OPT_NOQUOTADF; break; + case Opt_copyfrom: + fsopt->flags &= ~CEPH_MOUNT_OPT_NOCOPYFROM; + break; + case Opt_nocopyfrom: + fsopt->flags |= CEPH_MOUNT_OPT_NOCOPYFROM; + break; #ifdef CONFIG_CEPH_FS_POSIX_ACL case Opt_acl: fsopt->sb_flags |= SB_POSIXACL; @@ -553,6 +563,9 @@ static int ceph_show_options(struct seq_file *m, struct dentry *root) seq_puts(m, ",noacl"); #endif + if (fsopt->flags & CEPH_MOUNT_OPT_NOCOPYFROM) + seq_puts(m, ",nocopyfrom"); + if (fsopt->mds_namespace) seq_show_option(m, "mds_namespace", fsopt->mds_namespace); if (fsopt->wsize != CEPH_MAX_WRITE_SIZE) diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 91b13400badd..c005a5400f2e 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -40,6 +40,7 @@ #define CEPH_MOUNT_OPT_NOPOOLPERM (1<<11) /* no pool permission check */ #define CEPH_MOUNT_OPT_MOUNTWAIT (1<<12) /* mount waits if no mds is up */ #define CEPH_MOUNT_OPT_NOQUOTADF (1<<13) /* no root dir quota in statfs */ +#define CEPH_MOUNT_OPT_NOCOPYFROM (1<<14) /* don't use RADOS 'copy-from' op */ #define CEPH_MOUNT_OPT_DEFAULT CEPH_MOUNT_OPT_DCACHE