From patchwork Mon Mar 18 09:15:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857005 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CA5706C2 for ; Mon, 18 Mar 2019 09:23:00 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AFDAE2922F for ; Mon, 18 Mar 2019 09:23:00 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A2AAB29296; Mon, 18 Mar 2019 09:23:00 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4B8F72922F for ; Mon, 18 Mar 2019 09:23:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727217AbfCRJW7 (ORCPT ); Mon, 18 Mar 2019 05:22:59 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:50216 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726922AbfCRJW6 (ORCPT ); Mon, 18 Mar 2019 05:22:58 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S3; Mon, 18 Mar 2019 17:15:37 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 01/16] libceph: introduce ceph_extract_encoded_string_kvmalloc Date: Mon, 18 Mar 2019 05:15:19 -0400 Message-Id: <1552900534-29026-2-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S3 X-Coremail-Antispam: 1Uf129KBjDUn29KB7ZKAUJUUUUr529EdanIXcx71UUUUU7v73 VFW2AGmfu7bjvjm3AaLaJ3UbIYCTnIWIevJa73UjIFyTuYvjfUpzuAUUUUU X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbiHBl7elpchMQOTAAAsA Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When we are going to extract the encoded string, there would be a large string encoded. Especially in the journaling case, if we use the default journal object size, 16M, there could be a 16M string encoded in this object. Signed-off-by: Dongsheng Yang --- include/linux/ceph/decode.h | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h index a6c2a48..05dae1b 100644 --- a/include/linux/ceph/decode.h +++ b/include/linux/ceph/decode.h @@ -104,8 +104,11 @@ static inline bool ceph_has_room(void **p, void *end, size_t n) * beyond the "end" pointer provided (-ERANGE) * - memory could not be allocated for the result (-ENOMEM) */ -static inline char *ceph_extract_encoded_string(void **p, void *end, - size_t *lenp, gfp_t gfp) +typedef void * (mem_alloc_t)(size_t size, gfp_t flags); +extern void *ceph_kvmalloc(size_t size, gfp_t flags); +static inline char *extract_encoded_string(void **p, void *end, + mem_alloc_t alloc_fn, + size_t *lenp, gfp_t gfp) { u32 len; void *sp = *p; @@ -115,7 +118,7 @@ static inline char *ceph_extract_encoded_string(void **p, void *end, if (!ceph_has_room(&sp, end, len)) goto bad; - buf = kmalloc(len + 1, gfp); + buf = alloc_fn(len + 1, gfp); if (!buf) return ERR_PTR(-ENOMEM); @@ -133,6 +136,18 @@ static inline char *ceph_extract_encoded_string(void **p, void *end, return ERR_PTR(-ERANGE); } +static inline char *ceph_extract_encoded_string(void **p, void *end, + size_t *lenp, gfp_t gfp) +{ + return extract_encoded_string(p, end, kmalloc, lenp, gfp); +} + +static inline char *ceph_extract_encoded_string_kvmalloc(void **p, void *end, + size_t *lenp, gfp_t gfp) +{ + return extract_encoded_string(p, end, ceph_kvmalloc, lenp, gfp); +} + /* * skip helpers */ From patchwork Mon Mar 18 09:15:20 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857083 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 132406C2 for ; Mon, 18 Mar 2019 09:49:03 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EDDD429301 for ; Mon, 18 Mar 2019 09:49:02 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E193929304; Mon, 18 Mar 2019 09:49:02 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8246B29301 for ; Mon, 18 Mar 2019 09:49:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726835AbfCRJtA (ORCPT ); Mon, 18 Mar 2019 05:49:00 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:59200 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727271AbfCRJ0p (ORCPT ); Mon, 18 Mar 2019 05:26:45 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S4; Mon, 18 Mar 2019 17:15:37 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 02/16] libceph: introduce a new parameter of workqueue in ceph_osdc_watch() Date: Mon, 18 Mar 2019 05:15:20 -0400 Message-Id: <1552900534-29026-3-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S4 X-Coremail-Antispam: 1Uf129KBjvJXoWxCr1ktFWfJw48Jw4xZw4fKrg_yoW5Ar18pa y3Cw17Aay8Jr47WanxAa9avrsYg34kuFy7K342k34akFnIqFZFqF1kKFyYvFy7JFyfGay2 vF4jyrZ8Gw4jv3DanT9S1TB71UUUUUJqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0Jb_R67UUUUU= X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbiiBl7eltVgBqx5AACsT Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Currently, if we share osdc in rbd device and journaling, they are sharing the notify_wq in osdc to complete watch_cb. When we need to close journal held with mutex of rbd device, we need to flush the notify_wq. But we don't want to flush the watch_cb of rbd_device, maybe some of it need to lock rbd mutex. To solve this problem, this patch allow user to manage the notify workqueue by themselves in watching. Signed-off-by: Dongsheng Yang --- drivers/block/rbd.c | 2 +- include/linux/ceph/osd_client.h | 2 ++ net/ceph/osd_client.c | 8 +++++++- 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 63f73e8..57816c2 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3375,7 +3375,7 @@ static int __rbd_register_watch(struct rbd_device *rbd_dev) dout("%s rbd_dev %p\n", __func__, rbd_dev); handle = ceph_osdc_watch(osdc, &rbd_dev->header_oid, - &rbd_dev->header_oloc, rbd_watch_cb, + &rbd_dev->header_oloc, NULL, rbd_watch_cb, rbd_watch_errcb, rbd_dev); if (IS_ERR(handle)) return PTR_ERR(handle); diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 7a2af50..bde296f 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -282,6 +282,7 @@ struct ceph_osd_linger_request { rados_watcherrcb_t errcb; void *data; + struct workqueue_struct *wq; struct page ***preply_pages; size_t *preply_len; }; @@ -532,6 +533,7 @@ struct ceph_osd_linger_request * ceph_osdc_watch(struct ceph_osd_client *osdc, struct ceph_object_id *oid, struct ceph_object_locator *oloc, + struct workqueue_struct *wq, rados_watchcb2_t wcb, rados_watcherrcb_t errcb, void *data); diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index d23a9f8..e8ce744 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2850,7 +2850,10 @@ static void lwork_queue(struct linger_work *lwork) lwork->queued_stamp = jiffies; list_add_tail(&lwork->pending_item, &lreq->pending_lworks); - queue_work(osdc->notify_wq, &lwork->work); + if (lreq->wq) + queue_work(lreq->wq, &lwork->work); + else + queue_work(osdc->notify_wq, &lwork->work); } static void do_watch_notify(struct work_struct *w) @@ -4603,6 +4606,7 @@ struct ceph_osd_linger_request * ceph_osdc_watch(struct ceph_osd_client *osdc, struct ceph_object_id *oid, struct ceph_object_locator *oloc, + struct workqueue_struct *wq, rados_watchcb2_t wcb, rados_watcherrcb_t errcb, void *data) @@ -4618,6 +4622,8 @@ struct ceph_osd_linger_request * lreq->wcb = wcb; lreq->errcb = errcb; lreq->data = data; + if (wq) + lreq->wq = wq; lreq->watch_valid_thru = jiffies; ceph_oid_copy(&lreq->t.base_oid, oid); From patchwork Mon Mar 18 09:15:21 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857079 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0DD4913B5 for ; Mon, 18 Mar 2019 09:48:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E9484292FF for ; Mon, 18 Mar 2019 09:48:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DD45929304; Mon, 18 Mar 2019 09:48:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 69498292FF for ; Mon, 18 Mar 2019 09:48:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727762AbfCRJ1T (ORCPT ); Mon, 18 Mar 2019 05:27:19 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:60666 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727750AbfCRJ1S (ORCPT ); Mon, 18 Mar 2019 05:27:18 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S5; Mon, 18 Mar 2019 17:15:38 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 03/16] libceph: support op append Date: Mon, 18 Mar 2019 05:15:21 -0400 Message-Id: <1552900534-29026-4-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S5 X-Coremail-Antispam: 1Uf129KBjvJXoWxXF47XF45GrWrtF47XryUtrb_yoWrJFWkpF ZrA3yjyFW3Ga4xArs7GFZ5t3yrJ3yvyF42qrWDKrs3Can3JrykA3Z8Xr9Fgr1UZF4Fg348 CF1Y9r90qw1SvrDanT9S1TB71UUUUUJqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0Jb_R67UUUUU= X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbihBp7elsfmou8zAAAs6 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP we need to send append operation when we want to support journaling in kernel client. Signed-off-by: Dongsheng Yang --- net/ceph/osd_client.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index e8ce744..c5807b3 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -386,6 +386,7 @@ static void osd_req_op_data_release(struct ceph_osd_request *osd_req, case CEPH_OSD_OP_READ: case CEPH_OSD_OP_WRITE: case CEPH_OSD_OP_WRITEFULL: + case CEPH_OSD_OP_APPEND: ceph_osd_data_release(&op->extent.osd_data); break; case CEPH_OSD_OP_CALL: @@ -702,6 +703,7 @@ static void get_num_data_items(struct ceph_osd_request *req, /* request */ case CEPH_OSD_OP_WRITE: case CEPH_OSD_OP_WRITEFULL: + case CEPH_OSD_OP_APPEND: case CEPH_OSD_OP_SETXATTR: case CEPH_OSD_OP_CMPXATTR: case CEPH_OSD_OP_NOTIFY_ACK: @@ -787,13 +789,14 @@ void osd_req_op_extent_init(struct ceph_osd_request *osd_req, BUG_ON(opcode != CEPH_OSD_OP_READ && opcode != CEPH_OSD_OP_WRITE && opcode != CEPH_OSD_OP_WRITEFULL && opcode != CEPH_OSD_OP_ZERO && - opcode != CEPH_OSD_OP_TRUNCATE); + opcode != CEPH_OSD_OP_TRUNCATE && opcode != CEPH_OSD_OP_APPEND); op->extent.offset = offset; op->extent.length = length; op->extent.truncate_size = truncate_size; op->extent.truncate_seq = truncate_seq; - if (opcode == CEPH_OSD_OP_WRITE || opcode == CEPH_OSD_OP_WRITEFULL) + if (opcode == CEPH_OSD_OP_WRITE || opcode == CEPH_OSD_OP_WRITEFULL || + opcode == CEPH_OSD_OP_APPEND) payload_len += length; op->indata_len = payload_len; @@ -815,7 +818,8 @@ void osd_req_op_extent_update(struct ceph_osd_request *osd_req, BUG_ON(length > previous); op->extent.length = length; - if (op->op == CEPH_OSD_OP_WRITE || op->op == CEPH_OSD_OP_WRITEFULL) + if (op->op == CEPH_OSD_OP_WRITE || op->op == CEPH_OSD_OP_WRITEFULL || + op->op == CEPH_OSD_OP_APPEND) op->indata_len -= previous - length; } EXPORT_SYMBOL(osd_req_op_extent_update); @@ -837,7 +841,8 @@ void osd_req_op_extent_dup_last(struct ceph_osd_request *osd_req, op->extent.offset += offset_inc; op->extent.length -= offset_inc; - if (op->op == CEPH_OSD_OP_WRITE || op->op == CEPH_OSD_OP_WRITEFULL) + if (op->op == CEPH_OSD_OP_WRITE || op->op == CEPH_OSD_OP_WRITEFULL || + op->op == CEPH_OSD_OP_APPEND) op->indata_len -= offset_inc; } EXPORT_SYMBOL(osd_req_op_extent_dup_last); @@ -977,6 +982,7 @@ static u32 osd_req_encode_op(struct ceph_osd_op *dst, case CEPH_OSD_OP_READ: case CEPH_OSD_OP_WRITE: case CEPH_OSD_OP_WRITEFULL: + case CEPH_OSD_OP_APPEND: case CEPH_OSD_OP_ZERO: case CEPH_OSD_OP_TRUNCATE: dst->extent.offset = cpu_to_le64(src->extent.offset); @@ -1070,7 +1076,8 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc, BUG_ON(opcode != CEPH_OSD_OP_READ && opcode != CEPH_OSD_OP_WRITE && opcode != CEPH_OSD_OP_ZERO && opcode != CEPH_OSD_OP_TRUNCATE && - opcode != CEPH_OSD_OP_CREATE && opcode != CEPH_OSD_OP_DELETE); + opcode != CEPH_OSD_OP_CREATE && opcode != CEPH_OSD_OP_DELETE && + opcode != CEPH_OSD_OP_APPEND); req = ceph_osdc_alloc_request(osdc, snapc, num_ops, use_mempool, GFP_NOFS); @@ -1944,6 +1951,7 @@ static void setup_request_data(struct ceph_osd_request *req) /* request */ case CEPH_OSD_OP_WRITE: case CEPH_OSD_OP_WRITEFULL: + case CEPH_OSD_OP_APPEND: WARN_ON(op->indata_len != op->extent.length); ceph_osdc_msg_data_add(request_msg, &op->extent.osd_data); From patchwork Mon Mar 18 09:15:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857003 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 41D6E139A for ; Mon, 18 Mar 2019 09:22:51 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 278772922F for ; Mon, 18 Mar 2019 09:22:51 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1BE3A29296; Mon, 18 Mar 2019 09:22:51 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 08EAA2922F for ; Mon, 18 Mar 2019 09:22:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726813AbfCRJWs (ORCPT ); Mon, 18 Mar 2019 05:22:48 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:49805 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726449AbfCRJWs (ORCPT ); Mon, 18 Mar 2019 05:22:48 -0400 X-Greylist: delayed 366 seconds by postgrey-1.27 at vger.kernel.org; Mon, 18 Mar 2019 05:22:46 EDT Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S6; Mon, 18 Mar 2019 17:15:39 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 04/16] libceph: add prefix and suffix in ceph_osd_req_op.extent Date: Mon, 18 Mar 2019 05:15:22 -0400 Message-Id: <1552900534-29026-5-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S6 X-Coremail-Antispam: 1Uf129KBjvJXoWxWw13Wr48JFWUKw13GFWDCFg_yoWrCw17pF ZrCa15A3yDJw1xW392qayrur9Igr18AFW2gry7G3WfGan3JFW0vF1DtF9aqr17WF4IgFyq yF4jgFWUW3W2vrJanT9S1TB71UUUUUJqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0JbM9N3UUUUU= X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbihxt7eltVf3Z3KQAAsE Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When we are going to support rbd journaling, we need a prefix and suffix of ceph_osd_req_op.extent for append op. Signed-off-by: Dongsheng Yang --- include/linux/ceph/osd_client.h | 17 +++++++++++++++++ net/ceph/osd_client.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 52 insertions(+) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index bde296f..db1ec88 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -97,7 +97,13 @@ struct ceph_osd_req_op { u64 offset, length; u64 truncate_size; u32 truncate_seq; + // In common case, extent only need + // one ceph_osd_data, extent.osd_data. + // But in journaling, we need a prefix + // and suffix in append op, + struct ceph_osd_data prefix; struct ceph_osd_data osd_data; + struct ceph_osd_data suffix; } extent; struct { u32 name_len; @@ -435,6 +441,17 @@ void osd_req_op_extent_osd_data_bvec_pos(struct ceph_osd_request *osd_req, unsigned int which, struct ceph_bvec_iter *bvec_pos); + +extern void osd_req_op_extent_prefix_pages(struct ceph_osd_request *, + unsigned int which, + struct page **pages, u64 length, + u32 alignment, bool pages_from_pool, + bool own_pages); +extern void osd_req_op_extent_suffix_pages(struct ceph_osd_request *, + unsigned int which, + struct page **pages, u64 length, + u32 alignment, bool pages_from_pool, + bool own_pages); extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *, unsigned int which, struct ceph_pagelist *pagelist); diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index c5807b3..ba6092b 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -272,6 +272,32 @@ void osd_req_op_extent_osd_data_bvec_pos(struct ceph_osd_request *osd_req, } EXPORT_SYMBOL(osd_req_op_extent_osd_data_bvec_pos); +void osd_req_op_extent_prefix_pages(struct ceph_osd_request *osd_req, + unsigned int which, struct page **pages, + u64 length, u32 alignment, + bool pages_from_pool, bool own_pages) +{ + struct ceph_osd_data *prefix; + + prefix = osd_req_op_data(osd_req, which, extent, prefix); + ceph_osd_data_pages_init(prefix, pages, length, alignment, + pages_from_pool, own_pages); +} +EXPORT_SYMBOL(osd_req_op_extent_prefix_pages); + +void osd_req_op_extent_suffix_pages(struct ceph_osd_request *osd_req, + unsigned int which, struct page **pages, + u64 length, u32 alignment, + bool pages_from_pool, bool own_pages) +{ + struct ceph_osd_data *suffix; + + suffix = osd_req_op_data(osd_req, which, extent, suffix); + ceph_osd_data_pages_init(suffix, pages, length, alignment, + pages_from_pool, own_pages); +} +EXPORT_SYMBOL(osd_req_op_extent_suffix_pages); + static void osd_req_op_cls_request_info_pagelist( struct ceph_osd_request *osd_req, unsigned int which, struct ceph_pagelist *pagelist) @@ -387,7 +413,9 @@ static void osd_req_op_data_release(struct ceph_osd_request *osd_req, case CEPH_OSD_OP_WRITE: case CEPH_OSD_OP_WRITEFULL: case CEPH_OSD_OP_APPEND: + ceph_osd_data_release(&op->extent.prefix); ceph_osd_data_release(&op->extent.osd_data); + ceph_osd_data_release(&op->extent.suffix); break; case CEPH_OSD_OP_CALL: ceph_osd_data_release(&op->cls.request_info); @@ -704,6 +732,8 @@ static void get_num_data_items(struct ceph_osd_request *req, case CEPH_OSD_OP_WRITE: case CEPH_OSD_OP_WRITEFULL: case CEPH_OSD_OP_APPEND: + *num_request_data_items += 3; + break; case CEPH_OSD_OP_SETXATTR: case CEPH_OSD_OP_CMPXATTR: case CEPH_OSD_OP_NOTIFY_ACK: @@ -1953,8 +1983,13 @@ static void setup_request_data(struct ceph_osd_request *req) case CEPH_OSD_OP_WRITEFULL: case CEPH_OSD_OP_APPEND: WARN_ON(op->indata_len != op->extent.length); + // op->extent.prefix and op->extent.suffix can be NONE + ceph_osdc_msg_data_add(request_msg, + &op->extent.prefix); ceph_osdc_msg_data_add(request_msg, &op->extent.osd_data); + ceph_osdc_msg_data_add(request_msg, + &op->extent.suffix); break; case CEPH_OSD_OP_SETXATTR: case CEPH_OSD_OP_CMPXATTR: From patchwork Mon Mar 18 09:15:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857015 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 004641709 for ; Mon, 18 Mar 2019 09:23:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D94342922F for ; Mon, 18 Mar 2019 09:23:07 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CD833292A2; Mon, 18 Mar 2019 09:23:07 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AFB862922F for ; Mon, 18 Mar 2019 09:23:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727236AbfCRJXF (ORCPT ); Mon, 18 Mar 2019 05:23:05 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:50223 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727174AbfCRJXE (ORCPT ); Mon, 18 Mar 2019 05:23:04 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S7; Mon, 18 Mar 2019 17:15:39 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 05/16] libceph: introduce cls_journaler_client Date: Mon, 18 Mar 2019 05:15:23 -0400 Message-Id: <1552900534-29026-6-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S7 X-Coremail-Antispam: 1Uf129KBjvAXoWfGry3tr4rWw45Aw4UtrWrKrg_yoW8AF48Go W2kr4UGrn5JFWUurWvkrn2gFyYgay8GFn5Cr1YqFsruanrZ34ftw47Ga13ta43CF1ayrsr Kw4xJ3WfJw48A3W7n29KB7ZKAUJUUUUr529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UbIYCTnIWIevJa73UjIFyTuYvjfUL1xRDUUUU X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbiiht7eltVf9eIwgAAs8 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This is a cls client module for journaler. Signed-off-by: Dongsheng Yang --- include/linux/ceph/cls_journaler_client.h | 91 +++++ net/ceph/cls_journaler_client.c | 556 ++++++++++++++++++++++++++++++ 2 files changed, 647 insertions(+) create mode 100644 include/linux/ceph/cls_journaler_client.h create mode 100644 net/ceph/cls_journaler_client.c diff --git a/include/linux/ceph/cls_journaler_client.h b/include/linux/ceph/cls_journaler_client.h new file mode 100644 index 0000000..27740fc --- /dev/null +++ b/include/linux/ceph/cls_journaler_client.h @@ -0,0 +1,91 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_CEPH_CLS_JOURNAL_CLIENT_H +#define _LINUX_CEPH_CLS_JOURNAL_CLIENT_H + +#include + +struct ceph_journaler; +struct ceph_journaler_client; + +struct ceph_journaler_object_pos { + struct list_head node; + u64 object_num; + u64 tag_tid; + u64 entry_tid; + bool in_using; +}; + +struct ceph_journaler_client { + struct list_head node; + size_t id_len; + char *id; + size_t data_len; + char *data; + struct list_head object_positions; + struct ceph_journaler_object_pos *object_positions_array; +}; + +struct ceph_journaler_tag { + uint64_t tid; + uint64_t tag_class; + size_t data_len; + char *data; +}; + +void destroy_client(struct ceph_journaler_client *client); + +int ceph_cls_journaler_get_immutable_metas(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint8_t *order, + uint8_t *splay_width, + int64_t *pool_id); + +int ceph_cls_journaler_get_mutable_metas(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t *minimum_set, uint64_t *active_set); + +int ceph_cls_journaler_client_list(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + struct list_head *clients, + uint8_t splay_width); + +int ceph_cls_journaler_get_next_tag_tid(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t *tag_tid); + +int ceph_cls_journaler_get_tag(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t tag_tid, struct ceph_journaler_tag *tag); + +int ceph_cls_journaler_tag_create(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t tag_tid, uint64_t tag_class, + void *buf, uint32_t buf_len); + +int ceph_cls_journaler_client_committed(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + struct ceph_journaler_client *client, + struct list_head *object_positions); + +int ceph_cls_journaler_set_active_set(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t active_set); + +int ceph_cls_journaler_set_minimum_set(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t minimum_set); + +int ceph_cls_journaler_guard_append(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t soft_limit); +#endif diff --git a/net/ceph/cls_journaler_client.c b/net/ceph/cls_journaler_client.c new file mode 100644 index 0000000..789a6ac --- /dev/null +++ b/net/ceph/cls_journaler_client.c @@ -0,0 +1,556 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include + +#include + +//TODO get all metas in one single request +int ceph_cls_journaler_get_immutable_metas(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint8_t *order, + uint8_t *splay_width, + int64_t *pool_id) +{ + struct page *reply_page; + size_t reply_len = sizeof(*order); + int ret = 0; + + reply_page = alloc_page(GFP_NOIO); + if (!reply_page) + return -ENOMEM; + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "get_order", + CEPH_OSD_FLAG_READ, NULL, + 0, reply_page, &reply_len); + if (!ret) { + memcpy(order, page_address(reply_page), reply_len); + } else { + goto out; + } + reply_len = sizeof(*splay_width); + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "get_splay_width", + CEPH_OSD_FLAG_READ, NULL, + 0, reply_page, &reply_len); + if (!ret) { + memcpy(splay_width, page_address(reply_page), reply_len); + } else { + goto out; + } + reply_len = sizeof(*pool_id); + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "get_pool_id", + CEPH_OSD_FLAG_READ, NULL, + 0, reply_page, &reply_len); + if (!ret) { + memcpy(pool_id, page_address(reply_page), reply_len); + } else { + goto out; + } +out: + __free_page(reply_page); + return ret; +} +EXPORT_SYMBOL(ceph_cls_journaler_get_immutable_metas); + +//TODO get all metas in one single request +int ceph_cls_journaler_get_mutable_metas(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t *minimum_set, uint64_t *active_set) +{ + struct page *reply_page; + int ret = 0; + size_t reply_len = sizeof(*minimum_set); + + reply_page = alloc_page(GFP_NOIO); + if (!reply_page) + return -ENOMEM; + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "get_minimum_set", + CEPH_OSD_FLAG_READ, NULL, + 0, reply_page, &reply_len); + if (!ret) { + memcpy(minimum_set, page_address(reply_page), reply_len); + } else { + goto out; + } + reply_len = sizeof(active_set); + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "get_active_set", + CEPH_OSD_FLAG_READ, NULL, + 0, reply_page, &reply_len); + if (!ret) { + memcpy(active_set, page_address(reply_page), reply_len); + } else { + goto out; + } +out: + __free_page(reply_page); + return ret; +} +EXPORT_SYMBOL(ceph_cls_journaler_get_mutable_metas); + +static int decode_object_position(void **p, void *end, struct ceph_journaler_object_pos *pos) +{ + u8 struct_v; + u32 struct_len; + int ret = 0; + u64 object_num = 0; + u64 tag_tid = 0; + u64 entry_tid = 0; + + ret = ceph_start_decoding(p, end, 1, "cls_journal_object_position", + &struct_v, &struct_len); + if (ret) + return ret; + + object_num = ceph_decode_64(p); + tag_tid = ceph_decode_64(p); + entry_tid = ceph_decode_64(p); + + pos->object_num = object_num; + pos->tag_tid = tag_tid; + pos->entry_tid = entry_tid; + + return ret; +} + +void destroy_client(struct ceph_journaler_client *client) +{ + kfree(client->object_positions_array); + kfree(client->id); + kfree(client->data); + + kfree(client); +} + +struct ceph_journaler_client *create_client(uint8_t splay_width) +{ + struct ceph_journaler_client *client; + struct ceph_journaler_object_pos *pos; + int i = 0; + + client = kzalloc(sizeof(*client), GFP_NOIO); + if (!client) + return NULL; + + client->object_positions_array = kcalloc(splay_width, sizeof(*pos), GFP_NOIO); + if (!client->object_positions_array) + goto free_client; + + INIT_LIST_HEAD(&client->object_positions); + for (i = 0; i < splay_width; i++) { + pos = &client->object_positions_array[i]; + INIT_LIST_HEAD(&pos->node); + list_add_tail(&pos->node, &client->object_positions); + } + + INIT_LIST_HEAD(&client->node); + client->data = NULL; + client->id = NULL; + + return client; +free_client: + kfree(client); + return NULL; +} + +static int decode_client(void **p, void *end, struct ceph_journaler_client *client) +{ + u8 struct_v; + u8 state_raw; + u32 struct_len; + int ret = 0, num = 0, i = 0; + struct ceph_journaler_object_pos *pos; + + ret = ceph_start_decoding(p, end, 1, "cls_journal_get_client_reply", + &struct_v, &struct_len); + if (ret) + return ret; + + client->id = ceph_extract_encoded_string(p, end, &client->id_len, GFP_NOIO); + if (IS_ERR(client->id)) { + ret = PTR_ERR(client->id); + client->id = NULL; + goto err; + } + client->data = ceph_extract_encoded_string(p, end, &client->data_len, GFP_NOIO); + if (IS_ERR(client->data)) { + ret = PTR_ERR(client->data); + client->data = NULL; + goto free_id; + } + ret = ceph_start_decoding(p, end, 1, "cls_joural_client_object_set_position", + &struct_v, &struct_len); + if (ret) + goto free_data; + + num = ceph_decode_32(p); + i = 0; + list_for_each_entry(pos, &client->object_positions, node) { + if (i < num) { + // we will use this position stub + pos->in_using = true; + ret = decode_object_position(p, end, pos); + if (ret) { + goto free_data; + } + } else { + // This stub is not used anymore + pos->in_using = false; + } + i++; + } + + state_raw = ceph_decode_8(p); + return 0; + +free_data: + kfree(client->data); +free_id: + kfree(client->id); +err: + return ret; +} + +static int decode_clients(void **p, void *end, struct list_head *clients, uint8_t splay_width) +{ + int i = 0; + int ret = 0; + uint32_t client_num; + struct ceph_journaler_client *client, *next; + + client_num = ceph_decode_32(p); + // Reuse the clients already exist in list. + list_for_each_entry_safe(client, next, clients, node) { + // Some clients unregistered. + if (i < client_num) { + kfree(client->id); + kfree(client->data); + ret = decode_client(p, end, client); + if (ret) + return ret; + } else { + list_del(&client->node); + destroy_client(client); + } + i++; + } + + // Some more clients registered. + for (; i < client_num; i++) { + client = create_client(splay_width); + if (!client) + return -ENOMEM; + ret = decode_client(p, end, client); + if (ret) { + destroy_client(client); + return ret; + } + list_add_tail(&client->node, clients); + } + return 0; +} + +int ceph_cls_journaler_client_list(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + struct list_head *clients, + uint8_t splay_width) +{ + struct page *reply_page; + struct page *req_page; + int ret = 0; + size_t reply_len = PAGE_SIZE; + int buf_size; + void *p, *end; + char name[] = ""; + + buf_size = strlen(name) + sizeof(__le32) + sizeof(uint64_t); + + if (buf_size > PAGE_SIZE) + return -E2BIG; + + reply_page = alloc_page(GFP_NOIO); + if (!reply_page) + return -ENOMEM; + + req_page = alloc_page(GFP_NOIO); + if (!req_page) { + ret = -ENOMEM; + goto free_reply_page; + } + + p = page_address(req_page); + end = p + buf_size; + + ceph_encode_string(&p, end, name, strlen(name)); + ceph_encode_64(&p, (uint64_t)256); + + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "client_list", + CEPH_OSD_FLAG_READ, req_page, + buf_size, reply_page, &reply_len); + + if (!ret) { + p = page_address(reply_page); + end = p + reply_len; + + ret = decode_clients(&p, end, clients, splay_width); + } + + __free_page(req_page); +free_reply_page: + __free_page(reply_page); + return ret; +} +EXPORT_SYMBOL(ceph_cls_journaler_client_list); + +int ceph_cls_journaler_get_next_tag_tid(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t *tag_tid) +{ + struct page *reply_page; + int ret = 0; + size_t reply_len = PAGE_SIZE; + + reply_page = alloc_page(GFP_NOIO); + if (!reply_page) + return -ENOMEM; + + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "get_next_tag_tid", + CEPH_OSD_FLAG_READ, NULL, + 0, reply_page, &reply_len); + + if (!ret) { + memcpy(tag_tid, page_address(reply_page), reply_len); + } + + __free_page(reply_page); + return ret; +} +EXPORT_SYMBOL(ceph_cls_journaler_get_next_tag_tid); + +int ceph_cls_journaler_tag_create(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t tag_tid, uint64_t tag_class, + void *buf, uint32_t buf_len) +{ + struct page *req_page; + int ret = 0; + int buf_size; + void *p, *end; + + buf_size = buf_len + sizeof(__le32) + sizeof(uint64_t) + sizeof(uint64_t); + + if (buf_size > PAGE_SIZE) + return -E2BIG; + + req_page = alloc_page(GFP_NOIO); + if (!req_page) + return -ENOMEM; + + p = page_address(req_page); + end = p + buf_size; + + ceph_encode_64(&p, tag_tid); + ceph_encode_64(&p, tag_class); + ceph_encode_string(&p, end, buf, buf_len); + + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "tag_create", + CEPH_OSD_FLAG_WRITE, req_page, + buf_size, NULL, NULL); + + __free_page(req_page); + return ret; +} +EXPORT_SYMBOL(ceph_cls_journaler_tag_create); + +int decode_tag(void **p, void *end, struct ceph_journaler_tag *tag) +{ + int ret = 0; + u8 struct_v; + u32 struct_len; + + ret = ceph_start_decoding(p, end, 1, "cls_journaler_tag", + &struct_v, &struct_len); + if (ret) + return ret; + + tag->tid = ceph_decode_64(p); + tag->tag_class = ceph_decode_64(p); + tag->data = ceph_extract_encoded_string(p, end, &tag->data_len, GFP_NOIO); + if (IS_ERR(tag->data)) { + ret = PTR_ERR(tag->data); + tag->data = NULL; + return ret; + } + + return 0; +} + +int ceph_cls_journaler_get_tag(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t tag_tid, struct ceph_journaler_tag *tag) +{ + struct page *reply_page; + struct page *req_page; + int ret = 0; + size_t reply_len = PAGE_SIZE; + int buf_size; + void *p, *end; + + buf_size = sizeof(tag_tid); + + reply_page = alloc_page(GFP_NOIO); + if (!reply_page) + return -ENOMEM; + + req_page = alloc_page(GFP_NOIO); + if (!req_page) + goto free_reply_page; + + p = page_address(req_page); + end = p + buf_size; + ceph_encode_64(&p, tag_tid); + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "get_tag", + CEPH_OSD_FLAG_READ, req_page, + buf_size, reply_page, &reply_len); + + if (!ret) { + p = page_address(reply_page); + end = p + reply_len; + + ret = decode_tag(&p, end, tag); + } + + __free_page(req_page); +free_reply_page: + __free_page(reply_page); + return ret; +} +EXPORT_SYMBOL(ceph_cls_journaler_get_tag); + +static int version_len = 6; + +int ceph_cls_journaler_client_committed(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + struct ceph_journaler_client *client, + struct list_head *object_positions) +{ + struct page *req_page; + int ret = 0; + int buf_size; + void *p, *end; + struct ceph_journaler_object_pos *position = NULL; + int object_position_len = version_len + 8 + 8 + 8; + int pos_num = 0; + + buf_size = 4 + client->id_len + version_len + 4; + list_for_each_entry(position, object_positions, node) { + buf_size += object_position_len; + pos_num++; + } + + if (buf_size > PAGE_SIZE) + return -E2BIG; + + req_page = alloc_page(GFP_NOIO); + if (!req_page) + return -ENOMEM; + + p = page_address(req_page); + end = p + buf_size; + ceph_encode_string(&p, end, client->id, client->id_len); + ceph_start_encoding(&p, 1, 1, buf_size - client->id_len - version_len - 4); + ceph_encode_32(&p, pos_num); + + list_for_each_entry(position, object_positions, node) { + ceph_start_encoding(&p, 1, 1, 24); + ceph_encode_64(&p, position->object_num); + ceph_encode_64(&p, position->tag_tid); + ceph_encode_64(&p, position->entry_tid); + } + + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "client_commit", + CEPH_OSD_FLAG_WRITE, req_page, + buf_size, NULL, NULL); + + __free_page(req_page); + return ret; +} +EXPORT_SYMBOL(ceph_cls_journaler_client_committed); + + +int ceph_cls_journaler_set_minimum_set(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t minimum_set) +{ + struct page *req_page; + int ret = 0; + void *p; + + req_page = alloc_page(GFP_NOIO); + if (!req_page) + return -ENOMEM; + + p = page_address(req_page); + ceph_encode_64(&p, minimum_set); + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "set_minimum_set", + CEPH_OSD_FLAG_WRITE, req_page, + 8, NULL, NULL); + + __free_page(req_page); + return ret; +} +EXPORT_SYMBOL(ceph_cls_journaler_set_minimum_set); + +int ceph_cls_journaler_set_active_set(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t active_set) +{ + struct page *req_page; + int ret = 0; + void *p; + + req_page = alloc_page(GFP_NOIO); + if (!req_page) + return -ENOMEM; + + p = page_address(req_page); + ceph_encode_64(&p, active_set); + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "set_active_set", + CEPH_OSD_FLAG_WRITE, req_page, + 8, NULL, NULL); + + __free_page(req_page); + return ret; +} +EXPORT_SYMBOL(ceph_cls_journaler_set_active_set); + +int ceph_cls_journaler_guard_append(struct ceph_osd_client *osdc, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + uint64_t soft_limit) +{ + struct page *req_page; + int ret = 0; + void *p; + + req_page = alloc_page(GFP_NOIO); + if (!req_page) + return -ENOMEM; + + p = page_address(req_page); + ceph_encode_64(&p, soft_limit); + ret = ceph_osdc_call(osdc, oid, oloc, "journal", "guard_append", + CEPH_OSD_FLAG_READ, req_page, + 8, NULL, NULL); + + __free_page(req_page); + return ret; +} +EXPORT_SYMBOL(ceph_cls_journaler_guard_append); From patchwork Mon Mar 18 09:15:24 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857077 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A893413B5 for ; Mon, 18 Mar 2019 09:48:28 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8CA8D292FF for ; Mon, 18 Mar 2019 09:48:28 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 80C8F29304; Mon, 18 Mar 2019 09:48:28 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B6107292FF for ; Mon, 18 Mar 2019 09:48:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726719AbfCRJsZ (ORCPT ); Mon, 18 Mar 2019 05:48:25 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:61167 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727772AbfCRJ1c (ORCPT ); Mon, 18 Mar 2019 05:27:32 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S8; Mon, 18 Mar 2019 17:15:39 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 06/16] libceph: introduce generic journaling Date: Mon, 18 Mar 2019 05:15:24 -0400 Message-Id: <1552900534-29026-7-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S8 X-Coremail-Antispam: 1Uf129KBjvAXoWfAry5tFyUXr1DAF1rJr15twb_yoW5GF1fCo Z7Xr1UCFn5Ga47ZFWkKr1xJ34rXayUJa4fAr4YqF4a9an3Ary8Z3y7Gr15JFyYv3yUArsF qw4xJwn3Jr4kA3WUn29KB7ZKAUJUUUUr529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UbIYCTnIWIevJa73UjIFyTuYvjfUL1xRDUUUU X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbihhx7elsfmbXVOgAAsc Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This commit introduce the generic journaling. This is a initial commit, which only includes some generic functions, such as journaler_create|destroy() and journaler_open|close(). Signed-off-by: Dongsheng Yang --- include/linux/ceph/journaler.h | 180 ++++++++++++ net/ceph/Makefile | 3 +- net/ceph/journaler.c | 610 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 792 insertions(+), 1 deletion(-) create mode 100644 include/linux/ceph/journaler.h create mode 100644 net/ceph/journaler.c diff --git a/include/linux/ceph/journaler.h b/include/linux/ceph/journaler.h new file mode 100644 index 0000000..730d0ed --- /dev/null +++ b/include/linux/ceph/journaler.h @@ -0,0 +1,180 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _FS_CEPH_JOURNAL_H +#define _FS_CEPH_JOURNAL_H + +#include +#include + +struct ceph_osd_client; + +#define JOURNAL_HEADER_PREFIX "journal." +#define JOURNAL_OBJECT_PREFIX "journal_data." + +#define LOCAL_MIRROR_UUID "" + +/// preamble, version, entry tid, tag id +static const uint32_t HEADER_FIXED_SIZE = 25; +/// data size, crc +static const uint32_t REMAINDER_FIXED_SIZE = 8; +static const uint64_t PREAMBLE = 0x3141592653589793; + +struct ceph_journaler_ctx; +typedef void (*ceph_journalecallback_t)(struct ceph_journaler_ctx *); + +struct ceph_journaler_ctx { + struct list_head node; + struct ceph_bio_iter *bio_iter; + size_t bio_len; + + struct page *prefix_page; + unsigned int prefix_offset; + unsigned int prefix_len; + + struct page *suffix_page; + unsigned int suffix_offset; + unsigned int suffix_len; + + int result; + void *priv; + ceph_journalecallback_t callback; +}; + +struct ceph_journaler_future { + uint64_t tag_tid; + uint64_t entry_tid; + uint64_t commit_tid; + + spinlock_t lock; + bool safe; + bool consistent; + + struct ceph_journaler_ctx *ctx; + struct ceph_journaler_future *wait; +}; + +struct ceph_journaler_entry { + struct list_head node; + uint64_t tag_tid; + uint64_t entry_tid; + ssize_t data_len; + char *data; + struct ceph_bio_iter *bio_iter; +}; + +struct entry_tid { + struct list_head node; + uint64_t tag_tid; + uint64_t entry_tid; +}; + +struct commit_entry { + struct rb_node r_node; + uint64_t commit_tid; + uint64_t object_num; + uint64_t tag_tid; + uint64_t entry_tid; + bool committed; +}; + +struct object_recorder { + spinlock_t lock; + bool overflowed; + uint64_t splay_offset; + uint64_t inflight_append; + + struct list_head append_list; + struct list_head overflow_list; +}; + +struct object_replayer { + spinlock_t lock; + uint64_t object_num; + struct ceph_journaler_object_pos *pos; + struct list_head entry_list; +}; + +struct ceph_journaler { + struct ceph_osd_client *osdc; + struct ceph_object_id header_oid; + struct ceph_object_locator header_oloc; + struct ceph_object_locator data_oloc; + char *object_oid_prefix; + char *client_id; + + bool advancing; + bool overflowed; + bool commit_scheduled; + uint8_t order; + uint8_t splay_width; + int64_t pool_id; + uint64_t splay_offset; + uint64_t active_tag_tid; + uint64_t prune_tag_tid; + uint64_t commit_tid; + uint64_t minimum_set; + uint64_t active_set; + + struct ceph_journaler_future *prev_future; + struct ceph_journaler_client *client; + struct object_recorder *obj_recorders; + struct object_replayer *obj_replayers; + + struct ceph_journaler_object_pos *obj_pos_pending_array; + struct list_head obj_pos_pending; + struct ceph_journaler_object_pos *obj_pos_committing_array; + struct list_head obj_pos_committing; + + struct mutex meta_lock; + struct mutex commit_lock; + spinlock_t entry_tid_lock; + spinlock_t finish_lock; + struct list_head ctx_list; + struct list_head clients; + struct list_head clients_cache; + struct list_head entry_tids; + struct rb_root commit_entries; + + struct workqueue_struct *task_wq; + struct workqueue_struct *notify_wq; + struct work_struct flush_work; + struct delayed_work commit_work; + struct work_struct overflow_work; + struct work_struct finish_work; + struct work_struct notify_update_work; + + void *fetch_buf; + int (*handle_entry)(void *entry_handler, + struct ceph_journaler_entry *entry, + uint64_t commit_tid); + void *entry_handler; + struct ceph_osd_linger_request *watch_handle; +}; + +// generic functions +struct ceph_journaler *ceph_journaler_create(struct ceph_osd_client *osdc, + struct ceph_object_locator*_oloc, + const char *journal_id, + const char *client_id); +void ceph_journaler_destroy(struct ceph_journaler *journal); + +int ceph_journaler_open(struct ceph_journaler *journal); +void ceph_journaler_close(struct ceph_journaler *journal); + +int ceph_journaler_get_cached_client(struct ceph_journaler *journaler, char *client_id, + struct ceph_journaler_client **client_result); +// replaying +int ceph_journaler_start_replay(struct ceph_journaler *journaler); + +// recording +struct ceph_journaler_ctx *ceph_journaler_ctx_alloc(void); +void ceph_journaler_ctx_put(struct ceph_journaler_ctx *journaler_ctx); +int ceph_journaler_append(struct ceph_journaler *journaler, + uint64_t tag_tid, uint64_t *commit_tid, + struct ceph_journaler_ctx *ctx); +void ceph_journaler_client_committed(struct ceph_journaler *journaler, + uint64_t commit_tid); +int ceph_journaler_allocate_tag(struct ceph_journaler *journaler, + uint64_t tag_class, void *buf, + uint32_t buf_len, + struct ceph_journaler_tag *tag); +#endif diff --git a/net/ceph/Makefile b/net/ceph/Makefile index db09defe..a965d64 100644 --- a/net/ceph/Makefile +++ b/net/ceph/Makefile @@ -14,4 +14,5 @@ libceph-y := ceph_common.o messenger.o msgpool.o buffer.o pagelist.o \ crypto.o armor.o \ auth_x.o \ ceph_fs.o ceph_strings.o ceph_hash.o \ - pagevec.o snapshot.o string_table.o + pagevec.o snapshot.o string_table.o \ + journaler.o cls_journaler_client.o diff --git a/net/ceph/journaler.c b/net/ceph/journaler.c new file mode 100644 index 0000000..d8721a9 --- /dev/null +++ b/net/ceph/journaler.c @@ -0,0 +1,610 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include + +#include +#include + +#define JOURNALER_COMMIT_INTERVAL msecs_to_jiffies(5000) + +static char *object_oid_prefix(int pool_id, const char *journal_id) +{ + char *prefix = NULL; + ssize_t len = snprintf(NULL, 0, "%s%d.%s.", JOURNAL_OBJECT_PREFIX, + pool_id, journal_id); + + prefix = kzalloc(len + 1, GFP_KERNEL); + + if (!prefix) + return NULL; + + WARN_ON(snprintf(prefix, len + 1, "%s%d.%s.", JOURNAL_OBJECT_PREFIX, + pool_id, journal_id) != len); + + return prefix; +} + +static void journaler_flush(struct work_struct *work); +static void journaler_finish(struct work_struct *work); +static void journaler_client_commit(struct work_struct *work); +static void journaler_notify_update(struct work_struct *work); +static void journaler_overflow(struct work_struct *work); + +struct journaler_append_ctx { + struct list_head node; + struct ceph_journaler *journaler; + + uint64_t splay_offset; + uint64_t object_num; + struct page *req_page; + + struct ceph_journaler_future future; + struct ceph_journaler_entry entry; + struct ceph_journaler_ctx journaler_ctx; + + struct kref kref; +}; + +static struct kmem_cache *journaler_commit_entry_cache; +static struct kmem_cache *journaler_append_ctx_cache; + +struct ceph_journaler *ceph_journaler_create(struct ceph_osd_client *osdc, + struct ceph_object_locator *oloc, + const char *journal_id, + const char *client_id) +{ + struct ceph_journaler *journaler = NULL; + int ret = 0; + + journaler = kzalloc(sizeof(struct ceph_journaler), GFP_KERNEL); + if (!journaler) + return NULL; + + journaler->osdc = osdc; + ceph_oid_init(&journaler->header_oid); + ret = ceph_oid_aprintf(&journaler->header_oid, GFP_NOIO, "%s%s", + JOURNAL_HEADER_PREFIX, journal_id); + if (ret) { + pr_err("aprintf error : %d", ret); + goto err_free_journaler; + } + + ceph_oloc_init(&journaler->header_oloc); + ceph_oloc_copy(&journaler->header_oloc, oloc); + ceph_oloc_init(&journaler->data_oloc); + + journaler->object_oid_prefix = object_oid_prefix(journaler->header_oloc.pool, + journal_id); + + if (!journaler->object_oid_prefix) + goto err_destroy_data_oloc; + + journaler->client_id = kstrdup(client_id, GFP_NOIO); + if (!journaler->client_id) { + ret = -ENOMEM; + goto err_free_object_oid_prefix; + } + + journaler->advancing = false; + journaler->overflowed = false; + journaler->commit_scheduled = false; + journaler->order = 0; + journaler->splay_width = 0; + journaler->pool_id = -1; + journaler->splay_offset = 0; + journaler->active_tag_tid = UINT_MAX; + journaler->prune_tag_tid = UINT_MAX; + journaler->commit_tid = 0; + journaler->minimum_set = 0; + journaler->active_set = 0; + + journaler->prev_future = NULL; + journaler->client = NULL; + journaler->obj_recorders = NULL; + journaler->obj_replayers = NULL; + + mutex_init(&journaler->meta_lock); + mutex_init(&journaler->commit_lock); + spin_lock_init(&journaler->entry_tid_lock); + spin_lock_init(&journaler->finish_lock); + + INIT_LIST_HEAD(&journaler->ctx_list); + INIT_LIST_HEAD(&journaler->clients); + INIT_LIST_HEAD(&journaler->clients_cache); + INIT_LIST_HEAD(&journaler->entry_tids); + INIT_LIST_HEAD(&journaler->obj_pos_pending); + INIT_LIST_HEAD(&journaler->obj_pos_committing); + + journaler->commit_entries = RB_ROOT; + journaler_commit_entry_cache = KMEM_CACHE(commit_entry, 0); + if (!journaler_commit_entry_cache) + goto err_free_client_id; + + journaler_append_ctx_cache = KMEM_CACHE(journaler_append_ctx, 0); + if (!journaler_append_ctx_cache) + goto err_destroy_commit_entry_cache; + + journaler->task_wq = alloc_ordered_workqueue("journaler-tasks", + WQ_MEM_RECLAIM); + if (!journaler->task_wq) + goto err_destroy_append_ctx_cache; + + journaler->notify_wq = create_singlethread_workqueue("journaler-notify"); + if (!journaler->notify_wq) + goto err_destroy_task_wq; + + INIT_WORK(&journaler->flush_work, journaler_flush); + INIT_WORK(&journaler->finish_work, journaler_finish); + INIT_DELAYED_WORK(&journaler->commit_work, journaler_client_commit); + INIT_WORK(&journaler->notify_update_work, journaler_notify_update); + INIT_WORK(&journaler->overflow_work, journaler_overflow); + + journaler->fetch_buf = NULL; + journaler->handle_entry = NULL; + journaler->entry_handler = NULL; + journaler->watch_handle = NULL; + + return journaler; + +err_destroy_task_wq: + destroy_workqueue(journaler->task_wq); +err_destroy_append_ctx_cache: + kmem_cache_destroy(journaler_append_ctx_cache); +err_destroy_commit_entry_cache: + kmem_cache_destroy(journaler_commit_entry_cache); +err_free_client_id: + kfree(journaler->client_id); +err_free_object_oid_prefix: + kfree(journaler->object_oid_prefix); +err_destroy_data_oloc: + ceph_oloc_destroy(&journaler->data_oloc); + ceph_oloc_destroy(&journaler->header_oloc); + ceph_oid_destroy(&journaler->header_oid); +err_free_journaler: + kfree(journaler); + return NULL; +} +EXPORT_SYMBOL(ceph_journaler_create); + +void ceph_journaler_destroy(struct ceph_journaler *journaler) +{ + destroy_workqueue(journaler->notify_wq); + destroy_workqueue(journaler->task_wq); + + kmem_cache_destroy(journaler_append_ctx_cache); + kmem_cache_destroy(journaler_commit_entry_cache); + kfree(journaler->client_id); + kfree(journaler->object_oid_prefix); + ceph_oloc_destroy(&journaler->data_oloc); + ceph_oloc_destroy(&journaler->header_oloc); + ceph_oid_destroy(&journaler->header_oid); + kfree(journaler); +} +EXPORT_SYMBOL(ceph_journaler_destroy); + +static int remove_set(struct ceph_journaler *journaler, uint64_t object_set); +static int set_minimum_set(struct ceph_journaler* journaler, + uint64_t minimum_set); + +static int refresh(struct ceph_journaler *journaler, bool init) +{ + int ret = 0; + struct ceph_journaler_client *client; + uint64_t minimum_commit_set; + uint64_t minimum_set; + uint64_t active_set; + bool need_advance = false; + LIST_HEAD(tmp_clients); + + INIT_LIST_HEAD(&tmp_clients); + ret = ceph_cls_journaler_get_mutable_metas(journaler->osdc, + &journaler->header_oid, &journaler->header_oloc, + &minimum_set, &active_set); + if (ret) + return ret; + + ret = ceph_cls_journaler_client_list(journaler->osdc, &journaler->header_oid, + &journaler->header_oloc, &journaler->clients_cache, journaler->splay_width); + if (ret) + return ret; + mutex_lock(&journaler->meta_lock); + + if (init) { + journaler->active_set = active_set; + journaler->minimum_set = minimum_set; + } else { + // check for advance active_set. + need_advance = active_set > journaler->active_set; + journaler->minimum_set = minimum_set; + } + + // swap clients with clients_cache + list_splice_tail_init(&journaler->clients, &tmp_clients); + list_splice_tail_init(&journaler->clients_cache, &journaler->clients); + list_splice_tail_init(&tmp_clients, &journaler->clients_cache); + + // calculate the minimum_commit_set. + minimum_commit_set = journaler->active_set; + list_for_each_entry(client, &journaler->clients, node) { + struct ceph_journaler_object_pos *pos; + + list_for_each_entry(pos, &client->object_positions, node) { + uint64_t object_set = pos->object_num / journaler->splay_width; + if (object_set < minimum_commit_set) { + minimum_commit_set = object_set; + } + } + + if (!strcmp(client->id, journaler->client_id)) { + journaler->client = client; + } + } + mutex_unlock(&journaler->meta_lock); + + if (need_advance) { + mutex_lock(&journaler->meta_lock); + journaler->active_set = active_set; + journaler->overflowed = false; + journaler->advancing = false; + mutex_unlock(&journaler->meta_lock); + + queue_work(journaler->task_wq, &journaler->flush_work); + } + + // remove set if necessary + if (minimum_commit_set > minimum_set) { + uint64_t trim_set = minimum_set; + while (trim_set < minimum_commit_set) { + ret = remove_set(journaler, trim_set); + if (ret < 0 && ret != -ENOENT) { + pr_err("failed to trim object_set: %llu", trim_set); + return ret; + } + trim_set++; + } + + ret = set_minimum_set(journaler, minimum_commit_set); + if (ret < 0) { + pr_err("failed to set minimum set to %llu", minimum_commit_set); + return ret; + } + } + + return 0; + +} + +static void journaler_watch_cb(void *arg, u64 notify_id, u64 cookie, + u64 notifier_id, void *data, size_t data_len) +{ + struct ceph_journaler *journaler = arg; + int ret = 0; + + ret = refresh(journaler, false); + if (ret < 0) { + pr_err("%s: failed to refresh journaler: %d", __func__, ret); + } + + ret = ceph_osdc_notify_ack(journaler->osdc, &journaler->header_oid, + &journaler->header_oloc, notify_id, + cookie, NULL, 0); + if (ret) + pr_err("acknowledge_notify failed: %d", ret); +} + +static void journaler_watch_errcb(void *arg, u64 cookie, int err) +{ + // TODO re-watch in watch error. + pr_err("journaler watch error: %d", err); +} + +static int journaler_watch(struct ceph_journaler *journaler) +{ + struct ceph_osd_client *osdc = journaler->osdc; + struct ceph_osd_linger_request *handle; + + handle = ceph_osdc_watch(osdc, &journaler->header_oid, + &journaler->header_oloc, journaler->notify_wq, + journaler_watch_cb, journaler_watch_errcb, + journaler); + if (IS_ERR(handle)) + return PTR_ERR(handle); + + journaler->watch_handle = handle; + return 0; +} + +static void journaler_unwatch(struct ceph_journaler *journaler) +{ + struct ceph_osd_client *osdc = journaler->osdc; + int ret = 0; + + ret = ceph_osdc_unwatch(osdc, journaler->watch_handle); + if (ret) + pr_err("%s: failed to unwatch: %d", __func__, ret); + + journaler->watch_handle = NULL; +} + +static void copy_object_pos(struct ceph_journaler_object_pos *src_pos, + struct ceph_journaler_object_pos *dst_pos) +{ + dst_pos->object_num = src_pos->object_num; + dst_pos->tag_tid = src_pos->tag_tid; + dst_pos->entry_tid = src_pos->entry_tid; +} + +static void copy_pos_list(struct list_head *src_list, struct list_head *dst_list) +{ + struct ceph_journaler_object_pos *src_pos, *dst_pos; + + src_pos = list_first_entry(src_list, struct ceph_journaler_object_pos, node); + dst_pos = list_first_entry(dst_list, struct ceph_journaler_object_pos, node); + while (&src_pos->node != src_list && &dst_pos->node != dst_list) { + copy_object_pos(src_pos, dst_pos); + src_pos = list_next_entry(src_pos, node); + dst_pos = list_next_entry(dst_pos, node); + } +} + +int ceph_journaler_open(struct ceph_journaler *journaler) +{ + uint8_t order, splay_width; + int64_t pool_id; + int i = 0, ret = 0; + struct ceph_journaler_client *client = NULL, *next_client = NULL; + + ret = ceph_cls_journaler_get_immutable_metas(journaler->osdc, + &journaler->header_oid, + &journaler->header_oloc, + &order, + &splay_width, + &pool_id); + if (ret) { + pr_err("failed to get immutable metas.");; + goto out; + } + + mutex_lock(&journaler->meta_lock); + // set the immutable metas. + journaler->order = order; + journaler->splay_width = splay_width; + journaler->pool_id = pool_id; + + if (journaler->pool_id == -1) { + ceph_oloc_copy(&journaler->data_oloc, &journaler->header_oloc); + journaler->pool_id = journaler->data_oloc.pool; + } else { + journaler->data_oloc.pool = journaler->pool_id; + } + + // initialize ->obj_recorders and ->obj_replayers. + journaler->obj_recorders = kzalloc(sizeof(struct object_recorder) * + journaler->splay_width, GFP_KERNEL); + + if (!journaler->obj_recorders) { + mutex_unlock(&journaler->meta_lock); + goto out; + } + + journaler->obj_replayers = kzalloc(sizeof(struct object_replayer) * + journaler->splay_width, GFP_KERNEL); + + if (!journaler->obj_replayers) { + mutex_unlock(&journaler->meta_lock); + goto free_recorders; + } + + journaler->obj_pos_pending_array = kzalloc(sizeof(struct ceph_journaler_object_pos) * + journaler->splay_width, GFP_KERNEL); + + if (!journaler->obj_pos_pending_array) { + mutex_unlock(&journaler->meta_lock); + goto free_replayers; + } + + journaler->obj_pos_committing_array = kzalloc(sizeof(struct ceph_journaler_object_pos) * + journaler->splay_width, GFP_KERNEL); + + if (!journaler->obj_pos_committing_array) { + mutex_unlock(&journaler->meta_lock); + goto free_pos_pending; + } + + for (i = 0; i < journaler->splay_width; i++) { + struct object_recorder *obj_recorder = &journaler->obj_recorders[i]; + struct object_replayer *obj_replayer = &journaler->obj_replayers[i]; + struct ceph_journaler_object_pos *pos_pending = &journaler->obj_pos_pending_array[i]; + struct ceph_journaler_object_pos *pos_committing = &journaler->obj_pos_committing_array[i]; + + spin_lock_init(&obj_recorder->lock); + obj_recorder->overflowed = false; + obj_recorder->splay_offset = i; + obj_recorder->inflight_append = 0; + INIT_LIST_HEAD(&obj_recorder->append_list); + INIT_LIST_HEAD(&obj_recorder->overflow_list); + + spin_lock_init(&obj_replayer->lock); + obj_replayer->object_num = i; + obj_replayer->pos = NULL; + INIT_LIST_HEAD(&obj_replayer->entry_list); + + pos_pending->in_using = false; + INIT_LIST_HEAD(&pos_pending->node); + list_add_tail(&pos_pending->node, &journaler->obj_pos_pending); + + pos_committing->in_using = false; + INIT_LIST_HEAD(&pos_committing->node); + list_add_tail(&pos_committing->node, &journaler->obj_pos_committing); + } + mutex_unlock(&journaler->meta_lock); + + ret = refresh(journaler, true); + if (ret) + goto free_pos_committing; + + mutex_lock(&journaler->meta_lock); + if (journaler->client){ + copy_pos_list(&journaler->client->object_positions, + &journaler->obj_pos_pending); + } + mutex_unlock(&journaler->meta_lock); + + ret = journaler_watch(journaler); + if (ret) { + pr_err("journaler_watch error: %d", ret); + goto destroy_clients; + } + return 0; + +destroy_clients: + list_for_each_entry_safe(client, next_client, + &journaler->clients, node) { + list_del(&client->node); + destroy_client(client); + } + + list_for_each_entry_safe(client, next_client, + &journaler->clients_cache, node) { + list_del(&client->node); + destroy_client(client); + } +free_pos_committing: + kfree(journaler->obj_pos_committing_array); +free_pos_pending: + kfree(journaler->obj_pos_pending_array); +free_replayers: + kfree(journaler->obj_replayers); +free_recorders: + kfree(journaler->obj_recorders); +out: + return ret; +} +EXPORT_SYMBOL(ceph_journaler_open); + +DEFINE_RB_INSDEL_FUNCS(commit_entry, struct commit_entry, commit_tid, r_node) + +void ceph_journaler_close(struct ceph_journaler *journaler) +{ + struct ceph_journaler_client *client = NULL, *next = NULL; + struct commit_entry *commit_entry = NULL; + struct entry_tid *entry_tid = NULL, *entry_tid_next = NULL; + struct ceph_journaler_object_pos *pos = NULL, *next_pos = NULL; + struct rb_node *n; + int i = 0; + + // Stop watching + journaler_unwatch(journaler); + flush_workqueue(journaler->notify_wq); + + flush_delayed_work(&journaler->commit_work); + drain_workqueue(journaler->task_wq); + list_for_each_entry_safe(pos, next_pos, + &journaler->obj_pos_pending, node) { + list_del(&pos->node); + } + + list_for_each_entry_safe(pos, next_pos, + &journaler->obj_pos_committing, node) { + list_del(&pos->node); + } + journaler->client = NULL; + list_for_each_entry_safe(client, next, &journaler->clients, node) { + list_del(&client->node); + destroy_client(client); + } + list_for_each_entry_safe(client, next, &journaler->clients_cache, node) { + list_del(&client->node); + destroy_client(client); + } + + for (n = rb_first(&journaler->commit_entries); n;) { + commit_entry = rb_entry(n, struct commit_entry, r_node); + + n = rb_next(n); + erase_commit_entry(&journaler->commit_entries, commit_entry); + kmem_cache_free(journaler_commit_entry_cache, commit_entry); + } + + for (i = 0; i < journaler->splay_width; i++) { + struct object_recorder *obj_recorder = &journaler->obj_recorders[i]; + struct object_replayer *obj_replayer = &journaler->obj_replayers[i]; + + spin_lock(&obj_recorder->lock); + BUG_ON(!list_empty(&obj_recorder->append_list) || + !list_empty(&obj_recorder->overflow_list)); + spin_unlock(&obj_recorder->lock); + + spin_lock(&obj_replayer->lock); + BUG_ON(!list_empty(&obj_replayer->entry_list)); + spin_unlock(&obj_replayer->lock); + } + + kfree(journaler->obj_pos_committing_array); + kfree(journaler->obj_pos_pending_array); + kfree(journaler->obj_recorders); + kfree(journaler->obj_replayers); + journaler->obj_recorders = NULL; + journaler->obj_replayers = NULL; + + list_for_each_entry_safe(entry_tid, entry_tid_next, + &journaler->entry_tids, node) { + list_del(&entry_tid->node); + kfree(entry_tid); + } + + WARN_ON(!list_empty(&journaler->ctx_list)); + WARN_ON(!list_empty(&journaler->clients)); + WARN_ON(!list_empty(&journaler->clients_cache)); + WARN_ON(!list_empty(&journaler->entry_tids)); + WARN_ON(!list_empty(&journaler->obj_pos_pending)); + WARN_ON(rb_first(&journaler->commit_entries) != NULL); + + mutex_lock(&journaler->meta_lock); + ceph_oloc_init(&journaler->data_oloc); + journaler->advancing = false; + journaler->overflowed = false; + journaler->commit_scheduled = false; + journaler->order = 0; + journaler->splay_width = 0; + journaler->pool_id = -1; + journaler->splay_offset = 0; + journaler->active_tag_tid = UINT_MAX; + journaler->prune_tag_tid = UINT_MAX; + journaler->commit_tid = 0; + journaler->minimum_set = 0; + journaler->active_set = 0; + journaler->prev_future = NULL; + journaler->fetch_buf = NULL; + journaler->handle_entry = NULL; + journaler->entry_handler = NULL; + journaler->watch_handle = NULL; + + mutex_unlock(&journaler->meta_lock); + + return; +} +EXPORT_SYMBOL(ceph_journaler_close); + +int ceph_journaler_get_cached_client(struct ceph_journaler *journaler, char *client_id, + struct ceph_journaler_client **client_result) +{ + struct ceph_journaler_client *client = NULL; + int ret = -ENOENT; + + list_for_each_entry(client, &journaler->clients, node) { + if (!strcmp(client->id, client_id)) { + *client_result = client; + ret = 0; + break; + } + } + + return ret; +} +EXPORT_SYMBOL(ceph_journaler_get_cached_client); From patchwork Mon Mar 18 09:15:25 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857011 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 60B3A1709 for ; Mon, 18 Mar 2019 09:23:06 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 415CE2922F for ; Mon, 18 Mar 2019 09:23:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 35902292F6; Mon, 18 Mar 2019 09:23:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 551F62922F for ; Mon, 18 Mar 2019 09:23:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727234AbfCRJXD (ORCPT ); Mon, 18 Mar 2019 05:23:03 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:50295 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727043AbfCRJXB (ORCPT ); Mon, 18 Mar 2019 05:23:01 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S9; Mon, 18 Mar 2019 17:15:40 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 07/16] libceph: journaling: introduce api to replay uncommitted journal events Date: Mon, 18 Mar 2019 05:15:25 -0400 Message-Id: <1552900534-29026-8-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S9 X-Coremail-Antispam: 1Uf129KBjvAXoW3uw45Ww45uFykCr15KF45KFg_yoW8WF15Co WxZF4UC3W5Ga45uFyxGr1kW34Sqa48JayrAr4aqFWY93Z7Ary0939rGr15JryYyF45ur4D Xws7Jwn3tw4DA34Un29KB7ZKAUJUUUUr529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UbIYCTnIWIevJa73UjIFyTuYvjfUL1xRDUUUU X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbiixx7eltVgCnUhgAAsj Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When we are going to make sure the data and journal are consistent in opening journal, we can call the api of start_replay() to replay the all events recorded but not committed. Signed-off-by: Dongsheng Yang --- net/ceph/journaler.c | 568 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 568 insertions(+) diff --git a/net/ceph/journaler.c b/net/ceph/journaler.c index d8721a9..699b5258 100644 --- a/net/ceph/journaler.c +++ b/net/ceph/journaler.c @@ -370,6 +370,24 @@ int ceph_journaler_open(struct ceph_journaler *journaler) goto out; } + // TODO When the journal object size is smaller than 8M, there is a possibility + // to build a journal object larger than 2*object_size. + // e.g: + // When the journal object is 4M, + // (1) append first entry size of (4M - 1) + // (2) append second entry size of (4M + prefix_len + suffix_len) + // (3) journaler object size would be > 8M, which is 2*object_size. + // + // But when we do fetch, we will read a whole object into fetch_buf, which is + // size of 2*object_size. So if this object size is larger than fetch_len, we + // will fail in fetch. + // + // To solve this problem, we need to split append entry when we found entry size + // whould be larger than object size. + if (order < 23) { + return -ENOTSUPP; + } + mutex_lock(&journaler->meta_lock); // set the immutable metas. journaler->order = order; @@ -608,3 +626,553 @@ int ceph_journaler_get_cached_client(struct ceph_journaler *journaler, char *cli return ret; } EXPORT_SYMBOL(ceph_journaler_get_cached_client); + +// replaying +static int ceph_journaler_obj_read_sync(struct ceph_journaler *journaler, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + void *buf, uint32_t read_off, uint64_t buf_len) + +{ + struct ceph_osd_client *osdc = journaler->osdc; + struct ceph_osd_request *req; + struct page **pages; + int num_pages = calc_pages_for(0, buf_len); + int ret; + + req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_NOIO); + if (!req) + return -ENOMEM; + + ceph_oid_copy(&req->r_base_oid, oid); + ceph_oloc_copy(&req->r_base_oloc, oloc); + req->r_flags = CEPH_OSD_FLAG_READ; + + pages = ceph_alloc_page_vector(num_pages, GFP_NOIO); + if (IS_ERR(pages)) { + ret = PTR_ERR(pages); + goto out_req; + } + + osd_req_op_extent_init(req, 0, CEPH_OSD_OP_READ, read_off, buf_len, 0, 0); + osd_req_op_extent_osd_data_pages(req, 0, pages, buf_len, 0, false, + true); + + ret = ceph_osdc_alloc_messages(req, GFP_NOIO); + if (ret) + goto out_req; + + ceph_osdc_start_request(osdc, req, false); + ret = ceph_osdc_wait_request(osdc, req); + if (ret >= 0) + ceph_copy_from_page_vector(pages, buf, 0, ret); + +out_req: + ceph_osdc_put_request(req); + return ret; +} + +static bool entry_is_readable(struct ceph_journaler *journaler, void *buf, + void *end, uint32_t *bytes_needed) +{ + uint32_t remaining = end - buf; + uint64_t preamble; + uint32_t data_size; + void *origin_buf = buf; + uint32_t crc = 0, crc_encoded = 0; + + if (remaining < HEADER_FIXED_SIZE) { + *bytes_needed = HEADER_FIXED_SIZE - remaining; + return false; + } + + preamble = ceph_decode_64(&buf); + if (PREAMBLE != preamble) { + *bytes_needed = 0; + return false; + } + + buf += (HEADER_FIXED_SIZE - sizeof(preamble)); + remaining = end - buf; + if (remaining < sizeof(uint32_t)) { + *bytes_needed = sizeof(uint32_t) - remaining; + return false; + } + + data_size = ceph_decode_32(&buf); + remaining = end - buf; + if (remaining < data_size) { + *bytes_needed = data_size - remaining; + return false; + } + + buf += data_size; + + remaining = end - buf; + if (remaining < sizeof(uint32_t)) { + *bytes_needed = sizeof(uint32_t) - remaining; + return false; + } + + *bytes_needed = 0; + crc = crc32c(0, origin_buf, buf - origin_buf); + crc_encoded = ceph_decode_32(&buf); + if (crc != crc_encoded) { + pr_err("crc corrupted"); + return false; + } + return true; +} + +static int playback_entry(struct ceph_journaler *journaler, + struct ceph_journaler_entry *entry, + uint64_t commit_tid) +{ + int ret = 0; + + if (journaler->handle_entry != NULL) + ret = journaler->handle_entry(journaler->entry_handler, + entry, commit_tid); + + return ret; +} + +static bool get_last_entry_tid(struct ceph_journaler *journaler, + uint64_t tag_tid, uint64_t *entry_tid) +{ + struct entry_tid *pos = NULL; + + spin_lock(&journaler->entry_tid_lock); + list_for_each_entry(pos, &journaler->entry_tids, node) { + if (pos->tag_tid == tag_tid) { + *entry_tid = pos->entry_tid; + spin_unlock(&journaler->entry_tid_lock); + return true; + } + } + spin_unlock(&journaler->entry_tid_lock); + + return false; +} + +// There would not be too many entry_tids here, we need +// only one entry_tid for all entries with same tag_tid. +static struct entry_tid *entry_tid_alloc(struct ceph_journaler *journaler, + uint64_t tag_tid) +{ + struct entry_tid *entry_tid; + + entry_tid = kzalloc(sizeof(struct entry_tid), GFP_NOIO); + if (!entry_tid) { + pr_err("failed to allocate new entry."); + return NULL; + } + + entry_tid->tag_tid = tag_tid; + entry_tid->entry_tid = 0; + INIT_LIST_HEAD(&entry_tid->node); + + list_add_tail(&entry_tid->node, &journaler->entry_tids); + return entry_tid; +} + +static int reserve_entry_tid(struct ceph_journaler *journaler, + uint64_t tag_tid, uint64_t entry_tid) +{ + struct entry_tid *pos; + + spin_lock(&journaler->entry_tid_lock); + list_for_each_entry(pos, &journaler->entry_tids, node) { + if (pos->tag_tid == tag_tid) { + if (pos->entry_tid < entry_tid) { + pos->entry_tid = entry_tid; + } + + spin_unlock(&journaler->entry_tid_lock); + return 0; + } + } + + pos = entry_tid_alloc(journaler, tag_tid); + if (!pos) { + spin_unlock(&journaler->entry_tid_lock); + pr_err("failed to allocate new entry."); + return -ENOMEM; + } + + pos->entry_tid = entry_tid; + spin_unlock(&journaler->entry_tid_lock); + + return 0; +} + +static void journaler_entry_free(struct ceph_journaler_entry *entry) +{ + if (entry->data) + kvfree(entry->data); + kfree(entry); +} + +static struct ceph_journaler_entry *journaler_entry_decode(void **p, void *end) +{ + struct ceph_journaler_entry *entry = NULL; + uint64_t preamble = 0; + uint8_t version = 0; + uint32_t crc = 0, crc_encoded = 0; + void *start = *p; + + preamble = ceph_decode_64(p); + if (PREAMBLE != preamble) { + return NULL; + } + + version = ceph_decode_8(p); + if (version != 1) + return NULL; + + entry = kzalloc(sizeof(struct ceph_journaler_entry), GFP_NOIO); + if (!entry) { + goto err; + } + + INIT_LIST_HEAD(&entry->node); + entry->entry_tid = ceph_decode_64(p); + entry->tag_tid = ceph_decode_64(p); + entry->data = ceph_extract_encoded_string_kvmalloc(p, end, + &entry->data_len, GFP_KERNEL); + if (IS_ERR(entry->data)) { + entry->data = NULL; + goto free_entry; + } + + crc = crc32c(0, start, *p - start); + crc_encoded = ceph_decode_32(p); + if (crc != crc_encoded) { + goto free_entry; + } + return entry; + +free_entry: + journaler_entry_free(entry); +err: + return NULL; +} + +static int fetch(struct ceph_journaler *journaler, uint64_t object_num) +{ + struct ceph_object_id object_oid; + int ret = 0; + void *read_buf = NULL, *end = NULL; + uint64_t read_len = 2 << journaler->order; + struct ceph_journaler_object_pos *pos; + struct object_replayer *obj_replayer = NULL; + + obj_replayer = &journaler->obj_replayers[object_num % journaler->splay_width]; + obj_replayer->object_num = object_num; + list_for_each_entry(pos, &journaler->client->object_positions, node) { + if (pos->in_using && pos->object_num == object_num) { + obj_replayer->pos = pos; + break; + } + } + + ceph_oid_init(&object_oid); + ret = ceph_oid_aprintf(&object_oid, GFP_NOIO, "%s%llu", + journaler->object_oid_prefix, object_num); + if (ret) { + pr_err("failed to initialize object_id : %d", ret); + return ret; + } + + read_buf = journaler->fetch_buf; + ret = ceph_journaler_obj_read_sync(journaler, &object_oid, + &journaler->data_oloc, read_buf, + 0, read_len); + if (ret == -ENOENT) { + dout("no such object, %s: %d", object_oid.name, ret); + goto err_free_object_oid; + } else if (ret < 0) { + pr_err("failed to read: %d", ret); + goto err_free_object_oid; + } else if (ret == 0) { + pr_err("no data: %d", ret); + goto err_free_object_oid; + } + + end = read_buf + ret; + while (read_buf < end) { + uint32_t bytes_needed = 0; + struct ceph_journaler_entry *entry = NULL; + + if (!entry_is_readable(journaler, read_buf, end, &bytes_needed)) { + ret = -EIO; + goto err_free_object_oid; + } + + entry = journaler_entry_decode(&read_buf, end); + if (!entry) + goto err_free_object_oid; + + list_add_tail(&entry->node, &obj_replayer->entry_list); + } + ret = 0; + +err_free_object_oid: + ceph_oid_destroy(&object_oid); + return ret; +} + +static int add_commit_entry(struct ceph_journaler *journaler, uint64_t commit_tid, + uint64_t object_num, uint64_t tag_tid, uint64_t entry_tid) +{ + struct commit_entry *commit_entry = NULL; + + commit_entry = kmem_cache_zalloc(journaler_commit_entry_cache, GFP_NOIO); + if (!commit_entry) + return -ENOMEM; + + RB_CLEAR_NODE(&commit_entry->r_node); + + commit_entry->commit_tid = commit_tid; + commit_entry->object_num = object_num; + commit_entry->tag_tid = tag_tid; + commit_entry->entry_tid = entry_tid; + + mutex_lock(&journaler->commit_lock); + insert_commit_entry(&journaler->commit_entries, commit_entry); + mutex_unlock(&journaler->commit_lock); + + return 0; +} + +static uint64_t allocate_commit_tid(struct ceph_journaler *journaler) +{ + return ++journaler->commit_tid; +} + +static void prune_tag(struct ceph_journaler *journaler, uint64_t tag_tid) +{ + struct ceph_journaler_entry *entry, *next; + struct object_replayer *obj_replayer = NULL; + int i = 0; + + if (journaler->prune_tag_tid == UINT_MAX || + journaler->prune_tag_tid < tag_tid) { + journaler->prune_tag_tid = tag_tid; + } + + for (i = 0; i < journaler->splay_width; i++) { + obj_replayer = &journaler->obj_replayers[i]; + list_for_each_entry_safe(entry, next, + &obj_replayer->entry_list, node) { + if (entry->tag_tid == tag_tid) { + list_del(&entry->node); + journaler_entry_free(entry); + } + } + } +} + +static int get_first_entry(struct ceph_journaler *journaler, + struct ceph_journaler_entry **entry, + uint64_t *commit_tid) +{ + struct object_replayer *obj_replayer = NULL; + struct ceph_journaler_entry *tmp_entry = NULL; + uint64_t last_entry_tid = 0; + int ret = 0; + +next: + obj_replayer = &journaler->obj_replayers[journaler->splay_offset]; + if (list_empty(&obj_replayer->entry_list)) { + return -ENOENT; + } + + tmp_entry = list_first_entry(&obj_replayer->entry_list, + struct ceph_journaler_entry, node); + + journaler->splay_offset = (journaler->splay_offset + 1) % journaler->splay_width; + if (journaler->active_tag_tid == UINT_MAX) { + journaler->active_tag_tid = tmp_entry->tag_tid; + } else if (tmp_entry->tag_tid < journaler->active_tag_tid || + (journaler->prune_tag_tid != UINT_MAX && + tmp_entry->tag_tid <= journaler->prune_tag_tid)) { + list_del(&tmp_entry->node); + journaler_entry_free(tmp_entry); + prune_tag(journaler, tmp_entry->tag_tid); + goto next; + } else if (tmp_entry->tag_tid > journaler->active_tag_tid) { + prune_tag(journaler, journaler->active_tag_tid); + journaler->active_tag_tid = tmp_entry->tag_tid; + + if (tmp_entry->entry_tid != 0) { + journaler->splay_offset = 0; + goto next; + } + } + + list_del(&tmp_entry->node); + ret = get_last_entry_tid(journaler, tmp_entry->tag_tid, &last_entry_tid); + if (ret && tmp_entry->entry_tid != last_entry_tid + 1) { + pr_err("missing prior journal entry, last_entry_tid: %llu", + last_entry_tid); + ret = -ENOMSG; + goto free_entry; + } + + if (list_empty(&obj_replayer->entry_list)) { + ret = fetch(journaler, obj_replayer->object_num + journaler->splay_width); + if (ret && ret != -ENOENT) { + goto free_entry; + } + } + + ret = reserve_entry_tid(journaler, tmp_entry->tag_tid, tmp_entry->entry_tid); + if (ret) + goto free_entry; + + *commit_tid = allocate_commit_tid(journaler); + ret = add_commit_entry(journaler, *commit_tid, obj_replayer->object_num, + tmp_entry->tag_tid, tmp_entry->entry_tid); + if (ret) + goto free_entry; + + *entry = tmp_entry; + return 0; + +free_entry: + journaler_entry_free(tmp_entry); + return ret; +} + +static int process_replay(struct ceph_journaler *journaler) +{ + int r = 0; + struct ceph_journaler_entry *entry = NULL; + uint64_t commit_tid = 0; + +next: + r = get_first_entry(journaler, &entry, &commit_tid); + if (r) { + if (r == -ENOENT) { + prune_tag(journaler, journaler->active_tag_tid); + r = 0; + } + return r; + } + + r = playback_entry(journaler, entry, commit_tid); + journaler_entry_free(entry); + if (r) { + return r; + } + + goto next; +} + +static int preprocess_replay(struct ceph_journaler *journaler) +{ + struct ceph_journaler_entry *entry, *next; + bool found_commit = false; + struct object_replayer *obj_replayer = NULL; + int i = 0; + int ret = 0; + + for (i = 0; i < journaler->splay_width; i++) { + obj_replayer = &journaler->obj_replayers[i]; + + if (!obj_replayer->pos) + continue; + + found_commit = false; + list_for_each_entry_safe(entry, next, + &obj_replayer->entry_list, node) { + if (entry->tag_tid == obj_replayer->pos->tag_tid && + entry->entry_tid == obj_replayer->pos->entry_tid) { + found_commit = true; + } else if (found_commit) { + break; + } + + ret = reserve_entry_tid(journaler, entry->tag_tid, entry->entry_tid); + if (ret) + return ret; + list_del(&entry->node); + journaler_entry_free(entry); + } + } + return 0; +} + +int ceph_journaler_start_replay(struct ceph_journaler *journaler) +{ + struct ceph_journaler_object_pos *active_pos = NULL; + uint64_t *fetch_objects = NULL; + uint64_t buf_len = (2 << journaler->order); + uint64_t object_num; + int i = 0; + int ret = 0; + + fetch_objects = kzalloc(sizeof(uint64_t) * journaler->splay_width, GFP_NOIO); + if (!fetch_objects) { + return -ENOMEM; + } + + mutex_lock(&journaler->meta_lock); + active_pos = list_first_entry(&journaler->client->object_positions, + struct ceph_journaler_object_pos, node); + if (active_pos->in_using) { + journaler->splay_offset = (active_pos->object_num + 1) % journaler->splay_width; + journaler->active_tag_tid = active_pos->tag_tid; + + list_for_each_entry(active_pos, &journaler->client->object_positions, node) { + if (active_pos->in_using) { + fetch_objects[active_pos->object_num % + journaler->splay_width] = active_pos->object_num; + } + } + } + mutex_unlock(&journaler->meta_lock); + + journaler->fetch_buf = ceph_kvmalloc(buf_len, GFP_NOIO); + if (!journaler->fetch_buf) { + pr_err("failed to alloc fetch buf: %llu", buf_len); + ret = -ENOMEM; + goto out; + } + + for (i = 0; i < journaler->splay_width; i++) { + if (fetch_objects[i] == 0) { + object_num = i; + } else { + object_num = fetch_objects[i]; + } + ret = fetch(journaler, object_num); + if (ret && ret != -ENOENT) + goto free_fetch_buf; + } + + ret = preprocess_replay(journaler); + if (ret) + goto free_fetch_buf; + + ret = process_replay(journaler); + +free_fetch_buf: + kvfree(journaler->fetch_buf); +out: + for (i = 0; i < journaler->splay_width; i++) { + struct object_replayer *obj_replayer = &journaler->obj_replayers[i]; + struct ceph_journaler_entry *entry = NULL, *next_entry = NULL; + + spin_lock(&obj_replayer->lock); + list_for_each_entry_safe(entry, next_entry, &obj_replayer->entry_list, node) { + list_del(&entry->node); + journaler_entry_free(entry); + } + spin_unlock(&obj_replayer->lock); + } + kfree(fetch_objects); + return ret; +} +EXPORT_SYMBOL(ceph_journaler_start_replay); From patchwork Mon Mar 18 09:15:26 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857017 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B2C9F6C2 for ; Mon, 18 Mar 2019 09:23:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 979F529249 for ; Mon, 18 Mar 2019 09:23:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8BAD7292A2; Mon, 18 Mar 2019 09:23:08 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4EBBB29249 for ; Mon, 18 Mar 2019 09:23:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727238AbfCRJXG (ORCPT ); Mon, 18 Mar 2019 05:23:06 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:50203 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726860AbfCRJXE (ORCPT ); Mon, 18 Mar 2019 05:23:04 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S10; Mon, 18 Mar 2019 17:15:40 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 08/16] libceph: journaling: introduce api for journal appending Date: Mon, 18 Mar 2019 05:15:26 -0400 Message-Id: <1552900534-29026-9-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S10 X-Coremail-Antispam: 1Uf129KBjvAXoWftFW8KryUKF43Cw1UXr43Wrg_yoW8tFy5Ko WxWr48uFn5Ga429F97Kry8Ja4rX348XayrArWYgF4a9anrAry8Z3y7Gr15Jry5Aw4UCrsF qw1xJwn3WF4DJ3WUn29KB7ZKAUJUUUUr529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UbIYCTnIWIevJa73UjIFyTuYvjfUlU5rDUUUU X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbihB17elsfmou8-gAAsP Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This commit introduce 3 APIs for journal recording: (1) ceph_journaler_allocate_tag() This api allocate a new tag for user to get a unified tag_tid. Then each event appended by this user will be tagged by this tag_tid. (2) ceph_journaler_append() This api allow user to append event to journal objects. (3) ceph_journaler_client_committed() This api will notify journaling that a event is already committed, you can remove it from journal if there is no other client refre to it. Signed-off-by: Dongsheng Yang --- net/ceph/journaler.c | 726 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 726 insertions(+) diff --git a/net/ceph/journaler.c b/net/ceph/journaler.c index 699b5258..05c94a2 100644 --- a/net/ceph/journaler.c +++ b/net/ceph/journaler.c @@ -1176,3 +1176,729 @@ int ceph_journaler_start_replay(struct ceph_journaler *journaler) return ret; } EXPORT_SYMBOL(ceph_journaler_start_replay); + +// recording +static int get_new_entry_tid(struct ceph_journaler *journaler, + uint64_t tag_tid, uint64_t *entry_tid) +{ + struct entry_tid *pos = NULL; + + spin_lock(&journaler->entry_tid_lock); + list_for_each_entry(pos, &journaler->entry_tids, node) { + if (pos->tag_tid == tag_tid) { + *entry_tid = pos->entry_tid++; + spin_unlock(&journaler->entry_tid_lock); + return 0; + } + } + + pos = entry_tid_alloc(journaler, tag_tid); + if (!pos) { + spin_unlock(&journaler->entry_tid_lock); + pr_err("failed to allocate new entry."); + return -ENOMEM; + } + + *entry_tid = pos->entry_tid++; + spin_unlock(&journaler->entry_tid_lock); + + return 0; +} + +static uint64_t get_object(struct ceph_journaler *journaler, uint64_t splay_offset) +{ + return splay_offset + (journaler->splay_width * journaler->active_set); +} + +static void future_init(struct ceph_journaler_future *future, + uint64_t tag_tid, + uint64_t entry_tid, + uint64_t commit_tid, + struct ceph_journaler_ctx *journaler_ctx) +{ + future->tag_tid = tag_tid; + future->entry_tid = entry_tid; + future->commit_tid = commit_tid; + + spin_lock_init(&future->lock); + future->safe = false; + future->consistent = false; + + future->ctx = journaler_ctx; + future->wait = NULL; +} + +static void set_prev_future(struct ceph_journaler *journaler, + struct ceph_journaler_future *future) +{ + bool prev_future_finished = false; + + if (journaler->prev_future == NULL) { + prev_future_finished = true; + } else { + spin_lock(&journaler->prev_future->lock); + prev_future_finished = (journaler->prev_future->consistent && + journaler->prev_future->safe); + journaler->prev_future->wait = future; + spin_unlock(&journaler->prev_future->lock); + } + + spin_lock(&future->lock); + if (prev_future_finished) { + future->consistent = true; + } + spin_unlock(&future->lock); + + journaler->prev_future = future; +} + +static void entry_init(struct ceph_journaler_entry *entry, + uint64_t tag_tid, + uint64_t entry_tid, + struct ceph_journaler_ctx *journaler_ctx) +{ + entry->tag_tid = tag_tid; + entry->entry_tid = entry_tid; + entry->data_len = journaler_ctx->bio_len + + journaler_ctx->prefix_len + journaler_ctx->suffix_len; +} + +static void journaler_entry_encode_prefix(struct ceph_journaler_entry *entry, + void **p, void *end) +{ + ceph_encode_64(p, PREAMBLE); + ceph_encode_8(p, (uint8_t)1); + ceph_encode_64(p, entry->entry_tid); + ceph_encode_64(p, entry->tag_tid); + + ceph_encode_32(p, entry->data_len); +} + +static uint32_t crc_bio(uint32_t crc, struct bio *bio) +{ + struct bio_vec bv; + struct bvec_iter iter; + char *buf = NULL; + u64 offset = 0; + +next: + bio_for_each_segment(bv, bio, iter) { + buf = page_address(bv.bv_page) + bv.bv_offset; + crc = crc32c(crc, buf, bv.bv_len); + offset += bv.bv_len; + } + + if (bio->bi_next) { + bio = bio->bi_next; + goto next; + } + + return crc; +} + +static void journaler_finish(struct work_struct *work) +{ + struct ceph_journaler *journaler = container_of(work, struct ceph_journaler, + finish_work); + struct ceph_journaler_ctx *ctx_pos, *next; + + spin_lock(&journaler->finish_lock); + list_for_each_entry_safe(ctx_pos, next, &journaler->ctx_list, node) { + list_del(&ctx_pos->node); + ctx_pos->callback(ctx_pos); + } + spin_unlock(&journaler->finish_lock); +} + +static void future_consistent(struct ceph_journaler *journaler, + struct ceph_journaler_future *future, + int result); +static void future_finish(struct ceph_journaler *journaler, + struct ceph_journaler_future *future, + int result) { + struct ceph_journaler_ctx *journaler_ctx = future->ctx; + struct ceph_journaler_future *future_wait = future->wait; + + mutex_lock(&journaler->meta_lock); + if (journaler->prev_future == future) + journaler->prev_future = NULL; + mutex_unlock(&journaler->meta_lock); + + spin_lock(&journaler->finish_lock); + if (journaler_ctx->result == 0) + journaler_ctx->result = result; + list_add_tail(&journaler_ctx->node, &journaler->ctx_list); + spin_unlock(&journaler->finish_lock); + + queue_work(journaler->task_wq, &journaler->finish_work); + if (future_wait) + future_consistent(journaler, future_wait, result); +} + +static void future_consistent(struct ceph_journaler *journaler, + struct ceph_journaler_future *future, + int result) { + bool future_finished = false; + + spin_lock(&future->lock); + future->consistent = true; + future_finished = (future->safe && future->consistent); + spin_unlock(&future->lock); + + if (future_finished) + future_finish(journaler, future, result); +} + +static void future_safe(struct ceph_journaler *journaler, + struct ceph_journaler_future *future, + int result) { + bool future_finished = false; + + spin_lock(&future->lock); + future->safe = true; + future_finished = (future->safe && future->consistent); + spin_unlock(&future->lock); + + if (future_finished) + future_finish(journaler, future, result); +} + +static void journaler_notify_update(struct work_struct *work) +{ + struct ceph_journaler *journaler = container_of(work, + struct ceph_journaler, + notify_update_work); + int ret = 0; + + ret = ceph_osdc_notify(journaler->osdc, &journaler->header_oid, + &journaler->header_oloc, NULL, 0, + 5000, NULL, NULL); + if (ret) + pr_err("notify_update failed: %d", ret); +} + +static bool advance_object_set(struct ceph_journaler *journaler) +{ + int ret = 0; + int i = 0; + struct object_recorder *obj_recorder; + uint64_t active_set = 0; + + mutex_lock(&journaler->meta_lock); + if (journaler->advancing) { + mutex_unlock(&journaler->meta_lock); + return false; + } + + // make sure all inflight appending finish + for (i = 0; i < journaler->splay_width; i++) { + obj_recorder = &journaler->obj_recorders[i]; + spin_lock(&obj_recorder->lock); + if (obj_recorder->inflight_append) { + spin_unlock(&obj_recorder->lock); + mutex_unlock(&journaler->meta_lock); + return false; + } + spin_unlock(&obj_recorder->lock); + } + + journaler->advancing = true; + + active_set = journaler->active_set + 1; + mutex_unlock(&journaler->meta_lock); + + ret = ceph_cls_journaler_set_active_set(journaler->osdc, + &journaler->header_oid, &journaler->header_oloc, + active_set); + if (ret) { + pr_err("error in set active_set: %d", ret); + } + + queue_work(journaler->task_wq, &journaler->notify_update_work); + + return true; +} + +static void journaler_overflow(struct work_struct *work) +{ + struct ceph_journaler *journaler = container_of(work, + struct ceph_journaler, + overflow_work); + if (advance_object_set(journaler)) { + queue_work(journaler->task_wq, &journaler->flush_work); + } +} + +static void journaler_append_callback(struct ceph_osd_request *osd_req) +{ + struct journaler_append_ctx *ctx = osd_req->r_priv; + struct ceph_journaler *journaler = ctx->journaler; + struct ceph_journaler_future *future = &ctx->future; + int ret = osd_req->r_result; + struct object_recorder *obj_recorder = &journaler->obj_recorders[ctx->splay_offset]; + + if (ret) + pr_err("ret of journaler_append_callback: %d", ret); + + __free_page(ctx->req_page); + ceph_osdc_put_request(osd_req); + + if (ret == -EOVERFLOW) { + mutex_lock(&journaler->meta_lock); + journaler->overflowed = true; + mutex_unlock(&journaler->meta_lock); + + spin_lock(&obj_recorder->lock); + list_add_tail(&ctx->node, &obj_recorder->overflow_list); + if (--obj_recorder->inflight_append == 0) + queue_work(journaler->task_wq, &journaler->overflow_work); + spin_unlock(&obj_recorder->lock); + return; + } + + spin_lock(&obj_recorder->lock); + if (--obj_recorder->inflight_append == 0) { + mutex_lock(&journaler->meta_lock); + if (journaler->overflowed) + queue_work(journaler->task_wq, &journaler->overflow_work); + mutex_unlock(&journaler->meta_lock); + } + spin_unlock(&obj_recorder->lock); + + ret = add_commit_entry(journaler, ctx->future.commit_tid, ctx->object_num, + ctx->future.tag_tid, ctx->future.entry_tid); + if (ret) { + pr_err("failed to add_commit_entry: %d", ret); + future_finish(journaler, future, -ENOMEM); + return; + } + + future_safe(journaler, future, ret); +} + +static int append(struct ceph_journaler *journaler, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc, + struct journaler_append_ctx *ctx) + +{ + struct ceph_osd_client *osdc = journaler->osdc; + struct ceph_osd_request *req; + void *p; + int ret; + + req = ceph_osdc_alloc_request(osdc, NULL, 2, false, GFP_NOIO); + if (!req) + return -ENOMEM; + + ceph_oid_copy(&req->r_base_oid, oid); + ceph_oloc_copy(&req->r_base_oloc, oloc); + req->r_flags = CEPH_OSD_FLAG_WRITE; + req->r_callback = journaler_append_callback; + req->r_priv = ctx; + + // guard_append + ctx->req_page = alloc_page(GFP_NOIO); + if (!ctx->req_page) { + ret = -ENOMEM; + goto out_req; + } + p = page_address(ctx->req_page); + ceph_encode_64(&p, 1 << journaler->order); + ret = osd_req_op_cls_init(req, 0, "journal", "guard_append"); + if (ret) + goto out_free_page; + osd_req_op_cls_request_data_pages(req, 0, &ctx->req_page, 8, 0, false, false); + + // append_data + osd_req_op_extent_init(req, 1, CEPH_OSD_OP_APPEND, 0, + ctx->journaler_ctx.prefix_len + ctx->journaler_ctx.bio_len + ctx->journaler_ctx.suffix_len, 0, 0); + + if (ctx->journaler_ctx.prefix_len) + osd_req_op_extent_prefix_pages(req, 1, &ctx->journaler_ctx.prefix_page, + ctx->journaler_ctx.prefix_len, + ctx->journaler_ctx.prefix_offset, + false, false); + if (ctx->journaler_ctx.bio_len) + osd_req_op_extent_osd_data_bio(req, 1, ctx->journaler_ctx.bio_iter, ctx->journaler_ctx.bio_len); + if (ctx->journaler_ctx.suffix_len) + osd_req_op_extent_suffix_pages(req, 1, &ctx->journaler_ctx.suffix_page, + ctx->journaler_ctx.suffix_len, + ctx->journaler_ctx.suffix_offset, + false, false); + ret = ceph_osdc_alloc_messages(req, GFP_NOIO); + if (ret) + goto out_free_page; + + ceph_osdc_start_request(osdc, req, false); + return 0; + +out_free_page: + __free_page(ctx->req_page); +out_req: + ceph_osdc_put_request(req); + return ret; +} + +static int send_append_request(struct ceph_journaler *journaler, + uint64_t object_num, + struct journaler_append_ctx *ctx) +{ + struct ceph_object_id object_oid; + int ret = 0; + + ceph_oid_init(&object_oid); + ret = ceph_oid_aprintf(&object_oid, GFP_NOIO, "%s%llu", + journaler->object_oid_prefix, object_num); + if (ret) { + pr_err("failed to initialize object id: %d", ret); + goto out; + } + + ret = append(journaler, &object_oid, &journaler->data_oloc, ctx); +out: + ceph_oid_destroy(&object_oid); + return ret; +} + +static void journaler_flush(struct work_struct *work) +{ + struct ceph_journaler *journaler = container_of(work, + struct ceph_journaler, + flush_work); + int i = 0; + int ret = 0; + struct object_recorder *obj_recorder; + struct journaler_append_ctx *ctx, *next_ctx; + int req_num = 0; + LIST_HEAD(tmp); + + if (journaler->overflowed) { + return; + } + + for (i = 0; i < journaler->splay_width; i++) { + req_num = 0; + INIT_LIST_HEAD(&tmp); + obj_recorder = &journaler->obj_recorders[i]; + + spin_lock(&obj_recorder->lock); + list_splice_tail_init(&obj_recorder->overflow_list, &tmp); + list_splice_tail_init(&obj_recorder->append_list, &tmp); + spin_unlock(&obj_recorder->lock); + + list_for_each_entry_safe(ctx, next_ctx, &tmp, node) { + ctx->object_num = get_object(journaler, obj_recorder->splay_offset); + ret = send_append_request(journaler, ctx->object_num, ctx); + if (ret) { + pr_err("failed to send append request: %d", ret); + list_del(&ctx->node); + future_finish(journaler, &ctx->future, ret); + continue; + } + req_num++; + } + + spin_lock(&obj_recorder->lock); + obj_recorder->inflight_append += req_num; + spin_unlock(&obj_recorder->lock); + } +} + +static int ceph_journaler_object_append(struct ceph_journaler *journaler, + struct journaler_append_ctx *append_ctx) +{ + void *buf = NULL; + void *end = NULL; + int ret = 0; + uint32_t crc = 0; + struct ceph_journaler_ctx *journaler_ctx = &append_ctx->journaler_ctx; + struct ceph_bio_iter *bio_iter = journaler_ctx->bio_iter; + struct object_recorder *obj_recorder; + + // PEAMBLE(8) + version(1) + entry_tid(8) + tag_tid(8) + string_len(4) = 29 + journaler_ctx->prefix_offset -= 29; + journaler_ctx->prefix_len += 29; + buf = page_address(journaler_ctx->prefix_page) + journaler_ctx->prefix_offset; + end = buf + 29; + journaler_entry_encode_prefix(&append_ctx->entry, &buf, end); + + // size of crc is 4 + journaler_ctx->suffix_offset += 0; + journaler_ctx->suffix_len += 4; + buf = page_address(journaler_ctx->suffix_page); + end = buf + 4; + crc = crc32c(crc, page_address(journaler_ctx->prefix_page) + journaler_ctx->prefix_offset, + journaler_ctx->prefix_len); + if (journaler_ctx->bio_len) + crc = crc_bio(crc, bio_iter->bio); + ceph_encode_32(&buf, crc); + obj_recorder = &journaler->obj_recorders[append_ctx->splay_offset]; + + spin_lock(&obj_recorder->lock); + list_add_tail(&append_ctx->node, &obj_recorder->append_list); + queue_work(journaler->task_wq, &journaler->flush_work); + spin_unlock(&obj_recorder->lock); + + return ret; +} + +struct journaler_append_ctx *journaler_append_ctx_alloc(void) +{ + struct journaler_append_ctx *append_ctx; + struct ceph_journaler_ctx *journaler_ctx; + + append_ctx = kmem_cache_zalloc(journaler_append_ctx_cache, GFP_NOIO); + if (!append_ctx) + return NULL; + + journaler_ctx = &append_ctx->journaler_ctx; + journaler_ctx->prefix_page = alloc_page(GFP_NOIO); + if (!journaler_ctx->prefix_page) + goto free_journaler_ctx; + + journaler_ctx->suffix_page = alloc_page(GFP_NOIO); + if (!journaler_ctx->suffix_page) + goto free_prefix_page; + + memset(page_address(journaler_ctx->prefix_page), 0, PAGE_SIZE); + memset(page_address(journaler_ctx->suffix_page), 0, PAGE_SIZE); + INIT_LIST_HEAD(&journaler_ctx->node); + + kref_init(&append_ctx->kref); + INIT_LIST_HEAD(&append_ctx->node); + return append_ctx; + +free_prefix_page: + __free_page(journaler_ctx->prefix_page); +free_journaler_ctx: + kmem_cache_free(journaler_append_ctx_cache, append_ctx); + return NULL; +} + +struct ceph_journaler_ctx *ceph_journaler_ctx_alloc(void) +{ + struct journaler_append_ctx *append_ctx; + + append_ctx = journaler_append_ctx_alloc(); + if (!append_ctx) + return NULL; + + return &append_ctx->journaler_ctx; +} +EXPORT_SYMBOL(ceph_journaler_ctx_alloc); + +static void journaler_append_ctx_release(struct kref *kref) +{ + struct journaler_append_ctx *append_ctx; + struct ceph_journaler_ctx *journaler_ctx; + + append_ctx = container_of(kref, struct journaler_append_ctx, kref); + journaler_ctx = &append_ctx->journaler_ctx; + + __free_page(journaler_ctx->prefix_page); + __free_page(journaler_ctx->suffix_page); + kmem_cache_free(journaler_append_ctx_cache, append_ctx); +} + +static void journaler_append_ctx_put(struct journaler_append_ctx *append_ctx) +{ + if (append_ctx) { + kref_put(&append_ctx->kref, journaler_append_ctx_release); + } +} + +void ceph_journaler_ctx_put(struct ceph_journaler_ctx *journaler_ctx) +{ + struct journaler_append_ctx *append_ctx; + + if (journaler_ctx) { + append_ctx = container_of(journaler_ctx, + struct journaler_append_ctx, + journaler_ctx); + journaler_append_ctx_put(append_ctx); + } +} +EXPORT_SYMBOL(ceph_journaler_ctx_put); + +int ceph_journaler_append(struct ceph_journaler *journaler, + uint64_t tag_tid, + uint64_t *commit_tid, + struct ceph_journaler_ctx *journaler_ctx) +{ + uint8_t splay_width; + uint64_t entry_tid; + struct object_recorder *obj_recorder; + struct journaler_append_ctx *append_ctx; + int ret = 0; + + append_ctx = container_of(journaler_ctx, + struct journaler_append_ctx, + journaler_ctx); + + append_ctx->journaler = journaler; + + mutex_lock(&journaler->meta_lock); + ret = get_new_entry_tid(journaler, tag_tid, &entry_tid); + if (ret) { + goto unlock; + } + splay_width = journaler->splay_width; + append_ctx->splay_offset = entry_tid % splay_width; + + obj_recorder = &journaler->obj_recorders[splay_width]; + append_ctx->object_num = get_object(journaler, append_ctx->splay_offset); + + *commit_tid = allocate_commit_tid(journaler); + entry_init(&append_ctx->entry, tag_tid, entry_tid, journaler_ctx); + future_init(&append_ctx->future, tag_tid, entry_tid, *commit_tid, journaler_ctx); + set_prev_future(journaler, &append_ctx->future); + mutex_unlock(&journaler->meta_lock); + + ret = ceph_journaler_object_append(journaler, append_ctx); + return ret; + +unlock: + mutex_unlock(&journaler->meta_lock); + return ret; +} +EXPORT_SYMBOL(ceph_journaler_append); + +static void journaler_client_commit(struct work_struct *work) +{ + struct ceph_journaler *journaler = container_of(to_delayed_work(work), + struct ceph_journaler, + commit_work); + int ret = 0; + + mutex_lock(&journaler->commit_lock); + copy_pos_list(&journaler->obj_pos_pending, + &journaler->obj_pos_committing); + mutex_unlock(&journaler->commit_lock); + ret = ceph_cls_journaler_client_committed(journaler->osdc, + &journaler->header_oid, &journaler->header_oloc, + journaler->client, &journaler->obj_pos_committing); + + if (ret) { + pr_err("error in client committed: %d", ret); + } + + queue_work(journaler->task_wq, &journaler->notify_update_work); + + mutex_lock(&journaler->commit_lock); + journaler->commit_scheduled = false; + mutex_unlock(&journaler->commit_lock); +} + +// hold journaler->commit_lock +static void add_object_position(struct commit_entry *entry, + struct list_head *object_positions, + uint64_t splay_width) +{ + struct ceph_journaler_object_pos *position = NULL; + uint8_t splay_offset = entry->object_num % splay_width; + bool found = false; + + list_for_each_entry(position, object_positions, node) { + if (position->in_using == false) { + found = true; + break; + } + + if (splay_offset == position->object_num % splay_width) { + found = true; + break; + } + } + + BUG_ON(!found); + if (position->in_using == false) + position->in_using = true; + position->object_num = entry->object_num; + position->tag_tid = entry->tag_tid; + position->entry_tid = entry->entry_tid; + list_move(&position->node, object_positions); +} + +void ceph_journaler_client_committed(struct ceph_journaler *journaler, uint64_t commit_tid) +{ + struct commit_entry *commit_entry = NULL; + bool update_client_commit = true; + struct rb_node *n; + + mutex_lock(&journaler->commit_lock); + for (n = rb_first(&journaler->commit_entries); n; n = rb_next(n)) { + commit_entry = rb_entry(n, struct commit_entry, r_node); + if (commit_entry->commit_tid == commit_tid) { + commit_entry->committed = true; + break; + } + if (commit_entry->committed == false) + update_client_commit = false; + } + + if (update_client_commit) { + for (n = rb_first(&journaler->commit_entries); n;) { + commit_entry = rb_entry(n, struct commit_entry, r_node); + n = rb_next(n); + + if (commit_entry->commit_tid > commit_tid) + break; + add_object_position(commit_entry, + &journaler->obj_pos_pending, + journaler->splay_width); + erase_commit_entry(&journaler->commit_entries, commit_entry); + kmem_cache_free(journaler_commit_entry_cache, commit_entry); + } + } + + if (update_client_commit && !journaler->commit_scheduled) { + queue_delayed_work(journaler->task_wq, &journaler->commit_work, + JOURNALER_COMMIT_INTERVAL); + journaler->commit_scheduled = true; + } + mutex_unlock(&journaler->commit_lock); + +} +EXPORT_SYMBOL(ceph_journaler_client_committed); + +int ceph_journaler_allocate_tag(struct ceph_journaler *journaler, + uint64_t tag_class, void *buf, + uint32_t buf_len, + struct ceph_journaler_tag *tag) +{ + uint64_t tag_tid = 0; + int ret = 0; + +retry: + ret = ceph_cls_journaler_get_next_tag_tid(journaler->osdc, + &journaler->header_oid, + &journaler->header_oloc, + &tag_tid); + if (ret) + goto out; + + ret = ceph_cls_journaler_tag_create(journaler->osdc, + &journaler->header_oid, + &journaler->header_oloc, + tag_tid, tag_class, + buf, buf_len); + if (ret < 0) { + if (ret == -ESTALE) { + goto retry; + } else { + goto out; + } + } + + ret = ceph_cls_journaler_get_tag(journaler->osdc, + &journaler->header_oid, + &journaler->header_oloc, + tag_tid, tag); + if (ret) + goto out; + +out: + return ret; +} +EXPORT_SYMBOL(ceph_journaler_allocate_tag); From patchwork Mon Mar 18 09:15:27 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857009 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A201F139A for ; Mon, 18 Mar 2019 09:23:05 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8830D29249 for ; Mon, 18 Mar 2019 09:23:05 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7C62C292F6; Mon, 18 Mar 2019 09:23:05 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B00DD29249 for ; Mon, 18 Mar 2019 09:23:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727222AbfCRJXC (ORCPT ); Mon, 18 Mar 2019 05:23:02 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:50244 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727196AbfCRJXA (ORCPT ); Mon, 18 Mar 2019 05:23:00 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S11; Mon, 18 Mar 2019 17:15:41 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 09/16] libceph: journaling: trim object set when we found there is no client refer it Date: Mon, 18 Mar 2019 05:15:27 -0400 Message-Id: <1552900534-29026-10-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S11 X-Coremail-Antispam: 1Uf129KBjvJXoWxAFykXrW8Aw4UXF1UAr4ktFb_yoW5uFW3pw sxJr1fArW8ZF1fCrs3JanYqFZ0vrW0vrW7GrnIkF9aka1UXrWaqF18JFyqqFy3Jr17G3WD tF48tan8Gw47tFDanT9S1TB71UUUUUJqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0JbAoGLUUUUU= X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbiih17eltVf9eI7AAAsU Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When we found there is no client refre to the object set, we can remove the objects. Signed-off-by: Dongsheng Yang --- net/ceph/journaler.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) diff --git a/net/ceph/journaler.c b/net/ceph/journaler.c index 05c94a2..b4291f8 100644 --- a/net/ceph/journaler.c +++ b/net/ceph/journaler.c @@ -229,6 +229,9 @@ static int refresh(struct ceph_journaler *journaler, bool init) list_splice_tail_init(&tmp_clients, &journaler->clients_cache); // calculate the minimum_commit_set. + // TODO: unregister clients if the commit position is too long behind + // active positions. similar with rbd_journal_max_concurrent_object_sets + // in user space journal. minimum_commit_set = journaler->active_set; list_for_each_entry(client, &journaler->clients, node) { struct ceph_journaler_object_pos *pos; @@ -1902,3 +1905,88 @@ int ceph_journaler_allocate_tag(struct ceph_journaler *journaler, return ret; } EXPORT_SYMBOL(ceph_journaler_allocate_tag); + +// trimming +static int ceph_journaler_obj_remove_sync(struct ceph_journaler *journaler, + struct ceph_object_id *oid, + struct ceph_object_locator *oloc) + +{ + struct ceph_osd_client *osdc = journaler->osdc; + struct ceph_osd_request *req; + int ret; + + req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_NOIO); + if (!req) + return -ENOMEM; + + ceph_oid_copy(&req->r_base_oid, oid); + ceph_oloc_copy(&req->r_base_oloc, oloc); + req->r_flags = CEPH_OSD_FLAG_WRITE; + + osd_req_op_init(req, 0, CEPH_OSD_OP_DELETE, 0); + + ret = ceph_osdc_alloc_messages(req, GFP_NOIO); + if (ret) + goto out_req; + + ceph_osdc_start_request(osdc, req, false); + ret = ceph_osdc_wait_request(osdc, req); + +out_req: + ceph_osdc_put_request(req); + return ret; +} + +static int remove_set(struct ceph_journaler *journaler, uint64_t object_set) +{ + uint64_t object_num = 0; + int splay_offset = 0; + struct ceph_object_id object_oid; + int ret = 0; + + ceph_oid_init(&object_oid); + for (splay_offset = 0; splay_offset < journaler->splay_width; splay_offset++) { + object_num = splay_offset + (object_set * journaler->splay_width); + if (!ceph_oid_empty(&object_oid)) { + ceph_oid_destroy(&object_oid); + ceph_oid_init(&object_oid); + } + ret = ceph_oid_aprintf(&object_oid, GFP_NOIO, "%s%llu", + journaler->object_oid_prefix, object_num); + if (ret) { + pr_err("aprintf error : %d", ret); + goto out; + } + ret = ceph_journaler_obj_remove_sync(journaler, &object_oid, + &journaler->data_oloc); + if (ret < 0 && ret != -ENOENT) { + pr_err("%s: failed to remove object: %llu", + __func__, object_num); + goto out; + } + } + ret = 0; +out: + ceph_oid_destroy(&object_oid); + return ret; +} + +static int set_minimum_set(struct ceph_journaler* journaler, + uint64_t minimum_set) +{ + int ret = 0; + + ret = ceph_cls_journaler_set_minimum_set(journaler->osdc, + &journaler->header_oid, + &journaler->header_oloc, + minimum_set); + if (ret < 0) { + pr_err("%s: failed to set_minimum_set: %d", __func__, ret); + return ret; + } + + queue_work(journaler->task_wq, &journaler->notify_update_work); + + return ret; +} From patchwork Mon Mar 18 09:15:28 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857007 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 13CA96C2 for ; Mon, 18 Mar 2019 09:23:03 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EEA442922F for ; Mon, 18 Mar 2019 09:23:02 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E290A29296; Mon, 18 Mar 2019 09:23:02 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B2F662922F for ; Mon, 18 Mar 2019 09:23:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727220AbfCRJW7 (ORCPT ); Mon, 18 Mar 2019 05:22:59 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:50217 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727043AbfCRJW6 (ORCPT ); Mon, 18 Mar 2019 05:22:58 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S12; Mon, 18 Mar 2019 17:15:41 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 10/16] rbd: wakeup requests when we get -EBLACKLISTED in lock acquiring Date: Mon, 18 Mar 2019 05:15:28 -0400 Message-Id: <1552900534-29026-11-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S12 X-Coremail-Antispam: 1Uf129KBjDUn29KB7ZKAUJUUUUr529EdanIXcx71UUUUU7v73 VFW2AGmfu7bjvjm3AaLaJ3UbIYCTnIWIevJa73UjIFyTuYvjfUdsXrUUUUU X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbiCAt7elkXOvK89QAAsN Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When we got an -EBLACKLISTED, we need to wakeup the waiters, otherwise, they will wait forever. Signed-off-by: Dongsheng Yang --- drivers/block/rbd.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 57816c2..4c5f36e 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3031,14 +3031,19 @@ static void rbd_acquire_lock(struct work_struct *work) dout("%s rbd_dev %p\n", __func__, rbd_dev); again: lock_state = rbd_try_acquire_lock(rbd_dev, &ret); - if (lock_state != RBD_LOCK_STATE_UNLOCKED || ret == -EBLACKLISTED) { - if (lock_state == RBD_LOCK_STATE_LOCKED) - wake_requests(rbd_dev, true); + if (lock_state == RBD_LOCK_STATE_LOCKED) { + wake_requests(rbd_dev, true); dout("%s rbd_dev %p lock_state %d ret %d - done\n", __func__, rbd_dev, lock_state, ret); return; } + if (ret == -EBLACKLISTED) { + set_bit(RBD_DEV_FLAG_BLACKLISTED, &rbd_dev->flags); + wake_requests(rbd_dev, true); + return; + } + ret = rbd_request_lock(rbd_dev); if (ret == -ETIMEDOUT) { goto again; /* treat this as a dead client */ From patchwork Mon Mar 18 09:15:29 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857023 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 10EF6139A for ; Mon, 18 Mar 2019 09:24:22 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EC16428F49 for ; Mon, 18 Mar 2019 09:24:21 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E062228FD9; Mon, 18 Mar 2019 09:24:21 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7C09228F49 for ; Mon, 18 Mar 2019 09:24:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726935AbfCRJYU (ORCPT ); Mon, 18 Mar 2019 05:24:20 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:53189 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726795AbfCRJYT (ORCPT ); Mon, 18 Mar 2019 05:24:19 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S13; Mon, 18 Mar 2019 17:15:42 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 11/16] rbd: wait image request all complete in lock releasing Date: Mon, 18 Mar 2019 05:15:29 -0400 Message-Id: <1552900534-29026-12-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S13 X-Coremail-Antispam: 1Uf129KBjvJXoWxZry5trykJF4Uur4kAFy5Arb_yoW5ZFyfp3 y3JasIkrWUWwn7Ww1fJayrZr15Wa18t347WryIkw17CFn3JrWvvr1IkFyUZFW7Ar93Ar4x GF45tFs5CF4jgrDanT9S1TB71UUUUUJqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0JbGT5JUUUUU= X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbiHx57eluyGkr-BwAAsB Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP currently, we only sync all osdc request in lock releasing, but if we are going to support journaling, we need to wait all img_request complete, not only the low-level in osd_client. Signed-off-by: Dongsheng Yang --- drivers/block/rbd.c | 39 ++++++++++++++++++++++++++++++--------- 1 file changed, 30 insertions(+), 9 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4c5f36e..a583c2e 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -386,6 +386,9 @@ struct rbd_device { struct list_head node; + atomic_t inflight_ios; + struct completion inflight_wait; + /* sysfs related */ struct device dev; unsigned long open_count; /* protected by lock */ @@ -1654,6 +1657,7 @@ static struct rbd_img_request *rbd_img_request_create( spin_lock_init(&img_request->completion_lock); INIT_LIST_HEAD(&img_request->object_extents); kref_init(&img_request->kref); + atomic_inc(&rbd_dev->inflight_ios); dout("%s: rbd_dev %p %s -> img %p\n", __func__, rbd_dev, obj_op_name(op_type), img_request); @@ -1670,6 +1674,8 @@ static void rbd_img_request_destroy(struct kref *kref) dout("%s: img %p\n", __func__, img_request); + atomic_dec(&img_request->rbd_dev->inflight_ios); + complete_all(&img_request->rbd_dev->inflight_wait); for_each_obj_request_safe(img_request, obj_request, next_obj_request) rbd_img_obj_request_del(img_request, obj_request); rbd_assert(img_request->obj_request_count == 0); @@ -3075,26 +3081,39 @@ static void rbd_acquire_lock(struct work_struct *work) } } +static int rbd_inflight_wait(struct rbd_device *rbd_dev) +{ + int ret = 0; + + while (atomic_read(&rbd_dev->inflight_ios)) { + ret = wait_for_completion_interruptible(&rbd_dev->inflight_wait); + if (ret) + break; + } + + return ret; +} + /* * lock_rwsem must be held for write */ static bool rbd_release_lock(struct rbd_device *rbd_dev) { + int ret = 0; + dout("%s rbd_dev %p read lock_state %d\n", __func__, rbd_dev, rbd_dev->lock_state); if (rbd_dev->lock_state != RBD_LOCK_STATE_LOCKED) return false; rbd_dev->lock_state = RBD_LOCK_STATE_RELEASING; - downgrade_write(&rbd_dev->lock_rwsem); - /* - * Ensure that all in-flight IO is flushed. - * - * FIXME: ceph_osdc_sync() flushes the entire OSD client, which - * may be shared with other devices. - */ - ceph_osdc_sync(&rbd_dev->rbd_client->client->osdc); - up_read(&rbd_dev->lock_rwsem); + up_write(&rbd_dev->lock_rwsem); + + ret = rbd_inflight_wait(rbd_dev); + if (ret) { + down_write(&rbd_dev->lock_rwsem); + return false; + } down_write(&rbd_dev->lock_rwsem); dout("%s rbd_dev %p write lock_state %d\n", __func__, rbd_dev, @@ -4392,6 +4411,8 @@ static struct rbd_device *__rbd_dev_create(struct rbd_client *rbdc, INIT_LIST_HEAD(&rbd_dev->node); init_rwsem(&rbd_dev->header_rwsem); + atomic_set(&rbd_dev->inflight_ios, 0); + init_completion(&rbd_dev->inflight_wait); rbd_dev->header.data_pool_id = CEPH_NOPOOL; ceph_oid_init(&rbd_dev->header_oid); rbd_dev->header_oloc.pool = spec->pool_id; From patchwork Mon Mar 18 09:15:30 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857013 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 411CE6C2 for ; Mon, 18 Mar 2019 09:23:07 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 29482292A2 for ; Mon, 18 Mar 2019 09:23:07 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1BF3029249; Mon, 18 Mar 2019 09:23:07 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D890829249 for ; Mon, 18 Mar 2019 09:23:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727202AbfCRJW7 (ORCPT ); Mon, 18 Mar 2019 05:22:59 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:50214 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726872AbfCRJW6 (ORCPT ); Mon, 18 Mar 2019 05:22:58 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S14; Mon, 18 Mar 2019 17:15:42 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 12/16] rbd: introduce completion for each img_request Date: Mon, 18 Mar 2019 05:15:30 -0400 Message-Id: <1552900534-29026-13-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S14 X-Coremail-Antispam: 1Uf129KBjDUn29KB7ZKAUJUUUUr529EdanIXcx71UUUUU7v73 VFW2AGmfu7bjvjm3AaLaJ3UbIYCTnIWIevJa73UjIFyTuYvjfUdsXrUUUUU X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbihx57eltVf3Z3egAAsS Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When we are going to do a sync IO, we need a way to wait a img_request to complete. Example, when we are going to do journal replay, we need to do a sync replaying, and return after img_request completed. Signed-off-by: Dongsheng Yang --- drivers/block/rbd.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index a583c2e..cc0642c 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -294,6 +294,8 @@ struct rbd_img_request { u32 obj_request_count; u32 pending_count; + struct completion completion; + struct kref kref; }; @@ -1654,6 +1656,7 @@ static struct rbd_img_request *rbd_img_request_create( if (rbd_dev_parent_get(rbd_dev)) img_request_layered_set(img_request); + init_completion(&img_request->completion); spin_lock_init(&img_request->completion_lock); INIT_LIST_HEAD(&img_request->object_extents); kref_init(&img_request->kref); @@ -2598,6 +2601,7 @@ static void rbd_img_end_request(struct rbd_img_request *img_req) blk_mq_end_request(img_req->rq, errno_to_blk_status(img_req->result)); + complete_all(&img_req->completion); rbd_img_request_put(img_req); } From patchwork Mon Mar 18 09:15:31 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857027 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3B5CC6C2 for ; Mon, 18 Mar 2019 09:25:22 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1D96B29117 for ; Mon, 18 Mar 2019 09:25:22 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 10D9829142; Mon, 18 Mar 2019 09:25:22 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4F19A29117 for ; Mon, 18 Mar 2019 09:25:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727247AbfCRJZT (ORCPT ); Mon, 18 Mar 2019 05:25:19 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:55481 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726691AbfCRJZT (ORCPT ); Mon, 18 Mar 2019 05:25:19 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S15; Mon, 18 Mar 2019 17:15:42 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 13/16] rbd: introduce journal in rbd_device Date: Mon, 18 Mar 2019 05:15:31 -0400 Message-Id: <1552900534-29026-14-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S15 X-Coremail-Antispam: 1Uf129KBjvJXoW3AryxuF18Xr17AF4DJr17Awb_yoWfZFW3pF WDJFyFkrWUZr17W3yxXFs8ArWjqa40y34DWr9Ik3s7K3Z3JrZxta4IkFyDJrW7AFyUCa1k Jr45Jw4UC3yUKrDanT9S1TB71UUUUUJqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0JbGT5JUUUUU= X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbihx97eltVf3Z3hAAAst Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This commit introduce rbd_journal into rbd_device. with journaling feature enabled, We will open journal after exclusive-lock acquired and close journal before exclusive-lock released. Signed-off-by: Dongsheng Yang --- drivers/block/rbd.c | 237 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 237 insertions(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index cc0642c..bd90c17 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -28,16 +28,19 @@ */ +#include #include #include #include #include #include #include +#include #include #include #include +#include #include #include #include @@ -378,6 +381,8 @@ struct rbd_device { atomic_t parent_ref; struct rbd_device *parent; + struct rbd_journal *journal; + /* Block layer tags. */ struct blk_mq_tag_set tag_set; @@ -408,6 +413,22 @@ enum rbd_dev_flags { RBD_DEV_FLAG_BLACKLISTED, /* our ceph_client is blacklisted */ }; +#define LOCAL_MIRROR_UUID "" +#define LOCAL_CLIENT_ID "" + +enum rbd_journal_state { + RBD_JOURNAL_STATE_INITIALIZED, + RBD_JOURNAL_STATE_OPENED, + RBD_JOURNAL_STATE_CLOSED, +}; + +struct rbd_journal { + struct ceph_journaler *journaler; + uint64_t tag_tid; + /* state is protected by rbd_dev->lock_rwsem */ + enum rbd_journal_state state; +}; + static DEFINE_MUTEX(client_mutex); /* Serialize client creation */ static LIST_HEAD(rbd_dev_list); /* devices */ @@ -2681,6 +2702,7 @@ static void __rbd_lock(struct rbd_device *rbd_dev, const char *cookie) /* * lock_rwsem must be held for write */ +static int rbd_dev_open_journal(struct rbd_device *rbd_dev); static int rbd_lock(struct rbd_device *rbd_dev) { struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; @@ -2697,6 +2719,15 @@ static int rbd_lock(struct rbd_device *rbd_dev) if (ret) return ret; + if (rbd_dev->header.features & RBD_FEATURE_JOURNALING) { + ret = rbd_dev_open_journal(rbd_dev); + if (ret) { + rbd_warn(rbd_dev, "open journal failed: %d", ret); + set_disk_ro(rbd_dev->disk, true); + set_bit(RBD_DEV_FLAG_BLACKLISTED, &rbd_dev->flags); + return -EBLACKLISTED; + } + } rbd_dev->lock_state = RBD_LOCK_STATE_LOCKED; __rbd_lock(rbd_dev, cookie); return 0; @@ -2705,6 +2736,7 @@ static int rbd_lock(struct rbd_device *rbd_dev) /* * lock_rwsem must be held for write */ +static void rbd_journal_close(struct rbd_journal *journal); static void rbd_unlock(struct rbd_device *rbd_dev) { struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; @@ -2713,6 +2745,9 @@ static void rbd_unlock(struct rbd_device *rbd_dev) WARN_ON(!__rbd_is_lock_owner(rbd_dev) || rbd_dev->lock_cookie[0] == '\0'); + if (rbd_dev->journal) + rbd_journal_close(rbd_dev->journal); + ret = ceph_cls_unlock(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, RBD_LOCK_NAME, rbd_dev->lock_cookie); if (ret && ret != -ENOENT) @@ -5750,6 +5785,207 @@ static int rbd_dev_header_name(struct rbd_device *rbd_dev) return ret; } +static int rbd_journal_allocate_tag(struct rbd_journal *journal); +static int rbd_journal_open(struct rbd_journal *journal) +{ + struct ceph_journaler *journaler = journal->journaler; + int ret = 0; + + ret = ceph_journaler_open(journaler); + if (ret) + goto out; + + ret = ceph_journaler_start_replay(journaler); + if (ret) + goto err_close_journaler; + + ret = rbd_journal_allocate_tag(journal); + if (ret) + goto err_close_journaler; + + journal->state = RBD_JOURNAL_STATE_OPENED; + return ret; + +err_close_journaler: + ceph_journaler_close(journaler); + +out: + return ret; +} + +static int rbd_dev_open_journal(struct rbd_device *rbd_dev) +{ + int ret = 0; + struct rbd_journal *journal = NULL; + struct ceph_journaler *journaler = NULL; + struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; + + if (rbd_dev->journal && rbd_dev->journal->state == RBD_JOURNAL_STATE_OPENED) + return 0; + + // create journal + if (!rbd_dev->journal) { + journal = kzalloc(sizeof(struct rbd_journal), GFP_KERNEL); + if (!journal) + return -ENOMEM; + + journal->state = RBD_JOURNAL_STATE_INITIALIZED; + journaler = ceph_journaler_create(osdc, &rbd_dev->header_oloc, + rbd_dev->spec->image_id, + LOCAL_CLIENT_ID); + if (!journaler) { + ret = -ENOMEM; + goto err_free_journal; + } + + journaler->entry_handler = rbd_dev; + journaler->handle_entry = rbd_journal_replay; + + journal->journaler = journaler; + rbd_dev->journal = journal; + } + + // open journal + ret = rbd_journal_open(rbd_dev->journal); + if (ret) + goto err_destroy_journaler; + + return ret; + +err_destroy_journaler: + ceph_journaler_destroy(journaler); +err_free_journal: + kfree(rbd_dev->journal); + rbd_dev->journal = NULL; + return ret; +} + +static void rbd_journal_close(struct rbd_journal *journal) +{ + if (journal->state == RBD_JOURNAL_STATE_CLOSED) + return; + ceph_journaler_close(journal->journaler); + journal->tag_tid = 0; + journal->state = RBD_JOURNAL_STATE_CLOSED; +} + +static void rbd_dev_close_journal(struct rbd_device *rbd_dev) +{ + struct ceph_journaler *journaler = NULL; + + if (!rbd_dev->journal) + return; + + rbd_journal_close(rbd_dev->journal); + + journaler = rbd_dev->journal->journaler; + ceph_journaler_destroy(journaler); + kfree(rbd_dev->journal); + rbd_dev->journal = NULL; +} + +typedef struct rbd_journal_tag_predecessor { + bool commit_valid; + uint64_t tag_tid; + uint64_t entry_tid; + uint32_t uuid_len; + char *mirror_uuid; +} rbd_journal_tag_predecessor; + +typedef struct rbd_journal_tag_data { + struct rbd_journal_tag_predecessor predecessor; + uint32_t uuid_len; + char *mirror_uuid; +} rbd_journal_tag_data; + +static uint32_t tag_data_encoding_size(struct rbd_journal_tag_data *tag_data) +{ + // sizeof(uuid_len) 4 + uuid_len + 1 commit_valid + 8 tag_tid + + // 8 entry_tid + 4 sizeof(uuid_len) + uuid_len + return (4 + tag_data->uuid_len + 1 + 8 + 8 + 4 + + tag_data->predecessor.uuid_len); +} + +static void predecessor_encode(void **p, void *end, + struct rbd_journal_tag_predecessor *predecessor) +{ + ceph_encode_string(p, end, predecessor->mirror_uuid, + predecessor->uuid_len); + ceph_encode_8(p, predecessor->commit_valid); + ceph_encode_64(p, predecessor->tag_tid); + ceph_encode_64(p, predecessor->entry_tid); +} + +static int rbd_journal_encode_tag_data(void **p, void *end, + struct rbd_journal_tag_data *tag_data) +{ + struct rbd_journal_tag_predecessor *predecessor = &tag_data->predecessor; + + ceph_encode_string(p, end, tag_data->mirror_uuid, tag_data->uuid_len); + predecessor_encode(p, end, predecessor); + + return 0; +} + +static int rbd_journal_allocate_tag(struct rbd_journal *journal) +{ + struct ceph_journaler_tag tag = {}; + struct rbd_journal_tag_data tag_data = {}; + struct ceph_journaler *journaler = journal->journaler; + struct ceph_journaler_client *client; + struct rbd_journal_tag_predecessor *predecessor; + struct ceph_journaler_object_pos *position; + void *orig_buf = NULL, *buf = NULL, *p = NULL, *end = NULL; + uint32_t buf_len; + int ret = 0; + + ret = ceph_journaler_get_cached_client(journaler, LOCAL_CLIENT_ID, &client); + if (ret) + goto out; + + predecessor = &tag_data.predecessor; + position = list_first_entry(&client->object_positions, + struct ceph_journaler_object_pos, node); + + predecessor->commit_valid = true; + predecessor->tag_tid = position->tag_tid; + predecessor->entry_tid = position->entry_tid; + predecessor->uuid_len = 0; + predecessor->mirror_uuid = LOCAL_MIRROR_UUID; + + tag_data.uuid_len = 0; + tag_data.mirror_uuid = LOCAL_MIRROR_UUID; + + buf_len = tag_data_encoding_size(&tag_data); + + p = kmalloc(buf_len, GFP_KERNEL); + if (!p) { + pr_err("failed to allocate tag data"); + return -ENOMEM; + } + + end = p + buf_len; + orig_buf = buf = p; + ret = rbd_journal_encode_tag_data(&p, end, &tag_data); + if (ret) { + pr_err("error in tag data"); + goto free_buf; + } + + ret = ceph_journaler_allocate_tag(journaler, 0, buf, buf_len, &tag); + if (ret) + goto free_data; + + journal->tag_tid = tag.tid; +free_data: + if(tag.data) + kfree(tag.data); +free_buf: + kfree(orig_buf); +out: + return ret; +} + static void rbd_dev_image_release(struct rbd_device *rbd_dev) { rbd_dev_unprobe(rbd_dev); @@ -6074,6 +6310,7 @@ static ssize_t do_rbd_remove(struct bus_type *bus, device_del(&rbd_dev->dev); rbd_dev_image_unlock(rbd_dev); + rbd_dev_close_journal(rbd_dev); rbd_dev_device_release(rbd_dev); rbd_dev_image_release(rbd_dev); rbd_dev_destroy(rbd_dev); From patchwork Mon Mar 18 09:15:32 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857029 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 17932139A for ; Mon, 18 Mar 2019 09:25:47 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F2348292A2 for ; Mon, 18 Mar 2019 09:25:46 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E679C292BA; Mon, 18 Mar 2019 09:25:46 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4C7BF292A2 for ; Mon, 18 Mar 2019 09:25:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726795AbfCRJZp (ORCPT ); Mon, 18 Mar 2019 05:25:45 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:56405 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726449AbfCRJZo (ORCPT ); Mon, 18 Mar 2019 05:25:44 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S16; Mon, 18 Mar 2019 17:15:43 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 14/16] rbd: append journal first before sending img_request Date: Mon, 18 Mar 2019 05:15:32 -0400 Message-Id: <1552900534-29026-15-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S16 X-Coremail-Antispam: 1Uf129KBjvJXoW3Jr4DXFyDWrWUAryxury5twb_yoW3KF13pr W8GFWYkrWrZrnrZa1rWr4UGrW5X3yIkFW7WryvkrnakanYgrn3t3W8CFyjqFy2qFy8Gwsr Grs0y3yxCw4UtrDanT9S1TB71UUUUUJqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0JbGT5JUUUUU= X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbiER97elnxsemYgwAAsk Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP With journaling feature enabled, we need to append event to journal before sending img_request. Signed-off-by: Dongsheng Yang --- drivers/block/rbd.c | 204 +++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 196 insertions(+), 8 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index bd90c17..5b641f8 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -298,6 +298,7 @@ struct rbd_img_request { u32 pending_count; struct completion completion; + uint64_t journaler_commit_tid; struct kref kref; }; @@ -441,6 +442,7 @@ struct rbd_journal { static struct kmem_cache *rbd_img_request_cache; static struct kmem_cache *rbd_obj_request_cache; +static struct kmem_cache *rbd_journal_ctx_cache; static int rbd_major; static DEFINE_IDA(rbd_dev_id_ida); @@ -2616,12 +2618,20 @@ static void rbd_img_end_child_request(struct rbd_img_request *img_req) static void rbd_img_end_request(struct rbd_img_request *img_req) { rbd_assert(!test_bit(IMG_REQ_CHILD, &img_req->flags)); - rbd_assert((!img_req->result && - img_req->xferred == blk_rq_bytes(img_req->rq)) || - (img_req->result < 0 && !img_req->xferred)); - blk_mq_end_request(img_req->rq, - errno_to_blk_status(img_req->result)); + if (img_req->rq) { + rbd_assert((!img_req->result && + img_req->xferred == blk_rq_bytes(img_req->rq)) || + (img_req->result < 0 && !img_req->xferred)); + blk_mq_end_request(img_req->rq, + errno_to_blk_status(img_req->result)); + } + + if (img_req->journaler_commit_tid) { + ceph_journaler_client_committed(img_req->rbd_dev->journal->journaler, + img_req->journaler_commit_tid); + } + complete_all(&img_req->completion); rbd_img_request_put(img_req); } @@ -3689,6 +3699,21 @@ static int rbd_wait_state_locked(struct rbd_device *rbd_dev, bool may_acquire) return ret; } +struct rbd_journal_ctx { + struct rbd_device *rbd_dev; + struct rbd_img_request *img_request; + struct request *rq; + struct ceph_snap_context *snapc; + int result; + bool must_be_locked; + + struct ceph_bio_iter bio_iter; +}; + +static int rbd_journal_append(struct rbd_device *rbd_dev, struct bio *bio, + u64 offset, u64 length, enum obj_operation_type op_type, + struct rbd_journal_ctx *ctx); + static void rbd_queue_workfn(struct work_struct *work) { struct request *rq = blk_mq_rq_from_pdu(work); @@ -3794,7 +3819,29 @@ static void rbd_queue_workfn(struct work_struct *work) if (result) goto err_img_request; - rbd_img_request_submit(img_request); + if (!(rbd_dev->header.features & RBD_FEATURE_JOURNALING) || + (op_type == OBJ_OP_READ)) { + rbd_img_request_submit(img_request); + } else { + struct rbd_journal_ctx *ctx = kmem_cache_zalloc(rbd_journal_ctx_cache, GFP_NOIO); + + if (!ctx){ + result = -ENOMEM; + goto err_unlock; + } + + ctx->img_request = img_request; + ctx->rq = rq; + ctx->snapc = snapc; + ctx->must_be_locked = must_be_locked; + ctx->rbd_dev = rbd_dev; + result = rbd_journal_append(rbd_dev, rq->bio, offset, length, op_type, ctx); + if (result) { + rbd_warn(rbd_dev, "error in rbd_journal_append"); + goto err_unlock; + } + } + if (must_be_locked) up_read(&rbd_dev->lock_rwsem); return; @@ -5996,6 +6043,140 @@ static void rbd_dev_image_release(struct rbd_device *rbd_dev) rbd_dev->spec->image_id = NULL; } +static void rbd_journal_callback(struct ceph_journaler_ctx *journaler_ctx) +{ + struct rbd_journal_ctx *ctx = journaler_ctx->priv; + int result = journaler_ctx->result; + struct rbd_device *rbd_dev = ctx->rbd_dev; + bool must_be_locked = ctx->must_be_locked; + + if (result) + goto err_rq; + + if (must_be_locked) + down_read(&rbd_dev->lock_rwsem); + + rbd_img_request_submit(ctx->img_request); + + if (must_be_locked) + up_read(&rbd_dev->lock_rwsem); + + goto out; + +err_rq: + ceph_put_snap_context(ctx->snapc); + blk_mq_end_request(ctx->rq, errno_to_blk_status(result)); + rbd_img_request_put(ctx->img_request); +out: + kmem_cache_free(rbd_journal_ctx_cache, ctx); + ceph_journaler_ctx_put(journaler_ctx); +} + +static int rbd_journal_append_write_event(struct rbd_device *rbd_dev, struct bio *bio, + u64 offset, u64 length, struct rbd_journal_ctx *ctx) +{ + void *p = NULL; + struct ceph_journaler_ctx *journaler_ctx; + int ret = 0; + + journaler_ctx = ceph_journaler_ctx_alloc(); + if (!journaler_ctx) { + return -ENOMEM; + } + + ctx->bio_iter.bio = bio; + ctx->bio_iter.iter = bio->bi_iter; + + journaler_ctx->bio_iter = &ctx->bio_iter; + journaler_ctx->bio_len = length; + + // EVENT_FIXED_SIZE(10 = CEPH_ENCODING_START_BLK_LEN(6) + EVENT_TYPE(4)) + + // offset(8) + length(8) + string_len(4) = 30 + journaler_ctx->prefix_len = 30; + journaler_ctx->prefix_offset = PAGE_SIZE - journaler_ctx->prefix_len; + + p = page_address(journaler_ctx->prefix_page) + journaler_ctx->prefix_offset; + + ceph_start_encoding(&p, 1, 1, journaler_ctx->prefix_len + journaler_ctx->bio_len - 6); + + ceph_encode_32(&p, EVENT_TYPE_AIO_WRITE); + + ceph_encode_64(&p, offset); + ceph_encode_64(&p, length); + + // first part of ceph_encode_string(); + ceph_encode_32(&p, journaler_ctx->bio_len); + + journaler_ctx->priv = ctx; + journaler_ctx->callback = rbd_journal_callback; + + ret = ceph_journaler_append(rbd_dev->journal->journaler, rbd_dev->journal->tag_tid, + &ctx->img_request->journaler_commit_tid, journaler_ctx); + if (ret) { + ceph_journaler_ctx_put(journaler_ctx); + return ret; + } + return 0; +} + +static int rbd_journal_append_discard_event(struct rbd_device *rbd_dev, struct bio *bio, + u64 offset, u64 length, struct rbd_journal_ctx *ctx) +{ + void *p = NULL; + struct ceph_journaler_ctx *journaler_ctx; + int ret = 0; + + journaler_ctx = ceph_journaler_ctx_alloc(); + if (!journaler_ctx) { + return -ENOMEM; + } + + ctx->bio_iter.bio = bio; + ctx->bio_iter.iter = bio->bi_iter; + + journaler_ctx->bio_iter = &ctx->bio_iter; + journaler_ctx->bio_len = 0; + + // EVENT_FIXED_SIZE(10 = CEPH_ENCODING_START_BLK_LEN(6) + EVENT_TYPE(4)) + + // offset(8) + length(8) = 26 + journaler_ctx->prefix_len = 26; + journaler_ctx->prefix_offset = PAGE_SIZE - journaler_ctx->prefix_len; + + p = page_address(journaler_ctx->prefix_page) + journaler_ctx->prefix_offset; + + ceph_start_encoding(&p, 1, 1, journaler_ctx->prefix_len + journaler_ctx->bio_len - 6); + + ceph_encode_32(&p, EVENT_TYPE_AIO_DISCARD); + + ceph_encode_64(&p, offset); + ceph_encode_64(&p, length); + + journaler_ctx->priv = ctx; + journaler_ctx->callback = rbd_journal_callback; + + ret = ceph_journaler_append(rbd_dev->journal->journaler, rbd_dev->journal->tag_tid, + &ctx->img_request->journaler_commit_tid, journaler_ctx); + if (ret) { + ceph_journaler_ctx_put(journaler_ctx); + return ret; + } + return 0; +} + +static int rbd_journal_append(struct rbd_device *rbd_dev, struct bio *bio, + u64 offset, u64 length, enum obj_operation_type op_type, + struct rbd_journal_ctx *ctx) +{ + switch (op_type) { + case OBJ_OP_WRITE: + return rbd_journal_append_write_event(rbd_dev, bio, offset, length, ctx); + case OBJ_OP_DISCARD: + return rbd_journal_append_discard_event(rbd_dev, bio, offset, length, ctx); + default: + return 0; + } +} + /* * Probe for the existence of the header object for the given rbd * device. If this image is the one being mapped (i.e., not a @@ -6369,11 +6550,18 @@ static int __init rbd_slab_init(void) rbd_assert(!rbd_obj_request_cache); rbd_obj_request_cache = KMEM_CACHE(rbd_obj_request, 0); if (!rbd_obj_request_cache) - goto out_err; + goto destroy_img_request_cache; + rbd_assert(!rbd_journal_ctx_cache); + rbd_journal_ctx_cache = KMEM_CACHE(rbd_journal_ctx, 0); + if (!rbd_journal_ctx_cache) + goto destroy_obj_request_cache; return 0; -out_err: +destroy_obj_request_cache: + kmem_cache_destroy(rbd_obj_request_cache); + rbd_obj_request_cache = NULL; +destroy_img_request_cache: kmem_cache_destroy(rbd_img_request_cache); rbd_img_request_cache = NULL; return -ENOMEM; From patchwork Mon Mar 18 09:15:33 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857019 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AC1A36C2 for ; Mon, 18 Mar 2019 09:23:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 930052922F for ; Mon, 18 Mar 2019 09:23:10 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8757D29296; Mon, 18 Mar 2019 09:23:10 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0D64C29249 for ; Mon, 18 Mar 2019 09:23:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727194AbfCRJW6 (ORCPT ); Mon, 18 Mar 2019 05:22:58 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:50180 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726449AbfCRJW6 (ORCPT ); Mon, 18 Mar 2019 05:22:58 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S17; Mon, 18 Mar 2019 17:15:43 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 15/16] rbd: replay events in journal Date: Mon, 18 Mar 2019 05:15:33 -0400 Message-Id: <1552900534-29026-16-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S17 X-Coremail-Antispam: 1Uf129KBjvJXoWxtF17Jr43WF47Cr17GryUWrg_yoW7uw47pF 4UXFZ0krs5CF17Zrs3GFs8Zr1fXw4kCrZrGr9Ikr129an3Kr1vkF1rCa4jv3y3ZFW7Cryx GFs0qr1xur1DKrDanT9S1TB71UUUUUJqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0JbGT5JUUUUU= X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbihx97eltVf3Z3lwAAs+ Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP when we found uncommitted events in journal, we need to do a replay. This commit only implement three kinds of events replaying: EVENT_TYPE_AIO_DISCARD: Will send a img_request to image with OBJ_OP_DISCARD, and wait for it completed. EVENT_TYPE_AIO_WRITE: Will send a img_request to image with OBJ_OP_WRITE, and wait for it completed. EVENT_TYPE_AIO_FLUSH: As all other events are replayed in synchoronized way, that means the events before are all flushed. we did nothing for this event. Signed-off-by: Dongsheng Yang --- drivers/block/rbd.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 184 insertions(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 5b641f8..6f31734 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5832,6 +5832,190 @@ static int rbd_dev_header_name(struct rbd_device *rbd_dev) return ret; } +enum rbd_journal_event_type { + EVENT_TYPE_AIO_DISCARD = 0, + EVENT_TYPE_AIO_WRITE = 1, + EVENT_TYPE_AIO_FLUSH = 2, + EVENT_TYPE_OP_FINISH = 3, + EVENT_TYPE_SNAP_CREATE = 4, + EVENT_TYPE_SNAP_REMOVE = 5, + EVENT_TYPE_SNAP_RENAME = 6, + EVENT_TYPE_SNAP_PROTECT = 7, + EVENT_TYPE_SNAP_UNPROTECT = 8, + EVENT_TYPE_SNAP_ROLLBACK = 9, + EVENT_TYPE_RENAME = 10, + EVENT_TYPE_RESIZE = 11, + EVENT_TYPE_FLATTEN = 12, + EVENT_TYPE_DEMOTE_PROMOTE = 13, + EVENT_TYPE_SNAP_LIMIT = 14, + EVENT_TYPE_UPDATE_FEATURES = 15, + EVENT_TYPE_METADATA_SET = 16, + EVENT_TYPE_METADATA_REMOVE = 17, + EVENT_TYPE_AIO_WRITESAME = 18, + EVENT_TYPE_AIO_COMPARE_AND_WRITE = 19, +}; + +static struct bio_vec *setup_write_bvecs(void *buf, u64 offset, u64 length) +{ + u32 i; + struct bio_vec *bvecs = NULL; + u32 bvec_count = 0; + + bvec_count = calc_pages_for(offset, length); + bvecs = kcalloc(bvec_count, sizeof(*bvecs), GFP_NOIO); + if (!bvecs) + goto err; + + offset = offset % PAGE_SIZE; + for (i = 0; i < bvec_count; i++) { + unsigned int len = min(length, (u64)PAGE_SIZE - offset); + + bvecs[i].bv_page = alloc_page(GFP_NOIO); + if (!bvecs[i].bv_page) + goto free_bvecs; + + bvecs[i].bv_offset = offset; + bvecs[i].bv_len = len; + memcpy(page_address(bvecs[i].bv_page) + bvecs[i].bv_offset, buf, bvecs[i].bv_len); + length -= len; + buf += len; + offset = 0; + } + + rbd_assert(!length); + + return bvecs; + +free_bvecs: +err: + return NULL; +} + +static int rbd_journal_handle_aio_discard(struct rbd_device *rbd_dev, void **p, void *end, u8 struct_v, uint64_t commit_tid) +{ + uint64_t offset; + uint64_t length; + int result = 0; + enum obj_operation_type op_type; + struct rbd_img_request *img_request; + struct ceph_snap_context *snapc = NULL; + + offset = ceph_decode_64(p); + length = ceph_decode_64(p); + + snapc = rbd_dev->header.snapc; + ceph_get_snap_context(snapc); + op_type = OBJ_OP_DISCARD; + + img_request = rbd_img_request_create(rbd_dev, op_type, snapc); + if (!img_request) { + result = -ENOMEM; + goto err; + } + img_request->journaler_commit_tid = commit_tid; + + result = rbd_img_fill_nodata(img_request, offset, length); + if (result) + goto err; + + rbd_img_request_submit(img_request); + result = wait_for_completion_interruptible(&img_request->completion); +err: + return result; +} + +static int rbd_journal_handle_aio_write(struct rbd_device *rbd_dev, void **p, void *end, u8 struct_v, uint64_t commit_tid) +{ + uint64_t offset; + uint64_t length; + char *data; + ssize_t data_len; + int result = 0; + enum obj_operation_type op_type; + struct ceph_snap_context *snapc = NULL; + struct rbd_img_request *img_request; + + struct ceph_file_extent ex; + struct bio_vec *bvecs = NULL; + + offset = ceph_decode_64(p); + length = ceph_decode_64(p); + + data_len = ceph_decode_32(p); + if (!ceph_has_room(p, end, data_len)) { + pr_err("our of range"); + return -ERANGE; + } + + data = *p; + *p = (char *) *p + data_len; + + snapc = rbd_dev->header.snapc; + ceph_get_snap_context(snapc); + op_type = OBJ_OP_WRITE; + + img_request = rbd_img_request_create(rbd_dev, op_type, snapc); + if (!img_request) { + result = -ENOMEM; + goto err; + } + + img_request->journaler_commit_tid = commit_tid; + snapc = NULL; /* img_request consumes a ref */ + + ex.fe_off = offset; + ex.fe_len = length; + + bvecs = setup_write_bvecs(data, offset, length); + if (!bvecs) + rbd_warn(rbd_dev, "failed to alloc bvecs."); + result = rbd_img_fill_from_bvecs(img_request, + &ex, 1, bvecs); + if (result) + goto err; + + rbd_img_request_submit(img_request); + result = wait_for_completion_interruptible(&img_request->completion); +err: + if (bvecs) + kfree(bvecs); + return result; +} + +static int rbd_journal_replay(void *entry_handler, struct ceph_journaler_entry *entry, uint64_t commit_tid) +{ + struct rbd_device *rbd_dev = entry_handler; + void *data = entry->data; + void **p = &data; + void *end = *p + entry->data_len; + uint32_t event_type; + u8 struct_v; + u32 struct_len; + int ret = 0; + + ret = ceph_start_decoding(p, end, 1, "rbd_decode_entry", + &struct_v, &struct_len); + if (ret) + return -EINVAL; + + event_type = ceph_decode_32(p); + + switch (event_type) { + case EVENT_TYPE_AIO_WRITE: + rbd_journal_handle_aio_write(rbd_dev, p, end, struct_v, commit_tid); + break; + case EVENT_TYPE_AIO_DISCARD: + rbd_journal_handle_aio_discard(rbd_dev, p, end, struct_v, commit_tid); + break; + case EVENT_TYPE_AIO_FLUSH: + break; + default: + rbd_warn(rbd_dev, "unknown event_type: %u", event_type); + return -EINVAL; + } + return 0; +} + static int rbd_journal_allocate_tag(struct rbd_journal *journal); static int rbd_journal_open(struct rbd_journal *journal) { From patchwork Mon Mar 18 09:15:34 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10857025 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F2044139A for ; Mon, 18 Mar 2019 09:24:49 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D6A70290FB for ; Mon, 18 Mar 2019 09:24:49 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CB0F329121; Mon, 18 Mar 2019 09:24:49 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 79F49290FB for ; Mon, 18 Mar 2019 09:24:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727118AbfCRJYs (ORCPT ); Mon, 18 Mar 2019 05:24:48 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:54472 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726763AbfCRJYr (ORCPT ); Mon, 18 Mar 2019 05:24:47 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowABHr4m4YY9cO8zEAg--.331S18; Mon, 18 Mar 2019 17:15:43 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, jdillama@redhat.com, sage@redhat.com, elder@kernel.org Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH v2 16/16] rbd: add support for feature of RBD_FEATURE_JOURNALING Date: Mon, 18 Mar 2019 05:15:34 -0400 Message-Id: <1552900534-29026-17-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> References: <1552900534-29026-1-git-send-email-dongsheng.yang@easystack.cn> X-CM-TRANSID: huCowABHr4m4YY9cO8zEAg--.331S18 X-Coremail-Antispam: 1Uf129KBjDUn29KB7ZKAUJUUUUr529EdanIXcx71UUUUU7v73 VFW2AGmfu7bjvjm3AaLaJ3UbIYCTnIWIevJa73UjIFyTuYvjfUHRpBDUUUU X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbiEAB7elnxrHDxSAAAsc Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Allow user to map rbd images with journaling enabled, but there is a warning in demsg: WARNING: kernel journaling is EXPERIMENTAL! Signed-off-by: Dongsheng Yang --- drivers/block/rbd.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 6f31734..db6ca79 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -118,12 +118,14 @@ static int atomic_dec_return_safe(atomic_t *v) #define RBD_FEATURE_LAYERING (1ULL<<0) #define RBD_FEATURE_STRIPINGV2 (1ULL<<1) #define RBD_FEATURE_EXCLUSIVE_LOCK (1ULL<<2) +#define RBD_FEATURE_JOURNALING (1ULL<<6) #define RBD_FEATURE_DATA_POOL (1ULL<<7) #define RBD_FEATURE_OPERATIONS (1ULL<<8) #define RBD_FEATURES_ALL (RBD_FEATURE_LAYERING | \ RBD_FEATURE_STRIPINGV2 | \ RBD_FEATURE_EXCLUSIVE_LOCK | \ + RBD_FEATURE_JOURNALING | \ RBD_FEATURE_DATA_POOL | \ RBD_FEATURE_OPERATIONS) @@ -6437,6 +6439,11 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth) "WARNING: kernel layering is EXPERIMENTAL!"); } + if (rbd_dev->header.features & RBD_FEATURE_JOURNALING) { + rbd_warn(rbd_dev, + "WARNING: kernel journaling is EXPERIMENTAL!"); + } + ret = rbd_dev_probe_parent(rbd_dev, depth); if (ret) goto err_out_probe;