From patchwork Fri Jan 18 14:56:05 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10770569 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0DA0013B5 for ; Fri, 18 Jan 2019 14:56:16 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EED8529EA8 for ; Fri, 18 Jan 2019 14:56:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E27272E983; Fri, 18 Jan 2019 14:56:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 79A3E29EA8 for ; Fri, 18 Jan 2019 14:56:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727478AbfARO4O (ORCPT ); Fri, 18 Jan 2019 09:56:14 -0500 Received: from mx2.suse.de ([195.135.220.15]:58940 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727524AbfARO4O (ORCPT ); Fri, 18 Jan 2019 09:56:14 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id E29BAACA9; Fri, 18 Jan 2019 14:56:12 +0000 (UTC) From: Roman Penyaev Cc: David Disseldorp , Roman Penyaev , Ilya Dryomov , Sage Weil , Alex Elder , "Yan, Zheng" , ceph-devel@vger.kernel.org Subject: [RFC 0/2] rbd: respect REQ_NOUNMAP by setting new nounmap flag for ZERO op Date: Fri, 18 Jan 2019 15:56:05 +0100 Message-Id: <20190118145607.30018-1-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi all, This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd side when REQ_NOUNMAP flag is set for a block layer request. Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer requests fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd side. With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO request osd can zero out blocks, instead of punching holes. Possible handling of a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP on osd: diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc index dff549535d39..d812d88d7bc0 100644 --- a/src/osd/PrimaryLogPG.cc +++ b/src/osd/PrimaryLogPG.cc @@ -5788,6 +5788,7 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector& ops) // munge ZERO -> TRUNCATE? (don't munge to DELETE or we risk hosing attributes) if (op.op == CEPH_OSD_OP_ZERO && + !(op.flags & CEPH_OSD_OP_FLAG_ZERO_NOUNMAP) && obs.exists && op.extent.offset < static_cast(osd->osd_max_object_size) && op.extent.length >= 1 && @@ -6583,3 +6584,28 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector& ops) result = -EOPNOTSUPP; break; } + if (op.flags & CEPH_OSD_OP_FLAG_ZERO_NOUNMAP) { + // ZERO -> WRITE_SAME + // Why? Internally storage backend punches holes on zeroing, but we + // need zeroed blocks instead. + + if (osd_op.indata.length()) { + // Zero op with data? No way. + result = -EINVAL; + goto fail; + } + + // Extent and writesame layouts are almost similar, so reset union + // members which are different + op.extent.truncate_size = 0; + op.extent.truncate_seq = 0; + + // Fill in zero data, will be duplicated inside do_writesame() + const char buf[2] = {0}; + osd_op.indata.append(buf, sizeof(buf)); + op.writesame.data_length = sizeof(buf); + + result = do_writesame(ctx, osd_op); + break; + } + The other possible solution is to send CEPH_OSD_OP_WRITESAME directly from krbd instead of CEPH_OSD_OP_ZERO + NOUNMAP flag, but IMO that has a drawback: OP_WRITESAME was implemented on osd couple of years ago and seems that is not very nice to break the compatibility if someone has updated kernel, but still using old ceph cluster. Also ZERO + NOUNMAP has a cleaner semantics. These are just thoughts, nothing is tested, thus RFC. Roman Penyaev (2): libceph, rbd: pass op flags to osd_req_op_extent_init() libceph, rbd: respect REQ_NOUNMAP by setting new nounmap flag for CEPH_OSD_OP_ZERO drivers/block/rbd.c | 51 ++++++++++++++++++++++++--------- include/linux/ceph/osd_client.h | 2 +- include/linux/ceph/rados.h | 1 + net/ceph/osd_client.c | 6 ++-- 4 files changed, 42 insertions(+), 18 deletions(-) Signed-off-by: Roman Penyaev Cc: Ilya Dryomov Cc: Sage Weil Cc: Alex Elder Cc: "Yan, Zheng" Cc: ceph-devel@vger.kernel.org