From patchwork Tue Mar 21 16:45:28 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boris Burkov <boris@bur.io>
X-Patchwork-Id: 13182937
Return-Path: <linux-btrfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CB6F2C6FD1D
	for <linux-btrfs@archiver.kernel.org>; Tue, 21 Mar 2023 16:46:07 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230264AbjCUQqG (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Tue, 21 Mar 2023 12:46:06 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59916 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230035AbjCUQp5 (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>);
        Tue, 21 Mar 2023 12:45:57 -0400
Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com
 [66.111.4.29])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 12E9752F7F
        for <linux-btrfs@vger.kernel.org>;
 Tue, 21 Mar 2023 09:45:39 -0700 (PDT)
Received: from compute4.internal (compute4.nyi.internal [10.202.2.44])
        by mailout.nyi.internal (Postfix) with ESMTP id 3C0625C0131;
        Tue, 21 Mar 2023 12:45:37 -0400 (EDT)
Received: from mailfrontend2 ([10.202.2.163])
  by compute4.internal (MEProxy); Tue, 21 Mar 2023 12:45:37 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc
        :content-transfer-encoding:content-type:date:date:from:from
        :in-reply-to:in-reply-to:message-id:mime-version:references
        :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417137; x=
        1679503537; bh=vVvd6uG5BBvqxe6aBAm6gei27wUE9QbS5Dg+OiOZwJE=; b=d
        ES8XAGNNI7/WEgRHJ9UaEDNfgiqvrd3oldvoA0wPuZPz39IQ7/aV+s7Id99eQnVM
        oMhaSnGAikRsMzsNkuc11WRr8fu2nLBzxqUZYt5q9XDMs7xVA11owRsDJ5uYZNuJ
        Ehl5ci+yy0S2g6uzTj8E5H0GX2gdW64Tk2m6SWosNsV2E4gEynwu9PBBscq8b6S2
        DodnooX5p4GWP3kW2WNUcvRE9P1N7aBLQ0da4Nit/CFtRQxmg/dFmcPijpV40aoI
        f07PNF3Q5br9IJFs5ydb2gkD6cY4YqmRuBXFPSRuTFHCVafwUjOCV16O5zv2cV8u
        wdTIsAf+pT2XW54ILTKvg==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
        messagingengine.com; h=cc:content-transfer-encoding:content-type
        :date:date:feedback-id:feedback-id:from:from:in-reply-to
        :in-reply-to:message-id:mime-version:references:reply-to:sender
        :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender
        :x-me-sender:x-sasl-enc; s=fm2; t=1679417137; x=1679503537; bh=v
        Vvd6uG5BBvqxe6aBAm6gei27wUE9QbS5Dg+OiOZwJE=; b=UF5IEHC7OAdV2lrq0
        Yk0Obn6jtNKoLPbOA6X9v9zUq6GfByZ8uzry+1/iP2QJBsA6u88Oj1j/bLjJN+cQ
        SXG4E8KEpu4oNwk0x5y9OphIQu4t0JL2559G/zFYcI6P3MxTkiB7tJy11GEQ79lk
        VaEOIVH1n4lSU311CcA/iptWTEJfrUhH51HXmo+vG/9pHowMXfpKi0hxJ7xs/P6J
        w0sMfsGZP9qVcWRBlpHMSnpYP94BTrYcdvXC1le6VwKv6Dh2SNsV2olv50xLLaaW
        qE+0U2bkt5QfILWy/IlNFANDmcOFE6x0KUz+jwD8B/qHt+crWfbXjAlPB2OM0QTX
        9HRZA==
X-ME-Sender: <xms:Md8ZZN4lL3UZQmxj8EWz74H2zOr2Jli2_kmS3parkK0Ieg7Oja29ng>
    <xme:Md8ZZK6vtvUGD0tHzpjAW8ywKdcKE8nZiKeCyVV1DDN4rXnggB1vkA2TNnN1Ke3us
    6jwzQht6jcGz5Ip-yA>
X-ME-Received: <xmr:Md8ZZEd5Z5Cz_0kl86gamrM461tHQv7K9ILaLVGNWHjA7xY93SN5wM4G>
X-ME-Proxy-Cause: 
 gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgleduucetufdoteggodetrfdotf
    fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
    uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre
    dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr
    ihhoqeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffve
    dvhedtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep
    mhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh
X-ME-Proxy: <xmx:Md8ZZGLZWxeQ82sWyN3vI5mlMEPP6hFh4gZfLfvJv3beeB4CSMk9Dg>
    <xmx:Md8ZZBJfBRYnw8tQrKzNL0TkP_JShJt7MKGltPQGq7zuHxAz1mi3qQ>
    <xmx:Md8ZZPxCMvUkuF__uuhVH273bd8nndHtuax2eCo1FT-qyWdkNaeM3w>
    <xmx:Md8ZZDzYOapDEiBgryNZErCmGTc-x1QBOaPTSo5Yr2kRQhuZcpiIYA>
Feedback-ID: i083147f8:Fastmail
Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue,
 21 Mar 2023 12:45:36 -0400 (EDT)
From: Boris Burkov <boris@bur.io>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH v4 1/6] btrfs: add function to create and return an ordered
 extent
Date: Tue, 21 Mar 2023 09:45:28 -0700
Message-Id: 
 <3fac8b7cb05dabbb11205aa9076c889ca2894eb3.1679416511.git.boris@bur.io>
X-Mailer: git-send-email 2.38.1
In-Reply-To: <cover.1679416511.git.boris@bur.io>
References: <cover.1679416511.git.boris@bur.io>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

Currently, btrfs_add_ordered_extent allocates a new ordered extent, adds
it to the rb_tree, but doesn't return a referenced pointer to the
caller. There are cases where it is useful for the creator of a new
ordered_extent to hang on to such a pointer, so add a new function
btrfs_alloc_ordered_extent which is the same as
btrfs_add_ordered_extent, except it takes an additional reference count
and returns a pointer to the ordered_extent. Implement
btrfs_add_ordered_extent as btrfs_alloc_ordered_extent followed by
dropping the new reference and handling the IS_ERR case.

The type of flags in btrfs_alloc_ordered_extent and
btrfs_add_ordered_extent is changed from unsigned int to unsigned long
so it's unified with the other ordered extent functions.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ordered-data.c | 46 +++++++++++++++++++++++++++++++++--------
 fs/btrfs/ordered-data.h |  7 ++++++-
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 6c24b69e2d0a..1848d0d1a9c4 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -160,14 +160,16 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
  * @compress_type:   Compression algorithm used for data.
  *
  * Most of these parameters correspond to &struct btrfs_file_extent_item. The
- * tree is given a single reference on the ordered extent that was inserted.
+ * tree is given a single reference on the ordered extent that was inserted, and
+ * the returned pointer is given a second reference.
  *
- * Return: 0 or -ENOMEM.
+ * Return: the new ordered extent or ERR_PTR(-ENOMEM).
  */
-int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-			     u64 disk_num_bytes, u64 offset, unsigned flags,
-			     int compress_type)
+struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
+			struct btrfs_inode *inode, u64 file_offset,
+			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			u64 disk_num_bytes, u64 offset, unsigned long flags,
+			int compress_type)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -181,7 +183,7 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 		/* For nocow write, we can release the qgroup rsv right now */
 		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
 		if (ret < 0)
-			return ret;
+			return ERR_PTR(ret);
 		ret = 0;
 	} else {
 		/*
@@ -190,11 +192,11 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 		 */
 		ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes);
 		if (ret < 0)
-			return ret;
+			return ERR_PTR(ret);
 	}
 	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
 	if (!entry)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	entry->file_offset = file_offset;
 	entry->num_bytes = num_bytes;
@@ -256,6 +258,32 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 	btrfs_mod_outstanding_extents(inode, 1);
 	spin_unlock(&inode->lock);
 
+	/* One ref for the returned entry to match semantics of lookup. */
+	refcount_inc(&entry->refs);
+
+	return entry;
+}
+
+/*
+ * Add a new btrfs_ordered_extent for the range, but drop the reference instead
+ * of returning it to the caller.
+ */
+int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
+			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			     u64 disk_num_bytes, u64 offset, unsigned long flags,
+			     int compress_type)
+{
+	struct btrfs_ordered_extent *ordered;
+
+	ordered = btrfs_alloc_ordered_extent(inode, file_offset, num_bytes,
+					     ram_bytes, disk_bytenr,
+					     disk_num_bytes, offset, flags,
+					     compress_type);
+
+	if (IS_ERR(ordered))
+		return PTR_ERR(ordered);
+	btrfs_put_ordered_extent(ordered);
+
 	return 0;
 }
 
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index eb40cb39f842..18007f9c00ad 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -178,9 +178,14 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 				    struct btrfs_ordered_extent **cached,
 				    u64 file_offset, u64 io_size);
+struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
+			struct btrfs_inode *inode, u64 file_offset,
+			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+			u64 disk_num_bytes, u64 offset, unsigned long flags,
+			int compress_type);
 int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
 			     u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-			     u64 disk_num_bytes, u64 offset, unsigned flags,
+			     u64 disk_num_bytes, u64 offset, unsigned long flags,
 			     int compress_type);
 void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);

From patchwork Tue Mar 21 16:45:29 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boris Burkov <boris@bur.io>
X-Patchwork-Id: 13182938
Return-Path: <linux-btrfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D7988C7619A
	for <linux-btrfs@archiver.kernel.org>; Tue, 21 Mar 2023 16:46:08 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230120AbjCUQqH (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Tue, 21 Mar 2023 12:46:07 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59978 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230024AbjCUQp5 (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>);
        Tue, 21 Mar 2023 12:45:57 -0400
Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com
 [66.111.4.29])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4581852915
        for <linux-btrfs@vger.kernel.org>;
 Tue, 21 Mar 2023 09:45:40 -0700 (PDT)
Received: from compute5.internal (compute5.nyi.internal [10.202.2.45])
        by mailout.nyi.internal (Postfix) with ESMTP id DA1B25C0194;
        Tue, 21 Mar 2023 12:45:38 -0400 (EDT)
Received: from mailfrontend2 ([10.202.2.163])
  by compute5.internal (MEProxy); Tue, 21 Mar 2023 12:45:38 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc
        :content-transfer-encoding:content-type:date:date:from:from
        :in-reply-to:in-reply-to:message-id:mime-version:references
        :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417138; x=
        1679503538; bh=zrGpKiPh2OFEqA9I9wGR/IEfFSnJXMuned77DwsLncQ=; b=Y
        V3qDvzErkgvuzQOyL3SIqvTAXBvVmTQDLV7Ay/XpyIWWpH1CJmZ3ylPjAd/lIong
        cJyAP7cqs3E38Z8yiBMPN8CuNcirvTr/z0FO522bQczISyFktHScYTNHptryXr4S
        LEb39L41tItpWvszaVYN2eVQ6nc2MJH9lit2n5e9fI5jQ1uZkYf74h7SlKx9DCaj
        wzz3NScq7+FZxWsQmrwzG375msVzGy3voe6SqV9SqLk94JhX2I71mfJetnKvUNPw
        pBH+t0Bci0UQnbJqgiAb2P3WE0bFqPdyo3N7TJMXsHv1FSqkHOoCxSLw6r+YebGp
        8McaJKWeCdDHwlEhf3/wQ==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
        messagingengine.com; h=cc:content-transfer-encoding:content-type
        :date:date:feedback-id:feedback-id:from:from:in-reply-to
        :in-reply-to:message-id:mime-version:references:reply-to:sender
        :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender
        :x-me-sender:x-sasl-enc; s=fm2; t=1679417138; x=1679503538; bh=z
        rGpKiPh2OFEqA9I9wGR/IEfFSnJXMuned77DwsLncQ=; b=BEgzzF0S3PLSgev9+
        TxAb3qeKkbnIbExj+AgTvM5xaTLpwEOOF9iCcg6tlzgsrb4qRR2IurFKtB3i7Q2R
        VSWsoq51VxAvgNQBRjBM/zlx4N1Y3jj1+ELtjgFkLBFttDNIjgt3OCzC8IVAK39Q
        53kDr2DiPA5IgFN+gr92yEjtUSBMD6XHcGHtcr+ekjDvt3Az1FWuSa2qggLjW+d3
        oH4j7PBRBsHndhEEWhQyHOAim73IGCm2npU8PMEGi8RIMyBzR3XEgaLzw/fbhlnF
        5TJFjOiZvLusasRdmSONmCgfqiOb3U30Hx5Y1TrIlEkhtdW2dTwA2zYDvLHcOENt
        Otc7g==
X-ME-Sender: <xms:Mt8ZZINFru7zL9Hv_j_AxvRDBtqbo36l6pECQ6znxqLjEubIyWZNwQ>
    <xme:Mt8ZZO9mYSlUbYQgmnzUaeIl7xF6_0paASAW7ZcDpQHVKgP9a0xLzVd6q2f6edlnD
    ug8Iurs2Baqta-8unk>
X-ME-Received: <xmr:Mt8ZZPTlpz5expwPqVCQUM5G2_1PzQRaMtZnTTCX-ghQUqjIzusgIpzY>
X-ME-Proxy-Cause: 
 gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgleduucetufdoteggodetrfdotf
    fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
    uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre
    dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr
    ihhoqeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffve
    dvhedtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep
    mhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh
X-ME-Proxy: <xmx:Mt8ZZAuN6V_VolfH2Hp7WAht7YgEMTqqq4xDh7bJNS87o_YSdBdz5Q>
    <xmx:Mt8ZZAe0kHcivbR7ihlEzh7akXLOLkwRwBi0wbRENimp1Cdx1JajJg>
    <xmx:Mt8ZZE2NtXPryzFrOSKr_PS9c7PIr6drCAH7-yTtMl5_PM7mQZfATw>
    <xmx:Mt8ZZIlNCEw8wsrSwjvB2ehwWTB5Wh3RSuYyv4q2SIGpDBZ5N1HdYQ>
Feedback-ID: i083147f8:Fastmail
Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue,
 21 Mar 2023 12:45:38 -0400 (EDT)
From: Boris Burkov <boris@bur.io>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH v4 2/6] btrfs: stash ordered extent in dio_data during iomap
 dio
Date: Tue, 21 Mar 2023 09:45:29 -0700
Message-Id: 
 <c19e49cef3e7bb85e15dae042281d3366fa4cf00.1679416511.git.boris@bur.io>
X-Mailer: git-send-email 2.38.1
In-Reply-To: <cover.1679416511.git.boris@bur.io>
References: <cover.1679416511.git.boris@bur.io>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

While it is not feasible for an ordered extent to survive across the
calls btrfs_direct_write makes into __iomap_dio_rw, it is still helpful
to stash it on the dio_data in between creating it in iomap_begin and
finishing it in either end_io or iomap_end.

The specific use I have in mind is that we can check if a partcular bio
is partial in submit_io without unconditionally looking up the ordered
extent. This is a preparatory patch for a later patch which does just
that.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/inode.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 76d93b9e94a9..5ab486f448eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -81,6 +81,7 @@ struct btrfs_dio_data {
 	struct extent_changeset *data_reserved;
 	bool data_space_reserved;
 	bool nocow_done;
+	struct btrfs_ordered_extent *ordered;
 };
 
 struct btrfs_dio_private {
@@ -6968,6 +6969,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 }
 
 static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
+						  struct btrfs_dio_data *dio_data,
 						  const u64 start,
 						  const u64 len,
 						  const u64 orig_start,
@@ -6978,7 +6980,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 						  const int type)
 {
 	struct extent_map *em = NULL;
-	int ret;
+	struct btrfs_ordered_extent *ordered;
 
 	if (type != BTRFS_ORDERED_NOCOW) {
 		em = create_io_em(inode, start, len, orig_start, block_start,
@@ -6988,18 +6990,21 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 		if (IS_ERR(em))
 			goto out;
 	}
-	ret = btrfs_add_ordered_extent(inode, start, len, len, block_start,
-				       block_len, 0,
-				       (1 << type) |
-				       (1 << BTRFS_ORDERED_DIRECT),
-				       BTRFS_COMPRESS_NONE);
-	if (ret) {
+	ordered = btrfs_alloc_ordered_extent(inode, start, len, len,
+					     block_start, block_len, 0,
+					     (1 << type) |
+					     (1 << BTRFS_ORDERED_DIRECT),
+					     BTRFS_COMPRESS_NONE);
+	if (IS_ERR(ordered)) {
 		if (em) {
 			free_extent_map(em);
 			btrfs_drop_extent_map_range(inode, start,
 						    start + len - 1, false);
 		}
-		em = ERR_PTR(ret);
+		em = ERR_PTR(PTR_ERR(ordered));
+	} else {
+		ASSERT(!dio_data->ordered);
+		dio_data->ordered = ordered;
 	}
  out:
 
@@ -7007,6 +7012,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 }
 
 static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
+						  struct btrfs_dio_data *dio_data,
 						  u64 start, u64 len)
 {
 	struct btrfs_root *root = inode->root;
@@ -7022,7 +7028,8 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
 	if (ret)
 		return ERR_PTR(ret);
 
-	em = btrfs_create_dio_extent(inode, start, ins.offset, start,
+	em = btrfs_create_dio_extent(inode, dio_data,
+				     start, ins.offset, start,
 				     ins.objectid, ins.offset, ins.offset,
 				     ins.offset, BTRFS_ORDERED_REGULAR);
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
@@ -7367,7 +7374,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		}
 		space_reserved = true;
 
-		em2 = btrfs_create_dio_extent(BTRFS_I(inode), start, len,
+		em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
 					      orig_start, block_start,
 					      len, orig_block_len,
 					      ram_bytes, type);
@@ -7409,7 +7416,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 			goto out;
 		space_reserved = true;
 
-		em = btrfs_new_extent_direct(BTRFS_I(inode), start, len);
+		em = btrfs_new_extent_direct(BTRFS_I(inode), dio_data, start, len);
 		if (IS_ERR(em)) {
 			ret = PTR_ERR(em);
 			goto out;
@@ -7715,6 +7722,10 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 				      pos + length - 1, NULL);
 		ret = -ENOTBLK;
 	}
+	if (write) {
+		btrfs_put_ordered_extent(dio_data->ordered);
+		dio_data->ordered = NULL;
+	}
 
 	if (write)
 		extent_changeset_free(dio_data->data_reserved);
@@ -7776,7 +7787,7 @@ static const struct iomap_dio_ops btrfs_dio_ops = {
 
 ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_before)
 {
-	struct btrfs_dio_data data;
+	struct btrfs_dio_data data = { 0 };
 
 	return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
 			    IOMAP_DIO_PARTIAL, &data, done_before);
@@ -7785,7 +7796,7 @@ ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_be
 struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 				  size_t done_before)
 {
-	struct btrfs_dio_data data;
+	struct btrfs_dio_data data = { 0 };
 
 	return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
 			    IOMAP_DIO_PARTIAL, &data, done_before);

From patchwork Tue Mar 21 16:45:30 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boris Burkov <boris@bur.io>
X-Patchwork-Id: 13182941
Return-Path: <linux-btrfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ACEEBC7619A
	for <linux-btrfs@archiver.kernel.org>; Tue, 21 Mar 2023 16:46:13 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230017AbjCUQqM (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Tue, 21 Mar 2023 12:46:12 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59684 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230000AbjCUQqD (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>);
        Tue, 21 Mar 2023 12:46:03 -0400
Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com
 [66.111.4.29])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A54542799B
        for <linux-btrfs@vger.kernel.org>;
 Tue, 21 Mar 2023 09:45:41 -0700 (PDT)
Received: from compute2.internal (compute2.nyi.internal [10.202.2.46])
        by mailout.nyi.internal (Postfix) with ESMTP id 8BB745C0160;
        Tue, 21 Mar 2023 12:45:40 -0400 (EDT)
Received: from mailfrontend2 ([10.202.2.163])
  by compute2.internal (MEProxy); Tue, 21 Mar 2023 12:45:40 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc
        :content-transfer-encoding:content-type:date:date:from:from
        :in-reply-to:in-reply-to:message-id:mime-version:references
        :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417140; x=
        1679503540; bh=rXsjMfASmVp3FdC7wSRSkWbq55i+g8Uq0zzQq3PALfE=; b=L
        RdK+zphv3jGooxLp8xRg+KSQl8xA1G/mWJbWNCO/Wa/4VjqJ1RU+HUm9vG9HYak1
        O++aqaNUG+vdedUKPanz6n0SzGFSOukHSIQVGOP4AUiBd+V/d5e62TCUeN/GUXaO
        kNbqgPZ37z8oHa12KhWXuQQXtQ4CfBHxlSS8V8Il59dNdnUS5Dy/M+iwRpQzHBdE
        sHU9EzDjUvxmb5lqDQbeLTYnas9IqXy4s//81NBUcidejZ4u0RV57x0caXTuir6v
        f1ERhj17GYkhWB5o0BbXQPR0iEjpSWAC8HdcQTn5FC2vprXxRdRWTHRcLipjo3AE
        Te0X54p/5vzHqetGw3mwA==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
        messagingengine.com; h=cc:content-transfer-encoding:content-type
        :date:date:feedback-id:feedback-id:from:from:in-reply-to
        :in-reply-to:message-id:mime-version:references:reply-to:sender
        :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender
        :x-me-sender:x-sasl-enc; s=fm2; t=1679417140; x=1679503540; bh=r
        XsjMfASmVp3FdC7wSRSkWbq55i+g8Uq0zzQq3PALfE=; b=PGOXpXsbFgbhmzLrD
        1EKR703c5BtMtiFlgNJk3OrmGZAW6wAEfWYGrThNPLo1x7BrjyufGWBCtdWWOmu+
        Ag2MrSDL7iofJmt8T481n9gXBYVHtJFAGAjBY5dIOXx3m/ByQqUKdcBf50I4C7IV
        Cb+qvoBnAceaYmz8173alroYoLgx74iieXQn4vYE45BAQ3QkqrCw/+JLPAzKN1cG
        oKJCxnV3cCJCIBmx0Ro2crs40BoO3oxCU6qhQlaF9z92i8bOjG+WUqvsogVw9+3t
        Jf06p/4U/oVhLWgZsdM7UHx66BCj2EF3r9Lenzsnb5rxcL2jZMos0UX7dOTf9qwx
        cnNwQ==
X-ME-Sender: <xms:NN8ZZLIEP_BcxLrgMZA1UnVKaZFYut38Vzk2Dr1kz9UjcDO2DaGJKw>
    <xme:NN8ZZPIHG-Gti0995PPd0RnT6jwsEv4xzOicVjEw4PH33jadY5Wn0UQgmadVtRg0k
    A89bG46dv6cUrB8gzs>
X-ME-Received: <xmr:NN8ZZDsmHkhDVTwOAPtQO-JQ6vfdVHw8xjCDw5Catey42ZpojLKDAf1R>
X-ME-Proxy-Cause: 
 gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgleduucetufdoteggodetrfdotf
    fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
    uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre
    dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr
    ihhoqeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffve
    dvhedtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep
    mhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh
X-ME-Proxy: <xmx:NN8ZZEa1dngSlSt-QwIdKegBdvexCPn9Zf-o0rRiNFuO3TsMkS3TAQ>
    <xmx:NN8ZZCbXM_43s6-55RAeh1CTmUkNWkgg5SihenIuqm_EBcJWh6rR8Q>
    <xmx:NN8ZZIBbTZtiYZzcC6LBxJ-B_6QaRjlI-5iAOXOn6FIhoMLk8x4wLg>
    <xmx:NN8ZZLCvdVmnWOncLZLl_7U7war8UD-n9EE0KJtd2QnSTFZ-lXMvkQ>
Feedback-ID: i083147f8:Fastmail
Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue,
 21 Mar 2023 12:45:40 -0400 (EDT)
From: Boris Burkov <boris@bur.io>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH v4 3/6] btrfs: repurpose split_zoned_em for partial dio
 splitting
Date: Tue, 21 Mar 2023 09:45:30 -0700
Message-Id: 
 <5faf0148f526b4e9eb373c177de3c70284999ce7.1679416511.git.boris@bur.io>
X-Mailer: git-send-email 2.38.1
In-Reply-To: <cover.1679416511.git.boris@bur.io>
References: <cover.1679416511.git.boris@bur.io>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

In a subsequent patch I will be "extracting" a partial dio write bio
from its ordered extent, creating a 1:1 bio<->ordered_extent relation.
This is necessary to avoid triggering an assertion in unpin_extent_cache
called from btrfs_finish_ordered_io that checks that the em matches the
finished ordered extent.

Since there is already a function which splits an uncompressed
extent_map for a zoned bio use case, adapt it to this new, similar
use case. Mostly, modify it to handle the case where the extent_map is
bigger than the ordered_extent, and we cannot assume the em "post" split
can be computed from the ordered_extent and bio. This comes up in
btrfs/250, for example.

I felt that these relaxations where not so damaging to the legibility of
the zoned case as to merit a fully separate codepath, but I admit that is
not my area of expertise.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/inode.c | 104 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 71 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5ab486f448eb..2f8baf4797ea 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2512,37 +2512,59 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 }
 
 /*
- * Split an extent_map at [start, start + len]
+ * Split out a middle extent_map [start, start+len] from within an extent_map.
  *
- * This function is intended to be used only for extract_ordered_extent().
+ * @inode: the inode to which the extent map belongs.
+ * @start: the start index of the middle split
+ * @len: the length of the middle split
+ *
+ * The result is two or three extent_maps inserted in the tree, depending on
+ * whether start and len imply an uncovered area at the beginning or end of the
+ * extent map. If the implied split lines up with the end or beginning, there
+ * will only be two extent maps in the resulting split, otherwise there will be
+ * three. (If they both match, the split operation is a no-op)
+ *
+ * extent map splitting assumptions:
+ * end = start + len
+ * em-end = em-start + em-len
+ * start >= em-start
+ * len < em-len
+ * end <= em-end
+ *
+ * Diagrams explaining the splitting cases:
+ * original em:
+ *   [em-start---start---end---em-end)
+ * resulting ems:
+ * start != em-start && end != em-end (full tri split):
+ *   [em-start---start) [start---end) [end---em-end)
+ * start == em-start (no pre split):
+ *   [em-start---end) [end---em-end)
+ * end == em-end (no post split):
+ *   [em-start---start) [start--em-end)
+ *
+ * Returns: 0 on success, -errno on failure.
  */
-static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
-			  u64 pre, u64 post)
+static int split_em(struct btrfs_inode *inode, u64 start, u64 len)
 {
 	struct extent_map_tree *em_tree = &inode->extent_tree;
 	struct extent_map *em;
+	u64 pre_start, pre_len, pre_end;
+	u64 mid_start, mid_len, mid_end;
+	u64 post_start, post_len, post_end;
 	struct extent_map *split_pre = NULL;
 	struct extent_map *split_mid = NULL;
 	struct extent_map *split_post = NULL;
 	int ret = 0;
 	unsigned long flags;
 
-	/* Sanity check */
-	if (pre == 0 && post == 0)
-		return 0;
-
 	split_pre = alloc_extent_map();
-	if (pre)
-		split_mid = alloc_extent_map();
-	if (post)
-		split_post = alloc_extent_map();
-	if (!split_pre || (pre && !split_mid) || (post && !split_post)) {
+	split_mid = alloc_extent_map();
+	split_post = alloc_extent_map();
+	if (!split_pre || !split_mid || !split_post) {
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	ASSERT(pre + post < len);
-
 	lock_extent(&inode->io_tree, start, start + len - 1, NULL);
 	write_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree, start, len);
@@ -2551,19 +2573,38 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
 		goto out_unlock;
 	}
 
-	ASSERT(em->len == len);
+	pre_start = em->start;
+	pre_end = start;
+	pre_len = pre_end - pre_start;
+	mid_start = start;
+	mid_end = start + len;
+	mid_len = len;
+	post_start = mid_end;
+	post_end = em->start + em->len;
+	post_len = post_end - post_start;
+	ASSERT(pre_start == em->start);
+	ASSERT(pre_start + pre_len == mid_start);
+	ASSERT(mid_start + mid_len == post_start);
+	ASSERT(post_start + post_len == em->start + em->len);
+
+	/* Sanity check */
+	if (pre_len == 0 && post_len == 0) {
+		ret = 0;
+		goto out_unlock;
+	}
+
 	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
 	ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
-	ASSERT(test_bit(EXTENT_FLAG_PINNED, &em->flags));
 	ASSERT(!test_bit(EXTENT_FLAG_LOGGING, &em->flags));
 	ASSERT(!list_empty(&em->list));
 
 	flags = em->flags;
-	clear_bit(EXTENT_FLAG_PINNED, &em->flags);
+	if (test_bit(EXTENT_FLAG_PINNED, &em->flags))
+		clear_bit(EXTENT_FLAG_PINNED, &em->flags);
 
 	/* First, replace the em with a new extent_map starting from * em->start */
 	split_pre->start = em->start;
-	split_pre->len = (pre ? pre : em->len - post);
+	split_pre->len = (pre_len ? pre_len : mid_len);
 	split_pre->orig_start = split_pre->start;
 	split_pre->block_start = em->block_start;
 	split_pre->block_len = split_pre->len;
@@ -2577,16 +2618,15 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
 
 	/*
 	 * Now we only have an extent_map at:
-	 *     [em->start, em->start + pre] if pre != 0
-	 *     [em->start, em->start + em->len - post] if pre == 0
+	 *     [em->start, em->start + pre_len] if pre_len != 0
+	 *     [em->start, em->start + mid_len] if pre == 0
 	 */
-
-	if (pre) {
+	if (pre_len) {
 		/* Insert the middle extent_map */
-		split_mid->start = em->start + pre;
-		split_mid->len = em->len - pre - post;
+		split_mid->start = mid_start;
+		split_mid->len = mid_len;
 		split_mid->orig_start = split_mid->start;
-		split_mid->block_start = em->block_start + pre;
+		split_mid->block_start = em->block_start + pre_len;
 		split_mid->block_len = split_mid->len;
 		split_mid->orig_block_len = split_mid->block_len;
 		split_mid->ram_bytes = split_mid->len;
@@ -2596,11 +2636,11 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
 		add_extent_mapping(em_tree, split_mid, 1);
 	}
 
-	if (post) {
-		split_post->start = em->start + em->len - post;
-		split_post->len = post;
+	if (post_len) {
+		split_post->start = post_start;
+		split_post->len = post_len;
 		split_post->orig_start = split_post->start;
-		split_post->block_start = em->block_start + em->len - post;
+		split_post->block_start = em->block_start + pre_len + mid_len;
 		split_post->block_len = split_post->len;
 		split_post->orig_block_len = split_post->block_len;
 		split_post->ram_bytes = split_post->len;
@@ -2632,7 +2672,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 	u64 len = bbio->bio.bi_iter.bi_size;
 	struct btrfs_inode *inode = bbio->inode;
 	struct btrfs_ordered_extent *ordered;
-	u64 file_len;
 	u64 end = start + len;
 	u64 ordered_end;
 	u64 pre, post;
@@ -2671,14 +2710,13 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 		goto out;
 	}
 
-	file_len = ordered->num_bytes;
 	pre = start - ordered->disk_bytenr;
 	post = ordered_end - end;
 
 	ret = btrfs_split_ordered_extent(ordered, pre, post);
 	if (ret)
 		goto out;
-	ret = split_zoned_em(inode, bbio->file_offset, file_len, pre, post);
+	ret = split_em(inode, bbio->file_offset, len);
 
 out:
 	btrfs_put_ordered_extent(ordered);

From patchwork Tue Mar 21 16:45:31 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boris Burkov <boris@bur.io>
X-Patchwork-Id: 13182940
Return-Path: <linux-btrfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9CA8CC6FD1D
	for <linux-btrfs@archiver.kernel.org>; Tue, 21 Mar 2023 16:46:12 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229996AbjCUQqL (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Tue, 21 Mar 2023 12:46:11 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60286 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230220AbjCUQqC (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>);
        Tue, 21 Mar 2023 12:46:02 -0400
Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com
 [66.111.4.29])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C092F51CA6
        for <linux-btrfs@vger.kernel.org>;
 Tue, 21 Mar 2023 09:45:44 -0700 (PDT)
Received: from compute3.internal (compute3.nyi.internal [10.202.2.43])
        by mailout.nyi.internal (Postfix) with ESMTP id 5065D5C014B;
        Tue, 21 Mar 2023 12:45:42 -0400 (EDT)
Received: from mailfrontend2 ([10.202.2.163])
  by compute3.internal (MEProxy); Tue, 21 Mar 2023 12:45:42 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc
        :content-transfer-encoding:content-type:date:date:from:from
        :in-reply-to:in-reply-to:message-id:mime-version:references
        :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417142; x=
        1679503542; bh=I2FWZRYa52HN0xI8V64bxq5aHVpHC8EPYigxrPJS41k=; b=D
        kYV0j3kU893lr1irGoxGCv0UJtzfIkKadvTBSmmQ1ucrIK1upTN5MExOkp5kLdmz
        hOLn+IZw6++ykIL7efJSp78YqMlHUWlOzagy63VUigDqtL7c6Q52jHGSgduiKsTX
        7LwF/lRG8hRyokFsOc+2Z4RQT5blaGSN1vkGAgPZ2ETNv8hcNeXaLyx4e6F3Azrd
        5iYmmcwxeI2A/UENCZiydJ02p2rT/qhsNmkzbgdSO/wpItKZz7L3agAShqbKGe8r
        kz6iJplXNjHXBMRSpyV0OG3EZpNLyT+9qDrMZglfgjoq1oaJA6Eo0qPm3SMU+L+G
        coVyqBEKY8tFQYHiWebIw==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
        messagingengine.com; h=cc:content-transfer-encoding:content-type
        :date:date:feedback-id:feedback-id:from:from:in-reply-to
        :in-reply-to:message-id:mime-version:references:reply-to:sender
        :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender
        :x-me-sender:x-sasl-enc; s=fm2; t=1679417142; x=1679503542; bh=I
        2FWZRYa52HN0xI8V64bxq5aHVpHC8EPYigxrPJS41k=; b=O5+KBSmKHgAxE7P1Q
        xCGVWAvNGu9uJyWvuMch7j4XU/AE8NRGhnL2DTXJXh4v1YebHXAPESc4mB3iznfi
        FAFkQv2ESmBdpNHhacm+ALSPJezj7pdkyX9XHt/fSl+hTqReoc6CtA08Mwu4qMp0
        YwOc8AOFrnCRmQoLFDQfMEzV2wa9jASRod5xyeTjcVjTwOUojHDAEoPEk2QrQfCp
        CWDUgxASqd5n50XQVJvBSYFUayU1tLdJy5vIOPIVtomXJ1Ji567UtvTQp0XQEmKv
        4627TZt8KpvsgBYf+W0JxlzMn7gUeIaZXTUbHOP1xuOoiSmeDv9fGjvz0gxhHH1n
        Wr83g==
X-ME-Sender: <xms:Nt8ZZDQMzFB2-Hr_3TKcHPsUBlFil4A5OF9Trbw4QFzHJEITupifcQ>
    <xme:Nt8ZZEwGvll6EC7mD5C5BsoF2JpNeWfVTv-F8c_T1eLOYLsd0rLcIJcb0Rs066YSj
    t82LoUxWddN6EThwis>
X-ME-Received: <xmr:Nt8ZZI1OAZ20zE7DKNHU8UfFmDqfcJZZmFrBC3EVC5H1feBLpn4-iZt_>
X-ME-Proxy-Cause: 
 gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgleduucetufdoteggodetrfdotf
    fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
    uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre
    dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr
    ihhoqeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffve
    dvhedtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep
    mhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh
X-ME-Proxy: <xmx:Nt8ZZDCMAPsO7sZdmkL4EeZ5m3trJxJ-rtZAJIWYvXK9uMTvwj6CRw>
    <xmx:Nt8ZZMiUsnWyONVDaH59OKT8uPrlMFioKPKRhsSaMZQFkcqneeNZmA>
    <xmx:Nt8ZZHql-6dpWNug5SHMv9OATtZ3YOm9f9inqH6_yKAlFvfp8HCGpQ>
    <xmx:Nt8ZZELQoiJ75hyt_AQzslrGOxVCW9UnShxmxItHVtJaerCmksEuMA>
Feedback-ID: i083147f8:Fastmail
Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue,
 21 Mar 2023 12:45:41 -0400 (EDT)
From: Boris Burkov <boris@bur.io>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH v4 4/6] btrfs: return ordered_extent splits from bio
 extraction
Date: Tue, 21 Mar 2023 09:45:31 -0700
Message-Id: 
 <f39c1456efa7bc4e961ccbcf1ab88ecc46156af8.1679416511.git.boris@bur.io>
X-Mailer: git-send-email 2.38.1
In-Reply-To: <cover.1679416511.git.boris@bur.io>
References: <cover.1679416511.git.boris@bur.io>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

When extracting a bio from its ordered extent for dio partial writes, we
need the "remainder" ordered extent. It would be possible to look it up
in that case, but since we can grab the ordered_extent from the new
allocation function, we might as well wire it up to be returned to the
caller via out parameter and save that lookup.

Refactor the clone ordered extent function to return the new ordered
extent, then refactor the split and extract functions to pass back the
new pre and post split ordered extents via output parameter.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/bio.c          |  2 +-
 fs/btrfs/btrfs_inode.h  |  5 ++++-
 fs/btrfs/inode.c        | 23 ++++++++++++++++++-----
 fs/btrfs/ordered-data.c | 36 +++++++++++++++++++++++-------------
 fs/btrfs/ordered-data.h |  6 ++++--
 5 files changed, 50 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index cf09c6271edb..b849ced40d37 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -653,7 +653,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
 		if (use_append) {
 			bio->bi_opf &= ~REQ_OP_WRITE;
 			bio->bi_opf |= REQ_OP_ZONE_APPEND;
-			ret = btrfs_extract_ordered_extent(bbio);
+			ret = btrfs_extract_ordered_extent_bio(bbio, NULL, NULL, NULL);
 			if (ret)
 				goto fail_put_bio;
 		}
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9dc21622806e..e92a09559058 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -407,7 +407,10 @@ static inline void btrfs_inode_split_flags(u64 inode_item_flags,
 
 int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
 			    u32 pgoff, u8 *csum, const u8 * const csum_expected);
-blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio);
+blk_status_t btrfs_extract_ordered_extent_bio(struct btrfs_bio *bbio,
+					      struct btrfs_ordered_extent *ordered,
+					      struct btrfs_ordered_extent **ret_pre,
+					      struct btrfs_ordered_extent **ret_post);
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 			u32 bio_offset, struct bio_vec *bv);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2f8baf4797ea..dbea124c9fa3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2666,21 +2666,35 @@ static int split_em(struct btrfs_inode *inode, u64 start, u64 len)
 	return ret;
 }
 
-blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
+/*
+ * Extract a bio from an ordered extent to enforce an invariant where the bio
+ * fully matches a single ordered extent.
+ *
+ * @bbio: the bio to extract.
+ * @ordered: the ordered extent the bio is in, will be shrunk to fit. If NULL we
+ *	     will look it up.
+ * @ret_pre: out parameter to return the new oe in front of the bio, if needed.
+ * @ret_post: out parameter to return the new oe past the bio, if needed.
+ */
+blk_status_t btrfs_extract_ordered_extent_bio(struct btrfs_bio *bbio,
+					      struct btrfs_ordered_extent *ordered,
+					      struct btrfs_ordered_extent **ret_pre,
+					      struct btrfs_ordered_extent **ret_post)
 {
 	u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 len = bbio->bio.bi_iter.bi_size;
 	struct btrfs_inode *inode = bbio->inode;
-	struct btrfs_ordered_extent *ordered;
 	u64 end = start + len;
 	u64 ordered_end;
 	u64 pre, post;
 	int ret = 0;
 
-	ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset);
+	if (!ordered)
+		ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset);
 	if (WARN_ON_ONCE(!ordered))
 		return BLK_STS_IOERR;
 
+	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
 	/* No need to split */
 	if (ordered->disk_num_bytes == len)
 		goto out;
@@ -2697,7 +2711,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 		goto out;
 	}
 
-	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
 	/* bio must be in one ordered extent */
 	if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
 		ret = -EINVAL;
@@ -2713,7 +2726,7 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio)
 	pre = start - ordered->disk_bytenr;
 	post = ordered_end - end;
 
-	ret = btrfs_split_ordered_extent(ordered, pre, post);
+	ret = btrfs_split_ordered_extent(ordered, pre, post, ret_pre, ret_post);
 	if (ret)
 		goto out;
 	ret = split_em(inode, bbio->file_offset, len);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 1848d0d1a9c4..4bebebb9b434 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1117,8 +1117,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 }
 
 
-static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
-				u64 len)
+static struct btrfs_ordered_extent *clone_ordered_extent(struct btrfs_ordered_extent *ordered,
+						  u64 pos, u64 len)
 {
 	struct inode *inode = ordered->inode;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
@@ -1133,18 +1133,22 @@ static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
 	percpu_counter_add_batch(&fs_info->ordered_bytes, -len,
 				 fs_info->delalloc_batch);
 	WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED));
-	return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len,
-					disk_bytenr, len, 0, flags,
-					ordered->compress_type);
+	return btrfs_alloc_ordered_extent(BTRFS_I(inode), file_offset, len, len,
+					  disk_bytenr, len, 0, flags,
+					  ordered->compress_type);
 }
 
-int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
-				u64 post)
+int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
+			       u64 pre, u64 post,
+			       struct btrfs_ordered_extent **ret_pre,
+			       struct btrfs_ordered_extent **ret_post)
+
 {
 	struct inode *inode = ordered->inode;
 	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
 	struct rb_node *node;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_ordered_extent *oe;
 	int ret = 0;
 
 	trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered);
@@ -1172,12 +1176,18 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
 
 	spin_unlock_irq(&tree->lock);
 
-	if (pre)
-		ret = clone_ordered_extent(ordered, 0, pre);
-	if (ret == 0 && post)
-		ret = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes,
-					   post);
-
+	if (pre) {
+		oe = clone_ordered_extent(ordered, 0, pre);
+		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
+		if (!ret && ret_pre)
+			*ret_pre = oe;
+	}
+	if (!ret && post) {
+		oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
+		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
+		if (!ret && ret_post)
+			*ret_post = oe;
+	}
 	return ret;
 }
 
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 18007f9c00ad..933f6f0d8c10 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -212,8 +212,10 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 					struct extent_state **cached_state);
 bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct extent_state **cached_state);
-int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
-			       u64 post);
+int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
+			       u64 pre, u64 post,
+			       struct btrfs_ordered_extent **ret_pre,
+			       struct btrfs_ordered_extent **ret_post);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 

From patchwork Tue Mar 21 16:45:32 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boris Burkov <boris@bur.io>
X-Patchwork-Id: 13182943
Return-Path: <linux-btrfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E23E7C6FD1D
	for <linux-btrfs@archiver.kernel.org>; Tue, 21 Mar 2023 16:46:26 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229916AbjCUQqZ (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Tue, 21 Mar 2023 12:46:25 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60950 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230024AbjCUQqW (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>);
        Tue, 21 Mar 2023 12:46:22 -0400
Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com
 [66.111.4.29])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6188711656
        for <linux-btrfs@vger.kernel.org>;
 Tue, 21 Mar 2023 09:45:59 -0700 (PDT)
Received: from compute1.internal (compute1.nyi.internal [10.202.2.41])
        by mailout.nyi.internal (Postfix) with ESMTP id 1BEB85C00E0;
        Tue, 21 Mar 2023 12:45:44 -0400 (EDT)
Received: from mailfrontend1 ([10.202.2.162])
  by compute1.internal (MEProxy); Tue, 21 Mar 2023 12:45:44 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc
        :content-transfer-encoding:content-type:date:date:from:from
        :in-reply-to:in-reply-to:message-id:mime-version:references
        :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417144; x=
        1679503544; bh=xxveT1NlhD1MHsMmSBTdjJEHd7Q8vXuSckDasrhE94A=; b=e
        vNTOuVt0YsNi7ubkExamLmIFzQimYsS/bxyfEfCLeD2qp0cFW1tSsnwntHNCNzgK
        hk0/4Xsx7S34Yp8lZQAdb8KE96Vr9pbC3cKgG8GKiDzkNoFHqCvmkm6+P0CMGzz7
        Z/vJx8dN5cpwW91NVj6x7QDZm8sDxswX16vHQbtVVvltWP76nt6jGu9cG+d6O+x1
        I07pSQiklAtKjslLy8neABcdw1OptMkG7s0BMp7jcF/vzhr3FnOgmKMoOam7S5DF
        gaDGXCyDpcI7ZrlSEKckWHUPnJsUE2lF+tVdn7IBZUgTY7Rpyt/2M0YwI1c/vkrq
        iPPlzhAi4clM9aDQSvQlw==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
        messagingengine.com; h=cc:content-transfer-encoding:content-type
        :date:date:feedback-id:feedback-id:from:from:in-reply-to
        :in-reply-to:message-id:mime-version:references:reply-to:sender
        :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender
        :x-me-sender:x-sasl-enc; s=fm2; t=1679417144; x=1679503544; bh=x
        xveT1NlhD1MHsMmSBTdjJEHd7Q8vXuSckDasrhE94A=; b=rP4AIgLpFtyXSCucm
        0zN2BbxValW3tolNshXODi4jOZxgSdMFEWX5M87VbLbHy80ZChh45eqMJZEaLFvW
        SiLomBA0Zkyk437IvRYZ9+xbYX4tkkQtcJWbGsUPW1q9j5Yj4FCi2y0eiKEBDVnH
        Ob/dxfFevkWqBT+l5QGQl40GzY/taI0fzx8tOLTg0OiLIAYQvQDtXL7/Yaj8k3nY
        VWNNdrjQmk04hdxv0l3QLq/C7bCI5xJFGGVP0oDrFpHGHFB1zbrOCJw5Tn8/5mUG
        XK62lr5oWBTUNfomBqLTCUK0Z6wOu9PiNEXMP3FZxJPB+GPmh0zGoHyMkm+5WeBv
        PY39w==
X-ME-Sender: <xms:N98ZZI0AuZ5udYbNCjraL3StuH3illAi-3cTa0g38gr-nAMDOSMOKw>
    <xme:N98ZZDEGNttHtBnOx9lwJfA5J3Y-2XJ2VVY8iZwZWzQ0CVrXZqbCUUHQdKoGrU2MP
    1kBr4RdcvUe8phjkKc>
X-ME-Received: <xmr:N98ZZA4BHQpusbcAJDLVtRxUY-c8AzMCDgXUDArkf-zLOt_4LWtzpZAc>
X-ME-Proxy-Cause: 
 gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgledtucetufdoteggodetrfdotf
    fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
    uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre
    dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr
    ihhoqeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffve
    dvhedtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep
    mhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh
X-ME-Proxy: <xmx:N98ZZB1zF7_ucLIMNFS04QBz-K3kajZFDXEhFrLCeAKB-j8o6TvzfA>
    <xmx:N98ZZLHVZ--z5I6CiQSkO-iM-ithpnqwxoVrHYM_gy33Z-AqqT9YFg>
    <xmx:N98ZZK-eRTeMcnGVQP9sKY_Kqdv4AfO_6rf6gDQAnNv663TT2Jf6aQ>
    <xmx:ON8ZZOPxt1LekzwJ8fctgmjFkS-28o-zQcrOX6tyBzloGFejevqQ_g>
Feedback-ID: i083147f8:Fastmail
Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue,
 21 Mar 2023 12:45:43 -0400 (EDT)
From: Boris Burkov <boris@bur.io>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH v4 5/6] btrfs: fix crash with non-zero pre in
 btrfs_split_ordered_extent
Date: Tue, 21 Mar 2023 09:45:32 -0700
Message-Id: 
 <4154ce05313d40d1ba18e0648536426240119f41.1679416511.git.boris@bur.io>
X-Mailer: git-send-email 2.38.1
In-Reply-To: <cover.1679416511.git.boris@bur.io>
References: <cover.1679416511.git.boris@bur.io>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

if pre != 0 in btrfs_split_ordered_extent, then we do the following:
1. remove ordered (at file_offset) from the rb tree
2. modify file_offset+=pre
3. re-insert ordered
4. clone an ordered extent at offset 0 length pre from ordered.
5. clone an ordered extent for the post range, if necessary.

step 4 is not correct, as at this point, the start of ordered is already
the end of the desired new pre extent. Further this causes a panic when
btrfs_alloc_ordered_extent sees that the node (from the modified and
re-inserted ordered) is already present at file_offset + 0 = file_offset.

We can fix this by either using a negative offset, or by moving the
clone of the pre extent to after we remove the original one, but before
we modify and re-insert it. The former feels quite kludgy, as we are
"cloning" from outside the range of the ordered extent, so opt for the
latter, which does have some locking annoyances.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/ordered-data.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 4bebebb9b434..d14a3fe1a113 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1161,6 +1161,17 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
 	if (tree->last == node)
 		tree->last = NULL;
 
+	if (pre) {
+		spin_unlock_irq(&tree->lock);
+		oe = clone_ordered_extent(ordered, 0, pre);
+		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
+		if (!ret && ret_pre)
+			*ret_pre = oe;
+		if (ret)
+			goto out;
+		spin_lock_irq(&tree->lock);
+	}
+
 	ordered->file_offset += pre;
 	ordered->disk_bytenr += pre;
 	ordered->num_bytes -= (pre + post);
@@ -1176,18 +1187,13 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
 
 	spin_unlock_irq(&tree->lock);
 
-	if (pre) {
-		oe = clone_ordered_extent(ordered, 0, pre);
-		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
-		if (!ret && ret_pre)
-			*ret_pre = oe;
-	}
-	if (!ret && post) {
+	if (post) {
 		oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
 		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
 		if (!ret && ret_post)
 			*ret_post = oe;
 	}
+out:
 	return ret;
 }
 

From patchwork Tue Mar 21 16:45:33 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boris Burkov <boris@bur.io>
X-Patchwork-Id: 13182942
Return-Path: <linux-btrfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 450DFC74A5B
	for <linux-btrfs@archiver.kernel.org>; Tue, 21 Mar 2023 16:46:24 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230000AbjCUQqW (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Tue, 21 Mar 2023 12:46:22 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60870 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230024AbjCUQqS (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>);
        Tue, 21 Mar 2023 12:46:18 -0400
Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com
 [66.111.4.29])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2098152F73
        for <linux-btrfs@vger.kernel.org>;
 Tue, 21 Mar 2023 09:45:58 -0700 (PDT)
Received: from compute2.internal (compute2.nyi.internal [10.202.2.46])
        by mailout.nyi.internal (Postfix) with ESMTP id CB4865C0136;
        Tue, 21 Mar 2023 12:45:45 -0400 (EDT)
Received: from mailfrontend1 ([10.202.2.162])
  by compute2.internal (MEProxy); Tue, 21 Mar 2023 12:45:45 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc
        :content-transfer-encoding:content-type:date:date:from:from
        :in-reply-to:in-reply-to:message-id:mime-version:references
        :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417145; x=
        1679503545; bh=SJXljvkua0QBQ/xrM+RYMut+xJvz2Y9iekPc3I/pppI=; b=f
        19mEuDCs8mb2BD1zZ7BbLlhEzmh0yUw9oEv5wGPB5UV083D7Nkxum7DH6nu05t1L
        7djyRkQ7sCdUZ3YuVUlMY+tv1kwzAGKdxilIJhr2s5O6bIGcL+S8NYQCox3E2Krn
        oM0it4PSKRy7OjVZZg6U0wmKitCiHStx/83XjdyR42/FUQacpsxgME9A9VkZg3e8
        EM4O1P6Z9EFyaD7fenCyayspeZqWDWpWLdbiPBeCZtSs0wvHbQRnE30AtfROqqor
        DUSfXJTIRlyrQaPu4X203EpOww0DpJUFW9uPb00lsriWCo5JLbVj+fABcU6K8TNC
        vMIcJDdbbGCykba+HT3NA==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
        messagingengine.com; h=cc:content-transfer-encoding:content-type
        :date:date:feedback-id:feedback-id:from:from:in-reply-to
        :in-reply-to:message-id:mime-version:references:reply-to:sender
        :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender
        :x-me-sender:x-sasl-enc; s=fm2; t=1679417145; x=1679503545; bh=S
        JXljvkua0QBQ/xrM+RYMut+xJvz2Y9iekPc3I/pppI=; b=F4G4GBueiFaUDGATS
        6vSrl5k3eIQDFsaPHl/piXZNFgEsf42L5Bd/eCSGMiat3/4zASb2cdDkHXEEaFZ7
        dobN0sxIBuGas818C0PDyD/ZL6tBOG48NwlTNdtLaGiPiTe6wobn8UQXbKc0KrBe
        8bWte5Ld61/moaTS4y4/rIMXqrDBrIH9GwQY/IyWCgAsFGkOeKTgYeha1AG/Y8yQ
        wmmbrU6Nqvps8b/v7d9B3Q2Lpn82Eu1iKKFhFX22kiDEAT8MpXQZ0NIzg/bRuw9f
        9ku96jujdj6VMSrvvIAaE7e/60RQbueAoeCaxh69LvbxCoWP5jKrHOXkzoheu8rE
        I9H0Q==
X-ME-Sender: <xms:Od8ZZC4YPj6MnmC2I2_7LkVJao9MKhYpp6NHZN0FrEmCM10j5dTJig>
    <xme:Od8ZZL67h8Cr3uR--bSazohHwrd03rFb5xEMQJffic619UiAkY4EngBj6BwLFBwzT
    7m0fP2OkT2CdgZS1Js>
X-ME-Received: <xmr:Od8ZZBf8_7rhLEavURYLrbTNCw629-6F4esLKj6UVzGnndJ43YRtgLQA>
X-ME-Proxy-Cause: 
 gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgleduucetufdoteggodetrfdotf
    fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
    uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre
    dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr
    ihhoqeenucggtffrrghtthgvrhhnpedvtdetffelgeeguddugffgtddtgfehgeetffegue
    eutdelgeeuheetfeetveelueenucffohhmrghinheprhgvughhrghtrdgtohhmpdhkvghr
    nhgvlhdrohhrghdpphgrshhtvggsihhnrdgtohhmnecuvehluhhsthgvrhfuihiivgeptd
    enucfrrghrrghmpehmrghilhhfrhhomhepsghorhhishessghurhdrihho
X-ME-Proxy: <xmx:Od8ZZPJrkjc-4eUMm11czSPpwrJ8MthwSP02eepvO9QQ6U2HPfNgWg>
    <xmx:Od8ZZGImr9FWW8Q-5pShayBJQ70u2P6EDDu7wkaKszgvMfFVND0Sxw>
    <xmx:Od8ZZAwAToNvYvKm-ZiGwf70RepqnR0kPW_Xv22h9nnRRFHW9vRN1w>
    <xmx:Od8ZZAzlMfm4NcIO--iSh-A3uB2Xkf7g7epbn37WExnK9pwNKknGEQ>
Feedback-ID: i083147f8:Fastmail
Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue,
 21 Mar 2023 12:45:45 -0400 (EDT)
From: Boris Burkov <boris@bur.io>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH v4 6/6] btrfs: split partial dio bios before submit
Date: Tue, 21 Mar 2023 09:45:33 -0700
Message-Id: 
 <1216b857841d01b0494199129068baf23546959e.1679416511.git.boris@bur.io>
X-Mailer: git-send-email 2.38.1
In-Reply-To: <cover.1679416511.git.boris@bur.io>
References: <cover.1679416511.git.boris@bur.io>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

If an application is doing direct io to a btrfs file and experiences a
page fault reading from the write buffer, iomap will issue a partial
bio, and allow the fs to keep going. However, there was a subtle bug in
this codepath in the btrfs dio iomap implementation that led to the
partial write ending up as a gap in the file's extents and to be read
back as zeros.

The sequence of events in a partial write, lightly summarized and
trimmed down for brevity is as follows:

====WRITING TASK====
btrfs_direct_write
__iomap_dio_write
iomap_iter
btrfs_dio_iomap_begin # create full ordered extent
iomap_dio_bio_iter
bio_iov_iter_get_pages # page fault; partial read
submit_bio # partial bio
iomap_iter
btrfs_dio_iomap_end
btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR;
			       # submit to finish_ordered_fn wq
fault_in_iov_iter_readable # btrfs_direct_write detects partial write
__iomap_dio_write
iomap_iter
btrfs_dio_iomap_begin # create second partial ordered extent
iomap_dio_bio_iter
bio_iov_iter_get_pages # read all of remainder
submit_bio # partial bio with all of remainder
iomap_iter
btrfs_dio_iomap_end # nothing exciting to do with ordered io

====DIO ENDIO====
==FIRST PARTIAL BIO==
btrfs_dio_end_io
btrfs_mark_ordered_io_finished # bytes_left > 0
			     # don't submit to finish_ordered_fn wq
==SECOND PARTIAL BIO==
btrfs_dio_end_io
btrfs_mark_ordered_io_finished # bytes_left == 0
			     # submit to finish_ordered_fn wq

====BTRFS FINISH ORDERED WQ====
==FIRST PARTIAL BIO==
btrfs_finish_ordered_io # called by dio_iomap_end_io, sees
		    # BTRFS_ORDERED_IOERR, just drops the
		    # ordered_extent
==SECOND PARTIAL BIO==
btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file
		    # extents, csums, etc...

The essence of the problem is that while btrfs_direct_write and iomap
properly interact to submit all the correct bios, there is insufficient
logic in the btrfs dio functions (btrfs_dio_iomap_begin,
btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to
ensure that every bio is at least a part of a completed ordered_extent.
And it is completing an ordered_extent that results in crucial
functionality like writing out a file extent for the range.

More specifically, btrfs_dio_end_io treats the ordered extent as
unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it.
Thus, the finish io work doesn't result in file extents, csums, etc...
In the aftermath, such a file behaves as though it has a hole in it,
instead of the purportedly written data.

We considered a few options for fixing the bug (apologies for any
incorrect summary of a proposal which I didn't implement and fully
understand):
1. treat the partial bio as if we had truncated the file, which would
result in properly finishing it.
2. split the ordered extent when submitting a partial bio.
3. cache the ordered extent across calls to __iomap_dio_rw in
iter->private, so that we could reuse it and correctly apply several
bios to it.

I had trouble with 1, and it felt the most like a hack, so I tried 2
and 3. Since 3 has the benefit of also not creating an extra file
extent, and avoids an ordered extent lookup during bio submission, it
felt like the best option. However, that turned out to re-introduce a
deadlock which this code discarding the ordered_extent between faults
was meant to fix in the first place. (Link to an explanation of the
deadlock below)

Therefore, go with fix #2, which requires a bit more setup work but
fixes the corruption without introducing the deadlock, which is
fundamentally caused by the ordered extent existing when we attempt to
fault in a range that overlaps with it.

Put succinctly, what this patch does is: when we submit a dio bio, check
if it is partial against the ordered extent stored in dio_data, and if it
is, extract the ordered_extent that matches the bio exactly out of the
larger ordered_extent. Keep the remaining ordered_extent around in dio_data
for cancellation in iomap_end.

Thanks to Josef, Christoph, and Filipe with their help figuring out the
bug and the fix.

Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947
Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/
Link: https://pastebin.com/3SDaH8C6
Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#t
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/inode.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index dbea124c9fa3..69fdcbb89522 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7815,6 +7815,7 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	struct btrfs_dio_private *dip =
 		container_of(bbio, struct btrfs_dio_private, bbio);
 	struct btrfs_dio_data *dio_data = iter->private;
+	int err = 0;
 
 	btrfs_bio_init(bbio, BTRFS_I(iter->inode), btrfs_dio_end_io, bio->bi_private);
 	bbio->file_offset = file_offset;
@@ -7823,7 +7824,25 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	dip->bytes = bio->bi_iter.bi_size;
 
 	dio_data->submitted += bio->bi_iter.bi_size;
-	btrfs_submit_bio(bbio, 0);
+	/*
+	 * Check if we are doing a partial write. If we are, we need to split
+	 * the ordered extent to match the submitted bio. Hang on to the
+	 * remaining unfinishable ordered_extent in dio_data so that it can be
+	 * cancelled in iomap_end to avoid a deadlock wherein faulting the
+	 * remaining pages is blocked on the outstanding ordered extent.
+	 */
+	if (iter->flags & IOMAP_WRITE) {
+		struct btrfs_ordered_extent *ordered = dio_data->ordered;
+
+		ASSERT(ordered);
+		if (bio->bi_iter.bi_size < ordered->num_bytes)
+			err = btrfs_extract_ordered_extent_bio(bbio, ordered, NULL,
+							       &dio_data->ordered);
+	}
+	if (err)
+		btrfs_bio_end_io(bbio, err);
+	else
+		btrfs_submit_bio(bbio, 0);
 }
 
 static const struct iomap_ops btrfs_dio_iomap_ops = {