From patchwork Tue Mar 21 16:45:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13182937 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB6F2C6FD1D for ; Tue, 21 Mar 2023 16:46:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230264AbjCUQqG (ORCPT ); Tue, 21 Mar 2023 12:46:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59916 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230035AbjCUQp5 (ORCPT ); Tue, 21 Mar 2023 12:45:57 -0400 Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com [66.111.4.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 12E9752F7F for ; Tue, 21 Mar 2023 09:45:39 -0700 (PDT) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.nyi.internal (Postfix) with ESMTP id 3C0625C0131; Tue, 21 Mar 2023 12:45:37 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute4.internal (MEProxy); Tue, 21 Mar 2023 12:45:37 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417137; x= 1679503537; bh=vVvd6uG5BBvqxe6aBAm6gei27wUE9QbS5Dg+OiOZwJE=; b=d ES8XAGNNI7/WEgRHJ9UaEDNfgiqvrd3oldvoA0wPuZPz39IQ7/aV+s7Id99eQnVM oMhaSnGAikRsMzsNkuc11WRr8fu2nLBzxqUZYt5q9XDMs7xVA11owRsDJ5uYZNuJ Ehl5ci+yy0S2g6uzTj8E5H0GX2gdW64Tk2m6SWosNsV2E4gEynwu9PBBscq8b6S2 DodnooX5p4GWP3kW2WNUcvRE9P1N7aBLQ0da4Nit/CFtRQxmg/dFmcPijpV40aoI f07PNF3Q5br9IJFs5ydb2gkD6cY4YqmRuBXFPSRuTFHCVafwUjOCV16O5zv2cV8u wdTIsAf+pT2XW54ILTKvg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679417137; x=1679503537; bh=v Vvd6uG5BBvqxe6aBAm6gei27wUE9QbS5Dg+OiOZwJE=; b=UF5IEHC7OAdV2lrq0 Yk0Obn6jtNKoLPbOA6X9v9zUq6GfByZ8uzry+1/iP2QJBsA6u88Oj1j/bLjJN+cQ SXG4E8KEpu4oNwk0x5y9OphIQu4t0JL2559G/zFYcI6P3MxTkiB7tJy11GEQ79lk VaEOIVH1n4lSU311CcA/iptWTEJfrUhH51HXmo+vG/9pHowMXfpKi0hxJ7xs/P6J w0sMfsGZP9qVcWRBlpHMSnpYP94BTrYcdvXC1le6VwKv6Dh2SNsV2olv50xLLaaW qE+0U2bkt5QfILWy/IlNFANDmcOFE6x0KUz+jwD8B/qHt+crWfbXjAlPB2OM0QTX 9HRZA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgleduucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr ihhoqeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffve dvhedtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep mhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 21 Mar 2023 12:45:36 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v4 1/6] btrfs: add function to create and return an ordered extent Date: Tue, 21 Mar 2023 09:45:28 -0700 Message-Id: <3fac8b7cb05dabbb11205aa9076c889ca2894eb3.1679416511.git.boris@bur.io> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Currently, btrfs_add_ordered_extent allocates a new ordered extent, adds it to the rb_tree, but doesn't return a referenced pointer to the caller. There are cases where it is useful for the creator of a new ordered_extent to hang on to such a pointer, so add a new function btrfs_alloc_ordered_extent which is the same as btrfs_add_ordered_extent, except it takes an additional reference count and returns a pointer to the ordered_extent. Implement btrfs_add_ordered_extent as btrfs_alloc_ordered_extent followed by dropping the new reference and handling the IS_ERR case. The type of flags in btrfs_alloc_ordered_extent and btrfs_add_ordered_extent is changed from unsigned int to unsigned long so it's unified with the other ordered extent functions. Reviewed-by: Filipe Manana Reviewed-by: Christoph Hellwig Signed-off-by: Boris Burkov Signed-off-by: David Sterba --- fs/btrfs/ordered-data.c | 46 +++++++++++++++++++++++++++++++++-------- fs/btrfs/ordered-data.h | 7 ++++++- 2 files changed, 43 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 6c24b69e2d0a..1848d0d1a9c4 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -160,14 +160,16 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, * @compress_type: Compression algorithm used for data. * * Most of these parameters correspond to &struct btrfs_file_extent_item. The - * tree is given a single reference on the ordered extent that was inserted. + * tree is given a single reference on the ordered extent that was inserted, and + * the returned pointer is given a second reference. * - * Return: 0 or -ENOMEM. + * Return: the new ordered extent or ERR_PTR(-ENOMEM). */ -int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, - u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, - u64 disk_num_bytes, u64 offset, unsigned flags, - int compress_type) +struct btrfs_ordered_extent *btrfs_alloc_ordered_extent( + struct btrfs_inode *inode, u64 file_offset, + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, unsigned long flags, + int compress_type) { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; @@ -181,7 +183,7 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, /* For nocow write, we can release the qgroup rsv right now */ ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes); if (ret < 0) - return ret; + return ERR_PTR(ret); ret = 0; } else { /* @@ -190,11 +192,11 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, */ ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes); if (ret < 0) - return ret; + return ERR_PTR(ret); } entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS); if (!entry) - return -ENOMEM; + return ERR_PTR(-ENOMEM); entry->file_offset = file_offset; entry->num_bytes = num_bytes; @@ -256,6 +258,32 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, btrfs_mod_outstanding_extents(inode, 1); spin_unlock(&inode->lock); + /* One ref for the returned entry to match semantics of lookup. */ + refcount_inc(&entry->refs); + + return entry; +} + +/* + * Add a new btrfs_ordered_extent for the range, but drop the reference instead + * of returning it to the caller. + */ +int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, unsigned long flags, + int compress_type) +{ + struct btrfs_ordered_extent *ordered; + + ordered = btrfs_alloc_ordered_extent(inode, file_offset, num_bytes, + ram_bytes, disk_bytenr, + disk_num_bytes, offset, flags, + compress_type); + + if (IS_ERR(ordered)) + return PTR_ERR(ordered); + btrfs_put_ordered_extent(ordered); + return 0; } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index eb40cb39f842..18007f9c00ad 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -178,9 +178,14 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode, bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode, struct btrfs_ordered_extent **cached, u64 file_offset, u64 io_size); +struct btrfs_ordered_extent *btrfs_alloc_ordered_extent( + struct btrfs_inode *inode, u64 file_offset, + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, unsigned long flags, + int compress_type); int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, - u64 disk_num_bytes, u64 offset, unsigned flags, + u64 disk_num_bytes, u64 offset, unsigned long flags, int compress_type); void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, struct btrfs_ordered_sum *sum); From patchwork Tue Mar 21 16:45:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13182938 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7988C7619A for ; Tue, 21 Mar 2023 16:46:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230120AbjCUQqH (ORCPT ); Tue, 21 Mar 2023 12:46:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59978 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230024AbjCUQp5 (ORCPT ); Tue, 21 Mar 2023 12:45:57 -0400 Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com [66.111.4.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4581852915 for ; Tue, 21 Mar 2023 09:45:40 -0700 (PDT) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id DA1B25C0194; Tue, 21 Mar 2023 12:45:38 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute5.internal (MEProxy); Tue, 21 Mar 2023 12:45:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417138; x= 1679503538; bh=zrGpKiPh2OFEqA9I9wGR/IEfFSnJXMuned77DwsLncQ=; b=Y V3qDvzErkgvuzQOyL3SIqvTAXBvVmTQDLV7Ay/XpyIWWpH1CJmZ3ylPjAd/lIong cJyAP7cqs3E38Z8yiBMPN8CuNcirvTr/z0FO522bQczISyFktHScYTNHptryXr4S LEb39L41tItpWvszaVYN2eVQ6nc2MJH9lit2n5e9fI5jQ1uZkYf74h7SlKx9DCaj wzz3NScq7+FZxWsQmrwzG375msVzGy3voe6SqV9SqLk94JhX2I71mfJetnKvUNPw pBH+t0Bci0UQnbJqgiAb2P3WE0bFqPdyo3N7TJMXsHv1FSqkHOoCxSLw6r+YebGp 8McaJKWeCdDHwlEhf3/wQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679417138; x=1679503538; bh=z rGpKiPh2OFEqA9I9wGR/IEfFSnJXMuned77DwsLncQ=; b=BEgzzF0S3PLSgev9+ TxAb3qeKkbnIbExj+AgTvM5xaTLpwEOOF9iCcg6tlzgsrb4qRR2IurFKtB3i7Q2R VSWsoq51VxAvgNQBRjBM/zlx4N1Y3jj1+ELtjgFkLBFttDNIjgt3OCzC8IVAK39Q 53kDr2DiPA5IgFN+gr92yEjtUSBMD6XHcGHtcr+ekjDvt3Az1FWuSa2qggLjW+d3 oH4j7PBRBsHndhEEWhQyHOAim73IGCm2npU8PMEGi8RIMyBzR3XEgaLzw/fbhlnF 5TJFjOiZvLusasRdmSONmCgfqiOb3U30Hx5Y1TrIlEkhtdW2dTwA2zYDvLHcOENt Otc7g== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgleduucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr ihhoqeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffve dvhedtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep mhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 21 Mar 2023 12:45:38 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v4 2/6] btrfs: stash ordered extent in dio_data during iomap dio Date: Tue, 21 Mar 2023 09:45:29 -0700 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org While it is not feasible for an ordered extent to survive across the calls btrfs_direct_write makes into __iomap_dio_rw, it is still helpful to stash it on the dio_data in between creating it in iomap_begin and finishing it in either end_io or iomap_end. The specific use I have in mind is that we can check if a partcular bio is partial in submit_io without unconditionally looking up the ordered extent. This is a preparatory patch for a later patch which does just that. Signed-off-by: Boris Burkov --- fs/btrfs/inode.c | 37 ++++++++++++++++++++++++------------- 1 file changed, 24 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 76d93b9e94a9..5ab486f448eb 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -81,6 +81,7 @@ struct btrfs_dio_data { struct extent_changeset *data_reserved; bool data_space_reserved; bool nocow_done; + struct btrfs_ordered_extent *ordered; }; struct btrfs_dio_private { @@ -6968,6 +6969,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, } static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, + struct btrfs_dio_data *dio_data, const u64 start, const u64 len, const u64 orig_start, @@ -6978,7 +6980,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, const int type) { struct extent_map *em = NULL; - int ret; + struct btrfs_ordered_extent *ordered; if (type != BTRFS_ORDERED_NOCOW) { em = create_io_em(inode, start, len, orig_start, block_start, @@ -6988,18 +6990,21 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, if (IS_ERR(em)) goto out; } - ret = btrfs_add_ordered_extent(inode, start, len, len, block_start, - block_len, 0, - (1 << type) | - (1 << BTRFS_ORDERED_DIRECT), - BTRFS_COMPRESS_NONE); - if (ret) { + ordered = btrfs_alloc_ordered_extent(inode, start, len, len, + block_start, block_len, 0, + (1 << type) | + (1 << BTRFS_ORDERED_DIRECT), + BTRFS_COMPRESS_NONE); + if (IS_ERR(ordered)) { if (em) { free_extent_map(em); btrfs_drop_extent_map_range(inode, start, start + len - 1, false); } - em = ERR_PTR(ret); + em = ERR_PTR(PTR_ERR(ordered)); + } else { + ASSERT(!dio_data->ordered); + dio_data->ordered = ordered; } out: @@ -7007,6 +7012,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, } static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode, + struct btrfs_dio_data *dio_data, u64 start, u64 len) { struct btrfs_root *root = inode->root; @@ -7022,7 +7028,8 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode, if (ret) return ERR_PTR(ret); - em = btrfs_create_dio_extent(inode, start, ins.offset, start, + em = btrfs_create_dio_extent(inode, dio_data, + start, ins.offset, start, ins.objectid, ins.offset, ins.offset, ins.offset, BTRFS_ORDERED_REGULAR); btrfs_dec_block_group_reservations(fs_info, ins.objectid); @@ -7367,7 +7374,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, } space_reserved = true; - em2 = btrfs_create_dio_extent(BTRFS_I(inode), start, len, + em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len, orig_start, block_start, len, orig_block_len, ram_bytes, type); @@ -7409,7 +7416,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, goto out; space_reserved = true; - em = btrfs_new_extent_direct(BTRFS_I(inode), start, len); + em = btrfs_new_extent_direct(BTRFS_I(inode), dio_data, start, len); if (IS_ERR(em)) { ret = PTR_ERR(em); goto out; @@ -7715,6 +7722,10 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, pos + length - 1, NULL); ret = -ENOTBLK; } + if (write) { + btrfs_put_ordered_extent(dio_data->ordered); + dio_data->ordered = NULL; + } if (write) extent_changeset_free(dio_data->data_reserved); @@ -7776,7 +7787,7 @@ static const struct iomap_dio_ops btrfs_dio_ops = { ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_before) { - struct btrfs_dio_data data; + struct btrfs_dio_data data = { 0 }; return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, IOMAP_DIO_PARTIAL, &data, done_before); @@ -7785,7 +7796,7 @@ ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_be struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter, size_t done_before) { - struct btrfs_dio_data data; + struct btrfs_dio_data data = { 0 }; return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, IOMAP_DIO_PARTIAL, &data, done_before); From patchwork Tue Mar 21 16:45:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13182941 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACEEBC7619A for ; Tue, 21 Mar 2023 16:46:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230017AbjCUQqM (ORCPT ); Tue, 21 Mar 2023 12:46:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59684 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230000AbjCUQqD (ORCPT ); Tue, 21 Mar 2023 12:46:03 -0400 Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com [66.111.4.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A54542799B for ; Tue, 21 Mar 2023 09:45:41 -0700 (PDT) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id 8BB745C0160; Tue, 21 Mar 2023 12:45:40 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute2.internal (MEProxy); Tue, 21 Mar 2023 12:45:40 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417140; x= 1679503540; bh=rXsjMfASmVp3FdC7wSRSkWbq55i+g8Uq0zzQq3PALfE=; b=L RdK+zphv3jGooxLp8xRg+KSQl8xA1G/mWJbWNCO/Wa/4VjqJ1RU+HUm9vG9HYak1 O++aqaNUG+vdedUKPanz6n0SzGFSOukHSIQVGOP4AUiBd+V/d5e62TCUeN/GUXaO kNbqgPZ37z8oHa12KhWXuQQXtQ4CfBHxlSS8V8Il59dNdnUS5Dy/M+iwRpQzHBdE sHU9EzDjUvxmb5lqDQbeLTYnas9IqXy4s//81NBUcidejZ4u0RV57x0caXTuir6v f1ERhj17GYkhWB5o0BbXQPR0iEjpSWAC8HdcQTn5FC2vprXxRdRWTHRcLipjo3AE Te0X54p/5vzHqetGw3mwA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679417140; x=1679503540; bh=r XsjMfASmVp3FdC7wSRSkWbq55i+g8Uq0zzQq3PALfE=; b=PGOXpXsbFgbhmzLrD 1EKR703c5BtMtiFlgNJk3OrmGZAW6wAEfWYGrThNPLo1x7BrjyufGWBCtdWWOmu+ Ag2MrSDL7iofJmt8T481n9gXBYVHtJFAGAjBY5dIOXx3m/ByQqUKdcBf50I4C7IV Cb+qvoBnAceaYmz8173alroYoLgx74iieXQn4vYE45BAQ3QkqrCw/+JLPAzKN1cG oKJCxnV3cCJCIBmx0Ro2crs40BoO3oxCU6qhQlaF9z92i8bOjG+WUqvsogVw9+3t Jf06p/4U/oVhLWgZsdM7UHx66BCj2EF3r9Lenzsnb5rxcL2jZMos0UX7dOTf9qwx cnNwQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgleduucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr ihhoqeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffve dvhedtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep mhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 21 Mar 2023 12:45:40 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v4 3/6] btrfs: repurpose split_zoned_em for partial dio splitting Date: Tue, 21 Mar 2023 09:45:30 -0700 Message-Id: <5faf0148f526b4e9eb373c177de3c70284999ce7.1679416511.git.boris@bur.io> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org In a subsequent patch I will be "extracting" a partial dio write bio from its ordered extent, creating a 1:1 bio<->ordered_extent relation. This is necessary to avoid triggering an assertion in unpin_extent_cache called from btrfs_finish_ordered_io that checks that the em matches the finished ordered extent. Since there is already a function which splits an uncompressed extent_map for a zoned bio use case, adapt it to this new, similar use case. Mostly, modify it to handle the case where the extent_map is bigger than the ordered_extent, and we cannot assume the em "post" split can be computed from the ordered_extent and bio. This comes up in btrfs/250, for example. I felt that these relaxations where not so damaging to the legibility of the zoned case as to merit a fully separate codepath, but I admit that is not my area of expertise. Signed-off-by: Boris Burkov --- fs/btrfs/inode.c | 104 ++++++++++++++++++++++++++++++++--------------- 1 file changed, 71 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5ab486f448eb..2f8baf4797ea 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2512,37 +2512,59 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode, } /* - * Split an extent_map at [start, start + len] + * Split out a middle extent_map [start, start+len] from within an extent_map. * - * This function is intended to be used only for extract_ordered_extent(). + * @inode: the inode to which the extent map belongs. + * @start: the start index of the middle split + * @len: the length of the middle split + * + * The result is two or three extent_maps inserted in the tree, depending on + * whether start and len imply an uncovered area at the beginning or end of the + * extent map. If the implied split lines up with the end or beginning, there + * will only be two extent maps in the resulting split, otherwise there will be + * three. (If they both match, the split operation is a no-op) + * + * extent map splitting assumptions: + * end = start + len + * em-end = em-start + em-len + * start >= em-start + * len < em-len + * end <= em-end + * + * Diagrams explaining the splitting cases: + * original em: + * [em-start---start---end---em-end) + * resulting ems: + * start != em-start && end != em-end (full tri split): + * [em-start---start) [start---end) [end---em-end) + * start == em-start (no pre split): + * [em-start---end) [end---em-end) + * end == em-end (no post split): + * [em-start---start) [start--em-end) + * + * Returns: 0 on success, -errno on failure. */ -static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len, - u64 pre, u64 post) +static int split_em(struct btrfs_inode *inode, u64 start, u64 len) { struct extent_map_tree *em_tree = &inode->extent_tree; struct extent_map *em; + u64 pre_start, pre_len, pre_end; + u64 mid_start, mid_len, mid_end; + u64 post_start, post_len, post_end; struct extent_map *split_pre = NULL; struct extent_map *split_mid = NULL; struct extent_map *split_post = NULL; int ret = 0; unsigned long flags; - /* Sanity check */ - if (pre == 0 && post == 0) - return 0; - split_pre = alloc_extent_map(); - if (pre) - split_mid = alloc_extent_map(); - if (post) - split_post = alloc_extent_map(); - if (!split_pre || (pre && !split_mid) || (post && !split_post)) { + split_mid = alloc_extent_map(); + split_post = alloc_extent_map(); + if (!split_pre || !split_mid || !split_post) { ret = -ENOMEM; goto out; } - ASSERT(pre + post < len); - lock_extent(&inode->io_tree, start, start + len - 1, NULL); write_lock(&em_tree->lock); em = lookup_extent_mapping(em_tree, start, len); @@ -2551,19 +2573,38 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len, goto out_unlock; } - ASSERT(em->len == len); + pre_start = em->start; + pre_end = start; + pre_len = pre_end - pre_start; + mid_start = start; + mid_end = start + len; + mid_len = len; + post_start = mid_end; + post_end = em->start + em->len; + post_len = post_end - post_start; + ASSERT(pre_start == em->start); + ASSERT(pre_start + pre_len == mid_start); + ASSERT(mid_start + mid_len == post_start); + ASSERT(post_start + post_len == em->start + em->len); + + /* Sanity check */ + if (pre_len == 0 && post_len == 0) { + ret = 0; + goto out_unlock; + } + ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)); ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE); - ASSERT(test_bit(EXTENT_FLAG_PINNED, &em->flags)); ASSERT(!test_bit(EXTENT_FLAG_LOGGING, &em->flags)); ASSERT(!list_empty(&em->list)); flags = em->flags; - clear_bit(EXTENT_FLAG_PINNED, &em->flags); + if (test_bit(EXTENT_FLAG_PINNED, &em->flags)) + clear_bit(EXTENT_FLAG_PINNED, &em->flags); /* First, replace the em with a new extent_map starting from * em->start */ split_pre->start = em->start; - split_pre->len = (pre ? pre : em->len - post); + split_pre->len = (pre_len ? pre_len : mid_len); split_pre->orig_start = split_pre->start; split_pre->block_start = em->block_start; split_pre->block_len = split_pre->len; @@ -2577,16 +2618,15 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len, /* * Now we only have an extent_map at: - * [em->start, em->start + pre] if pre != 0 - * [em->start, em->start + em->len - post] if pre == 0 + * [em->start, em->start + pre_len] if pre_len != 0 + * [em->start, em->start + mid_len] if pre == 0 */ - - if (pre) { + if (pre_len) { /* Insert the middle extent_map */ - split_mid->start = em->start + pre; - split_mid->len = em->len - pre - post; + split_mid->start = mid_start; + split_mid->len = mid_len; split_mid->orig_start = split_mid->start; - split_mid->block_start = em->block_start + pre; + split_mid->block_start = em->block_start + pre_len; split_mid->block_len = split_mid->len; split_mid->orig_block_len = split_mid->block_len; split_mid->ram_bytes = split_mid->len; @@ -2596,11 +2636,11 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len, add_extent_mapping(em_tree, split_mid, 1); } - if (post) { - split_post->start = em->start + em->len - post; - split_post->len = post; + if (post_len) { + split_post->start = post_start; + split_post->len = post_len; split_post->orig_start = split_post->start; - split_post->block_start = em->block_start + em->len - post; + split_post->block_start = em->block_start + pre_len + mid_len; split_post->block_len = split_post->len; split_post->orig_block_len = split_post->block_len; split_post->ram_bytes = split_post->len; @@ -2632,7 +2672,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) u64 len = bbio->bio.bi_iter.bi_size; struct btrfs_inode *inode = bbio->inode; struct btrfs_ordered_extent *ordered; - u64 file_len; u64 end = start + len; u64 ordered_end; u64 pre, post; @@ -2671,14 +2710,13 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) goto out; } - file_len = ordered->num_bytes; pre = start - ordered->disk_bytenr; post = ordered_end - end; ret = btrfs_split_ordered_extent(ordered, pre, post); if (ret) goto out; - ret = split_zoned_em(inode, bbio->file_offset, file_len, pre, post); + ret = split_em(inode, bbio->file_offset, len); out: btrfs_put_ordered_extent(ordered); From patchwork Tue Mar 21 16:45:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13182940 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9CA8CC6FD1D for ; Tue, 21 Mar 2023 16:46:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229996AbjCUQqL (ORCPT ); Tue, 21 Mar 2023 12:46:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60286 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230220AbjCUQqC (ORCPT ); Tue, 21 Mar 2023 12:46:02 -0400 Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com [66.111.4.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C092F51CA6 for ; Tue, 21 Mar 2023 09:45:44 -0700 (PDT) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id 5065D5C014B; Tue, 21 Mar 2023 12:45:42 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute3.internal (MEProxy); Tue, 21 Mar 2023 12:45:42 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417142; x= 1679503542; bh=I2FWZRYa52HN0xI8V64bxq5aHVpHC8EPYigxrPJS41k=; b=D kYV0j3kU893lr1irGoxGCv0UJtzfIkKadvTBSmmQ1ucrIK1upTN5MExOkp5kLdmz hOLn+IZw6++ykIL7efJSp78YqMlHUWlOzagy63VUigDqtL7c6Q52jHGSgduiKsTX 7LwF/lRG8hRyokFsOc+2Z4RQT5blaGSN1vkGAgPZ2ETNv8hcNeXaLyx4e6F3Azrd 5iYmmcwxeI2A/UENCZiydJ02p2rT/qhsNmkzbgdSO/wpItKZz7L3agAShqbKGe8r kz6iJplXNjHXBMRSpyV0OG3EZpNLyT+9qDrMZglfgjoq1oaJA6Eo0qPm3SMU+L+G coVyqBEKY8tFQYHiWebIw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679417142; x=1679503542; bh=I 2FWZRYa52HN0xI8V64bxq5aHVpHC8EPYigxrPJS41k=; b=O5+KBSmKHgAxE7P1Q xCGVWAvNGu9uJyWvuMch7j4XU/AE8NRGhnL2DTXJXh4v1YebHXAPESc4mB3iznfi FAFkQv2ESmBdpNHhacm+ALSPJezj7pdkyX9XHt/fSl+hTqReoc6CtA08Mwu4qMp0 YwOc8AOFrnCRmQoLFDQfMEzV2wa9jASRod5xyeTjcVjTwOUojHDAEoPEk2QrQfCp CWDUgxASqd5n50XQVJvBSYFUayU1tLdJy5vIOPIVtomXJ1Ji567UtvTQp0XQEmKv 4627TZt8KpvsgBYf+W0JxlzMn7gUeIaZXTUbHOP1xuOoiSmeDv9fGjvz0gxhHH1n Wr83g== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgleduucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr ihhoqeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffve dvhedtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep mhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 21 Mar 2023 12:45:41 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v4 4/6] btrfs: return ordered_extent splits from bio extraction Date: Tue, 21 Mar 2023 09:45:31 -0700 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org When extracting a bio from its ordered extent for dio partial writes, we need the "remainder" ordered extent. It would be possible to look it up in that case, but since we can grab the ordered_extent from the new allocation function, we might as well wire it up to be returned to the caller via out parameter and save that lookup. Refactor the clone ordered extent function to return the new ordered extent, then refactor the split and extract functions to pass back the new pre and post split ordered extents via output parameter. Signed-off-by: Boris Burkov --- fs/btrfs/bio.c | 2 +- fs/btrfs/btrfs_inode.h | 5 ++++- fs/btrfs/inode.c | 23 ++++++++++++++++++----- fs/btrfs/ordered-data.c | 36 +++++++++++++++++++++++------------- fs/btrfs/ordered-data.h | 6 ++++-- 5 files changed, 50 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c index cf09c6271edb..b849ced40d37 100644 --- a/fs/btrfs/bio.c +++ b/fs/btrfs/bio.c @@ -653,7 +653,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num) if (use_append) { bio->bi_opf &= ~REQ_OP_WRITE; bio->bi_opf |= REQ_OP_ZONE_APPEND; - ret = btrfs_extract_ordered_extent(bbio); + ret = btrfs_extract_ordered_extent_bio(bbio, NULL, NULL, NULL); if (ret) goto fail_put_bio; } diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 9dc21622806e..e92a09559058 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -407,7 +407,10 @@ static inline void btrfs_inode_split_flags(u64 inode_item_flags, int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page, u32 pgoff, u8 *csum, const u8 * const csum_expected); -blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio); +blk_status_t btrfs_extract_ordered_extent_bio(struct btrfs_bio *bbio, + struct btrfs_ordered_extent *ordered, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post); bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev, u32 bio_offset, struct bio_vec *bv); noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2f8baf4797ea..dbea124c9fa3 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2666,21 +2666,35 @@ static int split_em(struct btrfs_inode *inode, u64 start, u64 len) return ret; } -blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) +/* + * Extract a bio from an ordered extent to enforce an invariant where the bio + * fully matches a single ordered extent. + * + * @bbio: the bio to extract. + * @ordered: the ordered extent the bio is in, will be shrunk to fit. If NULL we + * will look it up. + * @ret_pre: out parameter to return the new oe in front of the bio, if needed. + * @ret_post: out parameter to return the new oe past the bio, if needed. + */ +blk_status_t btrfs_extract_ordered_extent_bio(struct btrfs_bio *bbio, + struct btrfs_ordered_extent *ordered, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post) { u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT; u64 len = bbio->bio.bi_iter.bi_size; struct btrfs_inode *inode = bbio->inode; - struct btrfs_ordered_extent *ordered; u64 end = start + len; u64 ordered_end; u64 pre, post; int ret = 0; - ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset); + if (!ordered) + ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset); if (WARN_ON_ONCE(!ordered)) return BLK_STS_IOERR; + ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes; /* No need to split */ if (ordered->disk_num_bytes == len) goto out; @@ -2697,7 +2711,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) goto out; } - ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes; /* bio must be in one ordered extent */ if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) { ret = -EINVAL; @@ -2713,7 +2726,7 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) pre = start - ordered->disk_bytenr; post = ordered_end - end; - ret = btrfs_split_ordered_extent(ordered, pre, post); + ret = btrfs_split_ordered_extent(ordered, pre, post, ret_pre, ret_post); if (ret) goto out; ret = split_em(inode, bbio->file_offset, len); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 1848d0d1a9c4..4bebebb9b434 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -1117,8 +1117,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end, } -static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos, - u64 len) +static struct btrfs_ordered_extent *clone_ordered_extent(struct btrfs_ordered_extent *ordered, + u64 pos, u64 len) { struct inode *inode = ordered->inode; struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; @@ -1133,18 +1133,22 @@ static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos, percpu_counter_add_batch(&fs_info->ordered_bytes, -len, fs_info->delalloc_batch); WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED)); - return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len, - disk_bytenr, len, 0, flags, - ordered->compress_type); + return btrfs_alloc_ordered_extent(BTRFS_I(inode), file_offset, len, len, + disk_bytenr, len, 0, flags, + ordered->compress_type); } -int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, - u64 post) +int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, + u64 pre, u64 post, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post) + { struct inode *inode = ordered->inode; struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree; struct rb_node *node; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_ordered_extent *oe; int ret = 0; trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered); @@ -1172,12 +1176,18 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, spin_unlock_irq(&tree->lock); - if (pre) - ret = clone_ordered_extent(ordered, 0, pre); - if (ret == 0 && post) - ret = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, - post); - + if (pre) { + oe = clone_ordered_extent(ordered, 0, pre); + ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; + if (!ret && ret_pre) + *ret_pre = oe; + } + if (!ret && post) { + oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post); + ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; + if (!ret && ret_post) + *ret_post = oe; + } return ret; } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 18007f9c00ad..933f6f0d8c10 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -212,8 +212,10 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start, struct extent_state **cached_state); bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end, struct extent_state **cached_state); -int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, - u64 post); +int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, + u64 pre, u64 post, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post); int __init ordered_data_init(void); void __cold ordered_data_exit(void); From patchwork Tue Mar 21 16:45:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13182943 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E23E7C6FD1D for ; Tue, 21 Mar 2023 16:46:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229916AbjCUQqZ (ORCPT ); Tue, 21 Mar 2023 12:46:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60950 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230024AbjCUQqW (ORCPT ); Tue, 21 Mar 2023 12:46:22 -0400 Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com [66.111.4.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6188711656 for ; Tue, 21 Mar 2023 09:45:59 -0700 (PDT) Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id 1BEB85C00E0; Tue, 21 Mar 2023 12:45:44 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute1.internal (MEProxy); Tue, 21 Mar 2023 12:45:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417144; x= 1679503544; bh=xxveT1NlhD1MHsMmSBTdjJEHd7Q8vXuSckDasrhE94A=; b=e vNTOuVt0YsNi7ubkExamLmIFzQimYsS/bxyfEfCLeD2qp0cFW1tSsnwntHNCNzgK hk0/4Xsx7S34Yp8lZQAdb8KE96Vr9pbC3cKgG8GKiDzkNoFHqCvmkm6+P0CMGzz7 Z/vJx8dN5cpwW91NVj6x7QDZm8sDxswX16vHQbtVVvltWP76nt6jGu9cG+d6O+x1 I07pSQiklAtKjslLy8neABcdw1OptMkG7s0BMp7jcF/vzhr3FnOgmKMoOam7S5DF gaDGXCyDpcI7ZrlSEKckWHUPnJsUE2lF+tVdn7IBZUgTY7Rpyt/2M0YwI1c/vkrq iPPlzhAi4clM9aDQSvQlw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679417144; x=1679503544; bh=x xveT1NlhD1MHsMmSBTdjJEHd7Q8vXuSckDasrhE94A=; b=rP4AIgLpFtyXSCucm 0zN2BbxValW3tolNshXODi4jOZxgSdMFEWX5M87VbLbHy80ZChh45eqMJZEaLFvW SiLomBA0Zkyk437IvRYZ9+xbYX4tkkQtcJWbGsUPW1q9j5Yj4FCi2y0eiKEBDVnH Ob/dxfFevkWqBT+l5QGQl40GzY/taI0fzx8tOLTg0OiLIAYQvQDtXL7/Yaj8k3nY VWNNdrjQmk04hdxv0l3QLq/C7bCI5xJFGGVP0oDrFpHGHFB1zbrOCJw5Tn8/5mUG XK62lr5oWBTUNfomBqLTCUK0Z6wOu9PiNEXMP3FZxJPB+GPmh0zGoHyMkm+5WeBv PY39w== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgledtucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr ihhoqeenucggtffrrghtthgvrhhnpeeiueffuedvieeujefhheeigfekvedujeejjeffve dvhedtudefiefhkeegueehleenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep mhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 21 Mar 2023 12:45:43 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v4 5/6] btrfs: fix crash with non-zero pre in btrfs_split_ordered_extent Date: Tue, 21 Mar 2023 09:45:32 -0700 Message-Id: <4154ce05313d40d1ba18e0648536426240119f41.1679416511.git.boris@bur.io> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org if pre != 0 in btrfs_split_ordered_extent, then we do the following: 1. remove ordered (at file_offset) from the rb tree 2. modify file_offset+=pre 3. re-insert ordered 4. clone an ordered extent at offset 0 length pre from ordered. 5. clone an ordered extent for the post range, if necessary. step 4 is not correct, as at this point, the start of ordered is already the end of the desired new pre extent. Further this causes a panic when btrfs_alloc_ordered_extent sees that the node (from the modified and re-inserted ordered) is already present at file_offset + 0 = file_offset. We can fix this by either using a negative offset, or by moving the clone of the pre extent to after we remove the original one, but before we modify and re-insert it. The former feels quite kludgy, as we are "cloning" from outside the range of the ordered extent, so opt for the latter, which does have some locking annoyances. Signed-off-by: Boris Burkov --- fs/btrfs/ordered-data.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 4bebebb9b434..d14a3fe1a113 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -1161,6 +1161,17 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, if (tree->last == node) tree->last = NULL; + if (pre) { + spin_unlock_irq(&tree->lock); + oe = clone_ordered_extent(ordered, 0, pre); + ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; + if (!ret && ret_pre) + *ret_pre = oe; + if (ret) + goto out; + spin_lock_irq(&tree->lock); + } + ordered->file_offset += pre; ordered->disk_bytenr += pre; ordered->num_bytes -= (pre + post); @@ -1176,18 +1187,13 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, spin_unlock_irq(&tree->lock); - if (pre) { - oe = clone_ordered_extent(ordered, 0, pre); - ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; - if (!ret && ret_pre) - *ret_pre = oe; - } - if (!ret && post) { + if (post) { oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post); ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; if (!ret && ret_post) *ret_post = oe; } +out: return ret; } From patchwork Tue Mar 21 16:45:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13182942 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 450DFC74A5B for ; Tue, 21 Mar 2023 16:46:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230000AbjCUQqW (ORCPT ); Tue, 21 Mar 2023 12:46:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60870 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230024AbjCUQqS (ORCPT ); Tue, 21 Mar 2023 12:46:18 -0400 Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com [66.111.4.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2098152F73 for ; Tue, 21 Mar 2023 09:45:58 -0700 (PDT) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id CB4865C0136; Tue, 21 Mar 2023 12:45:45 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Tue, 21 Mar 2023 12:45:45 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679417145; x= 1679503545; bh=SJXljvkua0QBQ/xrM+RYMut+xJvz2Y9iekPc3I/pppI=; b=f 19mEuDCs8mb2BD1zZ7BbLlhEzmh0yUw9oEv5wGPB5UV083D7Nkxum7DH6nu05t1L 7djyRkQ7sCdUZ3YuVUlMY+tv1kwzAGKdxilIJhr2s5O6bIGcL+S8NYQCox3E2Krn oM0it4PSKRy7OjVZZg6U0wmKitCiHStx/83XjdyR42/FUQacpsxgME9A9VkZg3e8 EM4O1P6Z9EFyaD7fenCyayspeZqWDWpWLdbiPBeCZtSs0wvHbQRnE30AtfROqqor DUSfXJTIRlyrQaPu4X203EpOww0DpJUFW9uPb00lsriWCo5JLbVj+fABcU6K8TNC vMIcJDdbbGCykba+HT3NA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679417145; x=1679503545; bh=S JXljvkua0QBQ/xrM+RYMut+xJvz2Y9iekPc3I/pppI=; b=F4G4GBueiFaUDGATS 6vSrl5k3eIQDFsaPHl/piXZNFgEsf42L5Bd/eCSGMiat3/4zASb2cdDkHXEEaFZ7 dobN0sxIBuGas818C0PDyD/ZL6tBOG48NwlTNdtLaGiPiTe6wobn8UQXbKc0KrBe 8bWte5Ld61/moaTS4y4/rIMXqrDBrIH9GwQY/IyWCgAsFGkOeKTgYeha1AG/Y8yQ wmmbrU6Nqvps8b/v7d9B3Q2Lpn82Eu1iKKFhFX22kiDEAT8MpXQZ0NIzg/bRuw9f 9ku96jujdj6VMSrvvIAaE7e/60RQbueAoeCaxh69LvbxCoWP5jKrHOXkzoheu8rE I9H0Q== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegtddgleduucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefhvffufffkofgjfhgggfestdekre dtredttdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdr ihhoqeenucggtffrrghtthgvrhhnpedvtdetffelgeeguddugffgtddtgfehgeetffegue eutdelgeeuheetfeetveelueenucffohhmrghinheprhgvughhrghtrdgtohhmpdhkvghr nhgvlhdrohhrghdpphgrshhtvggsihhnrdgtohhmnecuvehluhhsthgvrhfuihiivgeptd enucfrrghrrghmpehmrghilhhfrhhomhepsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 21 Mar 2023 12:45:45 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v4 6/6] btrfs: split partial dio bios before submit Date: Tue, 21 Mar 2023 09:45:33 -0700 Message-Id: <1216b857841d01b0494199129068baf23546959e.1679416511.git.boris@bur.io> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org If an application is doing direct io to a btrfs file and experiences a page fault reading from the write buffer, iomap will issue a partial bio, and allow the fs to keep going. However, there was a subtle bug in this codepath in the btrfs dio iomap implementation that led to the partial write ending up as a gap in the file's extents and to be read back as zeros. The sequence of events in a partial write, lightly summarized and trimmed down for brevity is as follows: ====WRITING TASK==== btrfs_direct_write __iomap_dio_write iomap_iter btrfs_dio_iomap_begin # create full ordered extent iomap_dio_bio_iter bio_iov_iter_get_pages # page fault; partial read submit_bio # partial bio iomap_iter btrfs_dio_iomap_end btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR; # submit to finish_ordered_fn wq fault_in_iov_iter_readable # btrfs_direct_write detects partial write __iomap_dio_write iomap_iter btrfs_dio_iomap_begin # create second partial ordered extent iomap_dio_bio_iter bio_iov_iter_get_pages # read all of remainder submit_bio # partial bio with all of remainder iomap_iter btrfs_dio_iomap_end # nothing exciting to do with ordered io ====DIO ENDIO==== ==FIRST PARTIAL BIO== btrfs_dio_end_io btrfs_mark_ordered_io_finished # bytes_left > 0 # don't submit to finish_ordered_fn wq ==SECOND PARTIAL BIO== btrfs_dio_end_io btrfs_mark_ordered_io_finished # bytes_left == 0 # submit to finish_ordered_fn wq ====BTRFS FINISH ORDERED WQ==== ==FIRST PARTIAL BIO== btrfs_finish_ordered_io # called by dio_iomap_end_io, sees # BTRFS_ORDERED_IOERR, just drops the # ordered_extent ==SECOND PARTIAL BIO== btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file # extents, csums, etc... The essence of the problem is that while btrfs_direct_write and iomap properly interact to submit all the correct bios, there is insufficient logic in the btrfs dio functions (btrfs_dio_iomap_begin, btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to ensure that every bio is at least a part of a completed ordered_extent. And it is completing an ordered_extent that results in crucial functionality like writing out a file extent for the range. More specifically, btrfs_dio_end_io treats the ordered extent as unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it. Thus, the finish io work doesn't result in file extents, csums, etc... In the aftermath, such a file behaves as though it has a hole in it, instead of the purportedly written data. We considered a few options for fixing the bug (apologies for any incorrect summary of a proposal which I didn't implement and fully understand): 1. treat the partial bio as if we had truncated the file, which would result in properly finishing it. 2. split the ordered extent when submitting a partial bio. 3. cache the ordered extent across calls to __iomap_dio_rw in iter->private, so that we could reuse it and correctly apply several bios to it. I had trouble with 1, and it felt the most like a hack, so I tried 2 and 3. Since 3 has the benefit of also not creating an extra file extent, and avoids an ordered extent lookup during bio submission, it felt like the best option. However, that turned out to re-introduce a deadlock which this code discarding the ordered_extent between faults was meant to fix in the first place. (Link to an explanation of the deadlock below) Therefore, go with fix #2, which requires a bit more setup work but fixes the corruption without introducing the deadlock, which is fundamentally caused by the ordered extent existing when we attempt to fault in a range that overlaps with it. Put succinctly, what this patch does is: when we submit a dio bio, check if it is partial against the ordered extent stored in dio_data, and if it is, extract the ordered_extent that matches the bio exactly out of the larger ordered_extent. Keep the remaining ordered_extent around in dio_data for cancellation in iomap_end. Thanks to Josef, Christoph, and Filipe with their help figuring out the bug and the fix. Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes") Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947 Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/ Link: https://pastebin.com/3SDaH8C6 Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#t Signed-off-by: Boris Burkov --- fs/btrfs/inode.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index dbea124c9fa3..69fdcbb89522 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7815,6 +7815,7 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, struct btrfs_dio_private *dip = container_of(bbio, struct btrfs_dio_private, bbio); struct btrfs_dio_data *dio_data = iter->private; + int err = 0; btrfs_bio_init(bbio, BTRFS_I(iter->inode), btrfs_dio_end_io, bio->bi_private); bbio->file_offset = file_offset; @@ -7823,7 +7824,25 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, dip->bytes = bio->bi_iter.bi_size; dio_data->submitted += bio->bi_iter.bi_size; - btrfs_submit_bio(bbio, 0); + /* + * Check if we are doing a partial write. If we are, we need to split + * the ordered extent to match the submitted bio. Hang on to the + * remaining unfinishable ordered_extent in dio_data so that it can be + * cancelled in iomap_end to avoid a deadlock wherein faulting the + * remaining pages is blocked on the outstanding ordered extent. + */ + if (iter->flags & IOMAP_WRITE) { + struct btrfs_ordered_extent *ordered = dio_data->ordered; + + ASSERT(ordered); + if (bio->bi_iter.bi_size < ordered->num_bytes) + err = btrfs_extract_ordered_extent_bio(bbio, ordered, NULL, + &dio_data->ordered); + } + if (err) + btrfs_bio_end_io(bbio, err); + else + btrfs_submit_bio(bbio, 0); } static const struct iomap_ops btrfs_dio_iomap_ops = {