From patchwork Fri May 17 16:52:37 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 13667181 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8024013D8A2 for ; Fri, 17 May 2024 16:52:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715964762; cv=none; b=UIUP3e2DKiw/N4V0s2kkOnXFF1OgXcZSTnv5vhdl5V1HNmfMX0+YsTrMiNiWyVq4Uztp+mm2iDNJ2cXH3Dm/mKQiGO8u5JP3ZwKcbHTwF/sbkj31R92Xe8YZlWYfIc4B4BP9/AxDmQlpm29ueBe2wsR/5B5EUjZyh4O63j+xml4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715964762; c=relaxed/simple; bh=REt/3OJfNCITQy/2bGXlVBs6wsXgtZcqxRqyCSeg0W8=; h=From:To:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=GU+gFjvCbvzfNVOkOlcDb474mLeA+e7GnuRs4X0O9LDsxL5jrOhmpVd/q9S7D7aws2e+v9E3GjESicXHsNmU5r9P16Ocv/+BiFmz0Pjcxy6Rh6d7yq3gwxO8pckeL/YVSwOnCxCZxUHorbYKoVwi80tb7i3RJ1Y0b6INdBmMlLg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=gA1ZridC; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="gA1ZridC" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 98707C32782 for ; Fri, 17 May 2024 16:52:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1715964762; bh=REt/3OJfNCITQy/2bGXlVBs6wsXgtZcqxRqyCSeg0W8=; h=From:To:Subject:Date:In-Reply-To:References:From; b=gA1ZridCYLCIXnRSFrlPltIbINHnoBTEWwKubtTy9ZJhKJXQlBfyzYhv45zzn9eLn R9RZkKohPEZIhb2e5Z1+Ba/phednAup16fxt1C/55/gBY/xrkn9xSzhWSs8QxCV4gd +A/1sStYS9krSQXjnYMZrWXIQw06kbfMVeaASFcYYMy641ciIjK4lD5BMIeOwc+mgD ItVMkG5lfg5eF85pAcRByyNw3VdTGYcCpQJAhPljFbNH1ZmbrGENOgd3AeWMXrHxDz QIZB3MCioLF92meQYpHi1rUVSdMpl1WChmwP/W+YDIfHB9bhLE7rZxN+4Cm0kWZxmV 08LGfbFh0HRSA== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v3 1/2] btrfs: ensure fast fsync waits for ordered extents after a write failure Date: Fri, 17 May 2024 17:52:37 +0100 Message-Id: X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Filipe Manana If a write path in COW mode fails, either before submitting a bio for the new extents or an actual IO error happens, we can end up allowing a fast fsync to log file extent items that point to unwritten extents. This is because dropping the extent maps happens when completing ordered extents, at btrfs_finish_one_ordered(), and the completion of an ordered extent is executed in a work queue. This can result in a fast fsync to start logging file extent items based on existing extent maps before the ordered extents complete, therefore resulting in a log that has file extent items that point to unwritten extents, resulting in a corrupt file if a crash happens after and the log tree is replayed the next time the fs is mounted. This can happen for both direct IO writes and buffered writes. For example consider a direct IO write, in COW mode, that fails at btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an error: 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter set to false, meaning an error happened; 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR flag; 3) btrfs_finish_ordered_extent() queues the completion of the ordered extent - so that btrfs_finish_one_ordered() will be executed later in a work queue. That function will drop extent maps in the range when it's executed, since the extent maps point to unwritten locations (signaled by the BTRFS_ORDERED_IOERR flag); 4) After calling btrfs_finish_ordered_extent() we keep going down the write path and unlock the inode; 5) After that a fast fsync starts and locks the inode; 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync task sees the extent maps that point to the unwritten locations and logs file extent items based on them - it does not know they are unwritten, and the fast fsync path does not wait for ordered extents to complete, which is an intentional behaviour in order to reduce latency. For the buffered write case, here's one example: 1) A fast fsync begins, and it starts by flushing delalloc and waiting for the writeback to complete by calling filemap_fdatawait_range(); 2) Flushing the dellaloc created a new extent map X; 3) During the writeback some IO error happened, and at the end io callback (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its completion; 4) After queuing the ordered extent completion, the end io callback clears the writeback flag from all pages (or folios), and from that moment the fast fsync can proceed; 5) The fast fsync proceeds sees extent map X and logs a file extent item based on extent map X, resulting in a log that points to an unwritten data extent - because the ordered extent completion hasn't run yet, it happens only after the logging. To fix this make btrfs_finish_ordered_extent() set the inode flag BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write, so that a fast fsync will wait for ordered extent completion. Note that this issues of using extent maps that point to unwritten locations can not happen for reads, because in read paths we start by locking the extent range and wait for any ordered extents in the range to complete before looking for extent maps. Signed-off-by: Filipe Manana --- fs/btrfs/btrfs_inode.h | 19 +++++++++++++------ fs/btrfs/file.c | 15 +++++++++++++++ fs/btrfs/ordered-data.c | 31 +++++++++++++++++++++++++++++++ 3 files changed, 59 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 3c8bc7a8ebdd..26ee7251ad6c 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -56,7 +56,9 @@ enum { /* * Always set under the VFS' inode lock, otherwise it can cause races * during fsync (we start as a fast fsync and then end up in a full - * fsync racing with ordered extent completion). + * fsync racing with ordered extent completion). It's also safe to be + * set during writeback without holding the VFS' inode lock because + * a fsync flushes and waits for delalloc before starting to log. */ BTRFS_INODE_NEEDS_FULL_SYNC, BTRFS_INODE_COPY_EVERYTHING, @@ -453,14 +455,19 @@ static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode) } /* - * Should be called while holding the inode's VFS lock in exclusive mode, or - * while holding the inode's mmap lock (struct btrfs_inode::i_mmap_lock) in + * Should be called while holding the inode's VFS lock in exclusive/shared mode, + * or while holding the inode's mmap lock (struct btrfs_inode::i_mmap_lock) in * either shared or exclusive mode, or in a context where no one else can access * the inode concurrently (during inode creation or when loading an inode from - * disk). + * disk). Can also be called in irq context from btrfs_finish_ordered_extent() + * because at that point we are either in a direct IO write, in which case the + * inode's VFS lock is held, or in end IO callback of a buffered write, in which + * case fsync allways waits for writeback completion. */ static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode) { + unsigned long flags; + set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags); /* * The inode may have been part of a reflink operation in the last @@ -478,10 +485,10 @@ static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode) * while ->last_trans was not yet updated in the current transaction, * and therefore has a lower value. */ - spin_lock(&inode->lock); + spin_lock_irqsave(&inode->lock, flags); if (inode->last_reflink_trans < inode->last_trans) inode->last_reflink_trans = inode->last_trans; - spin_unlock(&inode->lock); + spin_unlock_irqrestore(&inode->lock, flags); } static inline bool btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 0c7c1b42028e..d635bc0c01df 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1894,6 +1894,21 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) btrfs_get_ordered_extents_for_logging(BTRFS_I(inode), &ctx.ordered_extents); ret = filemap_fdatawait_range(inode->i_mapping, start, end); + if (ret) + goto out_release_extents; + + /* + * Check again the full sync flag, because it may have been set + * during the end IO callback (end_bbio_data_write() -> + * btrfs_finish_ordered_extent()) in case an error happened and + * we need to wait for ordered extents to complete so that any + * extent maps that point to unwritten locations are dropped and + * we don't log them. + */ + full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, + &BTRFS_I(inode)->runtime_flags); + if (full_sync) + ret = btrfs_wait_ordered_range(inode, start, len); } if (ret) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 44157e43fd2a..55a9aeed7344 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -388,6 +388,37 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate); spin_unlock_irqrestore(&inode->ordered_tree_lock, flags); + /* + * If this is a COW write it means we created new extent maps for the + * range and they point to unwritten locations if we got an error either + * before submitting a bio or during IO. + * + * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we + * are queuing its completion below. During completion, at + * btrfs_finish_one_ordered(), we will drop the extent maps for the + * unwritten extents. + * + * However because completion runs in a work queue we can end up having + * a fast fsync running before that. In the case of direct IO, once we + * unlock the inode the fsync might start, and we queue the completion + * before unlocking the inode. In the case of buffered IO when writeback + * finishes (end_bbio_data_write()) we queue the completion, so if the + * writeback was triggered by a fast fsync, the fsync might start + * logging before ordered extent completion runs in the work queue. + * + * The fast fsync will log file extent items based on the extent maps it + * finds, so if by the time it collects extent maps the ordered extent + * completion didn't happen yet, it will log file extent items that + * point to unwritten extents, resulting in a corruption if a crash + * happens and the log tree is replayed. Note that a fast fsync does not + * wait for completion of ordered extents in order to reduce latency. + * + * Set a flag in the inode so that the next fast fsync will wait for + * ordered extents to complete before starting to log. + */ + if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) + btrfs_set_inode_full_sync(inode); + if (ret) btrfs_queue_ordered_fn(ordered); return ret; From patchwork Fri May 17 16:52:38 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 13667182 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 40D3813D8B6 for ; Fri, 17 May 2024 16:52:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715964763; cv=none; b=UCOdmVLj4CuGTLxqH2pHVoceeT/cqJE57PIFUOrWSF+8aBUKpDm+w0Ru4XqKO7C9JXhofJwa5J8wfyj7mMdhnT82UAcrYmxWJBXjBY3xrnKyoIprkWhNaRNd0ZRsPUIfayLizMgnnTKFY0RdI3ks0el38AG+gmMpnNR7Talnk0g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715964763; c=relaxed/simple; bh=giaF07HsS/o92Sx8QlESt03wiFeVuvpcaqG+P8qleA0=; h=From:To:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=JH8/6JYamzmarIUY7gzPbHg8pc6U9NnimRpdhV8vORfakaXlnK1XZ9/Wc17IPDDFWL8b6aAspvtwZEGdH4I2C6WRPhM6xDhDJFEMZ/BKfYbLHzxsnjog+3JSntEo1b7diDtVgdolB5DRAFSLSZCwAWRd37bDTT5Gg/RxIZHQRGE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=uGgQiiLI; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="uGgQiiLI" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 90361C32781 for ; Fri, 17 May 2024 16:52:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1715964763; bh=giaF07HsS/o92Sx8QlESt03wiFeVuvpcaqG+P8qleA0=; h=From:To:Subject:Date:In-Reply-To:References:From; b=uGgQiiLIV3KWqLtdaH1g1GuDcEokLU1l25DvaMiIucsyJHbCLqAX25GxobxIKagTk FQHYV3J2bSSanFtBGduC43wsBn1jjMYjx6GoPGxJYIiH68cawCEVt/tlRIAFJdkhxZ ispbjJrRIOCjLvdoZiQFS1vzIgVS52r0exCSGGTWlBDERm5utBPco3J+eznBY1frz1 y95oXbsFzlKt3GFbK/o2+AuDEYmqWQaYjOFE0cDhKovO6tHYmpfalAF/0SbaHDHkJn TV9o8a5si4pSL3mAZKwynaL9n7+LVtU/XaMrUtpwOgf49H8hNspvlYFGhyoMWUbhv9 kI4nwXujuPldg== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v3 2/2] btrfs: make btrfs_finish_ordered_extent() return void Date: Fri, 17 May 2024 17:52:38 +0100 Message-Id: X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Filipe Manana Currently btrfs_finish_ordered_extent() returns a boolean indicating if the ordered extent was added to the work queue for completion, but none of its callers cares about it, so make it return void. Reviewed-by: Qu Wenruo Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/ordered-data.c | 3 +-- fs/btrfs/ordered-data.h | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 55a9aeed7344..adc274605776 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -374,7 +374,7 @@ static void btrfs_queue_ordered_fn(struct btrfs_ordered_extent *ordered) btrfs_queue_work(wq, &ordered->work); } -bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, +void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, struct page *page, u64 file_offset, u64 len, bool uptodate) { @@ -421,7 +421,6 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, if (ret) btrfs_queue_ordered_fn(ordered); - return ret; } /* diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index b6f6c6b91732..bef22179e7c5 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -162,7 +162,7 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent); void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry); void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode, struct btrfs_ordered_extent *entry); -bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, +void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, struct page *page, u64 file_offset, u64 len, bool uptodate); void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,