From patchwork Tue Aug 1 16:28:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Mason X-Patchwork-Id: 13336997 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 137B6C04A94 for ; Tue, 1 Aug 2023 16:32:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230322AbjHAQcB (ORCPT ); Tue, 1 Aug 2023 12:32:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56516 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229716AbjHAQb5 (ORCPT ); Tue, 1 Aug 2023 12:31:57 -0400 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D177E57 for ; Tue, 1 Aug 2023 09:31:55 -0700 (PDT) Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 371FCZ03008235 for ; Tue, 1 Aug 2023 09:31:55 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : subject : date : message-id : mime-version : content-transfer-encoding : content-type; s=facebook; bh=w3FQcTTuIVHnUR9i2jp48OrYpF9bCcl8+YO6btnpqVE=; b=KK23Kcn6jQWoqEY6T6dvo6aICI0yIRPtYSiJOKG36+5hx43NYOmrrokTNHr9fuj3o+JF ZbNyvMPQlQzYKXYF2MpXL2RyPui8V4Y+b7bJzYyMFKRqdNgxmN5FbiRRWXQZXWpPnegn fJrmnPoJuTSL4dunkvN29faSeYZTTts6uZU= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3s6qfgxau3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Tue, 01 Aug 2023 09:31:55 -0700 Received: from ash-exhub204.TheFacebook.com (2620:10d:c0a8:83::4) by ash-exhub203.TheFacebook.com (2620:10d:c0a8:83::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Tue, 1 Aug 2023 09:31:54 -0700 Received: from twshared3345.02.ash8.facebook.com (2620:10d:c0a8:1c::11) by mail.thefacebook.com (2620:10d:c0a8:83::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Tue, 1 Aug 2023 09:31:53 -0700 Received: by devbig003.nao1.facebook.com (Postfix, from userid 8731) id 3456C1D9325F7; Tue, 1 Aug 2023 09:30:05 -0700 (PDT) From: Chris Mason To: , , , Subject: [PATCH v2] Btrfs: only subtract from len_to_oe_boundary when it is tracking an extent Date: Tue, 1 Aug 2023 09:28:28 -0700 Message-ID: <20230801162828.1396380-1-clm@fb.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: TE2Wu92MVK-m_DiO1GV-Iytxtp-ylklX X-Proofpoint-GUID: TE2Wu92MVK-m_DiO1GV-Iytxtp-ylklX X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.591,FMLib:17.11.176.26 definitions=2023-08-01_13,2023-08-01_01,2023-05-22_02 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org [Note: I dropped the RFC because I can now trigger on Linus kernels, and I think we need to send something to stable as well ] bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone as we submit bios for writes. Every time we add a page to the bio, we decrement those bytes from len_to_oe_boundary, and then we submit the bio if we happen to hit zero. Most of the time, len_to_oe_boundary gets set to U32_MAX. submit_extent_page() adds pages into our bio, and the size of the bio ends up limited by: - Are we contiguous on disk? - Does bio_add_page() allow us to stuff more in? - is len_to_oe_boundary > 0? The len_to_oe_boundary math starts with U32_MAX, which isn't page or sector aligned, and subtracts from it until it hits zero. In the non-zoned case, the last IO we submit before we hit zero is going to be unaligned, triggering BUGs and other sadness. This is hard to trigger because bio_add_page() isn't going to make a bio of U32_MAX size unless you give it a perfect set of pages and fully contiguous extents on disk. We can hit it pretty reliably while making large swapfiles during provisioning because the machine is freshly booted, mostly idle, and the disk is freshly formatted. It's also possible to trigger with reads when read_ahead_kb is set to 4GB. The code has been clean up and shifted around a few times, but this flaw has been lurking since the counter was added. I think Christoph's commit ended up exposing the bug. The fix used here is to skip doing math on len_to_oe_boundary unless we've changed it from the default U32_MAX value. bio_add_page() is the real limit we want, and there's no reason to do extra math when Jens is doing it for us. Sample repro, note you'll need to change the path to the bdi and device: SUBVOL=/btrfs/swapvol SWAPFILE=$SUBVOL/swapfile SZMB=8192 mkfs.btrfs -f /dev/vdb mount /dev/vdb /btrfs btrfs subvol create $SUBVOL chattr +C $SUBVOL dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB sync;sync;sync echo 4 > /proc/sys/vm/drop_caches echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb while(true) ; do echo 1 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/drop_caches dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock done Signed-off-by: Chris Mason Reviewed-by: Sweet Tea Dorminy CC: stable@vger.kernel.org # 6.4 Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") Reviewed-by: Qu Wenruo Reviewed-by: Christoph Hellwig --- fs/btrfs/extent_io.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) v1 -> v2: update the comments, add repro to commit log diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 6b40189a1a3e..c25115592d99 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -849,7 +849,30 @@ static void submit_extent_page(struct btrfs_bio_ctrl *bio_ctrl, size -= len; pg_offset += len; disk_bytenr += len; - bio_ctrl->len_to_oe_boundary -= len; + + /* + * len_to_oe_boundary defaults to U32_MAX, which isn't page or + * sector aligned. alloc_new_bio() then sets it to the end of + * our ordered extent for writes into zoned devices. + * + * When len_to_oe_boundary is tracking an ordered extent, we + * trust the ordered extent code to align things properly, and + * the check above to cap our write to the ordered extent + * boundary is correct. + * + * When len_to_oe_boundary is U32_MAX, the cap above would + * result in a 4095 byte IO for the last page riiiiight before + * we hit the bio limit of UINT_MAX. bio_add_page() has all + * the checks required to make sure we don't overflow the bio, + * and we should just ignore len_to_oe_boundary completely + * unless we're using it to track an ordered extent. + * + * It's pretty hard to make a bio sized U32_MAX, but it can + * happen when the page cache is able to feed us contiguous + * pages for large extents. + */ + if (bio_ctrl->len_to_oe_boundary != U32_MAX) + bio_ctrl->len_to_oe_boundary -= len; /* Ordered extent boundary: move on to a new bio. */ if (bio_ctrl->len_to_oe_boundary == 0)