From patchwork Tue Nov 15 01:30:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 13043083 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBB2BC4321E for ; Tue, 15 Nov 2022 01:30:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B14C08E0005; Mon, 14 Nov 2022 20:30:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A75DE8E0002; Mon, 14 Nov 2022 20:30:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7DE698E0005; Mon, 14 Nov 2022 20:30:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 6878B8E0002 for ; Mon, 14 Nov 2022 20:30:51 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 44D0080EE3 for ; Tue, 15 Nov 2022 01:30:51 +0000 (UTC) X-FDA: 80133947502.24.5D9DB69 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by imf13.hostedemail.com (Postfix) with ESMTP id DE5E920005 for ; Tue, 15 Nov 2022 01:30:50 +0000 (UTC) Received: by mail-pf1-f180.google.com with SMTP id 130so12709133pfu.8 for ; Mon, 14 Nov 2022 17:30:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=phVBYhudr1GO3CmXNiH4VHKFg2wktmsa7QR6ncKeGUY=; b=CnJLmS65oJuTXNJdauefJ7ELys/nPCvRWwVX6p7OIzQ5KbUM7QPVl79+tgw88wsm7+ NOXvdQrEcikF6PXWKwS95asN6EGPpnzUx4O94A/xedi5MfZ0G1d9aYt66X1a174iCgoc 0JjUtGc3s7wTtBFNUVjLpCumTSkYKzhoX//mcOfUCLLr8Bi7+bNTlSTceik1XNXEd56e LZDIx6vy3A4MIyexw2GJYAGzRvIPhhRvA37WNdtT44zXvD5F7EOfE02hxAvunBjy4eMz pq23L43nJleoxqZgKUFR8qycS5KZl1kKLcng6jqyuEfe5RJ4kfVkcy0Abz4coSFE5Chx DMbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=phVBYhudr1GO3CmXNiH4VHKFg2wktmsa7QR6ncKeGUY=; b=Bn5jgusy+6qYAnNkV0tNhtSwQ1NrY9DEVRfTqPN42PsoiqproULTWlAnjLPOgC4keF NgGcvLmL3hkcPEqnCK9RmYTg/iU3gRELFlpYpmHzY6IyJLYA2IWUE3y3H9nthtYbAPMg NoqmWTb+xHSQndt6qKRxGAJsGozoSg+c0mdf3bpxFuY5+QqHT7KhjZZ+QBuymYxoGjQM uiKqPJoLkJpkvqeKHAJsc5ysXgO8GZwF1jY4P49/XtoXXclR4MYfKKiwf4CJTWnZchSF Tlpr5vU1jVv9k780M6SciUoigHS1c1qyiGHDUccju0sYSlihb104jpBK9txCScoUG4WA XXqQ== X-Gm-Message-State: ANoB5pnpotGCGgRauWNI/64SMd0JRWYmu+ejBnrdcdlTa5K6gN4L+r5K gxjfIhuF9JTgqqGrOHATpcyUPiVO33LWug== X-Google-Smtp-Source: AA0mqf7IBDdvpfbRaR4daIa3KxGQlGc37lKNd2h/683Ib/X7tTmU2xe6OcRHlDoNxxbEc5NDvgl6aA== X-Received: by 2002:a62:403:0:b0:572:5be2:505b with SMTP id 3-20020a620403000000b005725be2505bmr2765779pfe.52.1668475849890; Mon, 14 Nov 2022 17:30:49 -0800 (PST) Received: from dread.disaster.area (pa49-181-106-210.pa.nsw.optusnet.com.au. [49.181.106.210]) by smtp.gmail.com with ESMTPSA id e2-20020a170902d38200b00186e2b3e12fsm8178832pld.261.2022.11.14.17.30.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 14 Nov 2022 17:30:48 -0800 (PST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1oukmj-00EKG7-Es; Tue, 15 Nov 2022 12:30:45 +1100 Received: from dave by discord.disaster.area with local (Exim 4.96) (envelope-from ) id 1oukmj-001Vph-1O; Tue, 15 Nov 2022 12:30:45 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 7/9] iomap: write iomap validity checks Date: Tue, 15 Nov 2022 12:30:41 +1100 Message-Id: <20221115013043.360610-8-david@fromorbit.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20221115013043.360610-1-david@fromorbit.com> References: <20221115013043.360610-1-david@fromorbit.com> MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668475850; a=rsa-sha256; cv=none; b=Stf5fpOZRHJ6yTVM3sjPahmdQU6o4Mi1Wb2z7VxXqNlyD/bDZS4TJeZvJm8WJWgFS3m8Rl DSzejepfuya1X2QZEHiwrbep/tY+hVrlSIsVGeepyNlPf1G3s55wL7QVScfsPj9nvP2clz WMtaarQu/ZlDCWky2vraDXGyxwbcyRM= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=CnJLmS65; dmarc=none; spf=none (imf13.hostedemail.com: domain of david@fromorbit.com has no SPF policy when checking 209.85.210.180) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1668475850; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=phVBYhudr1GO3CmXNiH4VHKFg2wktmsa7QR6ncKeGUY=; b=MbGoOZ/3swZojtV9Uflz6TlbdlrsVuxqZ/Ghc7ckdJD2IgPGA3HTaa1Xj+DFSXU8qDjCM3 smFYvagZcMX8HPtZpI/YQhKjqbGbde7idVEplvx/8ABCpLq4T1X+G+e+bCfPJ2fkqfsSDA nbcHbmRjcQsoGBh+FNUUP3Xp9c4YT5k= X-Rspamd-Queue-Id: DE5E920005 X-Rspam-User: Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=CnJLmS65; dmarc=none; spf=none (imf13.hostedemail.com: domain of david@fromorbit.com has no SPF policy when checking 209.85.210.180) smtp.mailfrom=david@fromorbit.com X-Rspamd-Server: rspam06 X-Stat-Signature: qdr1agyz6tcy53yu5wg7n6oif1eipbnp X-HE-Tag: 1668475850-303698 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner A recent multithreaded write data corruption has been uncovered in the iomap write code. The core of the problem is partial folio writes can be flushed to disk while a new racing write can map it and fill the rest of the page: writeback new write allocate blocks blocks are unwritten submit IO ..... map blocks iomap indicates UNWRITTEN range loop { lock folio copyin data ..... IO completes runs unwritten extent conv blocks are marked written get next folio } Now add memory pressure such that memory reclaim evicts the partially written folio that has already been written to disk. When the new write finally gets to the last partial page of the new write, it does not find it in cache, so it instantiates a new page, sees the iomap is unwritten, and zeros the part of the page that it does not have data from. This overwrites the data on disk that was originally written. The full description of the corruption mechanism can be found here: https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/ To solve this problem, we need to check whether the iomap is still valid after we lock each folio during the write. We have to do it after we lock the page so that we don't end up with state changes occurring while we wait for the folio to be locked. Hence we need a mechanism to be able to check that the cached iomap is still valid (similar to what we already do in buffered writeback), and we need a way for ->begin_write to back out and tell the high level iomap iterator that we need to remap the remaining write range. The iomap needs to grow some storage for the validity cookie that the filesystem provides to travel with the iomap. XFS, in particular, also needs to know some more information about what the iomap maps (attribute extents rather than file data extents) to for the validity cookie to cover all the types of iomaps we might need to validate. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/iomap/buffered-io.c | 29 +++++++++++++++++++++++++++- fs/iomap/iter.c | 19 ++++++++++++++++++- include/linux/iomap.h | 43 ++++++++++++++++++++++++++++++++++-------- 3 files changed, 81 insertions(+), 10 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 91ee0b308e13..8354b0fdaa94 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -584,7 +584,7 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter, return iomap_read_inline_data(iter, folio); } -static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos, +static int iomap_write_begin(struct iomap_iter *iter, loff_t pos, size_t len, struct folio **foliop) { const struct iomap_page_ops *page_ops = iter->iomap.page_ops; @@ -618,6 +618,27 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos, status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM; goto out_no_page; } + + /* + * Now we have a locked folio, before we do anything with it we need to + * check that the iomap we have cached is not stale. The inode extent + * mapping can change due to concurrent IO in flight (e.g. + * IOMAP_UNWRITTEN state can change and memory reclaim could have + * reclaimed a previously partially written page at this index after IO + * completion before this write reaches this file offset) and hence we + * could do the wrong thing here (zero a page range incorrectly or fail + * to zero) and corrupt data. + */ + if (page_ops && page_ops->iomap_valid) { + bool iomap_valid = page_ops->iomap_valid(iter->inode, + &iter->iomap); + if (!iomap_valid) { + iter->iomap.flags |= IOMAP_F_STALE; + status = 0; + goto out_unlock; + } + } + if (pos + len > folio_pos(folio) + folio_size(folio)) len = folio_pos(folio) + folio_size(folio) - pos; @@ -773,6 +794,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i) status = iomap_write_begin(iter, pos, bytes, &folio); if (unlikely(status)) break; + if (iter->iomap.flags & IOMAP_F_STALE) + break; page = folio_file_page(folio, pos >> PAGE_SHIFT); if (mapping_writably_mapped(mapping)) @@ -856,6 +879,8 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter) status = iomap_write_begin(iter, pos, bytes, &folio); if (unlikely(status)) return status; + if (iter->iomap.flags & IOMAP_F_STALE) + break; status = iomap_write_end(iter, pos, bytes, bytes, folio); if (WARN_ON_ONCE(status == 0)) @@ -911,6 +936,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) status = iomap_write_begin(iter, pos, bytes, &folio); if (status) return status; + if (iter->iomap.flags & IOMAP_F_STALE) + break; offset = offset_in_folio(folio, pos); if (bytes > folio_size(folio) - offset) diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c index a1c7592d2ade..79a0614eaab7 100644 --- a/fs/iomap/iter.c +++ b/fs/iomap/iter.c @@ -7,12 +7,28 @@ #include #include "trace.h" +/* + * Advance to the next range we need to map. + * + * If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully + * processed - it was aborted because the extent the iomap spanned may have been + * changed during the operation. In this case, the iteration behaviour is to + * remap the unprocessed range of the iter, and that means we may need to remap + * even when we've made no progress (i.e. iter->processed = 0). Hence the + * "finished iterating" case needs to distinguish between + * (processed = 0) meaning we are done and (processed = 0 && stale) meaning we + * need to remap the entire remaining range. + */ static inline int iomap_iter_advance(struct iomap_iter *iter) { + bool stale = iter->iomap.flags & IOMAP_F_STALE; + /* handle the previous iteration (if any) */ if (iter->iomap.length) { - if (iter->processed <= 0) + if (iter->processed < 0) return iter->processed; + if (!iter->processed && !stale) + return 0; if (WARN_ON_ONCE(iter->processed > iomap_length(iter))) return -EIO; iter->pos += iter->processed; @@ -33,6 +49,7 @@ static inline void iomap_iter_done(struct iomap_iter *iter) WARN_ON_ONCE(iter->iomap.offset > iter->pos); WARN_ON_ONCE(iter->iomap.length == 0); WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos); + WARN_ON_ONCE(iter->iomap.flags & IOMAP_F_STALE); trace_iomap_iter_dstmap(iter->inode, &iter->iomap); if (iter->srcmap.type != IOMAP_HOLE) diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 238a03087e17..f166d80b68bf 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -49,26 +49,35 @@ struct vm_fault; * * IOMAP_F_BUFFER_HEAD indicates that the file system requires the use of * buffer heads for this mapping. + * + * IOMAP_F_XATTR indicates that the iomap is for an extended attribute extent + * rather than a file data extent. */ -#define IOMAP_F_NEW 0x01 -#define IOMAP_F_DIRTY 0x02 -#define IOMAP_F_SHARED 0x04 -#define IOMAP_F_MERGED 0x08 -#define IOMAP_F_BUFFER_HEAD 0x10 -#define IOMAP_F_ZONE_APPEND 0x20 +#define IOMAP_F_NEW (1U << 0) +#define IOMAP_F_DIRTY (1U << 1) +#define IOMAP_F_SHARED (1U << 2) +#define IOMAP_F_MERGED (1U << 3) +#define IOMAP_F_BUFFER_HEAD (1U << 4) +#define IOMAP_F_ZONE_APPEND (1U << 5) +#define IOMAP_F_XATTR (1U << 6) /* * Flags set by the core iomap code during operations: * * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size * has changed as the result of this write operation. + * + * IOMAP_F_STALE indicates that the iomap is not valid any longer and the file + * range it covers needs to be remapped by the high level before the operation + * can proceed. */ -#define IOMAP_F_SIZE_CHANGED 0x100 +#define IOMAP_F_SIZE_CHANGED (1U << 8) +#define IOMAP_F_STALE (1U << 9) /* * Flags from 0x1000 up are for file system specific usage: */ -#define IOMAP_F_PRIVATE 0x1000 +#define IOMAP_F_PRIVATE (1U << 12) /* @@ -89,6 +98,7 @@ struct iomap { void *inline_data; void *private; /* filesystem private */ const struct iomap_page_ops *page_ops; + u64 validity_cookie; /* used with .iomap_valid() */ }; static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos) @@ -128,6 +138,23 @@ struct iomap_page_ops { int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len); void (*page_done)(struct inode *inode, loff_t pos, unsigned copied, struct page *page); + + /* + * Check that the cached iomap still maps correctly to the filesystem's + * internal extent map. FS internal extent maps can change while iomap + * is iterating a cached iomap, so this hook allows iomap to detect that + * the iomap needs to be refreshed during a long running write + * operation. + * + * The filesystem can store internal state (e.g. a sequence number) in + * iomap->validity_cookie when the iomap is first mapped to be able to + * detect changes between mapping time and whenever .iomap_valid() is + * called. + * + * This is called with the folio over the specified file position held + * locked by the iomap code. + */ + bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap); }; /*