From patchwork Fri Oct 14 23:57:25 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 13007519 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB6E4C433FE for ; Fri, 14 Oct 2022 23:57:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5BB806B007D; Fri, 14 Oct 2022 19:57:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 56A766B007E; Fri, 14 Oct 2022 19:57:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 40AF06B0080; Fri, 14 Oct 2022 19:57:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 2CC356B007D for ; Fri, 14 Oct 2022 19:57:28 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 0B780120B90 for ; Fri, 14 Oct 2022 23:57:28 +0000 (UTC) X-FDA: 80021219376.07.CB77650 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf28.hostedemail.com (Postfix) with ESMTP id 7BCCBC0025 for ; Fri, 14 Oct 2022 23:57:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1665791847; x=1697327847; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DyqxUkdpstu7tUvj01n9d1PqdUdnL/T7kQZf98+vfF8=; b=oAQNGnCXM0HRN6MwTYw8RiA5Yj1awJflZ3K8GeJxigBZdurMdqddGKh0 Nqe7t6CIr7PgGS6+1SRquxSP5x/BwFXCs4aFEMYmBvaCEezJBgEl0iLc9 pbf5cil8Bkpc9R+qO/sev83I9UvUdltGkj/FR95cTdlr0ztQQ4Y4h9YJn NerPYYrjySeKlQmYwfSkak6rKbnTtXaEOVeqb3M6we/lE+zJfX0g2szk7 97zIpI1Pq1g9Cs1dTI8CMJw3AvtB4X0nfqIhlz2OBL0aO/gLFIhJeJkha 4sjC6tzv7Yarrdrae81m1RlFxoCs0VCeYusBAq81gDmtiiDs8ppno1puf g==; X-IronPort-AV: E=McAfee;i="6500,9779,10500"; a="304236575" X-IronPort-AV: E=Sophos;i="5.95,185,1661842800"; d="scan'208";a="304236575" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Oct 2022 16:57:26 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10500"; a="658759571" X-IronPort-AV: E=Sophos;i="5.95,185,1661842800"; d="scan'208";a="658759571" Received: from uyoon-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.90.112]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Oct 2022 16:57:25 -0700 Subject: [PATCH v3 05/25] fsdax: Wait for pinned pages during truncate_inode_pages_final() From: Dan Williams To: linux-mm@kvack.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , Dave Chinner , nvdimm@lists.linux.dev, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org Date: Fri, 14 Oct 2022 16:57:25 -0700 Message-ID: <166579184544.2236710.791897642091142558.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com> References: <166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665791847; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=F5ugqU034RxVk0oqrJzy0rgiU8LjdGeX1/GvT+kv+Ps=; b=qXhUR+S+64/QzJXBnvuNvR5TnwfWkrM5HPGhnqSJDhrgfzsL3QPv+/4yt34hURViD++4SY HeMsejXSf/sHoQJ17om/XLYmYd3Ons//aRza4ehmvSBldv1sngNM9CIPhipIQnQJiwVvNH a4Q4W0v6jDJXI/2Tn4hYBuYaNPd5SPs= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=oAQNGnCX; spf=pass (imf28.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665791847; a=rsa-sha256; cv=none; b=tZDsinFIcAO/kjYaE6PMZoMbm+UO98I/PNVtIhDNAPfNvhJVfcYkVZj2Gs+jf8nMCAKeh7 F3oSyLUU350yAx9RMZvNsdVU6r3Z5vtArAARNGLogUfOVQ08mdheh3W5bWNDwsLIGj9Hfy u7xxcnzUD3JMnerZ7zYOuvRQh2eZqck= X-Rspam-User: X-Rspamd-Server: rspam11 Authentication-Results: imf28.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=oAQNGnCX; spf=pass (imf28.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Stat-Signature: 75fkykuoye4b49nokzg5nsw95y4xnidq X-Rspamd-Queue-Id: 7BCCBC0025 X-HE-Tag: 1665791847-935706 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The fsdax truncate vs page pinning solution is incomplete. The initial solution landed in v4.17 and covered typical truncate invoked through truncate(2) and fallocate(2), i.e. the truncate_inode_pages() called on open files. However, that enabling left truncate_inode_pages_final(), called after iput_final() to free the inode, unprotected. Thankfully that v4.17 enabling also left a warning in place to fire if any truncate is attempted while a DAX page is still pinned: commit d2c997c0f145 ("fs, dax: use page->mapping to warn if truncate collides with a busy page") While a lore search indicates no reports of that firing, the hole is there nonetheless. The concern is that if/when that warning fires it indicates a use-after-free condition whereby the filesystem has lost the ability to arbitrate access to its storage blocks. For example, in the worst case, DMA may be ongoing while the filesystem thinks the block is free to be reallocated to another inode. This patch is based on an observation from Dave that during iput_final() there is no need to hold filesystem locks like the explicit truncate path. The wait can occur from within dax_delete_mapping_entry() called by truncate_folio_batch_exceptionals(). This solution trades off fixing the use-after-free with a theoretical deadlock scenario. If the agent holding the page pin triggers inode reclaim and that reclaim waits for the pin to drop it will deadlock. Two observations make this approach still worth pursuing: 1/ Any existing scenarios where that happens would have triggered the warning referenced above which has shipped upstream for ~5 years without a bug report on lore. 2/ Most I/O drivers only hold page pins in their fast paths and new __GFP_FS allocations are unlikely in a driver fast path. I.e. if the deadlock triggers the likely fix would be in the offending driver, not new band-aids in fsdax. So, update the DAX core to notice that the inode->i_mapping is in the exiting state and use that as a signal that the inode is unreferenced await page-pins to drain. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Reported-by: Dave Chinner Signed-off-by: Dan Williams --- fs/dax.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index a75d4bf541b4..e3deb60a792f 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -803,13 +803,37 @@ static int __dax_invalidate_entry(struct address_space *mapping, return ret; } +/* + * wait indefinitely for all pins to drop, the alternative to waiting is + * a potential use-after-free scenario + */ +static void dax_break_layout(struct address_space *mapping, pgoff_t index) +{ + /* To do this without locks, the inode needs to be unreferenced */ + WARN_ON(atomic_read(&mapping->host->i_count)); + do { + struct page *page; + + page = dax_zap_mappings_range(mapping, index << PAGE_SHIFT, + (index + 1) << PAGE_SHIFT); + if (!page) + return; + wait_var_event(page, dax_page_idle(page)); + } while (true); +} + /* * Delete DAX entry at @index from @mapping. Wait for it * to be unlocked before deleting it. */ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index) { - int ret = __dax_invalidate_entry(mapping, index, true); + int ret; + + if (mapping_exiting(mapping)) + dax_break_layout(mapping, index); + + ret = __dax_invalidate_entry(mapping, index, true); /* * This gets called from truncate / punch_hole path. As such, the caller