From patchwork Mon Oct 23 07:20:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiyang Ruan X-Patchwork-Id: 13432436 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7002BCDB474 for ; Mon, 23 Oct 2023 07:21:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F3B626B00B1; Mon, 23 Oct 2023 03:21:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EEC586B00B2; Mon, 23 Oct 2023 03:21:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB1896B00B3; Mon, 23 Oct 2023 03:21:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id CB79F6B00B1 for ; Mon, 23 Oct 2023 03:21:02 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 9B720B540F for ; Mon, 23 Oct 2023 07:21:02 +0000 (UTC) X-FDA: 81375879564.19.197046E Received: from esa7.hc1455-7.c3s2.iphmx.com (esa7.hc1455-7.c3s2.iphmx.com [139.138.61.252]) by imf14.hostedemail.com (Postfix) with ESMTP id 1BBE9100020 for ; Mon, 23 Oct 2023 07:20:59 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=fujitsu.com; spf=pass (imf14.hostedemail.com: domain of ruansy.fnst@fujitsu.com designates 139.138.61.252 as permitted sender) smtp.mailfrom=ruansy.fnst@fujitsu.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698045660; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KppYELKs60M5tjzSIabXathiWIG9CotTb1Gc/8P8B4A=; b=8qEHCW+JvpA5E/83seGIyqKQMnyFr2dcRp3Qw7REJZjMTRBIGqw5lTZ50NRYkdcQK0WFip 9gQmLf+jroSeE02PxyRq79WzyMppjnxqEp2J2gVQ9Nbnq0uMYPKZEG8tBxnuxVZzLtz8x7 uVPZ/8gB3bitguwhR09jxBixylc1Cbc= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=fujitsu.com; spf=pass (imf14.hostedemail.com: domain of ruansy.fnst@fujitsu.com designates 139.138.61.252 as permitted sender) smtp.mailfrom=ruansy.fnst@fujitsu.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698045660; a=rsa-sha256; cv=none; b=EkAX8el3P7luHfpnT0gzjzZJzIoGg2VPBStJ3WL/WgOwPstAroLFWPx7wuwnbcpUCuZhUe LOnX8Mp/eGfNKRfiOVNr7IuifhPre69aT6Dd6AvTa/YrtR1woNW9g3/ipHr7ctVtmtcTep qihOhbfSCvP7SJAjyM2xsH/YRd1m85s= X-IronPort-AV: E=McAfee;i="6600,9927,10871"; a="116009977" X-IronPort-AV: E=Sophos;i="6.03,244,1694703600"; d="scan'208";a="116009977" Received: from unknown (HELO yto-r1.gw.nic.fujitsu.com) ([218.44.52.217]) by esa7.hc1455-7.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Oct 2023 16:20:57 +0900 Received: from yto-m4.gw.nic.fujitsu.com (yto-nat-yto-m4.gw.nic.fujitsu.com [192.168.83.67]) by yto-r1.gw.nic.fujitsu.com (Postfix) with ESMTP id 623FAD9DA9 for ; Mon, 23 Oct 2023 16:20:55 +0900 (JST) Received: from kws-ab3.gw.nic.fujitsu.com (kws-ab3.gw.nic.fujitsu.com [192.51.206.21]) by yto-m4.gw.nic.fujitsu.com (Postfix) with ESMTP id 910B1D506C for ; Mon, 23 Oct 2023 16:20:54 +0900 (JST) Received: from edo.cn.fujitsu.com (edo.cn.fujitsu.com [10.167.33.5]) by kws-ab3.gw.nic.fujitsu.com (Postfix) with ESMTP id 18C7420076852 for ; Mon, 23 Oct 2023 16:20:54 +0900 (JST) Received: from irides.g08.fujitsu.local (unknown [10.167.226.34]) by edo.cn.fujitsu.com (Postfix) with ESMTP id 1C6051A0070; Mon, 23 Oct 2023 15:20:53 +0800 (CST) From: Shiyang Ruan To: linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, chandanbabu@kernel.org Cc: dan.j.williams@intel.com, willy@infradead.org, jack@suse.cz, akpm@linux-foundation.org, djwong@kernel.org, mcgrof@kernel.org Subject: [PATCH v15.1] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind Date: Mon, 23 Oct 2023 15:20:46 +0800 Message-ID: <20231023072046.1626474-1-ruansy.fnst@fujitsu.com> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20230928103227.250550-1-ruansy.fnst@fujitsu.com> References: <20230928103227.250550-1-ruansy.fnst@fujitsu.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-TM-AS-Product-Ver: IMSS-9.1.0.1417-9.0.0.1002-27952.005 X-TM-AS-User-Approved-Sender: Yes X-TMASE-Version: IMSS-9.1.0.1417-9.0.1002-27952.005 X-TMASE-Result: 10--21.286700-10.000000 X-TMASE-MatchedRID: 91tjdmtuWYO0Tit9+Kk6bk4bkCZIwDi+fb+ZO7kHlEjKY//WmIj/oYYM uV3kzHSdj6ja18TEeMjhm5RK14IfCzHiD7ssqslsKsurITpSv+OycrvYxo9Kp7Xl40gTGJ5pF6b g4tIbLUSI2CAno9ubYVu3MPlIluMukZmKVOLxy1LEOJqSsn5KmW31RVaGptEha0TOsL14A2kKzN 3kYLPjJOIkog2fXJ1JKao4mTYQoAJkQckJEC3Q2uQoIU4rAATMKQNhMboqZlqp3QxRZDyTwzCTE d+L/eo9d8mnSvYsqD7mn3xyPJAJogKQjoxqav1/b/oIJuUAIuEFeeAjqMW+l4EBeX0uQ+npwPgx kqlR8CkMiVaxvErZjVDhyrIzFNxiYwDOL7t3RyGHmRpBdG9H1/lSepWcgdLPxFoXVXVzZ7+CF54 D/22LzbdR/tddFY/Bj/pFz/QcMFvVPASp6ZbxMfQxpA7auLwMF4r8H5YrEqx3de2OoBqgwm8RqA sgLeLogcDogF3e9CyTRxvg8CdUPV4bwANKTm+ido0n+JPFcJp9LQinZ4QefPcjNeVeWlqY+gtHj 7OwNO0CpgETeT0ynA== X-TMASE-SNAP-Result: 1.821001.0001-0-1-22:0,33:0,34:0-0 X-Rspamd-Queue-Id: 1BBE9100020 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 5w87xup4fxjr7x7b1xq19dw4ejsbjz8z X-HE-Tag: 1698045659-867212 X-HE-Meta: U2FsdGVkX1/up+n9RG7IwkP3BFp6NWldJhk2T4v2ImHeRFH8N7EuCnslkt7nEGeVsOOPMGHxTXLd0EmAd0/sTZy8lXjep0kTpzN1xfjzdcDB3K5hCP8OOC/DetT8QIurl0nSNGqd5q798Skul4vh0Ydh89qnVHGrOjwYfDNw8DitpMpAtzzCgxfNlACToNU808M1ANuXH/Vo3pI8UCI6eyNqbuztUa+vxduy1YKnnVCWsXyZOV/pzPo2W7YCHnLj6aNtL7h0Ug/hV8hEZYysUnjE6YF/Q08EHxt/TkcHL0dTRBkkg1EZnfrM9vyUUoMNXwdJKNqrj4UUcOAKcxhf9E2hxXDIcGmLkmR+lTTGOeNcViOIVMuGMYIOhfIrCZerZzCqrUf1tDjWSeZsZO5os71fSCFNr66RL4cxMWpm8h1zMvFhbhHCEPTqcrSZfnd92qXUkHSU6s3fXuzC4CpivVtX3uK2y90TDISEyg6jUVZsi20/EMNm6Bm9hvI3BX17Upr/GTBlsLuqV0rX/cYPyylvfI7dDWlN+vZr1g4Ry55W7bQtSjH8LbFSgZs2aYJ5GAhEtmHs5ylbtSbuShhgchVNdOlSkIlegKKBY4lAlq3VUi0PxU3ZqeV5K+mS/+BVo72pnfjyUMPWWxtMEWI/k4rQj3rN89vMFxvv4naTtkjYEKslmDJTjHFvyT4L6K8WQSaK7bZyX9F1R/e/vI4JIfV+1mqf9essJOGsNPVWyYsvT+jRb/RhVTt+PSmR+rbUB/kQM/YYBzufwu8oMSAbwaMy3GHaTDyx0wBraFc5Gbdigtte4IdNfvr2rUtDzLEWK+7UBeaDZnhYRX5Y9QI92/DGxJyspmrKL/x3dxEKZv1nFvHD6ll+4S1B8XsL3GxXX2EevxPLrQ6UNm+Bpk66PGkhTXcDuyMsIcbdgZ4odfBr5pU2ZClYtKZEOceJDGi9isDkpAsa+5v/Ze25+2k lmPF1VKa aWZI961/eaaAi9sRzp6gFd+dGgpvHskl6x9uWpyrp3vu2SJO6KzbM2pb99J0jKdH/VsT7M/kyAhHu8ggnwyLi1C3+VRnKTkPMoUfBoPxu4tkCsJpyEbh3ODWYb/lAUkH7fbbM9USoO92sarc5uZBvufnGDkDRjchI7g5AOfeZw+uHiMdPyQAVykkFXF8rtWk+ibnM+4qyCL6zraQ4SzPaO6LyOmvFjXapmP4AhMxrKjJ5K7mYBy1tTkYZQPKC2JdQe0/qmTwXlpMtOeqTpAYI/CXfQ1Bajop0whIFcY7zEyvfg70= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Changes since v15: 1. Rebased on v6.6-rc7 Now, if we suddenly remove a PMEM device(by calling unbind) which contains FSDAX while programs are still accessing data in this device, e.g.: ``` $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 & # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 & echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind ``` it could come into an unacceptable state: 1. device has gone but mount point still exists, and umount will fail with "target is busy" 2. programs will hang and cannot be killed 3. may crash with NULL pointer dereference To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we are going to remove the whole device, and make sure all related processes could be notified so that they could end up gracefully. This patch is inspired by Dan's "mm, dax, pmem: Introduce dev_pagemap_failure()"[1]. With the help of dax_holder and ->notify_failure() mechanism, the pmem driver is able to ask filesystem on it to unmap all files in use, and notify processes who are using those files. Call trace: trigger unbind -> unbind_store() -> ... (skip) -> devres_release_all() -> kill_dax() -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) -> xfs_dax_notify_failure() `-> freeze_super() // freeze (kernel call) `-> do xfs rmap ` -> mf_dax_kill_procs() ` -> collect_procs_fsdax() // all associated processes ` -> unmap_and_kill() ` -> invalidate_inode_pages2_range() // drop file's cache `-> thaw_super() // thaw (both kernel & user call) Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent new dax mapping from being created. Do not shutdown filesystem directly if configuration is not supported, or if failure range includes metadata area. Make sure all files and processes(not only the current progress) are handled correctly. Also drop the cache of associated files before pmem is removed. [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/ Signed-off-by: Shiyang Ruan Reviewed-by: Darrick J. Wong Reviewed-by: Dan Williams --- drivers/dax/super.c | 3 +- fs/xfs/xfs_notify_failure.c | 108 ++++++++++++++++++++++++++++++++++-- include/linux/mm.h | 1 + mm/memory-failure.c | 21 +++++-- 4 files changed, 122 insertions(+), 11 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 0da9232ea175..f4b635526345 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev) return; if (dax_dev->holder_data != NULL) - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); + dax_holder_notify_failure(dax_dev, 0, U64_MAX, + MF_MEM_PRE_REMOVE); clear_bit(DAXDEV_ALIVE, &dax_dev->flags); synchronize_srcu(&dax_srcu); diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index a7daa522e00f..fa50e5308292 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -22,6 +22,7 @@ #include #include +#include struct xfs_failure_info { xfs_agblock_t startblock; @@ -73,10 +74,16 @@ xfs_dax_failure_fn( struct xfs_mount *mp = cur->bc_mp; struct xfs_inode *ip; struct xfs_failure_info *notify = data; + struct address_space *mapping; + pgoff_t pgoff; + unsigned long pgcnt; int error = 0; if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { + /* Continue the query because this isn't a failure. */ + if (notify->mf_flags & MF_MEM_PRE_REMOVE) + return 0; notify->want_shutdown = true; return 0; } @@ -92,14 +99,60 @@ xfs_dax_failure_fn( return 0; } - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping, - xfs_failure_pgoff(mp, rec, notify), - xfs_failure_pgcnt(mp, rec, notify), - notify->mf_flags); + mapping = VFS_I(ip)->i_mapping; + pgoff = xfs_failure_pgoff(mp, rec, notify); + pgcnt = xfs_failure_pgcnt(mp, rec, notify); + + /* Continue the rmap query if the inode isn't a dax file. */ + if (dax_mapping(mapping)) + error = mf_dax_kill_procs(mapping, pgoff, pgcnt, + notify->mf_flags); + + /* Invalidate the cache in dax pages. */ + if (notify->mf_flags & MF_MEM_PRE_REMOVE) + invalidate_inode_pages2_range(mapping, pgoff, + pgoff + pgcnt - 1); + xfs_irele(ip); return error; } +static int +xfs_dax_notify_failure_freeze( + struct xfs_mount *mp) +{ + struct super_block *sb = mp->m_super; + int error; + + error = freeze_super(sb, FREEZE_HOLDER_KERNEL); + if (error) + xfs_emerg(mp, "already frozen by kernel, err=%d", error); + + return error; +} + +static void +xfs_dax_notify_failure_thaw( + struct xfs_mount *mp, + bool kernel_frozen) +{ + struct super_block *sb = mp->m_super; + int error; + + if (kernel_frozen) { + error = thaw_super(sb, FREEZE_HOLDER_KERNEL); + if (error) + xfs_emerg(mp, "still frozen after notify failure, err=%d", + error); + } + + /* + * Also thaw userspace call anyway because the device is about to be + * removed immediately. + */ + thaw_super(sb, FREEZE_HOLDER_USERSPACE); +} + static int xfs_dax_notify_ddev_failure( struct xfs_mount *mp, @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure( struct xfs_btree_cur *cur = NULL; struct xfs_buf *agf_bp = NULL; int error = 0; + bool kernel_frozen = false; xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr); xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno); xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen - 1); xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno); + if (mf_flags & MF_MEM_PRE_REMOVE) { + xfs_info(mp, "Device is about to be removed!"); + /* + * Freeze fs to prevent new mappings from being created. + * - Keep going on if others already hold the kernel forzen. + * - Keep going on if other errors too because this device is + * starting to fail. + * - If kernel frozen state is hold successfully here, thaw it + * here as well at the end. + */ + kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0; + } + error = xfs_trans_alloc_empty(mp, &tp); if (error) - return error; + goto out; for (; agno <= end_agno; agno++) { struct xfs_rmap_irec ri_low = { }; @@ -165,11 +232,26 @@ xfs_dax_notify_ddev_failure( } xfs_trans_cancel(tp); - if (error || notify.want_shutdown) { + + /* + * Shutdown fs from a force umount in pre-remove case which won't fail, + * so errors can be ignored. Otherwise, shutdown the filesystem with + * CORRUPT flag if error occured or notify.want_shutdown was set during + * RMAP querying. + */ + if (mf_flags & MF_MEM_PRE_REMOVE) + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT); + else if (error || notify.want_shutdown) { xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); if (!error) error = -EFSCORRUPTED; } + +out: + /* Thaw the fs if it has been frozen before. */ + if (mf_flags & MF_MEM_PRE_REMOVE) + xfs_dax_notify_failure_thaw(mp, kernel_frozen); + return error; } @@ -197,6 +279,14 @@ xfs_dax_notify_failure( if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev && mp->m_logdev_targp != mp->m_ddev_targp) { + /* + * In the pre-remove case the failure notification is attempting + * to trigger a force unmount. The expectation is that the + * device is still present, but its removal is in progress and + * can not be cancelled, proceed with accessing the log device. + */ + if (mf_flags & MF_MEM_PRE_REMOVE) + return 0; xfs_err(mp, "ondisk log corrupt, shutting down fs!"); xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); return -EFSCORRUPTED; @@ -210,6 +300,12 @@ xfs_dax_notify_failure( ddev_start = mp->m_ddev_targp->bt_dax_part_off; ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1; + /* Notify failure on the whole device. */ + if (offset == 0 && len == U64_MAX) { + offset = ddev_start; + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev); + } + /* Ignore the range out of filesystem area */ if (offset + len - 1 < ddev_start) return -ENXIO; diff --git a/include/linux/mm.h b/include/linux/mm.h index bf5d0b1b16f4..385eee0d05a2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3831,6 +3831,7 @@ enum mf_flags { MF_UNPOISON = 1 << 4, MF_SW_SIMULATED = 1 << 5, MF_NO_RETRY = 1 << 6, + MF_MEM_PRE_REMOVE = 1 << 7, }; int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, unsigned long count, int mf_flags); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 4d6e43c88489..6e43ae369fef 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -679,7 +679,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p, */ static void collect_procs_fsdax(struct page *page, struct address_space *mapping, pgoff_t pgoff, - struct list_head *to_kill) + struct list_head *to_kill, bool pre_remove) { struct vm_area_struct *vma; struct task_struct *tsk; @@ -687,8 +687,15 @@ static void collect_procs_fsdax(struct page *page, i_mmap_lock_read(mapping); rcu_read_lock(); for_each_process(tsk) { - struct task_struct *t = task_early_kill(tsk, true); + struct task_struct *t = tsk; + /* + * Search for all tasks while MF_MEM_PRE_REMOVE is set, because + * the current may not be the one accessing the fsdax page. + * Otherwise, search for the current task. + */ + if (!pre_remove) + t = task_early_kill(tsk, true); if (!t) continue; vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { @@ -1792,6 +1799,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, dax_entry_t cookie; struct page *page; size_t end = index + count; + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE; mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL; @@ -1803,9 +1811,14 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, if (!page) goto unlock; - SetPageHWPoison(page); + if (!pre_remove) + SetPageHWPoison(page); - collect_procs_fsdax(page, mapping, index, &to_kill); + /* + * The pre_remove case is revoking access, the memory is still + * good and could theoretically be put back into service. + */ + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove); unmap_and_kill(&to_kill, page_to_pfn(page), mapping, index, mf_flags); unlock: