From patchwork Fri Sep 16 03:35:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978053 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00B6CECAAD3 for ; Fri, 16 Sep 2022 03:35:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 84F388D0003; Thu, 15 Sep 2022 23:35:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D8A08D0001; Thu, 15 Sep 2022 23:35:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 652BE8D0003; Thu, 15 Sep 2022 23:35:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 54A738D0001 for ; Thu, 15 Sep 2022 23:35:18 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 35C2A120665 for ; Fri, 16 Sep 2022 03:35:18 +0000 (UTC) X-FDA: 79916533116.11.8ACE877 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by imf11.hostedemail.com (Postfix) with ESMTP id 8CE374009D for ; Fri, 16 Sep 2022 03:35:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299317; x=1694835317; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=TFsjnfnCv3qQkchX3Q8w6OPDWs44NJ34NXLtENxpehE=; b=ZiFQbuI9mjwD+DKwFQvk1lLNzypX7YGSVLaguGfnl9SwO04WTj4o+T7/ BYTEMCORWMSJNGAVKYYok9VT7tmu1V3h4jgzc7BiYV+ODYfBqgvtCZEFE 386Pcq/Wie2w79rphNikJHKOlbYAQ1A01XGYFq0Un+k7pNtc3hnne0Ear 0KPdEhmMyqY+o01ZjX151vGxB9TThessYoS5Xd3wtmyyfzSm5awthYm2U 0aeRYJRYcfsR9h7kxVGG7eiF+pcckrG77BNJAAmu2dbsdSggchMNQNnmb xaCk7NdYfsUd8T4OUfaYCjEUDKO/irkxX+37Cx1SaDhdRyJcg+xiw0dGZ A==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="278630714" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="278630714" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:16 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="679809163" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:15 -0700 Subject: [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:35:15 -0700 Message-ID: <166329931529.2786261.12375427940949385300.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=ZiFQbuI9; spf=pass (imf11.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299317; a=rsa-sha256; cv=none; b=awXlRxvrYAumBnDHeydA94D4X1KB+o5XYRdquAFIfFI6Gpr+uPD5rUOZXzTp0HVdDPpD6j ikEBsTSFmoibfNhKYmF1YxZ2hbm5ojzw+mYsi+W8QxI2J12kwhd/eW4a6iOB1b0zb//urp ecjXqkaoZmIq+P39WxWOb2piPbkNRaQ= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299317; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wxwqPldOGGZadzp6x+m/jxCXNhQRyLNrYwqy7ZPvlB0=; b=sjfdzvD6b8TnVSAYPavkEOSb06QdnCxkCIGa0M1TD/BuK1ge7IMbg8gNdxsOjWQ/SwQWsh guQiyb009PiyUV+rwZAmbYV+mNFXMSVQgzNE8lTIn40/mzVNr81qkBIU1fRC2ImNeg2n+n Sqo8Mb0OrFjpeqU5rfTrK1BBZGvgZ6c= X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 8CE374009D Authentication-Results: imf11.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=ZiFQbuI9; spf=pass (imf11.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Stat-Signature: 8xwyu98r1cq67ddqbo6hczdeag89fc9z X-HE-Tag: 1663299317-393094 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The __wait_var_event facility calculates a wait queue from a hash of the address of the variable being passed. Use the @page argument directly as it is less to type and is the object that is being waited upon. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams Reviewed-by: Jason Gunthorpe --- fs/ext4/inode.c | 8 ++++---- fs/fuse/dax.c | 6 +++--- fs/xfs/xfs_file.c | 6 +++--- mm/memremap.c | 2 +- 4 files changed, 11 insertions(+), 11 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 601214453c3a..b028a4413bea 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3961,10 +3961,10 @@ int ext4_break_layouts(struct inode *inode) if (!page) return 0; - error = ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, - TASK_INTERRUPTIBLE, 0, 0, - ext4_wait_dax_page(inode)); + error = ___wait_var_event(page, + atomic_read(&page->_refcount) == 1, + TASK_INTERRUPTIBLE, 0, 0, + ext4_wait_dax_page(inode)); } while (error == 0); return error; diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index e23e802a8013..4e12108c68af 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -676,9 +676,9 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry, return 0; *retry = true; - return ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, - 0, 0, fuse_wait_dax_page(inode)); + return ___wait_var_event(page, atomic_read(&page->_refcount) == 1, + TASK_INTERRUPTIBLE, 0, 0, + fuse_wait_dax_page(inode)); } /* dmap_end == 0 leads to unmapping of whole file */ diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index c6c80265c0b2..73e7b7ec0a4c 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -827,9 +827,9 @@ xfs_break_dax_layouts( return 0; *retry = true; - return ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, - 0, 0, xfs_wait_dax_page(inode)); + return ___wait_var_event(page, atomic_read(&page->_refcount) == 1, + TASK_INTERRUPTIBLE, 0, 0, + xfs_wait_dax_page(inode)); } int diff --git a/mm/memremap.c b/mm/memremap.c index 58b20c3c300b..95f6ffe9cb0f 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -520,7 +520,7 @@ bool __put_devmap_managed_page_refs(struct page *page, int refs) * stable because nobody holds a reference on the page. */ if (page_ref_sub_return(page, refs) == 1) - wake_up_var(&page->_refcount); + wake_up_var(page); return true; } EXPORT_SYMBOL(__put_devmap_managed_page_refs); From patchwork Fri Sep 16 03:35:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978054 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66C29C32771 for ; Fri, 16 Sep 2022 03:35:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B5DBD80007; Thu, 15 Sep 2022 23:35:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AE56B8D0001; Thu, 15 Sep 2022 23:35:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9605B80007; Thu, 15 Sep 2022 23:35:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 843208D0001 for ; Thu, 15 Sep 2022 23:35:24 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 3B5B4A06BB for ; Fri, 16 Sep 2022 03:35:24 +0000 (UTC) X-FDA: 79916533368.05.54A97EC Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by imf25.hostedemail.com (Postfix) with ESMTP id 883DAA00CF for ; Fri, 16 Sep 2022 03:35:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299323; x=1694835323; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=09smizwI4GSJLiX2SQxDnF1gC8wYEWwvB8zT8M3pE8M=; b=aMXaOd+SrRfUMURTdxw4LFKSSOF7HElAs+kTR8goHinxPxthDw9uhTY3 1gUapbXo0CGYSwDLxw43pH1ovMzCWAHyry5ker2B3TRezbOPRHD8WK4rY c6XGm2Y/M4oTNnnceyqEmy4HkRLcltrZEZlmyNSE4fsvAS9RlXZW592H+ kWUry7aa29rzqfLG+EfevQCWqoCdxM457tFoi5KAgHHNUkfE24EOHNLgx F0mchQioIijw7hqFp3uCRZ8wdva3cdNcadYzT3lyOGu0tIUonKB0o031f 09jPgLDy2os/p/vwTrcm4r1W6F2y+16PBOVtTK+GuQ9hSptUOcgf7tryc w==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="281930420" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="281930420" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:22 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="792961772" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:21 -0700 Subject: [PATCH v2 02/18] fsdax: Use dax_page_idle() to document DAX busy page checking From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:35:21 -0700 Message-ID: <166329932151.2786261.15762187070104795379.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299323; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KSDLRYTOs6o9D6qKBSGqXW9YZwe42Bkzum27o4zwuVk=; b=04LQX1hsIx7f6XJjeTN8HOHcn3C8GWDkFj4FJWomu2UcbctZq2GTaaZwxMEN/B0xutjTR8 //4MmjjjKusdH8lywVZ/AmELkXl7IakEatWq7ng+CllzftKHg/8zNXPUqWyHkse8FLumLQ lnUHc8ntegxnQu2Gdggcq3xURyz5jyQ= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=aMXaOd+S; spf=pass (imf25.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299323; a=rsa-sha256; cv=none; b=E18kGg6BGJcqNpdFB3G7I7Id4sZeUC72zodrrhkX7vqMOVujui9fntv9tACQG0fs0bp2dh YXYy+bCaRW3LTwzAwX45Yi79wnMAZW2DVEaGlJOsS+ntR2liFdpkKNvwFADBtXxt8zgOJo TaZS9z3LHbVktnzxzoUauBoCgYKxB6o= X-Stat-Signature: pubhyashipndrjdkrr1dokte9z7e6n4u X-Rspamd-Queue-Id: 883DAA00CF X-Rspam-User: Authentication-Results: imf25.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=aMXaOd+S; spf=pass (imf25.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam06 X-HE-Tag: 1663299323-718899 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In advance of converting DAX pages to be 0-based, use a new dax_page_idle() helper to both simplify that future conversion, but also document all the kernel locations that are watching for DAX page idle events. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams Reviewed-by: Jason Gunthorpe --- fs/dax.c | 4 ++-- fs/ext4/inode.c | 3 +-- fs/fuse/dax.c | 5 ++--- fs/xfs/xfs_file.c | 5 ++--- include/linux/dax.h | 9 +++++++++ 5 files changed, 16 insertions(+), 10 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index c440dcef4b1b..e762b9c04fb4 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -395,7 +395,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - WARN_ON_ONCE(trunc && page_ref_count(page) > 1); + WARN_ON_ONCE(trunc && !dax_page_idle(page)); if (dax_mapping_is_cow(page->mapping)) { /* keep the CoW flag if this page is still shared */ if (page->index-- > 0) @@ -414,7 +414,7 @@ static struct page *dax_busy_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - if (page_ref_count(page) > 1) + if (!dax_page_idle(page)) return page; } return NULL; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b028a4413bea..478ec6bc0935 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3961,8 +3961,7 @@ int ext4_break_layouts(struct inode *inode) if (!page) return 0; - error = ___wait_var_event(page, - atomic_read(&page->_refcount) == 1, + error = ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE, 0, 0, ext4_wait_dax_page(inode)); } while (error == 0); diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index 4e12108c68af..ae52ef7dbabe 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -676,9 +676,8 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry, return 0; *retry = true; - return ___wait_var_event(page, atomic_read(&page->_refcount) == 1, - TASK_INTERRUPTIBLE, 0, 0, - fuse_wait_dax_page(inode)); + return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE, + 0, 0, fuse_wait_dax_page(inode)); } /* dmap_end == 0 leads to unmapping of whole file */ diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 73e7b7ec0a4c..556e28d06788 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -827,9 +827,8 @@ xfs_break_dax_layouts( return 0; *retry = true; - return ___wait_var_event(page, atomic_read(&page->_refcount) == 1, - TASK_INTERRUPTIBLE, 0, 0, - xfs_wait_dax_page(inode)); + return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE, + 0, 0, xfs_wait_dax_page(inode)); } int diff --git a/include/linux/dax.h b/include/linux/dax.h index ba985333e26b..04987d14d7e0 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -210,6 +210,15 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, const struct iomap_ops *ops); +/* + * Document all the code locations that want know when a dax page is + * unreferenced. + */ +static inline bool dax_page_idle(struct page *page) +{ + return page_ref_count(page) == 1; +} + #if IS_ENABLED(CONFIG_DAX) int dax_read_lock(void); void dax_read_unlock(int id); From patchwork Fri Sep 16 03:35:27 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978055 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 18C84C32771 for ; Fri, 16 Sep 2022 03:35:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AC0D18D0005; Thu, 15 Sep 2022 23:35:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A228F8D0001; Thu, 15 Sep 2022 23:35:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8C4468D0005; Thu, 15 Sep 2022 23:35:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 781468D0001 for ; Thu, 15 Sep 2022 23:35:30 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id D9A381A04FE for ; Fri, 16 Sep 2022 03:35:29 +0000 (UTC) X-FDA: 79916533578.12.8D98A5C Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by imf18.hostedemail.com (Postfix) with ESMTP id 459BB1C00A1 for ; Fri, 16 Sep 2022 03:35:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299329; x=1694835329; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3wvSjSkW/zXAYAgI4UeAUQfAuIeQwEc+TBBp6c7MzGs=; b=PLTJVHVCKrbYbqwDYaBtUUEbV+Ytmtgbhx6rEGx5CuBgAC/q3sZKXN+2 FyNCTQYIKpxmvALndaaSxHocrHaMZi16flXwVa6YZ2m5TXGLoL9yJYO3m LESslJ88o8A4N8ExHf68UqXDXh8curFozKYj/NibuE4LMYvsmyo3z+lQH r2mGK/KVbVfHGMOIbo8khFySGyWBAjgESfzqi/H3UJUZTDu+3PVRfiWfp yw9X3cOxfnL208GN+AEfUckb+Vb/dvGnACSelhRatwCRj2qYvYAiJBt53 xw5AF2XRNUX6W1lK/82E7pNpPrhM+/XInH2mf0YGZlSgiMv3UXGM6brcR A==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="299726091" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="299726091" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:28 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="792961807" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:27 -0700 Subject: [PATCH v2 03/18] fsdax: Include unmapped inodes for page-idle detection From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:35:27 -0700 Message-ID: <166329932730.2786261.8645669907699123863.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=PLTJVHVC; spf=pass (imf18.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299329; a=rsa-sha256; cv=none; b=veDRjFzlpv2K7x6TBvs2SNoYKunAjAyrmuxoP0lGDMDRsd7Bj3uU1u+5p0sWbFK3oxn2AR cP/E5si9/ISoin7CjQ6Hb9fubGKDMVLa55dRSaAH64CFPv/kGNY8iZeNOqwCkaQkdPVWPo T2l8znvR6wcYd+59UDohQ8xJZd57ljY= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299329; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mg4sUw8CPGU23UE7zMKNYtu26deQnjSIF5al9lMmnVE=; b=mTaVJPsPRP8zyhjjI3chIUQLknxGWuAJtvEiU3H11Cstn05XX7zIMQknesjNhcLa4pFSCQ oq4hLaoCbPOPrMlYiTEupc6kPR9l40LJKgz2ITRRNsjQL+eP3RWdSFqMpzMv65UGXiyP/H Bg6s7hzf7fK0gXSjOUgQq4Kt5DUmrLo= X-Rspamd-Queue-Id: 459BB1C00A1 X-Rspam-User: Authentication-Results: imf18.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=PLTJVHVC; spf=pass (imf18.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Stat-Signature: gue6se85zo5kqbf1tcwbeapfx5y5cb3u X-Rspamd-Server: rspam04 X-HE-Tag: 1663299329-527865 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: A page can remain pinned even after it has been unmapped from userspace / removed from the rmap. In advance of requiring that all dax_insert_entry() events are followed up 'break layouts' before a truncate event, make sure that 'break layouts' can find unmapped entries. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/dax.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index e762b9c04fb4..76bad1c095c0 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -698,7 +698,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping, if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) return NULL; - if (!dax_mapping(mapping) || !mapping_mapped(mapping)) + if (!dax_mapping(mapping)) return NULL; /* If end == LLONG_MAX, all pages from start to till end of file */ From patchwork Fri Sep 16 03:35:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978056 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4251FC32771 for ; Fri, 16 Sep 2022 03:35:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA7608D0001; Thu, 15 Sep 2022 23:35:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D2FEA80008; Thu, 15 Sep 2022 23:35:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BAA5A8D0006; Thu, 15 Sep 2022 23:35:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id AABA48D0001 for ; Thu, 15 Sep 2022 23:35:35 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 8011E1A06A4 for ; Fri, 16 Sep 2022 03:35:35 +0000 (UTC) X-FDA: 79916533830.18.94B1980 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by imf08.hostedemail.com (Postfix) with ESMTP id E39F41600AA for ; Fri, 16 Sep 2022 03:35:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299335; x=1694835335; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=JMW9u8fUfl6xVouZZpm8FHKzsQjwv8MqqGLA900GgrQ=; b=JCNfjSAGP5zZLy+N/qJaLoTWDZfm1n5Fx/GZiYqeiJtwXGc2oVAXIipf GUj+EWzIpAqAGLHE6vB9G8gXVyE+3b1KP54X1hhIDexQ9/YwXZZUIEFn+ BcCs7pLTFWWhMzkU7Y27Vy2+wKz3xjOwp6zyLjD6j5AWXekgU288aG+LV eSaJouDUlQKVTvcuWSgJ93VN1YXiHA5ansFBYhGZv8iw5T3KVdFuIFQNZ K3DZnxKv/iFa2jYl7Cwv4rvaLU/e2S+C+mmQnkGTCxZ1ROZVigDvSv7vV idcw4Ws4YajsCHESbUm69BdA2r6Wxal/kjeYiVwKiRDZJCb79pTauY3Cj w==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="360643112" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="360643112" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:33 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="792961823" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:33 -0700 Subject: [PATCH v2 04/18] ext4: Add ext4_break_layouts() to the inode eviction path From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:35:33 -0700 Message-ID: <166329933305.2786261.13953404062673878108.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299335; a=rsa-sha256; cv=none; b=xopSIGl1TPA2hioKhCy6VMi8HPmfS8ZBve57OvMaCX6ue87f30Gq2j3J07yc5OeQDOZyfB RTVnK9rTCbn5encSCDsO0BJVJxj0kVr0oTpQKHDVAqJqXqG/R4Diae7o09Svr56Q6BDc2u 9MP5IOr2RhbbhvZzD1k6tBCFFCdMWcY= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=JCNfjSAG; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf08.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299335; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yDhWgkGnFcULcWPVdwPBarlx337nQjWxNrIx6swgbO4=; b=riEremv5yNFZNIAJhVLJivlTpVKklhGwMlgGZM7FkwIn+cnbCCsx1IOdJISsfJwzYfwC+y 7/E9uT58OsKdwe3TNHweZRYcpdzEiHPxTDgiiF2r7iaAXR9oU6fmYqpLgMOwHpMlYt6SCv jmPW3VD4gf9ssVRYYNOZkOQaKGzdItU= X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: E39F41600AA Authentication-Results: imf08.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=JCNfjSAG; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf08.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Stat-Signature: bp15tbn4ydq5jpsxqh76rmyxpggjgqgk X-HE-Tag: 1663299334-98886 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for moving DAX pages to be 0-based rather than 1-based for the idle refcount, the fsdax core wants to have all mappings in a "zapped" state before truncate. For typical pages this happens naturally via unmap_mapping_range(), for DAX pages some help is needed to record this state in the 'struct address_space' of the inode(s) where the page is mapped. That "zapped" state is recorded in DAX entries as a side effect of ext4_break_layouts(). Arrange for it to be called before all truncation events which already happens for truncate() and PUNCH_HOLE, but not truncate_inode_pages_final(). Arrange for ext4_break_layouts() before truncate_inode_pages_final(). Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/ext4/inode.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 478ec6bc0935..326269ad3961 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -207,7 +207,11 @@ void ext4_evict_inode(struct inode *inode) jbd2_complete_transaction(journal, commit_tid); filemap_write_and_wait(&inode->i_data); } + + filemap_invalidate_lock(inode->i_mapping); + ext4_break_layouts(inode); truncate_inode_pages_final(&inode->i_data); + filemap_invalidate_unlock(inode->i_mapping); goto no_delete; } @@ -218,7 +222,11 @@ void ext4_evict_inode(struct inode *inode) if (ext4_should_order_data(inode)) ext4_begin_ordered_truncate(inode, 0); + + filemap_invalidate_lock(inode->i_mapping); + ext4_break_layouts(inode); truncate_inode_pages_final(&inode->i_data); + filemap_invalidate_unlock(inode->i_mapping); /* * For inodes with journalled data, transaction commit could have From patchwork Fri Sep 16 03:35:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978057 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C869ECAAD3 for ; Fri, 16 Sep 2022 03:35:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EB90180009; Thu, 15 Sep 2022 23:35:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E410F80008; Thu, 15 Sep 2022 23:35:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CBB2A80009; Thu, 15 Sep 2022 23:35:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id BAAEA80008 for ; Thu, 15 Sep 2022 23:35:41 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 972D1120665 for ; Fri, 16 Sep 2022 03:35:41 +0000 (UTC) X-FDA: 79916534082.13.AAC59CF Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf31.hostedemail.com (Postfix) with ESMTP id E7A48200C7 for ; Fri, 16 Sep 2022 03:35:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299340; x=1694835340; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MOEX+9Os5MyF4ZlF0WkG4j7W52nt+aIbvJOvJCH+Qx0=; b=Rx65N7N+wsVwtZUqR+EoOSXCcj0Jj4IhAkt9tDLdAA71oWOPdSSB5DYp URPSBrjpfu5xDPJdJOUXn2uZTGgbCZojARkluhlovTdNfIGB4TDAprOCM 4tz/JW0IsoRvL3QTu8vpPvhG5SkgGPg+LVzX0j6NGC3T7fr9V5MapVXsr Ymb6W76FlBRdrv6GQg1lnHG/a+7owLyuAThWLpmhppTaD0q8UVRqISfKm EHbpcixvRpo4i2hto2Ep2MzDOnLrFPcCnsAaP0WU1RBjN7zgG6clbZPzP GprbrGXqdE1MXouwb02ExSrS+ha6nyzxyM+rbjNgbYqSn2oRnPAiTguPc w==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="362866863" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="362866863" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:39 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="792961859" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:39 -0700 Subject: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:35:38 -0700 Message-ID: <166329933874.2786261.18236541386474985669.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299341; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FCLb3g4zlCy5RGmAEklE96IsBUB+qp3yIx7KNDJ6I0w=; b=BtNXReJimuITwzctIDrfsveixfyAtJtCiAACLUVvs+3oFENTOSKrACKy5BfN3PBpsbrN3w jf1I9TNfoXcAN6/jtVtmdbXWoSHIcAZwv+jaxuoRQe6e8D1pZWoGS52sTDo1nZqs0fx8Mh vqPHTiohqfdSUHs01w/TnoQZ0jXFv6M= ARC-Authentication-Results: i=1; imf31.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=Rx65N7N+; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf31.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299341; a=rsa-sha256; cv=none; b=ekdrZ/VEc9mjCJWjdEHe831ZiNxNtJCTUM2a5+O4iytbjmbhV/SsxhMzgzJ6H3dRxmu9+r AJUNEAosxJ5Rzp4rzWgqdo6vqyUVP9zV/p1oifcWqrefmvA4ntM+8diStcBnNLOF3TXcCd +WWczSuwur8lpu/wz1wlX9pEguqoFb8= Authentication-Results: imf31.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=Rx65N7N+; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf31.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Stat-Signature: 7ugi6udk5paxrft9sxcq6chdnetgy1id X-Rspamd-Queue-Id: E7A48200C7 X-Rspamd-Server: rspam12 X-Rspam-User: X-HE-Tag: 1663299340-826480 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for moving DAX pages to be 0-based rather than 1-based for the idle refcount, the fsdax core wants to have all mappings in a "zapped" state before truncate. For typical pages this happens naturally via unmap_mapping_range(), for DAX pages some help is needed to record this state in the 'struct address_space' of the inode(s) where the page is mapped. That "zapped" state is recorded in DAX entries as a side effect of xfs_break_layouts(). Arrange for it to be called before all truncation events which already happens for truncate() and PUNCH_HOLE, but not truncate_inode_pages_final(). Arrange for xfs_break_layouts() before truncate_inode_pages_final(). Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/xfs/xfs_file.c | 13 +++++++++---- fs/xfs/xfs_inode.c | 3 ++- fs/xfs/xfs_inode.h | 6 ++++-- fs/xfs/xfs_super.c | 22 ++++++++++++++++++++++ 4 files changed, 37 insertions(+), 7 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 556e28d06788..d3ff692d5546 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -816,7 +816,8 @@ xfs_wait_dax_page( int xfs_break_dax_layouts( struct inode *inode, - bool *retry) + bool *retry, + int state) { struct page *page; @@ -827,8 +828,8 @@ xfs_break_dax_layouts( return 0; *retry = true; - return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE, - 0, 0, xfs_wait_dax_page(inode)); + return ___wait_var_event(page, dax_page_idle(page), state, 0, 0, + xfs_wait_dax_page(inode)); } int @@ -839,14 +840,18 @@ xfs_break_layouts( { bool retry; int error; + int state = TASK_INTERRUPTIBLE; ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)); do { retry = false; switch (reason) { + case BREAK_UNMAP_FINAL: + state = TASK_UNINTERRUPTIBLE; + fallthrough; case BREAK_UNMAP: - error = xfs_break_dax_layouts(inode, &retry); + error = xfs_break_dax_layouts(inode, &retry, state); if (error || retry) break; fallthrough; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 28493c8e9bb2..72ce1cb72736 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3452,6 +3452,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout( struct xfs_inode *ip1, struct xfs_inode *ip2) { + int state = TASK_INTERRUPTIBLE; int error; bool retry; struct page *page; @@ -3463,7 +3464,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout( retry = false; /* Lock the first inode */ xfs_ilock(ip1, XFS_MMAPLOCK_EXCL); - error = xfs_break_dax_layouts(VFS_I(ip1), &retry); + error = xfs_break_dax_layouts(VFS_I(ip1), &retry, state); if (error || retry) { xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL); if (error == 0 && retry) diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index fa780f08dc89..e4994eb6e521 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -454,11 +454,13 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip) * layout-holder has a consistent view of the file's extent map. While * BREAK_WRITE breaks can be satisfied by recalling FL_LAYOUT leases, * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to - * go idle. + * go idle. BREAK_UNMAP_FINAL is an uninterruptible version of + * BREAK_UNMAP. */ enum layout_break_reason { BREAK_WRITE, BREAK_UNMAP, + BREAK_UNMAP_FINAL, }; /* @@ -531,7 +533,7 @@ xfs_itruncate_extents( } /* from xfs_file.c */ -int xfs_break_dax_layouts(struct inode *inode, bool *retry); +int xfs_break_dax_layouts(struct inode *inode, bool *retry, int state); int xfs_break_layouts(struct inode *inode, uint *iolock, enum layout_break_reason reason); diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 9ac59814bbb6..ebb4a6eba3fc 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -725,6 +725,27 @@ xfs_fs_drop_inode( return generic_drop_inode(inode); } +STATIC void +xfs_fs_evict_inode( + struct inode *inode) +{ + struct xfs_inode *ip = XFS_I(inode); + uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; + long error; + + xfs_ilock(ip, iolock); + + error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP_FINAL); + + /* The final layout break is uninterruptible */ + ASSERT_ALWAYS(!error); + + truncate_inode_pages_final(&inode->i_data); + clear_inode(inode); + + xfs_iunlock(ip, iolock); +} + static void xfs_mount_free( struct xfs_mount *mp) @@ -1144,6 +1165,7 @@ static const struct super_operations xfs_super_operations = { .destroy_inode = xfs_fs_destroy_inode, .dirty_inode = xfs_fs_dirty_inode, .drop_inode = xfs_fs_drop_inode, + .evict_inode = xfs_fs_evict_inode, .put_super = xfs_fs_put_super, .sync_fs = xfs_fs_sync_fs, .freeze_fs = xfs_fs_freeze, From patchwork Fri Sep 16 03:35:44 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978058 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFE97C6FA86 for ; Fri, 16 Sep 2022 03:35:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8C4C68000A; Thu, 15 Sep 2022 23:35:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 84E3D80008; Thu, 15 Sep 2022 23:35:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6C83A8000A; Thu, 15 Sep 2022 23:35:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5B42A80008 for ; Thu, 15 Sep 2022 23:35:47 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 2AB1D40603 for ; Fri, 16 Sep 2022 03:35:47 +0000 (UTC) X-FDA: 79916534334.20.AE4FDB7 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf15.hostedemail.com (Postfix) with ESMTP id 9149FA00A7 for ; Fri, 16 Sep 2022 03:35:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299346; x=1694835346; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3aJv8sV5UvVLt6I9GPcQ4FOyHFWHfylczL3dCRbOmfY=; b=ALOiJO5PSAyVlP1rEmqNUQQirL5ghNB5trH0G5WugXK+tv1C2uB5Wbq7 Iv2Y1BHaJvnvUzxyE0ed5Qy/EmPtwdMULT+hxcUwZtj8ZOxpX/cv0Ayob forEtaYJN/GXDU+WQKxAVyc9kVmO/idhMMGqB40zO3l8sEMvFX5FSI4xM aMHI40sYHazc+8/fy+5Sr9wNlSCF8yh2dFZVRYPVsiBSNBQbsPDesRCzS PfRlyfntqnVHQqjy/D2q+j6prgGkTEjgWJU4NWBy/kq9N0wg0oc1yOjZU rybSt4Bq1eSfn6Ck6At42c4G1sKWJYkpRkzV1G6qrj5/CjEwg+YevDavs g==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="279283832" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="279283832" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:45 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="792961875" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:44 -0700 Subject: [PATCH v2 06/18] fsdax: Rework dax_layout_busy_page() to dax_zap_mappings() From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:35:44 -0700 Message-ID: <166329934448.2786261.3862047806587561874.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=ALOiJO5P; spf=pass (imf15.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299346; a=rsa-sha256; cv=none; b=gSsgd9AfE86LRX1sYzaMHHCQXwixkSXzMivisew7h3IzD3nSZh18l37LBRGY5XWnkE/UW7 Rtw9SkmYTzt4flEfu9EwlNES8PQxCte88uCaB+AoYBl4v2qQ/TN+WjOrZS3h0X+d7Bybrs y69lEYM6nbTcGLqQlCIVxJvUV7TA3Sw= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299346; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rbsmLnLuIDW9D9GdqrPgFaNwCfvT2Ta/2ZEHOiFRvFg=; b=MKiX9ex3XMMZRUjhIZNRIdkH2wJkcPSb3z4S330DqZeAT1uwpmEtZpuU5kGGkaMreW2VN+ Nl89ChN5Ckqa8ht8w3hl32IqZAUr6Xdt2wnXD4iQEG/fjycjew2aBQK2eItZf9kUKMIHk2 oEzIrnNZ29JFjnVWzNHZedX3dRERlqA= Authentication-Results: imf15.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=ALOiJO5P; spf=pass (imf15.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 9149FA00A7 X-Stat-Signature: ues88jttys4ds1x4n6fz96m7bb4b48qn X-Rspam-User: X-HE-Tag: 1663299346-817332 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for moving the truncate vs DAX-busy-page detection from detecting _refcount == 1 to _refcount == 0, change the busy page tracking to take refernces at dax_insert_entry(), drop references at dax_zap_mappings() time, and finally clean out the entries at dax_delete_mapping_entries(). This approach will rely on all paths that call truncate_inode_pages() to first call dax_zap_mappings(). This mirrors the zapped state of pages after unmap_mapping_range(), but since DAX pages do not maintain _mapcount this DAX specific flow is introduced. This approach helps address the immediate _refcount problem, but continues to kick the "DAX without pages?" question down the road. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/dax.c | 82 ++++++++++++++++++++++++++++++++++++--------------- fs/ext4/inode.c | 2 + fs/fuse/dax.c | 4 +- fs/xfs/xfs_file.c | 2 + fs/xfs/xfs_inode.c | 4 +- include/linux/dax.h | 11 ++++--- 6 files changed, 71 insertions(+), 34 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 76bad1c095c0..616bac4b7df3 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -74,11 +74,12 @@ fs_initcall(init_dax_wait_table); * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem * block allocation. */ -#define DAX_SHIFT (4) +#define DAX_SHIFT (5) #define DAX_LOCKED (1UL << 0) #define DAX_PMD (1UL << 1) #define DAX_ZERO_PAGE (1UL << 2) #define DAX_EMPTY (1UL << 3) +#define DAX_ZAP (1UL << 4) static unsigned long dax_to_pfn(void *entry) { @@ -95,6 +96,11 @@ static bool dax_is_locked(void *entry) return xa_to_value(entry) & DAX_LOCKED; } +static bool dax_is_zapped(void *entry) +{ + return xa_to_value(entry) & DAX_ZAP; +} + static unsigned int dax_entry_order(void *entry) { if (xa_to_value(entry) & DAX_PMD) @@ -380,6 +386,7 @@ static void dax_associate_entry(void *entry, struct address_space *mapping, WARN_ON_ONCE(page->mapping); page->mapping = mapping; page->index = index + i++; + page_ref_inc(page); } } } @@ -395,31 +402,20 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - WARN_ON_ONCE(trunc && !dax_page_idle(page)); if (dax_mapping_is_cow(page->mapping)) { /* keep the CoW flag if this page is still shared */ if (page->index-- > 0) continue; - } else + } else { + WARN_ON_ONCE(trunc && !dax_is_zapped(entry)); + WARN_ON_ONCE(trunc && !dax_page_idle(page)); WARN_ON_ONCE(page->mapping && page->mapping != mapping); + } page->mapping = NULL; page->index = 0; } } -static struct page *dax_busy_page(void *entry) -{ - unsigned long pfn; - - for_each_mapped_pfn(entry, pfn) { - struct page *page = pfn_to_page(pfn); - - if (!dax_page_idle(page)) - return page; - } - return NULL; -} - /* * dax_lock_page - Lock the DAX entry corresponding to a page * @page: The page whose entry we want to lock @@ -664,8 +660,46 @@ static void *grab_mapping_entry(struct xa_state *xas, return xa_mk_internal(VM_FAULT_FALLBACK); } +static void *dax_zap_entry(struct xa_state *xas, void *entry) +{ + unsigned long v = xa_to_value(entry); + + return xas_store(xas, xa_mk_value(v | DAX_ZAP)); +} + +/** + * Return NULL if the entry is zapped and all pages in the entry are + * idle, otherwise return the non-idle page in the entry + */ +static struct page *dax_zap_pages(struct xa_state *xas, void *entry) +{ + struct page *ret = NULL; + unsigned long pfn; + bool zap; + + if (!dax_entry_size(entry)) + return NULL; + + zap = !dax_is_zapped(entry); + + for_each_mapped_pfn(entry, pfn) { + struct page *page = pfn_to_page(pfn); + + if (zap) + page_ref_dec(page); + + if (!ret && !dax_page_idle(page)) + ret = page; + } + + if (zap) + dax_zap_entry(xas, entry); + + return ret; +} + /** - * dax_layout_busy_page_range - find first pinned page in @mapping + * dax_zap_mappings_range - find first pinned page in @mapping * @mapping: address space to scan for a page with ref count > 1 * @start: Starting offset. Page containing 'start' is included. * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX, @@ -682,8 +716,8 @@ static void *grab_mapping_entry(struct xa_state *xas, * to be able to run unmap_mapping_range() and subsequently not race * mapping_mapped() becoming true. */ -struct page *dax_layout_busy_page_range(struct address_space *mapping, - loff_t start, loff_t end) +struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start, + loff_t end) { void *entry; unsigned int scanned = 0; @@ -727,7 +761,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping, if (unlikely(dax_is_locked(entry))) entry = get_unlocked_entry(&xas, 0); if (entry) - page = dax_busy_page(entry); + page = dax_zap_pages(&xas, entry); put_unlocked_entry(&xas, entry, WAKE_NEXT); if (page) break; @@ -742,13 +776,13 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping, xas_unlock_irq(&xas); return page; } -EXPORT_SYMBOL_GPL(dax_layout_busy_page_range); +EXPORT_SYMBOL_GPL(dax_zap_mappings_range); -struct page *dax_layout_busy_page(struct address_space *mapping) +struct page *dax_zap_mappings(struct address_space *mapping) { - return dax_layout_busy_page_range(mapping, 0, LLONG_MAX); + return dax_zap_mappings_range(mapping, 0, LLONG_MAX); } -EXPORT_SYMBOL_GPL(dax_layout_busy_page); +EXPORT_SYMBOL_GPL(dax_zap_mappings); static int __dax_invalidate_entry(struct address_space *mapping, pgoff_t index, bool trunc) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 326269ad3961..0ce73af69c49 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3965,7 +3965,7 @@ int ext4_break_layouts(struct inode *inode) return -EINVAL; do { - page = dax_layout_busy_page(inode->i_mapping); + page = dax_zap_mappings(inode->i_mapping); if (!page) return 0; diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index ae52ef7dbabe..8cdc9402e8f7 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -443,7 +443,7 @@ static int fuse_setup_new_dax_mapping(struct inode *inode, loff_t pos, /* * Can't do inline reclaim in fault path. We call - * dax_layout_busy_page() before we free a range. And + * dax_zap_mappings() before we free a range. And * fuse_wait_dax_page() drops mapping->invalidate_lock and requires it. * In fault path we enter with mapping->invalidate_lock held and can't * drop it. Also in fault path we hold mapping->invalidate_lock shared @@ -671,7 +671,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry, { struct page *page; - page = dax_layout_busy_page_range(inode->i_mapping, start, end); + page = dax_zap_mappings_range(inode->i_mapping, start, end); if (!page) return 0; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index d3ff692d5546..918ab9130c96 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -823,7 +823,7 @@ xfs_break_dax_layouts( ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL)); - page = dax_layout_busy_page(inode->i_mapping); + page = dax_zap_mappings(inode->i_mapping); if (!page) return 0; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 72ce1cb72736..9bbc68500cec 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3482,8 +3482,8 @@ xfs_mmaplock_two_inodes_and_break_dax_layout( * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable * for this nested lock case. */ - page = dax_layout_busy_page(VFS_I(ip2)->i_mapping); - if (page && page_ref_count(page) != 1) { + page = dax_zap_mappings(VFS_I(ip2)->i_mapping); + if (page) { xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL); xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL); goto again; diff --git a/include/linux/dax.h b/include/linux/dax.h index 04987d14d7e0..f6acb4ed73cb 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -157,8 +157,9 @@ static inline void fs_put_dax(struct dax_device *dax_dev, void *holder) int dax_writeback_mapping_range(struct address_space *mapping, struct dax_device *dax_dev, struct writeback_control *wbc); -struct page *dax_layout_busy_page(struct address_space *mapping); -struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end); +struct page *dax_zap_mappings(struct address_space *mapping); +struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start, + loff_t end); dax_entry_t dax_lock_page(struct page *page); void dax_unlock_page(struct page *page, dax_entry_t cookie); dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, @@ -166,12 +167,14 @@ dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, void dax_unlock_mapping_entry(struct address_space *mapping, unsigned long index, dax_entry_t cookie); #else -static inline struct page *dax_layout_busy_page(struct address_space *mapping) +static inline struct page *dax_zap_mappings(struct address_space *mapping) { return NULL; } -static inline struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t start, pgoff_t nr_pages) +static inline struct page *dax_zap_mappings_range(struct address_space *mapping, + pgoff_t start, + pgoff_t nr_pages) { return NULL; } From patchwork Fri Sep 16 03:35:50 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978059 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED112ECAAD3 for ; Fri, 16 Sep 2022 03:35:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8504E8000B; Thu, 15 Sep 2022 23:35:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D8A180008; Thu, 15 Sep 2022 23:35:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 652D98000B; Thu, 15 Sep 2022 23:35:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 4F06580008 for ; Thu, 15 Sep 2022 23:35:53 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 2CB22805D9 for ; Fri, 16 Sep 2022 03:35:53 +0000 (UTC) X-FDA: 79916534586.27.663546F Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf07.hostedemail.com (Postfix) with ESMTP id 7230A40098 for ; Fri, 16 Sep 2022 03:35:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299352; x=1694835352; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=gvNMBXmzRliL/61+0l3Vp0DMaKfxeHPNaGSpEyGRros=; b=iDypxwJUKMQkCx0bXijvtpfGHOlepKeIsgyxigK9wYgjZbQ/60jSz9AK NlAaJvP7kGGJbAZ9iwvw6p5/7zYEvjq1qHi/YvpiX3NW8QTYgIhS63Vym JG7YxiqCL8dv2jYDN1QCciw7Y4IDM/mIK4reF+nTc6yhUHg4un+s7wgoG xC2WxpxuDBCQJr9UV/InKYj9Ssbcmyiwtq6Vu6ql2uGzs9WC4aaGxWoAj nxakp5OMI+AG7FEgYscpApusjWAs9AAkdUtsaRqIT4mZZ70Qwr9yu1G/t VhZle4xhZ8VNHO71JUbiznkadbEeANkAeSloFIBC8cGy64wqPXJC3Y3q2 Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="297624983" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="297624983" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:51 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="792961964" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:50 -0700 Subject: [PATCH v2 07/18] fsdax: Update dax_insert_entry() calling convention to return an error From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:35:50 -0700 Message-ID: <166329935018.2786261.15861171979773593749.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299352; a=rsa-sha256; cv=none; b=e9yZxqZPmJoVkJMRLW3flwrrXWXXr1D/I9XKVbn75IjITf8ASiS9w6pN9eihTTRYTP+sRP rb5iMm6D5zx2yWWruv/v7cz5BrQHHA+EFUIaKoot7fCLZYfD7vLQ/nmCylNZczbZCgCJTw ZP5qLRRkcllSL++dDR+OA0HZPHC+7qo= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=iDypxwJU; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf07.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299352; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NuHs7P0LUMGO2X4fh8xdHJG7aEV2M7sHwUIri9TENYs=; b=IHXweQIG3bbhnKarIyo4XudrxIdc5OXUWRcouHKx2ufzczzustNWhemZCuewgNfQwtCRiR 0/C58NEXShjPH+ISHVsZsAvQxg35kiNJQnd90bEnV3vPQroNlKg15a9LT946LatuQZzuNT 7kNsNE1jj2FrEWvO5HeGfRSiCyCdRKI= X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 7230A40098 Authentication-Results: imf07.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=iDypxwJU; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf07.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Stat-Signature: c37799enh447qfrepii6k4r9it66ye33 X-HE-Tag: 1663299352-882292 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for teaching dax_insert_entry() to take live @pgmap references, enable it to return errors. Given the observation that all callers overwrite the passed in entry with the return value, just update @entry in place and convert the return code to a vm_fault_t status. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/dax.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 616bac4b7df3..8382aab0d2f7 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -887,14 +887,15 @@ static bool dax_fault_is_cow(const struct iomap_iter *iter) * already in the tree, we will skip the insertion and just dirty the PMD as * appropriate. */ -static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, - const struct iomap_iter *iter, void *entry, pfn_t pfn, - unsigned long flags) +static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, + const struct iomap_iter *iter, void **pentry, + pfn_t pfn, unsigned long flags) { struct address_space *mapping = vmf->vma->vm_file->f_mapping; void *new_entry = dax_make_entry(pfn, flags); bool dirty = !dax_fault_is_synchronous(iter, vmf->vma); bool cow = dax_fault_is_cow(iter); + void *entry = *pentry; if (dirty) __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); @@ -940,7 +941,8 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, xas_set_mark(xas, PAGECACHE_TAG_TOWRITE); xas_unlock_irq(xas); - return entry; + *pentry = entry; + return 0; } static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, @@ -1188,9 +1190,12 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf, pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr)); vm_fault_t ret; - *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE); + ret = dax_insert_entry(xas, vmf, iter, entry, pfn, DAX_ZERO_PAGE); + if (ret) + goto out; ret = vmf_insert_mixed(vmf->vma, vaddr, pfn); +out: trace_dax_load_hole(inode, vmf, ret); return ret; } @@ -1207,6 +1212,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, struct page *zero_page; spinlock_t *ptl; pmd_t pmd_entry; + vm_fault_t ret; pfn_t pfn; zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm); @@ -1215,8 +1221,10 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, goto fallback; pfn = page_to_pfn_t(zero_page); - *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, - DAX_PMD | DAX_ZERO_PAGE); + ret = dax_insert_entry(xas, vmf, iter, entry, pfn, + DAX_PMD | DAX_ZERO_PAGE); + if (ret) + return ret; if (arch_needs_pgtable_deposit()) { pgtable = pte_alloc_one(vma->vm_mm); @@ -1568,6 +1576,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT; bool write = iter->flags & IOMAP_WRITE; unsigned long entry_flags = pmd ? DAX_PMD : 0; + vm_fault_t ret; int err = 0; pfn_t pfn; void *kaddr; @@ -1592,7 +1601,9 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, if (err) return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err); - *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, entry_flags); + ret = dax_insert_entry(xas, vmf, iter, entry, pfn, entry_flags); + if (ret) + return ret; if (write && srcmap->type != IOMAP_HOLE && srcmap->addr != iomap->addr) { From patchwork Fri Sep 16 03:35:56 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978060 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 222DDECAAD3 for ; Fri, 16 Sep 2022 03:35:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8EC478D0001; Thu, 15 Sep 2022 23:35:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 89B8280008; Thu, 15 Sep 2022 23:35:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7166E8D0001; Thu, 15 Sep 2022 23:35:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 5E66B80008 for ; Thu, 15 Sep 2022 23:35:58 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 4042AAB791 for ; Fri, 16 Sep 2022 03:35:58 +0000 (UTC) X-FDA: 79916534796.08.F7F1111 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by imf17.hostedemail.com (Postfix) with ESMTP id BE1EF4009F for ; Fri, 16 Sep 2022 03:35:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299357; x=1694835357; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=j5tRXbceFLZcGlHWWzLaIriWLFbXvQfVW7h15WIRvnM=; b=UL2QWDO2OrOLviyzqmyxn6CUPUzYJot0RcrZm3TKK6dWevqx6XjZQNLG zD/k58eRw8F9083R4cd0K8e8NPKKGmVcWyI/TqIBfP7z89HpzRcPV+cIL 5EmsdGTR5eU8lhPNPT6eo4mxcHM0qj5KjO55WBRR+aYZ/7e9O6ftlFSJP eLkeT4uhhXHJBAKpbAfIlHve3xIAmY6QPmoANXun1CRjzFB5YgBaFAqyn 6S9Cr0nglpiqmKEM8dVe71HAfrmjSm0lWmO+3VYC+FfqFYAr9HLojg+rS 2s2tvHZJjSg2HnVl7jaZCryit6JV7bK+knz7BXbNTMp+rleHYLe2Xgsmp Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="281930499" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="281930499" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:56 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="648099914" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:35:56 -0700 Subject: [PATCH v2 08/18] fsdax: Cleanup dax_associate_entry() From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:35:56 -0700 Message-ID: <166329935598.2786261.15591591637555586864.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299357; a=rsa-sha256; cv=none; b=BzdXPmvsCh1wmYpMVlL/E4gSpYrJqf1turknlvxY6ns2/Oqv0u+BSyoUSYtehTNQ47YBAm Rd+Rqyk+jMFWftb4NqNcrhWCGnB9WdaGdqvMlVOojVKb4LYl+wH+WgkatN9iO+KZuyNa43 XRjqF8cCyhPZq3lWNQK9A2EXCYlD4Ac= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=UL2QWDO2; spf=pass (imf17.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299357; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1MpRCRDLHif5gtcTvfZSuqk9tx0XT6kDDZSl0SePm9c=; b=nI9uHird2I/nZC6QzZCJEOmDg0phtQUWbALH362jAzZTQepg98J+hH6+cjyad3SE2D2/Gc TcKU0Me9h+d0e9OO/+0Aa+UTkjDW8N/4sx5Lt7sfqYeKyl7Dp2cmcnr8hswt2FUD/9xF8w h2NOXAKZ5RDDczUtY8wtH9MKngR/sBk= X-Rspam-User: Authentication-Results: imf17.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=UL2QWDO2; spf=pass (imf17.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam03 X-Stat-Signature: m41fec7qowct998s6teebcrkodn38xr9 X-Rspamd-Queue-Id: BE1EF4009F X-HE-Tag: 1663299357-354975 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Pass @vmf to drop the separate @vma and @address arguments to dax_associate_entry(), use the existing DAX flags to convey the @cow argument, and replace the open-coded ALIGN(). Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/dax.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 8382aab0d2f7..bd5c6b6e371e 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -368,7 +368,7 @@ static inline void dax_mapping_set_cow(struct page *page) * FS_DAX_MAPPING_COW, and use page->index as refcount. */ static void dax_associate_entry(void *entry, struct address_space *mapping, - struct vm_area_struct *vma, unsigned long address, bool cow) + struct vm_fault *vmf, unsigned long flags) { unsigned long size = dax_entry_size(entry), pfn, index; int i = 0; @@ -376,11 +376,11 @@ static void dax_associate_entry(void *entry, struct address_space *mapping, if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) return; - index = linear_page_index(vma, address & ~(size - 1)); + index = linear_page_index(vmf->vma, ALIGN(vmf->address, size)); for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - if (cow) { + if (flags & DAX_COW) { dax_mapping_set_cow(page); } else { WARN_ON_ONCE(page->mapping); @@ -916,8 +916,7 @@ static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, void *old; dax_disassociate_entry(entry, mapping, false); - dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address, - cow); + dax_associate_entry(new_entry, mapping, vmf, flags); /* * Only swap our new entry into the page cache if the current * entry is a zero page or an empty entry. If a normal PTE or From patchwork Fri Sep 16 03:36:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978061 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 975FAECAAD3 for ; Fri, 16 Sep 2022 03:36:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E3208D0006; Thu, 15 Sep 2022 23:36:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 344608D0003; Thu, 15 Sep 2022 23:36:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 197EF8D0006; Thu, 15 Sep 2022 23:36:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 0B3598D0003 for ; Thu, 15 Sep 2022 23:36:04 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id D7157ABC1A for ; Fri, 16 Sep 2022 03:36:03 +0000 (UTC) X-FDA: 79916535006.23.3EB356E Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf26.hostedemail.com (Postfix) with ESMTP id 5DC6D1400D0 for ; Fri, 16 Sep 2022 03:36:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299363; x=1694835363; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=bcuK2KG5EjEAqreYAdzRR3rq0EZdnkZSoL2J30lxqoE=; b=lp1U3OCfCV+qGfUV+6dbeB09e/4MKNE7DTbDCDGVQ+qaf4ccCpOsffQZ zc/5ljU1+v+D7Yiec5KBa654X3A2uJCN3rd/jjucgzyZ45qLQeuZPI+Jb XRb8cK0iOtp1J6Us1FCJ0C88Zi7XfOiLJoqR7CqvFB1SsBdDFXsGQSH/G dg69iMQPSH9hF297E4DajzntYAQnnjg+FEUBMzGCctf0st2obkihBUvBT bxBxXlXrIgmb0F/kiBZnlkADHGqAOiWGk+oZfI26nlVAxTZcTRnZLnzEs qNZ2ZUVkFZmCgEUuiXDqS+3QR2hlEEVnWb8BLXtOAHAmhT08Rz4SHaI4W w==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="362866919" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="362866919" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:02 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="648099926" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:01 -0700 Subject: [PATCH v2 09/18] fsdax: Rework dax_insert_entry() calling convention From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:36:01 -0700 Message-ID: <166329936170.2786261.6094157723547541341.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299363; a=rsa-sha256; cv=none; b=msVaijv+mNkr6LPD5CRIphSaW6drGj1JATJ3iV2/udLSpDg9lLKM1lizwCL57Zuti1G/0z SLVpW5AdAQdBo5VQcGD4ag5v0HLUwp6i83jgrG0miBRusXAxSH0pjKCyNuejWGNMqQ8AXW 9UZOzZAdFPIDfSdLkBFLtN6IaFJN1uc= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=lp1U3OCf; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf26.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299363; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3QqiST1BUsThE+uICixNAiL8FZOZnuA9KvCbeklWwj0=; b=rQcgxtHV0af5kFrcsM0SCtYgRz0DYKRzGU97nOKsqilN2FWVDCwgnROkwkXMAfemG95jvo 5ayb5m70mebyHdiC94aHHHq6vOd/w1VmBSj/yhsjVGPYvL/aL1wyt/Z2EPUVMYpKdIT/tK Uyzs5FazV4P+eUc1VF6QlZPHpHoc4ug= X-Rspam-User: Authentication-Results: imf26.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=lp1U3OCf; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf26.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Stat-Signature: rwnwotyf6u8r95dgmnshof6fw964ssx4 X-Rspamd-Queue-Id: 5DC6D1400D0 X-Rspamd-Server: rspam09 X-HE-Tag: 1663299363-776252 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Move the determination of @dirty and @cow in dax_insert_entry() to flags (DAX_DIRTY and DAX_COW) that are passed in. This allows the iomap related code to remain fs/dax.c in preparation for the Xarray infrastructure to move to drivers/dax/mapping.c. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/dax.c | 44 +++++++++++++++++++++++++++++++++++--------- 1 file changed, 35 insertions(+), 9 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index bd5c6b6e371e..5d9f30105db4 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -75,12 +75,20 @@ fs_initcall(init_dax_wait_table); * block allocation. */ #define DAX_SHIFT (5) +#define DAX_MASK ((1UL << DAX_SHIFT) - 1) #define DAX_LOCKED (1UL << 0) #define DAX_PMD (1UL << 1) #define DAX_ZERO_PAGE (1UL << 2) #define DAX_EMPTY (1UL << 3) #define DAX_ZAP (1UL << 4) +/* + * These flags are not conveyed in Xarray value entries, they are just + * modifiers to dax_insert_entry(). + */ +#define DAX_DIRTY (1UL << (DAX_SHIFT + 0)) +#define DAX_COW (1UL << (DAX_SHIFT + 1)) + static unsigned long dax_to_pfn(void *entry) { return xa_to_value(entry) >> DAX_SHIFT; @@ -88,7 +96,8 @@ static unsigned long dax_to_pfn(void *entry) static void *dax_make_entry(pfn_t pfn, unsigned long flags) { - return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT)); + return xa_mk_value((flags & DAX_MASK) | + (pfn_t_to_pfn(pfn) << DAX_SHIFT)); } static bool dax_is_locked(void *entry) @@ -880,6 +889,20 @@ static bool dax_fault_is_cow(const struct iomap_iter *iter) (iter->iomap.flags & IOMAP_F_SHARED); } +static unsigned long dax_iter_flags(const struct iomap_iter *iter, + struct vm_fault *vmf) +{ + unsigned long flags = 0; + + if (!dax_fault_is_synchronous(iter, vmf->vma)) + flags |= DAX_DIRTY; + + if (dax_fault_is_cow(iter)) + flags |= DAX_COW; + + return flags; +} + /* * By this point grab_mapping_entry() has ensured that we have a locked entry * of the appropriate size so we don't have to worry about downgrading PMDs to @@ -888,13 +911,13 @@ static bool dax_fault_is_cow(const struct iomap_iter *iter) * appropriate. */ static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, - const struct iomap_iter *iter, void **pentry, - pfn_t pfn, unsigned long flags) + void **pentry, pfn_t pfn, + unsigned long flags) { struct address_space *mapping = vmf->vma->vm_file->f_mapping; void *new_entry = dax_make_entry(pfn, flags); - bool dirty = !dax_fault_is_synchronous(iter, vmf->vma); - bool cow = dax_fault_is_cow(iter); + bool dirty = flags & DAX_DIRTY; + bool cow = flags & DAX_COW; void *entry = *pentry; if (dirty) @@ -1189,7 +1212,8 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf, pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr)); vm_fault_t ret; - ret = dax_insert_entry(xas, vmf, iter, entry, pfn, DAX_ZERO_PAGE); + ret = dax_insert_entry(xas, vmf, entry, pfn, + DAX_ZERO_PAGE | dax_iter_flags(iter, vmf)); if (ret) goto out; @@ -1220,8 +1244,9 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, goto fallback; pfn = page_to_pfn_t(zero_page); - ret = dax_insert_entry(xas, vmf, iter, entry, pfn, - DAX_PMD | DAX_ZERO_PAGE); + ret = dax_insert_entry(xas, vmf, entry, pfn, + DAX_PMD | DAX_ZERO_PAGE | + dax_iter_flags(iter, vmf)); if (ret) return ret; @@ -1600,7 +1625,8 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, if (err) return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err); - ret = dax_insert_entry(xas, vmf, iter, entry, pfn, entry_flags); + ret = dax_insert_entry(xas, vmf, entry, pfn, + entry_flags | dax_iter_flags(iter, vmf)); if (ret) return ret; From patchwork Fri Sep 16 03:36:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978062 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86C42ECAAD3 for ; Fri, 16 Sep 2022 03:36:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2834E8D0003; Thu, 15 Sep 2022 23:36:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 20C7A8D0002; Thu, 15 Sep 2022 23:36:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 086A38D0003; Thu, 15 Sep 2022 23:36:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id EAFBE8D0002 for ; Thu, 15 Sep 2022 23:36:09 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id C4774C05FF for ; Fri, 16 Sep 2022 03:36:09 +0000 (UTC) X-FDA: 79916535258.01.7BBAEAD Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf13.hostedemail.com (Postfix) with ESMTP id 3E55D200B0 for ; Fri, 16 Sep 2022 03:36:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299369; x=1694835369; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=iTs/4dIxqdixjfZltzlbEEuflTOzT+TWXRHIgDMaAds=; b=VENFNSjaNPOe2FYTehkOHSsem3xTJTeUAT1uSIkKX9Ucg4+XsGM3xQUl +3ItQWAKXjXPssUoIYfZzUwiqEDATQHJ5hIAOmqWf8p8ZCNHJQXJD07pP fVl+VYgvqO1gNyyM05WSikR2vRXFn0RFlLO46VBDBwDBCm5vhzm9+Xx7y s95/3GHMzC76A9KYzJWheMsDwpMXt9hF+JHUT5ZowZ4oGH6pE553bdvFw g4fMgkx7MuQKiJrKL+ZL4Drj3FZqRr1j2TUQZ9Bg1hSe1u9C1Z88ObHr1 7L0BjMtDr76KfHeSNBLGABQctedWFAj2/+ZUrWOoeQmL6NFXzX98QoxVT Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="300264015" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="300264015" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:08 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="648099943" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:07 -0700 Subject: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:36:07 -0700 Message-ID: <166329936739.2786261.14035402420254589047.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299369; a=rsa-sha256; cv=none; b=eSJffz/iaNcYa5M6fKxT9ShSe1X+ACa+5zbEyOIbT7vV5r6BMpExsdMElZ1mUSii3m3Qn/ REd8QHtOoUsBJomUOptRK3IYdqsWX+pGqb3kYtm2h6FZAl8ruoKofTm8wmlLRhWr9vnk18 h6zQ65x+hqqqkiv8nT8Nf4beWIs9rtM= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=VENFNSja; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf13.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299369; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UTDcOrfdTsip2YQKKp9cBvH/zqTGNVpC2GBtfJsmh1Y=; b=iJNgh9elPJ6gdsf80wmvveCtBlcSZ0eLsCvtvHf/XkkFua/4+4ryc+yj16/cgZME5pr9JF YHaipF0ooMMD2gFTLmTF01e7L4PGyeRc5U60jS7cs7NHnh/KRFQIs0C0mvoo5OEX2HUL3b rI4Y2ngQSYGs9N02eCklBE5i7DOQ//I= X-Stat-Signature: fefg115rzkwaczxdwfamhz496dibm4xw X-Rspamd-Queue-Id: 3E55D200B0 Authentication-Results: imf13.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=VENFNSja; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf13.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1663299369-341625 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The percpu_ref in 'struct dev_pagemap' is used to coordinate active mappings of device-memory with the device-removal / unbind path. It enables the semantic that initiating device-removal (or device-driver-unbind) blocks new mapping and DMA attempts, and waits for mapping revocation or inflight DMA to complete. Expand the scope of the reference count to pin the DAX device active at mapping time and not later at the first gup event. With a device reference being held while any page on that device is mapped the need to manage pgmap reference counts in the gup code is eliminated. That cleanup is saved for a follow-on change. For now, teach dax_insert_entry() and dax_delete_mapping_entry() to take and drop pgmap references respectively. Where dax_insert_entry() is called to take the initial reference on the page, and dax_delete_mapping_entry() is called once there are no outstanding references to the given page(s). Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/dax.c | 34 ++++++++++++++++++++++++++++------ include/linux/memremap.h | 18 ++++++++++++++---- mm/memremap.c | 13 ++++++++----- 3 files changed, 50 insertions(+), 15 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 5d9f30105db4..ee2568c8b135 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -376,14 +376,26 @@ static inline void dax_mapping_set_cow(struct page *page) * whether this entry is shared by multiple files. If so, set the page->mapping * FS_DAX_MAPPING_COW, and use page->index as refcount. */ -static void dax_associate_entry(void *entry, struct address_space *mapping, - struct vm_fault *vmf, unsigned long flags) +static vm_fault_t dax_associate_entry(void *entry, + struct address_space *mapping, + struct vm_fault *vmf, unsigned long flags) { unsigned long size = dax_entry_size(entry), pfn, index; + struct dev_pagemap *pgmap; int i = 0; if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return; + return 0; + + if (!size) + return 0; + + if (!(flags & DAX_COW)) { + pfn = dax_to_pfn(entry); + pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size)); + if (!pgmap) + return VM_FAULT_SIGBUS; + } index = linear_page_index(vmf->vma, ALIGN(vmf->address, size)); for_each_mapped_pfn(entry, pfn) { @@ -398,19 +410,24 @@ static void dax_associate_entry(void *entry, struct address_space *mapping, page_ref_inc(page); } } + + return 0; } static void dax_disassociate_entry(void *entry, struct address_space *mapping, bool trunc) { - unsigned long pfn; + unsigned long size = dax_entry_size(entry), pfn; + struct page *page; if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) return; - for_each_mapped_pfn(entry, pfn) { - struct page *page = pfn_to_page(pfn); + if (!size) + return; + for_each_mapped_pfn(entry, pfn) { + page = pfn_to_page(pfn); if (dax_mapping_is_cow(page->mapping)) { /* keep the CoW flag if this page is still shared */ if (page->index-- > 0) @@ -423,6 +440,11 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, page->mapping = NULL; page->index = 0; } + + if (trunc && !dax_mapping_is_cow(page->mapping)) { + page = pfn_to_page(dax_to_pfn(entry)); + put_dev_pagemap_many(page->pgmap, PHYS_PFN(size)); + } } /* diff --git a/include/linux/memremap.h b/include/linux/memremap.h index c3b4cc84877b..fd57407e7f3d 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -191,8 +191,13 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid); void memunmap_pages(struct dev_pagemap *pgmap); void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap); -struct dev_pagemap *get_dev_pagemap(unsigned long pfn, - struct dev_pagemap *pgmap); +struct dev_pagemap *get_dev_pagemap_many(unsigned long pfn, + struct dev_pagemap *pgmap, int refs); +static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn, + struct dev_pagemap *pgmap) +{ + return get_dev_pagemap_many(pfn, pgmap, 1); +} bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn); unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); @@ -244,10 +249,15 @@ static inline unsigned long memremap_compat_align(void) } #endif /* CONFIG_ZONE_DEVICE */ -static inline void put_dev_pagemap(struct dev_pagemap *pgmap) +static inline void put_dev_pagemap_many(struct dev_pagemap *pgmap, int refs) { if (pgmap) - percpu_ref_put(&pgmap->ref); + percpu_ref_put_many(&pgmap->ref, refs); +} + +static inline void put_dev_pagemap(struct dev_pagemap *pgmap) +{ + put_dev_pagemap_many(pgmap, 1); } #endif /* _LINUX_MEMREMAP_H_ */ diff --git a/mm/memremap.c b/mm/memremap.c index 95f6ffe9cb0f..83c5e6fafd84 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -430,15 +430,16 @@ void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns) } /** - * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn + * get_dev_pagemap_many() - take new live references(s) on the dev_pagemap for @pfn * @pfn: page frame number to lookup page_map * @pgmap: optional known pgmap that already has a reference + * @refs: number of references to take * * If @pgmap is non-NULL and covers @pfn it will be returned as-is. If @pgmap * is non-NULL but does not cover @pfn the reference to it will be released. */ -struct dev_pagemap *get_dev_pagemap(unsigned long pfn, - struct dev_pagemap *pgmap) +struct dev_pagemap *get_dev_pagemap_many(unsigned long pfn, + struct dev_pagemap *pgmap, int refs) { resource_size_t phys = PFN_PHYS(pfn); @@ -454,13 +455,15 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, /* fall back to slow path lookup */ rcu_read_lock(); pgmap = xa_load(&pgmap_array, PHYS_PFN(phys)); - if (pgmap && !percpu_ref_tryget_live(&pgmap->ref)) + if (pgmap && !percpu_ref_tryget_live_rcu(&pgmap->ref)) pgmap = NULL; + if (pgmap && refs > 1) + percpu_ref_get_many(&pgmap->ref, refs - 1); rcu_read_unlock(); return pgmap; } -EXPORT_SYMBOL_GPL(get_dev_pagemap); +EXPORT_SYMBOL_GPL(get_dev_pagemap_many); void free_zone_device_page(struct page *page) { From patchwork Fri Sep 16 03:36:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978063 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 559F5C32771 for ; Fri, 16 Sep 2022 03:36:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E92828D0007; Thu, 15 Sep 2022 23:36:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E4E038D0002; Thu, 15 Sep 2022 23:36:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE28D8D0007; Thu, 15 Sep 2022 23:36:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BEB0A8D0002 for ; Thu, 15 Sep 2022 23:36:15 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 9BDF0C0584 for ; Fri, 16 Sep 2022 03:36:15 +0000 (UTC) X-FDA: 79916535510.05.0D74AA6 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf08.hostedemail.com (Postfix) with ESMTP id 251101600AA for ; Fri, 16 Sep 2022 03:36:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299375; x=1694835375; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qfzx5y9B9XHcY/WNGZvpd9adYJw1q+hyjxJ0sZQS8JA=; b=duC/wdcX3P/UACbq+maPmC3ScRGncvXOsXpsPCjhvYkTMKDHfo25HfkH pmLkS/+WtkmETM8/N8DjmiSLq9HOizy3Rn3TuDVOvTr6Fuh2BJNXegtv1 +uillFeIvA38VyEOthSIjiaB8WSY+Hl0fxtkQoQ9KZbq0SZQYtfuDnUGN Kin1o6Drr6pYKIcZ3E/gnmw4xiiF8EjPD7PKDB0PoG+XHCpQqXJg9BQvK FB0Eo8Q7khKo9Tza1WRhKNJ50Tf4Psj7Qf1XVTmHqTSgonCKY7n+u0Apo 4iOCHjrSp6oiYfxWxSyKHqQ1RBfzuKQw2ZOO5fPaw3HoEusCkXwitvTxr A==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="297625069" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="297625069" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:13 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="648099963" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:13 -0700 Subject: [PATCH v2 11/18] devdax: Minor warning fixups From: Dan Williams To: akpm@linux-foundation.org Cc: hch@lst.de, linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:36:13 -0700 Message-ID: <166329937313.2786261.6805174536617254263.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b="duC/wdcX"; spf=pass (imf08.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299375; a=rsa-sha256; cv=none; b=yVhfUyzWylxpmk9CQNvvvho5+AIalFm91eCTdNe6Gicxo7/9GXKW7NzkXUZE6YMh1UhAg8 wQDiYxum5qbmcTlusaAJFVtSFQSiouEQQ3VgZDuXriWCkR/v1PDvFNxXGS+Az42RYiJYwF Lw4sVHr8cNb3fcG+8p00LyzYZY48j+4= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299375; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1vgZdDU9Ny4uGe480EWaiLp5fsg1lyKdoZ2O0fPs0wc=; b=fEXJxbr+Hxs3tsB+EdO/lXoPoC26LxRTkC7Ucp0WYobdobPfmvyn2YXGD9E1SKFFjNz25P vOBA4QKDy63jhH05V0sIzZ6EjVo957Zb1gEEW8tvvfnY27/VMh/mgTPOhtlC/29nRF80aJ oYrXPsGLWiEIWhYO3fGr6GTx4wwGnfs= X-Rspamd-Queue-Id: 251101600AA X-Rspam-User: Authentication-Results: imf08.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b="duC/wdcX"; spf=pass (imf08.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Stat-Signature: 7pb8q8d3hokaytcqpka6wc638esgqh7t X-Rspamd-Server: rspam04 X-HE-Tag: 1663299374-322394 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Fix a missing prototype warning for dev_dax_probe(), and fix dax_holder() comment block format. Signed-off-by: Dan Williams --- drivers/dax/dax-private.h | 1 + drivers/dax/super.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 1c974b7caae6..202cafd836e8 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -87,6 +87,7 @@ static inline struct dax_mapping *to_dax_mapping(struct device *dev) } phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size); +int dev_dax_probe(struct dev_dax *dev_dax); #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline bool dax_align_valid(unsigned long align) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 9b5e2a5eb0ae..4909ad945a49 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -475,7 +475,7 @@ EXPORT_SYMBOL_GPL(put_dax); /** * dax_holder() - obtain the holder of a dax device * @dax_dev: a dax_device instance - + * * Return: the holder's data which represents the holder if registered, * otherwize NULL. */ From patchwork Fri Sep 16 03:36:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978064 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4276BECAAD3 for ; Fri, 16 Sep 2022 03:36:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D81A68D0005; Thu, 15 Sep 2022 23:36:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D08438D0002; Thu, 15 Sep 2022 23:36:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A98888D0005; Thu, 15 Sep 2022 23:36:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 94A6B8D0002 for ; Thu, 15 Sep 2022 23:36:23 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 120B11C2E04 for ; Fri, 16 Sep 2022 03:36:22 +0000 (UTC) X-FDA: 79916535804.16.5DC31E6 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf20.hostedemail.com (Postfix) with ESMTP id 3E2041C00C3 for ; Fri, 16 Sep 2022 03:36:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299381; x=1694835381; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=yokvy2vJtxWd/DCQDxwhex72QghSUCJTijVP30FYlfM=; b=e3LG0W6CaGvm8RcfpJ8znQx9SPcxmqTBqpKA8SZLBn69TKvgmLBZ744j RsedNsWSJ0xfU5A3YCQSUaFIFmJpem9zDaQcOmsITT3As4KZiGDQIG1sI HXiSfsMEdO9/imdMPrDtm6E6xN6oy9o9dj3WLoyAoUWB8ZEKU9/brGqXN JsDqKjPBkKm3lhCc0rstCiq32T7nBgLCBEYZibi3eC2FTkXzDiSjYS01K jzKmuySptAtMt75vtm0JL8eECpaos2R0SqZRqxNk72ugyi+1zpDNnLJab Dkv627alA8VwYt8ppIObz/H8UY38wG4wZqgD33HAS0iPpC+2efUeD9/zA g==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="296491109" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="296491109" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:20 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="619942605" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:19 -0700 Subject: [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:36:18 -0700 Message-ID: <166329937873.2786261.10966526479509910698.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=e3LG0W6C; spf=pass (imf20.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299381; a=rsa-sha256; cv=none; b=G2YhAe+2TcHeil6dWWGGOEDFG4E0xZRDoQ/LLy8Xly8TwBOll+XoITzWOGjOSzqZ7B05yX BTlIxMzrZa9FGChCjNZx10hRpFP/m0bVYCvioMkWhU6vjaOWpfhRpAdmwXVGkZjEZimzvm gsOn4q778ZJJJXWm0Ec64kIloQqo5fE= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299381; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OuupUXZwlFwpF31znKzYCej5q8F42C1rMLsr8QBBRKg=; b=wsfyeVTjicT4uvTsheMiHzo7ZTbVpa9rWAMjRyMp9NXNDGAo+vMQ3eanjhqvLfGNx4OfuM aQmaxV8J7xmXedB1UM5qKd3ZYFfWLK7HTV2h2kVgGHiLZFouH+5I8NOlmPCCxyrlxvqHEk 4AN12PutNbtO8nPv3C4ju+C14A0xFJY= X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 3E2041C00C3 Authentication-Results: imf20.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=e3LG0W6C; spf=pass (imf20.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Stat-Signature: ad96pidpsjmb35t1cxttuuduf4a56c8r X-HE-Tag: 1663299381-132662 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for decamping get_dev_pagemap() and put_devmap_managed_page() from code paths outside of DAX, device-dax needs to track mapping references similar to the tracking done for fsdax. Reuse the same infrastructure as fsdax (dax_insert_entry() and dax_delete_mapping_entry()). For now, just move that infrastructure into a common location with no other code changes. The move involves splitting iomap and supporting helpers into fs/dax.c and all 'struct address_space' and DAX-entry manipulation into drivers/dax/mapping.c. grab_mapping_entry() is renamed dax_grab_mapping_entry(), and some common definitions and declarations are moved to include/linux/dax.h. No functional change is intended, just code movement. The interactions between drivers/dax/mapping.o and mm/memory-failure.o result in drivers/dax/mapping.o and the rest of the dax core losing its option to be compiled as a module. That can be addressed later given the fact the CONFIG_FS_DAX has always been forcing the dax core to be compiled in. I.e. this is only a vmlinux size regression for CONFIG_FS_DAX=n and CONFIG_DEV_DAX=m builds. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- .clang-format | 1 drivers/Makefile | 2 drivers/dax/Kconfig | 4 drivers/dax/Makefile | 1 drivers/dax/dax-private.h | 1 drivers/dax/mapping.c | 1010 +++++++++++++++++++++++++++++++++++++++++ drivers/dax/super.c | 4 drivers/nvdimm/Kconfig | 1 fs/dax.c | 1109 +-------------------------------------------- include/linux/dax.h | 110 +++- include/linux/memremap.h | 6 11 files changed, 1143 insertions(+), 1106 deletions(-) create mode 100644 drivers/dax/mapping.c diff --git a/.clang-format b/.clang-format index 1247d54f9e49..336fa266386e 100644 --- a/.clang-format +++ b/.clang-format @@ -269,6 +269,7 @@ ForEachMacros: - 'for_each_link_cpus' - 'for_each_link_platforms' - 'for_each_lru' + - 'for_each_mapped_pfn' - 'for_each_matching_node' - 'for_each_matching_node_and_match' - 'for_each_mem_pfn_range' diff --git a/drivers/Makefile b/drivers/Makefile index 057857258bfd..ec6c4146b966 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -71,7 +71,7 @@ obj-$(CONFIG_FB_INTEL) += video/fbdev/intelfb/ obj-$(CONFIG_PARPORT) += parport/ obj-y += base/ block/ misc/ mfd/ nfc/ obj-$(CONFIG_LIBNVDIMM) += nvdimm/ -obj-$(CONFIG_DAX) += dax/ +obj-y += dax/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ obj-$(CONFIG_NUBUS) += nubus/ obj-y += cxl/ diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 5fdf269a822e..205e9dda8928 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -1,8 +1,8 @@ # SPDX-License-Identifier: GPL-2.0-only menuconfig DAX - tristate "DAX: direct access to differentiated memory" + bool "DAX: direct access to differentiated memory" + depends on MMU select SRCU - default m if NVDIMM_DAX if DAX diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile index 90a56ca3b345..3546bca7adbf 100644 --- a/drivers/dax/Makefile +++ b/drivers/dax/Makefile @@ -6,6 +6,7 @@ obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o dax-y := super.o dax-y += bus.o +dax-y += mapping.o device_dax-y := device.o dax_pmem-y := pmem.o diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 202cafd836e8..19076f9d5c51 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -15,6 +15,7 @@ struct dax_device *inode_dax(struct inode *inode); struct inode *dax_inode(struct dax_device *dax_dev); int dax_bus_init(void); void dax_bus_exit(void); +void dax_mapping_init(void); /** * struct dax_region - mapping infrastructure for dax devices diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c new file mode 100644 index 000000000000..70576aa02148 --- /dev/null +++ b/drivers/dax/mapping.c @@ -0,0 +1,1010 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Direct Access mapping infrastructure split from fs/dax.c + * Copyright (c) 2013-2014 Intel Corporation + * Author: Matthew Wilcox + * Author: Ross Zwisler + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "dax-private.h" + +#define CREATE_TRACE_POINTS +#include + +/* We choose 4096 entries - same as per-zone page wait tables */ +#define DAX_WAIT_TABLE_BITS 12 +#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS) + +static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES]; + +void __init dax_mapping_init(void) +{ + int i; + + for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++) + init_waitqueue_head(wait_table + i); +} + +static unsigned long dax_to_pfn(void *entry) +{ + return xa_to_value(entry) >> DAX_SHIFT; +} + +static void *dax_make_entry(pfn_t pfn, unsigned long flags) +{ + return xa_mk_value((flags & DAX_MASK) | + (pfn_t_to_pfn(pfn) << DAX_SHIFT)); +} + +static bool dax_is_locked(void *entry) +{ + return xa_to_value(entry) & DAX_LOCKED; +} + +static bool dax_is_zapped(void *entry) +{ + return xa_to_value(entry) & DAX_ZAP; +} + +static unsigned int dax_entry_order(void *entry) +{ + if (xa_to_value(entry) & DAX_PMD) + return PMD_ORDER; + return 0; +} + +static unsigned long dax_is_pmd_entry(void *entry) +{ + return xa_to_value(entry) & DAX_PMD; +} + +static bool dax_is_pte_entry(void *entry) +{ + return !(xa_to_value(entry) & DAX_PMD); +} + +static int dax_is_zero_entry(void *entry) +{ + return xa_to_value(entry) & DAX_ZERO_PAGE; +} + +static int dax_is_empty_entry(void *entry) +{ + return xa_to_value(entry) & DAX_EMPTY; +} + +/* + * true if the entry that was found is of a smaller order than the entry + * we were looking for + */ +static bool dax_is_conflict(void *entry) +{ + return entry == XA_RETRY_ENTRY; +} + +/* + * DAX page cache entry locking + */ +struct exceptional_entry_key { + struct xarray *xa; + pgoff_t entry_start; +}; + +struct wait_exceptional_entry_queue { + wait_queue_entry_t wait; + struct exceptional_entry_key key; +}; + +/** + * enum dax_wake_mode: waitqueue wakeup behaviour + * @WAKE_ALL: wake all waiters in the waitqueue + * @WAKE_NEXT: wake only the first waiter in the waitqueue + */ +enum dax_wake_mode { + WAKE_ALL, + WAKE_NEXT, +}; + +static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas, void *entry, + struct exceptional_entry_key *key) +{ + unsigned long hash; + unsigned long index = xas->xa_index; + + /* + * If 'entry' is a PMD, align the 'index' that we use for the wait + * queue to the start of that PMD. This ensures that all offsets in + * the range covered by the PMD map to the same bit lock. + */ + if (dax_is_pmd_entry(entry)) + index &= ~PG_PMD_COLOUR; + key->xa = xas->xa; + key->entry_start = index; + + hash = hash_long((unsigned long)xas->xa ^ index, DAX_WAIT_TABLE_BITS); + return wait_table + hash; +} + +static int wake_exceptional_entry_func(wait_queue_entry_t *wait, + unsigned int mode, int sync, void *keyp) +{ + struct exceptional_entry_key *key = keyp; + struct wait_exceptional_entry_queue *ewait = + container_of(wait, struct wait_exceptional_entry_queue, wait); + + if (key->xa != ewait->key.xa || + key->entry_start != ewait->key.entry_start) + return 0; + return autoremove_wake_function(wait, mode, sync, NULL); +} + +/* + * @entry may no longer be the entry at the index in the mapping. + * The important information it's conveying is whether the entry at + * this index used to be a PMD entry. + */ +static void dax_wake_entry(struct xa_state *xas, void *entry, + enum dax_wake_mode mode) +{ + struct exceptional_entry_key key; + wait_queue_head_t *wq; + + wq = dax_entry_waitqueue(xas, entry, &key); + + /* + * Checking for locked entry and prepare_to_wait_exclusive() happens + * under the i_pages lock, ditto for entry handling in our callers. + * So at this point all tasks that could have seen our entry locked + * must be in the waitqueue and the following check will see them. + */ + if (waitqueue_active(wq)) + __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, &key); +} + +/* + * Look up entry in page cache, wait for it to become unlocked if it + * is a DAX entry and return it. The caller must subsequently call + * put_unlocked_entry() if it did not lock the entry or dax_unlock_entry() + * if it did. The entry returned may have a larger order than @order. + * If @order is larger than the order of the entry found in i_pages, this + * function returns a dax_is_conflict entry. + * + * Must be called with the i_pages lock held. + */ +static void *get_unlocked_entry(struct xa_state *xas, unsigned int order) +{ + void *entry; + struct wait_exceptional_entry_queue ewait; + wait_queue_head_t *wq; + + init_wait(&ewait.wait); + ewait.wait.func = wake_exceptional_entry_func; + + for (;;) { + entry = xas_find_conflict(xas); + if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) + return entry; + if (dax_entry_order(entry) < order) + return XA_RETRY_ENTRY; + if (!dax_is_locked(entry)) + return entry; + + wq = dax_entry_waitqueue(xas, entry, &ewait.key); + prepare_to_wait_exclusive(wq, &ewait.wait, + TASK_UNINTERRUPTIBLE); + xas_unlock_irq(xas); + xas_reset(xas); + schedule(); + finish_wait(wq, &ewait.wait); + xas_lock_irq(xas); + } +} + +/* + * The only thing keeping the address space around is the i_pages lock + * (it's cycled in clear_inode() after removing the entries from i_pages) + * After we call xas_unlock_irq(), we cannot touch xas->xa. + */ +static void wait_entry_unlocked(struct xa_state *xas, void *entry) +{ + struct wait_exceptional_entry_queue ewait; + wait_queue_head_t *wq; + + init_wait(&ewait.wait); + ewait.wait.func = wake_exceptional_entry_func; + + wq = dax_entry_waitqueue(xas, entry, &ewait.key); + /* + * Unlike get_unlocked_entry() there is no guarantee that this + * path ever successfully retrieves an unlocked entry before an + * inode dies. Perform a non-exclusive wait in case this path + * never successfully performs its own wake up. + */ + prepare_to_wait(wq, &ewait.wait, TASK_UNINTERRUPTIBLE); + xas_unlock_irq(xas); + schedule(); + finish_wait(wq, &ewait.wait); +} + +static void put_unlocked_entry(struct xa_state *xas, void *entry, + enum dax_wake_mode mode) +{ + if (entry && !dax_is_conflict(entry)) + dax_wake_entry(xas, entry, mode); +} + +/* + * We used the xa_state to get the entry, but then we locked the entry and + * dropped the xa_lock, so we know the xa_state is stale and must be reset + * before use. + */ +void dax_unlock_entry(struct xa_state *xas, void *entry) +{ + void *old; + + WARN_ON(dax_is_locked(entry)); + xas_reset(xas); + xas_lock_irq(xas); + old = xas_store(xas, entry); + xas_unlock_irq(xas); + WARN_ON(!dax_is_locked(old)); + dax_wake_entry(xas, entry, WAKE_NEXT); +} + +/* + * Return: The entry stored at this location before it was locked. + */ +static void *dax_lock_entry(struct xa_state *xas, void *entry) +{ + unsigned long v = xa_to_value(entry); + + return xas_store(xas, xa_mk_value(v | DAX_LOCKED)); +} + +static unsigned long dax_entry_size(void *entry) +{ + if (dax_is_zero_entry(entry)) + return 0; + else if (dax_is_empty_entry(entry)) + return 0; + else if (dax_is_pmd_entry(entry)) + return PMD_SIZE; + else + return PAGE_SIZE; +} + +static unsigned long dax_end_pfn(void *entry) +{ + return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE; +} + +/* + * Iterate through all mapped pfns represented by an entry, i.e. skip + * 'empty' and 'zero' entries. + */ +#define for_each_mapped_pfn(entry, pfn) \ + for (pfn = dax_to_pfn(entry); pfn < dax_end_pfn(entry); pfn++) + +static bool dax_mapping_is_cow(struct address_space *mapping) +{ + return (unsigned long)mapping == PAGE_MAPPING_DAX_COW; +} + +/* + * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount. + */ +static void dax_mapping_set_cow(struct page *page) +{ + if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) { + /* + * Reset the index if the page was already mapped + * regularly before. + */ + if (page->mapping) + page->index = 1; + page->mapping = (void *)PAGE_MAPPING_DAX_COW; + } + page->index++; +} + +/* + * When it is called in dax_insert_entry(), the cow flag will indicate that + * whether this entry is shared by multiple files. If so, set the page->mapping + * FS_DAX_MAPPING_COW, and use page->index as refcount. + */ +static vm_fault_t dax_associate_entry(void *entry, + struct address_space *mapping, + struct vm_fault *vmf, unsigned long flags) +{ + unsigned long size = dax_entry_size(entry), pfn, index; + struct dev_pagemap *pgmap; + int i = 0; + + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) + return 0; + + if (!size) + return 0; + + if (!(flags & DAX_COW)) { + pfn = dax_to_pfn(entry); + pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size)); + if (!pgmap) + return VM_FAULT_SIGBUS; + } + + index = linear_page_index(vmf->vma, ALIGN(vmf->address, size)); + for_each_mapped_pfn(entry, pfn) { + struct page *page = pfn_to_page(pfn); + + if (flags & DAX_COW) { + dax_mapping_set_cow(page); + } else { + WARN_ON_ONCE(page->mapping); + page->mapping = mapping; + page->index = index + i++; + page_ref_inc(page); + } + } + + return 0; +} + +static void dax_disassociate_entry(void *entry, struct address_space *mapping, + bool trunc) +{ + unsigned long size = dax_entry_size(entry), pfn; + struct page *page; + + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) + return; + + if (!size) + return; + + for_each_mapped_pfn(entry, pfn) { + page = pfn_to_page(pfn); + if (dax_mapping_is_cow(page->mapping)) { + /* keep the CoW flag if this page is still shared */ + if (page->index-- > 0) + continue; + } else { + WARN_ON_ONCE(trunc && !dax_is_zapped(entry)); + WARN_ON_ONCE(trunc && !dax_page_idle(page)); + WARN_ON_ONCE(page->mapping && page->mapping != mapping); + } + page->mapping = NULL; + page->index = 0; + } + + if (trunc && !dax_mapping_is_cow(page->mapping)) { + page = pfn_to_page(dax_to_pfn(entry)); + put_dev_pagemap_many(page->pgmap, PHYS_PFN(size)); + } +} + +/* + * dax_lock_page - Lock the DAX entry corresponding to a page + * @page: The page whose entry we want to lock + * + * Context: Process context. + * Return: A cookie to pass to dax_unlock_page() or 0 if the entry could + * not be locked. + */ +dax_entry_t dax_lock_page(struct page *page) +{ + XA_STATE(xas, NULL, 0); + void *entry; + + /* Ensure page->mapping isn't freed while we look at it */ + rcu_read_lock(); + for (;;) { + struct address_space *mapping = READ_ONCE(page->mapping); + + entry = NULL; + if (!mapping || !dax_mapping(mapping)) + break; + + /* + * In the device-dax case there's no need to lock, a + * struct dev_pagemap pin is sufficient to keep the + * inode alive, and we assume we have dev_pagemap pin + * otherwise we would not have a valid pfn_to_page() + * translation. + */ + entry = (void *)~0UL; + if (S_ISCHR(mapping->host->i_mode)) + break; + + xas.xa = &mapping->i_pages; + xas_lock_irq(&xas); + if (mapping != page->mapping) { + xas_unlock_irq(&xas); + continue; + } + xas_set(&xas, page->index); + entry = xas_load(&xas); + if (dax_is_locked(entry)) { + rcu_read_unlock(); + wait_entry_unlocked(&xas, entry); + rcu_read_lock(); + continue; + } + dax_lock_entry(&xas, entry); + xas_unlock_irq(&xas); + break; + } + rcu_read_unlock(); + return (dax_entry_t)entry; +} + +void dax_unlock_page(struct page *page, dax_entry_t cookie) +{ + struct address_space *mapping = page->mapping; + XA_STATE(xas, &mapping->i_pages, page->index); + + if (S_ISCHR(mapping->host->i_mode)) + return; + + dax_unlock_entry(&xas, (void *)cookie); +} + +/* + * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping + * @mapping: the file's mapping whose entry we want to lock + * @index: the offset within this file + * @page: output the dax page corresponding to this dax entry + * + * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry + * could not be locked. + */ +dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index, + struct page **page) +{ + XA_STATE(xas, NULL, 0); + void *entry; + + rcu_read_lock(); + for (;;) { + entry = NULL; + if (!dax_mapping(mapping)) + break; + + xas.xa = &mapping->i_pages; + xas_lock_irq(&xas); + xas_set(&xas, index); + entry = xas_load(&xas); + if (dax_is_locked(entry)) { + rcu_read_unlock(); + wait_entry_unlocked(&xas, entry); + rcu_read_lock(); + continue; + } + if (!entry || dax_is_zero_entry(entry) || + dax_is_empty_entry(entry)) { + /* + * Because we are looking for entry from file's mapping + * and index, so the entry may not be inserted for now, + * or even a zero/empty entry. We don't think this is + * an error case. So, return a special value and do + * not output @page. + */ + entry = (void *)~0UL; + } else { + *page = pfn_to_page(dax_to_pfn(entry)); + dax_lock_entry(&xas, entry); + } + xas_unlock_irq(&xas); + break; + } + rcu_read_unlock(); + return (dax_entry_t)entry; +} + +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index, + dax_entry_t cookie) +{ + XA_STATE(xas, &mapping->i_pages, index); + + if (cookie == ~0UL) + return; + + dax_unlock_entry(&xas, (void *)cookie); +} + +/* + * Find page cache entry at given index. If it is a DAX entry, return it + * with the entry locked. If the page cache doesn't contain an entry at + * that index, add a locked empty entry. + * + * When requesting an entry with size DAX_PMD, dax_grab_mapping_entry() will + * either return that locked entry or will return VM_FAULT_FALLBACK. + * This will happen if there are any PTE entries within the PMD range + * that we are requesting. + * + * We always favor PTE entries over PMD entries. There isn't a flow where we + * evict PTE entries in order to 'upgrade' them to a PMD entry. A PMD + * insertion will fail if it finds any PTE entries already in the tree, and a + * PTE insertion will cause an existing PMD entry to be unmapped and + * downgraded to PTE entries. This happens for both PMD zero pages as + * well as PMD empty entries. + * + * The exception to this downgrade path is for PMD entries that have + * real storage backing them. We will leave these real PMD entries in + * the tree, and PTE writes will simply dirty the entire PMD entry. + * + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For + * persistent memory the benefit is doubtful. We can add that later if we can + * show it helps. + * + * On error, this function does not return an ERR_PTR. Instead it returns + * a VM_FAULT code, encoded as an xarray internal entry. The ERR_PTR values + * overlap with xarray value entries. + */ +void *dax_grab_mapping_entry(struct xa_state *xas, + struct address_space *mapping, unsigned int order) +{ + unsigned long index = xas->xa_index; + bool pmd_downgrade; /* splitting PMD entry into PTE entries? */ + void *entry; + +retry: + pmd_downgrade = false; + xas_lock_irq(xas); + entry = get_unlocked_entry(xas, order); + + if (entry) { + if (dax_is_conflict(entry)) + goto fallback; + if (!xa_is_value(entry)) { + xas_set_err(xas, -EIO); + goto out_unlock; + } + + if (order == 0) { + if (dax_is_pmd_entry(entry) && + (dax_is_zero_entry(entry) || + dax_is_empty_entry(entry))) { + pmd_downgrade = true; + } + } + } + + if (pmd_downgrade) { + /* + * Make sure 'entry' remains valid while we drop + * the i_pages lock. + */ + dax_lock_entry(xas, entry); + + /* + * Besides huge zero pages the only other thing that gets + * downgraded are empty entries which don't need to be + * unmapped. + */ + if (dax_is_zero_entry(entry)) { + xas_unlock_irq(xas); + unmap_mapping_pages(mapping, + xas->xa_index & ~PG_PMD_COLOUR, + PG_PMD_NR, false); + xas_reset(xas); + xas_lock_irq(xas); + } + + dax_disassociate_entry(entry, mapping, false); + xas_store(xas, NULL); /* undo the PMD join */ + dax_wake_entry(xas, entry, WAKE_ALL); + mapping->nrpages -= PG_PMD_NR; + entry = NULL; + xas_set(xas, index); + } + + if (entry) { + dax_lock_entry(xas, entry); + } else { + unsigned long flags = DAX_EMPTY; + + if (order > 0) + flags |= DAX_PMD; + entry = dax_make_entry(pfn_to_pfn_t(0), flags); + dax_lock_entry(xas, entry); + if (xas_error(xas)) + goto out_unlock; + mapping->nrpages += 1UL << order; + } + +out_unlock: + xas_unlock_irq(xas); + if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) + goto retry; + if (xas->xa_node == XA_ERROR(-ENOMEM)) + return xa_mk_internal(VM_FAULT_OOM); + if (xas_error(xas)) + return xa_mk_internal(VM_FAULT_SIGBUS); + return entry; +fallback: + xas_unlock_irq(xas); + return xa_mk_internal(VM_FAULT_FALLBACK); +} + +static void *dax_zap_entry(struct xa_state *xas, void *entry) +{ + unsigned long v = xa_to_value(entry); + + return xas_store(xas, xa_mk_value(v | DAX_ZAP)); +} + +/* + * Return NULL if the entry is zapped and all pages in the entry are + * idle, otherwise return the non-idle page in the entry + */ +static struct page *dax_zap_pages(struct xa_state *xas, void *entry) +{ + struct page *ret = NULL; + unsigned long pfn; + bool zap; + + if (!dax_entry_size(entry)) + return NULL; + + zap = !dax_is_zapped(entry); + + for_each_mapped_pfn(entry, pfn) { + struct page *page = pfn_to_page(pfn); + + if (zap) + page_ref_dec(page); + + if (!ret && !dax_page_idle(page)) + ret = page; + } + + if (zap) + dax_zap_entry(xas, entry); + + return ret; +} + +/** + * dax_zap_mappings_range - find first pinned page in @mapping + * @mapping: address space to scan for a page with ref count > 1 + * @start: Starting offset. Page containing 'start' is included. + * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX, + * pages from 'start' till the end of file are included. + * + * DAX requires ZONE_DEVICE mapped pages. These pages are never + * 'onlined' to the page allocator so they are considered idle when + * page->count == 1. A filesystem uses this interface to determine if + * any page in the mapping is busy, i.e. for DMA, or other + * get_user_pages() usages. + * + * It is expected that the filesystem is holding locks to block the + * establishment of new mappings in this address_space. I.e. it expects + * to be able to run unmap_mapping_range() and subsequently not race + * mapping_mapped() becoming true. + */ +struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start, + loff_t end) +{ + void *entry; + unsigned int scanned = 0; + struct page *page = NULL; + pgoff_t start_idx = start >> PAGE_SHIFT; + pgoff_t end_idx; + XA_STATE(xas, &mapping->i_pages, start_idx); + + /* + * In the 'limited' case get_user_pages() for dax is disabled. + */ + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) + return NULL; + + if (!dax_mapping(mapping)) + return NULL; + + /* If end == LLONG_MAX, all pages from start to till end of file */ + if (end == LLONG_MAX) + end_idx = ULONG_MAX; + else + end_idx = end >> PAGE_SHIFT; + /* + * If we race get_user_pages_fast() here either we'll see the + * elevated page count in the iteration and wait, or + * get_user_pages_fast() will see that the page it took a reference + * against is no longer mapped in the page tables and bail to the + * get_user_pages() slow path. The slow path is protected by + * pte_lock() and pmd_lock(). New references are not taken without + * holding those locks, and unmap_mapping_pages() will not zero the + * pte or pmd without holding the respective lock, so we are + * guaranteed to either see new references or prevent new + * references from being established. + */ + unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0); + + xas_lock_irq(&xas); + xas_for_each(&xas, entry, end_idx) { + if (WARN_ON_ONCE(!xa_is_value(entry))) + continue; + if (unlikely(dax_is_locked(entry))) + entry = get_unlocked_entry(&xas, 0); + if (entry) + page = dax_zap_pages(&xas, entry); + put_unlocked_entry(&xas, entry, WAKE_NEXT); + if (page) + break; + if (++scanned % XA_CHECK_SCHED) + continue; + + xas_pause(&xas); + xas_unlock_irq(&xas); + cond_resched(); + xas_lock_irq(&xas); + } + xas_unlock_irq(&xas); + return page; +} +EXPORT_SYMBOL_GPL(dax_zap_mappings_range); + +struct page *dax_zap_mappings(struct address_space *mapping) +{ + return dax_zap_mappings_range(mapping, 0, LLONG_MAX); +} +EXPORT_SYMBOL_GPL(dax_zap_mappings); + +static int __dax_invalidate_entry(struct address_space *mapping, pgoff_t index, + bool trunc) +{ + XA_STATE(xas, &mapping->i_pages, index); + int ret = 0; + void *entry; + + xas_lock_irq(&xas); + entry = get_unlocked_entry(&xas, 0); + if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) + goto out; + if (!trunc && (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) || + xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE))) + goto out; + dax_disassociate_entry(entry, mapping, trunc); + xas_store(&xas, NULL); + mapping->nrpages -= 1UL << dax_entry_order(entry); + ret = 1; +out: + put_unlocked_entry(&xas, entry, WAKE_ALL); + xas_unlock_irq(&xas); + return ret; +} + +int dax_invalidate_mapping_entry_sync(struct address_space *mapping, + pgoff_t index) +{ + return __dax_invalidate_entry(mapping, index, false); +} + +/* + * Delete DAX entry at @index from @mapping. Wait for it + * to be unlocked before deleting it. + */ +int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index) +{ + int ret = __dax_invalidate_entry(mapping, index, true); + + /* + * This gets called from truncate / punch_hole path. As such, the caller + * must hold locks protecting against concurrent modifications of the + * page cache (usually fs-private i_mmap_sem for writing). Since the + * caller has seen a DAX entry for this index, we better find it + * at that index as well... + */ + WARN_ON_ONCE(!ret); + return ret; +} + +/* + * By this point dax_grab_mapping_entry() has ensured that we have a locked entry + * of the appropriate size so we don't have to worry about downgrading PMDs to + * PTEs. If we happen to be trying to insert a PTE and there is a PMD + * already in the tree, we will skip the insertion and just dirty the PMD as + * appropriate. + */ +vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, + void **pentry, pfn_t pfn, unsigned long flags) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + void *new_entry = dax_make_entry(pfn, flags); + bool dirty = flags & DAX_DIRTY; + bool cow = flags & DAX_COW; + void *entry = *pentry; + + if (dirty) + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) { + unsigned long index = xas->xa_index; + /* we are replacing a zero page with block mapping */ + if (dax_is_pmd_entry(entry)) + unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR, + PG_PMD_NR, false); + else /* pte entry */ + unmap_mapping_pages(mapping, index, 1, false); + } + + xas_reset(xas); + xas_lock_irq(xas); + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { + void *old; + + dax_disassociate_entry(entry, mapping, false); + dax_associate_entry(new_entry, mapping, vmf, flags); + /* + * Only swap our new entry into the page cache if the current + * entry is a zero page or an empty entry. If a normal PTE or + * PMD entry is already in the cache, we leave it alone. This + * means that if we are trying to insert a PTE and the + * existing entry is a PMD, we will just leave the PMD in the + * tree and dirty it if necessary. + */ + old = dax_lock_entry(xas, new_entry); + WARN_ON_ONCE(old != + xa_mk_value(xa_to_value(entry) | DAX_LOCKED)); + entry = new_entry; + } else { + xas_load(xas); /* Walk the xa_state */ + } + + if (dirty) + xas_set_mark(xas, PAGECACHE_TAG_DIRTY); + + if (cow) + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE); + + xas_unlock_irq(xas); + *pentry = entry; + return 0; +} + +int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, + struct address_space *mapping, void *entry) +{ + unsigned long pfn, index, count, end; + long ret = 0; + struct vm_area_struct *vma; + + /* + * A page got tagged dirty in DAX mapping? Something is seriously + * wrong. + */ + if (WARN_ON(!xa_is_value(entry))) + return -EIO; + + if (unlikely(dax_is_locked(entry))) { + void *old_entry = entry; + + entry = get_unlocked_entry(xas, 0); + + /* Entry got punched out / reallocated? */ + if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) + goto put_unlocked; + /* + * Entry got reallocated elsewhere? No need to writeback. + * We have to compare pfns as we must not bail out due to + * difference in lockbit or entry type. + */ + if (dax_to_pfn(old_entry) != dax_to_pfn(entry)) + goto put_unlocked; + if (WARN_ON_ONCE(dax_is_empty_entry(entry) || + dax_is_zero_entry(entry))) { + ret = -EIO; + goto put_unlocked; + } + + /* Another fsync thread may have already done this entry */ + if (!xas_get_mark(xas, PAGECACHE_TAG_TOWRITE)) + goto put_unlocked; + } + + /* Lock the entry to serialize with page faults */ + dax_lock_entry(xas, entry); + + /* + * We can clear the tag now but we have to be careful so that concurrent + * dax_writeback_one() calls for the same index cannot finish before we + * actually flush the caches. This is achieved as the calls will look + * at the entry only under the i_pages lock and once they do that + * they will see the entry locked and wait for it to unlock. + */ + xas_clear_mark(xas, PAGECACHE_TAG_TOWRITE); + xas_unlock_irq(xas); + + /* + * If dax_writeback_mapping_range() was given a wbc->range_start + * in the middle of a PMD, the 'index' we use needs to be + * aligned to the start of the PMD. + * This allows us to flush for PMD_SIZE and not have to worry about + * partial PMD writebacks. + */ + pfn = dax_to_pfn(entry); + count = 1UL << dax_entry_order(entry); + index = xas->xa_index & ~(count - 1); + end = index + count - 1; + + /* Walk all mappings of a given index of a file and writeprotect them */ + i_mmap_lock_read(mapping); + vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) { + pfn_mkclean_range(pfn, count, index, vma); + cond_resched(); + } + i_mmap_unlock_read(mapping); + + dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); + /* + * After we have flushed the cache, we can clear the dirty tag. There + * cannot be new dirty data in the pfn after the flush has completed as + * the pfn mappings are writeprotected and fault waits for mapping + * entry lock. + */ + xas_reset(xas); + xas_lock_irq(xas); + xas_store(xas, entry); + xas_clear_mark(xas, PAGECACHE_TAG_DIRTY); + dax_wake_entry(xas, entry, WAKE_NEXT); + + trace_dax_writeback_one(mapping->host, index, count); + return ret; + + put_unlocked: + put_unlocked_entry(xas, entry, WAKE_NEXT); + return ret; +} + +/* + * dax_insert_pfn_mkwrite - insert PTE or PMD entry into page tables + * @vmf: The description of the fault + * @pfn: PFN to insert + * @order: Order of entry to insert. + * + * This function inserts a writeable PTE or PMD entry into the page tables + * for an mmaped DAX file. It also marks the page cache entry as dirty. + */ +vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, + unsigned int order) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order); + void *entry; + vm_fault_t ret; + + xas_lock_irq(&xas); + entry = get_unlocked_entry(&xas, order); + /* Did we race with someone splitting entry or so? */ + if (!entry || dax_is_conflict(entry) || + (order == 0 && !dax_is_pte_entry(entry))) { + put_unlocked_entry(&xas, entry, WAKE_NEXT); + xas_unlock_irq(&xas); + trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf, + VM_FAULT_NOPAGE); + return VM_FAULT_NOPAGE; + } + xas_set_mark(&xas, PAGECACHE_TAG_DIRTY); + dax_lock_entry(&xas, entry); + xas_unlock_irq(&xas); + if (order == 0) + ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn); +#ifdef CONFIG_FS_DAX_PMD + else if (order == PMD_ORDER) + ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE); +#endif + else + ret = VM_FAULT_FALLBACK; + dax_unlock_entry(&xas, entry); + trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret); + return ret; +} diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 4909ad945a49..0976857ec7f2 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -564,6 +564,8 @@ static int __init dax_core_init(void) if (rc) return rc; + dax_mapping_init(); + rc = alloc_chrdev_region(&dax_devt, 0, MINORMASK+1, "dax"); if (rc) goto err_chrdev; @@ -590,5 +592,5 @@ static void __exit dax_core_exit(void) MODULE_AUTHOR("Intel Corporation"); MODULE_LICENSE("GPL v2"); -subsys_initcall(dax_core_init); +fs_initcall(dax_core_init); module_exit(dax_core_exit); diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig index 5a29046e3319..3bb17448d1c8 100644 --- a/drivers/nvdimm/Kconfig +++ b/drivers/nvdimm/Kconfig @@ -78,6 +78,7 @@ config NVDIMM_DAX bool "NVDIMM DAX: Raw access to persistent memory" default LIBNVDIMM depends on NVDIMM_PFN + depends on DAX help Support raw device dax access to a persistent memory namespace. For environments that want to hard partition diff --git a/fs/dax.c b/fs/dax.c index ee2568c8b135..79e49e718d33 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -27,847 +27,8 @@ #include #include -#define CREATE_TRACE_POINTS #include -static inline unsigned int pe_order(enum page_entry_size pe_size) -{ - if (pe_size == PE_SIZE_PTE) - return PAGE_SHIFT - PAGE_SHIFT; - if (pe_size == PE_SIZE_PMD) - return PMD_SHIFT - PAGE_SHIFT; - if (pe_size == PE_SIZE_PUD) - return PUD_SHIFT - PAGE_SHIFT; - return ~0; -} - -/* We choose 4096 entries - same as per-zone page wait tables */ -#define DAX_WAIT_TABLE_BITS 12 -#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS) - -/* The 'colour' (ie low bits) within a PMD of a page offset. */ -#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1) -#define PG_PMD_NR (PMD_SIZE >> PAGE_SHIFT) - -/* The order of a PMD entry */ -#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT) - -static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES]; - -static int __init init_dax_wait_table(void) -{ - int i; - - for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++) - init_waitqueue_head(wait_table + i); - return 0; -} -fs_initcall(init_dax_wait_table); - -/* - * DAX pagecache entries use XArray value entries so they can't be mistaken - * for pages. We use one bit for locking, one bit for the entry size (PMD) - * and two more to tell us if the entry is a zero page or an empty entry that - * is just used for locking. In total four special bits. - * - * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE - * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem - * block allocation. - */ -#define DAX_SHIFT (5) -#define DAX_MASK ((1UL << DAX_SHIFT) - 1) -#define DAX_LOCKED (1UL << 0) -#define DAX_PMD (1UL << 1) -#define DAX_ZERO_PAGE (1UL << 2) -#define DAX_EMPTY (1UL << 3) -#define DAX_ZAP (1UL << 4) - -/* - * These flags are not conveyed in Xarray value entries, they are just - * modifiers to dax_insert_entry(). - */ -#define DAX_DIRTY (1UL << (DAX_SHIFT + 0)) -#define DAX_COW (1UL << (DAX_SHIFT + 1)) - -static unsigned long dax_to_pfn(void *entry) -{ - return xa_to_value(entry) >> DAX_SHIFT; -} - -static void *dax_make_entry(pfn_t pfn, unsigned long flags) -{ - return xa_mk_value((flags & DAX_MASK) | - (pfn_t_to_pfn(pfn) << DAX_SHIFT)); -} - -static bool dax_is_locked(void *entry) -{ - return xa_to_value(entry) & DAX_LOCKED; -} - -static bool dax_is_zapped(void *entry) -{ - return xa_to_value(entry) & DAX_ZAP; -} - -static unsigned int dax_entry_order(void *entry) -{ - if (xa_to_value(entry) & DAX_PMD) - return PMD_ORDER; - return 0; -} - -static unsigned long dax_is_pmd_entry(void *entry) -{ - return xa_to_value(entry) & DAX_PMD; -} - -static bool dax_is_pte_entry(void *entry) -{ - return !(xa_to_value(entry) & DAX_PMD); -} - -static int dax_is_zero_entry(void *entry) -{ - return xa_to_value(entry) & DAX_ZERO_PAGE; -} - -static int dax_is_empty_entry(void *entry) -{ - return xa_to_value(entry) & DAX_EMPTY; -} - -/* - * true if the entry that was found is of a smaller order than the entry - * we were looking for - */ -static bool dax_is_conflict(void *entry) -{ - return entry == XA_RETRY_ENTRY; -} - -/* - * DAX page cache entry locking - */ -struct exceptional_entry_key { - struct xarray *xa; - pgoff_t entry_start; -}; - -struct wait_exceptional_entry_queue { - wait_queue_entry_t wait; - struct exceptional_entry_key key; -}; - -/** - * enum dax_wake_mode: waitqueue wakeup behaviour - * @WAKE_ALL: wake all waiters in the waitqueue - * @WAKE_NEXT: wake only the first waiter in the waitqueue - */ -enum dax_wake_mode { - WAKE_ALL, - WAKE_NEXT, -}; - -static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas, - void *entry, struct exceptional_entry_key *key) -{ - unsigned long hash; - unsigned long index = xas->xa_index; - - /* - * If 'entry' is a PMD, align the 'index' that we use for the wait - * queue to the start of that PMD. This ensures that all offsets in - * the range covered by the PMD map to the same bit lock. - */ - if (dax_is_pmd_entry(entry)) - index &= ~PG_PMD_COLOUR; - key->xa = xas->xa; - key->entry_start = index; - - hash = hash_long((unsigned long)xas->xa ^ index, DAX_WAIT_TABLE_BITS); - return wait_table + hash; -} - -static int wake_exceptional_entry_func(wait_queue_entry_t *wait, - unsigned int mode, int sync, void *keyp) -{ - struct exceptional_entry_key *key = keyp; - struct wait_exceptional_entry_queue *ewait = - container_of(wait, struct wait_exceptional_entry_queue, wait); - - if (key->xa != ewait->key.xa || - key->entry_start != ewait->key.entry_start) - return 0; - return autoremove_wake_function(wait, mode, sync, NULL); -} - -/* - * @entry may no longer be the entry at the index in the mapping. - * The important information it's conveying is whether the entry at - * this index used to be a PMD entry. - */ -static void dax_wake_entry(struct xa_state *xas, void *entry, - enum dax_wake_mode mode) -{ - struct exceptional_entry_key key; - wait_queue_head_t *wq; - - wq = dax_entry_waitqueue(xas, entry, &key); - - /* - * Checking for locked entry and prepare_to_wait_exclusive() happens - * under the i_pages lock, ditto for entry handling in our callers. - * So at this point all tasks that could have seen our entry locked - * must be in the waitqueue and the following check will see them. - */ - if (waitqueue_active(wq)) - __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, &key); -} - -/* - * Look up entry in page cache, wait for it to become unlocked if it - * is a DAX entry and return it. The caller must subsequently call - * put_unlocked_entry() if it did not lock the entry or dax_unlock_entry() - * if it did. The entry returned may have a larger order than @order. - * If @order is larger than the order of the entry found in i_pages, this - * function returns a dax_is_conflict entry. - * - * Must be called with the i_pages lock held. - */ -static void *get_unlocked_entry(struct xa_state *xas, unsigned int order) -{ - void *entry; - struct wait_exceptional_entry_queue ewait; - wait_queue_head_t *wq; - - init_wait(&ewait.wait); - ewait.wait.func = wake_exceptional_entry_func; - - for (;;) { - entry = xas_find_conflict(xas); - if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) - return entry; - if (dax_entry_order(entry) < order) - return XA_RETRY_ENTRY; - if (!dax_is_locked(entry)) - return entry; - - wq = dax_entry_waitqueue(xas, entry, &ewait.key); - prepare_to_wait_exclusive(wq, &ewait.wait, - TASK_UNINTERRUPTIBLE); - xas_unlock_irq(xas); - xas_reset(xas); - schedule(); - finish_wait(wq, &ewait.wait); - xas_lock_irq(xas); - } -} - -/* - * The only thing keeping the address space around is the i_pages lock - * (it's cycled in clear_inode() after removing the entries from i_pages) - * After we call xas_unlock_irq(), we cannot touch xas->xa. - */ -static void wait_entry_unlocked(struct xa_state *xas, void *entry) -{ - struct wait_exceptional_entry_queue ewait; - wait_queue_head_t *wq; - - init_wait(&ewait.wait); - ewait.wait.func = wake_exceptional_entry_func; - - wq = dax_entry_waitqueue(xas, entry, &ewait.key); - /* - * Unlike get_unlocked_entry() there is no guarantee that this - * path ever successfully retrieves an unlocked entry before an - * inode dies. Perform a non-exclusive wait in case this path - * never successfully performs its own wake up. - */ - prepare_to_wait(wq, &ewait.wait, TASK_UNINTERRUPTIBLE); - xas_unlock_irq(xas); - schedule(); - finish_wait(wq, &ewait.wait); -} - -static void put_unlocked_entry(struct xa_state *xas, void *entry, - enum dax_wake_mode mode) -{ - if (entry && !dax_is_conflict(entry)) - dax_wake_entry(xas, entry, mode); -} - -/* - * We used the xa_state to get the entry, but then we locked the entry and - * dropped the xa_lock, so we know the xa_state is stale and must be reset - * before use. - */ -static void dax_unlock_entry(struct xa_state *xas, void *entry) -{ - void *old; - - BUG_ON(dax_is_locked(entry)); - xas_reset(xas); - xas_lock_irq(xas); - old = xas_store(xas, entry); - xas_unlock_irq(xas); - BUG_ON(!dax_is_locked(old)); - dax_wake_entry(xas, entry, WAKE_NEXT); -} - -/* - * Return: The entry stored at this location before it was locked. - */ -static void *dax_lock_entry(struct xa_state *xas, void *entry) -{ - unsigned long v = xa_to_value(entry); - return xas_store(xas, xa_mk_value(v | DAX_LOCKED)); -} - -static unsigned long dax_entry_size(void *entry) -{ - if (dax_is_zero_entry(entry)) - return 0; - else if (dax_is_empty_entry(entry)) - return 0; - else if (dax_is_pmd_entry(entry)) - return PMD_SIZE; - else - return PAGE_SIZE; -} - -static unsigned long dax_end_pfn(void *entry) -{ - return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE; -} - -/* - * Iterate through all mapped pfns represented by an entry, i.e. skip - * 'empty' and 'zero' entries. - */ -#define for_each_mapped_pfn(entry, pfn) \ - for (pfn = dax_to_pfn(entry); \ - pfn < dax_end_pfn(entry); pfn++) - -static inline bool dax_mapping_is_cow(struct address_space *mapping) -{ - return (unsigned long)mapping == PAGE_MAPPING_DAX_COW; -} - -/* - * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount. - */ -static inline void dax_mapping_set_cow(struct page *page) -{ - if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) { - /* - * Reset the index if the page was already mapped - * regularly before. - */ - if (page->mapping) - page->index = 1; - page->mapping = (void *)PAGE_MAPPING_DAX_COW; - } - page->index++; -} - -/* - * When it is called in dax_insert_entry(), the cow flag will indicate that - * whether this entry is shared by multiple files. If so, set the page->mapping - * FS_DAX_MAPPING_COW, and use page->index as refcount. - */ -static vm_fault_t dax_associate_entry(void *entry, - struct address_space *mapping, - struct vm_fault *vmf, unsigned long flags) -{ - unsigned long size = dax_entry_size(entry), pfn, index; - struct dev_pagemap *pgmap; - int i = 0; - - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return 0; - - if (!size) - return 0; - - if (!(flags & DAX_COW)) { - pfn = dax_to_pfn(entry); - pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size)); - if (!pgmap) - return VM_FAULT_SIGBUS; - } - - index = linear_page_index(vmf->vma, ALIGN(vmf->address, size)); - for_each_mapped_pfn(entry, pfn) { - struct page *page = pfn_to_page(pfn); - - if (flags & DAX_COW) { - dax_mapping_set_cow(page); - } else { - WARN_ON_ONCE(page->mapping); - page->mapping = mapping; - page->index = index + i++; - page_ref_inc(page); - } - } - - return 0; -} - -static void dax_disassociate_entry(void *entry, struct address_space *mapping, - bool trunc) -{ - unsigned long size = dax_entry_size(entry), pfn; - struct page *page; - - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return; - - if (!size) - return; - - for_each_mapped_pfn(entry, pfn) { - page = pfn_to_page(pfn); - if (dax_mapping_is_cow(page->mapping)) { - /* keep the CoW flag if this page is still shared */ - if (page->index-- > 0) - continue; - } else { - WARN_ON_ONCE(trunc && !dax_is_zapped(entry)); - WARN_ON_ONCE(trunc && !dax_page_idle(page)); - WARN_ON_ONCE(page->mapping && page->mapping != mapping); - } - page->mapping = NULL; - page->index = 0; - } - - if (trunc && !dax_mapping_is_cow(page->mapping)) { - page = pfn_to_page(dax_to_pfn(entry)); - put_dev_pagemap_many(page->pgmap, PHYS_PFN(size)); - } -} - -/* - * dax_lock_page - Lock the DAX entry corresponding to a page - * @page: The page whose entry we want to lock - * - * Context: Process context. - * Return: A cookie to pass to dax_unlock_page() or 0 if the entry could - * not be locked. - */ -dax_entry_t dax_lock_page(struct page *page) -{ - XA_STATE(xas, NULL, 0); - void *entry; - - /* Ensure page->mapping isn't freed while we look at it */ - rcu_read_lock(); - for (;;) { - struct address_space *mapping = READ_ONCE(page->mapping); - - entry = NULL; - if (!mapping || !dax_mapping(mapping)) - break; - - /* - * In the device-dax case there's no need to lock, a - * struct dev_pagemap pin is sufficient to keep the - * inode alive, and we assume we have dev_pagemap pin - * otherwise we would not have a valid pfn_to_page() - * translation. - */ - entry = (void *)~0UL; - if (S_ISCHR(mapping->host->i_mode)) - break; - - xas.xa = &mapping->i_pages; - xas_lock_irq(&xas); - if (mapping != page->mapping) { - xas_unlock_irq(&xas); - continue; - } - xas_set(&xas, page->index); - entry = xas_load(&xas); - if (dax_is_locked(entry)) { - rcu_read_unlock(); - wait_entry_unlocked(&xas, entry); - rcu_read_lock(); - continue; - } - dax_lock_entry(&xas, entry); - xas_unlock_irq(&xas); - break; - } - rcu_read_unlock(); - return (dax_entry_t)entry; -} - -void dax_unlock_page(struct page *page, dax_entry_t cookie) -{ - struct address_space *mapping = page->mapping; - XA_STATE(xas, &mapping->i_pages, page->index); - - if (S_ISCHR(mapping->host->i_mode)) - return; - - dax_unlock_entry(&xas, (void *)cookie); -} - -/* - * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping - * @mapping: the file's mapping whose entry we want to lock - * @index: the offset within this file - * @page: output the dax page corresponding to this dax entry - * - * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry - * could not be locked. - */ -dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index, - struct page **page) -{ - XA_STATE(xas, NULL, 0); - void *entry; - - rcu_read_lock(); - for (;;) { - entry = NULL; - if (!dax_mapping(mapping)) - break; - - xas.xa = &mapping->i_pages; - xas_lock_irq(&xas); - xas_set(&xas, index); - entry = xas_load(&xas); - if (dax_is_locked(entry)) { - rcu_read_unlock(); - wait_entry_unlocked(&xas, entry); - rcu_read_lock(); - continue; - } - if (!entry || - dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { - /* - * Because we are looking for entry from file's mapping - * and index, so the entry may not be inserted for now, - * or even a zero/empty entry. We don't think this is - * an error case. So, return a special value and do - * not output @page. - */ - entry = (void *)~0UL; - } else { - *page = pfn_to_page(dax_to_pfn(entry)); - dax_lock_entry(&xas, entry); - } - xas_unlock_irq(&xas); - break; - } - rcu_read_unlock(); - return (dax_entry_t)entry; -} - -void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index, - dax_entry_t cookie) -{ - XA_STATE(xas, &mapping->i_pages, index); - - if (cookie == ~0UL) - return; - - dax_unlock_entry(&xas, (void *)cookie); -} - -/* - * Find page cache entry at given index. If it is a DAX entry, return it - * with the entry locked. If the page cache doesn't contain an entry at - * that index, add a locked empty entry. - * - * When requesting an entry with size DAX_PMD, grab_mapping_entry() will - * either return that locked entry or will return VM_FAULT_FALLBACK. - * This will happen if there are any PTE entries within the PMD range - * that we are requesting. - * - * We always favor PTE entries over PMD entries. There isn't a flow where we - * evict PTE entries in order to 'upgrade' them to a PMD entry. A PMD - * insertion will fail if it finds any PTE entries already in the tree, and a - * PTE insertion will cause an existing PMD entry to be unmapped and - * downgraded to PTE entries. This happens for both PMD zero pages as - * well as PMD empty entries. - * - * The exception to this downgrade path is for PMD entries that have - * real storage backing them. We will leave these real PMD entries in - * the tree, and PTE writes will simply dirty the entire PMD entry. - * - * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For - * persistent memory the benefit is doubtful. We can add that later if we can - * show it helps. - * - * On error, this function does not return an ERR_PTR. Instead it returns - * a VM_FAULT code, encoded as an xarray internal entry. The ERR_PTR values - * overlap with xarray value entries. - */ -static void *grab_mapping_entry(struct xa_state *xas, - struct address_space *mapping, unsigned int order) -{ - unsigned long index = xas->xa_index; - bool pmd_downgrade; /* splitting PMD entry into PTE entries? */ - void *entry; - -retry: - pmd_downgrade = false; - xas_lock_irq(xas); - entry = get_unlocked_entry(xas, order); - - if (entry) { - if (dax_is_conflict(entry)) - goto fallback; - if (!xa_is_value(entry)) { - xas_set_err(xas, -EIO); - goto out_unlock; - } - - if (order == 0) { - if (dax_is_pmd_entry(entry) && - (dax_is_zero_entry(entry) || - dax_is_empty_entry(entry))) { - pmd_downgrade = true; - } - } - } - - if (pmd_downgrade) { - /* - * Make sure 'entry' remains valid while we drop - * the i_pages lock. - */ - dax_lock_entry(xas, entry); - - /* - * Besides huge zero pages the only other thing that gets - * downgraded are empty entries which don't need to be - * unmapped. - */ - if (dax_is_zero_entry(entry)) { - xas_unlock_irq(xas); - unmap_mapping_pages(mapping, - xas->xa_index & ~PG_PMD_COLOUR, - PG_PMD_NR, false); - xas_reset(xas); - xas_lock_irq(xas); - } - - dax_disassociate_entry(entry, mapping, false); - xas_store(xas, NULL); /* undo the PMD join */ - dax_wake_entry(xas, entry, WAKE_ALL); - mapping->nrpages -= PG_PMD_NR; - entry = NULL; - xas_set(xas, index); - } - - if (entry) { - dax_lock_entry(xas, entry); - } else { - unsigned long flags = DAX_EMPTY; - - if (order > 0) - flags |= DAX_PMD; - entry = dax_make_entry(pfn_to_pfn_t(0), flags); - dax_lock_entry(xas, entry); - if (xas_error(xas)) - goto out_unlock; - mapping->nrpages += 1UL << order; - } - -out_unlock: - xas_unlock_irq(xas); - if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) - goto retry; - if (xas->xa_node == XA_ERROR(-ENOMEM)) - return xa_mk_internal(VM_FAULT_OOM); - if (xas_error(xas)) - return xa_mk_internal(VM_FAULT_SIGBUS); - return entry; -fallback: - xas_unlock_irq(xas); - return xa_mk_internal(VM_FAULT_FALLBACK); -} - -static void *dax_zap_entry(struct xa_state *xas, void *entry) -{ - unsigned long v = xa_to_value(entry); - - return xas_store(xas, xa_mk_value(v | DAX_ZAP)); -} - -/** - * Return NULL if the entry is zapped and all pages in the entry are - * idle, otherwise return the non-idle page in the entry - */ -static struct page *dax_zap_pages(struct xa_state *xas, void *entry) -{ - struct page *ret = NULL; - unsigned long pfn; - bool zap; - - if (!dax_entry_size(entry)) - return NULL; - - zap = !dax_is_zapped(entry); - - for_each_mapped_pfn(entry, pfn) { - struct page *page = pfn_to_page(pfn); - - if (zap) - page_ref_dec(page); - - if (!ret && !dax_page_idle(page)) - ret = page; - } - - if (zap) - dax_zap_entry(xas, entry); - - return ret; -} - -/** - * dax_zap_mappings_range - find first pinned page in @mapping - * @mapping: address space to scan for a page with ref count > 1 - * @start: Starting offset. Page containing 'start' is included. - * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX, - * pages from 'start' till the end of file are included. - * - * DAX requires ZONE_DEVICE mapped pages. These pages are never - * 'onlined' to the page allocator so they are considered idle when - * page->count == 1. A filesystem uses this interface to determine if - * any page in the mapping is busy, i.e. for DMA, or other - * get_user_pages() usages. - * - * It is expected that the filesystem is holding locks to block the - * establishment of new mappings in this address_space. I.e. it expects - * to be able to run unmap_mapping_range() and subsequently not race - * mapping_mapped() becoming true. - */ -struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start, - loff_t end) -{ - void *entry; - unsigned int scanned = 0; - struct page *page = NULL; - pgoff_t start_idx = start >> PAGE_SHIFT; - pgoff_t end_idx; - XA_STATE(xas, &mapping->i_pages, start_idx); - - /* - * In the 'limited' case get_user_pages() for dax is disabled. - */ - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return NULL; - - if (!dax_mapping(mapping)) - return NULL; - - /* If end == LLONG_MAX, all pages from start to till end of file */ - if (end == LLONG_MAX) - end_idx = ULONG_MAX; - else - end_idx = end >> PAGE_SHIFT; - /* - * If we race get_user_pages_fast() here either we'll see the - * elevated page count in the iteration and wait, or - * get_user_pages_fast() will see that the page it took a reference - * against is no longer mapped in the page tables and bail to the - * get_user_pages() slow path. The slow path is protected by - * pte_lock() and pmd_lock(). New references are not taken without - * holding those locks, and unmap_mapping_pages() will not zero the - * pte or pmd without holding the respective lock, so we are - * guaranteed to either see new references or prevent new - * references from being established. - */ - unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0); - - xas_lock_irq(&xas); - xas_for_each(&xas, entry, end_idx) { - if (WARN_ON_ONCE(!xa_is_value(entry))) - continue; - if (unlikely(dax_is_locked(entry))) - entry = get_unlocked_entry(&xas, 0); - if (entry) - page = dax_zap_pages(&xas, entry); - put_unlocked_entry(&xas, entry, WAKE_NEXT); - if (page) - break; - if (++scanned % XA_CHECK_SCHED) - continue; - - xas_pause(&xas); - xas_unlock_irq(&xas); - cond_resched(); - xas_lock_irq(&xas); - } - xas_unlock_irq(&xas); - return page; -} -EXPORT_SYMBOL_GPL(dax_zap_mappings_range); - -struct page *dax_zap_mappings(struct address_space *mapping) -{ - return dax_zap_mappings_range(mapping, 0, LLONG_MAX); -} -EXPORT_SYMBOL_GPL(dax_zap_mappings); - -static int __dax_invalidate_entry(struct address_space *mapping, - pgoff_t index, bool trunc) -{ - XA_STATE(xas, &mapping->i_pages, index); - int ret = 0; - void *entry; - - xas_lock_irq(&xas); - entry = get_unlocked_entry(&xas, 0); - if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) - goto out; - if (!trunc && - (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) || - xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE))) - goto out; - dax_disassociate_entry(entry, mapping, trunc); - xas_store(&xas, NULL); - mapping->nrpages -= 1UL << dax_entry_order(entry); - ret = 1; -out: - put_unlocked_entry(&xas, entry, WAKE_ALL); - xas_unlock_irq(&xas); - return ret; -} - -/* - * Delete DAX entry at @index from @mapping. Wait for it - * to be unlocked before deleting it. - */ -int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index) -{ - int ret = __dax_invalidate_entry(mapping, index, true); - - /* - * This gets called from truncate / punch_hole path. As such, the caller - * must hold locks protecting against concurrent modifications of the - * page cache (usually fs-private i_mmap_sem for writing). Since the - * caller has seen a DAX entry for this index, we better find it - * at that index as well... - */ - WARN_ON_ONCE(!ret); - return ret; -} - -/* - * Invalidate DAX entry if it is clean. - */ -int dax_invalidate_mapping_entry_sync(struct address_space *mapping, - pgoff_t index) -{ - return __dax_invalidate_entry(mapping, index, false); -} - static pgoff_t dax_iomap_pgoff(const struct iomap *iomap, loff_t pos) { return PHYS_PFN(iomap->addr + (pos & PAGE_MASK) - iomap->offset); @@ -894,195 +55,6 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter return 0; } -/* - * MAP_SYNC on a dax mapping guarantees dirty metadata is - * flushed on write-faults (non-cow), but not read-faults. - */ -static bool dax_fault_is_synchronous(const struct iomap_iter *iter, - struct vm_area_struct *vma) -{ - return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) && - (iter->iomap.flags & IOMAP_F_DIRTY); -} - -static bool dax_fault_is_cow(const struct iomap_iter *iter) -{ - return (iter->flags & IOMAP_WRITE) && - (iter->iomap.flags & IOMAP_F_SHARED); -} - -static unsigned long dax_iter_flags(const struct iomap_iter *iter, - struct vm_fault *vmf) -{ - unsigned long flags = 0; - - if (!dax_fault_is_synchronous(iter, vmf->vma)) - flags |= DAX_DIRTY; - - if (dax_fault_is_cow(iter)) - flags |= DAX_COW; - - return flags; -} - -/* - * By this point grab_mapping_entry() has ensured that we have a locked entry - * of the appropriate size so we don't have to worry about downgrading PMDs to - * PTEs. If we happen to be trying to insert a PTE and there is a PMD - * already in the tree, we will skip the insertion and just dirty the PMD as - * appropriate. - */ -static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, - void **pentry, pfn_t pfn, - unsigned long flags) -{ - struct address_space *mapping = vmf->vma->vm_file->f_mapping; - void *new_entry = dax_make_entry(pfn, flags); - bool dirty = flags & DAX_DIRTY; - bool cow = flags & DAX_COW; - void *entry = *pentry; - - if (dirty) - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); - - if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) { - unsigned long index = xas->xa_index; - /* we are replacing a zero page with block mapping */ - if (dax_is_pmd_entry(entry)) - unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR, - PG_PMD_NR, false); - else /* pte entry */ - unmap_mapping_pages(mapping, index, 1, false); - } - - xas_reset(xas); - xas_lock_irq(xas); - if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { - void *old; - - dax_disassociate_entry(entry, mapping, false); - dax_associate_entry(new_entry, mapping, vmf, flags); - /* - * Only swap our new entry into the page cache if the current - * entry is a zero page or an empty entry. If a normal PTE or - * PMD entry is already in the cache, we leave it alone. This - * means that if we are trying to insert a PTE and the - * existing entry is a PMD, we will just leave the PMD in the - * tree and dirty it if necessary. - */ - old = dax_lock_entry(xas, new_entry); - WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) | - DAX_LOCKED)); - entry = new_entry; - } else { - xas_load(xas); /* Walk the xa_state */ - } - - if (dirty) - xas_set_mark(xas, PAGECACHE_TAG_DIRTY); - - if (cow) - xas_set_mark(xas, PAGECACHE_TAG_TOWRITE); - - xas_unlock_irq(xas); - *pentry = entry; - return 0; -} - -static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, - struct address_space *mapping, void *entry) -{ - unsigned long pfn, index, count, end; - long ret = 0; - struct vm_area_struct *vma; - - /* - * A page got tagged dirty in DAX mapping? Something is seriously - * wrong. - */ - if (WARN_ON(!xa_is_value(entry))) - return -EIO; - - if (unlikely(dax_is_locked(entry))) { - void *old_entry = entry; - - entry = get_unlocked_entry(xas, 0); - - /* Entry got punched out / reallocated? */ - if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) - goto put_unlocked; - /* - * Entry got reallocated elsewhere? No need to writeback. - * We have to compare pfns as we must not bail out due to - * difference in lockbit or entry type. - */ - if (dax_to_pfn(old_entry) != dax_to_pfn(entry)) - goto put_unlocked; - if (WARN_ON_ONCE(dax_is_empty_entry(entry) || - dax_is_zero_entry(entry))) { - ret = -EIO; - goto put_unlocked; - } - - /* Another fsync thread may have already done this entry */ - if (!xas_get_mark(xas, PAGECACHE_TAG_TOWRITE)) - goto put_unlocked; - } - - /* Lock the entry to serialize with page faults */ - dax_lock_entry(xas, entry); - - /* - * We can clear the tag now but we have to be careful so that concurrent - * dax_writeback_one() calls for the same index cannot finish before we - * actually flush the caches. This is achieved as the calls will look - * at the entry only under the i_pages lock and once they do that - * they will see the entry locked and wait for it to unlock. - */ - xas_clear_mark(xas, PAGECACHE_TAG_TOWRITE); - xas_unlock_irq(xas); - - /* - * If dax_writeback_mapping_range() was given a wbc->range_start - * in the middle of a PMD, the 'index' we use needs to be - * aligned to the start of the PMD. - * This allows us to flush for PMD_SIZE and not have to worry about - * partial PMD writebacks. - */ - pfn = dax_to_pfn(entry); - count = 1UL << dax_entry_order(entry); - index = xas->xa_index & ~(count - 1); - end = index + count - 1; - - /* Walk all mappings of a given index of a file and writeprotect them */ - i_mmap_lock_read(mapping); - vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) { - pfn_mkclean_range(pfn, count, index, vma); - cond_resched(); - } - i_mmap_unlock_read(mapping); - - dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); - /* - * After we have flushed the cache, we can clear the dirty tag. There - * cannot be new dirty data in the pfn after the flush has completed as - * the pfn mappings are writeprotected and fault waits for mapping - * entry lock. - */ - xas_reset(xas); - xas_lock_irq(xas); - xas_store(xas, entry); - xas_clear_mark(xas, PAGECACHE_TAG_DIRTY); - dax_wake_entry(xas, entry, WAKE_NEXT); - - trace_dax_writeback_one(mapping->host, index, count); - return ret; - - put_unlocked: - put_unlocked_entry(xas, entry, WAKE_NEXT); - return ret; -} - /* * Flush the mapping to the persistent domain within the byte range of [start, * end]. This is required by data integrity operations to ensure file data is @@ -1219,6 +191,37 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size, return 0; } +/* + * MAP_SYNC on a dax mapping guarantees dirty metadata is + * flushed on write-faults (non-cow), but not read-faults. + */ +static bool dax_fault_is_synchronous(const struct iomap_iter *iter, + struct vm_area_struct *vma) +{ + return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) && + (iter->iomap.flags & IOMAP_F_DIRTY); +} + +static bool dax_fault_is_cow(const struct iomap_iter *iter) +{ + return (iter->flags & IOMAP_WRITE) && + (iter->iomap.flags & IOMAP_F_SHARED); +} + +static unsigned long dax_iter_flags(const struct iomap_iter *iter, + struct vm_fault *vmf) +{ + unsigned long flags = 0; + + if (!dax_fault_is_synchronous(iter, vmf->vma)) + flags |= DAX_DIRTY; + + if (dax_fault_is_cow(iter)) + flags |= DAX_COW; + + return flags; +} + /* * The user has performed a load from a hole in the file. Allocating a new * page in the file would cause excessive storage usage for workloads with @@ -1701,7 +704,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page) iter.flags |= IOMAP_WRITE; - entry = grab_mapping_entry(&xas, mapping, 0); + entry = dax_grab_mapping_entry(&xas, mapping, 0); if (xa_is_internal(entry)) { ret = xa_to_internal(entry); goto out; @@ -1818,12 +821,12 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, goto fallback; /* - * grab_mapping_entry() will make sure we get an empty PMD entry, + * dax_grab_mapping_entry() will make sure we get an empty PMD entry, * a zero PMD entry or a DAX PMD. If it can't (because a PTE * entry is already in the array, for instance), it will return * VM_FAULT_FALLBACK. */ - entry = grab_mapping_entry(&xas, mapping, PMD_ORDER); + entry = dax_grab_mapping_entry(&xas, mapping, PMD_ORDER); if (xa_is_internal(entry)) { ret = xa_to_internal(entry); goto fallback; @@ -1897,50 +900,6 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, } EXPORT_SYMBOL_GPL(dax_iomap_fault); -/* - * dax_insert_pfn_mkwrite - insert PTE or PMD entry into page tables - * @vmf: The description of the fault - * @pfn: PFN to insert - * @order: Order of entry to insert. - * - * This function inserts a writeable PTE or PMD entry into the page tables - * for an mmaped DAX file. It also marks the page cache entry as dirty. - */ -static vm_fault_t -dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order) -{ - struct address_space *mapping = vmf->vma->vm_file->f_mapping; - XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order); - void *entry; - vm_fault_t ret; - - xas_lock_irq(&xas); - entry = get_unlocked_entry(&xas, order); - /* Did we race with someone splitting entry or so? */ - if (!entry || dax_is_conflict(entry) || - (order == 0 && !dax_is_pte_entry(entry))) { - put_unlocked_entry(&xas, entry, WAKE_NEXT); - xas_unlock_irq(&xas); - trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf, - VM_FAULT_NOPAGE); - return VM_FAULT_NOPAGE; - } - xas_set_mark(&xas, PAGECACHE_TAG_DIRTY); - dax_lock_entry(&xas, entry); - xas_unlock_irq(&xas); - if (order == 0) - ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn); -#ifdef CONFIG_FS_DAX_PMD - else if (order == PMD_ORDER) - ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE); -#endif - else - ret = VM_FAULT_FALLBACK; - dax_unlock_entry(&xas, entry); - trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret); - return ret; -} - /** * dax_finish_sync_fault - finish synchronous page fault * @vmf: The description of the fault diff --git a/include/linux/dax.h b/include/linux/dax.h index f6acb4ed73cb..de60a34088bb 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -157,15 +157,33 @@ static inline void fs_put_dax(struct dax_device *dax_dev, void *holder) int dax_writeback_mapping_range(struct address_space *mapping, struct dax_device *dax_dev, struct writeback_control *wbc); -struct page *dax_zap_mappings(struct address_space *mapping); -struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start, - loff_t end); +#else +static inline int dax_writeback_mapping_range(struct address_space *mapping, + struct dax_device *dax_dev, struct writeback_control *wbc) +{ + return -EOPNOTSUPP; +} + +#endif + +int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, + const struct iomap_ops *ops); +int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, + const struct iomap_ops *ops); + +#if IS_ENABLED(CONFIG_DAX) +int dax_read_lock(void); +void dax_read_unlock(int id); dax_entry_t dax_lock_page(struct page *page); void dax_unlock_page(struct page *page, dax_entry_t cookie); +void run_dax(struct dax_device *dax_dev); dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, unsigned long index, struct page **page); void dax_unlock_mapping_entry(struct address_space *mapping, unsigned long index, dax_entry_t cookie); +struct page *dax_zap_mappings(struct address_space *mapping); +struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start, + loff_t end); #else static inline struct page *dax_zap_mappings(struct address_space *mapping) { @@ -179,12 +197,6 @@ static inline struct page *dax_zap_mappings_range(struct address_space *mapping, return NULL; } -static inline int dax_writeback_mapping_range(struct address_space *mapping, - struct dax_device *dax_dev, struct writeback_control *wbc) -{ - return -EOPNOTSUPP; -} - static inline dax_entry_t dax_lock_page(struct page *page) { if (IS_DAX(page->mapping->host)) @@ -196,6 +208,15 @@ static inline void dax_unlock_page(struct page *page, dax_entry_t cookie) { } +static inline int dax_read_lock(void) +{ + return 0; +} + +static inline void dax_read_unlock(int id) +{ +} + static inline dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, unsigned long index, struct page **page) { @@ -208,11 +229,6 @@ static inline void dax_unlock_mapping_entry(struct address_space *mapping, } #endif -int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, - const struct iomap_ops *ops); -int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, - const struct iomap_ops *ops); - /* * Document all the code locations that want know when a dax page is * unreferenced. @@ -222,19 +238,6 @@ static inline bool dax_page_idle(struct page *page) return page_ref_count(page) == 1; } -#if IS_ENABLED(CONFIG_DAX) -int dax_read_lock(void); -void dax_read_unlock(int id); -#else -static inline int dax_read_lock(void) -{ - return 0; -} - -static inline void dax_read_unlock(int id) -{ -} -#endif /* CONFIG_DAX */ bool dax_alive(struct dax_device *dax_dev); void *dax_get_private(struct dax_device *dax_dev); long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, @@ -255,6 +258,9 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, pfn_t *pfnp, int *errp, const struct iomap_ops *ops); vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, enum page_entry_size pe_size, pfn_t pfn); +void *dax_grab_mapping_entry(struct xa_state *xas, + struct address_space *mapping, unsigned int order); +void dax_unlock_entry(struct xa_state *xas, void *entry); int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); int dax_invalidate_mapping_entry_sync(struct address_space *mapping, pgoff_t index); @@ -271,6 +277,56 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); } +/* + * DAX pagecache entries use XArray value entries so they can't be mistaken + * for pages. We use one bit for locking, one bit for the entry size (PMD) + * and two more to tell us if the entry is a zero page or an empty entry that + * is just used for locking. In total four special bits. + * + * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE + * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem + * block allocation. + */ +#define DAX_SHIFT (5) +#define DAX_MASK ((1UL << DAX_SHIFT) - 1) +#define DAX_LOCKED (1UL << 0) +#define DAX_PMD (1UL << 1) +#define DAX_ZERO_PAGE (1UL << 2) +#define DAX_EMPTY (1UL << 3) +#define DAX_ZAP (1UL << 4) + +/* + * These flags are not conveyed in Xarray value entries, they are just + * modifiers to dax_insert_entry(). + */ +#define DAX_DIRTY (1UL << (DAX_SHIFT + 0)) +#define DAX_COW (1UL << (DAX_SHIFT + 1)) + +vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, + void **pentry, pfn_t pfn, unsigned long flags); +vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, + unsigned int order); +int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, + struct address_space *mapping, void *entry); + +/* The 'colour' (ie low bits) within a PMD of a page offset. */ +#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1) +#define PG_PMD_NR (PMD_SIZE >> PAGE_SHIFT) + +/* The order of a PMD entry */ +#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT) + +static inline unsigned int pe_order(enum page_entry_size pe_size) +{ + if (pe_size == PE_SIZE_PTE) + return PAGE_SHIFT - PAGE_SHIFT; + if (pe_size == PE_SIZE_PMD) + return PMD_SHIFT - PAGE_SHIFT; + if (pe_size == PE_SIZE_PUD) + return PUD_SHIFT - PAGE_SHIFT; + return ~0; +} + #ifdef CONFIG_DEV_DAX_HMEM_DEVICES void hmem_register_device(int target_nid, struct resource *r); #else diff --git a/include/linux/memremap.h b/include/linux/memremap.h index fd57407e7f3d..e5d30eec3bf1 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -221,6 +221,12 @@ static inline void devm_memunmap_pages(struct device *dev, { } +static inline struct dev_pagemap * +get_dev_pagemap_many(unsigned long pfn, struct dev_pagemap *pgmap, int refs) +{ + return NULL; +} + static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn, struct dev_pagemap *pgmap) { From patchwork Fri Sep 16 03:36:25 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978065 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE346C6FA90 for ; Fri, 16 Sep 2022 03:36:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4A70C80008; Thu, 15 Sep 2022 23:36:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 42F5D8D0002; Thu, 15 Sep 2022 23:36:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2AA0A80008; Thu, 15 Sep 2022 23:36:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 168458D0002 for ; Thu, 15 Sep 2022 23:36:28 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id EEE5712062F for ; Fri, 16 Sep 2022 03:36:27 +0000 (UTC) X-FDA: 79916536014.05.DF7938E Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf01.hostedemail.com (Postfix) with ESMTP id 71AF0400A5 for ; Fri, 16 Sep 2022 03:36:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299387; x=1694835387; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ajaFzvrRPG16T180cU73eT4NP/l2RlRiUDdGWZj7wgU=; b=V6Z+xkkAQCCNfLUyTnvd7MbisOcimRcfLlmE0pfW0a7kcmYplm7GQqiU TvrGoP9axXzgOPl7E1tch2zkErmrqwI7qz2C5ChliWCU/S1S07AxWi6cE 0wEjAsnr1S6M+sVSCdv1EgVbppzgwF+pkeCshbHw788oAi0c209yvrSS/ KNEiDcK9S3gmwGKz8KQXd3dKg9/AEDshLjoEnW83gLjWDG86hTE4pyfye YpIv75e0WYOjUk9D2waPNobGOwBBt/I/43e2mPfTEVuK3k2K0X0RvxjOs cqMgOx8BeoznitqMr0HiVxrQm7xpq2SA+9xasJ8qXyYs7NrJcEsqOksDh w==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="362867039" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="362867039" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:26 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="679809498" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:25 -0700 Subject: [PATCH v2 13/18] dax: Prep mapping helpers for compound pages From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:36:25 -0700 Message-ID: <166329938508.2786261.5544204703263725154.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299387; a=rsa-sha256; cv=none; b=oWSVOvRw1giapgRSggcFQQoISYqoS/0aqMI3TqbZLQSTs5hKpvLjQQZQ3a02KEBCBIfZT7 TwKYVFr206CUh24XlgpCd6Yn9LamyuBqv2WJHK73XBrYtAxKfpktUbOM8NklJmyGjsoBYW 0SnaD/kQjasVgtJTD+0fwxjqExAfqbo= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=V6Z+xkkA; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299387; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JfbF5N6ejI3xx1xkFO8ghG2Kg/m+EIGT4GrHdEZXoxo=; b=rNCTjlgUYdcvOzFh8Lm7wce8AFwK61g1893O/7Uj/T3DLTgDCYvWL3/KWEgXZMm4Z05M9Z XfCX7qCqh7HzcyAMkCH8OC13M5jeLxgox1AREczUMSNaWyFbovY4uPLKQ4Lp0FaczUDLyk q4AGQqdU0XVhdRfXsTyVZ1VvPIOo/C0= X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 71AF0400A5 Authentication-Results: imf01.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=V6Z+xkkA; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Stat-Signature: g9zagnsit5jz9h11xu4oot9tu5eez4k6 X-HE-Tag: 1663299387-362092 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for device-dax to use the same mapping machinery as fsdax, add support for device-dax compound pages. Presently this is handled by dax_set_mapping() which is careful to only update page->mapping for head pages. However, it does that by looking at properties in the 'struct dev_dax' instance associated with the page. Switch to just checking PageHead() directly in the functions that iterate over pages in a large mapping. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- drivers/dax/Kconfig | 1 + drivers/dax/mapping.c | 16 ++++++++++++++++ 2 files changed, 17 insertions(+) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 205e9dda8928..2eddd32c51f4 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -9,6 +9,7 @@ if DAX config DEV_DAX tristate "Device DAX: direct access mapping device" depends on TRANSPARENT_HUGEPAGE + depends on !FS_DAX_LIMITED help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c index 70576aa02148..5d4b9601f183 100644 --- a/drivers/dax/mapping.c +++ b/drivers/dax/mapping.c @@ -345,6 +345,8 @@ static vm_fault_t dax_associate_entry(void *entry, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); + page = compound_head(page); + if (flags & DAX_COW) { dax_mapping_set_cow(page); } else { @@ -353,6 +355,9 @@ static vm_fault_t dax_associate_entry(void *entry, page->index = index + i++; page_ref_inc(page); } + + if (PageHead(page)) + break; } return 0; @@ -372,6 +377,9 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { page = pfn_to_page(pfn); + + page = compound_head(page); + if (dax_mapping_is_cow(page->mapping)) { /* keep the CoW flag if this page is still shared */ if (page->index-- > 0) @@ -383,6 +391,9 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, } page->mapping = NULL; page->index = 0; + + if (PageHead(page)) + break; } if (trunc && !dax_mapping_is_cow(page->mapping)) { @@ -660,11 +671,16 @@ static struct page *dax_zap_pages(struct xa_state *xas, void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); + page = compound_head(page); + if (zap) page_ref_dec(page); if (!ret && !dax_page_idle(page)) ret = page; + + if (PageHead(page)) + break; } if (zap) From patchwork Fri Sep 16 03:36:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978066 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 58FD6C32771 for ; Fri, 16 Sep 2022 03:36:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E928E8D0003; Thu, 15 Sep 2022 23:36:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E409F8D0002; Thu, 15 Sep 2022 23:36:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D0A0F8D0003; Thu, 15 Sep 2022 23:36:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C0E098D0002 for ; Thu, 15 Sep 2022 23:36:33 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 9E6B440222 for ; Fri, 16 Sep 2022 03:36:33 +0000 (UTC) X-FDA: 79916536266.04.FEDDE38 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf01.hostedemail.com (Postfix) with ESMTP id E69CA400A6 for ; Fri, 16 Sep 2022 03:36:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299392; x=1694835392; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=r/wpsgIfHbDzQJNWJ9HD7Sb+85QkD9aLuJZ8FLqHRVk=; b=g53wPSeVkvdSF6Cs1+SbG6Lb3K55J4CV3euWiOe+LGlxvPMHRioSiiXl 6FDei0xret8EXyc/VCyzgJahC2eCftQsTs911COEQOyZ5PGiq+Eu3/ha3 FLUS2IwBX52WxKsFLF+dKUgzwL9REOZ5t5zwVmjrpPqekTvE/yRmN1oKT yDngm8tKMcJ6cVdlCZdLHUWGKov4N9anQTuE2n+nDhfXBBZh9w3qaSNYs 7MozGc2cTkMsSa6I/wqEq4LbIWu3NcYN3VWaH/vva74h234EMhLfm9l+P e6q/JXaDupx/YhJonFcbQbBsEGdE4RvXvmHbPXm4YASBFZHObKid1GT9J w==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="362867052" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="362867052" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:32 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="679809542" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:31 -0700 Subject: [PATCH v2 14/18] devdax: add PUD support to the DAX mapping infrastructure From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:36:31 -0700 Message-ID: <166329939123.2786261.4488002998591622104.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299393; a=rsa-sha256; cv=none; b=S2G0AXOm+er7qb9skb0urcGbzgIBz+IySIAhDUUXItUFBQbS+e1Spe39OEAOOL0c1Bwqi0 YLZkjGeaeKVsqjj348p640pe7azIpNUdrUjt7F+F4hc3YH8FV/CGRRWtA5cfeD2VhglIeA cuZCfKtPezwZW5SeUQIGgQNQiEiZi5k= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=g53wPSeV; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299393; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lLdggswO+TzKF5OK/HDUCENsqVoK/XyidRBlSadXhU4=; b=gdqvOF+SHnBw67nTQsJLtROS5D7RPpaCjPHJPfWxgpFS+ODFldvrbVy/18To2h5t5FuKnG oRU8ZQ9W5TaCHyFba/3Wmsp0zsra2lQ/TdRvWIKvGU/fdCJbYp01MrHgc5dk/uxq/TbhmK jlDF9QuuRVwtKxwn7vcrArKrcj8zpRU= X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: E69CA400A6 Authentication-Results: imf01.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=g53wPSeV; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Stat-Signature: f4u345cjwz59wifxyzayxwfsk5xf6b1k X-HE-Tag: 1663299392-694611 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for using the DAX mapping infrastructure for device-dax, update the helpers to handle PUD entries. In practice the code related to @size_downgrade will go unused for PUD entries since only devdax creates DAX PUD entries and devdax enforces aligned mappings. The conversion is included for completeness. The addition of PUD support to dax_insert_pfn_mkwrite() requires a new stub for vmf_insert_pfn_pud() in the CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=n case. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- drivers/dax/mapping.c | 50 ++++++++++++++++++++++++++++++++++++----------- include/linux/dax.h | 32 ++++++++++++++++++++---------- include/linux/huge_mm.h | 11 ++++++++-- 3 files changed, 68 insertions(+), 25 deletions(-) diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c index 5d4b9601f183..b5a5196f8831 100644 --- a/drivers/dax/mapping.c +++ b/drivers/dax/mapping.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "dax-private.h" @@ -56,6 +57,8 @@ static bool dax_is_zapped(void *entry) static unsigned int dax_entry_order(void *entry) { + if (xa_to_value(entry) & DAX_PUD) + return PUD_ORDER; if (xa_to_value(entry) & DAX_PMD) return PMD_ORDER; return 0; @@ -66,9 +69,14 @@ static unsigned long dax_is_pmd_entry(void *entry) return xa_to_value(entry) & DAX_PMD; } +static unsigned long dax_is_pud_entry(void *entry) +{ + return xa_to_value(entry) & DAX_PUD; +} + static bool dax_is_pte_entry(void *entry) { - return !(xa_to_value(entry) & DAX_PMD); + return !(xa_to_value(entry) & (DAX_PMD|DAX_PUD)); } static int dax_is_zero_entry(void *entry) @@ -277,6 +285,8 @@ static unsigned long dax_entry_size(void *entry) return 0; else if (dax_is_pmd_entry(entry)) return PMD_SIZE; + else if (dax_is_pud_entry(entry)) + return PUD_SIZE; else return PAGE_SIZE; } @@ -564,11 +574,11 @@ void *dax_grab_mapping_entry(struct xa_state *xas, struct address_space *mapping, unsigned int order) { unsigned long index = xas->xa_index; - bool pmd_downgrade; /* splitting PMD entry into PTE entries? */ + bool size_downgrade; /* splitting entry into PTE entries? */ void *entry; retry: - pmd_downgrade = false; + size_downgrade = false; xas_lock_irq(xas); entry = get_unlocked_entry(xas, order); @@ -581,15 +591,25 @@ void *dax_grab_mapping_entry(struct xa_state *xas, } if (order == 0) { - if (dax_is_pmd_entry(entry) && + if (!dax_is_pte_entry(entry) && (dax_is_zero_entry(entry) || dax_is_empty_entry(entry))) { - pmd_downgrade = true; + size_downgrade = true; } } } - if (pmd_downgrade) { + if (size_downgrade) { + unsigned long colour, nr; + + if (dax_is_pmd_entry(entry)) { + colour = PG_PMD_COLOUR; + nr = PG_PMD_NR; + } else { + colour = PG_PUD_COLOUR; + nr = PG_PUD_NR; + } + /* * Make sure 'entry' remains valid while we drop * the i_pages lock. @@ -603,9 +623,8 @@ void *dax_grab_mapping_entry(struct xa_state *xas, */ if (dax_is_zero_entry(entry)) { xas_unlock_irq(xas); - unmap_mapping_pages(mapping, - xas->xa_index & ~PG_PMD_COLOUR, - PG_PMD_NR, false); + unmap_mapping_pages(mapping, xas->xa_index & ~colour, + nr, false); xas_reset(xas); xas_lock_irq(xas); } @@ -613,7 +632,7 @@ void *dax_grab_mapping_entry(struct xa_state *xas, dax_disassociate_entry(entry, mapping, false); xas_store(xas, NULL); /* undo the PMD join */ dax_wake_entry(xas, entry, WAKE_ALL); - mapping->nrpages -= PG_PMD_NR; + mapping->nrpages -= nr; entry = NULL; xas_set(xas, index); } @@ -623,7 +642,9 @@ void *dax_grab_mapping_entry(struct xa_state *xas, } else { unsigned long flags = DAX_EMPTY; - if (order > 0) + if (order == PUD_SHIFT - PAGE_SHIFT) + flags |= DAX_PUD; + else if (order == PMD_SHIFT - PAGE_SHIFT) flags |= DAX_PMD; entry = dax_make_entry(pfn_to_pfn_t(0), flags); dax_lock_entry(xas, entry); @@ -846,7 +867,10 @@ vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) { unsigned long index = xas->xa_index; /* we are replacing a zero page with block mapping */ - if (dax_is_pmd_entry(entry)) + if (dax_is_pud_entry(entry)) + unmap_mapping_pages(mapping, index & ~PG_PUD_COLOUR, + PG_PUD_NR, false); + else if (dax_is_pmd_entry(entry)) unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR, PG_PMD_NR, false); else /* pte entry */ @@ -1018,6 +1042,8 @@ vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, else if (order == PMD_ORDER) ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE); #endif + else if (order == PUD_ORDER) + ret = vmf_insert_pfn_pud(vmf, pfn, FAULT_FLAG_WRITE); else ret = VM_FAULT_FALLBACK; dax_unlock_entry(&xas, entry); diff --git a/include/linux/dax.h b/include/linux/dax.h index de60a34088bb..3a27fecf072a 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -278,22 +278,25 @@ static inline bool dax_mapping(struct address_space *mapping) } /* - * DAX pagecache entries use XArray value entries so they can't be mistaken - * for pages. We use one bit for locking, one bit for the entry size (PMD) - * and two more to tell us if the entry is a zero page or an empty entry that - * is just used for locking. In total four special bits. + * DAX pagecache entries use XArray value entries so they can't be + * mistaken for pages. We use one bit for locking, two bits for the + * entry size (PMD, PUD) and two more to tell us if the entry is a zero + * page or an empty entry that is just used for locking. In total 5 + * special bits which limits the max pfn that can be stored as: + * (1UL << 57 - PAGE_SHIFT). 63 - DAX_SHIFT - 1 (for xa_mk_value()). * - * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE - * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem - * block allocation. + * If the P{M,U}D bits are not set the entry has size PAGE_SIZE, and if + * the ZERO_PAGE and EMPTY bits aren't set the entry is a normal DAX + * entry with a filesystem block allocation. */ -#define DAX_SHIFT (5) +#define DAX_SHIFT (6) #define DAX_MASK ((1UL << DAX_SHIFT) - 1) #define DAX_LOCKED (1UL << 0) #define DAX_PMD (1UL << 1) -#define DAX_ZERO_PAGE (1UL << 2) -#define DAX_EMPTY (1UL << 3) -#define DAX_ZAP (1UL << 4) +#define DAX_PUD (1UL << 2) +#define DAX_ZERO_PAGE (1UL << 3) +#define DAX_EMPTY (1UL << 4) +#define DAX_ZAP (1UL << 5) /* * These flags are not conveyed in Xarray value entries, they are just @@ -316,6 +319,13 @@ int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, /* The order of a PMD entry */ #define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT) +/* The 'colour' (ie low bits) within a PUD of a page offset. */ +#define PG_PUD_COLOUR ((PUD_SIZE >> PAGE_SHIFT) - 1) +#define PG_PUD_NR (PUD_SIZE >> PAGE_SHIFT) + +/* The order of a PUD entry */ +#define PUD_ORDER (PUD_SHIFT - PAGE_SHIFT) + static inline unsigned int pe_order(enum page_entry_size pe_size) { if (pe_size == PE_SIZE_PTE) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 768e5261fdae..de73f5a16252 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -18,10 +18,19 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud); +vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn, + pgprot_t pgprot, bool write); #else static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud) { } + +static inline vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, + pfn_t pfn, pgprot_t pgprot, + bool write) +{ + return VM_FAULT_SIGBUS; +} #endif vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf); @@ -58,8 +67,6 @@ static inline vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, { return vmf_insert_pfn_pmd_prot(vmf, pfn, vmf->vma->vm_page_prot, write); } -vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn, - pgprot_t pgprot, bool write); /** * vmf_insert_pfn_pud - insert a pud size pfn From patchwork Fri Sep 16 03:36:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978067 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1B1FC32771 for ; Fri, 16 Sep 2022 03:36:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7DB0B8D0008; Thu, 15 Sep 2022 23:36:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 78B328D0002; Thu, 15 Sep 2022 23:36:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 62C748D0008; Thu, 15 Sep 2022 23:36:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 4D30F8D0002 for ; Thu, 15 Sep 2022 23:36:40 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 2F347C06C1 for ; Fri, 16 Sep 2022 03:36:40 +0000 (UTC) X-FDA: 79916536560.05.519D0CC Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf14.hostedemail.com (Postfix) with ESMTP id 942761000A3 for ; Fri, 16 Sep 2022 03:36:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299399; x=1694835399; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=H9Kht40qne3iP+SMB12jPN9XbzdsW4QgUBLEqx1cMBY=; b=WBx4k0XKhCXKcBorVbs4q/8RCFDmw83+sL+Uwi4/SZSg6V/CiW1mZ75u PR8UMzFVQnkto5lVccXqfi2/04nwxSlARKkZW4hG0hgSHVn+BZjVGABvN ajHLTZ3Cg3w429OAE+FzplYLV8mtjoIbnwGjR4l6Za5r18SZnCrj1Z6l6 vyeLUfRPu70ARL5VAb6wkeqvcwFPHQDnF9iuuamWk1PAmKrHtVW9pzR7B ImgdlHPyHLwc1flt7imETMNgHX43Dnp65kxEKyA2RSmmePpcLEi+41gMG Gm265+AHpp4t0hLVx6PiIwVmARQ9qRQ+ucob64JcHfzwKUJcjhxNhc6yq A==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="298895579" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="298895579" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:38 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="679809564" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:37 -0700 Subject: [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry() From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:36:37 -0700 Message-ID: <166329939733.2786261.13946962468817639563.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299399; a=rsa-sha256; cv=none; b=t5QzgvWzEB8ZvU34kwfXsVZU9vbmJDxvjRiLsMHVl9xGGdq9SSdeGBql+FVsZ8v23YQG/3 O8nzBzBsQIMO1N+lMe/uE9c3gaVNUltsEMdXq/v2Z7WwdSENa46qEUAeybcZUWRbKd4mIg 172dPUIrmgT9G9TAIfIR/8UL5SqLWb0= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=WBx4k0XK; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf14.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.115 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299399; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Xnt+O4Zrp/JL2BXJJwm4LEuF0OGkoQW8QEPFPZP3eUU=; b=0ivCK2uMtWmv+uQb9BhcMZqySvR9W5uzhKhahpvlnx37xEAFxnTFFmVPLjowKypEOBjevb altD4TR43IdNYj6IQHu/mRWF3yVYGJyZV4XT82P4QiZtL/Iw7v4fxxPN5FrpOmEappWsU+ jED03qgOAySf/cXjcvVLbAy0IBAWL58= X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 942761000A3 Authentication-Results: imf14.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=WBx4k0XK; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf14.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.115 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Stat-Signature: 85e775e7t6byam33cgoo9pix4bo7nti5 X-HE-Tag: 1663299399-507919 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Track entries and take pgmap references at mapping insertion time. Revoke mappings (dax_zap_mappings()) and drop the associated pgmap references at device destruction or inode eviction time. With this in place, and the fsdax equivalent already in place, the gup code no longer needs to consider PTE_DEVMAP as an indicator to get a pgmap reference before taking a page reference. In other words, GUP takes additional references on mapped pages. Until now, DAX in all its forms was failing to take references at mapping time. With that fixed there is no longer a requirement for gup to manage @pgmap references. However, that cleanup is saved for a follow-on patch. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- drivers/dax/bus.c | 15 +++++++++- drivers/dax/device.c | 73 +++++++++++++++++++++++++++++-------------------- drivers/dax/mapping.c | 3 ++ 3 files changed, 60 insertions(+), 31 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 1dad813ee4a6..35a319a76c82 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -382,9 +382,22 @@ void kill_dev_dax(struct dev_dax *dev_dax) { struct dax_device *dax_dev = dev_dax->dax_dev; struct inode *inode = dax_inode(dax_dev); + struct page *page; kill_dax(dax_dev); - unmap_mapping_range(inode->i_mapping, 0, 0, 1); + + /* + * New mappings are blocked. Wait for all GUP users to release + * their pins. + */ + do { + page = dax_zap_mappings(inode->i_mapping); + if (!page) + break; + __wait_var_event(page, dax_page_idle(page)); + } while (true); + + truncate_inode_pages(inode->i_mapping, 0); /* * Dynamic dax region have the pgmap allocated via dev_kzalloc() diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 5494d745ced5..7f306939807e 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -73,38 +73,15 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, return -1; } -static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn, - unsigned long fault_size) -{ - unsigned long i, nr_pages = fault_size / PAGE_SIZE; - struct file *filp = vmf->vma->vm_file; - struct dev_dax *dev_dax = filp->private_data; - pgoff_t pgoff; - - /* mapping is only set on the head */ - if (dev_dax->pgmap->vmemmap_shift) - nr_pages = 1; - - pgoff = linear_page_index(vmf->vma, - ALIGN(vmf->address, fault_size)); - - for (i = 0; i < nr_pages; i++) { - struct page *page = pfn_to_page(pfn_t_to_pfn(pfn) + i); - - page = compound_head(page); - if (page->mapping) - continue; - - page->mapping = filp->f_mapping; - page->index = pgoff + i; - } -} - static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf) { + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + XA_STATE(xas, &mapping->i_pages, vmf->pgoff); struct device *dev = &dev_dax->dev; phys_addr_t phys; + vm_fault_t ret; + void *entry; pfn_t pfn; unsigned int fault_size = PAGE_SIZE; @@ -128,7 +105,16 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); - dax_set_mapping(vmf, pfn, fault_size); + entry = dax_grab_mapping_entry(&xas, mapping, 0); + if (xa_is_internal(entry)) + return xa_to_internal(entry); + + ret = dax_insert_entry(&xas, vmf, &entry, pfn, 0); + + dax_unlock_entry(&xas, entry); + + if (ret) + return ret; return vmf_insert_mixed(vmf->vma, vmf->address, pfn); } @@ -136,10 +122,14 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf) { + struct address_space *mapping = vmf->vma->vm_file->f_mapping; unsigned long pmd_addr = vmf->address & PMD_MASK; + XA_STATE(xas, &mapping->i_pages, vmf->pgoff); struct device *dev = &dev_dax->dev; phys_addr_t phys; + vm_fault_t ret; pgoff_t pgoff; + void *entry; pfn_t pfn; unsigned int fault_size = PMD_SIZE; @@ -171,7 +161,16 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); - dax_set_mapping(vmf, pfn, fault_size); + entry = dax_grab_mapping_entry(&xas, mapping, PMD_ORDER); + if (xa_is_internal(entry)) + return xa_to_internal(entry); + + ret = dax_insert_entry(&xas, vmf, &entry, pfn, DAX_PMD); + + dax_unlock_entry(&xas, entry); + + if (ret) + return ret; return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE); } @@ -180,10 +179,14 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf) { + struct address_space *mapping = vmf->vma->vm_file->f_mapping; unsigned long pud_addr = vmf->address & PUD_MASK; + XA_STATE(xas, &mapping->i_pages, vmf->pgoff); struct device *dev = &dev_dax->dev; phys_addr_t phys; + vm_fault_t ret; pgoff_t pgoff; + void *entry; pfn_t pfn; unsigned int fault_size = PUD_SIZE; @@ -216,7 +219,16 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); - dax_set_mapping(vmf, pfn, fault_size); + entry = dax_grab_mapping_entry(&xas, mapping, PUD_ORDER); + if (xa_is_internal(entry)) + return xa_to_internal(entry); + + ret = dax_insert_entry(&xas, vmf, &entry, pfn, DAX_PUD); + + dax_unlock_entry(&xas, entry); + + if (ret) + return ret; return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE); } @@ -494,3 +506,4 @@ MODULE_LICENSE("GPL v2"); module_init(dax_init); module_exit(dax_exit); MODULE_ALIAS_DAX_DEVICE(0); +MODULE_IMPORT_NS(DAX); diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c index b5a5196f8831..9981eebb2dc5 100644 --- a/drivers/dax/mapping.c +++ b/drivers/dax/mapping.c @@ -266,6 +266,7 @@ void dax_unlock_entry(struct xa_state *xas, void *entry) WARN_ON(!dax_is_locked(old)); dax_wake_entry(xas, entry, WAKE_NEXT); } +EXPORT_SYMBOL_NS_GPL(dax_unlock_entry, DAX); /* * Return: The entry stored at this location before it was locked. @@ -666,6 +667,7 @@ void *dax_grab_mapping_entry(struct xa_state *xas, xas_unlock_irq(xas); return xa_mk_internal(VM_FAULT_FALLBACK); } +EXPORT_SYMBOL_NS_GPL(dax_grab_mapping_entry, DAX); static void *dax_zap_entry(struct xa_state *xas, void *entry) { @@ -910,6 +912,7 @@ vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, *pentry = entry; return 0; } +EXPORT_SYMBOL_NS_GPL(dax_insert_entry, DAX); int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, struct address_space *mapping, void *entry) From patchwork Fri Sep 16 03:36:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978068 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D556C32771 for ; Fri, 16 Sep 2022 03:36:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DF51480009; Thu, 15 Sep 2022 23:36:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DA4E68D0002; Thu, 15 Sep 2022 23:36:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C1F7380009; Thu, 15 Sep 2022 23:36:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id AEEDA8D0002 for ; Thu, 15 Sep 2022 23:36:46 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 8B8BF1A06C5 for ; Fri, 16 Sep 2022 03:36:46 +0000 (UTC) X-FDA: 79916536812.24.F58A43D Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf07.hostedemail.com (Postfix) with ESMTP id 085F240094 for ; Fri, 16 Sep 2022 03:36:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299406; x=1694835406; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=R5iLP9zAHTpb9OkjwMC0Bw2xXqlB6pFYCCoagiNt1Hw=; b=mnaB6T4LghDIN7deuks9qBR2WV2AWID9qCPocoVsTWLRfj3vS3hV7tR5 xNSy1mhZjMDRR3KDVrdRCwn0FPogz2sjahV0Sf/u7Rmum1WQsBiA0Bfig 3fiwDhjjfYKFn+yPX7Uk1XuANZtAw9zark8rPCMsdgrL+x1evcBer9UJD hG1mvaGP520moZkc/vhCDg2w7y5K+j5RAle2pj0/Tkcc/KCRmWoTC1M+Z AEiJZc98eO8D+JypMOF6sn62/S35bBAYPLW1l1ynTaeXfb0lWl37s6BfC CT7Zz4JAPTUL48n2RLWWgjgbc8luf6SpZwSXJDm/H2e+nGazmEUFRXIBG Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="362867080" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="362867080" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:44 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="679809589" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:43 -0700 Subject: [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:36:43 -0700 Message-ID: <166329940343.2786261.6047770378829215962.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299406; a=rsa-sha256; cv=none; b=IiUlWo7GbJOrpION6Eb21mrglFNVJzWh30qq1Kb/Juk+T22oV2CVnUAFL8V22fzkw8F7ra lbMI4/JeZYdReM02Sfia3OwecVuNshqvXDq1NEe/LHFdVKQ/VlagciYVs38EJ1+lUuzdCc 1PxkdnHR+RwIQ7cs0aOLBriQEY/D3fc= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=mnaB6T4L; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf07.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299406; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KPg/2jdS7BngvojjU/GNniiwhZh+I091wdut/ByZ5FY=; b=iWXpRN4YhtoW493vU9+k4wO38P7VKTJJnwYg4uVbuy5RS53cVKXkRQkW9UmBY/qMNfhaZh 8DF6D1H2LLTHNimgthZLUfj6Y/A3qNTrKknaZ5kP/kLFje7rNskZdy92V0xJsx0UHd7ckS SogVYDPXto1pbytAk5B8CB010hu0Fjk= X-Stat-Signature: a49gusc7tecmfpqddftqodb9p4i6n5ym X-Rspamd-Queue-Id: 085F240094 Authentication-Results: imf07.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=mnaB6T4L; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf07.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Rspamd-Server: rspam01 X-Rspam-User: X-HE-Tag: 1663299405-925076 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The initial memremap_pages() implementation inherited the __init_single_page() default of pages starting life with an elevated reference count. This originally allowed for the page->pgmap pointer to alias with the storage for page->lru since a page was only allowed to be on an lru list when its reference count was zero. Since then, 'struct page' definition cleanups have arranged for dedicated space for the ZONE_DEVICE page metadata, and the MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0 page->_refcount transition to route the page to free_zone_device_page() and not the core-mm page-free. With those cleanups in place and with filesystem-dax and device-dax now converted to take and drop references at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX and MEMORY_DEVICE_GENERIC reference counts at 0. MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE pages start life at _refcount 1, so make that the default if pgmap->init_mode is left at zero. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- drivers/dax/device.c | 1 + drivers/nvdimm/pmem.c | 2 ++ include/linux/dax.h | 2 +- include/linux/memremap.h | 5 +++++ mm/memremap.c | 15 ++++++++++----- mm/page_alloc.c | 2 ++ 6 files changed, 21 insertions(+), 6 deletions(-) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 7f306939807e..8a7281d16c99 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -460,6 +460,7 @@ int dev_dax_probe(struct dev_dax *dev_dax) } pgmap->type = MEMORY_DEVICE_GENERIC; + pgmap->init_mode = INIT_PAGEMAP_IDLE; if (dev_dax->align > PAGE_SIZE) pgmap->vmemmap_shift = order_base_2(dev_dax->align >> PAGE_SHIFT); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 7e88cd242380..9c98dcb9f33d 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -529,6 +529,7 @@ static int pmem_attach_disk(struct device *dev, pmem->pfn_flags = PFN_DEV; if (is_nd_pfn(dev)) { pmem->pgmap.type = MEMORY_DEVICE_FS_DAX; + pmem->pgmap.init_mode = INIT_PAGEMAP_IDLE; pmem->pgmap.ops = &fsdax_pagemap_ops; addr = devm_memremap_pages(dev, &pmem->pgmap); pfn_sb = nd_pfn->pfn_sb; @@ -543,6 +544,7 @@ static int pmem_attach_disk(struct device *dev, pmem->pgmap.range.end = res->end; pmem->pgmap.nr_range = 1; pmem->pgmap.type = MEMORY_DEVICE_FS_DAX; + pmem->pgmap.init_mode = INIT_PAGEMAP_IDLE; pmem->pgmap.ops = &fsdax_pagemap_ops; addr = devm_memremap_pages(dev, &pmem->pgmap); pmem->pfn_flags |= PFN_MAP; diff --git a/include/linux/dax.h b/include/linux/dax.h index 3a27fecf072a..b9fdd8951e06 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -235,7 +235,7 @@ static inline void dax_unlock_mapping_entry(struct address_space *mapping, */ static inline bool dax_page_idle(struct page *page) { - return page_ref_count(page) == 1; + return page_ref_count(page) == 0; } bool dax_alive(struct dax_device *dax_dev); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5d30eec3bf1..9f1a57efd371 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -116,6 +116,7 @@ struct dev_pagemap_ops { * representation. A bigger value will set up compound struct pages * of the requested order value. * @ops: method table + * @init_mode: initial reference count mode * @owner: an opaque pointer identifying the entity that manages this * instance. Used by various helpers to make sure that no * foreign ZONE_DEVICE memory is accessed. @@ -131,6 +132,10 @@ struct dev_pagemap { unsigned int flags; unsigned long vmemmap_shift; const struct dev_pagemap_ops *ops; + enum { + INIT_PAGEMAP_BUSY = 0, /* default / historical */ + INIT_PAGEMAP_IDLE, + } init_mode; void *owner; int nr_range; union { diff --git a/mm/memremap.c b/mm/memremap.c index 83c5e6fafd84..b6a7a95339b3 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -467,8 +467,10 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap_many); void free_zone_device_page(struct page *page) { - if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free)) - return; + struct dev_pagemap *pgmap = page->pgmap; + + /* wake filesystem 'break dax layouts' waiters */ + wake_up_var(page); mem_cgroup_uncharge(page_folio(page)); @@ -503,12 +505,15 @@ void free_zone_device_page(struct page *page) * to clear page->mapping. */ page->mapping = NULL; - page->pgmap->ops->page_free(page); + if (pgmap->ops && pgmap->ops->page_free) + pgmap->ops->page_free(page); /* - * Reset the page count to 1 to prepare for handing out the page again. + * Reset the page count to the @init_mode value to prepare for + * handing out the page again. */ - set_page_count(page, 1); + if (pgmap->init_mode == INIT_PAGEMAP_BUSY) + set_page_count(page, 1); } #ifdef CONFIG_FS_DAX diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e5486d47406e..8ee52992055b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6719,6 +6719,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, { __init_single_page(page, pfn, zone_idx, nid); + if (pgmap->init_mode == INIT_PAGEMAP_IDLE) + set_page_count(page, 0); /* * Mark page reserved as it will need to wait for onlining From patchwork Fri Sep 16 03:36:49 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978069 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54ABEC6FA91 for ; Fri, 16 Sep 2022 03:36:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D80348D0002; Thu, 15 Sep 2022 23:36:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D30338D0001; Thu, 15 Sep 2022 23:36:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BAA9A8D0002; Thu, 15 Sep 2022 23:36:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id AB2E18D0001 for ; Thu, 15 Sep 2022 23:36:53 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 8644714055B for ; Fri, 16 Sep 2022 03:36:53 +0000 (UTC) X-FDA: 79916537106.27.2A43D79 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf10.hostedemail.com (Postfix) with ESMTP id DFC13C0090 for ; Fri, 16 Sep 2022 03:36:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299413; x=1694835413; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ZO7eiqr8wx9WyqWIgXnO5hpkhzDCAJO+DedF+qkd8PM=; b=hLGkEqv4Y+ZpcMsua4m6pK7kEx9tYAffe4UdP1hbZa22BizGko4sO4nv voi9uF4lGAIJ/xKLhikwy+42cLj6YSrFJOAZUS5nDGR56MAdrbKGE78Zz Ol8zz7kPbTT7MuFEmFbxV0aGKGVMaYi/Ax2eKpsQRqaGBkQimRjsFzxP1 L80vzfa5iObgXpWEnaoec69j50sJMa1U2ZYlftr9hVYPmWRuZFsy/HHgQ waW2uLUd1q/onJFFopJdb/6LKo42Nlqt/MmwLn1O/TqONZT1qZF6tpkEB gtgLn9NL2ijA3++GTzBwrEdW3IJZ5sJ8v/HuJIlmaHNLM5Ydb1FI7l7Vz w==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="300264141" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="300264141" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:51 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="679809618" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:50 -0700 Subject: [PATCH v2 17/18] fsdax: Delete put_devmap_managed_page_refs() From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:36:49 -0700 Message-ID: <166329940976.2786261.12969674850633492781.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299413; a=rsa-sha256; cv=none; b=XG6zJAcmahwM4+JXYjAAa8Qc8aJoPEv029waMS5Ds0kUbZXAUo+ChQwYkIrLHXwsVyirJ4 Nx4jaHoBasZn+gq4uBofJAxrk5j2z0O/AoZ+7Jz5Vi5RDbAbBHvSyBsuIqEIm1C0ZwZ0z+ 4jO/8FnqTwMDEuCqBveA6cBmAUocM/w= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=hLGkEqv4; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf10.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299413; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1OGButxsqx6Fn8Yf/d7mjKtZkM7tCMb2db6dks0U76k=; b=bNW28FcaC+6Ti1dDQ3aJQ7UJAtT1yWxmoFjAPeWvDfJc6o4c3SLHwc4Vfat06UOO0OizUy iU3LJXQcPUarsgZWafb16uWfnnloiOIqcKjWYkb4qd9zwNlefOyQ1KlJrJTr6FsQyMAvrk b+AafTcVNPy6FfCktW9NLJXCGTq2jFA= X-Rspam-User: Authentication-Results: imf10.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=hLGkEqv4; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf10.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Stat-Signature: qoo9xkk45u885o8gkhah9iqgjxbkpawi X-Rspamd-Queue-Id: DFC13C0090 X-Rspamd-Server: rspam09 X-HE-Tag: 1663299412-278183 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now that fsdax DMA-idle detection no longer depends on catching transitions of page->_refcount to 1, remove put_devmap_managed_page_refs() and associated infrastructure. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- include/linux/mm.h | 30 ------------------------------ mm/gup.c | 6 ++---- mm/memremap.c | 18 ------------------ mm/swap.c | 2 -- 4 files changed, 2 insertions(+), 54 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3bedc449c14d..182fe336a268 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1048,30 +1048,6 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); * back into memory. */ -#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX) -DECLARE_STATIC_KEY_FALSE(devmap_managed_key); - -bool __put_devmap_managed_page_refs(struct page *page, int refs); -static inline bool put_devmap_managed_page_refs(struct page *page, int refs) -{ - if (!static_branch_unlikely(&devmap_managed_key)) - return false; - if (!is_zone_device_page(page)) - return false; - return __put_devmap_managed_page_refs(page, refs); -} -#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */ -static inline bool put_devmap_managed_page_refs(struct page *page, int refs) -{ - return false; -} -#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */ - -static inline bool put_devmap_managed_page(struct page *page) -{ - return put_devmap_managed_page_refs(page, 1); -} - /* 127: arbitrary random number, small enough to assemble well */ #define folio_ref_zero_or_close_to_overflow(folio) \ ((unsigned int) folio_ref_count(folio) + 127u <= 127u) @@ -1168,12 +1144,6 @@ static inline void put_page(struct page *page) { struct folio *folio = page_folio(page); - /* - * For some devmap managed pages we need to catch refcount transition - * from 2 to 1: - */ - if (put_devmap_managed_page(&folio->page)) - return; folio_put(folio); } diff --git a/mm/gup.c b/mm/gup.c index 732825157430..c6d060dee9e0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -87,8 +87,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs) * belongs to this folio. */ if (unlikely(page_folio(page) != folio)) { - if (!put_devmap_managed_page_refs(&folio->page, refs)) - folio_put_refs(folio, refs); + folio_put_refs(folio, refs); goto retry; } @@ -177,8 +176,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) refs *= GUP_PIN_COUNTING_BIAS; } - if (!put_devmap_managed_page_refs(&folio->page, refs)) - folio_put_refs(folio, refs); + folio_put_refs(folio, refs); } /** diff --git a/mm/memremap.c b/mm/memremap.c index b6a7a95339b3..0f4a2e20c159 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -515,21 +515,3 @@ void free_zone_device_page(struct page *page) if (pgmap->init_mode == INIT_PAGEMAP_BUSY) set_page_count(page, 1); } - -#ifdef CONFIG_FS_DAX -bool __put_devmap_managed_page_refs(struct page *page, int refs) -{ - if (page->pgmap->type != MEMORY_DEVICE_FS_DAX) - return false; - - /* - * fsdax page refcounts are 1-based, rather than 0-based: if - * refcount is 1, then the page is free and the refcount is - * stable because nobody holds a reference on the page. - */ - if (page_ref_sub_return(page, refs) == 1) - wake_up_var(page); - return true; -} -EXPORT_SYMBOL(__put_devmap_managed_page_refs); -#endif /* CONFIG_FS_DAX */ diff --git a/mm/swap.c b/mm/swap.c index 9cee7f6a3809..b346dd24cde8 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -960,8 +960,6 @@ void release_pages(struct page **pages, int nr) unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - if (put_devmap_managed_page(&folio->page)) - continue; if (folio_put_testzero(folio)) free_zone_device_page(&folio->page); continue; From patchwork Fri Sep 16 03:36:56 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12978070 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 873E0C32771 for ; Fri, 16 Sep 2022 03:37:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 20E9B8000B; Thu, 15 Sep 2022 23:37:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1BE8D8D0001; Thu, 15 Sep 2022 23:37:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 05FA68000B; Thu, 15 Sep 2022 23:37:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E5CFA8D0001 for ; Thu, 15 Sep 2022 23:37:01 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C39C7120691 for ; Fri, 16 Sep 2022 03:37:01 +0000 (UTC) X-FDA: 79916537442.24.C97BAA2 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by imf06.hostedemail.com (Postfix) with ESMTP id 252BD180090 for ; Fri, 16 Sep 2022 03:37:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1663299421; x=1694835421; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=shTt/uNd4pi9Ct2gNVWBBoQz77p4cEL4DPuvM6dkcww=; b=CCcGuL3pcdcXbn1hQGt80Zfx1hgY/j4mKGVTUDGPdL1eDW4XxjZJxrZg yrjNCX7Rn3zbG0ogfRqUZkM/7FJ+nuIvw4uK9hzGAXfPA765Q458Zir35 K2qR9lAY0oI5fsr1WzUvRLE1Uhoznldp/OfIxPiaCRFDy0XnDslidevlx 6EDoXH8YizTDEGoKxqLf+w+7hdTG69ZowIegc5NG5khZB0BjReypFC5lQ PpZz8ETlR2/a3k1LsogbaJVB4C+x8BjoQy5XbgN5l+Be0by1sFu4643vy 2hHZ2wx88zLLu6zqFN+ozAM/0TXD/vE4eQba8Kwd4zBQsUV985beJ+nJS g==; X-IronPort-AV: E=McAfee;i="6500,9779,10471"; a="385192699" X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="385192699" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:57 -0700 X-IronPort-AV: E=Sophos;i="5.93,319,1654585200"; d="scan'208";a="743194587" Received: from colinlix-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.29.52]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2022 20:36:56 -0700 Subject: [PATCH v2 18/18] mm/gup: Drop DAX pgmap accounting From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Christoph Hellwig , John Hubbard , Jason Gunthorpe , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Date: Thu, 15 Sep 2022 20:36:56 -0700 Message-ID: <166329941594.2786261.16402766003789003164.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> References: <166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663299421; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7mqOea5zvMpPnI+VofYy5ofQAxddbeNHI22e0I3ucI0=; b=Z2Bg0IPrhvnjNIAB8vnM272MQBeIZoZSYvtIIrccfrEDwra1z6xGZ+R9uoXLssT4Ccp2xw J6AGTknHPN+eRSvvfxFSVuVRlK7+pO0RkwnZCS870WRTR8UIxH6dVT4VGM10Xz+VDdpHot NHWf150kKvvQhTQuiMVjI7lcAA/9agM= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=CCcGuL3p; spf=pass (imf06.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663299421; a=rsa-sha256; cv=none; b=yYPvzb+3xqfYjzGmI6whTdr+3gSq78ezfL65axEeDuk9GObu9wt1aW3O/t9+6hnOH4H33K aAedlV2XFW8eU+NquWvIMr8OzKRPkVwFvyX7Vafa9oLgfMcOsZ0uBjw+Wt77T+hyMgDR3x fbMwFdfCWQ98GzH+zcVe74pg0gHfUKE= X-Stat-Signature: o6webyp3onz6fs8r1xkr1nte49mws4dr X-Rspamd-Queue-Id: 252BD180090 X-Rspam-User: Authentication-Results: imf06.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=CCcGuL3p; spf=pass (imf06.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam06 X-HE-Tag: 1663299420-134502 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now that pgmap accounting is handled at map time, it can be dropped from gup time. A hurdle still remains that filesystem-DAX huge pages are not compound pages which still requires infrastructure like __gup_device_huge_p{m,u}d() to stick around. Additionally, ZONE_DEVICE pages with this change are still not suitable to be returned from vm_normal_page(), so this cleanup is limited to deleting pgmap reference manipulation. This is an incremental step on the path to removing pte_devmap() altogether. Note that follow_pmd_devmap() can be deleted entirely since a few additions of pmd_devmap() allows the transparent huge page path to be reused. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Christoph Hellwig Cc: John Hubbard Reported-by: Jason Gunthorpe Signed-off-by: Dan Williams --- include/linux/huge_mm.h | 12 +------ mm/gup.c | 83 +++++++++++------------------------------------ mm/huge_memory.c | 54 +------------------------------ 3 files changed, 22 insertions(+), 127 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index de73f5a16252..b8ed373c6090 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -263,10 +263,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio) return folio_order(folio) >= HPAGE_PMD_ORDER; } -struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, int flags, struct dev_pagemap **pgmap); struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, - pud_t *pud, int flags, struct dev_pagemap **pgmap); + pud_t *pud, int flags); vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); @@ -418,14 +416,8 @@ static inline void mm_put_huge_zero_page(struct mm_struct *mm) return; } -static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma, - unsigned long addr, pmd_t *pmd, int flags, struct dev_pagemap **pgmap) -{ - return NULL; -} - static inline struct page *follow_devmap_pud(struct vm_area_struct *vma, - unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap) + unsigned long addr, pud_t *pud, int flags) { return NULL; } diff --git a/mm/gup.c b/mm/gup.c index c6d060dee9e0..8e6dd4308e19 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -25,7 +25,6 @@ #include "internal.h" struct follow_page_context { - struct dev_pagemap *pgmap; unsigned int page_mask; }; @@ -487,8 +486,7 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) } static struct page *follow_page_pte(struct vm_area_struct *vma, - unsigned long address, pmd_t *pmd, unsigned int flags, - struct dev_pagemap **pgmap) + unsigned long address, pmd_t *pmd, unsigned int flags) { struct mm_struct *mm = vma->vm_mm; struct page *page; @@ -532,17 +530,13 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } page = vm_normal_page(vma, address, pte); - if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { + if (!page && pte_devmap(pte)) { /* - * Only return device mapping pages in the FOLL_GET or FOLL_PIN - * case since they are only valid while holding the pgmap - * reference. + * ZONE_DEVICE pages are not yet treated as vm_normal_page() + * instances, with respect to mapcount and compound-page + * metadata */ - *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap); - if (*pgmap) - page = pte_page(pte); - else - goto no_page; + page = pte_page(pte); } else if (unlikely(!page)) { if (flags & FOLL_DUMP) { /* Avoid special (like zero) pages in core dumps */ @@ -660,15 +654,8 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, return no_page_table(vma, flags); goto retry; } - if (pmd_devmap(pmdval)) { - ptl = pmd_lock(mm, pmd); - page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap); - spin_unlock(ptl); - if (page) - return page; - } - if (likely(!pmd_trans_huge(pmdval))) - return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); + if (likely(!(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))) + return follow_page_pte(vma, address, pmd, flags); if ((flags & FOLL_NUMA) && pmd_protnone(pmdval)) return no_page_table(vma, flags); @@ -686,9 +673,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, pmd_migration_entry_wait(mm, pmd); goto retry_locked; } - if (unlikely(!pmd_trans_huge(*pmd))) { + if (unlikely(!(pmd_trans_huge(*pmd) || pmd_devmap(pmdval)))) { spin_unlock(ptl); - return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); + return follow_page_pte(vma, address, pmd, flags); } if (flags & FOLL_SPLIT_PMD) { int ret; @@ -706,7 +693,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, } return ret ? ERR_PTR(ret) : - follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); + follow_page_pte(vma, address, pmd, flags); } page = follow_trans_huge_pmd(vma, address, pmd, flags); spin_unlock(ptl); @@ -743,7 +730,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma, } if (pud_devmap(*pud)) { ptl = pud_lock(mm, pud); - page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap); + page = follow_devmap_pud(vma, address, pud, flags); spin_unlock(ptl); if (page) return page; @@ -790,9 +777,6 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma, * * @flags can have FOLL_ flags set, defined in * - * When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches - * the device's dev_pagemap metadata to avoid repeating expensive lookups. - * * When getting an anonymous page and the caller has to trigger unsharing * of a shared anonymous page first, -EMLINK is returned. The caller should * trigger a fault with FAULT_FLAG_UNSHARE set. Note that unsharing is only @@ -847,7 +831,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma, struct page *follow_page(struct vm_area_struct *vma, unsigned long address, unsigned int foll_flags) { - struct follow_page_context ctx = { NULL }; + struct follow_page_context ctx = { 0 }; struct page *page; if (vma_is_secretmem(vma)) @@ -857,8 +841,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, return NULL; page = follow_page_mask(vma, address, foll_flags, &ctx); - if (ctx.pgmap) - put_dev_pagemap(ctx.pgmap); return page; } @@ -1118,7 +1100,7 @@ static long __get_user_pages(struct mm_struct *mm, { long ret = 0, i = 0; struct vm_area_struct *vma = NULL; - struct follow_page_context ctx = { NULL }; + struct follow_page_context ctx = { 0 }; if (!nr_pages) return 0; @@ -1241,8 +1223,6 @@ static long __get_user_pages(struct mm_struct *mm, nr_pages -= page_increm; } while (nr_pages); out: - if (ctx.pgmap) - put_dev_pagemap(ctx.pgmap); return i ? i : ret; } @@ -2322,9 +2302,8 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { - struct dev_pagemap *pgmap = NULL; - int nr_start = *nr, ret = 0; pte_t *ptep, *ptem; + int ret = 0; ptem = ptep = pte_offset_map(&pmd, addr); do { @@ -2345,12 +2324,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, if (pte_devmap(pte)) { if (unlikely(flags & FOLL_LONGTERM)) goto pte_unmap; - - pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); - if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, flags, pages); - goto pte_unmap; - } } else if (pte_special(pte)) goto pte_unmap; @@ -2397,8 +2370,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, ret = 1; pte_unmap: - if (pgmap) - put_dev_pagemap(pgmap); pte_unmap(ptem); return ret; } @@ -2425,28 +2396,17 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { - int nr_start = *nr; - struct dev_pagemap *pgmap = NULL; - do { struct page *page = pfn_to_page(pfn); - pgmap = get_dev_pagemap(pfn, pgmap); - if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, flags, pages); - break; - } SetPageReferenced(page); pages[*nr] = page; - if (unlikely(!try_grab_page(page, flags))) { - undo_dev_pagemap(nr, nr_start, flags, pages); + if (unlikely(!try_grab_page(page, flags))) break; - } (*nr)++; pfn++; } while (addr += PAGE_SIZE, addr != end); - put_dev_pagemap(pgmap); return addr == end; } @@ -2455,16 +2415,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, struct page **pages, int *nr) { unsigned long fault_pfn; - int nr_start = *nr; fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) return 0; - if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - undo_dev_pagemap(nr, nr_start, flags, pages); + if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) return 0; - } + return 1; } @@ -2473,16 +2431,13 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, struct page **pages, int *nr) { unsigned long fault_pfn; - int nr_start = *nr; fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) return 0; - if (unlikely(pud_val(orig) != pud_val(*pudp))) { - undo_dev_pagemap(nr, nr_start, flags, pages); + if (unlikely(pud_val(orig) != pud_val(*pudp))) return 0; - } return 1; } #else diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8a7c1b344abe..ef68296f2158 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1031,55 +1031,6 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, update_mmu_cache_pmd(vma, addr, pmd); } -struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, int flags, struct dev_pagemap **pgmap) -{ - unsigned long pfn = pmd_pfn(*pmd); - struct mm_struct *mm = vma->vm_mm; - struct page *page; - - assert_spin_locked(pmd_lockptr(mm, pmd)); - - /* - * When we COW a devmap PMD entry, we split it into PTEs, so we should - * not be in this function with `flags & FOLL_COW` set. - */ - WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set"); - - /* FOLL_GET and FOLL_PIN are mutually exclusive. */ - if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == - (FOLL_PIN | FOLL_GET))) - return NULL; - - if (flags & FOLL_WRITE && !pmd_write(*pmd)) - return NULL; - - if (pmd_present(*pmd) && pmd_devmap(*pmd)) - /* pass */; - else - return NULL; - - if (flags & FOLL_TOUCH) - touch_pmd(vma, addr, pmd, flags & FOLL_WRITE); - - /* - * device mapped pages can only be returned if the - * caller will manage the page reference count. - */ - if (!(flags & (FOLL_GET | FOLL_PIN))) - return ERR_PTR(-EEXIST); - - pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT; - *pgmap = get_dev_pagemap(pfn, *pgmap); - if (!*pgmap) - return ERR_PTR(-EFAULT); - page = pfn_to_page(pfn); - if (!try_grab_page(page, flags)) - page = ERR_PTR(-ENOMEM); - - return page; -} - int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) @@ -1196,7 +1147,7 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr, } struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, - pud_t *pud, int flags, struct dev_pagemap **pgmap) + pud_t *pud, int flags) { unsigned long pfn = pud_pfn(*pud); struct mm_struct *mm = vma->vm_mm; @@ -1230,9 +1181,6 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, return ERR_PTR(-EEXIST); pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; - *pgmap = get_dev_pagemap(pfn, *pgmap); - if (!*pgmap) - return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); if (!try_grab_page(page, flags)) page = ERR_PTR(-ENOMEM);