From patchwork Thu Mar 18 04:08:08 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12147291 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E50A2C433E9 for ; Thu, 18 Mar 2021 04:08:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 69C5264E6B for ; Thu, 18 Mar 2021 04:08:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 69C5264E6B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 830DD6B0072; Thu, 18 Mar 2021 00:08:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 794058D0001; Thu, 18 Mar 2021 00:08:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4FB8C6B0074; Thu, 18 Mar 2021 00:08:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0046.hostedemail.com [216.40.44.46]) by kanga.kvack.org (Postfix) with ESMTP id 379556B0072 for ; Thu, 18 Mar 2021 00:08:27 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 003FE6D6D for ; Thu, 18 Mar 2021 04:08:26 +0000 (UTC) X-FDA: 77931663012.07.1ED4691 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by imf14.hostedemail.com (Postfix) with ESMTP id 0B93DC0001F7 for ; Thu, 18 Mar 2021 04:08:25 +0000 (UTC) IronPort-SDR: VDGXvZ/jCPkrAkgDytt8ip9RxCu0FL4/lKDckMVsdi2YM34dFJnSKKDjC0h+fo6q8UP5YfiyaY B9cEvSgKrkEQ== X-IronPort-AV: E=McAfee;i="6000,8403,9926"; a="168878223" X-IronPort-AV: E=Sophos;i="5.81,257,1610438400"; d="scan'208";a="168878223" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2021 21:08:24 -0700 IronPort-SDR: 1rq8sqVWSe6YHQ+WVrPVpOsMt43EYcMglJqHZo97oYfutFoyx3sJkcavfyuGw3PDOsaLdR1Dfz oB7zwAcabX0Q== X-IronPort-AV: E=Sophos;i="5.81,257,1610438400"; d="scan'208";a="389086762" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.25]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2021 21:08:18 -0700 Subject: [PATCH 1/3] mm/memory-failure: Prepare for mass memory_failure() From: Dan Williams To: linux-mm@kvack.org, linux-nvdimm@lists.01.org Cc: Naoya Horiguchi , Andrew Morton , vishal.l.verma@intel.com, david@fromorbit.com, hch@lst.de, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org Date: Wed, 17 Mar 2021 21:08:08 -0700 Message-ID: <161604048859.1463742.10087657197118774859.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> References: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 X-Stat-Signature: 8hqj6ixqcsb46omot34rjqewxr611rnz X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 0B93DC0001F7 Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf14; identity=mailfrom; envelope-from=""; helo=mga12.intel.com; client-ip=192.55.52.136 X-HE-DKIM-Result: none/none X-HE-Tag: 1616040505-35841 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently memory_failure() assumes an infrequent report on a handful of pages. A new use case for surprise removal of a persistent memory device needs to trigger memory_failure() on a large range. Rate limit memory_failure() error logging, and allow the memory_failure_dev_pagemap() helper to be called directly. Cc: Naoya Horiguchi Cc: Andrew Morton Signed-off-by: Dan Williams --- mm/memory-failure.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 24210c9bd843..43ba4307c526 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -395,8 +395,9 @@ static void kill_procs(struct list_head *to_kill, int forcekill, bool fail, * signal and then access the memory. Just kill it. */ if (fail || tk->addr == -EFAULT) { - pr_err("Memory failure: %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n", - pfn, tk->tsk->comm, tk->tsk->pid); + pr_err_ratelimited( + "Memory failure: %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n", + pfn, tk->tsk->comm, tk->tsk->pid); do_send_sig_info(SIGKILL, SEND_SIG_PRIV, tk->tsk, PIDTYPE_PID); } @@ -408,8 +409,9 @@ static void kill_procs(struct list_head *to_kill, int forcekill, bool fail, * process anyways. */ else if (kill_proc(tk, pfn, flags) < 0) - pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n", - pfn, tk->tsk->comm, tk->tsk->pid); + pr_err_ratelimited( + "Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n", + pfn, tk->tsk->comm, tk->tsk->pid); } put_task_struct(tk->tsk); kfree(tk); @@ -919,8 +921,8 @@ static void action_result(unsigned long pfn, enum mf_action_page_type type, { trace_memory_failure_event(pfn, type, result); - pr_err("Memory failure: %#lx: recovery action for %s: %s\n", - pfn, action_page_types[type], action_name[result]); + pr_err_ratelimited("Memory failure: %#lx: recovery action for %s: %s\n", + pfn, action_page_types[type], action_name[result]); } static int page_action(struct page_state *ps, struct page *p, @@ -1375,8 +1377,6 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, unlock: dax_unlock_page(page, cookie); out: - /* drop pgmap ref acquired in caller */ - put_dev_pagemap(pgmap); action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED); return rc; } @@ -1415,9 +1415,12 @@ int memory_failure(unsigned long pfn, int flags) if (!p) { if (pfn_valid(pfn)) { pgmap = get_dev_pagemap(pfn, NULL); - if (pgmap) - return memory_failure_dev_pagemap(pfn, flags, - pgmap); + if (pgmap) { + res = memory_failure_dev_pagemap(pfn, flags, + pgmap); + put_dev_pagemap(pgmap); + return res; + } } pr_err("Memory failure: %#lx: memory outside kernel control\n", pfn); From patchwork Thu Mar 18 04:08:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12147289 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 187FFC433E6 for ; Thu, 18 Mar 2021 04:08:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 56CA564F10 for ; Thu, 18 Mar 2021 04:08:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 56CA564F10 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CEB186B0071; Thu, 18 Mar 2021 00:08:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CC2388D0002; Thu, 18 Mar 2021 00:08:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B63FE8D0001; Thu, 18 Mar 2021 00:08:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0063.hostedemail.com [216.40.44.63]) by kanga.kvack.org (Postfix) with ESMTP id 9BC3D6B0071 for ; Thu, 18 Mar 2021 00:08:26 -0400 (EDT) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 6260B8249980 for ; Thu, 18 Mar 2021 04:08:26 +0000 (UTC) X-FDA: 77931662970.16.1FD9E66 Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by imf02.hostedemail.com (Postfix) with ESMTP id 5704C407F8F6 for ; Thu, 18 Mar 2021 04:08:25 +0000 (UTC) IronPort-SDR: 0Lx1Z+WrnFqJzOtAk5ASRTqspTAVG1n7dLpvemV0mIMWrhHKywIY3LG3Y9Hv7YzjFuy/qo6C2Z Vh5uRj6Jflrw== X-IronPort-AV: E=McAfee;i="6000,8403,9926"; a="250953499" X-IronPort-AV: E=Sophos;i="5.81,257,1610438400"; d="scan'208";a="250953499" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2021 21:08:23 -0700 IronPort-SDR: p/1qj9fprbOi0UfZ5VYxeKbH6q8jS44HvH9XDPNVwgyY5Do2862DyyZHGArppKpw7d6Ru/D8GC pqCcL9toU4Ng== X-IronPort-AV: E=Sophos;i="5.81,257,1610438400"; d="scan'208";a="439602453" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.25]) by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2021 21:08:23 -0700 Subject: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure() From: Dan Williams To: linux-mm@kvack.org, linux-nvdimm@lists.01.org Cc: Jason Gunthorpe , Dave Chinner , Christoph Hellwig , Shiyang Ruan , Vishal Verma , Dave Jiang , Ira Weiny , Matthew Wilcox , Jan Kara , Andrew Morton , Naoya Horiguchi , "Darrick J. Wong" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org Date: Wed, 17 Mar 2021 21:08:23 -0700 Message-ID: <161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> References: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 X-Stat-Signature: 1ema5k7e4rwooob8zdx888zjuicgrn9k X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 5704C407F8F6 Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf02; identity=mailfrom; envelope-from=""; helo=mga06.intel.com; client-ip=134.134.136.31 X-HE-DKIM-Result: none/none X-HE-Tag: 1616040505-531373 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Jason wondered why the get_user_pages_fast() path takes references on a @pgmap object. The rationale was to protect against accessing a 'struct page' that might be in the process of being removed by the driver, but he rightly points out that should be solved the same way all gup-fast synchronization is solved which is invalidate the mapping and let the gup slow path do @pgmap synchronization [1]. To achieve that it means that new user mappings need to stop being created and all existing user mappings need to be invalidated. For device-dax this is already the case as kill_dax() prevents future faults from installing a pte, and the single device-dax inode address_space can be trivially unmapped. The situation is different for filesystem-dax where device pages could be mapped by any number of inode address_space instances. An initial thought was to treat the device removal event like a drop_pagecache_sb() event that walks superblocks and unmaps all inodes. However, Dave points out that it is not just the filesystem user-mappings that need to react to global DAX page-unmap events, it is also filesystem metadata (proposed DAX metadata access), and other drivers (upstream DM-writecache) that need to react to this event [2]. The only kernel facility that is meant to globally broadcast the loss of a page (via corruption or surprise remove) is memory_failure(). The downside of memory_failure() is that it is a pfn-at-a-time interface. However, the events that would trigger the need to call memory_failure() over a full PMEM device should be rare. Remove should always be coordinated by the administrator with the filesystem. If someone force removes a device from underneath a mounted filesystem the driver assumes they have a good reason, or otherwise get to keep the pieces. Since ->remove() callbacks can not fail the only option is to trigger the mass memory_failure(). The mechanism to determine whether memory_failure() triggers at pmem->remove() time is whether the associated dax_device has an elevated reference at @pgmap ->kill() time. With this in place the get_user_pages_fast() path can drop its half-measure synchronization with an @pgmap reference. Link: http://lore.kernel.org/r/20210224010017.GQ2643399@ziepe.ca [1] Link: http://lore.kernel.org/r/20210302075736.GJ4662@dread.disaster.area [2] Reported-by: Jason Gunthorpe Cc: Dave Chinner Cc: Christoph Hellwig Cc: Shiyang Ruan Cc: Vishal Verma Cc: Dave Jiang Cc: Ira Weiny Cc: Matthew Wilcox Cc: Jan Kara Cc: Andrew Morton Cc: Naoya Horiguchi Cc: "Darrick J. Wong" Signed-off-by: Dan Williams --- drivers/dax/super.c | 15 +++++++++++++++ drivers/nvdimm/pmem.c | 10 +++++++++- drivers/nvdimm/pmem.h | 1 + include/linux/dax.h | 5 +++++ include/linux/memremap.h | 5 +++++ include/linux/mm.h | 3 +++ mm/memory-failure.c | 11 +++++++++-- mm/memremap.c | 11 +++++++++++ 8 files changed, 58 insertions(+), 3 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 5fa6ae9dbc8b..5ebcedf4a68c 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev) } EXPORT_SYMBOL_GPL(put_dax); +bool dax_is_idle(struct dax_device *dax_dev) +{ + struct inode *inode; + + if (!dax_dev) + return true; + + WARN_ONCE(test_bit(DAXDEV_ALIVE, &dax_dev->flags), + "dax idle check on live device.\n"); + + inode = &dax_dev->inode; + return atomic_read(&inode->i_count) < 2; +} +EXPORT_SYMBOL_GPL(dax_is_idle); + /** * dax_get_by_host() - temporary lookup mechanism for filesystem-dax * @host: alternate name for the device registered by a dax driver diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index b8a85bfb2e95..e8822c9262ee 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap *pgmap) { struct request_queue *q = container_of(pgmap->ref, struct request_queue, q_usage_counter); + struct pmem_device *pmem = q->queuedata; blk_freeze_queue_start(q); + kill_dax(pmem->dax_dev); + if (!dax_is_idle(pmem->dax_dev)) { + dev_warn(pmem->dev, + "DAX active at remove, trigger mass memory failure\n"); + dev_pagemap_failure(pgmap); + } } static void pmem_release_disk(void *__pmem) { struct pmem_device *pmem = __pmem; - kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); del_gendisk(pmem->disk); put_disk(pmem->disk); @@ -406,6 +412,7 @@ static int pmem_attach_disk(struct device *dev, devm_namespace_disable(dev, ndns); dev_set_drvdata(dev, pmem); + pmem->dev = dev; pmem->phys_addr = res->start; pmem->size = resource_size(res); fua = nvdimm_has_flush(nd_region); @@ -467,6 +474,7 @@ static int pmem_attach_disk(struct device *dev, blk_queue_flag_set(QUEUE_FLAG_NONROT, q); if (pmem->pfn_flags & PFN_MAP) blk_queue_flag_set(QUEUE_FLAG_DAX, q); + q->queuedata = pmem; disk = alloc_disk_node(0, nid); if (!disk) diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 59cfe13ea8a8..1222088a569a 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -23,6 +23,7 @@ struct pmem_device { struct badblocks bb; struct dax_device *dax_dev; struct gendisk *disk; + struct device *dev; struct dev_pagemap pgmap; }; diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..015f1d9a8232 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -46,6 +46,7 @@ struct dax_device *alloc_dax(void *private, const char *host, const struct dax_operations *ops, unsigned long flags); void put_dax(struct dax_device *dax_dev); void kill_dax(struct dax_device *dax_dev); +bool dax_is_idle(struct dax_device *dax_dev); void dax_write_cache(struct dax_device *dax_dev, bool wc); bool dax_write_cache_enabled(struct dax_device *dax_dev); bool __dax_synchronous(struct dax_device *dax_dev); @@ -92,6 +93,10 @@ static inline void put_dax(struct dax_device *dax_dev) static inline void kill_dax(struct dax_device *dax_dev) { } +static inline bool dax_is_idle(struct dax_device *dax_dev) +{ + return true; +} static inline void dax_write_cache(struct dax_device *dax_dev, bool wc) { } diff --git a/include/linux/memremap.h b/include/linux/memremap.h index f5b464daeeca..d52cdc6c5313 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -137,6 +137,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap); struct dev_pagemap *get_dev_pagemap(unsigned long pfn, struct dev_pagemap *pgmap); +void dev_pagemap_failure(struct dev_pagemap *pgmap); bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn); unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); @@ -160,6 +161,10 @@ static inline void devm_memunmap_pages(struct device *dev, { } +static inline void dev_pagemap_failure(struct dev_pagemap *pgmap) +{ +} + static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn, struct dev_pagemap *pgmap) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 77e64e3eac80..95f79f457bab 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3002,8 +3002,11 @@ enum mf_flags { MF_ACTION_REQUIRED = 1 << 1, MF_MUST_KILL = 1 << 2, MF_SOFT_OFFLINE = 1 << 3, + MF_MEM_REMOVE = 1 << 4, }; extern int memory_failure(unsigned long pfn, int flags); +extern int memory_failure_dev_pagemap(unsigned long pfn, int flags, + struct dev_pagemap *pgmap); extern void memory_failure_queue(unsigned long pfn, int flags); extern void memory_failure_queue_kick(int cpu); extern int unpoison_memory(unsigned long pfn); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 43ba4307c526..8f557beb19ee 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1296,8 +1296,8 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) return res; } -static int memory_failure_dev_pagemap(unsigned long pfn, int flags, - struct dev_pagemap *pgmap) +int memory_failure_dev_pagemap(unsigned long pfn, int flags, + struct dev_pagemap *pgmap) { struct page *page = pfn_to_page(pfn); const bool unmap_success = true; @@ -1377,6 +1377,13 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, unlock: dax_unlock_page(page, cookie); out: + /* + * In the removal case, given unmap is always successful, and + * the driver is responsible for the direct map the recovery is + * always successful + */ + if (flags & MF_MEM_REMOVE) + rc = 0; action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED); return rc; } diff --git a/mm/memremap.c b/mm/memremap.c index 7aa7d6e80ee5..f34da1e14b52 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -165,6 +165,17 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id) pgmap_array_delete(range); } +void dev_pagemap_failure(struct dev_pagemap *pgmap) +{ + unsigned long pfn; + int i; + + for (i = 0; i < pgmap->nr_range; i++) + for_each_device_pfn(pfn, pgmap, i) + memory_failure_dev_pagemap(pfn, MF_MEM_REMOVE, pgmap); +} +EXPORT_SYMBOL_GPL(dev_pagemap_failure); + void memunmap_pages(struct dev_pagemap *pgmap) { unsigned long pfn; From patchwork Thu Mar 18 04:08:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12147293 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2E4DC433E0 for ; Thu, 18 Mar 2021 04:08:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 428CF64F18 for ; Thu, 18 Mar 2021 04:08:32 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 428CF64F18 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C51518D0001; Thu, 18 Mar 2021 00:08:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C26CC6B0074; Thu, 18 Mar 2021 00:08:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA0838D0001; Thu, 18 Mar 2021 00:08:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0051.hostedemail.com [216.40.44.51]) by kanga.kvack.org (Postfix) with ESMTP id 9044D6B0073 for ; Thu, 18 Mar 2021 00:08:31 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 5806A8249980 for ; Thu, 18 Mar 2021 04:08:31 +0000 (UTC) X-FDA: 77931663222.14.2EDC7A2 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf06.hostedemail.com (Postfix) with ESMTP id 5A845C0001EE for ; Thu, 18 Mar 2021 04:08:30 +0000 (UTC) IronPort-SDR: 6tYeBNVlQrrYr5wSffqk25RJ/RI1QKnS7UoN4V6FC68RnOLff0+0cx3m3IEfGNxSija9Q0YDo2 V5sqFpUGGMyg== X-IronPort-AV: E=McAfee;i="6000,8403,9926"; a="188967435" X-IronPort-AV: E=Sophos;i="5.81,257,1610438400"; d="scan'208";a="188967435" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2021 21:08:29 -0700 IronPort-SDR: hjXuI57s6BARQ8VjuN98OWw1m7Xbz+qmz7WFsT7KZ0Of7ZJkksW36XRaJlvxof+m/4aP+GrUyb uP7k9VGqpFwA== X-IronPort-AV: E=Sophos;i="5.81,257,1610438400"; d="scan'208";a="372572789" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.25]) by orsmga003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2021 21:08:28 -0700 Subject: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path From: Dan Williams To: linux-mm@kvack.org, linux-nvdimm@lists.01.org Cc: Jason Gunthorpe , Christoph Hellwig , Shiyang Ruan , Vishal Verma , Dave Jiang , Ira Weiny , Matthew Wilcox , Jan Kara , Andrew Morton , david@fromorbit.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org Date: Wed, 17 Mar 2021 21:08:28 -0700 Message-ID: <161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> References: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 X-Stat-Signature: not9ttmrkd1crbphr3ryoshr9jy15jwi X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 5A845C0001EE Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf06; identity=mailfrom; envelope-from=""; helo=mga14.intel.com; client-ip=192.55.52.115 X-HE-DKIM-Result: none/none X-HE-Tag: 1616040510-136217 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now that device-dax and filesystem-dax are guaranteed to unmap all user mappings of devmap / DAX pages before tearing down the 'struct page' array, get_user_pages_fast() can rely on its traditional synchronization method "validate_pte(); get_page(); revalidate_pte()" to catch races with device shutdown. Specifically the unmap guarantee ensures that gup-fast either succeeds in taking a page reference (lock-less), or it detects a need to fall back to the slow path where the device presence can be revalidated with locks held. Reported-by: Jason Gunthorpe Cc: Christoph Hellwig Cc: Shiyang Ruan Cc: Vishal Verma Cc: Dave Jiang Cc: Ira Weiny Cc: Matthew Wilcox Cc: Jan Kara Cc: Andrew Morton Signed-off-by: Dan Williams Reviewed-by: Jason Gunthorpe Signed-off-by: Joao Martins --- mm/gup.c | 38 ++++++++++++++++---------------------- 1 file changed, 16 insertions(+), 22 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index e40579624f10..dfeb47e4e8d4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1996,9 +1996,8 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { - struct dev_pagemap *pgmap = NULL; - int nr_start = *nr, ret = 0; pte_t *ptep, *ptem; + int ret = 0; ptem = ptep = pte_offset_map(&pmd, addr); do { @@ -2015,16 +2014,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, if (!pte_access_permitted(pte, flags & FOLL_WRITE)) goto pte_unmap; - if (pte_devmap(pte)) { - if (unlikely(flags & FOLL_LONGTERM)) - goto pte_unmap; + if (pte_devmap(pte) && (flags & FOLL_LONGTERM)) + goto pte_unmap; - pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); - if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, flags, pages); - goto pte_unmap; - } - } else if (pte_special(pte)) + if (pte_special(pte)) goto pte_unmap; VM_BUG_ON(!pfn_valid(pte_pfn(pte))); @@ -2063,8 +2056,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, ret = 1; pte_unmap: - if (pgmap) - put_dev_pagemap(pgmap); pte_unmap(ptem); return ret; } @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) + static int __gup_device_huge(unsigned long pfn, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { int nr_start = *nr; - struct dev_pagemap *pgmap = NULL; do { - struct page *page = pfn_to_page(pfn); + struct page *page; + + /* + * Typically pfn_to_page() on a devmap pfn is not safe + * without holding a live reference on the hosting + * pgmap. In the gup-fast path it is safe because any + * races will be resolved by either gup-fast taking a + * reference or the shutdown path unmapping the pte to + * trigger gup-fast to fall back to the slow path. + */ + page = pfn_to_page(pfn); - pgmap = get_dev_pagemap(pfn, pgmap); - if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, flags, pages); - return 0; - } SetPageReferenced(page); pages[*nr] = page; if (unlikely(!try_grab_page(page, flags))) { @@ -2112,8 +2108,6 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, pfn++; } while (addr += PAGE_SIZE, addr != end); - if (pgmap) - put_dev_pagemap(pgmap); return 1; }