From patchwork Wed Jan 16 18:19:01 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 10766671 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A5586186E for ; Wed, 16 Jan 2019 18:25:41 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 953692F3E1 for ; Wed, 16 Jan 2019 18:25:41 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 891492F3F0; Wed, 16 Jan 2019 18:25:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 9B4632F3E8 for ; Wed, 16 Jan 2019 18:25:40 +0000 (UTC) Received: from [127.0.0.1] (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 319FB2194D3B9; Wed, 16 Jan 2019 10:25:40 -0800 (PST) X-Original-To: linux-nvdimm@lists.01.org Delivered-To: linux-nvdimm@lists.01.org Received-SPF: None (no SPF record) identity=mailfrom; client-ip=134.134.136.100; helo=mga07.intel.com; envelope-from=dave.hansen@linux.intel.com; receiver=linux-nvdimm@lists.01.org Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id DE4CF21962301 for ; Wed, 16 Jan 2019 10:25:38 -0800 (PST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Jan 2019 10:25:38 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,487,1539673200"; d="scan'208";a="126559961" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga002.jf.intel.com with ESMTP; 16 Jan 2019 10:25:38 -0800 Subject: [PATCH 1/4] mm/resource: return real error codes from walk failures To: dave@sr71.net From: Dave Hansen Date: Wed, 16 Jan 2019 10:19:01 -0800 References: <20190116181859.D1504459@viggo.jf.intel.com> In-Reply-To: <20190116181859.D1504459@viggo.jf.intel.com> Message-Id: <20190116181901.CAF85066@viggo.jf.intel.com> X-BeenThere: linux-nvdimm@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Linux-nvdimm developer list." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: thomas.lendacky@amd.com, mhocko@suse.com, linux-nvdimm@lists.01.org, tiwai@suse.de, Dave Hansen , ying.huang@intel.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bp@suse.de, baiyaowei@cmss.chinamobile.com, zwisler@kernel.org, bhelgaas@google.com, fengguang.wu@intel.com, akpm@linux-foundation.org MIME-Version: 1.0 Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Hansen walk_system_ram_range() can return an error code either becuase *it* failed, or because the 'func' that it calls returned an error. The memory hotplug does the following: ret = walk_system_ram_range(..., func); if (ret) return ret; and 'ret' makes it out to userspace, eventually. The problem is, walk_system_ram_range() failues that result from *it* failing (as opposed to 'func') return -1. That leads to a very odd -EPERM (-1) return code out to userspace. Make walk_system_ram_range() return -EINVAL for internal failures to keep userspace less confused. This return code is compatible with all the callers that I audited. Cc: Dan Williams Cc: Dave Jiang Cc: Ross Zwisler Cc: Vishal Verma Cc: Tom Lendacky Cc: Andrew Morton Cc: Michal Hocko Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying Cc: Fengguang Wu Signed-off-by: Dave Hansen --- b/kernel/resource.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c --- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 2018-12-20 11:48:41.810771934 -0800 +++ b/kernel/resource.c 2018-12-20 11:48:41.814771934 -0800 @@ -375,7 +375,7 @@ static int __walk_iomem_res_desc(resourc int (*func)(struct resource *, void *)) { struct resource res; - int ret = -1; + int ret = -EINVAL; while (start < end && !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) { @@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long unsigned long flags; struct resource res; unsigned long pfn, end_pfn; - int ret = -1; + int ret = -EINVAL; start = (u64) start_pfn << PAGE_SHIFT; end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1; From patchwork Wed Jan 16 18:19:02 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 10766675 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 00B24186E for ; Wed, 16 Jan 2019 18:25:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E4F8D2F3E1 for ; Wed, 16 Jan 2019 18:25:43 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D8E4D2F3F4; Wed, 16 Jan 2019 18:25:43 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 6A62F2F3E8 for ; Wed, 16 Jan 2019 18:25:43 +0000 (UTC) Received: from [127.0.0.1] (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 4DD92211B8197; Wed, 16 Jan 2019 10:25:43 -0800 (PST) X-Original-To: linux-nvdimm@lists.01.org Delivered-To: linux-nvdimm@lists.01.org Received-SPF: None (no SPF record) identity=mailfrom; client-ip=192.55.52.43; helo=mga05.intel.com; envelope-from=dave.hansen@linux.intel.com; receiver=linux-nvdimm@lists.01.org Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id B5567211B8189 for ; Wed, 16 Jan 2019 10:25:41 -0800 (PST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Jan 2019 10:25:40 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,487,1539673200"; d="scan'208";a="119025562" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga003.jf.intel.com with ESMTP; 16 Jan 2019 10:25:40 -0800 Subject: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children To: dave@sr71.net From: Dave Hansen Date: Wed, 16 Jan 2019 10:19:02 -0800 References: <20190116181859.D1504459@viggo.jf.intel.com> In-Reply-To: <20190116181859.D1504459@viggo.jf.intel.com> Message-Id: <20190116181902.670EEBC3@viggo.jf.intel.com> X-BeenThere: linux-nvdimm@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Linux-nvdimm developer list." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: thomas.lendacky@amd.com, mhocko@suse.com, linux-nvdimm@lists.01.org, tiwai@suse.de, Dave Hansen , ying.huang@intel.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bp@suse.de, baiyaowei@cmss.chinamobile.com, zwisler@kernel.org, bhelgaas@google.com, fengguang.wu@intel.com, akpm@linux-foundation.org MIME-Version: 1.0 Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Hansen The mm/resource.c code is used to manage the physical address space. We can view the current resource configuration in /proc/iomem. An example of this is at the bottom of this description. The nvdimm subsystem "owns" the physical address resources which map to persistent memory and has resources inserted for them as "Persistent Memory". We want to use this persistent memory, but as volatile memory, just like RAM. The best way to do this is to leave the existing resource in place, but add a "System RAM" resource underneath it. This clearly communicates the ownership relationship of this memory. The request_resource_conflict() API only deals with the top-level resources. Replace it with __request_region() which will search for !IORESOURCE_BUSY areas lower in the resource tree than the top level. We also rework the old error message a bit since we do not get the conflicting entry back: only an indication that we *had* a conflict. We *could* also simply truncate the existing top-level "Persistent Memory" resource and take over the released address space. But, this means that if we ever decide to hot-unplug the "RAM" and give it back, we need to recreate the original setup, which may mean going back to the BIOS tables. This should have no real effect on the existing collision detection because the areas that truly conflict should be marked IORESOURCE_BUSY. 00000000-00000fff : Reserved 00001000-0009fbff : System RAM 0009fc00-0009ffff : Reserved 000a0000-000bffff : PCI Bus 0000:00 000c0000-000c97ff : Video ROM 000c9800-000ca5ff : Adapter ROM 000f0000-000fffff : Reserved 000f0000-000fffff : System ROM 00100000-9fffffff : System RAM 01000000-01e071d0 : Kernel code 01e071d1-027dfdff : Kernel data 02dc6000-0305dfff : Kernel bss a0000000-afffffff : Persistent Memory (legacy) a0000000-a7ffffff : System RAM b0000000-bffdffff : System RAM bffe0000-bfffffff : Reserved c0000000-febfffff : PCI Bus 0000:00 Cc: Dan Williams Cc: Dave Jiang Cc: Ross Zwisler Cc: Vishal Verma Cc: Tom Lendacky Cc: Andrew Morton Cc: Michal Hocko Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying Cc: Fengguang Wu Signed-off-by: Dave Hansen --- b/mm/memory_hotplug.c | 31 ++++++++++++++----------------- 1 file changed, 14 insertions(+), 17 deletions(-) diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c --- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 2018-12-20 11:48:42.317771933 -0800 +++ b/mm/memory_hotplug.c 2018-12-20 11:48:42.322771933 -0800 @@ -98,24 +98,21 @@ void mem_hotplug_done(void) /* add this memory to iomem resource */ static struct resource *register_memory_resource(u64 start, u64 size) { - struct resource *res, *conflict; - res = kzalloc(sizeof(struct resource), GFP_KERNEL); - if (!res) - return ERR_PTR(-ENOMEM); + struct resource *res; + unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; + char *resource_name = "System RAM"; - res->name = "System RAM"; - res->start = start; - res->end = start + size - 1; - res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; - conflict = request_resource_conflict(&iomem_resource, res); - if (conflict) { - if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) { - pr_debug("Device unaddressable memory block " - "memory hotplug at %#010llx !\n", - (unsigned long long)start); - } - pr_debug("System RAM resource %pR cannot be added\n", res); - kfree(res); + /* + * Request ownership of the new memory range. This might be + * a child of an existing resource that was present but + * not marked as busy. + */ + res = __request_region(&iomem_resource, start, size, + resource_name, flags); + + if (!res) { + pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n", + start, start + size); return ERR_PTR(-EEXIST); } return res; From patchwork Wed Jan 16 18:19:04 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 10766677 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DE56313BF for ; Wed, 16 Jan 2019 18:25:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CECB32F3E1 for ; Wed, 16 Jan 2019 18:25:44 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C23A52F3FC; Wed, 16 Jan 2019 18:25:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 6D2752F3E8 for ; Wed, 16 Jan 2019 18:25:44 +0000 (UTC) Received: from [127.0.0.1] (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 65C92211B8189; Wed, 16 Jan 2019 10:25:44 -0800 (PST) X-Original-To: linux-nvdimm@lists.01.org Delivered-To: linux-nvdimm@lists.01.org Received-SPF: None (no SPF record) identity=mailfrom; client-ip=192.55.52.115; helo=mga14.intel.com; envelope-from=dave.hansen@linux.intel.com; receiver=linux-nvdimm@lists.01.org Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 78D60211B8182 for ; Wed, 16 Jan 2019 10:25:42 -0800 (PST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Jan 2019 10:25:42 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,487,1539673200"; d="scan'208";a="292082790" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga005.jf.intel.com with ESMTP; 16 Jan 2019 10:25:41 -0800 Subject: [PATCH 3/4] dax/kmem: let walk_system_ram_range() search child resources To: dave@sr71.net From: Dave Hansen Date: Wed, 16 Jan 2019 10:19:04 -0800 References: <20190116181859.D1504459@viggo.jf.intel.com> In-Reply-To: <20190116181859.D1504459@viggo.jf.intel.com> Message-Id: <20190116181904.D24AF5FE@viggo.jf.intel.com> X-BeenThere: linux-nvdimm@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Linux-nvdimm developer list." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: thomas.lendacky@amd.com, mhocko@suse.com, linux-nvdimm@lists.01.org, tiwai@suse.de, Dave Hansen , ying.huang@intel.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bp@suse.de, baiyaowei@cmss.chinamobile.com, zwisler@kernel.org, bhelgaas@google.com, fengguang.wu@intel.com, akpm@linux-foundation.org MIME-Version: 1.0 Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Hansen In the process of onlining memory, we use walk_system_ram_range() to find the actual RAM areas inside of the area being onlined. However, it currently only finds memory resources which are "top-level" iomem_resources. Children are not currently searched which causes it to skip System RAM in areas like this (in the format of /proc/iomem): a0000000-bfffffff : Persistent Memory (legacy) a0000000-afffffff : System RAM Changing the true->false here allows children to be searched as well. We need this because we add a new "System RAM" resource underneath the "persistent memory" resource when we use persistent memory in a volatile mode. Cc: Dan Williams Cc: Dave Jiang Cc: Ross Zwisler Cc: Vishal Verma Cc: Tom Lendacky Cc: Andrew Morton Cc: Michal Hocko Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying Cc: Fengguang Wu Signed-off-by: Dave Hansen --- b/kernel/resource.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff -puN kernel/resource.c~mm-walk_system_ram_range-search-child-resources kernel/resource.c --- a/kernel/resource.c~mm-walk_system_ram_range-search-child-resources 2018-12-20 11:48:42.824771932 -0800 +++ b/kernel/resource.c 2018-12-20 11:48:42.827771932 -0800 @@ -445,6 +445,9 @@ int walk_mem_res(u64 start, u64 end, voi * This function calls the @func callback against all memory ranges of type * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY. * It is to be used only for System RAM. + * + * This will find System RAM ranges that are children of top-level resources + * in addition to top-level System RAM resources. */ int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages, void *arg, int (*func)(unsigned long, unsigned long, void *)) @@ -460,7 +463,7 @@ int walk_system_ram_range(unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; while (start < end && !find_next_iomem_res(start, end, flags, IORES_DESC_NONE, - true, &res)) { + false, &res)) { pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT; end_pfn = (res.end + 1) >> PAGE_SHIFT; if (end_pfn > pfn) From patchwork Wed Jan 16 18:19:05 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 10766679 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E7B081580 for ; Wed, 16 Jan 2019 18:25:45 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D73CD2F3E1 for ; Wed, 16 Jan 2019 18:25:45 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CB01C2F3F9; Wed, 16 Jan 2019 18:25:45 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 34B302F3E1 for ; Wed, 16 Jan 2019 18:25:45 +0000 (UTC) Received: from [127.0.0.1] (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 7C49B211B819A; Wed, 16 Jan 2019 10:25:44 -0800 (PST) X-Original-To: linux-nvdimm@lists.01.org Delivered-To: linux-nvdimm@lists.01.org Received-SPF: None (no SPF record) identity=mailfrom; client-ip=192.55.52.88; helo=mga01.intel.com; envelope-from=dave.hansen@linux.intel.com; receiver=linux-nvdimm@lists.01.org Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 21383211B8189 for ; Wed, 16 Jan 2019 10:25:44 -0800 (PST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Jan 2019 10:25:43 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,487,1539673200"; d="scan'208";a="108759249" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga006.jf.intel.com with ESMTP; 16 Jan 2019 10:25:43 -0800 Subject: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM To: dave@sr71.net From: Dave Hansen Date: Wed, 16 Jan 2019 10:19:05 -0800 References: <20190116181859.D1504459@viggo.jf.intel.com> In-Reply-To: <20190116181859.D1504459@viggo.jf.intel.com> Message-Id: <20190116181905.12E102B4@viggo.jf.intel.com> X-BeenThere: linux-nvdimm@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Linux-nvdimm developer list." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: thomas.lendacky@amd.com, mhocko@suse.com, linux-nvdimm@lists.01.org, tiwai@suse.de, Dave Hansen , ying.huang@intel.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bp@suse.de, baiyaowei@cmss.chinamobile.com, zwisler@kernel.org, bhelgaas@google.com, fengguang.wu@intel.com, akpm@linux-foundation.org MIME-Version: 1.0 Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" X-Virus-Scanned: ClamAV using ClamSMTP From: Dave Hansen Currently, a persistent memory region is "owned" by a device driver, either the "Direct DAX" or "Filesystem DAX" drivers. These drivers allow applications to explicitly use persistent memory, generally by being modified to use special, new libraries. However, this limits persistent memory use to applications which *have* been modified. To make it more broadly usable, this driver "hotplugs" memory into the kernel, to be managed ad used just like normal RAM would be. To make this work, management software must remove the device from being controlled by the "Device DAX" infrastructure: echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind and then bind it to this new driver: echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind After this, there will be a number of new memory sections visible in sysfs that can be onlined, or that may get onlined by existing udev-initiated memory hotplug rules. Note: this inherits any existing NUMA information for the newly- added memory from the persistent memory device that came from the firmware. On Intel platforms, the firmware has guarantees that require each socket's persistent memory to be in a separate memory-only NUMA node. That means that this patch is not expected to create NUMA nodes, but will simply hotplug memory into existing nodes. There is currently some metadata at the beginning of pmem regions. The section-size memory hotplug restrictions, plus this small reserved area can cause the "loss" of a section or two of capacity. This should be fixable in follow-on patches. But, as a first step, losing 256MB of memory (worst case) out of hundreds of gigabytes is a good tradeoff vs. the required code to fix this up precisely. Signed-off-by: Dave Hansen Cc: Dan Williams Cc: Dave Jiang Cc: Ross Zwisler Cc: Vishal Verma Cc: Tom Lendacky Cc: Andrew Morton Cc: Michal Hocko Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying Cc: Fengguang Wu Cc: Borislav Petkov Cc: Bjorn Helgaas Cc: Yaowei Bai Cc: Takashi Iwai --- b/drivers/dax/Kconfig | 5 ++ b/drivers/dax/Makefile | 1 b/drivers/dax/kmem.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 99 insertions(+) diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig --- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 -0800 +++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800 @@ -32,6 +32,11 @@ config DEV_DAX_PMEM Say M if unsure +config DEV_DAX_KMEM + def_bool y + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure + depends on MEMORY_HOTPLUG # for add_memory() and friends + config DEV_DAX_PMEM_COMPAT tristate "PMEM DAX: support the deprecated /sys/class/dax interface" depends on DEV_DAX_PMEM diff -puN /dev/null drivers/dax/kmem.c --- /dev/null 2018-12-03 08:41:47.355756491 -0800 +++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800 @@ -0,0 +1,93 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "dax-private.h" +#include "bus.h" + +int dev_dax_kmem_probe(struct device *dev) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + struct resource *res = &dev_dax->region->res; + resource_size_t kmem_start; + resource_size_t kmem_size; + struct resource *new_res; + int numa_node; + int rc; + + /* Hotplug starting at the beginning of the next block: */ + kmem_start = ALIGN(res->start, memory_block_size_bytes()); + + kmem_size = resource_size(res); + /* Adjust the size down to compensate for moving up kmem_start: */ + kmem_size -= kmem_start - res->start; + /* Align the size down to cover only complete blocks: */ + kmem_size &= ~(memory_block_size_bytes() - 1); + + new_res = devm_request_mem_region(dev, kmem_start, kmem_size, + dev_name(dev)); + + if (!new_res) { + printk("could not reserve region %016llx -> %016llx\n", + kmem_start, kmem_start+kmem_size); + return -EBUSY; + } + + /* + * Set flags appropriate for System RAM. Leave ..._BUSY clear + * so that add_memory() can add a child resource. + */ + new_res->flags = IORESOURCE_SYSTEM_RAM; + new_res->name = dev_name(dev); + + numa_node = dev_dax->target_node; + if (numa_node < 0) { + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node); + numa_node = 0; + } + + rc = add_memory(numa_node, new_res->start, resource_size(new_res)); + if (rc) + return rc; + + return 0; +} +EXPORT_SYMBOL_GPL(dev_dax_kmem_probe); + +static int dev_dax_kmem_remove(struct device *dev) +{ + /* Assume that hot-remove will fail for now */ + return -EBUSY; +} + +static struct dax_device_driver device_dax_kmem_driver = { + .drv = { + .probe = dev_dax_kmem_probe, + .remove = dev_dax_kmem_remove, + }, +}; + +static int __init dax_kmem_init(void) +{ + return dax_driver_register(&device_dax_kmem_driver); +} + +static void __exit dax_kmem_exit(void) +{ + dax_driver_unregister(&device_dax_kmem_driver); +} + +MODULE_AUTHOR("Intel Corporation"); +MODULE_LICENSE("GPL v2"); +module_init(dax_kmem_init); +module_exit(dax_kmem_exit); +MODULE_ALIAS_DAX_DEVICE(0); diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile --- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-08 09:54:44.053694874 -0800 +++ b/drivers/dax/Makefile 2019-01-08 09:54:44.056694874 -0800 @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_DAX) += dax.o obj-$(CONFIG_DEV_DAX) += device_dax.o +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o dax-y := super.o dax-y += bus.o