[5/5] dax: "Hotplug" persistent memory for use like normal RAM

From: Dave Hansen <dave.hansen@linux.intel.com>

From: Dave Hansen <dave.hansen@linux.intel.com>

This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement.  Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then bind it to this new driver:

	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

This rebinding procedure is currently a one-way trip.  Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.

The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver.  There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly.  Two, the kmem driver destroys data on the
device.  Think of if you had good data on a pmem device.  It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver.  kmem would take over the device and
write volatile data all over your good data.

This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
---

 b/drivers/base/memory.c |    1 
 b/drivers/dax/Kconfig   |   16 +++++++
 b/drivers/dax/Makefile  |    1 
 b/drivers/dax/kmem.c    |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 126 insertions(+)

Message ID	20190124231448.E102D18E@viggo.jf.intel.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0A5721399 for <patchwork-linux-nvdimm@patchwork.kernel.org>; Thu, 24 Jan 2019 23:22:01 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E9E6D2FDEE for <patchwork-linux-nvdimm@patchwork.kernel.org>; Thu, 24 Jan 2019 23:22:00 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DE4DF2FE60; Thu, 24 Jan 2019 23:22:00 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 35F972FDEE for <patchwork-linux-nvdimm@patchwork.kernel.org>; Thu, 24 Jan 2019 23:22:00 +0000 (UTC) Received: from [127.0.0.1] (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id CB056211BA446; Thu, 24 Jan 2019 15:21:59 -0800 (PST) X-Original-To: linux-nvdimm@lists.01.org Delivered-To: linux-nvdimm@lists.01.org Received-SPF: None (no SPF record) identity=mailfrom; client-ip=134.134.136.100; helo=mga07.intel.com; envelope-from=dave.hansen@linux.intel.com; receiver=linux-nvdimm@lists.01.org Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id B0BB52194D387 for <linux-nvdimm@lists.01.org>; Thu, 24 Jan 2019 15:21:58 -0800 (PST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 24 Jan 2019 15:21:58 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,518,1539673200"; d="scan'208";a="269699706" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga004.jf.intel.com with ESMTP; 24 Jan 2019 15:21:58 -0800 Subject: [PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM To: linux-kernel@vger.kernel.org From: Dave Hansen <dave.hansen@linux.intel.com> Date: Thu, 24 Jan 2019 15:14:48 -0800 References: <20190124231441.37A4A305@viggo.jf.intel.com> In-Reply-To: <20190124231441.37A4A305@viggo.jf.intel.com> Message-Id: <20190124231448.E102D18E@viggo.jf.intel.com> X-BeenThere: linux-nvdimm@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Linux-nvdimm developer list." <linux-nvdimm.lists.01.org> List-Unsubscribe: <https://lists.01.org/mailman/options/linux-nvdimm>, <mailto:linux-nvdimm-request@lists.01.org?subject=unsubscribe> List-Archive: <http://lists.01.org/pipermail/linux-nvdimm/> List-Post: <mailto:linux-nvdimm@lists.01.org> List-Help: <mailto:linux-nvdimm-request@lists.01.org?subject=help> List-Subscribe: <https://lists.01.org/mailman/listinfo/linux-nvdimm>, <mailto:linux-nvdimm-request@lists.01.org?subject=subscribe> Cc: thomas.lendacky@amd.com, mhocko@suse.com, linux-nvdimm@lists.01.org, tiwai@suse.de, Dave Hansen <dave.hansen@linux.intel.com>, ying.huang@intel.com, linux-mm@kvack.org, jglisse@redhat.com, bp@suse.de, baiyaowei@cmss.chinamobile.com, zwisler@kernel.org, bhelgaas@google.com, fengguang.wu@intel.com, akpm@linux-foundation.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org> X-Virus-Scanned: ClamAV using ClamSMTP
Series	Allow persistent memory to be used like normal RAM \| expand [0/5,v4] Allow persistent memory to be used like normal RAM [1/5] mm/resource: return real error codes from walk failures [2/5] mm/resource: move HMM pr_debug() deeper into resource code [3/5] mm/memory-hotplug: allow memory resources to be children [4/5] dax/kmem: let walk_system_ram_range() search child resources [5/5] dax: "Hotplug" persistent memory for use like normal RAM

[5/5] dax: "Hotplug" persistent memory for use like normal RAM

Commit Message

Comments

Patch