Message ID | 20190116181905.12E102B4@viggo.jf.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Allow persistent memory to be used like normal RAM | expand |
On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > > From: Dave Hansen <dave.hansen@linux.intel.com> > > Currently, a persistent memory region is "owned" by a device driver, > either the "Direct DAX" or "Filesystem DAX" drivers. These drivers > allow applications to explicitly use persistent memory, generally > by being modified to use special, new libraries. Is there any documentation about exactly what persistent memory is? In Documentation/, I see references to pstore and pmem, which sound sort of similar, but maybe not quite the same? > However, this limits persistent memory use to applications which > *have* been modified. To make it more broadly usable, this driver > "hotplugs" memory into the kernel, to be managed ad used just like > normal RAM would be. s/ad/and/ > To make this work, management software must remove the device from > being controlled by the "Device DAX" infrastructure: > > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > > and then bind it to this new driver: > > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind > > After this, there will be a number of new memory sections visible > in sysfs that can be onlined, or that may get onlined by existing > udev-initiated memory hotplug rules. > > Note: this inherits any existing NUMA information for the newly- > added memory from the persistent memory device that came from the > firmware. On Intel platforms, the firmware has guarantees that > require each socket's persistent memory to be in a separate > memory-only NUMA node. That means that this patch is not expected > to create NUMA nodes, but will simply hotplug memory into existing > nodes. > > There is currently some metadata at the beginning of pmem regions. > The section-size memory hotplug restrictions, plus this small > reserved area can cause the "loss" of a section or two of capacity. > This should be fixable in follow-on patches. But, as a first step, > losing 256MB of memory (worst case) out of hundreds of gigabytes > is a good tradeoff vs. the required code to fix this up precisely. > > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Ross Zwisler <zwisler@kernel.org> > Cc: Vishal Verma <vishal.l.verma@intel.com> > Cc: Tom Lendacky <thomas.lendacky@amd.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: linux-nvdimm@lists.01.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > Cc: Huang Ying <ying.huang@intel.com> > Cc: Fengguang Wu <fengguang.wu@intel.com> > Cc: Borislav Petkov <bp@suse.de> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> > Cc: Takashi Iwai <tiwai@suse.de> > --- > > b/drivers/dax/Kconfig | 5 ++ > b/drivers/dax/Makefile | 1 > b/drivers/dax/kmem.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 99 insertions(+) > > diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig > --- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 -0800 > +++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800 > @@ -32,6 +32,11 @@ config DEV_DAX_PMEM > > Say M if unsure > > +config DEV_DAX_KMEM > + def_bool y Is "y" the right default here? I periodically see Linus complain about new things defaulting to "on", but I admit I haven't paid enough attention to know whether that would apply here. > + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure > + depends on MEMORY_HOTPLUG # for add_memory() and friends > + > config DEV_DAX_PMEM_COMPAT > tristate "PMEM DAX: support the deprecated /sys/class/dax interface" > depends on DEV_DAX_PMEM > diff -puN /dev/null drivers/dax/kmem.c > --- /dev/null 2018-12-03 08:41:47.355756491 -0800 > +++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800 > @@ -0,0 +1,93 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */ > +#include <linux/memremap.h> > +#include <linux/pagemap.h> > +#include <linux/memory.h> > +#include <linux/module.h> > +#include <linux/device.h> > +#include <linux/pfn_t.h> > +#include <linux/slab.h> > +#include <linux/dax.h> > +#include <linux/fs.h> > +#include <linux/mm.h> > +#include <linux/mman.h> > +#include "dax-private.h" > +#include "bus.h" > + > +int dev_dax_kmem_probe(struct device *dev) > +{ > + struct dev_dax *dev_dax = to_dev_dax(dev); > + struct resource *res = &dev_dax->region->res; > + resource_size_t kmem_start; > + resource_size_t kmem_size; > + struct resource *new_res; > + int numa_node; > + int rc; > + > + /* Hotplug starting at the beginning of the next block: */ > + kmem_start = ALIGN(res->start, memory_block_size_bytes()); > + > + kmem_size = resource_size(res); > + /* Adjust the size down to compensate for moving up kmem_start: */ > + kmem_size -= kmem_start - res->start; > + /* Align the size down to cover only complete blocks: */ > + kmem_size &= ~(memory_block_size_bytes() - 1); > + > + new_res = devm_request_mem_region(dev, kmem_start, kmem_size, > + dev_name(dev)); > + > + if (!new_res) { > + printk("could not reserve region %016llx -> %016llx\n", > + kmem_start, kmem_start+kmem_size); 1) It'd be nice to have some sort of module tag in the output that ties it to this driver. 2) It might be nice to print the range in the same format as %pR, i.e., "[mem %#010x-%#010x]" with the end included (start + size -1 ). > + return -EBUSY; > + } > + > + /* > + * Set flags appropriate for System RAM. Leave ..._BUSY clear > + * so that add_memory() can add a child resource. > + */ > + new_res->flags = IORESOURCE_SYSTEM_RAM; IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the devm_request_mem_region() path. I think you should keep at least IORESOURCE_MEM so the iomem_resource tree stays consistent. > + new_res->name = dev_name(dev); > + > + numa_node = dev_dax->target_node; > + if (numa_node < 0) { > + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node); It'd be nice to again have a module tag and an indication of what range is affected, e.g., %pR of new_res. You don't save the new_res pointer anywhere, which I guess you intend for now since there's no remove or anything else to do with this resource? I thought maybe devm_request_mem_region() would implicitly save it, but it doesn't; it only saves the parent (iomem_resource, the start (kmem_start), and the size (kmem_size)). > + numa_node = 0; > + } > + > + rc = add_memory(numa_node, new_res->start, resource_size(new_res)); > + if (rc) > + return rc; > + > + return 0; Doesn't this mean "return rc" or even just "return add_memory(...)"? > +} > +EXPORT_SYMBOL_GPL(dev_dax_kmem_probe); > + > +static int dev_dax_kmem_remove(struct device *dev) > +{ > + /* Assume that hot-remove will fail for now */ > + return -EBUSY; > +} > + > +static struct dax_device_driver device_dax_kmem_driver = { > + .drv = { > + .probe = dev_dax_kmem_probe, > + .remove = dev_dax_kmem_remove, > + }, > +}; > + > +static int __init dax_kmem_init(void) > +{ > + return dax_driver_register(&device_dax_kmem_driver); > +} > + > +static void __exit dax_kmem_exit(void) > +{ > + dax_driver_unregister(&device_dax_kmem_driver); > +} > + > +MODULE_AUTHOR("Intel Corporation"); > +MODULE_LICENSE("GPL v2"); > +module_init(dax_kmem_init); > +module_exit(dax_kmem_exit); > +MODULE_ALIAS_DAX_DEVICE(0); > diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile > --- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-08 09:54:44.053694874 -0800 > +++ b/drivers/dax/Makefile 2019-01-08 09:54:44.056694874 -0800 > @@ -1,6 +1,7 @@ > # SPDX-License-Identifier: GPL-2.0 > obj-$(CONFIG_DAX) += dax.o > obj-$(CONFIG_DEV_DAX) += device_dax.o > +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o > > dax-y := super.o > dax-y += bus.o > _
On Wed, Jan 16, 2019 at 10:25 AM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > > From: Dave Hansen <dave.hansen@linux.intel.com> > > Currently, a persistent memory region is "owned" by a device driver, > either the "Direct DAX" or "Filesystem DAX" drivers. These drivers > allow applications to explicitly use persistent memory, generally > by being modified to use special, new libraries. > > However, this limits persistent memory use to applications which > *have* been modified. To make it more broadly usable, this driver > "hotplugs" memory into the kernel, to be managed ad used just like > normal RAM would be. > > To make this work, management software must remove the device from > being controlled by the "Device DAX" infrastructure: > > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > > and then bind it to this new driver: > > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind > > After this, there will be a number of new memory sections visible > in sysfs that can be onlined, or that may get onlined by existing > udev-initiated memory hotplug rules. > > Note: this inherits any existing NUMA information for the newly- > added memory from the persistent memory device that came from the > firmware. On Intel platforms, the firmware has guarantees that > require each socket's persistent memory to be in a separate > memory-only NUMA node. That means that this patch is not expected > to create NUMA nodes, but will simply hotplug memory into existing > nodes. > > There is currently some metadata at the beginning of pmem regions. > The section-size memory hotplug restrictions, plus this small > reserved area can cause the "loss" of a section or two of capacity. > This should be fixable in follow-on patches. But, as a first step, > losing 256MB of memory (worst case) out of hundreds of gigabytes > is a good tradeoff vs. the required code to fix this up precisely. > > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Ross Zwisler <zwisler@kernel.org> > Cc: Vishal Verma <vishal.l.verma@intel.com> > Cc: Tom Lendacky <thomas.lendacky@amd.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: linux-nvdimm@lists.01.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > Cc: Huang Ying <ying.huang@intel.com> > Cc: Fengguang Wu <fengguang.wu@intel.com> > Cc: Borislav Petkov <bp@suse.de> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> > Cc: Takashi Iwai <tiwai@suse.de> > --- > > b/drivers/dax/Kconfig | 5 ++ > b/drivers/dax/Makefile | 1 > b/drivers/dax/kmem.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 99 insertions(+) > > diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig > --- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 -0800 > +++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800 > @@ -32,6 +32,11 @@ config DEV_DAX_PMEM > > Say M if unsure > > +config DEV_DAX_KMEM > + def_bool y > + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure > + depends on MEMORY_HOTPLUG # for add_memory() and friends > + I think this should be: config DEV_DAX_KMEM tristate "<kmem title>" depends on DEV_DAX default DEV_DAX depends on MEMORY_HOTPLUG # for add_memory() and friends help <kmem description> ...because the DEV_DAX_KMEM implementation with the device-DAX reworks is independent of pmem. It just so happens that pmem is the only source for device-DAX instances, but that need not always be the case and kmem is device-DAX origin generic. > config DEV_DAX_PMEM_COMPAT > tristate "PMEM DAX: support the deprecated /sys/class/dax interface" > depends on DEV_DAX_PMEM > diff -puN /dev/null drivers/dax/kmem.c > --- /dev/null 2018-12-03 08:41:47.355756491 -0800 > +++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800 > @@ -0,0 +1,93 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */ > +#include <linux/memremap.h> > +#include <linux/pagemap.h> > +#include <linux/memory.h> > +#include <linux/module.h> > +#include <linux/device.h> > +#include <linux/pfn_t.h> > +#include <linux/slab.h> > +#include <linux/dax.h> > +#include <linux/fs.h> > +#include <linux/mm.h> > +#include <linux/mman.h> > +#include "dax-private.h" > +#include "bus.h" > + > +int dev_dax_kmem_probe(struct device *dev) > +{ > + struct dev_dax *dev_dax = to_dev_dax(dev); > + struct resource *res = &dev_dax->region->res; > + resource_size_t kmem_start; > + resource_size_t kmem_size; > + struct resource *new_res; > + int numa_node; > + int rc; > + > + /* Hotplug starting at the beginning of the next block: */ > + kmem_start = ALIGN(res->start, memory_block_size_bytes()); > + > + kmem_size = resource_size(res); > + /* Adjust the size down to compensate for moving up kmem_start: */ > + kmem_size -= kmem_start - res->start; > + /* Align the size down to cover only complete blocks: */ > + kmem_size &= ~(memory_block_size_bytes() - 1); > + > + new_res = devm_request_mem_region(dev, kmem_start, kmem_size, > + dev_name(dev)); > + > + if (!new_res) { > + printk("could not reserve region %016llx -> %016llx\n", > + kmem_start, kmem_start+kmem_size); dev_err() please. > + return -EBUSY; > + } > + > + /* > + * Set flags appropriate for System RAM. Leave ..._BUSY clear > + * so that add_memory() can add a child resource. > + */ > + new_res->flags = IORESOURCE_SYSTEM_RAM; > + new_res->name = dev_name(dev); > + > + numa_node = dev_dax->target_node; > + if (numa_node < 0) { > + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node); I think this should be dev_info(dev, "no numa_node, defaulting to 0\n"), or dev_dbg(): 1/ so we can backtrack which device is missing numa information 2/ NUMA_NO_NODE may be a common occurrence so it's not really a "warn" level concern afaics. 3/ no real need for _once I don't see this as a log spam risk. > + numa_node = 0; > + } > + > + rc = add_memory(numa_node, new_res->start, resource_size(new_res)); > + if (rc) > + return rc; > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(dev_dax_kmem_probe); No need to export this afaics. > + > +static int dev_dax_kmem_remove(struct device *dev) > +{ > + /* Assume that hot-remove will fail for now */ > + return -EBUSY; > +} > + > +static struct dax_device_driver device_dax_kmem_driver = { > + .drv = { > + .probe = dev_dax_kmem_probe, > + .remove = dev_dax_kmem_remove, > + }, > +}; > + > +static int __init dax_kmem_init(void) > +{ > + return dax_driver_register(&device_dax_kmem_driver); > +} > + > +static void __exit dax_kmem_exit(void) > +{ > + dax_driver_unregister(&device_dax_kmem_driver); > +} > + > +MODULE_AUTHOR("Intel Corporation"); > +MODULE_LICENSE("GPL v2"); > +module_init(dax_kmem_init); > +module_exit(dax_kmem_exit); > +MODULE_ALIAS_DAX_DEVICE(0); > diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile > --- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-08 09:54:44.053694874 -0800 > +++ b/drivers/dax/Makefile 2019-01-08 09:54:44.056694874 -0800 > @@ -1,6 +1,7 @@ > # SPDX-License-Identifier: GPL-2.0 > obj-$(CONFIG_DAX) += dax.o > obj-$(CONFIG_DEV_DAX) += device_dax.o > +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o > > dax-y := super.o > dax-y += bus.o > _
On 1/16/19 1:16 PM, Bjorn Helgaas wrote: > On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen > <dave.hansen@linux.intel.com> wrote: >> From: Dave Hansen <dave.hansen@linux.intel.com> >> Currently, a persistent memory region is "owned" by a device driver, >> either the "Direct DAX" or "Filesystem DAX" drivers. These drivers >> allow applications to explicitly use persistent memory, generally >> by being modified to use special, new libraries. > > Is there any documentation about exactly what persistent memory is? > In Documentation/, I see references to pstore and pmem, which sound > sort of similar, but maybe not quite the same? One instance of persistent memory is nonvolatile DIMMS. They're described in great detail here: Documentation/nvdimm/nvdimm.txt >> +config DEV_DAX_KMEM >> + def_bool y > > Is "y" the right default here? I periodically see Linus complain > about new things defaulting to "on", but I admit I haven't paid enough > attention to know whether that would apply here. > >> + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure >> + depends on MEMORY_HOTPLUG # for add_memory() and friends Well, it doesn't default to "on for everyone". It inherits the state of DEV_DAX_PMEM so it's only foisted on folks who have already opted in to generic pmem support. >> +int dev_dax_kmem_probe(struct device *dev) >> +{ >> + struct dev_dax *dev_dax = to_dev_dax(dev); >> + struct resource *res = &dev_dax->region->res; >> + resource_size_t kmem_start; >> + resource_size_t kmem_size; >> + struct resource *new_res; >> + int numa_node; >> + int rc; >> + >> + /* Hotplug starting at the beginning of the next block: */ >> + kmem_start = ALIGN(res->start, memory_block_size_bytes()); >> + >> + kmem_size = resource_size(res); >> + /* Adjust the size down to compensate for moving up kmem_start: */ >> + kmem_size -= kmem_start - res->start; >> + /* Align the size down to cover only complete blocks: */ >> + kmem_size &= ~(memory_block_size_bytes() - 1); >> + >> + new_res = devm_request_mem_region(dev, kmem_start, kmem_size, >> + dev_name(dev)); >> + >> + if (!new_res) { >> + printk("could not reserve region %016llx -> %016llx\n", >> + kmem_start, kmem_start+kmem_size); > > 1) It'd be nice to have some sort of module tag in the output that > ties it to this driver. Good point. That should probably be a dev_printk(). > 2) It might be nice to print the range in the same format as %pR, > i.e., "[mem %#010x-%#010x]" with the end included (start + size -1 ). Sure, that sounds like a sane thing to do as well. >> + return -EBUSY; >> + } >> + >> + /* >> + * Set flags appropriate for System RAM. Leave ..._BUSY clear >> + * so that add_memory() can add a child resource. >> + */ >> + new_res->flags = IORESOURCE_SYSTEM_RAM; > > IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the > devm_request_mem_region() path. I think you should keep at least > IORESOURCE_MEM so the iomem_resource tree stays consistent. > >> + new_res->name = dev_name(dev); >> + >> + numa_node = dev_dax->target_node; >> + if (numa_node < 0) { >> + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node); > > It'd be nice to again have a module tag and an indication of what > range is affected, e.g., %pR of new_res. > > You don't save the new_res pointer anywhere, which I guess you intend > for now since there's no remove or anything else to do with this > resource? I thought maybe devm_request_mem_region() would implicitly > save it, but it doesn't; it only saves the parent (iomem_resource, the > start (kmem_start), and the size (kmem_size)). Yeah, that's the intention: removal is currently not supported. I'll add a comment to clarify. >> + numa_node = 0; >> + } >> + >> + rc = add_memory(numa_node, new_res->start, resource_size(new_res)); >> + if (rc) >> + return rc; >> + >> + return 0; > > Doesn't this mean "return rc" or even just "return add_memory(...)"? Yeah, all of those are equivalent. I guess I just prefer the explicit error handling path.
On Wed, Jan 16, 2019 at 3:40 PM Dave Hansen <dave.hansen@intel.com> wrote: > On 1/16/19 1:16 PM, Bjorn Helgaas wrote: > > On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen > > <dave.hansen@linux.intel.com> wrote: > >> From: Dave Hansen <dave.hansen@linux.intel.com> > >> Currently, a persistent memory region is "owned" by a device driver, > >> either the "Direct DAX" or "Filesystem DAX" drivers. These drivers > >> allow applications to explicitly use persistent memory, generally > >> by being modified to use special, new libraries. > > > > Is there any documentation about exactly what persistent memory is? > > In Documentation/, I see references to pstore and pmem, which sound > > sort of similar, but maybe not quite the same? > > One instance of persistent memory is nonvolatile DIMMS. They're > described in great detail here: Documentation/nvdimm/nvdimm.txt Thanks! Some bread crumbs in the changelog to lead there would be great. Bjorn
On 1/16/19 1:16 PM, Bjorn Helgaas wrote: >> + /* >> + * Set flags appropriate for System RAM. Leave ..._BUSY clear >> + * so that add_memory() can add a child resource. >> + */ >> + new_res->flags = IORESOURCE_SYSTEM_RAM; > IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the > devm_request_mem_region() path. I think you should keep at least > IORESOURCE_MEM so the iomem_resource tree stays consistent. I went to look at fixing this. It looks like "IORESOURCE_SYSTEM_RAM" includes IORESOURCE_MEM: > #define IORESOURCE_SYSTEM_RAM (IORESOURCE_MEM|IORESOURCE_SYSRAM) Did you want the patch to expand this #define, or did you just want to ensure that IORESORUCE_MEM got in there somehow?
On Wed, Jan 16, 2019 at 3:53 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 1/16/19 1:16 PM, Bjorn Helgaas wrote: > >> + /* > >> + * Set flags appropriate for System RAM. Leave ..._BUSY clear > >> + * so that add_memory() can add a child resource. > >> + */ > >> + new_res->flags = IORESOURCE_SYSTEM_RAM; > > IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the > > devm_request_mem_region() path. I think you should keep at least > > IORESOURCE_MEM so the iomem_resource tree stays consistent. > > I went to look at fixing this. It looks like "IORESOURCE_SYSTEM_RAM" > includes IORESOURCE_MEM: > > > #define IORESOURCE_SYSTEM_RAM (IORESOURCE_MEM|IORESOURCE_SYSRAM) > > Did you want the patch to expand this #define, or did you just want to > ensure that IORESORUCE_MEM got in there somehow? The latter. Since it's already included, forget I said anything :) Although if your intent is only to clear IORESOURCE_BUSY, maybe it would be safer to just clear that bit instead of overwriting everything? That might also help people grepping for IORESOURCE_BUSY usage.
On Wed, Jan 16, 2019 at 1:40 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 1/16/19 1:16 PM, Bjorn Helgaas wrote: > > On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen > > <dave.hansen@linux.intel.com> wrote: > >> From: Dave Hansen <dave.hansen@linux.intel.com> > >> Currently, a persistent memory region is "owned" by a device driver, > >> either the "Direct DAX" or "Filesystem DAX" drivers. These drivers > >> allow applications to explicitly use persistent memory, generally > >> by being modified to use special, new libraries. > > > > Is there any documentation about exactly what persistent memory is? > > In Documentation/, I see references to pstore and pmem, which sound > > sort of similar, but maybe not quite the same? > > One instance of persistent memory is nonvolatile DIMMS. They're > described in great detail here: Documentation/nvdimm/nvdimm.txt > > >> +config DEV_DAX_KMEM > >> + def_bool y > > > > Is "y" the right default here? I periodically see Linus complain > > about new things defaulting to "on", but I admit I haven't paid enough > > attention to know whether that would apply here. > > > >> + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure > >> + depends on MEMORY_HOTPLUG # for add_memory() and friends > > Well, it doesn't default to "on for everyone". It inherits the state of > DEV_DAX_PMEM so it's only foisted on folks who have already opted in to > generic pmem support. > > >> +int dev_dax_kmem_probe(struct device *dev) > >> +{ > >> + struct dev_dax *dev_dax = to_dev_dax(dev); > >> + struct resource *res = &dev_dax->region->res; > >> + resource_size_t kmem_start; > >> + resource_size_t kmem_size; > >> + struct resource *new_res; > >> + int numa_node; > >> + int rc; > >> + > >> + /* Hotplug starting at the beginning of the next block: */ > >> + kmem_start = ALIGN(res->start, memory_block_size_bytes()); > >> + > >> + kmem_size = resource_size(res); > >> + /* Adjust the size down to compensate for moving up kmem_start: */ > >> + kmem_size -= kmem_start - res->start; > >> + /* Align the size down to cover only complete blocks: */ > >> + kmem_size &= ~(memory_block_size_bytes() - 1); > >> + > >> + new_res = devm_request_mem_region(dev, kmem_start, kmem_size, > >> + dev_name(dev)); > >> + > >> + if (!new_res) { > >> + printk("could not reserve region %016llx -> %016llx\n", > >> + kmem_start, kmem_start+kmem_size); > > > > 1) It'd be nice to have some sort of module tag in the output that > > ties it to this driver. > > Good point. That should probably be a dev_printk(). > > > 2) It might be nice to print the range in the same format as %pR, > > i.e., "[mem %#010x-%#010x]" with the end included (start + size -1 ). > > Sure, that sounds like a sane thing to do as well. Does %pR protect physical address disclosure to non-root by default? At least the pmem driver is using %pR rather than manually printing raw physical address values, but you would need to create a local modified version of the passed in resource. > >> + return -EBUSY; > >> + } > >> + > >> + /* > >> + * Set flags appropriate for System RAM. Leave ..._BUSY clear > >> + * so that add_memory() can add a child resource. > >> + */ > >> + new_res->flags = IORESOURCE_SYSTEM_RAM; > > > > IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the > > devm_request_mem_region() path. I think you should keep at least > > IORESOURCE_MEM so the iomem_resource tree stays consistent. > > > >> + new_res->name = dev_name(dev); > >> + > >> + numa_node = dev_dax->target_node; > >> + if (numa_node < 0) { > >> + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node); > > > > It'd be nice to again have a module tag and an indication of what > > range is affected, e.g., %pR of new_res. > > > > You don't save the new_res pointer anywhere, which I guess you intend > > for now since there's no remove or anything else to do with this > > resource? I thought maybe devm_request_mem_region() would implicitly > > save it, but it doesn't; it only saves the parent (iomem_resource, the > > start (kmem_start), and the size (kmem_size)). > > Yeah, that's the intention: removal is currently not supported. I'll > add a comment to clarify. I would clarify that *driver* removal is supported because there's no Linux facility for drivers to fail removal (nothing checks the return code from ->remove()). Instead the protection is that the resource must remain pinned forever. In that case devm_request_mem_region() is the wrong function to use. You want to explicitly use the non-devm request_mem_region() and purposely leak it to keep the memory reserved indefinitely.
>-----Original Message----- >From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf >Of Dave Hansen >Sent: Thursday, January 17, 2019 2:19 AM >To: dave@sr71.net >Cc: thomas.lendacky@amd.com; mhocko@suse.com; >linux-nvdimm@lists.01.org; tiwai@suse.de; Dave Hansen ><dave.hansen@linux.intel.com>; Huang, Ying <ying.huang@intel.com>; >linux-kernel@vger.kernel.org; linux-mm@kvack.org; bp@suse.de; >baiyaowei@cmss.chinamobile.com; zwisler@kernel.org; >bhelgaas@google.com; Wu, Fengguang <fengguang.wu@intel.com>; >akpm@linux-foundation.org >Subject: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal >RAM > > >From: Dave Hansen <dave.hansen@linux.intel.com> > >Currently, a persistent memory region is "owned" by a device driver, >either the "Direct DAX" or "Filesystem DAX" drivers. These drivers >allow applications to explicitly use persistent memory, generally >by being modified to use special, new libraries. > >However, this limits persistent memory use to applications which >*have* been modified. To make it more broadly usable, this driver >"hotplugs" memory into the kernel, to be managed ad used just like >normal RAM would be. > >To make this work, management software must remove the device from >being controlled by the "Device DAX" infrastructure: > > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > >and then bind it to this new driver: > > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind Is there any plan to introduce additional mode, e.g. "kmem" in the userspace ndctl tool to do the configuration? >After this, there will be a number of new memory sections visible >in sysfs that can be onlined, or that may get onlined by existing >udev-initiated memory hotplug rules. > >Note: this inherits any existing NUMA information for the newly- >added memory from the persistent memory device that came from the >firmware. On Intel platforms, the firmware has guarantees that >require each socket's persistent memory to be in a separate >memory-only NUMA node. That means that this patch is not expected >to create NUMA nodes, but will simply hotplug memory into existing >nodes. > >There is currently some metadata at the beginning of pmem regions. >The section-size memory hotplug restrictions, plus this small >reserved area can cause the "loss" of a section or two of capacity. >This should be fixable in follow-on patches. But, as a first step, >losing 256MB of memory (worst case) out of hundreds of gigabytes >is a good tradeoff vs. the required code to fix this up precisely. > >Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> >Cc: Dan Williams <dan.j.williams@intel.com> >Cc: Dave Jiang <dave.jiang@intel.com> >Cc: Ross Zwisler <zwisler@kernel.org> >Cc: Vishal Verma <vishal.l.verma@intel.com> >Cc: Tom Lendacky <thomas.lendacky@amd.com> >Cc: Andrew Morton <akpm@linux-foundation.org> >Cc: Michal Hocko <mhocko@suse.com> >Cc: linux-nvdimm@lists.01.org >Cc: linux-kernel@vger.kernel.org >Cc: linux-mm@kvack.org >Cc: Huang Ying <ying.huang@intel.com> >Cc: Fengguang Wu <fengguang.wu@intel.com> >Cc: Borislav Petkov <bp@suse.de> >Cc: Bjorn Helgaas <bhelgaas@google.com> >Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> >Cc: Takashi Iwai <tiwai@suse.de> >--- > > b/drivers/dax/Kconfig | 5 ++ > b/drivers/dax/Makefile | 1 > b/drivers/dax/kmem.c | 93 >+++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 99 insertions(+) > >diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig >--- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 >-0800 >+++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800 >@@ -32,6 +32,11 @@ config DEV_DAX_PMEM > > Say M if unsure > >+config DEV_DAX_KMEM >+ def_bool y >+ depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure >+ depends on MEMORY_HOTPLUG # for add_memory() and friends >+ > config DEV_DAX_PMEM_COMPAT > tristate "PMEM DAX: support the deprecated /sys/class/dax interface" > depends on DEV_DAX_PMEM >diff -puN /dev/null drivers/dax/kmem.c >--- /dev/null 2018-12-03 08:41:47.355756491 -0800 >+++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800 >@@ -0,0 +1,93 @@ >+// SPDX-License-Identifier: GPL-2.0 >+/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */ >+#include <linux/memremap.h> >+#include <linux/pagemap.h> >+#include <linux/memory.h> >+#include <linux/module.h> >+#include <linux/device.h> >+#include <linux/pfn_t.h> >+#include <linux/slab.h> >+#include <linux/dax.h> >+#include <linux/fs.h> >+#include <linux/mm.h> >+#include <linux/mman.h> >+#include "dax-private.h" >+#include "bus.h" >+ >+int dev_dax_kmem_probe(struct device *dev) >+{ >+ struct dev_dax *dev_dax = to_dev_dax(dev); >+ struct resource *res = &dev_dax->region->res; >+ resource_size_t kmem_start; >+ resource_size_t kmem_size; >+ struct resource *new_res; >+ int numa_node; >+ int rc; >+ >+ /* Hotplug starting at the beginning of the next block: */ >+ kmem_start = ALIGN(res->start, memory_block_size_bytes()); >+ >+ kmem_size = resource_size(res); >+ /* Adjust the size down to compensate for moving up kmem_start: */ >+ kmem_size -= kmem_start - res->start; >+ /* Align the size down to cover only complete blocks: */ >+ kmem_size &= ~(memory_block_size_bytes() - 1); >+ >+ new_res = devm_request_mem_region(dev, kmem_start, kmem_size, >+ dev_name(dev)); >+ >+ if (!new_res) { >+ printk("could not reserve region %016llx -> %016llx\n", >+ kmem_start, kmem_start+kmem_size); >+ return -EBUSY; >+ } >+ >+ /* >+ * Set flags appropriate for System RAM. Leave ..._BUSY clear >+ * so that add_memory() can add a child resource. >+ */ >+ new_res->flags = IORESOURCE_SYSTEM_RAM; >+ new_res->name = dev_name(dev); >+ >+ numa_node = dev_dax->target_node; >+ if (numa_node < 0) { >+ pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node); >+ numa_node = 0; >+ } >+ >+ rc = add_memory(numa_node, new_res->start, resource_size(new_res)); >+ if (rc) >+ return rc; >+ >+ return 0; >+} >+EXPORT_SYMBOL_GPL(dev_dax_kmem_probe); >+ >+static int dev_dax_kmem_remove(struct device *dev) >+{ >+ /* Assume that hot-remove will fail for now */ >+ return -EBUSY; >+} >+ >+static struct dax_device_driver device_dax_kmem_driver = { >+ .drv = { >+ .probe = dev_dax_kmem_probe, >+ .remove = dev_dax_kmem_remove, >+ }, >+}; >+ >+static int __init dax_kmem_init(void) >+{ >+ return dax_driver_register(&device_dax_kmem_driver); >+} >+ >+static void __exit dax_kmem_exit(void) >+{ >+ dax_driver_unregister(&device_dax_kmem_driver); >+} >+ >+MODULE_AUTHOR("Intel Corporation"); >+MODULE_LICENSE("GPL v2"); >+module_init(dax_kmem_init); >+module_exit(dax_kmem_exit); >+MODULE_ALIAS_DAX_DEVICE(0); >diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile >--- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-08 09:54:44.053694874 >-0800 >+++ b/drivers/dax/Makefile 2019-01-08 09:54:44.056694874 -0800 >@@ -1,6 +1,7 @@ > # SPDX-License-Identifier: GPL-2.0 > obj-$(CONFIG_DAX) += dax.o > obj-$(CONFIG_DEV_DAX) += device_dax.o >+obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o > > dax-y := super.o > dax-y += bus.o >_ >_______________________________________________ >Linux-nvdimm mailing list >Linux-nvdimm@lists.01.org >https://lists.01.org/mailman/listinfo/linux-nvdimm
On 2019/1/17 上午2:19, Dave Hansen wrote: > From: Dave Hansen <dave.hansen@linux.intel.com> > > Currently, a persistent memory region is "owned" by a device driver, > either the "Direct DAX" or "Filesystem DAX" drivers. These drivers > allow applications to explicitly use persistent memory, generally > by being modified to use special, new libraries. > > However, this limits persistent memory use to applications which > *have* been modified. To make it more broadly usable, this driver > "hotplugs" memory into the kernel, to be managed ad used just like > normal RAM would be. > > To make this work, management software must remove the device from > being controlled by the "Device DAX" infrastructure: > > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > > and then bind it to this new driver: > > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind > > After this, there will be a number of new memory sections visible > in sysfs that can be onlined, or that may get onlined by existing > udev-initiated memory hotplug rules. > > Note: this inherits any existing NUMA information for the newly- > added memory from the persistent memory device that came from the > firmware. On Intel platforms, the firmware has guarantees that > require each socket's persistent memory to be in a separate > memory-only NUMA node. That means that this patch is not expected > to create NUMA nodes, but will simply hotplug memory into existing > nodes. > > There is currently some metadata at the beginning of pmem regions. > The section-size memory hotplug restrictions, plus this small > reserved area can cause the "loss" of a section or two of capacity. > This should be fixable in follow-on patches. But, as a first step, > losing 256MB of memory (worst case) out of hundreds of gigabytes > is a good tradeoff vs. the required code to fix this up precisely. > > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Ross Zwisler <zwisler@kernel.org> > Cc: Vishal Verma <vishal.l.verma@intel.com> > Cc: Tom Lendacky <thomas.lendacky@amd.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: linux-nvdimm@lists.01.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > Cc: Huang Ying <ying.huang@intel.com> > Cc: Fengguang Wu <fengguang.wu@intel.com> > Cc: Borislav Petkov <bp@suse.de> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> > Cc: Takashi Iwai <tiwai@suse.de> > --- > > b/drivers/dax/Kconfig | 5 ++ > b/drivers/dax/Makefile | 1 > b/drivers/dax/kmem.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 99 insertions(+) > > diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig > --- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 -0800 > +++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800 > @@ -32,6 +32,11 @@ config DEV_DAX_PMEM > > Say M if unsure > > +config DEV_DAX_KMEM > + def_bool y > + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure > + depends on MEMORY_HOTPLUG # for add_memory() and friends > + > config DEV_DAX_PMEM_COMPAT > tristate "PMEM DAX: support the deprecated /sys/class/dax interface" > depends on DEV_DAX_PMEM > diff -puN /dev/null drivers/dax/kmem.c > --- /dev/null 2018-12-03 08:41:47.355756491 -0800 > +++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800 > @@ -0,0 +1,93 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */ > +#include <linux/memremap.h> > +#include <linux/pagemap.h> > +#include <linux/memory.h> > +#include <linux/module.h> > +#include <linux/device.h> > +#include <linux/pfn_t.h> > +#include <linux/slab.h> > +#include <linux/dax.h> > +#include <linux/fs.h> > +#include <linux/mm.h> > +#include <linux/mman.h> > +#include "dax-private.h" > +#include "bus.h" > + > +int dev_dax_kmem_probe(struct device *dev) > +{ > + struct dev_dax *dev_dax = to_dev_dax(dev); > + struct resource *res = &dev_dax->region->res; > + resource_size_t kmem_start; > + resource_size_t kmem_size; > + struct resource *new_res; > + int numa_node; > + int rc; > + > + /* Hotplug starting at the beginning of the next block: */ > + kmem_start = ALIGN(res->start, memory_block_size_bytes()); > + > + kmem_size = resource_size(res); > + /* Adjust the size down to compensate for moving up kmem_start: */ > + kmem_size -= kmem_start - res->start; > + /* Align the size down to cover only complete blocks: */ > + kmem_size &= ~(memory_block_size_bytes() - 1); > + > + new_res = devm_request_mem_region(dev, kmem_start, kmem_size, > + dev_name(dev)); > + > + if (!new_res) { > + printk("could not reserve region %016llx -> %016llx\n", > + kmem_start, kmem_start+kmem_size); > + return -EBUSY; > + } > + > + /* > + * Set flags appropriate for System RAM. Leave ..._BUSY clear > + * so that add_memory() can add a child resource. > + */ > + new_res->flags = IORESOURCE_SYSTEM_RAM; > + new_res->name = dev_name(dev); > + > + numa_node = dev_dax->target_node; > + if (numa_node < 0) { > + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node); > + numa_node = 0; > + } > + > + rc = add_memory(numa_node, new_res->start, resource_size(new_res)); I didn't try pmem and I am wondering it's slower than DRAM. Should a flag, such like _GFP_PMEM, be added to distinguish it from DRAM? If it's used for DMA, perhaps it might not satisfy device DMA request on time?
On 1/17/19 12:19 AM, Yanmin Zhang wrote: >> > I didn't try pmem and I am wondering it's slower than DRAM. > Should a flag, such like _GFP_PMEM, be added to distinguish it from > DRAM? Absolutely not. :) We already have performance-differentiated memory, and lots of ways to enumerate and select it in the kernel (all of our NUMA infrastructure). PMEM is also just the first of many "kinds" of memory that folks want to build in systems and use a "RAM". We literally don't have space to put a flag in for each type.
On Wed, Jan 16, 2019 at 9:21 PM Du, Fan <fan.du@intel.com> wrote: [..] > >From: Dave Hansen <dave.hansen@linux.intel.com> > > > >Currently, a persistent memory region is "owned" by a device driver, > >either the "Direct DAX" or "Filesystem DAX" drivers. These drivers > >allow applications to explicitly use persistent memory, generally > >by being modified to use special, new libraries. > > > >However, this limits persistent memory use to applications which > >*have* been modified. To make it more broadly usable, this driver > >"hotplugs" memory into the kernel, to be managed ad used just like > >normal RAM would be. > > > >To make this work, management software must remove the device from > >being controlled by the "Device DAX" infrastructure: > > > > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id > > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > > > >and then bind it to this new driver: > > > > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id > > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind > > Is there any plan to introduce additional mode, e.g. "kmem" in the userspace > ndctl tool to do the configuration? > Yes, but not to ndctl. The daxctl tool will grow a helper for this. The policy of what device-dax instances should be hotplugged at system init will be managed by a persistent configuration file and udev rules.
On 2019/1/17 下午11:17, Dave Hansen wrote: > On 1/17/19 12:19 AM, Yanmin Zhang wrote: >>> >> I didn't try pmem and I am wondering it's slower than DRAM. >> Should a flag, such like _GFP_PMEM, be added to distinguish it from >> DRAM? > > Absolutely not. :) Agree. > > We already have performance-differentiated memory, and lots of ways to > enumerate and select it in the kernel (all of our NUMA infrastructure). Kernel does manage memory like what you say. My question is: with your patch, PMEM becomes normal RAM, then there is a chance for kernel to allocate PMEM as DMA buffer. Some super speed devices like 10Giga NIC, USB (SSIC connecting modem), might not work well if DMA buffer is in PMEM as it's slower than DRAM. Should your patchset consider it? > > PMEM is also just the first of many "kinds" of memory that folks want to > build in systems and use a "RAM". We literally don't have space to put > a flag in for each type. > >
On 1/17/19 11:47 PM, Yanmin Zhang wrote: > a chance for kernel to allocate PMEM as DMA buffer. > Some super speed devices like 10Giga NIC, USB (SSIC connecting modem), > might not work well if DMA buffer is in PMEM as it's slower than DRAM. > > Should your patchset consider it? No, I don't think so. They can DMA to persistent memory whether this patch set exists or not. So, if the hardware falls over, that's a separate problem. If an app wants memory that performs in a particular way, then I would suggest those app find the NUMA nodes on the system that match their needs with these patches: > http://lkml.kernel.org/r/20190116175804.30196-1-keith.busch@intel.com and use the existing NUMA APIs to select that memory.
diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig --- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 -0800 +++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800 @@ -32,6 +32,11 @@ config DEV_DAX_PMEM Say M if unsure +config DEV_DAX_KMEM + def_bool y + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure + depends on MEMORY_HOTPLUG # for add_memory() and friends + config DEV_DAX_PMEM_COMPAT tristate "PMEM DAX: support the deprecated /sys/class/dax interface" depends on DEV_DAX_PMEM diff -puN /dev/null drivers/dax/kmem.c --- /dev/null 2018-12-03 08:41:47.355756491 -0800 +++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800 @@ -0,0 +1,93 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */ +#include <linux/memremap.h> +#include <linux/pagemap.h> +#include <linux/memory.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/pfn_t.h> +#include <linux/slab.h> +#include <linux/dax.h> +#include <linux/fs.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include "dax-private.h" +#include "bus.h" + +int dev_dax_kmem_probe(struct device *dev) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + struct resource *res = &dev_dax->region->res; + resource_size_t kmem_start; + resource_size_t kmem_size; + struct resource *new_res; + int numa_node; + int rc; + + /* Hotplug starting at the beginning of the next block: */ + kmem_start = ALIGN(res->start, memory_block_size_bytes()); + + kmem_size = resource_size(res); + /* Adjust the size down to compensate for moving up kmem_start: */ + kmem_size -= kmem_start - res->start; + /* Align the size down to cover only complete blocks: */ + kmem_size &= ~(memory_block_size_bytes() - 1); + + new_res = devm_request_mem_region(dev, kmem_start, kmem_size, + dev_name(dev)); + + if (!new_res) { + printk("could not reserve region %016llx -> %016llx\n", + kmem_start, kmem_start+kmem_size); + return -EBUSY; + } + + /* + * Set flags appropriate for System RAM. Leave ..._BUSY clear + * so that add_memory() can add a child resource. + */ + new_res->flags = IORESOURCE_SYSTEM_RAM; + new_res->name = dev_name(dev); + + numa_node = dev_dax->target_node; + if (numa_node < 0) { + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node); + numa_node = 0; + } + + rc = add_memory(numa_node, new_res->start, resource_size(new_res)); + if (rc) + return rc; + + return 0; +} +EXPORT_SYMBOL_GPL(dev_dax_kmem_probe); + +static int dev_dax_kmem_remove(struct device *dev) +{ + /* Assume that hot-remove will fail for now */ + return -EBUSY; +} + +static struct dax_device_driver device_dax_kmem_driver = { + .drv = { + .probe = dev_dax_kmem_probe, + .remove = dev_dax_kmem_remove, + }, +}; + +static int __init dax_kmem_init(void) +{ + return dax_driver_register(&device_dax_kmem_driver); +} + +static void __exit dax_kmem_exit(void) +{ + dax_driver_unregister(&device_dax_kmem_driver); +} + +MODULE_AUTHOR("Intel Corporation"); +MODULE_LICENSE("GPL v2"); +module_init(dax_kmem_init); +module_exit(dax_kmem_exit); +MODULE_ALIAS_DAX_DEVICE(0); diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile --- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-08 09:54:44.053694874 -0800 +++ b/drivers/dax/Makefile 2019-01-08 09:54:44.056694874 -0800 @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_DAX) += dax.o obj-$(CONFIG_DEV_DAX) += device_dax.o +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o dax-y := super.o dax-y += bus.o