diff mbox series

[4/4] dax: "Hotplug" persistent memory for use like normal RAM

Message ID 20190116181905.12E102B4@viggo.jf.intel.com (mailing list archive)
State New, archived
Headers show
Series Allow persistent memory to be used like normal RAM | expand

Commit Message

Dave Hansen Jan. 16, 2019, 6:19 p.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries.

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed ad used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then bind it to this new driver:

	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

Note: this inherits any existing NUMA information for the newly-
added memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
---

 b/drivers/dax/Kconfig  |    5 ++
 b/drivers/dax/Makefile |    1 
 b/drivers/dax/kmem.c   |   93 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 99 insertions(+)

Comments

Bjorn Helgaas Jan. 16, 2019, 9:16 p.m. UTC | #1
On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries.

Is there any documentation about exactly what persistent memory is?
In Documentation/, I see references to pstore and pmem, which sound
sort of similar, but maybe not quite the same?

> However, this limits persistent memory use to applications which
> *have* been modified.  To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed ad used just like
> normal RAM would be.

s/ad/and/

> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
>
>         echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>         echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>
> and then bind it to this new driver:
>
>         echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>         echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>
> After this, there will be a number of new memory sections visible
> in sysfs that can be onlined, or that may get onlined by existing
> udev-initiated memory hotplug rules.
>
> Note: this inherits any existing NUMA information for the newly-
> added memory from the persistent memory device that came from the
> firmware.  On Intel platforms, the firmware has guarantees that
> require each socket's persistent memory to be in a separate
> memory-only NUMA node.  That means that this patch is not expected
> to create NUMA nodes, but will simply hotplug memory into existing
> nodes.
>
> There is currently some metadata at the beginning of pmem regions.
> The section-size memory hotplug restrictions, plus this small
> reserved area can cause the "loss" of a section or two of capacity.
> This should be fixable in follow-on patches.  But, as a first step,
> losing 256MB of memory (worst case) out of hundreds of gigabytes
> is a good tradeoff vs. the required code to fix this up precisely.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> ---
>
>  b/drivers/dax/Kconfig  |    5 ++
>  b/drivers/dax/Makefile |    1
>  b/drivers/dax/kmem.c   |   93 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 99 insertions(+)
>
> diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
> --- a/drivers/dax/Kconfig~dax-kmem-try-4        2019-01-08 09:54:44.051694874 -0800
> +++ b/drivers/dax/Kconfig       2019-01-08 09:54:44.056694874 -0800
> @@ -32,6 +32,11 @@ config DEV_DAX_PMEM
>
>           Say M if unsure
>
> +config DEV_DAX_KMEM
> +       def_bool y

Is "y" the right default here?  I periodically see Linus complain
about new things defaulting to "on", but I admit I haven't paid enough
attention to know whether that would apply here.

> +       depends on DEV_DAX_PMEM   # Needs DEV_DAX_PMEM infrastructure
> +       depends on MEMORY_HOTPLUG # for add_memory() and friends
> +
>  config DEV_DAX_PMEM_COMPAT
>         tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
>         depends on DEV_DAX_PMEM
> diff -puN /dev/null drivers/dax/kmem.c
> --- /dev/null   2018-12-03 08:41:47.355756491 -0800
> +++ b/drivers/dax/kmem.c        2019-01-08 09:54:44.056694874 -0800
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
> +#include <linux/memremap.h>
> +#include <linux/pagemap.h>
> +#include <linux/memory.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pfn_t.h>
> +#include <linux/slab.h>
> +#include <linux/dax.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include "dax-private.h"
> +#include "bus.h"
> +
> +int dev_dax_kmem_probe(struct device *dev)
> +{
> +       struct dev_dax *dev_dax = to_dev_dax(dev);
> +       struct resource *res = &dev_dax->region->res;
> +       resource_size_t kmem_start;
> +       resource_size_t kmem_size;
> +       struct resource *new_res;
> +       int numa_node;
> +       int rc;
> +
> +       /* Hotplug starting at the beginning of the next block: */
> +       kmem_start = ALIGN(res->start, memory_block_size_bytes());
> +
> +       kmem_size = resource_size(res);
> +       /* Adjust the size down to compensate for moving up kmem_start: */
> +        kmem_size -= kmem_start - res->start;
> +       /* Align the size down to cover only complete blocks: */
> +       kmem_size &= ~(memory_block_size_bytes() - 1);
> +
> +       new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
> +                                         dev_name(dev));
> +
> +       if (!new_res) {
> +               printk("could not reserve region %016llx -> %016llx\n",
> +                               kmem_start, kmem_start+kmem_size);

1) It'd be nice to have some sort of module tag in the output that
ties it to this driver.

2) It might be nice to print the range in the same format as %pR,
i.e., "[mem %#010x-%#010x]" with the end included (start + size -1 ).

> +               return -EBUSY;
> +       }
> +
> +       /*
> +        * Set flags appropriate for System RAM.  Leave ..._BUSY clear
> +        * so that add_memory() can add a child resource.
> +        */
> +       new_res->flags = IORESOURCE_SYSTEM_RAM;

IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
devm_request_mem_region() path.  I think you should keep at least
IORESOURCE_MEM so the iomem_resource tree stays consistent.

> +       new_res->name = dev_name(dev);
> +
> +       numa_node = dev_dax->target_node;
> +       if (numa_node < 0) {
> +               pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);

It'd be nice to again have a module tag and an indication of what
range is affected, e.g., %pR of new_res.

You don't save the new_res pointer anywhere, which I guess you intend
for now since there's no remove or anything else to do with this
resource?  I thought maybe devm_request_mem_region() would implicitly
save it, but it doesn't; it only saves the parent (iomem_resource, the
start (kmem_start), and the size (kmem_size)).

> +               numa_node = 0;
> +       }
> +
> +       rc = add_memory(numa_node, new_res->start, resource_size(new_res));
> +       if (rc)
> +               return rc;
> +
> +       return 0;

Doesn't this mean "return rc" or even just "return add_memory(...)"?

> +}
> +EXPORT_SYMBOL_GPL(dev_dax_kmem_probe);
> +
> +static int dev_dax_kmem_remove(struct device *dev)
> +{
> +       /* Assume that hot-remove will fail for now */
> +       return -EBUSY;
> +}
> +
> +static struct dax_device_driver device_dax_kmem_driver = {
> +       .drv = {
> +               .probe = dev_dax_kmem_probe,
> +               .remove = dev_dax_kmem_remove,
> +       },
> +};
> +
> +static int __init dax_kmem_init(void)
> +{
> +       return dax_driver_register(&device_dax_kmem_driver);
> +}
> +
> +static void __exit dax_kmem_exit(void)
> +{
> +       dax_driver_unregister(&device_dax_kmem_driver);
> +}
> +
> +MODULE_AUTHOR("Intel Corporation");
> +MODULE_LICENSE("GPL v2");
> +module_init(dax_kmem_init);
> +module_exit(dax_kmem_exit);
> +MODULE_ALIAS_DAX_DEVICE(0);
> diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
> --- a/drivers/dax/Makefile~dax-kmem-try-4       2019-01-08 09:54:44.053694874 -0800
> +++ b/drivers/dax/Makefile      2019-01-08 09:54:44.056694874 -0800
> @@ -1,6 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-$(CONFIG_DAX) += dax.o
>  obj-$(CONFIG_DEV_DAX) += device_dax.o
> +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
>
>  dax-y := super.o
>  dax-y += bus.o
> _
Dan Williams Jan. 16, 2019, 9:31 p.m. UTC | #2
On Wed, Jan 16, 2019 at 10:25 AM Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries.
>
> However, this limits persistent memory use to applications which
> *have* been modified.  To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed ad used just like
> normal RAM would be.
>
> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
>
>         echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>         echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>
> and then bind it to this new driver:
>
>         echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>         echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>
> After this, there will be a number of new memory sections visible
> in sysfs that can be onlined, or that may get onlined by existing
> udev-initiated memory hotplug rules.
>
> Note: this inherits any existing NUMA information for the newly-
> added memory from the persistent memory device that came from the
> firmware.  On Intel platforms, the firmware has guarantees that
> require each socket's persistent memory to be in a separate
> memory-only NUMA node.  That means that this patch is not expected
> to create NUMA nodes, but will simply hotplug memory into existing
> nodes.
>
> There is currently some metadata at the beginning of pmem regions.
> The section-size memory hotplug restrictions, plus this small
> reserved area can cause the "loss" of a section or two of capacity.
> This should be fixable in follow-on patches.  But, as a first step,
> losing 256MB of memory (worst case) out of hundreds of gigabytes
> is a good tradeoff vs. the required code to fix this up precisely.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> ---
>
>  b/drivers/dax/Kconfig  |    5 ++
>  b/drivers/dax/Makefile |    1
>  b/drivers/dax/kmem.c   |   93 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 99 insertions(+)
>
> diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
> --- a/drivers/dax/Kconfig~dax-kmem-try-4        2019-01-08 09:54:44.051694874 -0800
> +++ b/drivers/dax/Kconfig       2019-01-08 09:54:44.056694874 -0800
> @@ -32,6 +32,11 @@ config DEV_DAX_PMEM
>
>           Say M if unsure
>
> +config DEV_DAX_KMEM
> +       def_bool y
> +       depends on DEV_DAX_PMEM   # Needs DEV_DAX_PMEM infrastructure
> +       depends on MEMORY_HOTPLUG # for add_memory() and friends
> +

I think this should be:

config DEV_DAX_KMEM
       tristate "<kmem title>"
       depends on DEV_DAX
       default DEV_DAX
       depends on MEMORY_HOTPLUG # for add_memory() and friends
       help
           <kmem description>

...because the DEV_DAX_KMEM implementation with the device-DAX reworks
is independent of pmem. It just so happens that pmem is the only
source for device-DAX instances, but that need not always be the case
and kmem is device-DAX origin generic.

>  config DEV_DAX_PMEM_COMPAT
>         tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
>         depends on DEV_DAX_PMEM
> diff -puN /dev/null drivers/dax/kmem.c
> --- /dev/null   2018-12-03 08:41:47.355756491 -0800
> +++ b/drivers/dax/kmem.c        2019-01-08 09:54:44.056694874 -0800
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
> +#include <linux/memremap.h>
> +#include <linux/pagemap.h>
> +#include <linux/memory.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pfn_t.h>
> +#include <linux/slab.h>
> +#include <linux/dax.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include "dax-private.h"
> +#include "bus.h"
> +
> +int dev_dax_kmem_probe(struct device *dev)
> +{
> +       struct dev_dax *dev_dax = to_dev_dax(dev);
> +       struct resource *res = &dev_dax->region->res;
> +       resource_size_t kmem_start;
> +       resource_size_t kmem_size;
> +       struct resource *new_res;
> +       int numa_node;
> +       int rc;
> +
> +       /* Hotplug starting at the beginning of the next block: */
> +       kmem_start = ALIGN(res->start, memory_block_size_bytes());
> +
> +       kmem_size = resource_size(res);
> +       /* Adjust the size down to compensate for moving up kmem_start: */
> +        kmem_size -= kmem_start - res->start;
> +       /* Align the size down to cover only complete blocks: */
> +       kmem_size &= ~(memory_block_size_bytes() - 1);
> +
> +       new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
> +                                         dev_name(dev));
> +
> +       if (!new_res) {
> +               printk("could not reserve region %016llx -> %016llx\n",
> +                               kmem_start, kmem_start+kmem_size);

dev_err() please.

> +               return -EBUSY;
> +       }
> +
> +       /*
> +        * Set flags appropriate for System RAM.  Leave ..._BUSY clear
> +        * so that add_memory() can add a child resource.
> +        */
> +       new_res->flags = IORESOURCE_SYSTEM_RAM;
> +       new_res->name = dev_name(dev);
> +
> +       numa_node = dev_dax->target_node;
> +       if (numa_node < 0) {
> +               pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);

I think this should be dev_info(dev, "no numa_node, defaulting to
0\n"), or dev_dbg():

1/ so we can backtrack which device is missing numa information
2/ NUMA_NO_NODE may be a common occurrence so it's not really a "warn"
level concern afaics.
3/ no real need for _once I don't see this as a log spam risk.

> +               numa_node = 0;
> +       }
> +
> +       rc = add_memory(numa_node, new_res->start, resource_size(new_res));
> +       if (rc)
> +               return rc;
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(dev_dax_kmem_probe);

No need to export this afaics.

> +
> +static int dev_dax_kmem_remove(struct device *dev)
> +{
> +       /* Assume that hot-remove will fail for now */
> +       return -EBUSY;
> +}
> +
> +static struct dax_device_driver device_dax_kmem_driver = {
> +       .drv = {
> +               .probe = dev_dax_kmem_probe,
> +               .remove = dev_dax_kmem_remove,
> +       },
> +};
> +
> +static int __init dax_kmem_init(void)
> +{
> +       return dax_driver_register(&device_dax_kmem_driver);
> +}
> +
> +static void __exit dax_kmem_exit(void)
> +{
> +       dax_driver_unregister(&device_dax_kmem_driver);
> +}
> +
> +MODULE_AUTHOR("Intel Corporation");
> +MODULE_LICENSE("GPL v2");
> +module_init(dax_kmem_init);
> +module_exit(dax_kmem_exit);
> +MODULE_ALIAS_DAX_DEVICE(0);
> diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
> --- a/drivers/dax/Makefile~dax-kmem-try-4       2019-01-08 09:54:44.053694874 -0800
> +++ b/drivers/dax/Makefile      2019-01-08 09:54:44.056694874 -0800
> @@ -1,6 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-$(CONFIG_DAX) += dax.o
>  obj-$(CONFIG_DEV_DAX) += device_dax.o
> +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
>
>  dax-y := super.o
>  dax-y += bus.o
> _
Dave Hansen Jan. 16, 2019, 9:40 p.m. UTC | #3
On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
> On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>> Currently, a persistent memory region is "owned" by a device driver,
>> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
>> allow applications to explicitly use persistent memory, generally
>> by being modified to use special, new libraries.
> 
> Is there any documentation about exactly what persistent memory is?
> In Documentation/, I see references to pstore and pmem, which sound
> sort of similar, but maybe not quite the same?

One instance of persistent memory is nonvolatile DIMMS.  They're
described in great detail here: Documentation/nvdimm/nvdimm.txt

>> +config DEV_DAX_KMEM
>> +       def_bool y
> 
> Is "y" the right default here?  I periodically see Linus complain
> about new things defaulting to "on", but I admit I haven't paid enough
> attention to know whether that would apply here.
> 
>> +       depends on DEV_DAX_PMEM   # Needs DEV_DAX_PMEM infrastructure
>> +       depends on MEMORY_HOTPLUG # for add_memory() and friends

Well, it doesn't default to "on for everyone".  It inherits the state of
DEV_DAX_PMEM so it's only foisted on folks who have already opted in to
generic pmem support.

>> +int dev_dax_kmem_probe(struct device *dev)
>> +{
>> +       struct dev_dax *dev_dax = to_dev_dax(dev);
>> +       struct resource *res = &dev_dax->region->res;
>> +       resource_size_t kmem_start;
>> +       resource_size_t kmem_size;
>> +       struct resource *new_res;
>> +       int numa_node;
>> +       int rc;
>> +
>> +       /* Hotplug starting at the beginning of the next block: */
>> +       kmem_start = ALIGN(res->start, memory_block_size_bytes());
>> +
>> +       kmem_size = resource_size(res);
>> +       /* Adjust the size down to compensate for moving up kmem_start: */
>> +        kmem_size -= kmem_start - res->start;
>> +       /* Align the size down to cover only complete blocks: */
>> +       kmem_size &= ~(memory_block_size_bytes() - 1);
>> +
>> +       new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
>> +                                         dev_name(dev));
>> +
>> +       if (!new_res) {
>> +               printk("could not reserve region %016llx -> %016llx\n",
>> +                               kmem_start, kmem_start+kmem_size);
> 
> 1) It'd be nice to have some sort of module tag in the output that
> ties it to this driver.

Good point.  That should probably be a dev_printk().

> 2) It might be nice to print the range in the same format as %pR,
> i.e., "[mem %#010x-%#010x]" with the end included (start + size -1 ).

Sure, that sounds like a sane thing to do as well.

>> +               return -EBUSY;
>> +       }
>> +
>> +       /*
>> +        * Set flags appropriate for System RAM.  Leave ..._BUSY clear
>> +        * so that add_memory() can add a child resource.
>> +        */
>> +       new_res->flags = IORESOURCE_SYSTEM_RAM;
> 
> IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
> devm_request_mem_region() path.  I think you should keep at least
> IORESOURCE_MEM so the iomem_resource tree stays consistent.
> 
>> +       new_res->name = dev_name(dev);
>> +
>> +       numa_node = dev_dax->target_node;
>> +       if (numa_node < 0) {
>> +               pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
> 
> It'd be nice to again have a module tag and an indication of what
> range is affected, e.g., %pR of new_res.
> 
> You don't save the new_res pointer anywhere, which I guess you intend
> for now since there's no remove or anything else to do with this
> resource?  I thought maybe devm_request_mem_region() would implicitly
> save it, but it doesn't; it only saves the parent (iomem_resource, the
> start (kmem_start), and the size (kmem_size)).

Yeah, that's the intention: removal is currently not supported.  I'll
add a comment to clarify.

>> +               numa_node = 0;
>> +       }
>> +
>> +       rc = add_memory(numa_node, new_res->start, resource_size(new_res));
>> +       if (rc)
>> +               return rc;
>> +
>> +       return 0;
> 
> Doesn't this mean "return rc" or even just "return add_memory(...)"?

Yeah, all of those are equivalent.  I guess I just prefer the explicit
error handling path.
Bjorn Helgaas Jan. 16, 2019, 9:44 p.m. UTC | #4
On Wed, Jan 16, 2019 at 3:40 PM Dave Hansen <dave.hansen@intel.com> wrote:
> On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
> > On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen
> > <dave.hansen@linux.intel.com> wrote:
> >> From: Dave Hansen <dave.hansen@linux.intel.com>
> >> Currently, a persistent memory region is "owned" by a device driver,
> >> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
> >> allow applications to explicitly use persistent memory, generally
> >> by being modified to use special, new libraries.
> >
> > Is there any documentation about exactly what persistent memory is?
> > In Documentation/, I see references to pstore and pmem, which sound
> > sort of similar, but maybe not quite the same?
>
> One instance of persistent memory is nonvolatile DIMMS.  They're
> described in great detail here: Documentation/nvdimm/nvdimm.txt

Thanks!  Some bread crumbs in the changelog to lead there would be great.

Bjorn
Dave Hansen Jan. 16, 2019, 9:53 p.m. UTC | #5
On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
>> +       /*
>> +        * Set flags appropriate for System RAM.  Leave ..._BUSY clear
>> +        * so that add_memory() can add a child resource.
>> +        */
>> +       new_res->flags = IORESOURCE_SYSTEM_RAM;
> IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
> devm_request_mem_region() path.  I think you should keep at least
> IORESOURCE_MEM so the iomem_resource tree stays consistent.

I went to look at fixing this.  It looks like "IORESOURCE_SYSTEM_RAM"
includes IORESOURCE_MEM:

> #define IORESOURCE_SYSTEM_RAM           (IORESOURCE_MEM|IORESOURCE_SYSRAM)

Did you want the patch to expand this #define, or did you just want to
ensure that IORESORUCE_MEM got in there somehow?
Bjorn Helgaas Jan. 16, 2019, 9:59 p.m. UTC | #6
On Wed, Jan 16, 2019 at 3:53 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
> >> +       /*
> >> +        * Set flags appropriate for System RAM.  Leave ..._BUSY clear
> >> +        * so that add_memory() can add a child resource.
> >> +        */
> >> +       new_res->flags = IORESOURCE_SYSTEM_RAM;
> > IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
> > devm_request_mem_region() path.  I think you should keep at least
> > IORESOURCE_MEM so the iomem_resource tree stays consistent.
>
> I went to look at fixing this.  It looks like "IORESOURCE_SYSTEM_RAM"
> includes IORESOURCE_MEM:
>
> > #define IORESOURCE_SYSTEM_RAM           (IORESOURCE_MEM|IORESOURCE_SYSRAM)
>
> Did you want the patch to expand this #define, or did you just want to
> ensure that IORESORUCE_MEM got in there somehow?

The latter.  Since it's already included, forget I said anything :)
Although if your intent is only to clear IORESOURCE_BUSY, maybe it
would be safer to just clear that bit instead of overwriting
everything?  That might also help people grepping for IORESOURCE_BUSY
usage.
Dan Williams Jan. 16, 2019, 10:06 p.m. UTC | #7
On Wed, Jan 16, 2019 at 1:40 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
> > On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen
> > <dave.hansen@linux.intel.com> wrote:
> >> From: Dave Hansen <dave.hansen@linux.intel.com>
> >> Currently, a persistent memory region is "owned" by a device driver,
> >> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
> >> allow applications to explicitly use persistent memory, generally
> >> by being modified to use special, new libraries.
> >
> > Is there any documentation about exactly what persistent memory is?
> > In Documentation/, I see references to pstore and pmem, which sound
> > sort of similar, but maybe not quite the same?
>
> One instance of persistent memory is nonvolatile DIMMS.  They're
> described in great detail here: Documentation/nvdimm/nvdimm.txt
>
> >> +config DEV_DAX_KMEM
> >> +       def_bool y
> >
> > Is "y" the right default here?  I periodically see Linus complain
> > about new things defaulting to "on", but I admit I haven't paid enough
> > attention to know whether that would apply here.
> >
> >> +       depends on DEV_DAX_PMEM   # Needs DEV_DAX_PMEM infrastructure
> >> +       depends on MEMORY_HOTPLUG # for add_memory() and friends
>
> Well, it doesn't default to "on for everyone".  It inherits the state of
> DEV_DAX_PMEM so it's only foisted on folks who have already opted in to
> generic pmem support.
>
> >> +int dev_dax_kmem_probe(struct device *dev)
> >> +{
> >> +       struct dev_dax *dev_dax = to_dev_dax(dev);
> >> +       struct resource *res = &dev_dax->region->res;
> >> +       resource_size_t kmem_start;
> >> +       resource_size_t kmem_size;
> >> +       struct resource *new_res;
> >> +       int numa_node;
> >> +       int rc;
> >> +
> >> +       /* Hotplug starting at the beginning of the next block: */
> >> +       kmem_start = ALIGN(res->start, memory_block_size_bytes());
> >> +
> >> +       kmem_size = resource_size(res);
> >> +       /* Adjust the size down to compensate for moving up kmem_start: */
> >> +        kmem_size -= kmem_start - res->start;
> >> +       /* Align the size down to cover only complete blocks: */
> >> +       kmem_size &= ~(memory_block_size_bytes() - 1);
> >> +
> >> +       new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
> >> +                                         dev_name(dev));
> >> +
> >> +       if (!new_res) {
> >> +               printk("could not reserve region %016llx -> %016llx\n",
> >> +                               kmem_start, kmem_start+kmem_size);
> >
> > 1) It'd be nice to have some sort of module tag in the output that
> > ties it to this driver.
>
> Good point.  That should probably be a dev_printk().
>
> > 2) It might be nice to print the range in the same format as %pR,
> > i.e., "[mem %#010x-%#010x]" with the end included (start + size -1 ).
>
> Sure, that sounds like a sane thing to do as well.

Does %pR protect physical address disclosure to non-root by default?
At least the pmem driver is using %pR rather than manually printing
raw physical address values, but you would need to create a local
modified version of the passed in resource.

> >> +               return -EBUSY;
> >> +       }
> >> +
> >> +       /*
> >> +        * Set flags appropriate for System RAM.  Leave ..._BUSY clear
> >> +        * so that add_memory() can add a child resource.
> >> +        */
> >> +       new_res->flags = IORESOURCE_SYSTEM_RAM;
> >
> > IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
> > devm_request_mem_region() path.  I think you should keep at least
> > IORESOURCE_MEM so the iomem_resource tree stays consistent.
> >
> >> +       new_res->name = dev_name(dev);
> >> +
> >> +       numa_node = dev_dax->target_node;
> >> +       if (numa_node < 0) {
> >> +               pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
> >
> > It'd be nice to again have a module tag and an indication of what
> > range is affected, e.g., %pR of new_res.
> >
> > You don't save the new_res pointer anywhere, which I guess you intend
> > for now since there's no remove or anything else to do with this
> > resource?  I thought maybe devm_request_mem_region() would implicitly
> > save it, but it doesn't; it only saves the parent (iomem_resource, the
> > start (kmem_start), and the size (kmem_size)).
>
> Yeah, that's the intention: removal is currently not supported.  I'll
> add a comment to clarify.

I would clarify that *driver* removal is supported because there's no
Linux facility for drivers to fail removal (nothing checks the return
code from ->remove()). Instead the protection is that the resource
must remain pinned forever. In that case devm_request_mem_region() is
the wrong function to use. You want to explicitly use the non-devm
request_mem_region() and purposely leak it to keep the memory reserved
indefinitely.
Du, Fan Jan. 17, 2019, 5:21 a.m. UTC | #8
>-----Original Message-----
>From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf
>Of Dave Hansen
>Sent: Thursday, January 17, 2019 2:19 AM
>To: dave@sr71.net
>Cc: thomas.lendacky@amd.com; mhocko@suse.com;
>linux-nvdimm@lists.01.org; tiwai@suse.de; Dave Hansen
><dave.hansen@linux.intel.com>; Huang, Ying <ying.huang@intel.com>;
>linux-kernel@vger.kernel.org; linux-mm@kvack.org; bp@suse.de;
>baiyaowei@cmss.chinamobile.com; zwisler@kernel.org;
>bhelgaas@google.com; Wu, Fengguang <fengguang.wu@intel.com>;
>akpm@linux-foundation.org
>Subject: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal
>RAM
>
>
>From: Dave Hansen <dave.hansen@linux.intel.com>
>
>Currently, a persistent memory region is "owned" by a device driver,
>either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
>allow applications to explicitly use persistent memory, generally
>by being modified to use special, new libraries.
>
>However, this limits persistent memory use to applications which
>*have* been modified.  To make it more broadly usable, this driver
>"hotplugs" memory into the kernel, to be managed ad used just like
>normal RAM would be.
>
>To make this work, management software must remove the device from
>being controlled by the "Device DAX" infrastructure:
>
>	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>
>and then bind it to this new driver:
>
>	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind

Is there any plan to introduce additional mode, e.g. "kmem" in the userspace
ndctl tool to do the configuration?

>After this, there will be a number of new memory sections visible
>in sysfs that can be onlined, or that may get onlined by existing
>udev-initiated memory hotplug rules.
>
>Note: this inherits any existing NUMA information for the newly-
>added memory from the persistent memory device that came from the
>firmware.  On Intel platforms, the firmware has guarantees that
>require each socket's persistent memory to be in a separate
>memory-only NUMA node.  That means that this patch is not expected
>to create NUMA nodes, but will simply hotplug memory into existing
>nodes.
>
>There is currently some metadata at the beginning of pmem regions.
>The section-size memory hotplug restrictions, plus this small
>reserved area can cause the "loss" of a section or two of capacity.
>This should be fixable in follow-on patches.  But, as a first step,
>losing 256MB of memory (worst case) out of hundreds of gigabytes
>is a good tradeoff vs. the required code to fix this up precisely.
>
>Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>Cc: Dan Williams <dan.j.williams@intel.com>
>Cc: Dave Jiang <dave.jiang@intel.com>
>Cc: Ross Zwisler <zwisler@kernel.org>
>Cc: Vishal Verma <vishal.l.verma@intel.com>
>Cc: Tom Lendacky <thomas.lendacky@amd.com>
>Cc: Andrew Morton <akpm@linux-foundation.org>
>Cc: Michal Hocko <mhocko@suse.com>
>Cc: linux-nvdimm@lists.01.org
>Cc: linux-kernel@vger.kernel.org
>Cc: linux-mm@kvack.org
>Cc: Huang Ying <ying.huang@intel.com>
>Cc: Fengguang Wu <fengguang.wu@intel.com>
>Cc: Borislav Petkov <bp@suse.de>
>Cc: Bjorn Helgaas <bhelgaas@google.com>
>Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
>Cc: Takashi Iwai <tiwai@suse.de>
>---
>
> b/drivers/dax/Kconfig  |    5 ++
> b/drivers/dax/Makefile |    1
> b/drivers/dax/kmem.c   |   93
>+++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 99 insertions(+)
>
>diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
>--- a/drivers/dax/Kconfig~dax-kmem-try-4	2019-01-08 09:54:44.051694874
>-0800
>+++ b/drivers/dax/Kconfig	2019-01-08 09:54:44.056694874 -0800
>@@ -32,6 +32,11 @@ config DEV_DAX_PMEM
>
> 	  Say M if unsure
>
>+config DEV_DAX_KMEM
>+	def_bool y
>+	depends on DEV_DAX_PMEM   # Needs DEV_DAX_PMEM infrastructure
>+	depends on MEMORY_HOTPLUG # for add_memory() and friends
>+
> config DEV_DAX_PMEM_COMPAT
> 	tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
> 	depends on DEV_DAX_PMEM
>diff -puN /dev/null drivers/dax/kmem.c
>--- /dev/null	2018-12-03 08:41:47.355756491 -0800
>+++ b/drivers/dax/kmem.c	2019-01-08 09:54:44.056694874 -0800
>@@ -0,0 +1,93 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
>+#include <linux/memremap.h>
>+#include <linux/pagemap.h>
>+#include <linux/memory.h>
>+#include <linux/module.h>
>+#include <linux/device.h>
>+#include <linux/pfn_t.h>
>+#include <linux/slab.h>
>+#include <linux/dax.h>
>+#include <linux/fs.h>
>+#include <linux/mm.h>
>+#include <linux/mman.h>
>+#include "dax-private.h"
>+#include "bus.h"
>+
>+int dev_dax_kmem_probe(struct device *dev)
>+{
>+	struct dev_dax *dev_dax = to_dev_dax(dev);
>+	struct resource *res = &dev_dax->region->res;
>+	resource_size_t kmem_start;
>+	resource_size_t kmem_size;
>+	struct resource *new_res;
>+	int numa_node;
>+	int rc;
>+
>+	/* Hotplug starting at the beginning of the next block: */
>+	kmem_start = ALIGN(res->start, memory_block_size_bytes());
>+
>+	kmem_size = resource_size(res);
>+	/* Adjust the size down to compensate for moving up kmem_start: */
>+        kmem_size -= kmem_start - res->start;
>+	/* Align the size down to cover only complete blocks: */
>+	kmem_size &= ~(memory_block_size_bytes() - 1);
>+
>+	new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
>+					  dev_name(dev));
>+
>+	if (!new_res) {
>+		printk("could not reserve region %016llx -> %016llx\n",
>+				kmem_start, kmem_start+kmem_size);
>+		return -EBUSY;
>+	}
>+
>+	/*
>+	 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
>+	 * so that add_memory() can add a child resource.
>+	 */
>+	new_res->flags = IORESOURCE_SYSTEM_RAM;
>+	new_res->name = dev_name(dev);
>+
>+	numa_node = dev_dax->target_node;
>+	if (numa_node < 0) {
>+		pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
>+		numa_node = 0;
>+	}
>+
>+	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
>+	if (rc)
>+		return rc;
>+
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(dev_dax_kmem_probe);
>+
>+static int dev_dax_kmem_remove(struct device *dev)
>+{
>+	/* Assume that hot-remove will fail for now */
>+	return -EBUSY;
>+}
>+
>+static struct dax_device_driver device_dax_kmem_driver = {
>+	.drv = {
>+		.probe = dev_dax_kmem_probe,
>+		.remove = dev_dax_kmem_remove,
>+	},
>+};
>+
>+static int __init dax_kmem_init(void)
>+{
>+	return dax_driver_register(&device_dax_kmem_driver);
>+}
>+
>+static void __exit dax_kmem_exit(void)
>+{
>+	dax_driver_unregister(&device_dax_kmem_driver);
>+}
>+
>+MODULE_AUTHOR("Intel Corporation");
>+MODULE_LICENSE("GPL v2");
>+module_init(dax_kmem_init);
>+module_exit(dax_kmem_exit);
>+MODULE_ALIAS_DAX_DEVICE(0);
>diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
>--- a/drivers/dax/Makefile~dax-kmem-try-4	2019-01-08 09:54:44.053694874
>-0800
>+++ b/drivers/dax/Makefile	2019-01-08 09:54:44.056694874 -0800
>@@ -1,6 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_DAX) += dax.o
> obj-$(CONFIG_DEV_DAX) += device_dax.o
>+obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
>
> dax-y := super.o
> dax-y += bus.o
>_
>_______________________________________________
>Linux-nvdimm mailing list
>Linux-nvdimm@lists.01.org
>https://lists.01.org/mailman/listinfo/linux-nvdimm
Yanmin Zhang Jan. 17, 2019, 8:19 a.m. UTC | #9
On 2019/1/17 上午2:19, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries.
> 
> However, this limits persistent memory use to applications which
> *have* been modified.  To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed ad used just like
> normal RAM would be.
> 
> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
> 
> 	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> 	echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> 
> and then bind it to this new driver:
> 
> 	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> 	echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
> 
> After this, there will be a number of new memory sections visible
> in sysfs that can be onlined, or that may get onlined by existing
> udev-initiated memory hotplug rules.
> 
> Note: this inherits any existing NUMA information for the newly-
> added memory from the persistent memory device that came from the
> firmware.  On Intel platforms, the firmware has guarantees that
> require each socket's persistent memory to be in a separate
> memory-only NUMA node.  That means that this patch is not expected
> to create NUMA nodes, but will simply hotplug memory into existing
> nodes.
> 
> There is currently some metadata at the beginning of pmem regions.
> The section-size memory hotplug restrictions, plus this small
> reserved area can cause the "loss" of a section or two of capacity.
> This should be fixable in follow-on patches.  But, as a first step,
> losing 256MB of memory (worst case) out of hundreds of gigabytes
> is a good tradeoff vs. the required code to fix this up precisely.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> ---
> 
>   b/drivers/dax/Kconfig  |    5 ++
>   b/drivers/dax/Makefile |    1
>   b/drivers/dax/kmem.c   |   93 +++++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 99 insertions(+)
> 
> diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
> --- a/drivers/dax/Kconfig~dax-kmem-try-4	2019-01-08 09:54:44.051694874 -0800
> +++ b/drivers/dax/Kconfig	2019-01-08 09:54:44.056694874 -0800
> @@ -32,6 +32,11 @@ config DEV_DAX_PMEM
>   
>   	  Say M if unsure
>   
> +config DEV_DAX_KMEM
> +	def_bool y
> +	depends on DEV_DAX_PMEM   # Needs DEV_DAX_PMEM infrastructure
> +	depends on MEMORY_HOTPLUG # for add_memory() and friends
> +
>   config DEV_DAX_PMEM_COMPAT
>   	tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
>   	depends on DEV_DAX_PMEM
> diff -puN /dev/null drivers/dax/kmem.c
> --- /dev/null	2018-12-03 08:41:47.355756491 -0800
> +++ b/drivers/dax/kmem.c	2019-01-08 09:54:44.056694874 -0800
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
> +#include <linux/memremap.h>
> +#include <linux/pagemap.h>
> +#include <linux/memory.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pfn_t.h>
> +#include <linux/slab.h>
> +#include <linux/dax.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include "dax-private.h"
> +#include "bus.h"
> +
> +int dev_dax_kmem_probe(struct device *dev)
> +{
> +	struct dev_dax *dev_dax = to_dev_dax(dev);
> +	struct resource *res = &dev_dax->region->res;
> +	resource_size_t kmem_start;
> +	resource_size_t kmem_size;
> +	struct resource *new_res;
> +	int numa_node;
> +	int rc;
> +
> +	/* Hotplug starting at the beginning of the next block: */
> +	kmem_start = ALIGN(res->start, memory_block_size_bytes());
> +
> +	kmem_size = resource_size(res);
> +	/* Adjust the size down to compensate for moving up kmem_start: */
> +        kmem_size -= kmem_start - res->start;
> +	/* Align the size down to cover only complete blocks: */
> +	kmem_size &= ~(memory_block_size_bytes() - 1);
> +
> +	new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
> +					  dev_name(dev));
> +
> +	if (!new_res) {
> +		printk("could not reserve region %016llx -> %016llx\n",
> +				kmem_start, kmem_start+kmem_size);
> +		return -EBUSY;
> +	}
> +
> +	/*
> +	 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
> +	 * so that add_memory() can add a child resource.
> +	 */
> +	new_res->flags = IORESOURCE_SYSTEM_RAM;
> +	new_res->name = dev_name(dev);
> +
> +	numa_node = dev_dax->target_node;
> +	if (numa_node < 0) {
> +		pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
> +		numa_node = 0;
> +	}
> +
> +	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
I didn't try pmem and I am wondering it's slower than DRAM.
Should a flag, such like _GFP_PMEM, be added to distinguish it from
DRAM?

If it's used for DMA, perhaps it might not satisfy device DMA request on 
time?
Dave Hansen Jan. 17, 2019, 3:17 p.m. UTC | #10
On 1/17/19 12:19 AM, Yanmin Zhang wrote:
>>
> I didn't try pmem and I am wondering it's slower than DRAM.
> Should a flag, such like _GFP_PMEM, be added to distinguish it from
> DRAM?

Absolutely not. :)

We already have performance-differentiated memory, and lots of ways to
enumerate and select it in the kernel (all of our NUMA infrastructure).

PMEM is also just the first of many "kinds" of memory that folks want to
build in systems and use a "RAM".  We literally don't have space to put
a flag in for each type.
Dan Williams Jan. 17, 2019, 4:56 p.m. UTC | #11
On Wed, Jan 16, 2019 at 9:21 PM Du, Fan <fan.du@intel.com> wrote:
[..]
> >From: Dave Hansen <dave.hansen@linux.intel.com>
> >
> >Currently, a persistent memory region is "owned" by a device driver,
> >either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
> >allow applications to explicitly use persistent memory, generally
> >by being modified to use special, new libraries.
> >
> >However, this limits persistent memory use to applications which
> >*have* been modified.  To make it more broadly usable, this driver
> >"hotplugs" memory into the kernel, to be managed ad used just like
> >normal RAM would be.
> >
> >To make this work, management software must remove the device from
> >being controlled by the "Device DAX" infrastructure:
> >
> >       echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> >       echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> >
> >and then bind it to this new driver:
> >
> >       echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> >       echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>
> Is there any plan to introduce additional mode, e.g. "kmem" in the userspace
> ndctl tool to do the configuration?
>

Yes, but not to ndctl. The daxctl tool will grow a helper for this.
The policy of what device-dax instances should be hotplugged at system
init will be managed by a persistent configuration file and udev
rules.
Yanmin Zhang Jan. 18, 2019, 7:47 a.m. UTC | #12
On 2019/1/17 下午11:17, Dave Hansen wrote:
> On 1/17/19 12:19 AM, Yanmin Zhang wrote:
>>>
>> I didn't try pmem and I am wondering it's slower than DRAM.
>> Should a flag, such like _GFP_PMEM, be added to distinguish it from
>> DRAM?
> 
> Absolutely not. :)
Agree.

> 
> We already have performance-differentiated memory, and lots of ways to
> enumerate and select it in the kernel (all of our NUMA infrastructure).
Kernel does manage memory like what you say.
My question is: with your patch, PMEM becomes normal RAM, then there is
a chance for kernel to allocate PMEM as DMA buffer.
Some super speed devices like 10Giga NIC, USB (SSIC connecting modem), 
might not work well if DMA buffer is in PMEM as it's slower than DRAM.

Should your patchset consider it?



> 
> PMEM is also just the first of many "kinds" of memory that folks want to
> build in systems and use a "RAM".  We literally don't have space to put
> a flag in for each type.
> 
>
Dave Hansen Jan. 18, 2019, 3:20 p.m. UTC | #13
On 1/17/19 11:47 PM, Yanmin Zhang wrote:
> a chance for kernel to allocate PMEM as DMA buffer.
> Some super speed devices like 10Giga NIC, USB (SSIC connecting modem),
> might not work well if DMA buffer is in PMEM as it's slower than DRAM.
> 
> Should your patchset consider it?

No, I don't think so.

They can DMA to persistent memory whether this patch set exists or not.
 So, if the hardware falls over, that's a separate problem.

If an app wants memory that performs in a particular way, then I would
suggest those app find the NUMA nodes on the system that match their
needs with these patches:

> http://lkml.kernel.org/r/20190116175804.30196-1-keith.busch@intel.com

and use the existing NUMA APIs to select that memory.
diff mbox series

Patch

diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-4	2019-01-08 09:54:44.051694874 -0800
+++ b/drivers/dax/Kconfig	2019-01-08 09:54:44.056694874 -0800
@@ -32,6 +32,11 @@  config DEV_DAX_PMEM
 
 	  Say M if unsure
 
+config DEV_DAX_KMEM
+	def_bool y
+	depends on DEV_DAX_PMEM   # Needs DEV_DAX_PMEM infrastructure
+	depends on MEMORY_HOTPLUG # for add_memory() and friends
+
 config DEV_DAX_PMEM_COMPAT
 	tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
 	depends on DEV_DAX_PMEM
diff -puN /dev/null drivers/dax/kmem.c
--- /dev/null	2018-12-03 08:41:47.355756491 -0800
+++ b/drivers/dax/kmem.c	2019-01-08 09:54:44.056694874 -0800
@@ -0,0 +1,93 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pfn_t.h>
+#include <linux/slab.h>
+#include <linux/dax.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include "dax-private.h"
+#include "bus.h"
+
+int dev_dax_kmem_probe(struct device *dev)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct resource *res = &dev_dax->region->res;
+	resource_size_t kmem_start;
+	resource_size_t kmem_size;
+	struct resource *new_res;
+	int numa_node;
+	int rc;
+
+	/* Hotplug starting at the beginning of the next block: */
+	kmem_start = ALIGN(res->start, memory_block_size_bytes());
+
+	kmem_size = resource_size(res);
+	/* Adjust the size down to compensate for moving up kmem_start: */
+        kmem_size -= kmem_start - res->start;
+	/* Align the size down to cover only complete blocks: */
+	kmem_size &= ~(memory_block_size_bytes() - 1);
+
+	new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
+					  dev_name(dev));
+
+	if (!new_res) {
+		printk("could not reserve region %016llx -> %016llx\n",
+				kmem_start, kmem_start+kmem_size);
+		return -EBUSY;
+	}
+
+	/*
+	 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
+	 * so that add_memory() can add a child resource.
+	 */
+	new_res->flags = IORESOURCE_SYSTEM_RAM;
+	new_res->name = dev_name(dev);
+
+	numa_node = dev_dax->target_node;
+	if (numa_node < 0) {
+		pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
+		numa_node = 0;
+	}
+
+	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+	if (rc)
+		return rc;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dev_dax_kmem_probe);
+
+static int dev_dax_kmem_remove(struct device *dev)
+{
+	/* Assume that hot-remove will fail for now */
+	return -EBUSY;
+}
+
+static struct dax_device_driver device_dax_kmem_driver = {
+	.drv = {
+		.probe = dev_dax_kmem_probe,
+		.remove = dev_dax_kmem_remove,
+	},
+};
+
+static int __init dax_kmem_init(void)
+{
+	return dax_driver_register(&device_dax_kmem_driver);
+}
+
+static void __exit dax_kmem_exit(void)
+{
+	dax_driver_unregister(&device_dax_kmem_driver);
+}
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
+module_init(dax_kmem_init);
+module_exit(dax_kmem_exit);
+MODULE_ALIAS_DAX_DEVICE(0);
diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
--- a/drivers/dax/Makefile~dax-kmem-try-4	2019-01-08 09:54:44.053694874 -0800
+++ b/drivers/dax/Makefile	2019-01-08 09:54:44.056694874 -0800
@@ -1,6 +1,7 @@ 
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
+obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
 
 dax-y := super.o
 dax-y += bus.o