Message ID | 20230613-vv-kmem_memmap-v1-3-f6de9c6af2c6@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: use memmap_on_memory semantics for dax/kmem | expand |
Vishal Verma <vishal.l.verma@intel.com> writes: > With DAX memory regions originating from CXL memory expanders or > NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory > on a system without enough 'regular' main memory to support the memmap > for it. To avoid this, ensure that all kmem managed hotplugged memory is > added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the > new memory region being hot added. > > To do this, call add_memory() in chunks of memory_block_size_bytes() as > that is a requirement for memmap_on_memory. Additionally, Use the > mhp_flag to force the memmap_on_memory checks regardless of the > respective module parameter setting. > > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Cc: Len Brown <lenb@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Huang Ying <ying.huang@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 36 insertions(+), 13 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 7b36db6f1cbd..0751346193ef 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -12,6 +12,7 @@ > #include <linux/mm.h> > #include <linux/mman.h> > #include <linux/memory-tiers.h> > +#include <linux/memory_hotplug.h> > #include "dax-private.h" > #include "bus.h" > > @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > data->mgid = rc; > > for (i = 0; i < dev_dax->nr_range; i++) { > + u64 cur_start, cur_len, remaining; > struct resource *res; > struct range range; > > @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > res->flags = IORESOURCE_SYSTEM_RAM; > > /* > - * Ensure that future kexec'd kernels will not treat > - * this as RAM automatically. > + * Add memory in chunks of memory_block_size_bytes() so that > + * it is considered for MHP_MEMMAP_ON_MEMORY > + * @range has already been aligned to memory_block_size_bytes(), > + * so the following loop will always break it down cleanly. > */ > - rc = add_memory_driver_managed(data->mgid, range.start, > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > + cur_start = range.start; > + cur_len = memory_block_size_bytes(); > + remaining = range_len(&range); > + while (remaining) { > + mhp_t mhp_flags = MHP_NID_IS_MGID; > > - if (rc) { > - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > - i, range.start, range.end); > - remove_resource(res); > - kfree(res); > - data->res[i] = NULL; > - if (mapped) > - continue; > - goto err_request_mem; > + if (mhp_supports_memmap_on_memory(cur_len, > + MHP_MEMMAP_ON_MEMORY)) > + mhp_flags |= MHP_MEMMAP_ON_MEMORY; > + /* > + * Ensure that future kexec'd kernels will not treat > + * this as RAM automatically. > + */ > + rc = add_memory_driver_managed(data->mgid, cur_start, > + cur_len, kmem_name, > + mhp_flags); > + > + if (rc) { > + dev_warn(dev, > + "mapping%d: %#llx-%#llx memory add failed\n", > + i, cur_start, cur_start + cur_len - 1); > + remove_resource(res); > + kfree(res); > + data->res[i] = NULL; > + if (mapped) > + continue; > + goto err_request_mem; > + } > + > + cur_start += cur_len; > + remaining -= cur_len; > } > mapped++; > } It appears that we need to hot-remove memory in the granularity of memory_block_size_bytes() too, according to try_remove_memory(). If so, it seems better to allocate one dax_kmem_data.res[] element for each memory block instead of dax region? Best Regards, Huang, Ying
On 16.06.23 00:00, Vishal Verma wrote: > With DAX memory regions originating from CXL memory expanders or > NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory > on a system without enough 'regular' main memory to support the memmap > for it. To avoid this, ensure that all kmem managed hotplugged memory is > added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the > new memory region being hot added. > > To do this, call add_memory() in chunks of memory_block_size_bytes() as > that is a requirement for memmap_on_memory. Additionally, Use the > mhp_flag to force the memmap_on_memory checks regardless of the > respective module parameter setting. > > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Cc: Len Brown <lenb@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Huang Ying <ying.huang@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 36 insertions(+), 13 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 7b36db6f1cbd..0751346193ef 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -12,6 +12,7 @@ > #include <linux/mm.h> > #include <linux/mman.h> > #include <linux/memory-tiers.h> > +#include <linux/memory_hotplug.h> > #include "dax-private.h" > #include "bus.h" > > @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > data->mgid = rc; > > for (i = 0; i < dev_dax->nr_range; i++) { > + u64 cur_start, cur_len, remaining; > struct resource *res; > struct range range; > > @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > res->flags = IORESOURCE_SYSTEM_RAM; > > /* > - * Ensure that future kexec'd kernels will not treat > - * this as RAM automatically. > + * Add memory in chunks of memory_block_size_bytes() so that > + * it is considered for MHP_MEMMAP_ON_MEMORY > + * @range has already been aligned to memory_block_size_bytes(), > + * so the following loop will always break it down cleanly. > */ > - rc = add_memory_driver_managed(data->mgid, range.start, > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > + cur_start = range.start; > + cur_len = memory_block_size_bytes(); > + remaining = range_len(&range); > + while (remaining) { > + mhp_t mhp_flags = MHP_NID_IS_MGID; > > - if (rc) { > - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > - i, range.start, range.end); > - remove_resource(res); > - kfree(res); > - data->res[i] = NULL; > - if (mapped) > - continue; > - goto err_request_mem; > + if (mhp_supports_memmap_on_memory(cur_len, > + MHP_MEMMAP_ON_MEMORY)) > + mhp_flags |= MHP_MEMMAP_ON_MEMORY; > + /* > + * Ensure that future kexec'd kernels will not treat > + * this as RAM automatically. > + */ > + rc = add_memory_driver_managed(data->mgid, cur_start, > + cur_len, kmem_name, > + mhp_flags); > + > + if (rc) { > + dev_warn(dev, > + "mapping%d: %#llx-%#llx memory add failed\n", > + i, cur_start, cur_start + cur_len - 1); > + remove_resource(res); > + kfree(res); > + data->res[i] = NULL; > + if (mapped) > + continue; > + goto err_request_mem; > + } > + > + cur_start += cur_len; > + remaining -= cur_len; > } > mapped++; > } > Maybe the better alternative is teach add_memory_resource()/try_remove_memory() to do that internally. In the add_memory_resource() case, it might be a loop around that memmap_on_memory + arch_add_memory code path (well, and the error path also needs adjustment): /* * Self hosted memmap array */ if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { if (!mhp_supports_memmap_on_memory(size)) { ret = -EINVAL; goto error; } mhp_altmap.free = PHYS_PFN(size); mhp_altmap.base_pfn = PHYS_PFN(start); params.altmap = &mhp_altmap; } /* call arch's memory hotadd */ ret = arch_add_memory(nid, start, size, ¶ms); if (ret < 0) goto error; Note that we want to handle that on a per memory-block basis, because we don't want the vmemmap of memory block #2 to end up on memory block #1. It all gets messy with memory onlining/offlining etc otherwise ...
Hi Vishal, Vishal Verma <vishal.l.verma@intel.com> writes: > With DAX memory regions originating from CXL memory expanders or > NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory > on a system without enough 'regular' main memory to support the memmap > for it. To avoid this, ensure that all kmem managed hotplugged memory is > added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the > new memory region being hot added. > Some architectures doesn't have support for MEMMAP_ON_MEMORY, bypassing the check mhp_memmap_on_memory() might cause problems on such architectures (for e.g PPC64). > To do this, call add_memory() in chunks of memory_block_size_bytes() as > that is a requirement for memmap_on_memory. Additionally, Use the > mhp_flag to force the memmap_on_memory checks regardless of the > respective module parameter setting. > > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Cc: Len Brown <lenb@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Huang Ying <ying.huang@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 36 insertions(+), 13 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 7b36db6f1cbd..0751346193ef 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -12,6 +12,7 @@ > #include <linux/mm.h> > #include <linux/mman.h> > #include <linux/memory-tiers.h> > +#include <linux/memory_hotplug.h> > #include "dax-private.h" > #include "bus.h" > > @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > data->mgid = rc; > > for (i = 0; i < dev_dax->nr_range; i++) { > + u64 cur_start, cur_len, remaining; > struct resource *res; > struct range range; > > @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > res->flags = IORESOURCE_SYSTEM_RAM; > > /* > - * Ensure that future kexec'd kernels will not treat > - * this as RAM automatically. > + * Add memory in chunks of memory_block_size_bytes() so that > + * it is considered for MHP_MEMMAP_ON_MEMORY > + * @range has already been aligned to memory_block_size_bytes(), > + * so the following loop will always break it down cleanly. > */ > - rc = add_memory_driver_managed(data->mgid, range.start, > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > + cur_start = range.start; > + cur_len = memory_block_size_bytes(); > + remaining = range_len(&range); > + while (remaining) { > + mhp_t mhp_flags = MHP_NID_IS_MGID; > > - if (rc) { > - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > - i, range.start, range.end); > - remove_resource(res); > - kfree(res); > - data->res[i] = NULL; > - if (mapped) > - continue; > - goto err_request_mem; > + if (mhp_supports_memmap_on_memory(cur_len, > + MHP_MEMMAP_ON_MEMORY)) > + mhp_flags |= MHP_MEMMAP_ON_MEMORY; > + /* > + * Ensure that future kexec'd kernels will not treat > + * this as RAM automatically. > + */ > + rc = add_memory_driver_managed(data->mgid, cur_start, > + cur_len, kmem_name, > + mhp_flags); > + > + if (rc) { > + dev_warn(dev, > + "mapping%d: %#llx-%#llx memory add failed\n", > + i, cur_start, cur_start + cur_len - 1); > + remove_resource(res); > + kfree(res); > + data->res[i] = NULL; > + if (mapped) > + continue; > + goto err_request_mem; > + } > + > + cur_start += cur_len; > + remaining -= cur_len; > } > mapped++; > } > > -- > 2.40.1
David Hildenbrand <david@redhat.com> writes: > On 16.06.23 00:00, Vishal Verma wrote: >> With DAX memory regions originating from CXL memory expanders or >> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory >> on a system without enough 'regular' main memory to support the memmap >> for it. To avoid this, ensure that all kmem managed hotplugged memory is >> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the >> new memory region being hot added. >> >> To do this, call add_memory() in chunks of memory_block_size_bytes() as >> that is a requirement for memmap_on_memory. Additionally, Use the >> mhp_flag to force the memmap_on_memory checks regardless of the >> respective module parameter setting. >> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org> >> Cc: Len Brown <lenb@kernel.org> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Oscar Salvador <osalvador@suse.de> >> Cc: Dan Williams <dan.j.williams@intel.com> >> Cc: Dave Jiang <dave.jiang@intel.com> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Huang Ying <ying.huang@intel.com> >> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> >> --- >> drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- >> 1 file changed, 36 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >> index 7b36db6f1cbd..0751346193ef 100644 >> --- a/drivers/dax/kmem.c >> +++ b/drivers/dax/kmem.c >> @@ -12,6 +12,7 @@ >> #include <linux/mm.h> >> #include <linux/mman.h> >> #include <linux/memory-tiers.h> >> +#include <linux/memory_hotplug.h> >> #include "dax-private.h" >> #include "bus.h" >> >> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >> data->mgid = rc; >> >> for (i = 0; i < dev_dax->nr_range; i++) { >> + u64 cur_start, cur_len, remaining; >> struct resource *res; >> struct range range; >> >> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >> res->flags = IORESOURCE_SYSTEM_RAM; >> >> /* >> - * Ensure that future kexec'd kernels will not treat >> - * this as RAM automatically. >> + * Add memory in chunks of memory_block_size_bytes() so that >> + * it is considered for MHP_MEMMAP_ON_MEMORY >> + * @range has already been aligned to memory_block_size_bytes(), >> + * so the following loop will always break it down cleanly. >> */ >> - rc = add_memory_driver_managed(data->mgid, range.start, >> - range_len(&range), kmem_name, MHP_NID_IS_MGID); >> + cur_start = range.start; >> + cur_len = memory_block_size_bytes(); >> + remaining = range_len(&range); >> + while (remaining) { >> + mhp_t mhp_flags = MHP_NID_IS_MGID; >> >> - if (rc) { >> - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", >> - i, range.start, range.end); >> - remove_resource(res); >> - kfree(res); >> - data->res[i] = NULL; >> - if (mapped) >> - continue; >> - goto err_request_mem; >> + if (mhp_supports_memmap_on_memory(cur_len, >> + MHP_MEMMAP_ON_MEMORY)) >> + mhp_flags |= MHP_MEMMAP_ON_MEMORY; >> + /* >> + * Ensure that future kexec'd kernels will not treat >> + * this as RAM automatically. >> + */ >> + rc = add_memory_driver_managed(data->mgid, cur_start, >> + cur_len, kmem_name, >> + mhp_flags); >> + >> + if (rc) { >> + dev_warn(dev, >> + "mapping%d: %#llx-%#llx memory add failed\n", >> + i, cur_start, cur_start + cur_len - 1); >> + remove_resource(res); >> + kfree(res); >> + data->res[i] = NULL; >> + if (mapped) >> + continue; >> + goto err_request_mem; >> + } >> + >> + cur_start += cur_len; >> + remaining -= cur_len; >> } >> mapped++; >> } >> > > Maybe the better alternative is teach > add_memory_resource()/try_remove_memory() to do that internally. > > In the add_memory_resource() case, it might be a loop around that > memmap_on_memory + arch_add_memory code path (well, and the error path > also needs adjustment): > > /* > * Self hosted memmap array > */ > if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { > if (!mhp_supports_memmap_on_memory(size)) { > ret = -EINVAL; > goto error; > } > mhp_altmap.free = PHYS_PFN(size); > mhp_altmap.base_pfn = PHYS_PFN(start); > params.altmap = &mhp_altmap; > } > > /* call arch's memory hotadd */ > ret = arch_add_memory(nid, start, size, ¶ms); > if (ret < 0) > goto error; > > > Note that we want to handle that on a per memory-block basis, because we > don't want the vmemmap of memory block #2 to end up on memory block #1. > It all gets messy with memory onlining/offlining etc otherwise ... > I tried to implement this inside add_memory_driver_managed() and also within dax/kmem. IMHO doing the error handling inside dax/kmem is better. Here is how it looks: 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions 2. For each succesful request_mem_regions if any blocks got added, we keep the resource. If none got added, we will kfree the resource for (i = 0; i < dev_dax->nr_range; i++) { u64 cur_start, cur_len, remaining; struct resource *res; struct range range; bool block_added; rc = dax_kmem_range(dev_dax, i, &range); if (rc) continue; /* Region is permanently reserved if hotremove fails. */ res = request_mem_region(range.start, range_len(&range), data->res_name); if (!res) { dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n", i, range.start, range.end); /* * Once some memory has been onlined we can't * assume that it can be un-onlined safely. */ if (mapped) continue; rc = -EBUSY; goto err_request_mem; } data->res[i] = res; /* * Set flags appropriate for System RAM. Leave ..._BUSY clear * so that add_memory() can add a child resource. Do not * inherit flags from the parent since it may set new flags * unknown to us that will break add_memory() below. */ res->flags = IORESOURCE_SYSTEM_RAM; /* * Add memory in chunks of memory_block_size_bytes() so that * it is considered for MHP_MEMMAP_ON_MEMORY * @range has already been aligned to memory_block_size_bytes(), * so the following loop will always break it down cleanly. */ cur_start = range.start; cur_len = memory_block_size_bytes(); remaining = range_len(&range); block_added = false; while (remaining) { /* * If alignment rules are not satisified we will * fallback normal memmap allocation. */ mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY; /* * Ensure that future kexec'd kernels will not treat * this as RAM automatically. */ rc = add_memory_driver_managed(data->mgid, cur_start, cur_len, kmem_name, mhp_flags); if (rc) dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", i, cur_start, cur_start + cur_len - 1); else block_added = true; cur_start += cur_len; remaining -= cur_len; } if (!block_added) { /* * None of the blocks got added, remove the resource. */ remove_resource(res); kfree(res); data->res[i] = NULL; } else mapped++; } if (mapped) { dev_set_drvdata(dev, data); return 0; } err_request_mem: /* * If none of the resources got mapped. * unregister the group. */ memory_group_unregister(data->mgid); err_reg_mgid: kfree(data->res_name);
On 11.07.23 16:30, Aneesh Kumar K.V wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 16.06.23 00:00, Vishal Verma wrote: >>> With DAX memory regions originating from CXL memory expanders or >>> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory >>> on a system without enough 'regular' main memory to support the memmap >>> for it. To avoid this, ensure that all kmem managed hotplugged memory is >>> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the >>> new memory region being hot added. >>> >>> To do this, call add_memory() in chunks of memory_block_size_bytes() as >>> that is a requirement for memmap_on_memory. Additionally, Use the >>> mhp_flag to force the memmap_on_memory checks regardless of the >>> respective module parameter setting. >>> >>> Cc: "Rafael J. Wysocki" <rafael@kernel.org> >>> Cc: Len Brown <lenb@kernel.org> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: David Hildenbrand <david@redhat.com> >>> Cc: Oscar Salvador <osalvador@suse.de> >>> Cc: Dan Williams <dan.j.williams@intel.com> >>> Cc: Dave Jiang <dave.jiang@intel.com> >>> Cc: Dave Hansen <dave.hansen@linux.intel.com> >>> Cc: Huang Ying <ying.huang@intel.com> >>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> >>> --- >>> drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- >>> 1 file changed, 36 insertions(+), 13 deletions(-) >>> >>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >>> index 7b36db6f1cbd..0751346193ef 100644 >>> --- a/drivers/dax/kmem.c >>> +++ b/drivers/dax/kmem.c >>> @@ -12,6 +12,7 @@ >>> #include <linux/mm.h> >>> #include <linux/mman.h> >>> #include <linux/memory-tiers.h> >>> +#include <linux/memory_hotplug.h> >>> #include "dax-private.h" >>> #include "bus.h" >>> >>> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>> data->mgid = rc; >>> >>> for (i = 0; i < dev_dax->nr_range; i++) { >>> + u64 cur_start, cur_len, remaining; >>> struct resource *res; >>> struct range range; >>> >>> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>> res->flags = IORESOURCE_SYSTEM_RAM; >>> >>> /* >>> - * Ensure that future kexec'd kernels will not treat >>> - * this as RAM automatically. >>> + * Add memory in chunks of memory_block_size_bytes() so that >>> + * it is considered for MHP_MEMMAP_ON_MEMORY >>> + * @range has already been aligned to memory_block_size_bytes(), >>> + * so the following loop will always break it down cleanly. >>> */ >>> - rc = add_memory_driver_managed(data->mgid, range.start, >>> - range_len(&range), kmem_name, MHP_NID_IS_MGID); >>> + cur_start = range.start; >>> + cur_len = memory_block_size_bytes(); >>> + remaining = range_len(&range); >>> + while (remaining) { >>> + mhp_t mhp_flags = MHP_NID_IS_MGID; >>> >>> - if (rc) { >>> - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", >>> - i, range.start, range.end); >>> - remove_resource(res); >>> - kfree(res); >>> - data->res[i] = NULL; >>> - if (mapped) >>> - continue; >>> - goto err_request_mem; >>> + if (mhp_supports_memmap_on_memory(cur_len, >>> + MHP_MEMMAP_ON_MEMORY)) >>> + mhp_flags |= MHP_MEMMAP_ON_MEMORY; >>> + /* >>> + * Ensure that future kexec'd kernels will not treat >>> + * this as RAM automatically. >>> + */ >>> + rc = add_memory_driver_managed(data->mgid, cur_start, >>> + cur_len, kmem_name, >>> + mhp_flags); >>> + >>> + if (rc) { >>> + dev_warn(dev, >>> + "mapping%d: %#llx-%#llx memory add failed\n", >>> + i, cur_start, cur_start + cur_len - 1); >>> + remove_resource(res); >>> + kfree(res); >>> + data->res[i] = NULL; >>> + if (mapped) >>> + continue; >>> + goto err_request_mem; >>> + } >>> + >>> + cur_start += cur_len; >>> + remaining -= cur_len; >>> } >>> mapped++; >>> } >>> >> >> Maybe the better alternative is teach >> add_memory_resource()/try_remove_memory() to do that internally. >> >> In the add_memory_resource() case, it might be a loop around that >> memmap_on_memory + arch_add_memory code path (well, and the error path >> also needs adjustment): >> >> /* >> * Self hosted memmap array >> */ >> if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { >> if (!mhp_supports_memmap_on_memory(size)) { >> ret = -EINVAL; >> goto error; >> } >> mhp_altmap.free = PHYS_PFN(size); >> mhp_altmap.base_pfn = PHYS_PFN(start); >> params.altmap = &mhp_altmap; >> } >> >> /* call arch's memory hotadd */ >> ret = arch_add_memory(nid, start, size, ¶ms); >> if (ret < 0) >> goto error; >> >> >> Note that we want to handle that on a per memory-block basis, because we >> don't want the vmemmap of memory block #2 to end up on memory block #1. >> It all gets messy with memory onlining/offlining etc otherwise ... >> > > I tried to implement this inside add_memory_driver_managed() and also > within dax/kmem. IMHO doing the error handling inside dax/kmem is > better. Here is how it looks: > > 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions > 2. For each succesful request_mem_regions if any blocks got added, we > keep the resource. If none got added, we will kfree the resource > Doing this unconditional splitting outside of add_memory_driver_managed() is undesirable for at least two reasons 1) You end up always creating individual entries in the resource tree (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective. 2) As we call arch_add_memory() in memory block granularity (e.g., 128 MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective. While you could sense for support and do the split based on that, it will be beneficial for other users (especially DIMMs) if we do that internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be effective or not. In general, we avoid placing important kernel data-structures on slow memory. That's one of the reasons why PMEM decided to mostly always use ZONE_MOVABLE such that exactly what this patch does would not happen. So I'm wondering if there would be demand for an additional toggle. Because even with memmap_on_memory enabled in general, you might not want to do that for dax/kmem. IMHO, this patch should be dropped from your ppc64 series, as it's an independent change that might be valuable for other architectures as well.
On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote: > On 11.07.23 16:30, Aneesh Kumar K.V wrote: > > David Hildenbrand <david@redhat.com> writes: > > > > > > Maybe the better alternative is teach > > > add_memory_resource()/try_remove_memory() to do that internally. > > > > > > In the add_memory_resource() case, it might be a loop around that > > > memmap_on_memory + arch_add_memory code path (well, and the error path > > > also needs adjustment): > > > > > > /* > > > * Self hosted memmap array > > > */ > > > if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { > > > if (!mhp_supports_memmap_on_memory(size)) { > > > ret = -EINVAL; > > > goto error; > > > } > > > mhp_altmap.free = PHYS_PFN(size); > > > mhp_altmap.base_pfn = PHYS_PFN(start); > > > params.altmap = &mhp_altmap; > > > } > > > > > > /* call arch's memory hotadd */ > > > ret = arch_add_memory(nid, start, size, ¶ms); > > > if (ret < 0) > > > goto error; > > > > > > > > > Note that we want to handle that on a per memory-block basis, because we > > > don't want the vmemmap of memory block #2 to end up on memory block #1. > > > It all gets messy with memory onlining/offlining etc otherwise ... > > > > > > > I tried to implement this inside add_memory_driver_managed() and also > > within dax/kmem. IMHO doing the error handling inside dax/kmem is > > better. Here is how it looks: > > > > 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions > > 2. For each succesful request_mem_regions if any blocks got added, we > > keep the resource. If none got added, we will kfree the resource > > > > Doing this unconditional splitting outside of > add_memory_driver_managed() is undesirable for at least two reasons > > 1) You end up always creating individual entries in the resource tree > (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective. > 2) As we call arch_add_memory() in memory block granularity (e.g., 128 > MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the > identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective. > > While you could sense for support and do the split based on that, it > will be beneficial for other users (especially DIMMs) if we do that > internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be > effective or not. I'm taking a shot at implementing the splitting internally in memory_hotplug.c. The caller (kmem) side does become trivial with this approach, but there's a slight complication if I don't have the module param override (patch 1 of this series). The kmem diff now looks like: diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 898ca9505754..8be932f63f90 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) data->mgid = rc; for (i = 0; i < dev_dax->nr_range; i++) { + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | + MHP_SPLIT_MEMBLOCKS; struct resource *res; struct range range; @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) * this as RAM automatically. */ rc = add_memory_driver_managed(data->mgid, range.start, - range_len(&range), kmem_name, MHP_NID_IS_MGID); + range_len(&range), kmem_name, mhp_flags); if (rc) { dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", However this begins to fail if the memmap_on_memory modparam is not set, as add_memory_driver_managed EINVALs from the mhp_supports_memmap_on_memory() check. The way to work around this would probably include doing the mhp_supports_memmap_on_memory() check in kmem, in a loop to check for each memblock sized chunk, and that feels like a leak of the implementation details into the caller. Any suggestions on how to go about this? > > In general, we avoid placing important kernel data-structures on slow > memory. That's one of the reasons why PMEM decided to mostly always use > ZONE_MOVABLE such that exactly what this patch does would not happen. So > I'm wondering if there would be demand for an additional toggle. > > Because even with memmap_on_memory enabled in general, you might not > want to do that for dax/kmem. > > IMHO, this patch should be dropped from your ppc64 series, as it's an > independent change that might be valuable for other architectures as well. > Sure thing, I can pick this back up and Aneesh can drop this from his set.
On 13.07.23 08:45, Verma, Vishal L wrote: > On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote: >> On 11.07.23 16:30, Aneesh Kumar K.V wrote: >>> David Hildenbrand <david@redhat.com> writes: >>>> >>>> Maybe the better alternative is teach >>>> add_memory_resource()/try_remove_memory() to do that internally. >>>> >>>> In the add_memory_resource() case, it might be a loop around that >>>> memmap_on_memory + arch_add_memory code path (well, and the error path >>>> also needs adjustment): >>>> >>>> /* >>>> * Self hosted memmap array >>>> */ >>>> if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { >>>> if (!mhp_supports_memmap_on_memory(size)) { >>>> ret = -EINVAL; >>>> goto error; >>>> } >>>> mhp_altmap.free = PHYS_PFN(size); >>>> mhp_altmap.base_pfn = PHYS_PFN(start); >>>> params.altmap = &mhp_altmap; >>>> } >>>> >>>> /* call arch's memory hotadd */ >>>> ret = arch_add_memory(nid, start, size, ¶ms); >>>> if (ret < 0) >>>> goto error; >>>> >>>> >>>> Note that we want to handle that on a per memory-block basis, because we >>>> don't want the vmemmap of memory block #2 to end up on memory block #1. >>>> It all gets messy with memory onlining/offlining etc otherwise ... >>>> >>> >>> I tried to implement this inside add_memory_driver_managed() and also >>> within dax/kmem. IMHO doing the error handling inside dax/kmem is >>> better. Here is how it looks: >>> >>> 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions >>> 2. For each succesful request_mem_regions if any blocks got added, we >>> keep the resource. If none got added, we will kfree the resource >>> >> >> Doing this unconditional splitting outside of >> add_memory_driver_managed() is undesirable for at least two reasons >> >> 1) You end up always creating individual entries in the resource tree >> (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective. >> 2) As we call arch_add_memory() in memory block granularity (e.g., 128 >> MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the >> identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective. >> >> While you could sense for support and do the split based on that, it >> will be beneficial for other users (especially DIMMs) if we do that >> internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be >> effective or not. > > I'm taking a shot at implementing the splitting internally in > memory_hotplug.c. The caller (kmem) side does become trivial with this > approach, but there's a slight complication if I don't have the module > param override (patch 1 of this series). > > The kmem diff now looks like: > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 898ca9505754..8be932f63f90 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > data->mgid = rc; > > for (i = 0; i < dev_dax->nr_range; i++) { > + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | > + MHP_SPLIT_MEMBLOCKS; > struct resource *res; > struct range range; > > @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > * this as RAM automatically. > */ > rc = add_memory_driver_managed(data->mgid, range.start, > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > + range_len(&range), kmem_name, mhp_flags); > > if (rc) { > dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > > Why do we need the MHP_SPLIT_MEMBLOCKS? In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a single memory block, you can simply split up internally, no? Essentially in add_memory_resource() something like if (mhp_flags & MHP_MEMMAP_ON_MEMORY && mhp_supports_memmap_on_memory(memory_block_size_bytes())) { for (cur_start = start, cur_start < start + size; cur_start += memory_block_size_bytes()) { mhp_altmap.free = PHYS_PFN(memory_block_size_bytes()); mhp_altmap.base_pfn = PHYS_PFN(start); params.altmap = &mhp_altmap; ret = arch_add_memory(nid, start, memory_block_size_bytes(), ¶ms); if (ret < 0) ... ret = create_memory_block_devices(start, memory_block_size_bytes(), mhp_altmap.alloc, group); if (ret) ... } } else { /* old boring stuff */ } Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes() ... If any adding of memory failed, we remove what we already added. That works, because as long as we're holding the relevant locks, memory cannot get onlined in the meantime. Then we only have to teach remove_memory() to deal with individual blocks when finding blocks that have an altmap.
On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote: > On 13.07.23 08:45, Verma, Vishal L wrote: > > > > I'm taking a shot at implementing the splitting internally in > > memory_hotplug.c. The caller (kmem) side does become trivial with this > > approach, but there's a slight complication if I don't have the module > > param override (patch 1 of this series). > > > > The kmem diff now looks like: > > > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > > index 898ca9505754..8be932f63f90 100644 > > --- a/drivers/dax/kmem.c > > +++ b/drivers/dax/kmem.c > > @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > > data->mgid = rc; > > > > for (i = 0; i < dev_dax->nr_range; i++) { > > + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | > > + MHP_SPLIT_MEMBLOCKS; > > struct resource *res; > > struct range range; > > > > @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > > * this as RAM automatically. > > */ > > rc = add_memory_driver_managed(data->mgid, range.start, > > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > > + range_len(&range), kmem_name, mhp_flags); > > > > if (rc) { > > dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > > > > > > Why do we need the MHP_SPLIT_MEMBLOCKS? I thought we still wanted either an opt-in or opt-out for the kmem driver to be able to do memmap_on_memory, in case there were performance implications or the lack of 1GiB PUDs. I haven't implemented that yet, but I was thinking along the lines of a sysfs knob exposed by kmem, that controls setting of this new MHP_SPLIT_MEMBLOCKS flag. > > In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a > single memory block, you can simply split up internally, no? > > Essentially in add_memory_resource() something like > > if (mhp_flags & MHP_MEMMAP_ON_MEMORY && > mhp_supports_memmap_on_memory(memory_block_size_bytes())) { > for (cur_start = start, cur_start < start + size; > cur_start += memory_block_size_bytes()) { > mhp_altmap.free = PHYS_PFN(memory_block_size_bytes()); > mhp_altmap.base_pfn = PHYS_PFN(start); > params.altmap = &mhp_altmap; > > ret = arch_add_memory(nid, start, > memory_block_size_bytes(), ¶ms); > if (ret < 0) ... > > ret = create_memory_block_devices(start, memory_block_size_bytes(), > mhp_altmap.alloc, group); > if (ret) ... > > } > } else { > /* old boring stuff */ > } > > Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into > a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes() > ... My current approach was looping a level higher, on the call to add_memory_resource, but this looks reasonable too, and I can switch to this. In fact it is better as in my case I had to loop twice, once for the regular add_memory() path and once for the _driver_managed() path. Yours should avoid that. > > If any adding of memory failed, we remove what we already added. That works, because as > long as we're holding the relevant locks, memory cannot get onlined in the meantime. > > Then we only have to teach remove_memory() to deal with individual blocks when finding > blocks that have an altmap. >
On 13.07.23 17:15, Verma, Vishal L wrote: > On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote: >> On 13.07.23 08:45, Verma, Vishal L wrote: >>> >>> I'm taking a shot at implementing the splitting internally in >>> memory_hotplug.c. The caller (kmem) side does become trivial with this >>> approach, but there's a slight complication if I don't have the module >>> param override (patch 1 of this series). >>> >>> The kmem diff now looks like: >>> >>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >>> index 898ca9505754..8be932f63f90 100644 >>> --- a/drivers/dax/kmem.c >>> +++ b/drivers/dax/kmem.c >>> @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>> data->mgid = rc; >>> >>> for (i = 0; i < dev_dax->nr_range; i++) { >>> + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | >>> + MHP_SPLIT_MEMBLOCKS; >>> struct resource *res; >>> struct range range; >>> >>> @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>> * this as RAM automatically. >>> */ >>> rc = add_memory_driver_managed(data->mgid, range.start, >>> - range_len(&range), kmem_name, MHP_NID_IS_MGID); >>> + range_len(&range), kmem_name, mhp_flags); >>> >>> if (rc) { >>> dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", >>> >>> >> >> Why do we need the MHP_SPLIT_MEMBLOCKS? > > I thought we still wanted either an opt-in or opt-out for the kmem > driver to be able to do memmap_on_memory, in case there were > performance implications or the lack of 1GiB PUDs. I haven't > implemented that yet, but I was thinking along the lines of a sysfs > knob exposed by kmem, that controls setting of this new > MHP_SPLIT_MEMBLOCKS flag. Why is MHP_MEMMAP_ON_MEMORY not sufficient for that? > >> >> In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a >> single memory block, you can simply split up internally, no? >> >> Essentially in add_memory_resource() something like >> >> if (mhp_flags & MHP_MEMMAP_ON_MEMORY && >> mhp_supports_memmap_on_memory(memory_block_size_bytes())) { >> for (cur_start = start, cur_start < start + size; >> cur_start += memory_block_size_bytes()) { >> mhp_altmap.free = PHYS_PFN(memory_block_size_bytes()); >> mhp_altmap.base_pfn = PHYS_PFN(start); >> params.altmap = &mhp_altmap; >> >> ret = arch_add_memory(nid, start, >> memory_block_size_bytes(), ¶ms); >> if (ret < 0) ... >> >> ret = create_memory_block_devices(start, memory_block_size_bytes(), >> mhp_altmap.alloc, group); >> if (ret) ... >> >> } >> } else { >> /* old boring stuff */ >> } >> >> Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into >> a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes() >> ... > > My current approach was looping a level higher, on the call to > add_memory_resource, but this looks reasonable too, and I can switch to > this. In fact it is better as in my case I had to loop twice, once for > the regular add_memory() path and once for the _driver_managed() path. > Yours should avoid that. As we only care about the altmap here, handling it for arch_add_memory() + create_memory_block_devices() should be sufficient.
On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote: > On 13.07.23 17:15, Verma, Vishal L wrote: > > On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote: > > > On 13.07.23 08:45, Verma, Vishal L wrote: > > > > > > > > I'm taking a shot at implementing the splitting internally in > > > > memory_hotplug.c. The caller (kmem) side does become trivial with this > > > > approach, but there's a slight complication if I don't have the module > > > > param override (patch 1 of this series). > > > > > > > > The kmem diff now looks like: > > > > > > > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > > > > index 898ca9505754..8be932f63f90 100644 > > > > --- a/drivers/dax/kmem.c > > > > +++ b/drivers/dax/kmem.c > > > > @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > > > > data->mgid = rc; > > > > > > > > for (i = 0; i < dev_dax->nr_range; i++) { > > > > + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | > > > > + MHP_SPLIT_MEMBLOCKS; > > > > struct resource *res; > > > > struct range range; > > > > > > > > @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > > > > * this as RAM automatically. > > > > */ > > > > rc = add_memory_driver_managed(data->mgid, range.start, > > > > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > > > > + range_len(&range), kmem_name, mhp_flags); > > > > > > > > if (rc) { > > > > dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > > > > > > > > > > > > > > Why do we need the MHP_SPLIT_MEMBLOCKS? > > > > I thought we still wanted either an opt-in or opt-out for the kmem > > driver to be able to do memmap_on_memory, in case there were > > performance implications or the lack of 1GiB PUDs. I haven't > > implemented that yet, but I was thinking along the lines of a sysfs > > knob exposed by kmem, that controls setting of this new > > MHP_SPLIT_MEMBLOCKS flag. > > Why is MHP_MEMMAP_ON_MEMORY not sufficient for that? > > Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY, and memory_hotplug is free to split to memblocks if it needs to to satisfy that. That sounds reasonable. Let me give this a try and see if I run into anything else. Thanks David!
On 13.07.23 17:40, Verma, Vishal L wrote: > On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote: >> On 13.07.23 17:15, Verma, Vishal L wrote: >>> On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote: >>>> On 13.07.23 08:45, Verma, Vishal L wrote: >>>>> >>>>> I'm taking a shot at implementing the splitting internally in >>>>> memory_hotplug.c. The caller (kmem) side does become trivial with this >>>>> approach, but there's a slight complication if I don't have the module >>>>> param override (patch 1 of this series). >>>>> >>>>> The kmem diff now looks like: >>>>> >>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >>>>> index 898ca9505754..8be932f63f90 100644 >>>>> --- a/drivers/dax/kmem.c >>>>> +++ b/drivers/dax/kmem.c >>>>> @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>>>> data->mgid = rc; >>>>> >>>>> for (i = 0; i < dev_dax->nr_range; i++) { >>>>> + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | >>>>> + MHP_SPLIT_MEMBLOCKS; >>>>> struct resource *res; >>>>> struct range range; >>>>> >>>>> @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>>>> * this as RAM automatically. >>>>> */ >>>>> rc = add_memory_driver_managed(data->mgid, range.start, >>>>> - range_len(&range), kmem_name, MHP_NID_IS_MGID); >>>>> + range_len(&range), kmem_name, mhp_flags); >>>>> >>>>> if (rc) { >>>>> dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", >>>>> >>>>> >>>> >>>> Why do we need the MHP_SPLIT_MEMBLOCKS? >>> >>> I thought we still wanted either an opt-in or opt-out for the kmem >>> driver to be able to do memmap_on_memory, in case there were >>> performance implications or the lack of 1GiB PUDs. I haven't >>> implemented that yet, but I was thinking along the lines of a sysfs >>> knob exposed by kmem, that controls setting of this new >>> MHP_SPLIT_MEMBLOCKS flag. >> >> Why is MHP_MEMMAP_ON_MEMORY not sufficient for that? >> >> > Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY, > and memory_hotplug is free to split to memblocks if it needs to to > satisfy that. And if you don't want memmap holes in a larger area you're adding (for example to runtime-allocate 1 GiB pages), simply check the size your adding, and if it's, say, less than 1 G, don't set the flag. But that's probably a corner case use case not worth considering for now. > > That sounds reasonable. Let me give this a try and see if I run into > anything else. Thanks David! Sure!
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 7b36db6f1cbd..0751346193ef 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -12,6 +12,7 @@ #include <linux/mm.h> #include <linux/mman.h> #include <linux/memory-tiers.h> +#include <linux/memory_hotplug.h> #include "dax-private.h" #include "bus.h" @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) data->mgid = rc; for (i = 0; i < dev_dax->nr_range; i++) { + u64 cur_start, cur_len, remaining; struct resource *res; struct range range; @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) res->flags = IORESOURCE_SYSTEM_RAM; /* - * Ensure that future kexec'd kernels will not treat - * this as RAM automatically. + * Add memory in chunks of memory_block_size_bytes() so that + * it is considered for MHP_MEMMAP_ON_MEMORY + * @range has already been aligned to memory_block_size_bytes(), + * so the following loop will always break it down cleanly. */ - rc = add_memory_driver_managed(data->mgid, range.start, - range_len(&range), kmem_name, MHP_NID_IS_MGID); + cur_start = range.start; + cur_len = memory_block_size_bytes(); + remaining = range_len(&range); + while (remaining) { + mhp_t mhp_flags = MHP_NID_IS_MGID; - if (rc) { - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", - i, range.start, range.end); - remove_resource(res); - kfree(res); - data->res[i] = NULL; - if (mapped) - continue; - goto err_request_mem; + if (mhp_supports_memmap_on_memory(cur_len, + MHP_MEMMAP_ON_MEMORY)) + mhp_flags |= MHP_MEMMAP_ON_MEMORY; + /* + * Ensure that future kexec'd kernels will not treat + * this as RAM automatically. + */ + rc = add_memory_driver_managed(data->mgid, cur_start, + cur_len, kmem_name, + mhp_flags); + + if (rc) { + dev_warn(dev, + "mapping%d: %#llx-%#llx memory add failed\n", + i, cur_start, cur_start + cur_len - 1); + remove_resource(res); + kfree(res); + data->res[i] = NULL; + if (mapped) + continue; + goto err_request_mem; + } + + cur_start += cur_len; + remaining -= cur_len; } mapped++; }
With DAX memory regions originating from CXL memory expanders or NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory on a system without enough 'regular' main memory to support the memmap for it. To avoid this, ensure that all kmem managed hotplugged memory is added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the new memory region being hot added. To do this, call add_memory() in chunks of memory_block_size_bytes() as that is a requirement for memmap_on_memory. Additionally, Use the mhp_flag to force the memmap_on_memory checks regardless of the respective module parameter setting. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Len Brown <lenb@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> --- drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- 1 file changed, 36 insertions(+), 13 deletions(-)