Message ID | 20220527122528.129445-5-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/demotion: Memory tiers and demotion | expand |
On 5/27/2022 5:55 PM, Aneesh Kumar K.V wrote: > From: Jagdish Gediya <jvgediya@linux.ibm.com> > > By default, all nodes are assigned to DEFAULT_MEMORY_TIER which > is memory tier 1 which is designated for nodes with DRAM, so it > is not the right tier for dax devices. > > Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future, > support should be added to distinguish the dax-devices which should > not be MEMORY_TIER_PMEM and right memory tier should be set for them. > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > --- > drivers/dax/kmem.c | 4 ++++ > mm/migrate.c | 2 ++ > 2 files changed, 6 insertions(+) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index a37622060fff..991782aa2448 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -11,6 +11,7 @@ > #include <linux/fs.h> > #include <linux/mm.h> > #include <linux/mman.h> > +#include <linux/migrate.h> > #include "dax-private.h" > #include "bus.h" > > @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > > dev_set_drvdata(dev, data); > > +#ifdef CONFIG_TIERED_MEMORY > + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); > +#endif I was experimenting with this patchset and found this behaviour. Here's what I did: Boot a KVM guest with vNVDIMM device which ends up with device_dax driver by default. Use it as RAM by binding it to dax kmem driver. It now appears as RAM with a new NUMA node that is put to memtier1 (the existing tier where DRAM already exists) I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't that expected to happen automatically when a node with dax kmem device comes up? Regards, Bharata.
On 6/1/22 11:59 AM, Bharata B Rao wrote: > On 5/27/2022 5:55 PM, Aneesh Kumar K.V wrote: >> From: Jagdish Gediya <jvgediya@linux.ibm.com> >> >> By default, all nodes are assigned to DEFAULT_MEMORY_TIER which >> is memory tier 1 which is designated for nodes with DRAM, so it >> is not the right tier for dax devices. >> >> Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future, >> support should be added to distinguish the dax-devices which should >> not be MEMORY_TIER_PMEM and right memory tier should be set for them. >> >> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> --- >> drivers/dax/kmem.c | 4 ++++ >> mm/migrate.c | 2 ++ >> 2 files changed, 6 insertions(+) >> >> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >> index a37622060fff..991782aa2448 100644 >> --- a/drivers/dax/kmem.c >> +++ b/drivers/dax/kmem.c >> @@ -11,6 +11,7 @@ >> #include <linux/fs.h> >> #include <linux/mm.h> >> #include <linux/mman.h> >> +#include <linux/migrate.h> >> #include "dax-private.h" >> #include "bus.h" >> >> @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >> >> dev_set_drvdata(dev, data); >> >> +#ifdef CONFIG_TIERED_MEMORY >> + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); >> +#endif > > I was experimenting with this patchset and found this behaviour. > Here's what I did: > > Boot a KVM guest with vNVDIMM device which ends up with device_dax > driver by default. > > Use it as RAM by binding it to dax kmem driver. It now appears as > RAM with a new NUMA node that is put to memtier1 (the existing tier > where DRAM already exists) > That should have placed it in memtier2. > I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't > that expected to happen automatically when a node with dax kmem > device comes up? > This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier? Thank you for testing the patchset. -aneesh
On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote: > On 6/1/22 11:59 AM, Bharata B Rao wrote: >> I was experimenting with this patchset and found this behaviour. >> Here's what I did: >> >> Boot a KVM guest with vNVDIMM device which ends up with device_dax >> driver by default. >> >> Use it as RAM by binding it to dax kmem driver. It now appears as >> RAM with a new NUMA node that is put to memtier1 (the existing tier >> where DRAM already exists) >> > > That should have placed it in memtier2. > >> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't >> that expected to happen automatically when a node with dax kmem >> device comes up? >> > > This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier? When we reach node_set_memory_tier_rank(), node1 (that has the pmem device) is already part of memtier1 whose nodelist shows 0-1. Regards, Bharata.
On 6/2/22 12:06 PM, Bharata B Rao wrote: > On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote: >> On 6/1/22 11:59 AM, Bharata B Rao wrote: >>> I was experimenting with this patchset and found this behaviour. >>> Here's what I did: >>> >>> Boot a KVM guest with vNVDIMM device which ends up with device_dax >>> driver by default. >>> >>> Use it as RAM by binding it to dax kmem driver. It now appears as >>> RAM with a new NUMA node that is put to memtier1 (the existing tier >>> where DRAM already exists) >>> >> >> That should have placed it in memtier2. >> >>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't >>> that expected to happen automatically when a node with dax kmem >>> device comes up? >>> >> >> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier? > > When we reach node_set_memory_tier_rank(), node1 (that has the pmem device) > is already part of memtier1 whose nodelist shows 0-1. > can you find out which code path added node1 to memtier1? Do you have regular memory also appearing on node1? -aneesh
On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote: > On 6/2/22 12:06 PM, Bharata B Rao wrote: >> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote: >>> On 6/1/22 11:59 AM, Bharata B Rao wrote: >>>> I was experimenting with this patchset and found this behaviour. >>>> Here's what I did: >>>> >>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax >>>> driver by default. >>>> >>>> Use it as RAM by binding it to dax kmem driver. It now appears as >>>> RAM with a new NUMA node that is put to memtier1 (the existing tier >>>> where DRAM already exists) >>>> >>> >>> That should have placed it in memtier2. >>> >>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't >>>> that expected to happen automatically when a node with dax kmem >>>> device comes up? >>>> >>> >>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier? >> >> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device) >> is already part of memtier1 whose nodelist shows 0-1. >> > > can you find out which code path added node1 to memtier1? node_set_memory_tier_rank+0x63/0x80 migrate_on_reclaim_callback+0x40/0x4d blocking_notifier_call_chain+0x68/0x90 memory_notify+0x1b/0x20 online_pages+0x257/0x2f0 memory_subsys_online+0x99/0x150 device_online+0x65/0x90 online_memory_block+0x1b/0x20 walk_memory_blocks+0x85/0xc0 ? generic_online_page+0x40/0x40 add_memory_resource+0x1fa/0x2d0 add_memory_driver_managed+0x80/0xc0 dev_dax_kmem_probe+0x1af/0x250 dax_bus_probe+0x6e/0xa0 After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM) from dev_dax_kmem_probe() finds that the memtier is already set. > Do you have regular memory also appearing on node1? No, regular memory is on Node0. Regards, Bharata.
On 6/6/22 3:41 PM, Bharata B Rao wrote: > On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote: >> On 6/2/22 12:06 PM, Bharata B Rao wrote: >>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote: >>>> On 6/1/22 11:59 AM, Bharata B Rao wrote: >>>>> I was experimenting with this patchset and found this behaviour. >>>>> Here's what I did: >>>>> >>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax >>>>> driver by default. >>>>> >>>>> Use it as RAM by binding it to dax kmem driver. It now appears as >>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier >>>>> where DRAM already exists) >>>>> >>>> >>>> That should have placed it in memtier2. >>>> >>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't >>>>> that expected to happen automatically when a node with dax kmem >>>>> device comes up? >>>>> >>>> >>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier? >>> >>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device) >>> is already part of memtier1 whose nodelist shows 0-1. >>> >> >> can you find out which code path added node1 to memtier1? > > node_set_memory_tier_rank+0x63/0x80 > migrate_on_reclaim_callback+0x40/0x4d > blocking_notifier_call_chain+0x68/0x90 > memory_notify+0x1b/0x20 > online_pages+0x257/0x2f0 > memory_subsys_online+0x99/0x150 > device_online+0x65/0x90 > online_memory_block+0x1b/0x20 > walk_memory_blocks+0x85/0xc0 > ? generic_online_page+0x40/0x40 > add_memory_resource+0x1fa/0x2d0 > add_memory_driver_managed+0x80/0xc0 > dev_dax_kmem_probe+0x1af/0x250 > dax_bus_probe+0x6e/0xa0 > > After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM) > from dev_dax_kmem_probe() finds that the memtier is already set. > >> Do you have regular memory also appearing on node1? > > No, regular memory is on Node0. > Thanks for the stack trace. I was getting the kvm setup on my laptop to test this. We should move node_set_mem_tier() early. You had automatic online on memory hotplug /* online pages if requested */ if (mhp_default_online_type != MMOP_OFFLINE) walk_memory_blocks(start, size, NULL, online_memory_block); which caused memory to be onlined before we could do node_set_mem_tier. That is a bug on my side. Will send you a change after testing . -aneesh
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: > On 6/6/22 3:41 PM, Bharata B Rao wrote: >> On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote: >>> On 6/2/22 12:06 PM, Bharata B Rao wrote: >>>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote: >>>>> On 6/1/22 11:59 AM, Bharata B Rao wrote: >>>>>> I was experimenting with this patchset and found this behaviour. >>>>>> Here's what I did: >>>>>> >>>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax >>>>>> driver by default. >>>>>> >>>>>> Use it as RAM by binding it to dax kmem driver. It now appears as >>>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier >>>>>> where DRAM already exists) >>>>>> >>>>> >>>>> That should have placed it in memtier2. >>>>> >>>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't >>>>>> that expected to happen automatically when a node with dax kmem >>>>>> device comes up? >>>>>> >>>>> >>>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier? >>>> >>>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device) >>>> is already part of memtier1 whose nodelist shows 0-1. >>>> >>> >>> can you find out which code path added node1 to memtier1? >> >> node_set_memory_tier_rank+0x63/0x80 >> migrate_on_reclaim_callback+0x40/0x4d >> blocking_notifier_call_chain+0x68/0x90 >> memory_notify+0x1b/0x20 >> online_pages+0x257/0x2f0 >> memory_subsys_online+0x99/0x150 >> device_online+0x65/0x90 >> online_memory_block+0x1b/0x20 >> walk_memory_blocks+0x85/0xc0 >> ? generic_online_page+0x40/0x40 >> add_memory_resource+0x1fa/0x2d0 >> add_memory_driver_managed+0x80/0xc0 >> dev_dax_kmem_probe+0x1af/0x250 >> dax_bus_probe+0x6e/0xa0 >> >> After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM) >> from dev_dax_kmem_probe() finds that the memtier is already set. >> >>> Do you have regular memory also appearing on node1? >> >> No, regular memory is on Node0. >> > > Thanks for the stack trace. I was getting the kvm setup on my laptop to > test this. We should move node_set_mem_tier() early. You had automatic > online on memory hotplug > > /* online pages if requested */ > if (mhp_default_online_type != MMOP_OFFLINE) > walk_memory_blocks(start, size, NULL, online_memory_block); > > > which caused memory to be onlined before we could do node_set_mem_tier. > That is a bug on my side. Will send you a change after testing . > Can you try this change? diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 7a11c387fbbc..905609260dda 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) goto err_reg_mgid; data->mgid = rc; + /* + * This get called before the node is brought online. That + * is because depending on the value of mhp_default_online_type + * the kernel will online the memory along with hotplug + * operation. Add the new memory tier before we try to bring + * memory blocks online. Otherwise new node will get added to + * the default memory tier via hotplug callbacks. + */ +#ifdef CONFIG_TIERED_MEMORY + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); +#endif for (i = 0; i < dev_dax->nr_range; i++) { struct resource *res; struct range range; @@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) dev_set_drvdata(dev, data); -#ifdef CONFIG_TIERED_MEMORY - node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); -#endif return 0; err_request_mem:
On 6/6/2022 5:24 PM, Aneesh Kumar K.V wrote: > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >> > Can you try this change? > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 7a11c387fbbc..905609260dda 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > goto err_reg_mgid; > data->mgid = rc; > > + /* > + * This get called before the node is brought online. That > + * is because depending on the value of mhp_default_online_type > + * the kernel will online the memory along with hotplug > + * operation. Add the new memory tier before we try to bring > + * memory blocks online. Otherwise new node will get added to > + * the default memory tier via hotplug callbacks. > + */ > +#ifdef CONFIG_TIERED_MEMORY > + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); > +#endif > for (i = 0; i < dev_dax->nr_range; i++) { > struct resource *res; > struct range range; > @@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > > dev_set_drvdata(dev, data); > > -#ifdef CONFIG_TIERED_MEMORY > - node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); > -#endif > return 0; > > err_request_mem: Yes, this fixes the issue for me. Thanks. Regards, Bharata.
On 6/6/22 5:39 PM, Bharata B Rao wrote: > On 6/6/2022 5:24 PM, Aneesh Kumar K.V wrote: >> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes: >>> >> Can you try this change? >> >> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >> index 7a11c387fbbc..905609260dda 100644 >> --- a/drivers/dax/kmem.c >> +++ b/drivers/dax/kmem.c >> @@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >> goto err_reg_mgid; >> data->mgid = rc; >> >> + /* >> + * This get called before the node is brought online. That >> + * is because depending on the value of mhp_default_online_type >> + * the kernel will online the memory along with hotplug >> + * operation. Add the new memory tier before we try to bring >> + * memory blocks online. Otherwise new node will get added to >> + * the default memory tier via hotplug callbacks. >> + */ >> +#ifdef CONFIG_TIERED_MEMORY >> + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); >> +#endif >> for (i = 0; i < dev_dax->nr_range; i++) { >> struct resource *res; >> struct range range; >> @@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >> >> dev_set_drvdata(dev, data); >> >> -#ifdef CONFIG_TIERED_MEMORY >> - node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); >> -#endif >> return 0; >> >> err_request_mem: > > Yes, this fixes the issue for me. Thanks. > I might put the below change instead of the above. In the end I guess it is better to add a NUMA node to memory tier after the node is brought online than before even though with the current code it shouldn't matter much. modified drivers/dax/kmem.c @@ -147,9 +147,15 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) } dev_set_drvdata(dev, data); - + /* + * node_reset_memory_tier is used here to ensure we force + * update the NUMA node memory tier. Depending on the value + * of mhp_default_online_type the kernel will online the memory + * blocks along with hotplug operation above. This can result in dax + * kmem memory NUMA node getting added to default memory tier. + */ #ifdef CONFIG_TIERED_MEMORY - node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); + node_reset_memory_tier(numa_node, MEMORY_TIER_PMEM); #endif return 0;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a37622060fff..991782aa2448 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -11,6 +11,7 @@ #include <linux/fs.h> #include <linux/mm.h> #include <linux/mman.h> +#include <linux/migrate.h> #include "dax-private.h" #include "bus.h" @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) dev_set_drvdata(dev, data); +#ifdef CONFIG_TIERED_MEMORY + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); +#endif return 0; err_request_mem: diff --git a/mm/migrate.c b/mm/migrate.c index d819a64db5b1..59d8558dd2ee 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2418,6 +2418,8 @@ int node_set_memory_tier(int node, int tier) return ret; } +EXPORT_SYMBOL_GPL(node_set_memory_tier); + /** * next_demotion_node() - Get the next node in the demotion path