Message ID | 20200706011947.184166-2-justin.he@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix and enable pmem as RAM on arm64 | expand |
On 06.07.20 03:19, Jia He wrote: > Previously, numa_off is set to true unconditionally in dummy_numa_init(), > even if there is a fake numa node. > > But acpi will translate node id to NUMA_NO_NODE(-1) in acpi_map_pxm_to_node() > because it regards numa_off as turning off the numa node. > > Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table > isn't present. > > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K > kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with invalid node: -1 > kmem: probe of dax0.0 failed with error -22 > > This fixes it by setting numa_off to false. > > Signed-off-by: Jia He <justin.he@arm.com> > --- > arch/arm64/mm/numa.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c > index aafcee3e3f7e..7689986020d9 100644 > --- a/arch/arm64/mm/numa.c > +++ b/arch/arm64/mm/numa.c > @@ -440,7 +440,8 @@ static int __init dummy_numa_init(void) > return ret; > } > > - numa_off = true; > + /* force numa_off to be false since we have a fake numa node here */ > + numa_off = false; > return 0; > } > > What would happen if we use something like this in drivers/dax/kmem.c instead: numa_node = dev_dax->target_node; if (numa_node == NUMA_NO_NODE) numa_node = memory_add_physaddr_to_nid(kmem_start); and eventually dropping the pr_warn in arm64/memory_add_physaddr_to_nid() ? Would that work?
On Mon, 6 Jul 2020 09:19:45 +0800 Jia He <justin.he@arm.com> wrote: Hi, > Previously, numa_off is set to true unconditionally in dummy_numa_init(), > even if there is a fake numa node. > > But acpi will translate node id to NUMA_NO_NODE(-1) in acpi_map_pxm_to_node() > because it regards numa_off as turning off the numa node. That is correct. It is operating exactly as it should, if SRAT hasn't been parsed and you are on ACPI platform there are no nodes. They cannot be created at some later date. The dummy code doesn't change this. It just does enough to carry on operating with no specified nodes. > > Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table > isn't present. > > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K > kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with invalid node: -1 > kmem: probe of dax0.0 failed with error -22 > > This fixes it by setting numa_off to false. Without the SRAT protection patch [1] you may well run into problems because someone somewhere will have _PXM in a DSDT but will have a non existent SRAT. We had this happen on an AMD platform when we tried to introduce working _PXM support for PCI. [2] So whilst this seems superficially safe, I'd definitely be crossing your fingers. Note, at that time I proposed putting the numa_off = false into the x86 code path precisely to cut out that possibility (was rejected at the time, at least partly because the clarifications to the ACPI spec were not pubilc.) The patch in [1] should sort things out however by ensuring we only create new domains where we should actually be doing so. However, in your case it will return NUMA_NO_NODE anyway so this isn't the right way to fix things. [1] https://patchwork.kernel.org/patch/11632063/ [2] https://patchwork.kernel.org/patch/10597777/ Thanks, Jonathan > > Signed-off-by: Jia He <justin.he@arm.com> > --- > arch/arm64/mm/numa.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c > index aafcee3e3f7e..7689986020d9 100644 > --- a/arch/arm64/mm/numa.c > +++ b/arch/arm64/mm/numa.c > @@ -440,7 +440,8 @@ static int __init dummy_numa_init(void) > return ret; > } > > - numa_off = true; > + /* force numa_off to be false since we have a fake numa node here */ > + numa_off = false; > return 0; > } >
On Mon, 6 Jul 2020 11:29:21 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > On Mon, 6 Jul 2020 09:19:45 +0800 > Jia He <justin.he@arm.com> wrote: > > Hi, > > > Previously, numa_off is set to true unconditionally in dummy_numa_init(), > > even if there is a fake numa node. > > > > But acpi will translate node id to NUMA_NO_NODE(-1) in acpi_map_pxm_to_node() > > because it regards numa_off as turning off the numa node. > > That is correct. It is operating exactly as it should, if SRAT hasn't been parsed > and you are on ACPI platform there are no nodes. They cannot be created at > some later date. The dummy code doesn't change this. It just does enough to carry > on operating with no specified nodes. > > > > > Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table > > isn't present. > > > > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K > > kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with invalid node: -1 > > kmem: probe of dax0.0 failed with error -22 > > > > This fixes it by setting numa_off to false. > > Without the SRAT protection patch [1] you may well run into problems > because someone somewhere will have _PXM in a DSDT but will > have a non existent SRAT. We had this happen on an AMD platform when we > tried to introduce working _PXM support for PCI. [2] > > So whilst this seems superficially safe, I'd definitely be crossing your fingers. > Note, at that time I proposed putting the numa_off = false into the x86 code > path precisely to cut out that possibility (was rejected at the time, at least > partly because the clarifications to the ACPI spec were not pubilc.) > > The patch in [1] should sort things out however by ensuring we only create > new domains where we should actually be doing so. However, in your case > it will return NUMA_NO_NODE anyway so this isn't the right way to fix things. > > [1] https://patchwork.kernel.org/patch/11632063/ > [2] https://patchwork.kernel.org/patch/10597777/ Thinking a bit more on this... I'd like to understand more on what your use case is. Do you have an NFIT that is setting the proximity domain for the non-volatile memory in SPA structures? If so the ACPI spec (6.3 makes this clear) requires those match with domains described in SRAT. If SRAT isn't there, then we can't expect sensible results from using these values from NFIT. If SRAT is there and numa=off is set then we should probably also rule out parsing NFIT, or make all nfit handling fine with NO_NUMA_NODE, preferably with explicit checks to ensure we don't try to use the Proximity Node values as they have no meaning with numa=off. I note that the core NFIT parsing is fine with the value not being supplied in the first place. https://elixir.bootlin.com/linux/latest/source/drivers/acpi/nfit/core.c#L2947 Thanks, Jonathan > > Thanks, > > Jonathan > > > > > Signed-off-by: Jia He <justin.he@arm.com> > > --- > > arch/arm64/mm/numa.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c > > index aafcee3e3f7e..7689986020d9 100644 > > --- a/arch/arm64/mm/numa.c > > +++ b/arch/arm64/mm/numa.c > > @@ -440,7 +440,8 @@ static int __init dummy_numa_init(void) > > return ret; > > } > > > > - numa_off = true; > > + /* force numa_off to be false since we have a fake numa node here */ > > + numa_off = false; > > return 0; > > } > > >
Hi David, thanks for the comments. See my answer please: > -----Original Message----- > From: David Hildenbrand <david@redhat.com> > Sent: Monday, July 6, 2020 4:03 PM > To: Justin He <Justin.He@arm.com>; Catalin Marinas > <Catalin.Marinas@arm.com>; Will Deacon <will@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org>; Mike Rapoport > <rppt@linux.ibm.com>; Baoquan He <bhe@redhat.com>; Chuhong Yuan > <hslester96@gmail.com>; linux-arm-kernel@lists.infradead.org; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; Kaly Xin <Kaly.Xin@arm.com> > Subject: Re: [PATCH 1/3] arm64/numa: set numa_off to false when numa node > is fake > > On 06.07.20 03:19, Jia He wrote: > > Previously, numa_off is set to true unconditionally in dummy_numa_init(), > > even if there is a fake numa node. > > > > But acpi will translate node id to NUMA_NO_NODE(-1) in > acpi_map_pxm_to_node() > > because it regards numa_off as turning off the numa node. > > > > Without this patch, pmem can't be probed as a RAM device on arm64 if > SRAT table > > isn't present. > > > > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g - > a 64K > > kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with > invalid node: -1 > > kmem: probe of dax0.0 failed with error -22 > > > > This fixes it by setting numa_off to false. > > > > Signed-off-by: Jia He <justin.he@arm.com> > > --- > > arch/arm64/mm/numa.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c > > index aafcee3e3f7e..7689986020d9 100644 > > --- a/arch/arm64/mm/numa.c > > +++ b/arch/arm64/mm/numa.c > > @@ -440,7 +440,8 @@ static int __init dummy_numa_init(void) > > return ret; > > } > > > > - numa_off = true; > > + /* force numa_off to be false since we have a fake numa node here > */ > > + numa_off = false; > > return 0; > > } > > > > > > What would happen if we use something like this in drivers/dax/kmem.c > instead: > > numa_node = dev_dax->target_node; > if (numa_node == NUMA_NO_NODE) > numa_node = memory_add_physaddr_to_nid(kmem_start); > > and eventually dropping the pr_warn in > arm64/memory_add_physaddr_to_nid() ? Would that work? Yes, it works. I sent a similar patch [1] before. But seems pmem maintainer didn't satisfy it. Do you think memory_add_physaddr_to_nid() is better than numa_mem_id()? [1] https://lkml.org/lkml/2019/8/16/367 -- Cheers, Justin (Jia He)
Hi Jonathan, thanks for the comments. > -----Original Message----- > From: Jonathan Cameron <Jonathan.Cameron@Huawei.com> > Sent: Monday, July 6, 2020 6:46 PM > To: Justin He <Justin.He@arm.com> > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon > <will@kernel.org>; Andrew Morton <akpm@linux-foundation.org>; Mike > Rapoport <rppt@linux.ibm.com>; Baoquan He <bhe@redhat.com>; Chuhong Yuan > <hslester96@gmail.com>; linux-arm-kernel@lists.infradead.org; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; Kaly Xin <Kaly.Xin@arm.com> > Subject: Re: [PATCH 1/3] arm64/numa: set numa_off to false when numa node > is fake > > On Mon, 6 Jul 2020 11:29:21 +0100 > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > > > On Mon, 6 Jul 2020 09:19:45 +0800 > > Jia He <justin.he@arm.com> wrote: > > > > Hi, > > > > > Previously, numa_off is set to true unconditionally in > dummy_numa_init(), > > > even if there is a fake numa node. > > > > > > But acpi will translate node id to NUMA_NO_NODE(-1) in > acpi_map_pxm_to_node() > > > because it regards numa_off as turning off the numa node. > > > > That is correct. It is operating exactly as it should, if SRAT hasn't > been parsed > > and you are on ACPI platform there are no nodes. They cannot be created > at > > some later date. The dummy code doesn't change this. It just does > enough to carry > > on operating with no specified nodes. > > > > > > > > Without this patch, pmem can't be probed as a RAM device on arm64 if > SRAT table > > > isn't present. > > > > > > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g > -a 64K > > > kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with > invalid node: -1 > > > kmem: probe of dax0.0 failed with error -22 > > > > > > This fixes it by setting numa_off to false. > > > > Without the SRAT protection patch [1] you may well run into problems Sorry, doesn't quite understand here. Do you mean your [1] can resolve this issue? But acpi_map_pxm_to_node() has returned with NUMA_NO_NODE after following check: if (pxm < 0 || pxm >= MAX_PXM_DOMAINS || numa_off) return NUMA_NO_NODE; Seems even with your [1] patch, it is not helpful? Thanks for clarification if my understanding is wrong. [1] https://patchwork.kernel.org/patch/11632063/ > > because someone somewhere will have _PXM in a DSDT but will > > have a non existent SRAT. We had this happen on an AMD platform when > we > > tried to introduce working _PXM support for PCI. [2] > > > > So whilst this seems superficially safe, I'd definitely be crossing your > fingers. > > Note, at that time I proposed putting the numa_off = false into the x86 > code > > path precisely to cut out that possibility (was rejected at the time, at > least > > partly because the clarifications to the ACPI spec were not pubilc.) > > > > The patch in [1] should sort things out however by ensuring we only > create > > new domains where we should actually be doing so. However, in your case > > it will return NUMA_NO_NODE anyway so this isn't the right way to fix > things. Okay, let me try to summarize, there might be 3 possible fixing ways: 1. this patch, seems it is not satisfied by you and David
On Mon, 6 Jul 2020 12:47:51 +0000 Justin He <Justin.He@arm.com> wrote: > Hi Jonathan, thanks for the comments. > > > -----Original Message----- > > From: Jonathan Cameron <Jonathan.Cameron@Huawei.com> > > Sent: Monday, July 6, 2020 6:46 PM > > To: Justin He <Justin.He@arm.com> > > Cc: Catalin Marinas <Catalin.Marinas@arm.com>; Will Deacon > > <will@kernel.org>; Andrew Morton <akpm@linux-foundation.org>; Mike > > Rapoport <rppt@linux.ibm.com>; Baoquan He <bhe@redhat.com>; Chuhong Yuan > > <hslester96@gmail.com>; linux-arm-kernel@lists.infradead.org; linux- > > kernel@vger.kernel.org; linux-mm@kvack.org; Kaly Xin <Kaly.Xin@arm.com> > > Subject: Re: [PATCH 1/3] arm64/numa: set numa_off to false when numa node > > is fake > > > > On Mon, 6 Jul 2020 11:29:21 +0100 > > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > > > > > On Mon, 6 Jul 2020 09:19:45 +0800 > > > Jia He <justin.he@arm.com> wrote: > > > > > > Hi, > > > > > > > Previously, numa_off is set to true unconditionally in > > dummy_numa_init(), > > > > even if there is a fake numa node. > > > > > > > > But acpi will translate node id to NUMA_NO_NODE(-1) in > > acpi_map_pxm_to_node() > > > > because it regards numa_off as turning off the numa node. > > > > > > That is correct. It is operating exactly as it should, if SRAT hasn't > > been parsed > > > and you are on ACPI platform there are no nodes. They cannot be created > > at > > > some later date. The dummy code doesn't change this. It just does > > enough to carry > > > on operating with no specified nodes. > > > > > > > > > > > Without this patch, pmem can't be probed as a RAM device on arm64 if > > SRAT table > > > > isn't present. > > > > > > > > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g > > -a 64K > > > > kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with > > invalid node: -1 > > > > kmem: probe of dax0.0 failed with error -22 > > > > > > > > This fixes it by setting numa_off to false. > > > > > > Without the SRAT protection patch [1] you may well run into problems > > Sorry, doesn't quite understand here. Do you mean your [1] can resolve this > issue? But acpi_map_pxm_to_node() has returned with NUMA_NO_NODE after > following check: > if (pxm < 0 || pxm >= MAX_PXM_DOMAINS || numa_off) > return NUMA_NO_NODE; The point of that patch is it will make it safe to remove the numa_off because any later accidental reference to a non existent node (i.e. one not defined in SRAT) will not blow up. It doesn't fix your original problem. What it does do, is fix the new problem case you introduce by removing numa_off below. It ensures you still return NUMA_NO_NODE in cases which should do so (i.e. all of them if you have no SRAT and are using ACPI). Of course, you could just not remove the numa_off = true bit then you won't hit that condition anyway. There are plenty of other reasons for the SRAT patch though, it just happens to close a problem you were introducing here as well. For reference we had an AMD platform that had no SRAT, but provided _PXM for a few nodes in its DSDT. That result in non booting systems. It only affected x86 because ARM64 had that numa_off = true being set. If we change the arm64 case without the patch to ensure the underlying problem is fixed, you are very likely to hit the equivalent problem. There may well be platforms out there relying on that quirk of what the code currently does. > Seems even with your [1] patch, it is not helpful? Thanks for clarification > if my understanding is wrong. > [1] https://patchwork.kernel.org/patch/11632063/ > > > > because someone somewhere will have _PXM in a DSDT but will > > > have a non existent SRAT. We had this happen on an AMD platform when > > we > > > tried to introduce working _PXM support for PCI. [2] > > > > > > So whilst this seems superficially safe, I'd definitely be crossing your > > fingers. > > > Note, at that time I proposed putting the numa_off = false into the x86 > > code > > > path precisely to cut out that possibility (was rejected at the time, at > > least > > > partly because the clarifications to the ACPI spec were not pubilc.) > > > > > > The patch in [1] should sort things out however by ensuring we only > > create > > > new domains where we should actually be doing so. However, in your case > > > it will return NUMA_NO_NODE anyway so this isn't the right way to fix > > things. > > Okay, let me try to summarize, there might be 3 possible fixing ways: > 1. this patch, seems it is not satisfied by you and David
On 06.07.20 14:36, Justin He wrote: > Hi David, thanks for the comments. See my answer please: > >> -----Original Message----- >> From: David Hildenbrand <david@redhat.com> >> Sent: Monday, July 6, 2020 4:03 PM >> To: Justin He <Justin.He@arm.com>; Catalin Marinas >> <Catalin.Marinas@arm.com>; Will Deacon <will@kernel.org> >> Cc: Andrew Morton <akpm@linux-foundation.org>; Mike Rapoport >> <rppt@linux.ibm.com>; Baoquan He <bhe@redhat.com>; Chuhong Yuan >> <hslester96@gmail.com>; linux-arm-kernel@lists.infradead.org; linux- >> kernel@vger.kernel.org; linux-mm@kvack.org; Kaly Xin <Kaly.Xin@arm.com> >> Subject: Re: [PATCH 1/3] arm64/numa: set numa_off to false when numa node >> is fake >> >> On 06.07.20 03:19, Jia He wrote: >>> Previously, numa_off is set to true unconditionally in dummy_numa_init(), >>> even if there is a fake numa node. >>> >>> But acpi will translate node id to NUMA_NO_NODE(-1) in >> acpi_map_pxm_to_node() >>> because it regards numa_off as turning off the numa node. >>> >>> Without this patch, pmem can't be probed as a RAM device on arm64 if >> SRAT table >>> isn't present. >>> >>> $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g - >> a 64K >>> kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with >> invalid node: -1 >>> kmem: probe of dax0.0 failed with error -22 >>> >>> This fixes it by setting numa_off to false. >>> >>> Signed-off-by: Jia He <justin.he@arm.com> >>> --- >>> arch/arm64/mm/numa.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c >>> index aafcee3e3f7e..7689986020d9 100644 >>> --- a/arch/arm64/mm/numa.c >>> +++ b/arch/arm64/mm/numa.c >>> @@ -440,7 +440,8 @@ static int __init dummy_numa_init(void) >>> return ret; >>> } >>> >>> - numa_off = true; >>> + /* force numa_off to be false since we have a fake numa node here >> */ >>> + numa_off = false; >>> return 0; >>> } >>> >>> >> >> What would happen if we use something like this in drivers/dax/kmem.c >> instead: >> >> numa_node = dev_dax->target_node; >> if (numa_node == NUMA_NO_NODE) >> numa_node = memory_add_physaddr_to_nid(kmem_start); >> >> and eventually dropping the pr_warn in >> arm64/memory_add_physaddr_to_nid() ? Would that work? > > Yes, it works. I sent a similar patch [1] before. But seems pmem > maintainer didn't satisfy it. Do you think memory_add_physaddr_to_nid() > is better than numa_mem_id()? Well, it's the somewhat-common way to get a NID for memory hotadd. E.g., - drivers/acpi/acpi_memhotplug.c - drivers/base/memory.c - drivers/hv/hv_balloon.c - drivers/virtio/virtio_mem.c - drivers/xen/balloon.c use it in combination with add_memory_*() Especially, ACPI and virtio-mem use it in case NUMA_NO_NID is detected.
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index aafcee3e3f7e..7689986020d9 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -440,7 +440,8 @@ static int __init dummy_numa_init(void) return ret; } - numa_off = true; + /* force numa_off to be false since we have a fake numa node here */ + numa_off = false; return 0; }
Previously, numa_off is set to true unconditionally in dummy_numa_init(), even if there is a fake numa node. But acpi will translate node id to NUMA_NO_NODE(-1) in acpi_map_pxm_to_node() because it regards numa_off as turning off the numa node. Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table isn't present. $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with invalid node: -1 kmem: probe of dax0.0 failed with error -22 This fixes it by setting numa_off to false. Signed-off-by: Jia He <justin.he@arm.com> --- arch/arm64/mm/numa.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)