Message ID | 20180619163256.GA18952@e107981-ln.cambridge.arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Lorenzo, Punit, On 2018/6/20 0:32, Lorenzo Pieralisi wrote: > On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote: >> Michal Hocko <mhocko@kernel.org> writes: >> >>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote: >>> [...] >>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch >>>> as a temporary fix (it'll also be easier to backport) while we work on >>>> fixing these other issues and enabling memoryless nodes. >>> >>> Well, x86 already does that but copying this antipatern is not really >>> nice. So it is good as a quick fix but it would be definitely much >>> better to have a robust fix. Who knows how many other places might hit >>> this. You certainly do not want to add a hack like this all over... >> >> Completely agree! I was only suggesting it as a temporary measure, >> especially as it looked like a proper fix might be invasive. >> >> Another fix might be to change the node specific allocation to node >> agnostic allocations. It isn't clear why the allocation is being >> requested from a specific node. I think Lorenzo suggested this in one of >> the threads. > > I think that code was just copypasted but it is better to fix the > underlying issue. > >> I've started putting together a set fixing the issues identified in this >> thread. It should give a better idea on the best course of action. > > On ACPI ARM64, this diff should do if I read the code correctly, it > should be (famous last words) just a matter of mapping PXMs to nodes for > every SRAT GICC entry, feel free to pick it up if it works. > > Yes, we can take the original patch just because it is safer for an -rc > cycle even though if the patch below would do delaying the fix for a > couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is > not a disaster. I tested this patch on my arm board, it works.
Xie XiuQi <xiexiuqi@huawei.com> writes: > Hi Lorenzo, Punit, > > > On 2018/6/20 0:32, Lorenzo Pieralisi wrote: >> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote: >>> Michal Hocko <mhocko@kernel.org> writes: >>> >>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote: >>>> [...] >>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch >>>>> as a temporary fix (it'll also be easier to backport) while we work on >>>>> fixing these other issues and enabling memoryless nodes. >>>> >>>> Well, x86 already does that but copying this antipatern is not really >>>> nice. So it is good as a quick fix but it would be definitely much >>>> better to have a robust fix. Who knows how many other places might hit >>>> this. You certainly do not want to add a hack like this all over... >>> >>> Completely agree! I was only suggesting it as a temporary measure, >>> especially as it looked like a proper fix might be invasive. >>> >>> Another fix might be to change the node specific allocation to node >>> agnostic allocations. It isn't clear why the allocation is being >>> requested from a specific node. I think Lorenzo suggested this in one of >>> the threads. >> >> I think that code was just copypasted but it is better to fix the >> underlying issue. >> >>> I've started putting together a set fixing the issues identified in this >>> thread. It should give a better idea on the best course of action. >> >> On ACPI ARM64, this diff should do if I read the code correctly, it >> should be (famous last words) just a matter of mapping PXMs to nodes for >> every SRAT GICC entry, feel free to pick it up if it works. >> >> Yes, we can take the original patch just because it is safer for an -rc >> cycle even though if the patch below would do delaying the fix for a >> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is >> not a disaster. > > I tested this patch on my arm board, it works. I am assuming you tried the patch without enabling support for memory-less nodes. The patch de-couples the onlining of numa nodes (as parsed from SRAT) from NR_CPUS restriction. When it comes to building zonelists, the node referenced by the PCI controller also has zonelists initialised. So it looks like a fallback node is setup even if we don't have memory-less nodes enabled. I need to stare some more at the code to see why we need memory-less nodes at all then ...
On 2018/6/20 19:51, Punit Agrawal wrote: > Xie XiuQi <xiexiuqi@huawei.com> writes: > >> Hi Lorenzo, Punit, >> >> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote: >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote: >>>> Michal Hocko <mhocko@kernel.org> writes: >>>> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote: >>>>> [...] >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch >>>>>> as a temporary fix (it'll also be easier to backport) while we work on >>>>>> fixing these other issues and enabling memoryless nodes. >>>>> >>>>> Well, x86 already does that but copying this antipatern is not really >>>>> nice. So it is good as a quick fix but it would be definitely much >>>>> better to have a robust fix. Who knows how many other places might hit >>>>> this. You certainly do not want to add a hack like this all over... >>>> >>>> Completely agree! I was only suggesting it as a temporary measure, >>>> especially as it looked like a proper fix might be invasive. >>>> >>>> Another fix might be to change the node specific allocation to node >>>> agnostic allocations. It isn't clear why the allocation is being >>>> requested from a specific node. I think Lorenzo suggested this in one of >>>> the threads. >>> >>> I think that code was just copypasted but it is better to fix the >>> underlying issue. >>> >>>> I've started putting together a set fixing the issues identified in this >>>> thread. It should give a better idea on the best course of action. >>> >>> On ACPI ARM64, this diff should do if I read the code correctly, it >>> should be (famous last words) just a matter of mapping PXMs to nodes for >>> every SRAT GICC entry, feel free to pick it up if it works. >>> >>> Yes, we can take the original patch just because it is safer for an -rc >>> cycle even though if the patch below would do delaying the fix for a >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is >>> not a disaster. >> >> I tested this patch on my arm board, it works. > > I am assuming you tried the patch without enabling support for > memory-less nodes. > > The patch de-couples the onlining of numa nodes (as parsed from SRAT) > from NR_CPUS restriction. When it comes to building zonelists, the node > referenced by the PCI controller also has zonelists initialised. > > So it looks like a fallback node is setup even if we don't have > memory-less nodes enabled. I need to stare some more at the code to see > why we need memory-less nodes at all then ... Yes, please. From my limited MM knowledge, zonelists should not be initialised if no CPU and no memory on this node, correct me if I'm wrong. Thanks Hanjun
On Fri 22-06-18 16:58:05, Hanjun Guo wrote: > On 2018/6/20 19:51, Punit Agrawal wrote: > > Xie XiuQi <xiexiuqi@huawei.com> writes: > > > >> Hi Lorenzo, Punit, > >> > >> > >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote: > >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote: > >>>> Michal Hocko <mhocko@kernel.org> writes: > >>>> > >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote: > >>>>> [...] > >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch > >>>>>> as a temporary fix (it'll also be easier to backport) while we work on > >>>>>> fixing these other issues and enabling memoryless nodes. > >>>>> > >>>>> Well, x86 already does that but copying this antipatern is not really > >>>>> nice. So it is good as a quick fix but it would be definitely much > >>>>> better to have a robust fix. Who knows how many other places might hit > >>>>> this. You certainly do not want to add a hack like this all over... > >>>> > >>>> Completely agree! I was only suggesting it as a temporary measure, > >>>> especially as it looked like a proper fix might be invasive. > >>>> > >>>> Another fix might be to change the node specific allocation to node > >>>> agnostic allocations. It isn't clear why the allocation is being > >>>> requested from a specific node. I think Lorenzo suggested this in one of > >>>> the threads. > >>> > >>> I think that code was just copypasted but it is better to fix the > >>> underlying issue. > >>> > >>>> I've started putting together a set fixing the issues identified in this > >>>> thread. It should give a better idea on the best course of action. > >>> > >>> On ACPI ARM64, this diff should do if I read the code correctly, it > >>> should be (famous last words) just a matter of mapping PXMs to nodes for > >>> every SRAT GICC entry, feel free to pick it up if it works. > >>> > >>> Yes, we can take the original patch just because it is safer for an -rc > >>> cycle even though if the patch below would do delaying the fix for a > >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is > >>> not a disaster. > >> > >> I tested this patch on my arm board, it works. > > > > I am assuming you tried the patch without enabling support for > > memory-less nodes. > > > > The patch de-couples the onlining of numa nodes (as parsed from SRAT) > > from NR_CPUS restriction. When it comes to building zonelists, the node > > referenced by the PCI controller also has zonelists initialised. > > > > So it looks like a fallback node is setup even if we don't have > > memory-less nodes enabled. I need to stare some more at the code to see > > why we need memory-less nodes at all then ... > > Yes, please. From my limited MM knowledge, zonelists should not be > initialised if no CPU and no memory on this node, correct me if I'm > wrong. Well, as long as there is a code which can explicitly ask for a specific node than it is safer to have zonelists configured. Otherwise you just force callers to add hacks and figure out the proper placement there. Zonelists should be cheep to configure for all possible nodes. It's not like we are talking about huge amount of resources.
Michal Hocko <mhocko@kernel.org> writes: > On Fri 22-06-18 16:58:05, Hanjun Guo wrote: >> On 2018/6/20 19:51, Punit Agrawal wrote: >> > Xie XiuQi <xiexiuqi@huawei.com> writes: >> > >> >> Hi Lorenzo, Punit, >> >> >> >> >> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote: >> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote: >> >>>> Michal Hocko <mhocko@kernel.org> writes: >> >>>> >> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote: >> >>>>> [...] >> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch >> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on >> >>>>>> fixing these other issues and enabling memoryless nodes. >> >>>>> >> >>>>> Well, x86 already does that but copying this antipatern is not really >> >>>>> nice. So it is good as a quick fix but it would be definitely much >> >>>>> better to have a robust fix. Who knows how many other places might hit >> >>>>> this. You certainly do not want to add a hack like this all over... >> >>>> >> >>>> Completely agree! I was only suggesting it as a temporary measure, >> >>>> especially as it looked like a proper fix might be invasive. >> >>>> >> >>>> Another fix might be to change the node specific allocation to node >> >>>> agnostic allocations. It isn't clear why the allocation is being >> >>>> requested from a specific node. I think Lorenzo suggested this in one of >> >>>> the threads. >> >>> >> >>> I think that code was just copypasted but it is better to fix the >> >>> underlying issue. >> >>> >> >>>> I've started putting together a set fixing the issues identified in this >> >>>> thread. It should give a better idea on the best course of action. >> >>> >> >>> On ACPI ARM64, this diff should do if I read the code correctly, it >> >>> should be (famous last words) just a matter of mapping PXMs to nodes for >> >>> every SRAT GICC entry, feel free to pick it up if it works. >> >>> >> >>> Yes, we can take the original patch just because it is safer for an -rc >> >>> cycle even though if the patch below would do delaying the fix for a >> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is >> >>> not a disaster. >> >> >> >> I tested this patch on my arm board, it works. >> > >> > I am assuming you tried the patch without enabling support for >> > memory-less nodes. >> > >> > The patch de-couples the onlining of numa nodes (as parsed from SRAT) >> > from NR_CPUS restriction. When it comes to building zonelists, the node >> > referenced by the PCI controller also has zonelists initialised. >> > >> > So it looks like a fallback node is setup even if we don't have >> > memory-less nodes enabled. I need to stare some more at the code to see >> > why we need memory-less nodes at all then ... >> >> Yes, please. From my limited MM knowledge, zonelists should not be >> initialised if no CPU and no memory on this node, correct me if I'm >> wrong. > > Well, as long as there is a code which can explicitly ask for a specific > node than it is safer to have zonelists configured. Otherwise you just > force callers to add hacks and figure out the proper placement there. > Zonelists should be cheep to configure for all possible nodes. It's not > like we are talking about huge amount of resources. I agree. The current problem stems from not configuring the zonelists for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes the configuration of such nodes. For allocation requests targeting memory-less nodes, the allocator will take the slow path and fall back to one of the other nodes based on the zonelists. I'm not sure how common such allocations are but I'll work on enabling CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this config improves the fallback mechanism by starting the search from a near-by node with memory.
On Fri, 22 Jun 2018 11:24:38 +0100 Punit Agrawal <punit.agrawal@arm.com> wrote: > Michal Hocko <mhocko@kernel.org> writes: > > > On Fri 22-06-18 16:58:05, Hanjun Guo wrote: > >> On 2018/6/20 19:51, Punit Agrawal wrote: > >> > Xie XiuQi <xiexiuqi@huawei.com> writes: > >> > > >> >> Hi Lorenzo, Punit, > >> >> > >> >> > >> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote: > >> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote: > >> >>>> Michal Hocko <mhocko@kernel.org> writes: > >> >>>> > >> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote: > >> >>>>> [...] > >> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch > >> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on > >> >>>>>> fixing these other issues and enabling memoryless nodes. > >> >>>>> > >> >>>>> Well, x86 already does that but copying this antipatern is not really > >> >>>>> nice. So it is good as a quick fix but it would be definitely much > >> >>>>> better to have a robust fix. Who knows how many other places might hit > >> >>>>> this. You certainly do not want to add a hack like this all over... > >> >>>> > >> >>>> Completely agree! I was only suggesting it as a temporary measure, > >> >>>> especially as it looked like a proper fix might be invasive. > >> >>>> > >> >>>> Another fix might be to change the node specific allocation to node > >> >>>> agnostic allocations. It isn't clear why the allocation is being > >> >>>> requested from a specific node. I think Lorenzo suggested this in one of > >> >>>> the threads. > >> >>> > >> >>> I think that code was just copypasted but it is better to fix the > >> >>> underlying issue. > >> >>> > >> >>>> I've started putting together a set fixing the issues identified in this > >> >>>> thread. It should give a better idea on the best course of action. > >> >>> > >> >>> On ACPI ARM64, this diff should do if I read the code correctly, it > >> >>> should be (famous last words) just a matter of mapping PXMs to nodes for > >> >>> every SRAT GICC entry, feel free to pick it up if it works. > >> >>> > >> >>> Yes, we can take the original patch just because it is safer for an -rc > >> >>> cycle even though if the patch below would do delaying the fix for a > >> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is > >> >>> not a disaster. > >> >> > >> >> I tested this patch on my arm board, it works. > >> > > >> > I am assuming you tried the patch without enabling support for > >> > memory-less nodes. > >> > > >> > The patch de-couples the onlining of numa nodes (as parsed from SRAT) > >> > from NR_CPUS restriction. When it comes to building zonelists, the node > >> > referenced by the PCI controller also has zonelists initialised. > >> > > >> > So it looks like a fallback node is setup even if we don't have > >> > memory-less nodes enabled. I need to stare some more at the code to see > >> > why we need memory-less nodes at all then ... > >> > >> Yes, please. From my limited MM knowledge, zonelists should not be > >> initialised if no CPU and no memory on this node, correct me if I'm > >> wrong. > > > > Well, as long as there is a code which can explicitly ask for a specific > > node than it is safer to have zonelists configured. Otherwise you just > > force callers to add hacks and figure out the proper placement there. > > Zonelists should be cheep to configure for all possible nodes. It's not > > like we are talking about huge amount of resources. > > I agree. The current problem stems from not configuring the zonelists > for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes > the configuration of such nodes. > > For allocation requests targeting memory-less nodes, the allocator will > take the slow path and fall back to one of the other nodes based on the > zonelists. > > I'm not sure how common such allocations are but I'll work on enabling > CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this > config improves the fallback mechanism by starting the search from a > near-by node with memory. I'll test it when back in the office, but I had a similar issue with memory only nodes when I moved the SRAT listing for cpus from the 4 4th mode to the 3rd node to fake some memory I could hot unplug. This gave a memory only node for the last node on the system. When I instead moved cpus from the 3rd node to the 4th (so the node with only memory was now in the middle, everything worked). Was odd, and I'd been meaning to chase it down but hadn't gotten to it yet. If I get time I'll put together some test firmwares as see if there are any other nasty corner cases we aren't handling. Jonathan > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Jonathan Cameron <jonathan.cameron@huawei.com> writes: [...] > > I'll test it when back in the office, but I had a similar issue with > memory only nodes when I moved the SRAT listing for cpus from the 4 > 4th mode to the 3rd node to fake some memory I could hot unplug. > This gave a memory only node for the last node on the system. > > When I instead moved cpus from the 3rd node to the 4th (so the node > with only memory was now in the middle, everything worked). > > Was odd, and I'd been meaning to chase it down but hadn't gotten to it > yet. If I get time I'll put together some test firmwares as see if there > are any other nasty corner cases we aren't handling. If you get a chance, it'd be really helpful to test reversing the ordering of entries in the SRAT and booting with a restricted NR_CPUS. This issue was found through code inspection. Please make sure to use the updated patch from Lorenzo for your tests[0]. [0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2 > > Jonathan > >> >> _______________________________________________ >> linux-arm-kernel mailing list >> linux-arm-kernel@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Jonathan Cameron <jonathan.cameron@huawei.com> writes: [...] > > I'll test it when back in the office, but I had a similar issue with > memory only nodes when I moved the SRAT listing for cpus from the 4 > 4th mode to the 3rd node to fake some memory I could hot unplug. > This gave a memory only node for the last node on the system. > > When I instead moved cpus from the 3rd node to the 4th (so the node > with only memory was now in the middle, everything worked). > > Was odd, and I'd been meaning to chase it down but hadn't gotten to it > yet. If I get time I'll put together some test firmwares as see if there > are any other nasty corner cases we aren't handling. If you get a chance, it'd be really helpful to test reversing the ordering of entries in the SRAT and booting with a restricted NR_CPUS. This issue was found through code inspection. Please make sure to use the updated patch from Lorenzo for your tests[0]. [0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2 > > Jonathan > >> >> _______________________________________________ >> linux-arm-kernel mailing list >> linux-arm-kernel@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c index d190a7b231bf..877b268ef9fa 100644 --- a/arch/arm64/kernel/acpi_numa.c +++ b/arch/arm64/kernel/acpi_numa.c @@ -70,12 +70,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa) if (!(pa->flags & ACPI_SRAT_GICC_ENABLED)) return; - if (cpus_in_srat >= NR_CPUS) { - pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n", - NR_CPUS); - return; - } - pxm = pa->proximity_domain; node = acpi_map_pxm_to_node(pxm); @@ -85,6 +79,14 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa) return; } + node_set(node, numa_nodes_parsed); + + if (cpus_in_srat >= NR_CPUS) { + pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n", + NR_CPUS); + return; + } + mpidr = acpi_map_madt_entry(pa->acpi_processor_uid); if (mpidr == PHYS_CPUID_INVALID) { pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in MADT\n", @@ -95,7 +97,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa) early_node_cpu_hwid[cpus_in_srat].node_id = node; early_node_cpu_hwid[cpus_in_srat].cpu_hwid = mpidr; - node_set(node, numa_nodes_parsed); cpus_in_srat++; pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n", pxm, mpidr, node);