Message ID | 20190212095343.23315-2-mhocko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86, numa: always initialize all possible nodes | expand |
Hi - This patch triggered an oops for me (more below). On 2/12/19 4:53 AM, Michal Hocko wrote: [snip] > Fix the issue by reworking how x86 initializes the memory less nodes. > The current implementation is hacked into the workflow and it doesn't > allow any flexibility. There is init_memory_less_node called for each > offline node that has a CPU as already mentioned above. This will make > sure that we will have a new online node without any memory. Much later > on we build a zone list for this node and things seem to work, except > they do not (e.g. due to nr_cpus). Not to mention that it doesn't really > make much sense to consider an empty node as online because we just > consider this node whenever we want to iterate nodes to use and empty > node is obviously not the best candidate. This is all just too fragile. The problem might be in here - I have a case with a 'memoryless' node that has CPUs that get onlined during SMP boot, but that onlining triggers a page fault during device registration. I'm running on a NUMA machine but I marked all of the memory on node 1 as type 12 (PRAM), using the memmap arg. That makes node 1 appear to have no memory. During SMP boot, the fault is in bus_add_device(): error = sysfs_create_link(&bus->p->devices_kset->kobj, bus->p is NULL. That p is the subsys_private struct, and it should have been set in postcore_initcall(register_node_type); But that happens after SMP boot. This fault happens during SMP boot. The old code had set this node online via alloc_node_data(), so when it came time to do_cpu_up() -> try_online_node(), the node was already up and nothing happened. Now, it attempts to online the node, which registers the node with sysfs, but that can't happen before the 'node' subsystem is registered. My modified e820 map looks like this: > [ 0.000000] user: [mem 0x0000000000000100-0x000000000009c7ff] usable > [ 0.000000] user: [mem 0x000000000009c800-0x000000000009ffff] reserved > [ 0.000000] user: [mem 0x00000000000e0000-0x00000000000fffff] reserved > [ 0.000000] user: [mem 0x0000000000100000-0x0000000073216fff] usable > [ 0.000000] user: [mem 0x0000000073217000-0x0000000075316fff] reserved > [ 0.000000] user: [mem 0x0000000075317000-0x00000000754f8fff] ACPI data > [ 0.000000] user: [mem 0x00000000754f9000-0x0000000076057fff] ACPI NVS > [ 0.000000] user: [mem 0x0000000076058000-0x0000000077ae9fff] reserved > [ 0.000000] user: [mem 0x0000000077aea000-0x0000000077ffffff] usable > [ 0.000000] user: [mem 0x0000000078000000-0x000000008fffffff] reserved > [ 0.000000] user: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved > [ 0.000000] user: [mem 0x00000000ff000000-0x00000000ffffffff] reserved > [ 0.000000] user: [mem 0x0000000100000000-0x00000004ffffffff] usable > [ 0.000000] user: [mem 0x0000000500000000-0x000000603fffffff] persistent (type 12) Which leads to an empty zone 1: > [ 0.016060] Initmem setup node 0 [mem 0x0000000000001000-0x00000004ffffffff] > [ 0.073310] Initmem setup node 1 [mem 0x0000000000000000-0x0000000000000000] The backtrace: > [ 2.175327] Call Trace: > [ 2.175327] device_add+0x43e/0x690 > [ 2.175327] device_register+0x107/0x110 > [ 2.175327] __register_one_node+0x72/0x150 > [ 2.175327] __try_online_node+0x8f/0xd0 > [ 2.175327] try_online_node+0x2b/0x50 > [ 2.175327] do_cpu_up+0x46/0xf0 > [ 2.175327] cpu_up+0x13/0x20 > [ 2.175327] smp_init+0x6e/0xd0 > [ 2.175327] kernel_init_freeable+0xe5/0x21f > [ 2.175327] ? rest_init+0xb0/0xb0 > [ 2.175327] kernel_init+0xf/0x180 > [ 2.175327] ? rest_init+0xb0/0xb0 > [ 2.175327] ret_from_fork+0x1f/0x30 To get it booting again, I unconditionally node_set_online: arch/x86/mm/numa.c @@ -583,7 +583,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) continue; alloc_node_data(nid); - if (end) + //if (end) node_set_online(nid); } A more elegant solution may be to avoid registering with sysfs during early boot, or something else entirely. But I figured I'd ask for help at this point. =) Thanks, Barret
On Wed 01-05-19 15:12:32, Barret Rhoden wrote: [...] > A more elegant solution may be to avoid registering with sysfs during early > boot, or something else entirely. But I figured I'd ask for help at this > point. =) Thanks for the report and an excellent analysis! This is really helpful. I will think about this some more but I am traveling this week. It seems really awkward to register a sysfs file for an empty range. That looks like a bug to me.
On Thu 02-05-19 09:00:31, Michal Hocko wrote: > On Wed 01-05-19 15:12:32, Barret Rhoden wrote: > [...] > > A more elegant solution may be to avoid registering with sysfs during early > > boot, or something else entirely. But I figured I'd ask for help at this > > point. =) > > Thanks for the report and an excellent analysis! This is really helpful. > I will think about this some more but I am traveling this week. It seems > really awkward to register a sysfs file for an empty range. That looks > like a bug to me. I am sorry, but I didn't get to this for a long time and I am still busy. The patch has been dropped from the mm tree (thus linux-next). I hope I can revisit this or somebody else will take over and finish this work. This is much more trickier than I anticipated unfortunately.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 1308f5408bf7..b3621ee4dfe8 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid) node_data[nid] = nd; memset(NODE_DATA(nid), 0, sizeof(pg_data_t)); - - node_set_online(nid); } /** @@ -570,7 +568,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) return -EINVAL; /* Finally register nodes. */ - for_each_node_mask(nid, node_possible_map) { + for_each_node_mask(nid, numa_nodes_parsed) { u64 start = PFN_PHYS(max_pfn); u64 end = 0; @@ -581,9 +579,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) end = max(mi->blk[i].end, end); } - if (start >= end) - continue; - /* * Don't confuse VM with a node that doesn't have the * minimum amount of memory: @@ -592,6 +587,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) continue; alloc_node_data(nid); + if (end) + node_set_online(nid); } /* Dump memblock with node info and return. */ @@ -721,21 +718,6 @@ void __init x86_numa_init(void) numa_init(dummy_numa_init); } -static void __init init_memory_less_node(int nid) -{ - unsigned long zones_size[MAX_NR_ZONES] = {0}; - unsigned long zholes_size[MAX_NR_ZONES] = {0}; - - /* Allocate and initialize node data. Memory-less node is now online.*/ - alloc_node_data(nid); - free_area_init_node(nid, zones_size, 0, zholes_size); - - /* - * All zonelists will be built later in start_kernel() after per cpu - * areas are initialized. - */ -} - /* * Setup early cpu_to_node. * @@ -763,9 +745,6 @@ void __init init_cpu_to_node(void) if (node == NUMA_NO_NODE) continue; - if (!node_online(node)) - init_memory_less_node(node); - numa_set_node(cpu, node); } } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2ec9cc407216..2e097f336126 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5361,10 +5361,11 @@ static void __build_all_zonelists(void *data) if (self && !node_online(self->node_id)) { build_zonelists(self); } else { - for_each_online_node(nid) { + for_each_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); - build_zonelists(pgdat); + if (pgdat) + build_zonelists(pgdat); } #ifdef CONFIG_HAVE_MEMORYLESS_NODES @@ -6644,10 +6645,8 @@ static unsigned long __init find_min_pfn_for_node(int nid) for_each_mem_pfn_range(i, nid, &start_pfn, NULL, NULL) min_pfn = min(min_pfn, start_pfn); - if (min_pfn == ULONG_MAX) { - pr_warn("Could not find start_pfn for node %d\n", nid); + if (min_pfn == ULONG_MAX) return 0; - } return min_pfn; } @@ -6991,8 +6990,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn) mminit_verify_pageflags_layout(); setup_nr_node_ids(); zero_resv_unavail(); - for_each_online_node(nid) { + for_each_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); + + if (!pgdat) + continue; + free_area_init_node(nid, NULL, find_min_pfn_for_node(nid), NULL);