Message ID | 20240118031412.3300-1-shijie@os.amperecomputing.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | init: refactor the generic cpu_to_node for NUMA | expand |
On Thu, Jan 18, 2024 at 11:14:12AM +0800, Huang Shijie wrote: > (0) We list the ARCHs which support the NUMA: > arm64, loongarch, powerpc, riscv, > sparc, mips, s390, x86, I do not understand this format, what are you saying here? Have you read the kernel documentation for how to write changelog texts? It doesn't say "list a bunch of things", it's a bit more descriptive. > > (1) Some ARCHs in (0) override the generic cpu_to_node(), such as: > sparc, mips, s390, x86. > > Since these ARCHs have their own cpu_to_node(), we do not care > about them. > > (2) The ARCHs enable NUMA and use the generic cpu_to_node. > From (0) and (1), we can know that four ARCHs support NUMA and > use the generic cpu_to_node: > arm64, loongarch, powerpc, riscv, > > The generic cpu_to_node depends on percpu "numa_node". > > (2.1) The loongarch sets "numa_node" in: > start_kernel --> smp_prepare_boot_cpu() > > (2.2) The arm64, powerpc, riscv set "numa_node" in: > start_kernel --> arch_call_rest_init() --> rest_init() > --> kernel_init() --> kernel_init_freeable() > --> smp_prepare_cpus() > > (2.3) The first place calling the cpu_to_node() is early_trace_init(): > start_kernel --> early_trace_init()--> __ring_buffer_alloc() > --> rb_allocate_cpu_buffer() > > (2.4) So it safe for loongarch. But for arm64, powerpc and riscv, > there are at least four places in the common code where > the cpu_to_node() is called before it is initialized: > a.) early_trace_init() in kernel/trace/trace.c > b.) sched_init() in kernel/sched/core.c > c.) init_sched_fair_class() in kernel/sched/fair.c > d.) workqueue_init_early() in kernel/workqueue.c > > (3) In order to fix the issue, the patch refactors the generic cpu_to_node: > (3.1) change cpu_to_node to function pointer, > and export it for kernel modules. > > (3.2) introduce _cpu_to_node() which is the original cpu_to_node(). > > (3.3) introduce smp_prepare_boot_cpu_start() to wrap the original > smp_prepare_boot_cpu(), and set cpu_to_node with > early_cpu_to_node which works fine for arm64, powerpc, > riscv and loongarch. > > (3.4) introduce smp_prepare_cpus_done() to wrap the original > smp_prepare_cpus(). > The "numa_node" is ready after smp_prepare_cpus(), > then set cpu_to_node with _cpu_to_node(). When you start listing different things in a changelog, that's a hint to the reviewer to say "please break this up" as patches need to do only one thing at a time. As I can't follow the above text at all, that's all the review comments I'm able to give here, sorry. But as-is, this isn't acceptable :( thanks, greg k-h
Hi Greg, 在 2024/1/18 17:27, Greg KH 写道: > On Thu, Jan 18, 2024 at 11:14:12AM +0800, Huang Shijie wrote: >> (0) We list the ARCHs which support the NUMA: >> arm64, loongarch, powerpc, riscv, >> sparc, mips, s390, x86, > I do not understand this format, what are you saying here? Sorry for the confusing. I should put the conclusion at the beginning: The generic cpu_to_node() has bug in some situations. The generic cpu_to_node() does not work in arm64, powerpc, riscv when the CONFIG_NUMA is enabled: The cpu_to_node() is called before it is initialized. So all the four places are set with the wrong node id (get by cpu_to_node()): a.) early_trace_init() in kernel/trace/trace.c b.) sched_init() in kernel/sched/core.c c.) init_sched_fair_class() in kernel/sched/fair.c d.) workqueue_init_early() in kernel/workqueue.c Thanks Huang Shijie
Hi Huang, kernel test robot noticed the following build errors: [auto build test ERROR on driver-core/driver-core-testing] [also build test ERROR on driver-core/driver-core-next driver-core/driver-core-linus linus/master v6.8-rc1 next-20240125] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Huang-Shijie/init-refactor-the-generic-cpu_to_node-for-NUMA/20240118-111802 base: driver-core/driver-core-testing patch link: https://lore.kernel.org/r/20240118031412.3300-1-shijie%40os.amperecomputing.com patch subject: [PATCH] init: refactor the generic cpu_to_node for NUMA config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20240129/202401290116.GpUOCzGd-lkp@intel.com/config) compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240129/202401290116.GpUOCzGd-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202401290116.GpUOCzGd-lkp@intel.com/ All errors (new ones prefixed by >>): >> ld.lld: error: undefined symbol: cpu_to_node >>> referenced by main.c:880 (init/main.c:880) >>> init/main.o:(start_kernel) in archive vmlinux.a >>> referenced by main.c:1542 (init/main.c:1542) >>> init/main.o:(kernel_init_freeable) in archive vmlinux.a >>> referenced by core.c:550 (arch/x86/events/amd/core.c:550) >>> arch/x86/events/amd/core.o:(amd_pmu_cpu_prepare) in archive vmlinux.a >>> referenced 179 more times -- >> ld.lld: error: undefined symbol: _cpu_to_node >>> referenced by main.c:1542 (init/main.c:1542) >>> init/main.o:(kernel_init_freeable) in archive vmlinux.a
Hi Huang, kernel test robot noticed the following build errors: [auto build test ERROR on driver-core/driver-core-testing] [also build test ERROR on driver-core/driver-core-next driver-core/driver-core-linus linus/master v6.8-rc1 next-20240125] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Huang-Shijie/init-refactor-the-generic-cpu_to_node-for-NUMA/20240118-111802 base: driver-core/driver-core-testing patch link: https://lore.kernel.org/r/20240118031412.3300-1-shijie%40os.amperecomputing.com patch subject: [PATCH] init: refactor the generic cpu_to_node for NUMA config: x86_64-kexec (https://download.01.org/0day-ci/archive/20240129/202401290316.0eu1Mue2-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240129/202401290316.0eu1Mue2-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202401290316.0eu1Mue2-lkp@intel.com/ All errors (new ones prefixed by >>): ld: vmlinux.o: in function `amd_pmu_cpu_prepare': >> arch/x86/events/amd/core.c:549: undefined reference to `cpu_to_node' ld: vmlinux.o: in function `amd_alloc_nb': arch/x86/events/amd/core.c:507: undefined reference to `cpu_to_node' ld: vmlinux.o: in function `amd_uncore_ctx_init': >> arch/x86/events/amd/uncore.c:476: undefined reference to `cpu_to_node' ld: vmlinux.o: in function `allocate_shared_regs': >> arch/x86/events/intel/core.c:4520: undefined reference to `cpu_to_node' ld: vmlinux.o: in function `intel_cpuc_prepare': arch/x86/events/intel/core.c:4561: undefined reference to `cpu_to_node' ld: vmlinux.o:arch/x86/events/intel/core.c:4538: more undefined references to `cpu_to_node' follow ld: vmlinux.o: in function `smp_prepare_cpus_done': >> init/main.c:1542: undefined reference to `_cpu_to_node' ld: vmlinux.o: in function `check_timer': >> arch/x86/kernel/apic/io_apic.c:2169: undefined reference to `cpu_to_node' ld: vmlinux.o: in function `kvm_alloc_cpumask': >> arch/x86/kernel/kvm.c:687: undefined reference to `cpu_to_node' ld: vmlinux.o: in function `fork_idle': >> kernel/fork.c:2826: undefined reference to `cpu_to_node' ld: vmlinux.o: in function `cpus_share_numa': >> kernel/workqueue.c:6768: undefined reference to `cpu_to_node' >> ld: kernel/workqueue.c:6768: undefined reference to `cpu_to_node' ld: vmlinux.o:kernel/workqueue.c:6748: more undefined references to `cpu_to_node' follow vim +549 arch/x86/events/amd/core.c 21d59e3e2c403c arch/x86/events/amd/core.c Sandipan Das 2022-04-21 544 b38b24ead33417 arch/x86/kernel/cpu/perf_event_amd.c Peter Zijlstra 2010-03-23 545 static int amd_pmu_cpu_prepare(int cpu) b38b24ead33417 arch/x86/kernel/cpu/perf_event_amd.c Peter Zijlstra 2010-03-23 546 { b38b24ead33417 arch/x86/kernel/cpu/perf_event_amd.c Peter Zijlstra 2010-03-23 547 struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu); b38b24ead33417 arch/x86/kernel/cpu/perf_event_amd.c Peter Zijlstra 2010-03-23 548 f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 @549 cpuc->lbr_sel = kzalloc_node(sizeof(struct er_account), GFP_KERNEL, f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 550 cpu_to_node(cpu)); f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 551 if (!cpuc->lbr_sel) f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 552 return -ENOMEM; f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 553 b38b24ead33417 arch/x86/kernel/cpu/perf_event_amd.c Peter Zijlstra 2010-03-23 554 WARN_ON_ONCE(cpuc->amd_nb); b38b24ead33417 arch/x86/kernel/cpu/perf_event_amd.c Peter Zijlstra 2010-03-23 555 32b62f446827f6 arch/x86/events/amd/core.c Peter Zijlstra 2016-03-25 556 if (!x86_pmu.amd_nb_constraints) 95ca792c7582fd arch/x86/events/amd/core.c Thomas Gleixner 2016-07-13 557 return 0; b38b24ead33417 arch/x86/kernel/cpu/perf_event_amd.c Peter Zijlstra 2010-03-23 558 c079c791c5a062 arch/x86/kernel/cpu/perf_event_amd.c Peter Zijlstra 2010-11-25 559 cpuc->amd_nb = amd_alloc_nb(cpu); f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 560 if (cpuc->amd_nb) 95ca792c7582fd arch/x86/events/amd/core.c Thomas Gleixner 2016-07-13 561 return 0; f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 562 f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 563 kfree(cpuc->lbr_sel); f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 564 cpuc->lbr_sel = NULL; f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 565 f4f925dae7419f arch/x86/events/amd/core.c Sandipan Das 2022-08-11 566 return -ENOMEM; b38b24ead33417 arch/x86/kernel/cpu/perf_event_amd.c Peter Zijlstra 2010-03-23 567 } b38b24ead33417 arch/x86/kernel/cpu/perf_event_amd.c Peter Zijlstra 2010-03-23 568
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index 5b59d133b6af..867a477fa975 100644 --- a/drivers/base/arch_numa.c +++ b/drivers/base/arch_numa.c @@ -61,6 +61,17 @@ EXPORT_SYMBOL(cpumask_of_node); #endif +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +#ifndef cpu_to_node +int _cpu_to_node(int cpu) +{ + return per_cpu(numa_node, cpu); +} +int (*cpu_to_node)(int cpu); +EXPORT_SYMBOL(cpu_to_node); +#endif +#endif + static void numa_update_cpu(unsigned int cpu, bool remove) { int nid = cpu_to_node(cpu); diff --git a/include/linux/topology.h b/include/linux/topology.h index 52f5850730b3..e7ce2bae11dd 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -91,10 +91,8 @@ static inline int numa_node_id(void) #endif #ifndef cpu_to_node -static inline int cpu_to_node(int cpu) -{ - return per_cpu(numa_node, cpu); -} +extern int (*cpu_to_node)(int cpu); +extern int _cpu_to_node(int cpu); #endif #ifndef set_numa_node diff --git a/init/main.c b/init/main.c index e24b0780fdff..b142e9c51161 100644 --- a/init/main.c +++ b/init/main.c @@ -870,6 +870,18 @@ static void __init print_unknown_bootoptions(void) memblock_free(unknown_options, len); } +static void __init smp_prepare_boot_cpu_start(void) +{ + smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ + +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +#ifndef cpu_to_node + /* The early_cpu_to_node should be ready now. */ + cpu_to_node = early_cpu_to_node; +#endif +#endif +} + asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector void start_kernel(void) { @@ -899,7 +911,7 @@ void start_kernel(void) setup_command_line(command_line); setup_nr_cpu_ids(); setup_per_cpu_areas(); - smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ + smp_prepare_boot_cpu_start(); boot_cpu_hotplug_init(); pr_notice("Kernel command line: %s\n", saved_command_line); @@ -1519,6 +1531,19 @@ void __init console_on_rootfs(void) fput(file); } +static void __init smp_prepare_cpus_done(unsigned int setup_max_cpus) +{ + /* Different ARCHs may override smp_prepare_cpus() */ + smp_prepare_cpus(setup_max_cpus); + +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +#ifndef cpu_to_node + /* Change to the formal function. */ + cpu_to_node = _cpu_to_node; +#endif +#endif +} + static noinline void __init kernel_init_freeable(void) { /* Now the scheduler is fully set up and can do blocking allocations */ @@ -1531,7 +1556,7 @@ static noinline void __init kernel_init_freeable(void) cad_pid = get_pid(task_pid(current)); - smp_prepare_cpus(setup_max_cpus); + smp_prepare_cpus_done(setup_max_cpus); workqueue_init();
(0) We list the ARCHs which support the NUMA: arm64, loongarch, powerpc, riscv, sparc, mips, s390, x86, (1) Some ARCHs in (0) override the generic cpu_to_node(), such as: sparc, mips, s390, x86. Since these ARCHs have their own cpu_to_node(), we do not care about them. (2) The ARCHs enable NUMA and use the generic cpu_to_node. From (0) and (1), we can know that four ARCHs support NUMA and use the generic cpu_to_node: arm64, loongarch, powerpc, riscv, The generic cpu_to_node depends on percpu "numa_node". (2.1) The loongarch sets "numa_node" in: start_kernel --> smp_prepare_boot_cpu() (2.2) The arm64, powerpc, riscv set "numa_node" in: start_kernel --> arch_call_rest_init() --> rest_init() --> kernel_init() --> kernel_init_freeable() --> smp_prepare_cpus() (2.3) The first place calling the cpu_to_node() is early_trace_init(): start_kernel --> early_trace_init()--> __ring_buffer_alloc() --> rb_allocate_cpu_buffer() (2.4) So it safe for loongarch. But for arm64, powerpc and riscv, there are at least four places in the common code where the cpu_to_node() is called before it is initialized: a.) early_trace_init() in kernel/trace/trace.c b.) sched_init() in kernel/sched/core.c c.) init_sched_fair_class() in kernel/sched/fair.c d.) workqueue_init_early() in kernel/workqueue.c (3) In order to fix the issue, the patch refactors the generic cpu_to_node: (3.1) change cpu_to_node to function pointer, and export it for kernel modules. (3.2) introduce _cpu_to_node() which is the original cpu_to_node(). (3.3) introduce smp_prepare_boot_cpu_start() to wrap the original smp_prepare_boot_cpu(), and set cpu_to_node with early_cpu_to_node which works fine for arm64, powerpc, riscv and loongarch. (3.4) introduce smp_prepare_cpus_done() to wrap the original smp_prepare_cpus(). The "numa_node" is ready after smp_prepare_cpus(), then set cpu_to_node with _cpu_to_node(). Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> --- drivers/base/arch_numa.c | 11 +++++++++++ include/linux/topology.h | 6 ++---- init/main.c | 29 +++++++++++++++++++++++++++-- 3 files changed, 40 insertions(+), 6 deletions(-)