Message ID | 20200814191449.183998-2-Yazen.Ghannam@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | AMD MCA Address Translation Updates | expand |
* Yazen Ghannam <Yazen.Ghannam@amd.com> wrote: > From: Yazen Ghannam <yazen.ghannam@amd.com> > > The edac_mce_amd module calls decode_dram_ecc() on AMD Family17h and > later systems. This function is used in amd64_edac_mod to do > system-specific decoding for DRAM ECC errors. The function takes a > "NodeId" as a parameter. > > In AMD documentation, NodeId is used to identify a physical die in a > system. This can be used to identify a node in the AMD_NB code and also > it is used with umc_normaddr_to_sysaddr(). > > However, the input used for decode_dram_ecc() is currently the NUMA node > of a logical CPU. In the default configuration, the NUMA node and > physical die will be equivalent, so this doesn't have an impact. But the > NUMA node configuration can be adjusted with optional memory > interleaving schemes. This will cause the NUMA node enumeration to not > match the physical die enumeration. The mismatch will cause the address > translation function to fail or report incorrect results. > > Save the "NodeId" as a percpu value during init in AMD MCE code. Export > a function to return the value which can be used from modules like > edac_mce_amd. > > Fixes: fbe63acf62f5 ("EDAC, mce_amd: Use cpu_to_node() to find the node ID") > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> > --- > arch/x86/include/asm/mce.h | 2 ++ > arch/x86/kernel/cpu/mce/amd.c | 11 +++++++++++ > drivers/edac/mce_amd.c | 2 +- > 3 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h > index cf503824529c..92527cc9ed06 100644 > --- a/arch/x86/include/asm/mce.h > +++ b/arch/x86/include/asm/mce.h > @@ -343,6 +343,8 @@ extern struct smca_bank smca_banks[MAX_NR_BANKS]; > extern const char *smca_get_long_name(enum smca_bank_types t); > extern bool amd_mce_is_memory_error(struct mce *m); > > +extern u8 amd_cpu_to_node(unsigned int cpu); > + > extern int mce_threshold_create_device(unsigned int cpu); > extern int mce_threshold_remove_device(unsigned int cpu); > > diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c > index 99be063fcb1b..524edf81e287 100644 > --- a/arch/x86/kernel/cpu/mce/amd.c > +++ b/arch/x86/kernel/cpu/mce/amd.c > @@ -202,6 +202,9 @@ static DEFINE_PER_CPU(unsigned int, bank_map); > /* Map of banks that have more than MCA_MISC0 available. */ > static DEFINE_PER_CPU(u32, smca_misc_banks_map); > > +/* CPUID_Fn8000001E_ECX[NodeId] used to identify a physical node/die. */ > +static DEFINE_PER_CPU(u8, node_id); > + > static void amd_threshold_interrupt(void); > static void amd_deferred_error_interrupt(void); > > @@ -233,6 +236,12 @@ static void smca_set_misc_banks_map(unsigned int bank, unsigned int cpu) > > } > > +u8 amd_cpu_to_node(unsigned int cpu) > +{ > + return per_cpu(node_id, cpu); > +} > +EXPORT_SYMBOL_GPL(amd_cpu_to_node); > + > static void smca_configure(unsigned int bank, unsigned int cpu) > { > unsigned int i, hwid_mcatype; > @@ -240,6 +249,8 @@ static void smca_configure(unsigned int bank, unsigned int cpu) > u32 high, low; > u32 smca_config = MSR_AMD64_SMCA_MCx_CONFIG(bank); > > + this_cpu_write(node_id, cpuid_ecx(0x8000001e) & 0xFF); So we already have this magic number used for a similar purpose, in amd_get_topology(): cpuid(0x8000001e, &eax, &ebx, &ecx, &edx); node_id = ecx & 0xff; Firstly, could we please at least give 0x8000001e a proper symbolic name, use it in hygon.c too (which AFAIK is derived from AMD anyway), and then use it in these new patches? Secondly, why not stick node_id into struct cpuinfo_x86, where the MCA code can then use it without having to introduce a new percpu data structure? There's also the underlying assumption that there's only ever going to be 256 nodes, which limitation I'm sure we'll hear about in a couple of years as not being quite enough. ;-) So less hardcoding and more generalizations please. Thanks, Ingo
On Sat, Aug 15, 2020 at 10:42:12AM +0200, Ingo Molnar wrote: > > * Yazen Ghannam <Yazen.Ghannam@amd.com> wrote: > > > From: Yazen Ghannam <yazen.ghannam@amd.com> > > > > The edac_mce_amd module calls decode_dram_ecc() on AMD Family17h and > > later systems. This function is used in amd64_edac_mod to do > > system-specific decoding for DRAM ECC errors. The function takes a > > "NodeId" as a parameter. > > > > In AMD documentation, NodeId is used to identify a physical die in a > > system. This can be used to identify a node in the AMD_NB code and also > > it is used with umc_normaddr_to_sysaddr(). > > > > However, the input used for decode_dram_ecc() is currently the NUMA node > > of a logical CPU. In the default configuration, the NUMA node and > > physical die will be equivalent, so this doesn't have an impact. But the > > NUMA node configuration can be adjusted with optional memory > > interleaving schemes. This will cause the NUMA node enumeration to not > > match the physical die enumeration. The mismatch will cause the address > > translation function to fail or report incorrect results. > > > > Save the "NodeId" as a percpu value during init in AMD MCE code. Export > > a function to return the value which can be used from modules like > > edac_mce_amd. > > > > Fixes: fbe63acf62f5 ("EDAC, mce_amd: Use cpu_to_node() to find the node ID") > > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> > > --- > > arch/x86/include/asm/mce.h | 2 ++ > > arch/x86/kernel/cpu/mce/amd.c | 11 +++++++++++ > > drivers/edac/mce_amd.c | 2 +- > > 3 files changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h > > index cf503824529c..92527cc9ed06 100644 > > --- a/arch/x86/include/asm/mce.h > > +++ b/arch/x86/include/asm/mce.h > > @@ -343,6 +343,8 @@ extern struct smca_bank smca_banks[MAX_NR_BANKS]; > > extern const char *smca_get_long_name(enum smca_bank_types t); > > extern bool amd_mce_is_memory_error(struct mce *m); > > > > +extern u8 amd_cpu_to_node(unsigned int cpu); > > + > > extern int mce_threshold_create_device(unsigned int cpu); > > extern int mce_threshold_remove_device(unsigned int cpu); > > > > diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c > > index 99be063fcb1b..524edf81e287 100644 > > --- a/arch/x86/kernel/cpu/mce/amd.c > > +++ b/arch/x86/kernel/cpu/mce/amd.c > > @@ -202,6 +202,9 @@ static DEFINE_PER_CPU(unsigned int, bank_map); > > /* Map of banks that have more than MCA_MISC0 available. */ > > static DEFINE_PER_CPU(u32, smca_misc_banks_map); > > > > +/* CPUID_Fn8000001E_ECX[NodeId] used to identify a physical node/die. */ > > +static DEFINE_PER_CPU(u8, node_id); > > + > > static void amd_threshold_interrupt(void); > > static void amd_deferred_error_interrupt(void); > > > > @@ -233,6 +236,12 @@ static void smca_set_misc_banks_map(unsigned int bank, unsigned int cpu) > > > > } > > > > +u8 amd_cpu_to_node(unsigned int cpu) > > +{ > > + return per_cpu(node_id, cpu); > > +} > > +EXPORT_SYMBOL_GPL(amd_cpu_to_node); > > + > > static void smca_configure(unsigned int bank, unsigned int cpu) > > { > > unsigned int i, hwid_mcatype; > > @@ -240,6 +249,8 @@ static void smca_configure(unsigned int bank, unsigned int cpu) > > u32 high, low; > > u32 smca_config = MSR_AMD64_SMCA_MCx_CONFIG(bank); > > > > + this_cpu_write(node_id, cpuid_ecx(0x8000001e) & 0xFF); > > So we already have this magic number used for a similar purpose, in > amd_get_topology(): > > cpuid(0x8000001e, &eax, &ebx, &ecx, &edx); > > node_id = ecx & 0xff; > Yes, that's right. I did have a patch that tried to leverage the existing topology variables. But it wasn't working for all targeted systems. So I thought to have something local to the AMD MCA code in order to avoid messing with the topology code just for this feature. > Firstly, could we please at least give 0x8000001e a proper symbolic > name, use it in hygon.c too (which AFAIK is derived from AMD anyway), > and then use it in these new patches? > Sure, but all places that use a symbolic name for a CPUID leaf define it locally. Should the same be done here? Or should there be common place for all the defines like in <asm/cpufeatures.h> or maybe a new header file? > Secondly, why not stick node_id into struct cpuinfo_x86, where the MCA > code can then use it without having to introduce a new percpu data > structure? > I think this would be the simplest approach. I can write it. Also, the amd_get_nb_id() function could then be replaced with this. > There's also the underlying assumption that there's only ever going to > be 256 nodes, which limitation I'm sure we'll hear about in a couple > of years as not being quite enough. ;-) > Yeah, CPU topology seems very fractal-like. :) > So less hardcoding and more generalizations please. > Will do. Thanks, Yazen
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index cf503824529c..92527cc9ed06 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -343,6 +343,8 @@ extern struct smca_bank smca_banks[MAX_NR_BANKS]; extern const char *smca_get_long_name(enum smca_bank_types t); extern bool amd_mce_is_memory_error(struct mce *m); +extern u8 amd_cpu_to_node(unsigned int cpu); + extern int mce_threshold_create_device(unsigned int cpu); extern int mce_threshold_remove_device(unsigned int cpu); diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c index 99be063fcb1b..524edf81e287 100644 --- a/arch/x86/kernel/cpu/mce/amd.c +++ b/arch/x86/kernel/cpu/mce/amd.c @@ -202,6 +202,9 @@ static DEFINE_PER_CPU(unsigned int, bank_map); /* Map of banks that have more than MCA_MISC0 available. */ static DEFINE_PER_CPU(u32, smca_misc_banks_map); +/* CPUID_Fn8000001E_ECX[NodeId] used to identify a physical node/die. */ +static DEFINE_PER_CPU(u8, node_id); + static void amd_threshold_interrupt(void); static void amd_deferred_error_interrupt(void); @@ -233,6 +236,12 @@ static void smca_set_misc_banks_map(unsigned int bank, unsigned int cpu) } +u8 amd_cpu_to_node(unsigned int cpu) +{ + return per_cpu(node_id, cpu); +} +EXPORT_SYMBOL_GPL(amd_cpu_to_node); + static void smca_configure(unsigned int bank, unsigned int cpu) { unsigned int i, hwid_mcatype; @@ -240,6 +249,8 @@ static void smca_configure(unsigned int bank, unsigned int cpu) u32 high, low; u32 smca_config = MSR_AMD64_SMCA_MCx_CONFIG(bank); + this_cpu_write(node_id, cpuid_ecx(0x8000001e) & 0xFF); + /* Set appropriate bits in MCA_CONFIG */ if (!rdmsr_safe(smca_config, &low, &high)) { /* diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c index 325aedf46ff2..9476097d0fdb 100644 --- a/drivers/edac/mce_amd.c +++ b/drivers/edac/mce_amd.c @@ -996,7 +996,7 @@ static void decode_smca_error(struct mce *m) } if (bank_type == SMCA_UMC && xec == 0 && decode_dram_ecc) - decode_dram_ecc(cpu_to_node(m->extcpu), m); + decode_dram_ecc(amd_cpu_to_node(m->extcpu), m); } static inline void amd_decode_err_code(u16 ec)