Message ID | 20230425201239.324476-1-hristo@venev.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | EDAC/amd64: Add support for ECC on family 19h model 60h-6Fh | expand |
On 4/25/23 4:12 PM, Hristo Venev wrote: > Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels > instead of 12. > > I tested this with two 32GB dual-rank DIMMs. The sizes appear to be > reported correctly: > > [ 2.122750] EDAC MC0: Giving out device to module amd64_edac controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT) > [ 2.122751] EDAC amd64: F19h_M60h detected (node 0). > [ 2.122754] EDAC MC: UMC0 chip selects: > [ 2.122754] EDAC amd64: MC: 0: 0MB 1: 0MB > [ 2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB > [ 2.122757] EDAC MC: UMC1 chip selects: > [ 2.122757] EDAC amd64: MC: 0: 0MB 1: 0MB > [ 2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB > [ 2.122759] AMD64 EDAC driver v3.5.0 > > ECC errors can also be detected: > > [ 313.747594] mce: [Hardware Error]: Machine check events logged > [ 313.747597] [Hardware Error]: Corrected error, no action required. > [ 313.747613] [Hardware Error]: CPU:0 (19:61:2) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b > [ 313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0 > [ 313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000100010a801203 > [ 313.747652] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. > [ 313.747669] EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x1) > [ 313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD > > Signed-off-by: Hristo Venev <hristo@venev.name> Hi Hristo, Thank you for the patch. It looks good to me. Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> > --- > drivers/edac/amd64_edac.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c > index b55129425c81..1080784e2784 100644 > --- a/drivers/edac/amd64_edac.c > +++ b/drivers/edac/amd64_edac.c > @@ -3816,6 +3816,10 @@ static int per_family_init(struct amd64_pvt *pvt) > case 0x50 ... 0x5f: > pvt->ctl_name = "F19h_M50h"; > break; > + case 0x60 ... 0x6f: > + pvt->ctl_name = "F19h_M60h"; > + pvt->flags.zn_regs_v2 = 1; > + break; Mario, Are there other Client models that can leverage this change? Thanks, Yazen
[AMD Official Use Only - General] > -----Original Message----- > From: Ghannam, Yazen <Yazen.Ghannam@amd.com> > Sent: Tuesday, May 9, 2023 9:53 AM > To: Hristo Venev <hristo@venev.name>; Limonciello, Mario > <Mario.Limonciello@amd.com> > Cc: Ghannam, Yazen <Yazen.Ghannam@amd.com>; Borislav Petkov > <bp@alien8.de>; linux-edac@vger.kernel.org > Subject: Re: [PATCH] EDAC/amd64: Add support for ECC on family 19h model > 60h-6Fh > > On 4/25/23 4:12 PM, Hristo Venev wrote: > > Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels > > instead of 12. > > > > I tested this with two 32GB dual-rank DIMMs. The sizes appear to be > > reported correctly: > > > > [ 2.122750] EDAC MC0: Giving out device to module amd64_edac > controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT) > > [ 2.122751] EDAC amd64: F19h_M60h detected (node 0). > > [ 2.122754] EDAC MC: UMC0 chip selects: > > [ 2.122754] EDAC amd64: MC: 0: 0MB 1: 0MB > > [ 2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB > > [ 2.122757] EDAC MC: UMC1 chip selects: > > [ 2.122757] EDAC amd64: MC: 0: 0MB 1: 0MB > > [ 2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB > > [ 2.122759] AMD64 EDAC driver v3.5.0 > > > > ECC errors can also be detected: > > > > [ 313.747594] mce: [Hardware Error]: Machine check events logged > > [ 313.747597] [Hardware Error]: Corrected error, no action required. > > [ 313.747613] [Hardware Error]: CPU:0 (19:61:2) > MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: > 0xdc2040000400011b > > [ 313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0 > > [ 313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: > 0x000100010a801203 > > [ 313.747652] [Hardware Error]: Unified Memory Controller Ext. Error > Code: 0, DRAM ECC error. > > [ 313.747669] EDAC MC0: 1 CE Cannot decode normalized address on > mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 > syndrome:0x1) > > [ 313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: > RD > > > > Signed-off-by: Hristo Venev <hristo@venev.name> > > Hi Hristo, > > Thank you for the patch. It looks good to me. > > Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> > > > --- > > drivers/edac/amd64_edac.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c > > index b55129425c81..1080784e2784 100644 > > --- a/drivers/edac/amd64_edac.c > > +++ b/drivers/edac/amd64_edac.c > > @@ -3816,6 +3816,10 @@ static int per_family_init(struct amd64_pvt *pvt) > > case 0x50 ... 0x5f: > > pvt->ctl_name = "F19h_M50h"; > > break; > > + case 0x60 ... 0x6f: > > + pvt->ctl_name = "F19h_M60h"; > > + pvt->flags.zn_regs_v2 = 1; > > + break; > > Mario, > > Are there other Client models that can leverage this change? Yes family 0x19 models 0x70... 0x7f can too, thanks! > > Thanks, > Yazen
On 5/10/23 7:42 PM, Limonciello, Mario wrote: > [AMD Official Use Only - General] > >> -----Original Message----- >> From: Ghannam, Yazen <Yazen.Ghannam@amd.com> >> Sent: Tuesday, May 9, 2023 9:53 AM >> To: Hristo Venev <hristo@venev.name>; Limonciello, Mario >> <Mario.Limonciello@amd.com> >> Cc: Ghannam, Yazen <Yazen.Ghannam@amd.com>; Borislav Petkov >> <bp@alien8.de>; linux-edac@vger.kernel.org >> Subject: Re: [PATCH] EDAC/amd64: Add support for ECC on family 19h model >> 60h-6Fh >> >> On 4/25/23 4:12 PM, Hristo Venev wrote: >>> Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels >>> instead of 12. >>> >>> I tested this with two 32GB dual-rank DIMMs. The sizes appear to be >>> reported correctly: >>> >>> [ 2.122750] EDAC MC0: Giving out device to module amd64_edac >> controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT) >>> [ 2.122751] EDAC amd64: F19h_M60h detected (node 0). >>> [ 2.122754] EDAC MC: UMC0 chip selects: >>> [ 2.122754] EDAC amd64: MC: 0: 0MB 1: 0MB >>> [ 2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB >>> [ 2.122757] EDAC MC: UMC1 chip selects: >>> [ 2.122757] EDAC amd64: MC: 0: 0MB 1: 0MB >>> [ 2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB >>> [ 2.122759] AMD64 EDAC driver v3.5.0 >>> >>> ECC errors can also be detected: >>> >>> [ 313.747594] mce: [Hardware Error]: Machine check events logged >>> [ 313.747597] [Hardware Error]: Corrected error, no action required. >>> [ 313.747613] [Hardware Error]: CPU:0 (19:61:2) >> MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: >> 0xdc2040000400011b >>> [ 313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0 >>> [ 313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: >> 0x000100010a801203 >>> [ 313.747652] [Hardware Error]: Unified Memory Controller Ext. Error >> Code: 0, DRAM ECC error. >>> [ 313.747669] EDAC MC0: 1 CE Cannot decode normalized address on >> mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 >> syndrome:0x1) >>> [ 313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: >> RD >>> >>> Signed-off-by: Hristo Venev <hristo@venev.name> >> >> Hi Hristo, >> >> Thank you for the patch. It looks good to me. >> >> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> >> >>> --- >>> drivers/edac/amd64_edac.c | 4 ++++ >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c >>> index b55129425c81..1080784e2784 100644 >>> --- a/drivers/edac/amd64_edac.c >>> +++ b/drivers/edac/amd64_edac.c >>> @@ -3816,6 +3816,10 @@ static int per_family_init(struct amd64_pvt *pvt) >>> case 0x50 ... 0x5f: >>> pvt->ctl_name = "F19h_M50h"; >>> break; >>> + case 0x60 ... 0x6f: >>> + pvt->ctl_name = "F19h_M60h"; >>> + pvt->flags.zn_regs_v2 = 1; >>> + break; >> >> Mario, >> >> Are there other Client models that can leverage this change? > > Yes family 0x19 models 0x70... 0x7f can too, thanks! > Thanks Mario. Hristo, Can you please also add those models? Thanks, Yazen
I'll send the updated patch. One thing I noticed is that in the ECC error I observed the address was not decoded successfully. As I don't really have good test infrastructure (getting the error involved tuning voltages over several reboots), do you think you could look into it?
On Thu, May 11, 2023 at 08:45:06PM +0300, Hristo Venev wrote:
> do you think you could look into it?
Yeah, that's being worked on but it'll take a while longer.
Thx.
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c index b55129425c81..1080784e2784 100644 --- a/drivers/edac/amd64_edac.c +++ b/drivers/edac/amd64_edac.c @@ -3816,6 +3816,10 @@ static int per_family_init(struct amd64_pvt *pvt) case 0x50 ... 0x5f: pvt->ctl_name = "F19h_M50h"; break; + case 0x60 ... 0x6f: + pvt->ctl_name = "F19h_M60h"; + pvt->flags.zn_regs_v2 = 1; + break; case 0xa0 ... 0xaf: pvt->ctl_name = "F19h_MA0h"; pvt->max_mcs = 12;
Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels instead of 12. I tested this with two 32GB dual-rank DIMMs. The sizes appear to be reported correctly: [ 2.122750] EDAC MC0: Giving out device to module amd64_edac controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT) [ 2.122751] EDAC amd64: F19h_M60h detected (node 0). [ 2.122754] EDAC MC: UMC0 chip selects: [ 2.122754] EDAC amd64: MC: 0: 0MB 1: 0MB [ 2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB [ 2.122757] EDAC MC: UMC1 chip selects: [ 2.122757] EDAC amd64: MC: 0: 0MB 1: 0MB [ 2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB [ 2.122759] AMD64 EDAC driver v3.5.0 ECC errors can also be detected: [ 313.747594] mce: [Hardware Error]: Machine check events logged [ 313.747597] [Hardware Error]: Corrected error, no action required. [ 313.747613] [Hardware Error]: CPU:0 (19:61:2) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b [ 313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0 [ 313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000100010a801203 [ 313.747652] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [ 313.747669] EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x1) [ 313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Signed-off-by: Hristo Venev <hristo@venev.name> --- drivers/edac/amd64_edac.c | 4 ++++ 1 file changed, 4 insertions(+)