diff mbox series

EDAC/amd64: Add support for ECC on family 19h model 60h-6Fh

Message ID 20230425201239.324476-1-hristo@venev.name (mailing list archive)
State New, archived
Headers show
Series EDAC/amd64: Add support for ECC on family 19h model 60h-6Fh | expand

Commit Message

Hristo Venev April 25, 2023, 8:12 p.m. UTC
Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels
instead of 12.

I tested this with two 32GB dual-rank DIMMs. The sizes appear to be
reported correctly:

    [    2.122750] EDAC MC0: Giving out device to module amd64_edac controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT)
    [    2.122751] EDAC amd64: F19h_M60h detected (node 0).
    [    2.122754] EDAC MC: UMC0 chip selects:
    [    2.122754] EDAC amd64: MC: 0:     0MB 1:     0MB
    [    2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB
    [    2.122757] EDAC MC: UMC1 chip selects:
    [    2.122757] EDAC amd64: MC: 0:     0MB 1:     0MB
    [    2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB
    [    2.122759] AMD64 EDAC driver v3.5.0

ECC errors can also be detected:

    [  313.747594] mce: [Hardware Error]: Machine check events logged
    [  313.747597] [Hardware Error]: Corrected error, no action required.
    [  313.747613] [Hardware Error]: CPU:0 (19:61:2) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b
    [  313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0
    [  313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000100010a801203
    [  313.747652] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
    [  313.747669] EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x1)
    [  313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Signed-off-by: Hristo Venev <hristo@venev.name>
---
 drivers/edac/amd64_edac.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Yazen Ghannam May 9, 2023, 2:53 p.m. UTC | #1
On 4/25/23 4:12 PM, Hristo Venev wrote:
> Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels
> instead of 12.
> 
> I tested this with two 32GB dual-rank DIMMs. The sizes appear to be
> reported correctly:
> 
>     [    2.122750] EDAC MC0: Giving out device to module amd64_edac controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT)
>     [    2.122751] EDAC amd64: F19h_M60h detected (node 0).
>     [    2.122754] EDAC MC: UMC0 chip selects:
>     [    2.122754] EDAC amd64: MC: 0:     0MB 1:     0MB
>     [    2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB
>     [    2.122757] EDAC MC: UMC1 chip selects:
>     [    2.122757] EDAC amd64: MC: 0:     0MB 1:     0MB
>     [    2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB
>     [    2.122759] AMD64 EDAC driver v3.5.0
> 
> ECC errors can also be detected:
> 
>     [  313.747594] mce: [Hardware Error]: Machine check events logged
>     [  313.747597] [Hardware Error]: Corrected error, no action required.
>     [  313.747613] [Hardware Error]: CPU:0 (19:61:2) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b
>     [  313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0
>     [  313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000100010a801203
>     [  313.747652] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
>     [  313.747669] EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x1)
>     [  313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
> 
> Signed-off-by: Hristo Venev <hristo@venev.name>

Hi Hristo,

Thank you for the patch. It looks good to me.

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>

> ---
>  drivers/edac/amd64_edac.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index b55129425c81..1080784e2784 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3816,6 +3816,10 @@ static int per_family_init(struct amd64_pvt *pvt)
>  		case 0x50 ... 0x5f:
>  			pvt->ctl_name			= "F19h_M50h";
>  			break;
> +		case 0x60 ... 0x6f:
> +			pvt->ctl_name			= "F19h_M60h";
> +			pvt->flags.zn_regs_v2		= 1;
> +			break;

Mario,

Are there other Client models that can leverage this change?

Thanks,
Yazen
Mario Limonciello May 10, 2023, 11:42 p.m. UTC | #2
[AMD Official Use Only - General]

> -----Original Message-----
> From: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Sent: Tuesday, May 9, 2023 9:53 AM
> To: Hristo Venev <hristo@venev.name>; Limonciello, Mario
> <Mario.Limonciello@amd.com>
> Cc: Ghannam, Yazen <Yazen.Ghannam@amd.com>; Borislav Petkov
> <bp@alien8.de>; linux-edac@vger.kernel.org
> Subject: Re: [PATCH] EDAC/amd64: Add support for ECC on family 19h model
> 60h-6Fh
>
> On 4/25/23 4:12 PM, Hristo Venev wrote:
> > Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels
> > instead of 12.
> >
> > I tested this with two 32GB dual-rank DIMMs. The sizes appear to be
> > reported correctly:
> >
> >     [    2.122750] EDAC MC0: Giving out device to module amd64_edac
> controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT)
> >     [    2.122751] EDAC amd64: F19h_M60h detected (node 0).
> >     [    2.122754] EDAC MC: UMC0 chip selects:
> >     [    2.122754] EDAC amd64: MC: 0:     0MB 1:     0MB
> >     [    2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB
> >     [    2.122757] EDAC MC: UMC1 chip selects:
> >     [    2.122757] EDAC amd64: MC: 0:     0MB 1:     0MB
> >     [    2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB
> >     [    2.122759] AMD64 EDAC driver v3.5.0
> >
> > ECC errors can also be detected:
> >
> >     [  313.747594] mce: [Hardware Error]: Machine check events logged
> >     [  313.747597] [Hardware Error]: Corrected error, no action required.
> >     [  313.747613] [Hardware Error]: CPU:0 (19:61:2)
> MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]:
> 0xdc2040000400011b
> >     [  313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0
> >     [  313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome:
> 0x000100010a801203
> >     [  313.747652] [Hardware Error]: Unified Memory Controller Ext. Error
> Code: 0, DRAM ECC error.
> >     [  313.747669] EDAC MC0: 1 CE Cannot decode normalized address on
> mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64
> syndrome:0x1)
> >     [  313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx:
> RD
> >
> > Signed-off-by: Hristo Venev <hristo@venev.name>
>
> Hi Hristo,
>
> Thank you for the patch. It looks good to me.
>
> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
>
> > ---
> >  drivers/edac/amd64_edac.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> > index b55129425c81..1080784e2784 100644
> > --- a/drivers/edac/amd64_edac.c
> > +++ b/drivers/edac/amd64_edac.c
> > @@ -3816,6 +3816,10 @@ static int per_family_init(struct amd64_pvt *pvt)
> >             case 0x50 ... 0x5f:
> >                     pvt->ctl_name                   = "F19h_M50h";
> >                     break;
> > +           case 0x60 ... 0x6f:
> > +                   pvt->ctl_name                   = "F19h_M60h";
> > +                   pvt->flags.zn_regs_v2           = 1;
> > +                   break;
>
> Mario,
>
> Are there other Client models that can leverage this change?

Yes family 0x19 models 0x70... 0x7f can too, thanks!

>
> Thanks,
> Yazen
Yazen Ghannam May 11, 2023, 1:02 p.m. UTC | #3
On 5/10/23 7:42 PM, Limonciello, Mario wrote:
> [AMD Official Use Only - General]
> 
>> -----Original Message-----
>> From: Ghannam, Yazen <Yazen.Ghannam@amd.com>
>> Sent: Tuesday, May 9, 2023 9:53 AM
>> To: Hristo Venev <hristo@venev.name>; Limonciello, Mario
>> <Mario.Limonciello@amd.com>
>> Cc: Ghannam, Yazen <Yazen.Ghannam@amd.com>; Borislav Petkov
>> <bp@alien8.de>; linux-edac@vger.kernel.org
>> Subject: Re: [PATCH] EDAC/amd64: Add support for ECC on family 19h model
>> 60h-6Fh
>>
>> On 4/25/23 4:12 PM, Hristo Venev wrote:
>>> Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels
>>> instead of 12.
>>>
>>> I tested this with two 32GB dual-rank DIMMs. The sizes appear to be
>>> reported correctly:
>>>
>>>     [    2.122750] EDAC MC0: Giving out device to module amd64_edac
>> controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT)
>>>     [    2.122751] EDAC amd64: F19h_M60h detected (node 0).
>>>     [    2.122754] EDAC MC: UMC0 chip selects:
>>>     [    2.122754] EDAC amd64: MC: 0:     0MB 1:     0MB
>>>     [    2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB
>>>     [    2.122757] EDAC MC: UMC1 chip selects:
>>>     [    2.122757] EDAC amd64: MC: 0:     0MB 1:     0MB
>>>     [    2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB
>>>     [    2.122759] AMD64 EDAC driver v3.5.0
>>>
>>> ECC errors can also be detected:
>>>
>>>     [  313.747594] mce: [Hardware Error]: Machine check events logged
>>>     [  313.747597] [Hardware Error]: Corrected error, no action required.
>>>     [  313.747613] [Hardware Error]: CPU:0 (19:61:2)
>> MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]:
>> 0xdc2040000400011b
>>>     [  313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0
>>>     [  313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome:
>> 0x000100010a801203
>>>     [  313.747652] [Hardware Error]: Unified Memory Controller Ext. Error
>> Code: 0, DRAM ECC error.
>>>     [  313.747669] EDAC MC0: 1 CE Cannot decode normalized address on
>> mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64
>> syndrome:0x1)
>>>     [  313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx:
>> RD
>>>
>>> Signed-off-by: Hristo Venev <hristo@venev.name>
>>
>> Hi Hristo,
>>
>> Thank you for the patch. It looks good to me.
>>
>> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>
>>> ---
>>>  drivers/edac/amd64_edac.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
>>> index b55129425c81..1080784e2784 100644
>>> --- a/drivers/edac/amd64_edac.c
>>> +++ b/drivers/edac/amd64_edac.c
>>> @@ -3816,6 +3816,10 @@ static int per_family_init(struct amd64_pvt *pvt)
>>>             case 0x50 ... 0x5f:
>>>                     pvt->ctl_name                   = "F19h_M50h";
>>>                     break;
>>> +           case 0x60 ... 0x6f:
>>> +                   pvt->ctl_name                   = "F19h_M60h";
>>> +                   pvt->flags.zn_regs_v2           = 1;
>>> +                   break;
>>
>> Mario,
>>
>> Are there other Client models that can leverage this change?
> 
> Yes family 0x19 models 0x70... 0x7f can too, thanks!
>

Thanks Mario.

Hristo,
Can you please also add those models?

Thanks,
Yazen
Hristo Venev May 11, 2023, 5:45 p.m. UTC | #4
I'll send the updated patch.

One thing I noticed is that in the ECC error I observed the address was
not decoded successfully. As I don't really have good test
infrastructure (getting the error involved tuning voltages over several
reboots), do you think you could look into it?
Borislav Petkov May 15, 2023, 2:27 p.m. UTC | #5
On Thu, May 11, 2023 at 08:45:06PM +0300, Hristo Venev wrote:
> do you think you could look into it?

Yeah, that's being worked on but it'll take a while longer.

Thx.
diff mbox series

Patch

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index b55129425c81..1080784e2784 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3816,6 +3816,10 @@  static int per_family_init(struct amd64_pvt *pvt)
 		case 0x50 ... 0x5f:
 			pvt->ctl_name			= "F19h_M50h";
 			break;
+		case 0x60 ... 0x6f:
+			pvt->ctl_name			= "F19h_M60h";
+			pvt->flags.zn_regs_v2		= 1;
+			break;
 		case 0xa0 ... 0xaf:
 			pvt->ctl_name			= "F19h_MA0h";
 			pvt->max_mcs			= 12;