diff mbox series

[v2] EDAC/amd64: Add support for ECC on family 19h model 60h-7Fh

Message ID 20230511174506.875153-2-hristo@venev.name (mailing list archive)
State New, archived
Headers show
Series [v2] EDAC/amd64: Add support for ECC on family 19h model 60h-7Fh | expand

Commit Message

Hristo Venev May 11, 2023, 5:45 p.m. UTC
Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels
instead of 12.

I tested this with two 32GB dual-rank DIMMs. The sizes appear to be
reported correctly:

    [    2.122750] EDAC MC0: Giving out device to module amd64_edac controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT)
    [    2.122751] EDAC amd64: F19h_M60h detected (node 0).
    [    2.122754] EDAC MC: UMC0 chip selects:
    [    2.122754] EDAC amd64: MC: 0:     0MB 1:     0MB
    [    2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB
    [    2.122757] EDAC MC: UMC1 chip selects:
    [    2.122757] EDAC amd64: MC: 0:     0MB 1:     0MB
    [    2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB
    [    2.122759] AMD64 EDAC driver v3.5.0

ECC errors can also be detected:

    [  313.747594] mce: [Hardware Error]: Machine check events logged
    [  313.747597] [Hardware Error]: Corrected error, no action required.
    [  313.747613] [Hardware Error]: CPU:0 (19:61:2) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b
    [  313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0
    [  313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000100010a801203
    [  313.747652] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
    [  313.747669] EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x1)
    [  313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

According to Mario Limonciello, the same code should also work for
models 70h-7Fh [1].

Link: https://lore.kernel.org/linux-edac/d619252e-35c7-814b-acdb-74714619d62a@amd.com/T/#m9fc20d5dc36074048ec5f1c0a5b01b7f972a1cc7 [1]
Signed-off-by: Hristo Venev <hristo@venev.name>
---
 drivers/edac/amd64_edac.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Mario Limonciello May 11, 2023, 5:58 p.m. UTC | #1
[AMD Official Use Only - General]

> Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels
> instead of 12.
>
> I tested this with two 32GB dual-rank DIMMs. The sizes appear to be
> reported correctly:
>
>     [    2.122750] EDAC MC0: Giving out device to module amd64_edac
> controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT)
>     [    2.122751] EDAC amd64: F19h_M60h detected (node 0).
>     [    2.122754] EDAC MC: UMC0 chip selects:
>     [    2.122754] EDAC amd64: MC: 0:     0MB 1:     0MB
>     [    2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB
>     [    2.122757] EDAC MC: UMC1 chip selects:
>     [    2.122757] EDAC amd64: MC: 0:     0MB 1:     0MB
>     [    2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB
>     [    2.122759] AMD64 EDAC driver v3.5.0
>
> ECC errors can also be detected:
>
>     [  313.747594] mce: [Hardware Error]: Machine check events logged
>     [  313.747597] [Hardware Error]: Corrected error, no action required.
>     [  313.747613] [Hardware Error]: CPU:0 (19:61:2)
> MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]:
> 0xdc2040000400011b
>     [  313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0
>     [  313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome:
> 0x000100010a801203
>     [  313.747652] [Hardware Error]: Unified Memory Controller Ext. Error
> Code: 0, DRAM ECC error.
>     [  313.747669] EDAC MC0: 1 CE Cannot decode normalized address on
> mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64
> syndrome:0x1)
>     [  313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
>
> According to Mario Limonciello, the same code should also work for
> models 70h-7Fh [1].
>
> Link: https://lore.kernel.org/linux-edac/d619252e-35c7-814b-acdb-
> 74714619d62a@amd.com/T/#m9fc20d5dc36074048ec5f1c0a5b01b7f972a1cc7
> [1]
> Signed-off-by: Hristo Venev <hristo@venev.name>

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>

> ---
>  drivers/edac/amd64_edac.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index b55129425c81..c00f7e4ef366 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3816,6 +3816,14 @@ static int per_family_init(struct amd64_pvt *pvt)
>               case 0x50 ... 0x5f:
>                       pvt->ctl_name                   = "F19h_M50h";
>                       break;
> +             case 0x60 ... 0x6f:
> +                     pvt->ctl_name                   = "F19h_M60h";
> +                     pvt->flags.zn_regs_v2           = 1;
> +                     break;
> +             case 0x70 ... 0x7f:
> +                     pvt->ctl_name                   = "F19h_M70h";
> +                     pvt->flags.zn_regs_v2           = 1;
> +                     break;
>               case 0xa0 ... 0xaf:
>                       pvt->ctl_name                   = "F19h_MA0h";
>                       pvt->max_mcs                    = 12;
> --
> 2.40.1
Borislav Petkov May 15, 2023, 2:39 p.m. UTC | #2
On Thu, May 11, 2023 at 08:45:07PM +0300, Hristo Venev wrote:
> Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels
> instead of 12.
> 
> I tested this with two 32GB dual-rank DIMMs. The sizes appear to be
> reported correctly:
> 
>     [    2.122750] EDAC MC0: Giving out device to module amd64_edac controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT)
>     [    2.122751] EDAC amd64: F19h_M60h detected (node 0).
>     [    2.122754] EDAC MC: UMC0 chip selects:
>     [    2.122754] EDAC amd64: MC: 0:     0MB 1:     0MB
>     [    2.122755] EDAC amd64: MC: 2: 16384MB 3: 16384MB
>     [    2.122757] EDAC MC: UMC1 chip selects:
>     [    2.122757] EDAC amd64: MC: 0:     0MB 1:     0MB
>     [    2.122758] EDAC amd64: MC: 2: 16384MB 3: 16384MB
>     [    2.122759] AMD64 EDAC driver v3.5.0
> 
> ECC errors can also be detected:
> 
>     [  313.747594] mce: [Hardware Error]: Machine check events logged
>     [  313.747597] [Hardware Error]: Corrected error, no action required.
>     [  313.747613] [Hardware Error]: CPU:0 (19:61:2) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b
>     [  313.747632] [Hardware Error]: Error Addr: 0x00000007ff7e93c0
>     [  313.747639] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000100010a801203
>     [  313.747652] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
>     [  313.747669] EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x1)
>     [  313.747672] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
> 
> According to Mario Limonciello, the same code should also work for
> models 70h-7Fh [1].
> 
> Link: https://lore.kernel.org/linux-edac/d619252e-35c7-814b-acdb-74714619d62a@amd.com/T/#m9fc20d5dc36074048ec5f1c0a5b01b7f972a1cc7 [1]
> Signed-off-by: Hristo Venev <hristo@venev.name>
> ---
>  drivers/edac/amd64_edac.c | 8 ++++++++
>  1 file changed, 8 insertions(+)

Applied, thanks.
diff mbox series

Patch

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index b55129425c81..c00f7e4ef366 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3816,6 +3816,14 @@  static int per_family_init(struct amd64_pvt *pvt)
 		case 0x50 ... 0x5f:
 			pvt->ctl_name			= "F19h_M50h";
 			break;
+		case 0x60 ... 0x6f:
+			pvt->ctl_name			= "F19h_M60h";
+			pvt->flags.zn_regs_v2		= 1;
+			break;
+		case 0x70 ... 0x7f:
+			pvt->ctl_name			= "F19h_M70h";
+			pvt->flags.zn_regs_v2		= 1;
+			break;
 		case 0xa0 ... 0xaf:
 			pvt->ctl_name			= "F19h_MA0h";
 			pvt->max_mcs			= 12;