diff mbox series

[v2] EDAC/amd64: Fix possible module load failure on some UMC usage combinations

Message ID 20241210212054.3895697-1-avadhut.naik@amd.com (mailing list archive)
State New
Headers show
Series [v2] EDAC/amd64: Fix possible module load failure on some UMC usage combinations | expand

Commit Message

Avadhut Naik Dec. 10, 2024, 9:20 p.m. UTC
Starting Zen4, AMD SOCs have 12 Unified Memory Controllers (UMCs) per
socket.

When the amd64_edac module is being loaded, these UMCs are traversed to
determine if they have SdpInit (SdpCtrl[31]) and EccEnabled (UmcCapHi[30])
bits set and create masks in umc_en_mask and ecc_en_mask respectively.

However, the current data type of these variables is u8. As a result, if
only the last 4 UMCs (UMC8 - UMC11) of the system have been utilized,
umc_ecc_enabled() will return false. Consequently, the module may fail to
load on these systems.

Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Cc: stable@vger.kernel.org
---
Changes in v2:
1. Change data type of variables from u16 to int. (Boris)
2. Modify commit message per feedback. (Boris)
3. Add Fixes: and CC:stable tags. (Boris)
---
 drivers/edac/amd64_edac.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


base-commit: f84722cbed6c2b2094ad8bbe48be2c5900752935

Comments

Borislav Petkov Dec. 11, 2024, 11:07 a.m. UTC | #1
On Tue, Dec 10, 2024 at 09:20:00PM +0000, Avadhut Naik wrote:
> Starting Zen4, AMD SOCs have 12 Unified Memory Controllers (UMCs) per
> socket.
> 
> When the amd64_edac module is being loaded, these UMCs are traversed to
> determine if they have SdpInit (SdpCtrl[31]) and EccEnabled (UmcCapHi[30])
> bits set and create masks in umc_en_mask and ecc_en_mask respectively.
> 
> However, the current data type of these variables is u8. As a result, if
> only the last 4 UMCs (UMC8 - UMC11) of the system have been utilized,
> umc_ecc_enabled() will return false. Consequently, the module may fail to
> load on these systems.
> 
> Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> Cc: stable@vger.kernel.org
> ---
> Changes in v2:
> 1. Change data type of variables from u16 to int. (Boris)
> 2. Modify commit message per feedback. (Boris)
> 3. Add Fixes: and CC:stable tags. (Boris)
> ---
>  drivers/edac/amd64_edac.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index ddfbdb66b794..b1c034214a8d 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3362,7 +3362,7 @@ static bool dct_ecc_enabled(struct amd64_pvt *pvt)
>  
>  static bool umc_ecc_enabled(struct amd64_pvt *pvt)
>  {
> -	u8 umc_en_mask = 0, ecc_en_mask = 0;
> +	int umc_en_mask = 0, ecc_en_mask = 0;
>  	u16 nid = pvt->mc_node_id;
>  	struct amd64_umc *umc;
>  	u8 ecc_en = 0, i;

Hmm, looking at that whole function, it looks kinda clumsy to me. If the point
is to check whether at least one UMC is enabled, why aren't we doing simply
that instead of those silly masks?

Yazen? Did you think about checking anything else here, in addition?

Because if not, this can be written as simple as:

static bool umc_ecc_enabled(struct amd64_pvt *pvt)
{
        u16 nid = pvt->mc_node_id;
        struct amd64_umc *umc;
        bool ecc_en = false; 
        int i;

        /* Check whether at least one UMC is enabled: */
        for_each_umc(i) {
                umc = &pvt->umc[i];
                
                if (umc->sdp_ctrl & UMC_SDP_INIT && 
                    umc->umc_cap_hi & UMC_ECC_ENABLED) {
                        ecc_en = true;
                        break; 
                }       
        }

        edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled"));
        
        return ecc_en;
}

Thx.
Yazen Ghannam Dec. 11, 2024, 3:46 p.m. UTC | #2
On Wed, Dec 11, 2024 at 12:07:29PM +0100, Borislav Petkov wrote:
> On Tue, Dec 10, 2024 at 09:20:00PM +0000, Avadhut Naik wrote:
> > Starting Zen4, AMD SOCs have 12 Unified Memory Controllers (UMCs) per
> > socket.
> > 
> > When the amd64_edac module is being loaded, these UMCs are traversed to
> > determine if they have SdpInit (SdpCtrl[31]) and EccEnabled (UmcCapHi[30])
> > bits set and create masks in umc_en_mask and ecc_en_mask respectively.
> > 
> > However, the current data type of these variables is u8. As a result, if
> > only the last 4 UMCs (UMC8 - UMC11) of the system have been utilized,
> > umc_ecc_enabled() will return false. Consequently, the module may fail to
> > load on these systems.
> > 
> > Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
> > Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> > Cc: stable@vger.kernel.org
> > ---
> > Changes in v2:
> > 1. Change data type of variables from u16 to int. (Boris)
> > 2. Modify commit message per feedback. (Boris)
> > 3. Add Fixes: and CC:stable tags. (Boris)
> > ---
> >  drivers/edac/amd64_edac.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> > index ddfbdb66b794..b1c034214a8d 100644
> > --- a/drivers/edac/amd64_edac.c
> > +++ b/drivers/edac/amd64_edac.c
> > @@ -3362,7 +3362,7 @@ static bool dct_ecc_enabled(struct amd64_pvt *pvt)
> >  
> >  static bool umc_ecc_enabled(struct amd64_pvt *pvt)
> >  {
> > -	u8 umc_en_mask = 0, ecc_en_mask = 0;
> > +	int umc_en_mask = 0, ecc_en_mask = 0;
> >  	u16 nid = pvt->mc_node_id;
> >  	struct amd64_umc *umc;
> >  	u8 ecc_en = 0, i;
> 
> Hmm, looking at that whole function, it looks kinda clumsy to me. If the point
> is to check whether at least one UMC is enabled, why aren't we doing simply
> that instead of those silly masks?
> 
> Yazen? Did you think about checking anything else here, in addition?
>

I think we used the masks because we would only read registers as
needed.

  196b79fcc8ed ("EDAC, amd64: Extend ecc_enabled() to Fam17h")

Now we cache all the registers at init time. So yeah, I agree that this
can be simplified.

> Because if not, this can be written as simple as:
> 
> static bool umc_ecc_enabled(struct amd64_pvt *pvt)
> {
>         u16 nid = pvt->mc_node_id;
>         struct amd64_umc *umc;
>         bool ecc_en = false; 
>         int i;
> 
>         /* Check whether at least one UMC is enabled: */
>         for_each_umc(i) {
>                 umc = &pvt->umc[i];
>                 
>                 if (umc->sdp_ctrl & UMC_SDP_INIT && 
>                     umc->umc_cap_hi & UMC_ECC_ENABLED) {
>                         ecc_en = true;
>                         break; 
>                 }       
>         }
> 
>         edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled"));
>         
>         return ecc_en;
> }
>

Looks good overall. We can even remove the "nid" variable and just use
"pvt->mc_node_id" directly in the debug message. This is another remnant
from when this function did register accesses.

Thanks,
Yazen
Borislav Petkov Dec. 11, 2024, 6:51 p.m. UTC | #3
On Wed, Dec 11, 2024 at 10:46:37AM -0500, Yazen Ghannam wrote:
> Looks good overall. We can even remove the "nid" variable and just use
> "pvt->mc_node_id" directly in the debug message. This is another remnant
> from when this function did register accesses.

Ok, done.

Avadhut, can you pls verify this fixes your issue too?

I'll run it on my boxes too, to make sure nothing breaks.

Thx.

---
From: "Borislav Petkov (AMD)" <bp@alien8.de>
Date: Wed, 11 Dec 2024 12:07:42 +0100
Subject: [PATCH] EDAC/amd64: Simplify ECC check on unified memory controllers

The intent of the check is to see whether at least one UMC has ECC
enabled. So do that instead of tracking which ones are enabled in masks
which are too small in size anyway and lead to not loading the driver on
Zen4 machines with UMCs enabled over UMC8.

Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
Reported-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20241210212054.3895697-1-avadhut.naik@amd.com
---
 drivers/edac/amd64_edac.c | 32 ++++++++++----------------------
 1 file changed, 10 insertions(+), 22 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index ddfbdb66b794..5d356b7c4589 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3362,36 +3362,24 @@ static bool dct_ecc_enabled(struct amd64_pvt *pvt)
 
 static bool umc_ecc_enabled(struct amd64_pvt *pvt)
 {
-	u8 umc_en_mask = 0, ecc_en_mask = 0;
-	u16 nid = pvt->mc_node_id;
 	struct amd64_umc *umc;
-	u8 ecc_en = 0, i;
+	bool ecc_en = false;
+	int i;
 
+	/* Check whether at least one UMC is enabled: */
 	for_each_umc(i) {
 		umc = &pvt->umc[i];
 
-		/* Only check enabled UMCs. */
-		if (!(umc->sdp_ctrl & UMC_SDP_INIT))
-			continue;
-
-		umc_en_mask |= BIT(i);
-
-		if (umc->umc_cap_hi & UMC_ECC_ENABLED)
-			ecc_en_mask |= BIT(i);
+		if (umc->sdp_ctrl & UMC_SDP_INIT &&
+		    umc->umc_cap_hi & UMC_ECC_ENABLED) {
+			ecc_en = true;
+			break;
+		}
 	}
 
-	/* Check whether at least one UMC is enabled: */
-	if (umc_en_mask)
-		ecc_en = umc_en_mask == ecc_en_mask;
-	else
-		edac_dbg(0, "Node %d: No enabled UMCs.\n", nid);
-
-	edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled"));
+	edac_dbg(3, "Node %d: DRAM ECC %s.\n", pvt->mc_node_id, (ecc_en ? "enabled" : "disabled"));
 
-	if (!ecc_en)
-		return false;
-	else
-		return true;
+	return ecc_en;
 }
 
 static inline void
Naik, Avadhut Dec. 11, 2024, 7:18 p.m. UTC | #4
On 12/11/2024 12:51, Borislav Petkov wrote:
> On Wed, Dec 11, 2024 at 10:46:37AM -0500, Yazen Ghannam wrote:
>> Looks good overall. We can even remove the "nid" variable and just use
>> "pvt->mc_node_id" directly in the debug message. This is another remnant
>> from when this function did register accesses.
> 
> Ok, done.
> 
> Avadhut, can you pls verify this fixes your issue too?
> 
Yes, this fixes the issue of module not loading with some UMC
configurations.

If relevant, then for the below patch:

Tested-by: Avadhut Naik <avadhut.naik@amd.com>
Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>

> I'll run it on my boxes too, to make sure nothing breaks.
> 
> Thx.
> 
> ---
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Wed, 11 Dec 2024 12:07:42 +0100
> Subject: [PATCH] EDAC/amd64: Simplify ECC check on unified memory controllers
> 
> The intent of the check is to see whether at least one UMC has ECC
> enabled. So do that instead of tracking which ones are enabled in masks
> which are too small in size anyway and lead to not loading the driver on
> Zen4 machines with UMCs enabled over UMC8.
> 
> Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
> Reported-by: Avadhut Naik <avadhut.naik@amd.com>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> Cc: <stable@kernel.org>
> Link: https://lore.kernel.org/r/20241210212054.3895697-1-avadhut.naik@amd.com
> ---
>  drivers/edac/amd64_edac.c | 32 ++++++++++----------------------
>  1 file changed, 10 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index ddfbdb66b794..5d356b7c4589 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3362,36 +3362,24 @@ static bool dct_ecc_enabled(struct amd64_pvt *pvt)
>  
>  static bool umc_ecc_enabled(struct amd64_pvt *pvt)
>  {
> -	u8 umc_en_mask = 0, ecc_en_mask = 0;
> -	u16 nid = pvt->mc_node_id;
>  	struct amd64_umc *umc;
> -	u8 ecc_en = 0, i;
> +	bool ecc_en = false;
> +	int i;
>  
> +	/* Check whether at least one UMC is enabled: */
>  	for_each_umc(i) {
>  		umc = &pvt->umc[i];
>  
> -		/* Only check enabled UMCs. */
> -		if (!(umc->sdp_ctrl & UMC_SDP_INIT))
> -			continue;
> -
> -		umc_en_mask |= BIT(i);
> -
> -		if (umc->umc_cap_hi & UMC_ECC_ENABLED)
> -			ecc_en_mask |= BIT(i);
> +		if (umc->sdp_ctrl & UMC_SDP_INIT &&
> +		    umc->umc_cap_hi & UMC_ECC_ENABLED) {
> +			ecc_en = true;
> +			break;
> +		}
>  	}
>  
> -	/* Check whether at least one UMC is enabled: */
> -	if (umc_en_mask)
> -		ecc_en = umc_en_mask == ecc_en_mask;
> -	else
> -		edac_dbg(0, "Node %d: No enabled UMCs.\n", nid);
> -
> -	edac_dbg(3, "Node %d: DRAM ECC %s.\n", nid, (ecc_en ? "enabled" : "disabled"));
> +	edac_dbg(3, "Node %d: DRAM ECC %s.\n", pvt->mc_node_id, (ecc_en ? "enabled" : "disabled"));
>  
> -	if (!ecc_en)
> -		return false;
> -	else
> -		return true;
> +	return ecc_en;
>  }
>  
>  static inline void
Borislav Petkov Dec. 11, 2024, 8:49 p.m. UTC | #5
On Wed, Dec 11, 2024 at 01:18:39PM -0600, Naik, Avadhut wrote:
> Yes, this fixes the issue of module not loading with some UMC
> configurations.

Thanks!

> If relevant, then for the below patch:
> 
> Tested-by: Avadhut Naik <avadhut.naik@amd.com>
> Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>

Added.
diff mbox series

Patch

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index ddfbdb66b794..b1c034214a8d 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3362,7 +3362,7 @@  static bool dct_ecc_enabled(struct amd64_pvt *pvt)
 
 static bool umc_ecc_enabled(struct amd64_pvt *pvt)
 {
-	u8 umc_en_mask = 0, ecc_en_mask = 0;
+	int umc_en_mask = 0, ecc_en_mask = 0;
 	u16 nid = pvt->mc_node_id;
 	struct amd64_umc *umc;
 	u8 ecc_en = 0, i;