diff mbox

ghes_edac: enable HIP08 platform edac driver

Message ID 20180516182958.GB17092@pd.tnic (mailing list archive)
State New, archived
Headers show

Commit Message

Borislav Petkov May 16, 2018, 6:29 p.m. UTC
On Wed, May 16, 2018 at 02:38:38PM +0100, James Morse wrote:
> XGene has its own edac driver, but it doesn't probe when booted via ACPI so
> won't conflict with ghes_edac.

Actually it will. EDAC core can have only one EDAC driver loaded. Don't
ask me why - it has been that way since forever. We can change it some
day but frankly, I don't see reasoning for it. One driver can easily
manage *all* error sources on a system, I'd say.

> ... The thing has 4 dimm slots, but only two are populated. I swapped
> them round and the table was regenerated, so I don't think its faking
> it.

Ok.

> So I think we're good to make the whitelist x86 only.
> Your diff-hunk makes 'idx=-1', so we always get the 'Unfortunately' warning. I'd
> like to suppress this unless force_load has been used.

Yeah, we should handle that differently for ARM. Toshi added the idx
thing in

  5deed6b6a479 ("EDAC, ghes: Add platform check")

to dump this when the platform is not whitelisted. So let's do that:

---

---

> What is the history behind the fake thing here? It predates 32fa1f53c2d
> "ghes_edac: do a better job of filling EDAC DIMM info", was it to support a
> valid system, or just to ease merging the driver when not all systems had the
> dmi table?

I wouldn't be surprised if there were some, TBH.

Looks to me like it used to fake DIMMs, see

-       /* FIXME: FAKE DATA */
-       dimm->nr_pages = 1000;
-       dimm->grain = 128;
-       dimm->mtype = MEM_UNKNOWN;
-       dimm->dtype = DEV_UNKNOWN;
-       dimm->edac_mode = EDAC_SECDED;

which 32fa1f53c2d removes.

$ git annotate drivers/edac/ghes_edac.c 32fa1f53c2d~1

shows you the driver before the DMI scanning so it looks like initially
it was faking stuff to satisfy EDAC core when it creates sysfs entries
using struct dimm_info descriptors.

> It looks like even the oldest Arm64 ACPI systems have dmi tables, so we can
> probably require DMI or the 'force' flag.

Well, with the hunk above it would still do ghes_edac_count_dimms() on
ARM and if it fails to find something, it will set fake, which is a good
sanity-check as it screams loudly. :)

Thx.

Comments

James Morse May 17, 2018, 6:02 p.m. UTC | #1
Hi guys,

Tyler, Zhengqiang, I assume all your shipped platforms with HEST->GHES entries
also have DMI tables.


On 16/05/18 19:29, Borislav Petkov wrote:
> On Wed, May 16, 2018 at 02:38:38PM +0100, James Morse wrote:
>> XGene has its own edac driver, but it doesn't probe when booted via ACPI so
>> won't conflict with ghes_edac.
> 
> Actually it will. EDAC core can have only one EDAC driver loaded. Don't
> ask me why - it has been that way since forever.

By won't probe I mean it only works on DT systems:

| static const struct of_device_id xgene_edac_of_match[] = {
|	{ .compatible = "apm,xgene-edac" },
|	{},
| };

|	.driver = {
|		.name = "xgene-edac",
|		.of_match_table = xgene_edac_of_match,
|	},

To work on a system with GHES it would need an 'struct acpi_device_id' to
describe the HID (?) and populate driver's acpi_match_table.


> We can change it some
> day but frankly, I don't see reasoning for it. One driver can easily
> manage *all* error sources on a system, I'd say.

I agree, there is no reason to support two at the same time, if this happens
then there is probably something wrong with the platform (e.g. races with
firmware reading the same hardware registers), so we should make some noise.

Xgene's edac driver would be a good example of this, it looks like it reads data
from some mmio region, if something else is doing the same we're going to make a
mess.


>> So I think we're good to make the whitelist x86 only.
>> Your diff-hunk makes 'idx=-1', so we always get the 'Unfortunately' warning. I'd
>> like to suppress this unless force_load has been used.
> 
> Yeah, we should handle that differently for ARM. Toshi added the idx
> thing in
> 
>   5deed6b6a479 ("EDAC, ghes: Add platform check")
> 
> to dump this when the platform is not whitelisted. So let's do that:
> 
> ---
> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 863fbf3db29f..473aeec4b1da 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -440,12 +440,16 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
>  	struct mem_ctl_info *mci;
>  	struct edac_mc_layer layers[1];
>  	struct ghes_edac_dimm_fill dimm_fill;
> -	int idx;
> +	int idx = -1;
>  
> -	/* Check if safe to enable on this system */
> -	idx = acpi_match_platform_list(plat_list);
> -	if (!force_load && idx < 0)
> -		return -ENODEV;

v4.17-rc5 has 'return 0' here. Wouldn't this change means no ghes can be
registered unless ghes_edac is also supported by the platform?
Shouldn't this be '0' for a silent failure?


> +	if (IS_ENABLED(CONFIG_X86)) {
> +		/* Check if safe to enable on this system */
> +		idx = acpi_match_platform_list(plat_list);
> +		if (!force_load && idx < 0)
> +			return -ENODEV;
> +	} else {
> +		idx = 0;
> +	}
>  
>  	/*
>  	 * We have only one logical memory controller to which all DIMMs belong.

Tested on Seattle and some cranky homebrew-no-DMI firmware:
Tested-by: James Morse <james.morse@arm.com>

With the ENODEV/0 thing above:
Reviewed-by: James Morse <james.morse@arm.com>


>> It looks like even the oldest Arm64 ACPI systems have dmi tables, so we can
>> probably require DMI or the 'force' flag.
> 
> Well, with the hunk above it would still do ghes_edac_count_dimms() on
> ARM and if it fails to find something, it will set fake, which is a good
> sanity-check as it screams loudly. :)


Thanks,

James
Zhengqiang May 18, 2018, 7:13 a.m. UTC | #2
On 2018/5/18 2:02, James Morse wrote:
> Hi guys,
> 
> Tyler, Zhengqiang, I assume all your shipped platforms with HEST->GHES entries
> also have DMI tables.
> 

Sure, Our ARM64 platform have DMI tables. thanks.

> 
> On 16/05/18 19:29, Borislav Petkov wrote:
>> On Wed, May 16, 2018 at 02:38:38PM +0100, James Morse wrote:
>>> XGene has its own edac driver, but it doesn't probe when booted via ACPI so
>>> won't conflict with ghes_edac.
>>
>> Actually it will. EDAC core can have only one EDAC driver loaded. Don't
>> ask me why - it has been that way since forever.
> 
> By won't probe I mean it only works on DT systems:
> 
> | static const struct of_device_id xgene_edac_of_match[] = {
> |	{ .compatible = "apm,xgene-edac" },
> |	{},
> | };
> 
> |	.driver = {
> |		.name = "xgene-edac",
> |		.of_match_table = xgene_edac_of_match,
> |	},
> 
> To work on a system with GHES it would need an 'struct acpi_device_id' to
> describe the HID (?) and populate driver's acpi_match_table.
> 
> 
>> We can change it some
>> day but frankly, I don't see reasoning for it. One driver can easily
>> manage *all* error sources on a system, I'd say.
> 
> I agree, there is no reason to support two at the same time, if this happens
> then there is probably something wrong with the platform (e.g. races with
> firmware reading the same hardware registers), so we should make some noise.
> 
> Xgene's edac driver would be a good example of this, it looks like it reads data
> from some mmio region, if something else is doing the same we're going to make a
> mess.
> 
> 
>>> So I think we're good to make the whitelist x86 only.
>>> Your diff-hunk makes 'idx=-1', so we always get the 'Unfortunately' warning. I'd
>>> like to suppress this unless force_load has been used.
>>
>> Yeah, we should handle that differently for ARM. Toshi added the idx
>> thing in
>>
>>   5deed6b6a479 ("EDAC, ghes: Add platform check")
>>
>> to dump this when the platform is not whitelisted. So let's do that:
>>
>> ---
>> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
>> index 863fbf3db29f..473aeec4b1da 100644
>> --- a/drivers/edac/ghes_edac.c
>> +++ b/drivers/edac/ghes_edac.c
>> @@ -440,12 +440,16 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
>>  	struct mem_ctl_info *mci;
>>  	struct edac_mc_layer layers[1];
>>  	struct ghes_edac_dimm_fill dimm_fill;
>> -	int idx;
>> +	int idx = -1;
>>  
>> -	/* Check if safe to enable on this system */
>> -	idx = acpi_match_platform_list(plat_list);
>> -	if (!force_load && idx < 0)
>> -		return -ENODEV;
> 
> v4.17-rc5 has 'return 0' here. Wouldn't this change means no ghes can be
> registered unless ghes_edac is also supported by the platform?
> Shouldn't this be '0' for a silent failure?
> 
> 
>> +	if (IS_ENABLED(CONFIG_X86)) {
>> +		/* Check if safe to enable on this system */
>> +		idx = acpi_match_platform_list(plat_list);
>> +		if (!force_load && idx < 0)
>> +			return -ENODEV;
>> +	} else {
>> +		idx = 0;
>> +	}
>>  
>>  	/*
>>  	 * We have only one logical memory controller to which all DIMMs belong.
> 
> Tested on Seattle and some cranky homebrew-no-DMI firmware:
> Tested-by: James Morse <james.morse@arm.com>
> 
> With the ENODEV/0 thing above:
> Reviewed-by: James Morse <james.morse@arm.com>
> 
> 
>>> It looks like even the oldest Arm64 ACPI systems have dmi tables, so we can
>>> probably require DMI or the 'force' flag.
>>
>> Well, with the hunk above it would still do ghes_edac_count_dimms() on
>> ARM and if it fails to find something, it will set fake, which is a good
>> sanity-check as it screams loudly. :)
> 
> 
> Thanks,
> 
> James
> 
> .
>
Borislav Petkov May 18, 2018, 11:11 a.m. UTC | #3
On Thu, May 17, 2018 at 07:02:18PM +0100, James Morse wrote:
> v4.17-rc5 has 'return 0' here. Wouldn't this change means no ghes can be
> registered unless ghes_edac is also supported by the platform?
> Shouldn't this be '0' for a silent failure?

https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/commit/?h=for-next&id=cc7f3f132658289b6661ab8294ab08a9d32ea026

> Tested on Seattle and some cranky homebrew-no-DMI firmware:
> Tested-by: James Morse <james.morse@arm.com>
> 
> With the ENODEV/0 thing above:
> Reviewed-by: James Morse <james.morse@arm.com>

Thanks, adding.
diff mbox

Patch

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 863fbf3db29f..473aeec4b1da 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -440,12 +440,16 @@  int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	struct mem_ctl_info *mci;
 	struct edac_mc_layer layers[1];
 	struct ghes_edac_dimm_fill dimm_fill;
-	int idx;
+	int idx = -1;
 
-	/* Check if safe to enable on this system */
-	idx = acpi_match_platform_list(plat_list);
-	if (!force_load && idx < 0)
-		return -ENODEV;
+	if (IS_ENABLED(CONFIG_X86)) {
+		/* Check if safe to enable on this system */
+		idx = acpi_match_platform_list(plat_list);
+		if (!force_load && idx < 0)
+			return -ENODEV;
+	} else {
+		idx = 0;
+	}
 
 	/*
 	 * We have only one logical memory controller to which all DIMMs belong.