diff mbox series

[1/3] platform/x86/intel/ifs: Classify error scenarios correctly

Message ID 20240412172349.544064-2-jithu.joseph@intel.com (mailing list archive)
State Accepted, archived
Headers show
Series Miscelleanous In Field Scan changes | expand

Commit Message

Joseph, Jithu April 12, 2024, 5:23 p.m. UTC
Based on inputs from hardware architects, only "scan signature failures"
should be treated as actual hardware/cpu failure.

Current driver, in addition, classifies "scan controller error" scenario
too as a hardware/cpu failure. Modify the driver to classify this situation
with a more appropriate "untested" status instead of "fail" status.

Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/platform/x86/intel/ifs/runtest.c | 27 +++++++++++++-----------
 1 file changed, 15 insertions(+), 12 deletions(-)

Comments

Kuppuswamy Sathyanarayanan April 12, 2024, 6:32 p.m. UTC | #1
On 4/12/24 10:23 AM, Jithu Joseph wrote:
> Based on inputs from hardware architects, only "scan signature failures"
> should be treated as actual hardware/cpu failure.

Instead of just saying input from hardware architects, it would be better
if you mention the rationale behind it.

> Current driver, in addition, classifies "scan controller error" scenario
> too as a hardware/cpu failure. Modify the driver to classify this situation
> with a more appropriate "untested" status instead of "fail" status.
>
> Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Reviewe

Code wise it looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

> d-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/platform/x86/intel/ifs/runtest.c | 27 +++++++++++++-----------
>  1 file changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/platform/x86/intel/ifs/runtest.c b/drivers/platform/x86/intel/ifs/runtest.c
> index 95b4b71fab53..282e4bfe30da 100644
> --- a/drivers/platform/x86/intel/ifs/runtest.c
> +++ b/drivers/platform/x86/intel/ifs/runtest.c
> @@ -69,6 +69,19 @@ static const char * const scan_test_status[] = {
>  
>  static void message_not_tested(struct device *dev, int cpu, union ifs_status status)
>  {
> +	struct ifs_data *ifsd = ifs_get_data(dev);
> +
> +	/*
> +	 * control_error is set when the microcode runs into a problem
> +	 * loading the image from the reserved BIOS memory, or it has
> +	 * been corrupted. Reloading the image may fix this issue.
> +	 */
> +	if (status.control_error) {
> +		dev_warn(dev, "CPU(s) %*pbl: Scan controller error. Batch: %02x version: 0x%x\n",
> +			 cpumask_pr_args(cpu_smt_mask(cpu)), ifsd->cur_batch, ifsd->loaded_version);
> +		return;
> +	}
> +
>  	if (status.error_code < ARRAY_SIZE(scan_test_status)) {
>  		dev_info(dev, "CPU(s) %*pbl: SCAN operation did not start. %s\n",
>  			 cpumask_pr_args(cpu_smt_mask(cpu)),
> @@ -90,16 +103,6 @@ static void message_fail(struct device *dev, int cpu, union ifs_status status)
>  {
>  	struct ifs_data *ifsd = ifs_get_data(dev);
>  
> -	/*
> -	 * control_error is set when the microcode runs into a problem
> -	 * loading the image from the reserved BIOS memory, or it has
> -	 * been corrupted. Reloading the image may fix this issue.
> -	 */
> -	if (status.control_error) {
> -		dev_err(dev, "CPU(s) %*pbl: could not execute from loaded scan image. Batch: %02x version: 0x%x\n",
> -			cpumask_pr_args(cpu_smt_mask(cpu)), ifsd->cur_batch, ifsd->loaded_version);
> -	}
> -
>  	/*
>  	 * signature_error is set when the output from the scan chains does not
>  	 * match the expected signature. This might be a transient problem (e.g.
> @@ -285,10 +288,10 @@ static void ifs_test_core(int cpu, struct device *dev)
>  	/* Update status for this core */
>  	ifsd->scan_details = status.data;
>  
> -	if (status.control_error || status.signature_error) {
> +	if (status.signature_error) {
>  		ifsd->status = SCAN_TEST_FAIL;
>  		message_fail(dev, cpu, status);
> -	} else if (status.error_code) {
> +	} else if (status.control_error || status.error_code) {
>  		ifsd->status = SCAN_NOT_TESTED;
>  		message_not_tested(dev, cpu, status);
>  	} else {
Joseph, Jithu April 12, 2024, 7:31 p.m. UTC | #2
Sathya,

Thanks for reviewing this

On 4/12/2024 11:32 AM, Kuppuswamy Sathyanarayanan wrote:
> 
> On 4/12/24 10:23 AM, Jithu Joseph wrote:
>> Based on inputs from hardware architects, only "scan signature failures"
>> should be treated as actual hardware/cpu failure.
> 
> Instead of just saying input from hardware architects, it would be better
> if you mention the rationale behind it.

I can reword the first para as below:

"Scan controller error" means that scan hardware encountered an error
prior to doing an actual test on the target CPU. It does not mean that
there is an actual cpu/core failure. "scan signature failure" indicates
that the test result on the target core did not match the expected value
and should be treated as a cpu failure.

Current driver classifies both these scenarios as failures. Modify ...

> 
>> Current driver, in addition, classifies "scan controller error" scenario
>> too as a hardware/cpu failure. Modify the driver to classify this situation
>> with a more appropriate "untested" status instead of "fail" status.
>>
>> Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>> Reviewe
> 
> Code wise it looks good to me.
> 
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 


Jithu
Kuppuswamy Sathyanarayanan April 12, 2024, 8:46 p.m. UTC | #3
On 4/12/24 12:31 PM, Joseph, Jithu wrote:
> Sathya,
>
> Thanks for reviewing this
>
> On 4/12/2024 11:32 AM, Kuppuswamy Sathyanarayanan wrote:
>> On 4/12/24 10:23 AM, Jithu Joseph wrote:
>>> Based on inputs from hardware architects, only "scan signature failures"
>>> should be treated as actual hardware/cpu failure.
>> Instead of just saying input from hardware architects, it would be better
>> if you mention the rationale behind it.
> I can reword the first para as below:
>
> "Scan controller error" means that scan hardware encountered an error
> prior to doing an actual test on the target CPU. It does not mean that
> there is an actual cpu/core failure. "scan signature failure" indicates
> that the test result on the target core did not match the expected value
> and should be treated as a cpu failure.
>
> Current driver classifies both these scenarios as failures. Modify ...

Looks good to me.

>>> Current driver, in addition, classifies "scan controller error" scenario
>>> too as a hardware/cpu failure. Modify the driver to classify this situation
>>> with a more appropriate "untested" status instead of "fail" status.
>>>
>>> Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
>>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>>> Reviewe
>> Code wise it looks good to me.
>>
>> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>
> Jithu
Hans de Goede April 15, 2024, 3:10 p.m. UTC | #4
Hi,

Thank you for this patch series.

On 4/12/24 9:31 PM, Joseph, Jithu wrote:
> Sathya,
> 
> Thanks for reviewing this
> 
> On 4/12/2024 11:32 AM, Kuppuswamy Sathyanarayanan wrote:
>>
>> On 4/12/24 10:23 AM, Jithu Joseph wrote:
>>> Based on inputs from hardware architects, only "scan signature failures"
>>> should be treated as actual hardware/cpu failure.
>>
>> Instead of just saying input from hardware architects, it would be better
>> if you mention the rationale behind it.
> 
> I can reword the first para as below:
> 
> "Scan controller error" means that scan hardware encountered an error
> prior to doing an actual test on the target CPU. It does not mean that
> there is an actual cpu/core failure. "scan signature failure" indicates
> that the test result on the target core did not match the expected value
> and should be treated as a cpu failure.
> 
> Current driver classifies both these scenarios as failures. Modify ...

I've modified the commit message using the rewording suggested above
while merging this patch and I have merged the entire series:

Thank you for your patch-series, I've applied the series to my
review-hans branch:
https://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86.git/log/?h=review-hans

Once I've run some tests on this branch the patches there will be
added to the platform-drivers-x86/for-next branch and eventually
will be included in the pdx86 pull-request to Linus for the next
merge-window.

Regards,

Hans






>>> Current driver, in addition, classifies "scan controller error" scenario
>>> too as a hardware/cpu failure. Modify the driver to classify this situation
>>> with a more appropriate "untested" status instead of "fail" status.
>>>
>>> Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
>>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>>> Reviewe
>>
>> Code wise it looks good to me.
>>
>> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
> 
> 
> Jithu
>
diff mbox series

Patch

diff --git a/drivers/platform/x86/intel/ifs/runtest.c b/drivers/platform/x86/intel/ifs/runtest.c
index 95b4b71fab53..282e4bfe30da 100644
--- a/drivers/platform/x86/intel/ifs/runtest.c
+++ b/drivers/platform/x86/intel/ifs/runtest.c
@@ -69,6 +69,19 @@  static const char * const scan_test_status[] = {
 
 static void message_not_tested(struct device *dev, int cpu, union ifs_status status)
 {
+	struct ifs_data *ifsd = ifs_get_data(dev);
+
+	/*
+	 * control_error is set when the microcode runs into a problem
+	 * loading the image from the reserved BIOS memory, or it has
+	 * been corrupted. Reloading the image may fix this issue.
+	 */
+	if (status.control_error) {
+		dev_warn(dev, "CPU(s) %*pbl: Scan controller error. Batch: %02x version: 0x%x\n",
+			 cpumask_pr_args(cpu_smt_mask(cpu)), ifsd->cur_batch, ifsd->loaded_version);
+		return;
+	}
+
 	if (status.error_code < ARRAY_SIZE(scan_test_status)) {
 		dev_info(dev, "CPU(s) %*pbl: SCAN operation did not start. %s\n",
 			 cpumask_pr_args(cpu_smt_mask(cpu)),
@@ -90,16 +103,6 @@  static void message_fail(struct device *dev, int cpu, union ifs_status status)
 {
 	struct ifs_data *ifsd = ifs_get_data(dev);
 
-	/*
-	 * control_error is set when the microcode runs into a problem
-	 * loading the image from the reserved BIOS memory, or it has
-	 * been corrupted. Reloading the image may fix this issue.
-	 */
-	if (status.control_error) {
-		dev_err(dev, "CPU(s) %*pbl: could not execute from loaded scan image. Batch: %02x version: 0x%x\n",
-			cpumask_pr_args(cpu_smt_mask(cpu)), ifsd->cur_batch, ifsd->loaded_version);
-	}
-
 	/*
 	 * signature_error is set when the output from the scan chains does not
 	 * match the expected signature. This might be a transient problem (e.g.
@@ -285,10 +288,10 @@  static void ifs_test_core(int cpu, struct device *dev)
 	/* Update status for this core */
 	ifsd->scan_details = status.data;
 
-	if (status.control_error || status.signature_error) {
+	if (status.signature_error) {
 		ifsd->status = SCAN_TEST_FAIL;
 		message_fail(dev, cpu, status);
-	} else if (status.error_code) {
+	} else if (status.control_error || status.error_code) {
 		ifsd->status = SCAN_NOT_TESTED;
 		message_not_tested(dev, cpu, status);
 	} else {