mbox series

[v12,0/3] cxl, EINJ: Update EINJ for CXL error types

Message ID 20240214200709.777166-1-Benjamin.Cheatham@amd.com (mailing list archive)
Headers show
Series cxl, EINJ: Update EINJ for CXL error types | expand

Message

Ben Cheatham Feb. 14, 2024, 8:07 p.m. UTC
v12 Changes:
	- Rebase onto v6.8-rc4
	- Squash Kconfig patch into patch 2/3 (Jonathan)
	- Change CONFIG_CXL_EINJ from "depends on ACPI_APEI_EINJ >= CXL_BUS"
	  to "depends on ACPI_APEI_EINJ = CXL_BUS"
	- Drop "ACPI, APEI" part of commit message title and use just EINJ
	instead (Dan)
	- Add protocol error types to "einj_types" documentation (Jonathan)
	- Change 0xffff... constants to use GENMASK()
	- Drop param* variables and use constants instead in cxl error
	  inject functions (Jonathan)
	- Add is_cxl_error_type() helper function in einj.c (Jonathan)
	- Remove a stray function declaration in einj-cxl.h (Jonathan)
	- Comment #else/#endifs with corresponding #if/#ifdef in
	einj-cxl.h (Jonathan)

v11 Changes:
	- Drop patch 2/6 (Add CXL protocol error defines) and put the
	  defines in patch 4/6 instead (Dan)
	- Add Dan's reviewed-by

The new CXL error types will use the Memory Address field in the
SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1
compliant memory-mapped downstream port. The value of the memory address
will be in the port's MMIO range, and it will not represent physical
(normal or persistent) memory.

Add the functionality for injecting CXL 1.1 errors to the EINJ module,
but not through the EINJ legacy interface under /sys/kernel/debug/apei/einj.
Instead, make the error types available under /sys/kernel/debug/cxl.
This allows for validating the MMIO address for a CXL 1.1 error type
while also not making the user responsible for finding it.
Ben Cheatham (3):
  EINJ: Migrate to a platform driver
  cxl/core, EINJ: Add EINJ CXL debugfs files and EINJ helper functions
  EINJ, Documentation: Update EINJ kernel doc

 Documentation/ABI/testing/debugfs-cxl         |  30 +++
 .../firmware-guide/acpi/apei/einj.rst         |  19 ++
 MAINTAINERS                                   |   1 +
 drivers/acpi/apei/einj.c                      | 202 ++++++++++++++++--
 drivers/cxl/Kconfig                           |  12 ++
 drivers/cxl/core/port.c                       |  41 ++++
 include/linux/einj-cxl.h                      |  40 ++++
 7 files changed, 332 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/einj-cxl.h

Comments

Tony Luck Feb. 15, 2024, 1:11 a.m. UTC | #1
On Wed, Feb 14, 2024 at 02:07:06PM -0600, Ben Cheatham wrote:
> v12 Changes:
> 	- Rebase onto v6.8-rc4
> 	- Squash Kconfig patch into patch 2/3 (Jonathan)
> 	- Change CONFIG_CXL_EINJ from "depends on ACPI_APEI_EINJ >= CXL_BUS"
> 	  to "depends on ACPI_APEI_EINJ = CXL_BUS"
> 	- Drop "ACPI, APEI" part of commit message title and use just EINJ
> 	instead (Dan)
> 	- Add protocol error types to "einj_types" documentation (Jonathan)
> 	- Change 0xffff... constants to use GENMASK()
> 	- Drop param* variables and use constants instead in cxl error
> 	  inject functions (Jonathan)
> 	- Add is_cxl_error_type() helper function in einj.c (Jonathan)
> 	- Remove a stray function declaration in einj-cxl.h (Jonathan)
> 	- Comment #else/#endifs with corresponding #if/#ifdef in
> 	einj-cxl.h (Jonathan)
> 
> v11 Changes:
> 	- Drop patch 2/6 (Add CXL protocol error defines) and put the
> 	  defines in patch 4/6 instead (Dan)
> 	- Add Dan's reviewed-by
> 
> The new CXL error types will use the Memory Address field in the
> SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1
> compliant memory-mapped downstream port. The value of the memory address
> will be in the port's MMIO range, and it will not represent physical
> (normal or persistent) memory.
> 
> Add the functionality for injecting CXL 1.1 errors to the EINJ module,
> but not through the EINJ legacy interface under /sys/kernel/debug/apei/einj.
> Instead, make the error types available under /sys/kernel/debug/cxl.
> This allows for validating the MMIO address for a CXL 1.1 error type
> while also not making the user responsible for finding it.

I tried this series on an Intel Icelake (which as far as I know
doesn't support CXL).

Couple of oddities:

1) I built as a module (CONFIG_ACPI_APEI_EINJ=m) like I normally do.
   But this was autoloaded and EINJ initialized during boot:

[   33.909111] EINJ: Error INJection is initialized.

I'm wondering if that might be a problem for anyone that likes to
leave the einj module not loaded until they have some need to
inject errors.


2) Even though my system doesn't have any CXL support, I found this:

# cat /sys/kernel/debug/cxl/einj_types
0x00001000      CXL.cache Protocol Correctable

What does this mean?

Using ras-tools I injected some DDR memory errors. So legacy
functionality still works OK.

-Tony
Dan Williams Feb. 15, 2024, 2:53 a.m. UTC | #2
Tony Luck wrote:
> On Wed, Feb 14, 2024 at 02:07:06PM -0600, Ben Cheatham wrote:
> > v12 Changes:
> > 	- Rebase onto v6.8-rc4
> > 	- Squash Kconfig patch into patch 2/3 (Jonathan)
> > 	- Change CONFIG_CXL_EINJ from "depends on ACPI_APEI_EINJ >= CXL_BUS"
> > 	  to "depends on ACPI_APEI_EINJ = CXL_BUS"
> > 	- Drop "ACPI, APEI" part of commit message title and use just EINJ
> > 	instead (Dan)
> > 	- Add protocol error types to "einj_types" documentation (Jonathan)
> > 	- Change 0xffff... constants to use GENMASK()
> > 	- Drop param* variables and use constants instead in cxl error
> > 	  inject functions (Jonathan)
> > 	- Add is_cxl_error_type() helper function in einj.c (Jonathan)
> > 	- Remove a stray function declaration in einj-cxl.h (Jonathan)
> > 	- Comment #else/#endifs with corresponding #if/#ifdef in
> > 	einj-cxl.h (Jonathan)
> > 
> > v11 Changes:
> > 	- Drop patch 2/6 (Add CXL protocol error defines) and put the
> > 	  defines in patch 4/6 instead (Dan)
> > 	- Add Dan's reviewed-by
> > 
> > The new CXL error types will use the Memory Address field in the
> > SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1
> > compliant memory-mapped downstream port. The value of the memory address
> > will be in the port's MMIO range, and it will not represent physical
> > (normal or persistent) memory.
> > 
> > Add the functionality for injecting CXL 1.1 errors to the EINJ module,
> > but not through the EINJ legacy interface under /sys/kernel/debug/apei/einj.
> > Instead, make the error types available under /sys/kernel/debug/cxl.
> > This allows for validating the MMIO address for a CXL 1.1 error type
> > while also not making the user responsible for finding it.
> 
> I tried this series on an Intel Icelake (which as far as I know
> doesn't support CXL).

Thanks Tony!

> Couple of oddities:
> 
> 1) I built as a module (CONFIG_ACPI_APEI_EINJ=m) like I normally do.
>    But this was autoloaded and EINJ initialized during boot:
> 
> [   33.909111] EINJ: Error INJection is initialized.

In the current code it should only load if cxl_core.ko is also loaded.

Can you share the output of lsmod to maybe see which module loaded that
dependency?

> I'm wondering if that might be a problem for anyone that likes to
> leave the einj module not loaded until they have some need to
> inject errors.

That is a behavior change of this approach. Is it a problem?

If it is I would say that we need to break out a new cxl_einj.ko module
that when it loads walks the CXL topology and creates the debugfs files.
Otherwise my assumption is that CONFIG_CXL_EINJ=y means that cxl_core.ko
loads einj.ko unconditionally.

I would save that work for a clear description of why einj.ko should not
be resident.

> 2) Even though my system doesn't have any CXL support, I found this:
> 
> # cat /sys/kernel/debug/cxl/einj_types
> 0x00001000      CXL.cache Protocol Correctable
> 
> What does this mean?

Strange, does:

/sys/kernel/debug/einj/available_error_type

...show the same even before these patches? I.e. maybe this pre-CXL BIOS was
using the 0x1000 encoding when it should not?

> Using ras-tools I injected some DDR memory errors. So legacy
> functionality still works OK.
> 
> -Tony
Ben Cheatham Feb. 15, 2024, 3:01 p.m. UTC | #3
On 2/14/24 8:53 PM, Dan Williams wrote:
> Tony Luck wrote:
>> On Wed, Feb 14, 2024 at 02:07:06PM -0600, Ben Cheatham wrote:
>>> v12 Changes:
>>> 	- Rebase onto v6.8-rc4
>>> 	- Squash Kconfig patch into patch 2/3 (Jonathan)
>>> 	- Change CONFIG_CXL_EINJ from "depends on ACPI_APEI_EINJ >= CXL_BUS"
>>> 	  to "depends on ACPI_APEI_EINJ = CXL_BUS"
>>> 	- Drop "ACPI, APEI" part of commit message title and use just EINJ
>>> 	instead (Dan)
>>> 	- Add protocol error types to "einj_types" documentation (Jonathan)
>>> 	- Change 0xffff... constants to use GENMASK()
>>> 	- Drop param* variables and use constants instead in cxl error
>>> 	  inject functions (Jonathan)
>>> 	- Add is_cxl_error_type() helper function in einj.c (Jonathan)
>>> 	- Remove a stray function declaration in einj-cxl.h (Jonathan)
>>> 	- Comment #else/#endifs with corresponding #if/#ifdef in
>>> 	einj-cxl.h (Jonathan)
>>>
>>> v11 Changes:
>>> 	- Drop patch 2/6 (Add CXL protocol error defines) and put the
>>> 	  defines in patch 4/6 instead (Dan)
>>> 	- Add Dan's reviewed-by
>>>
>>> The new CXL error types will use the Memory Address field in the
>>> SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1
>>> compliant memory-mapped downstream port. The value of the memory address
>>> will be in the port's MMIO range, and it will not represent physical
>>> (normal or persistent) memory.
>>>
>>> Add the functionality for injecting CXL 1.1 errors to the EINJ module,
>>> but not through the EINJ legacy interface under /sys/kernel/debug/apei/einj.
>>> Instead, make the error types available under /sys/kernel/debug/cxl.
>>> This allows for validating the MMIO address for a CXL 1.1 error type
>>> while also not making the user responsible for finding it.
>>
>> I tried this series on an Intel Icelake (which as far as I know
>> doesn't support CXL).
> 
> Thanks Tony!
> 
>> Couple of oddities:
>>
>> 1) I built as a module (CONFIG_ACPI_APEI_EINJ=m) like I normally do.
>>    But this was autoloaded and EINJ initialized during boot:
>>
>> [   33.909111] EINJ: Error INJection is initialized.
> 
> In the current code it should only load if cxl_core.ko is also loaded.
> 
> Can you share the output of lsmod to maybe see which module loaded that
> dependency?
> 
>> I'm wondering if that might be a problem for anyone that likes to
>> leave the einj module not loaded until they have some need to
>> inject errors.
> 
> That is a behavior change of this approach. Is it a problem?
> 
> If it is I would say that we need to break out a new cxl_einj.ko module
> that when it loads walks the CXL topology and creates the debugfs files.
> Otherwise my assumption is that CONFIG_CXL_EINJ=y means that cxl_core.ko
> loads einj.ko unconditionally.
> 
> I would save that work for a clear description of why einj.ko should not
> be resident.
> 
>> 2) Even though my system doesn't have any CXL support, I found this:
>>
>> # cat /sys/kernel/debug/cxl/einj_types
>> 0x00001000      CXL.cache Protocol Correctable
>>
>> What does this mean?
> 
> Strange, does:
> 
> /sys/kernel/debug/einj/available_error_type
> 
> ...show the same even before these patches? I.e. maybe this pre-CXL BIOS was
> using the 0x1000 encoding when it should not?
> 

Dan's already alluded to this, but to elaborate: This series doesn't do anything
different than before when getting available error types, it just puts the CXL types
into the "einj_types" file instead. So it's possible that your platform doesn't
have CXL support, but it is reporting a CXL error type for EINJ. This could be
a BIOS error, or it could be that the BIOS is pre ACPIv6.5 and was using 0x1000
for a different error type and the kernel is interpreting it as a CXL error type.
If you think something else is happening, I'd love to hear about it!

Thanks,
Ben

>> Using ras-tools I injected some DDR memory errors. So legacy
>> functionality still works OK.
>>
>> -Tony
> 
> 
>
Tony Luck Feb. 15, 2024, 5:23 p.m. UTC | #4
> > Couple of oddities:
> >
> > 1) I built as a module (CONFIG_ACPI_APEI_EINJ=m) like I normally do.
> >    But this was autoloaded and EINJ initialized during boot:
> >
> > [   33.909111] EINJ: Error INJection is initialized.
>
> In the current code it should only load if cxl_core.ko is also loaded.
>
> Can you share the output of lsmod to maybe see which module loaded that
> dependency?
>
> > I'm wondering if that might be a problem for anyone that likes to
> > leave the einj module not loaded until they have some need to
> > inject errors.
>
> That is a behavior change of this approach. Is it a problem?
>
> If it is I would say that we need to break out a new cxl_einj.ko module
> that when it loads walks the CXL topology and creates the debugfs files.
> Otherwise my assumption is that CONFIG_CXL_EINJ=y means that cxl_core.ko
> loads einj.ko unconditionally.
>
> I would save that work for a clear description of why einj.ko should not
> be resident.

Personally, it would save me having to type "modprobe einj" to run tests (and
answer e-mails from validation folks telling they missed this step).

But others might feels this is unwanted. It looks like some distros build kernels
with CONFIG_ACPI_APEI_EINJ=m.

On the other hand, EINJ should be under control of a BIOS option that
defaults to "off". So production systems won't enable it.

But perhaps there will be a pr_warn() or pr_err() during boot. One of these will likely trip:

	pr_warn("EINJ table not found.\n");
	pr_err("Failed to get EINJ table: %s\n", acpi_format_exception(status));
	pr_warn(FW_BUG "Invalid EINJ table.\n");
	pr_err("Error collecting EINJ resources.\n");

>
> > 2) Even though my system doesn't have any CXL support, I found this:
> >
> > # cat /sys/kernel/debug/cxl/einj_types
> > 0x00001000      CXL.cache Protocol Correctable
> >
> > What does this mean?
>
> Strange, does:
>
> /sys/kernel/debug/einj/available_error_type
>
> ...show the same even before these patches? I.e. maybe this pre-CXL BIOS was
> using the 0x1000 encoding when it should not?

I added a printk() to show the raw value returned by my BIOS: 0x80001038

So your guess is correct. By BIOS is setting 0x1000 bit when it shouldn't.
>
> > Using ras-tools I injected some DDR memory errors. So legacy
> > functionality still works OK.

-Tony
Tony Luck Feb. 15, 2024, 6:15 p.m. UTC | #5
> But perhaps there will be a pr_warn() or pr_err() during boot. One of these will likely trip:

>	pr_warn("EINJ table not found.\n");
>	pr_err("Failed to get EINJ table: %s\n", acpi_format_exception(status));
>	pr_warn(FW_BUG "Invalid EINJ table.\n");
>	pr_err("Error collecting EINJ resources.\n");

Just tried on my system. The winner (for me) is:

[   27.989081] EINJ: EINJ table not found.

If you decide that it is OK to auto-load, I think that needs severity downgraded to pr_info().

Users ask questions when they see warnings.

-Tony
Dan Williams Feb. 16, 2024, 12:09 a.m. UTC | #6
Luck, Tony wrote:
> > I would save that work for a clear description of why einj.ko should not
> > be resident.
> 
> Personally, it would save me having to type "modprobe einj" to run tests (and
> answer e-mails from validation folks telling they missed this step).

It would only autoload with cxl_core.ko though.

> 
> But others might feels this is unwanted. It looks like some distros build kernels
> with CONFIG_ACPI_APEI_EINJ=m.
> 
> On the other hand, EINJ should be under control of a BIOS option that
> defaults to "off". So production systems won't enable it.
> 
> But perhaps there will be a pr_warn() or pr_err() during boot. One of these will likely trip:
> 
> 	pr_warn("EINJ table not found.\n");
> 	pr_err("Failed to get EINJ table: %s\n", acpi_format_exception(status));
> 	pr_warn(FW_BUG "Invalid EINJ table.\n");
> 	pr_err("Error collecting EINJ resources.\n");

Oh, good point. However, should a debug module really be throwing
pr_err() or pr_warn()? I.e. should those all move to pr_info() or
pr_debug() since the error case is detected by the lack of debugfs files
being published.

At least that would be my preference over creating cxl_einj.ko.
Dan Williams Feb. 16, 2024, 12:11 a.m. UTC | #7
Luck, Tony wrote:
> > But perhaps there will be a pr_warn() or pr_err() during boot. One of these will likely trip:
> 
> >	pr_warn("EINJ table not found.\n");
> >	pr_err("Failed to get EINJ table: %s\n", acpi_format_exception(status));
> >	pr_warn(FW_BUG "Invalid EINJ table.\n");
> >	pr_err("Error collecting EINJ resources.\n");
> 
> Just tried on my system. The winner (for me) is:
> 
> [   27.989081] EINJ: EINJ table not found.
> 
> If you decide that it is OK to auto-load, I think that needs severity downgraded to pr_info().
> 
> Users ask questions when they see warnings.

Sounds good, I missed this before sending my last reply.