Message ID | 20240214200709.777166-1-Benjamin.Cheatham@amd.com |
---|---|
Headers | show |
Series | cxl, EINJ: Update EINJ for CXL error types | expand |
On Wed, Feb 14, 2024 at 02:07:06PM -0600, Ben Cheatham wrote: > v12 Changes: > - Rebase onto v6.8-rc4 > - Squash Kconfig patch into patch 2/3 (Jonathan) > - Change CONFIG_CXL_EINJ from "depends on ACPI_APEI_EINJ >= CXL_BUS" > to "depends on ACPI_APEI_EINJ = CXL_BUS" > - Drop "ACPI, APEI" part of commit message title and use just EINJ > instead (Dan) > - Add protocol error types to "einj_types" documentation (Jonathan) > - Change 0xffff... constants to use GENMASK() > - Drop param* variables and use constants instead in cxl error > inject functions (Jonathan) > - Add is_cxl_error_type() helper function in einj.c (Jonathan) > - Remove a stray function declaration in einj-cxl.h (Jonathan) > - Comment #else/#endifs with corresponding #if/#ifdef in > einj-cxl.h (Jonathan) > > v11 Changes: > - Drop patch 2/6 (Add CXL protocol error defines) and put the > defines in patch 4/6 instead (Dan) > - Add Dan's reviewed-by > > The new CXL error types will use the Memory Address field in the > SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1 > compliant memory-mapped downstream port. The value of the memory address > will be in the port's MMIO range, and it will not represent physical > (normal or persistent) memory. > > Add the functionality for injecting CXL 1.1 errors to the EINJ module, > but not through the EINJ legacy interface under /sys/kernel/debug/apei/einj. > Instead, make the error types available under /sys/kernel/debug/cxl. > This allows for validating the MMIO address for a CXL 1.1 error type > while also not making the user responsible for finding it. I tried this series on an Intel Icelake (which as far as I know doesn't support CXL). Couple of oddities: 1) I built as a module (CONFIG_ACPI_APEI_EINJ=m) like I normally do. But this was autoloaded and EINJ initialized during boot: [ 33.909111] EINJ: Error INJection is initialized. I'm wondering if that might be a problem for anyone that likes to leave the einj module not loaded until they have some need to inject errors. 2) Even though my system doesn't have any CXL support, I found this: # cat /sys/kernel/debug/cxl/einj_types 0x00001000 CXL.cache Protocol Correctable What does this mean? Using ras-tools I injected some DDR memory errors. So legacy functionality still works OK. -Tony
Tony Luck wrote: > On Wed, Feb 14, 2024 at 02:07:06PM -0600, Ben Cheatham wrote: > > v12 Changes: > > - Rebase onto v6.8-rc4 > > - Squash Kconfig patch into patch 2/3 (Jonathan) > > - Change CONFIG_CXL_EINJ from "depends on ACPI_APEI_EINJ >= CXL_BUS" > > to "depends on ACPI_APEI_EINJ = CXL_BUS" > > - Drop "ACPI, APEI" part of commit message title and use just EINJ > > instead (Dan) > > - Add protocol error types to "einj_types" documentation (Jonathan) > > - Change 0xffff... constants to use GENMASK() > > - Drop param* variables and use constants instead in cxl error > > inject functions (Jonathan) > > - Add is_cxl_error_type() helper function in einj.c (Jonathan) > > - Remove a stray function declaration in einj-cxl.h (Jonathan) > > - Comment #else/#endifs with corresponding #if/#ifdef in > > einj-cxl.h (Jonathan) > > > > v11 Changes: > > - Drop patch 2/6 (Add CXL protocol error defines) and put the > > defines in patch 4/6 instead (Dan) > > - Add Dan's reviewed-by > > > > The new CXL error types will use the Memory Address field in the > > SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1 > > compliant memory-mapped downstream port. The value of the memory address > > will be in the port's MMIO range, and it will not represent physical > > (normal or persistent) memory. > > > > Add the functionality for injecting CXL 1.1 errors to the EINJ module, > > but not through the EINJ legacy interface under /sys/kernel/debug/apei/einj. > > Instead, make the error types available under /sys/kernel/debug/cxl. > > This allows for validating the MMIO address for a CXL 1.1 error type > > while also not making the user responsible for finding it. > > I tried this series on an Intel Icelake (which as far as I know > doesn't support CXL). Thanks Tony! > Couple of oddities: > > 1) I built as a module (CONFIG_ACPI_APEI_EINJ=m) like I normally do. > But this was autoloaded and EINJ initialized during boot: > > [ 33.909111] EINJ: Error INJection is initialized. In the current code it should only load if cxl_core.ko is also loaded. Can you share the output of lsmod to maybe see which module loaded that dependency? > I'm wondering if that might be a problem for anyone that likes to > leave the einj module not loaded until they have some need to > inject errors. That is a behavior change of this approach. Is it a problem? If it is I would say that we need to break out a new cxl_einj.ko module that when it loads walks the CXL topology and creates the debugfs files. Otherwise my assumption is that CONFIG_CXL_EINJ=y means that cxl_core.ko loads einj.ko unconditionally. I would save that work for a clear description of why einj.ko should not be resident. > 2) Even though my system doesn't have any CXL support, I found this: > > # cat /sys/kernel/debug/cxl/einj_types > 0x00001000 CXL.cache Protocol Correctable > > What does this mean? Strange, does: /sys/kernel/debug/einj/available_error_type ...show the same even before these patches? I.e. maybe this pre-CXL BIOS was using the 0x1000 encoding when it should not? > Using ras-tools I injected some DDR memory errors. So legacy > functionality still works OK. > > -Tony
On 2/14/24 8:53 PM, Dan Williams wrote: > Tony Luck wrote: >> On Wed, Feb 14, 2024 at 02:07:06PM -0600, Ben Cheatham wrote: >>> v12 Changes: >>> - Rebase onto v6.8-rc4 >>> - Squash Kconfig patch into patch 2/3 (Jonathan) >>> - Change CONFIG_CXL_EINJ from "depends on ACPI_APEI_EINJ >= CXL_BUS" >>> to "depends on ACPI_APEI_EINJ = CXL_BUS" >>> - Drop "ACPI, APEI" part of commit message title and use just EINJ >>> instead (Dan) >>> - Add protocol error types to "einj_types" documentation (Jonathan) >>> - Change 0xffff... constants to use GENMASK() >>> - Drop param* variables and use constants instead in cxl error >>> inject functions (Jonathan) >>> - Add is_cxl_error_type() helper function in einj.c (Jonathan) >>> - Remove a stray function declaration in einj-cxl.h (Jonathan) >>> - Comment #else/#endifs with corresponding #if/#ifdef in >>> einj-cxl.h (Jonathan) >>> >>> v11 Changes: >>> - Drop patch 2/6 (Add CXL protocol error defines) and put the >>> defines in patch 4/6 instead (Dan) >>> - Add Dan's reviewed-by >>> >>> The new CXL error types will use the Memory Address field in the >>> SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1 >>> compliant memory-mapped downstream port. The value of the memory address >>> will be in the port's MMIO range, and it will not represent physical >>> (normal or persistent) memory. >>> >>> Add the functionality for injecting CXL 1.1 errors to the EINJ module, >>> but not through the EINJ legacy interface under /sys/kernel/debug/apei/einj. >>> Instead, make the error types available under /sys/kernel/debug/cxl. >>> This allows for validating the MMIO address for a CXL 1.1 error type >>> while also not making the user responsible for finding it. >> >> I tried this series on an Intel Icelake (which as far as I know >> doesn't support CXL). > > Thanks Tony! > >> Couple of oddities: >> >> 1) I built as a module (CONFIG_ACPI_APEI_EINJ=m) like I normally do. >> But this was autoloaded and EINJ initialized during boot: >> >> [ 33.909111] EINJ: Error INJection is initialized. > > In the current code it should only load if cxl_core.ko is also loaded. > > Can you share the output of lsmod to maybe see which module loaded that > dependency? > >> I'm wondering if that might be a problem for anyone that likes to >> leave the einj module not loaded until they have some need to >> inject errors. > > That is a behavior change of this approach. Is it a problem? > > If it is I would say that we need to break out a new cxl_einj.ko module > that when it loads walks the CXL topology and creates the debugfs files. > Otherwise my assumption is that CONFIG_CXL_EINJ=y means that cxl_core.ko > loads einj.ko unconditionally. > > I would save that work for a clear description of why einj.ko should not > be resident. > >> 2) Even though my system doesn't have any CXL support, I found this: >> >> # cat /sys/kernel/debug/cxl/einj_types >> 0x00001000 CXL.cache Protocol Correctable >> >> What does this mean? > > Strange, does: > > /sys/kernel/debug/einj/available_error_type > > ...show the same even before these patches? I.e. maybe this pre-CXL BIOS was > using the 0x1000 encoding when it should not? > Dan's already alluded to this, but to elaborate: This series doesn't do anything different than before when getting available error types, it just puts the CXL types into the "einj_types" file instead. So it's possible that your platform doesn't have CXL support, but it is reporting a CXL error type for EINJ. This could be a BIOS error, or it could be that the BIOS is pre ACPIv6.5 and was using 0x1000 for a different error type and the kernel is interpreting it as a CXL error type. If you think something else is happening, I'd love to hear about it! Thanks, Ben >> Using ras-tools I injected some DDR memory errors. So legacy >> functionality still works OK. >> >> -Tony > > >
> > Couple of oddities: > > > > 1) I built as a module (CONFIG_ACPI_APEI_EINJ=m) like I normally do. > > But this was autoloaded and EINJ initialized during boot: > > > > [ 33.909111] EINJ: Error INJection is initialized. > > In the current code it should only load if cxl_core.ko is also loaded. > > Can you share the output of lsmod to maybe see which module loaded that > dependency? > > > I'm wondering if that might be a problem for anyone that likes to > > leave the einj module not loaded until they have some need to > > inject errors. > > That is a behavior change of this approach. Is it a problem? > > If it is I would say that we need to break out a new cxl_einj.ko module > that when it loads walks the CXL topology and creates the debugfs files. > Otherwise my assumption is that CONFIG_CXL_EINJ=y means that cxl_core.ko > loads einj.ko unconditionally. > > I would save that work for a clear description of why einj.ko should not > be resident. Personally, it would save me having to type "modprobe einj" to run tests (and answer e-mails from validation folks telling they missed this step). But others might feels this is unwanted. It looks like some distros build kernels with CONFIG_ACPI_APEI_EINJ=m. On the other hand, EINJ should be under control of a BIOS option that defaults to "off". So production systems won't enable it. But perhaps there will be a pr_warn() or pr_err() during boot. One of these will likely trip: pr_warn("EINJ table not found.\n"); pr_err("Failed to get EINJ table: %s\n", acpi_format_exception(status)); pr_warn(FW_BUG "Invalid EINJ table.\n"); pr_err("Error collecting EINJ resources.\n"); > > > 2) Even though my system doesn't have any CXL support, I found this: > > > > # cat /sys/kernel/debug/cxl/einj_types > > 0x00001000 CXL.cache Protocol Correctable > > > > What does this mean? > > Strange, does: > > /sys/kernel/debug/einj/available_error_type > > ...show the same even before these patches? I.e. maybe this pre-CXL BIOS was > using the 0x1000 encoding when it should not? I added a printk() to show the raw value returned by my BIOS: 0x80001038 So your guess is correct. By BIOS is setting 0x1000 bit when it shouldn't. > > > Using ras-tools I injected some DDR memory errors. So legacy > > functionality still works OK. -Tony
> But perhaps there will be a pr_warn() or pr_err() during boot. One of these will likely trip: > pr_warn("EINJ table not found.\n"); > pr_err("Failed to get EINJ table: %s\n", acpi_format_exception(status)); > pr_warn(FW_BUG "Invalid EINJ table.\n"); > pr_err("Error collecting EINJ resources.\n"); Just tried on my system. The winner (for me) is: [ 27.989081] EINJ: EINJ table not found. If you decide that it is OK to auto-load, I think that needs severity downgraded to pr_info(). Users ask questions when they see warnings. -Tony
Luck, Tony wrote: > > I would save that work for a clear description of why einj.ko should not > > be resident. > > Personally, it would save me having to type "modprobe einj" to run tests (and > answer e-mails from validation folks telling they missed this step). It would only autoload with cxl_core.ko though. > > But others might feels this is unwanted. It looks like some distros build kernels > with CONFIG_ACPI_APEI_EINJ=m. > > On the other hand, EINJ should be under control of a BIOS option that > defaults to "off". So production systems won't enable it. > > But perhaps there will be a pr_warn() or pr_err() during boot. One of these will likely trip: > > pr_warn("EINJ table not found.\n"); > pr_err("Failed to get EINJ table: %s\n", acpi_format_exception(status)); > pr_warn(FW_BUG "Invalid EINJ table.\n"); > pr_err("Error collecting EINJ resources.\n"); Oh, good point. However, should a debug module really be throwing pr_err() or pr_warn()? I.e. should those all move to pr_info() or pr_debug() since the error case is detected by the lack of debugfs files being published. At least that would be my preference over creating cxl_einj.ko.
Luck, Tony wrote: > > But perhaps there will be a pr_warn() or pr_err() during boot. One of these will likely trip: > > > pr_warn("EINJ table not found.\n"); > > pr_err("Failed to get EINJ table: %s\n", acpi_format_exception(status)); > > pr_warn(FW_BUG "Invalid EINJ table.\n"); > > pr_err("Error collecting EINJ resources.\n"); > > Just tried on my system. The winner (for me) is: > > [ 27.989081] EINJ: EINJ table not found. > > If you decide that it is OK to auto-load, I think that needs severity downgraded to pr_info(). > > Users ask questions when they see warnings. Sounds good, I missed this before sending my last reply.