mbox series

[RFC,v7,00/12] memory: scrub: introduce subsystem + CXL/ACPI-RAS2 drivers

Message ID 20240223143723.1574-1-shiju.jose@huawei.com
Headers show
Series memory: scrub: introduce subsystem + CXL/ACPI-RAS2 drivers | expand

Message

Shiju Jose Feb. 23, 2024, 2:37 p.m. UTC
From: Shiju Jose <shiju.jose@huawei.com>

'Previously known as: cxl: Add support for CXL feature commands,
 CXL device patrol scrub control and DDR5 ECS control features'
https://lore.kernel.org/lkml/20240215111455.1462-1-shiju.jose@huawei.com/

Introduce generic memory scrub subsystem which allows user to
control underlying memory scrubbers in the system via the sysfs
scrub control interface.

Memory scrub is a feature where an ECC engine reads data from
each memory media location, corrects with an ECC if necessary and
writes the corrected data back to the same memory media location.
More details can be found in Reference [1].

CXL patrol scrub and DDR5 ECS and ACPI RAS2 HW based memory
patrol scrub features are added as use cases for the scrub
subsystem to expose the scrub controls to the user.

CXL device patrol scrub and DDR5 ECS features needs support for
the CXL feature mail box commands. CXL device scrub driver
registers with the memory scrub subsystem to expose the scrub
controls for CXL device patrol and ECS scrubs to the user.

RAS2 HW based memory patrol scrub needs RAS2 PCC interfaces
and ACPI RAS2 driver for communication b/w kernel and firmware.
ACPI RAS2 Driver adds platform device, for each memory feature,
which binds to the RAS2 memory driver.
Memory RAS2 driver registers with the memory scrub subsystem to
expose the RAS2 scrub controls to the user.

Series adds,
1. scrub subsystem driver supports configuring memory scrubs
   in the system.
2. support for CXL feature mailbox commands.
3. CXL device scrub driver supporting patrol scrub control and
   ECS control features.
4. register CXL device patrol scrub and ECS with scrub subsystem.
5. common library for RAS2 PCC interfaces.
6. ACPI RAS2 driver for ACPI RAS2 feature table (RAS2).
7. memory RAS2 driver and registers with scrub subsystem.

The QEMU series to support the CXL specific scrub features is
available here,
https://lore.kernel.org/qemu-devel/20240223085902.1549-1-shiju.jose@huawei.com/

References:
1. Discussions on kernel support of memory error detection and
   patrol scrubber can be found here.
   https://lore.kernel.org/all/20221103155029.2451105-1-jiaqiyan@google.com/

2. Discussions on RASF:
   https://lore.kernel.org/lkml/20230915172818.761-1-shiju.jose@huawei.com/#r 
   https://patchwork.kernel.org/project/linux-arm-kernel/patch/CS1PR84MB0038718F49DBC0FF03919E1184390@CS1PR84MB0038.NAMPRD84.PROD.OUTLOOK.COM/

Changes
v6 -> v7:
1. Main changes for comments from Jonathan.
1.1. CXL
 - Changes for deal with small mail box and supporting multipart
   feature data transfers.
 - Provide more specific parameters to mbox supported/get/set features
   interface functions.
 - kvmalloc -> kmalloc in CXL scrub mem allocation for feature commands.
 - Changed the way using __free(kfree)
 - Removed readback and verify for setting CXL scrub patrol and ECS
   parameters. Could be added later if needed.
 - In is_visible() callback functions for scrub control sysfs attrs
   changed to writeback the default attribute mode value instead of
   setting per attrs.
 - Add documentation for sysfs interfaces for CXL ECS scrub control. 
1.2. RAS2
 - In rasf common code, rename rasf to ras2 because RASF seems obselete.
 - Replace pr_* with dev_* log function calls from ACPI RAS2 and
   memory RAS2 drivers.
 - In rasf common code, rename rasf to ras2.
 - Removed including unnecessary .h file from memory RAS2 driver.
 - In is_visible() callback functions for scrub control sysfs attrs
   changed to writeback the default attribute mode value instead of
   setting per attribute.

2. Changes for comments from Fan.
 - Add debug message if cxl patrol scrub and ecs init function
   calls fail.
3. Updated cover letter for feedback from Dan Williams. 
   
v5 -> v6:
1. Changes for comments from Davidlohr, Thanks.
 - Update CXL feature code based on spec 3.1.
 - attrb -> attr
 - Use enums with default counting.  
2. Rebased to the latest kernel.

v4 -> v5:
1. Following are the main changes made based on the feedback from Dan Williams on v4.
1.1. In the scrub subsystem the common scrub control attributes are statically defined
     instead of dynamically created.
1.2. Add scrub subsystem support externally defined attribute group.
     Add CXL ECS driver define ECS specific attribute group and pass to
	 the scrub subsystem.
1.3. Move cxl_mem_ecs_init() to cxl/core/region.c so that the CXL region_id
     is used in the registration with the scrub subsystem. 	 
1.4. Add previously posted RASF common and RAS2 patches to this scrub series.
	 
2. Add support for the 'enable_background_scrub' attribute
   for RAS2, on request from Bill Schwartz(wschwartz@amperecomputing.com).

v3 -> v4:
1. Fixes for the warnings/errors reported by kernel test robot.
2. Add support for reading the 'enable' attribute of CXL patrol scrub.

Changes
v2 -> v3:
1. Changes for comments from Davidlohr, Thanks.
 - Updated cxl scrub kconfig
 - removed usage of the flag is_support_feature from
   the function cxl_mem_get_supported_feature_entry().
 - corrected spelling error.
 - removed unnecessary debug message.
 - removed export feature commands to the userspace.
2. Possible fix for the warnings/errors reported by kernel
   test robot.
3. Add documentation for the common scrub configure atrributes.

v1 -> v2:
1. Changes for comments from Dave Jiang, Thanks.
 - Split patches.
 - reversed xmas tree declarations.
 - declared flags as enums.
 - removed few unnecessary variable initializations.
 - replaced PTR_ERR_OR_ZERO() with IS_ERR() and PTR_ERR().
 - add auto clean declarations.
 - replaced while loop with for loop.
 - Removed allocation from cxl_get_supported_features() and
   cxl_get_feature() and make change to take allocated memory
   pointer from the caller.
 - replaced if/else with switch case.
 - replaced sprintf() with sysfs_emit() in 2 places.
 - replaced goto label with return in few functions.
2. removed unused code for supported attributes from ecs.
3. Included following common patch for scrub configure driver
   to this series.
   "memory: scrub: Add scrub driver supports configuring memory scrubbers
    in the system"

A Somasundaram (1):
  ACPI:RAS2: Add common library for RAS2 PCC interfaces

Shiju Jose (11):
  cxl/mbox: Add GET_SUPPORTED_FEATURES mailbox command
  cxl/mbox: Add GET_FEATURE mailbox command
  cxl/mbox: Add SET_FEATURE mailbox command
  cxl/memscrub: Add CXL device patrol scrub control feature
  cxl/memscrub: Add CXL device ECS control feature
  memory: scrub: Add scrub subsystem driver supports configuring memory
    scrubs in the system
  cxl/memscrub: Register CXL device patrol scrub with scrub subsystem
    driver
  cxl/memscrub: Register CXL device ECS with scrub subsystem driver
  ACPICA: ACPI 6.5: Add support for RAS2 table
  ACPI:RAS2: Add driver for ACPI RAS2 feature table (RAS2)
  memory: RAS2: Add memory RAS2 driver

 .../ABI/testing/sysfs-class-cxl-ecs-configure |  79 ++
 .../ABI/testing/sysfs-class-scrub-configure   |  91 ++
 drivers/acpi/Kconfig                          |  14 +
 drivers/acpi/Makefile                         |   1 +
 drivers/acpi/ras2_acpi.c                      |  97 ++
 drivers/acpi/ras2_acpi_common.c               | 272 +++++
 drivers/cxl/Kconfig                           |  21 +
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/mbox.c                       | 143 +++
 drivers/cxl/core/memscrub.c                   | 954 ++++++++++++++++++
 drivers/cxl/core/region.c                     |   3 +
 drivers/cxl/cxlmem.h                          | 124 +++
 drivers/cxl/pci.c                             |   4 +
 drivers/memory/Kconfig                        |  15 +
 drivers/memory/Makefile                       |   3 +
 drivers/memory/ras2.c                         | 364 +++++++
 drivers/memory/ras2_common.c                  | 282 ++++++
 drivers/memory/scrub/Kconfig                  |  11 +
 drivers/memory/scrub/Makefile                 |   6 +
 drivers/memory/scrub/memory-scrub.c           | 369 +++++++
 include/acpi/actbl2.h                         | 137 +++
 include/acpi/ras2_acpi.h                      |  59 ++
 include/memory/memory-scrub.h                 |  79 ++
 include/memory/ras2.h                         |  88 ++
 24 files changed, 3217 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-class-cxl-ecs-configure
 create mode 100644 Documentation/ABI/testing/sysfs-class-scrub-configure
 create mode 100755 drivers/acpi/ras2_acpi.c
 create mode 100755 drivers/acpi/ras2_acpi_common.c
 create mode 100644 drivers/cxl/core/memscrub.c
 create mode 100644 drivers/memory/ras2.c
 create mode 100644 drivers/memory/ras2_common.c
 create mode 100644 drivers/memory/scrub/Kconfig
 create mode 100644 drivers/memory/scrub/Makefile
 create mode 100755 drivers/memory/scrub/memory-scrub.c
 create mode 100644 include/acpi/ras2_acpi.h
 create mode 100755 include/memory/memory-scrub.h
 create mode 100755 include/memory/ras2.h

Comments

Borislav Petkov Feb. 23, 2024, 3:42 p.m. UTC | #1
On Fri, Feb 23, 2024 at 10:37:11PM +0800, shiju.jose@huawei.com wrote:
> CXL patrol scrub and DDR5 ECS and ACPI RAS2 HW based memory
> patrol scrub features are added as use cases for the scrub
> subsystem to expose the scrub controls to the user.

WTH is a RAS2 driver and where is the RAS1 driver?
Jonathan Cameron Feb. 23, 2024, 4:25 p.m. UTC | #2
On Fri, 23 Feb 2024 16:42:51 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Feb 23, 2024 at 10:37:11PM +0800, shiju.jose@huawei.com wrote:
> > CXL patrol scrub and DDR5 ECS and ACPI RAS2 HW based memory
> > patrol scrub features are added as use cases for the scrub
> > subsystem to expose the scrub controls to the user.  
> 
> WTH is a RAS2 driver and where is the RAS1 driver?
> 
Odd ACPI naming.  RASF (which is very limited) was replaced by RAS2.

Both are ACPI tables that describe the Platform Communication Channels
(mailbox) + provide a definition of the protocol that is sent over
it in order to control memory scrub.  RAS2 has some other uses this
driver isn't touching on.

So in short, it's an ACPI spec defined interface for controlling 
Memory Scrub engines.

https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#acpi-ras2-feature-table-ras2

Jonathan
Borislav Petkov Feb. 23, 2024, 5:51 p.m. UTC | #3
On Fri, Feb 23, 2024 at 04:25:51PM +0000, Jonathan Cameron wrote:
> So in short, it's an ACPI spec defined interface for controlling 
> Memory Scrub engines.

Ok, let's call it acpi_ras2 then so that it is clear. We shouldn't be
perpetuating the ACPI poor naming in the kernel. This way at least it is
namespaced and says exactly where it is coming from.

Thx.
Duran, Leo Feb. 23, 2024, 5:57 p.m. UTC | #4
[AMD Official Use Only - General]

BTW, what is the encoding (units) of the 8-bit "scrub rate"?
(the ACPI spec does not seem to define that detail)

Thanks,
Leo.

-----Original Message-----
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Sent: Friday, February 23, 2024 11:26 AM
To: Borislav Petkov <bp@alien8.de>
Cc: shiju.jose@huawei.com; linux-cxl@vger.kernel.org; linux-acpi@vger.kernel.org; linux-mm@kvack.org; dan.j.williams@intel.com; dave@stgolabs.net; dave.jiang@intel.com; alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; david@redhat.com; Sridharan, Vilas <Vilas.Sridharan@amd.com>; Duran, Leo <leo.duran@amd.com>; Ghannam, Yazen <Yazen.Ghannam@amd.com>; rientjes@google.com; jiaqiyan@google.com; tony.luck@intel.com; Grimm, Jon <Jon.Grimm@amd.com>; dave.hansen@linux.intel.com; rafael@kernel.org; lenb@kernel.org; naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com; somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com; duenwen@google.com; mike.malvestuto@intel.com; gthelen@google.com; wschwartz@amperecomputing.com; dferguson@amperecomputing.com; tanxiaofei@huawei.com; prime.zeng@hisilicon.com; kangkang.shen@futurewei.com; wanghuiqiang@huawei.com; linuxarm@huawei.com
Subject: Re: [RFC PATCH v7 00/12] memory: scrub: introduce subsystem + CXL/ACPI-RAS2 drivers

On Fri, 23 Feb 2024 16:42:51 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Feb 23, 2024 at 10:37:11PM +0800, shiju.jose@huawei.com wrote:
> > CXL patrol scrub and DDR5 ECS and ACPI RAS2 HW based memory patrol
> > scrub features are added as use cases for the scrub subsystem to
> > expose the scrub controls to the user.
>
> WTH is a RAS2 driver and where is the RAS1 driver?
>
Odd ACPI naming.  RASF (which is very limited) was replaced by RAS2.

Both are ACPI tables that describe the Platform Communication Channels
(mailbox) + provide a definition of the protocol that is sent over it in order to control memory scrub.  RAS2 has some other uses this driver isn't touching on.

So in short, it's an ACPI spec defined interface for controlling Memory Scrub engines.

https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#acpi-ras2-feature-table-ras2

Jonathan
Daniel Ferguson March 28, 2024, 11:39 p.m. UTC | #5
> RAS2 HW based memory patrol scrub needs RAS2 PCC interfaces
> and ACPI RAS2 driver for communication b/w kernel and firmware.
> ACPI RAS2 Driver adds platform device, for each memory feature,
> which binds to the RAS2 memory driver.
> Memory RAS2 driver registers with the memory scrub subsystem to
> expose the RAS2 scrub controls to the user.

Hi Shiju,
Thanks for this work. This has been very useful for us, as we've
been using it to test a RAS2 implementation here at Ampere
Computing.

In general, the pieces implementing RAS2 are useful and functional. We did,
however, encounter a few places where we had to make some bug fixes.
I'll reply to specific patches with changes I can recommend. Also, there are
a few implementation choices that I will have questions about.

Here is an outline containing the generalized comments I plan on making
within/near relevant patches:
1) Found a bug in ras2.c in how the pcc_desc_list pointer is incremented.
2) Executing a RAS2 command seems unnecessary for feature detection.
specifically, in the functions ras2_is_patrol_scrub_support() and 
ras2_get_patrol_scrub_params()
3) Consider adding more error detection/reporting in ras2_check_pcc_chan()
4) Concerns with multiple scrub devices attempting to map the same PCC 
channel.

-Daniel
Shiju Jose April 3, 2024, 1:52 p.m. UTC | #6
Hi Daniel,

Thanks for the feedbacks.

>-----Original Message-----
>From: Daniel Ferguson <danielf@os.amperecomputing.com>
>Sent: 28 March 2024 23:40
>To: Shiju Jose <shiju.jose@huawei.com>; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; dan.j.williams@intel.com;
>dave@stgolabs.net; Jonathan Cameron <jonathan.cameron@huawei.com>;
>dave.jiang@intel.com; alison.schofield@intel.com; vishal.l.verma@intel.com;
>ira.weiny@intel.com
>Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>tony.luck@intel.com; Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>rafael@kernel.org; lenb@kernel.org; naoya.horiguchi@nec.com;
>james.morse@arm.com; jthoughton@google.com; somasundaram.a@hpe.com;
>erdemaktas@google.com; pgonda@google.com; duenwen@google.com;
>mike.malvestuto@intel.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>tanxiaofei <tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
>kangkang.shen@futurewei.com; wanghuiqiang <wanghuiqiang@huawei.com>;
>Linuxarm <linuxarm@huawei.com>; wbs@os.amperecomputing.com
>Subject: Re: [RFC PATCH v7 00/12] memory: scrub: introduce subsystem +
>CXL/ACPI-RAS2 drivers
>
>> RAS2 HW based memory patrol scrub needs RAS2 PCC interfaces and ACPI
>> RAS2 driver for communication b/w kernel and firmware.
>> ACPI RAS2 Driver adds platform device, for each memory feature, which
>> binds to the RAS2 memory driver.
>> Memory RAS2 driver registers with the memory scrub subsystem to expose
>> the RAS2 scrub controls to the user.
>
>Hi Shiju,
>Thanks for this work. This has been very useful for us, as we've been using it to
>test a RAS2 implementation here at Ampere Computing.
>
>In general, the pieces implementing RAS2 are useful and functional. We did,
>however, encounter a few places where we had to make some bug fixes.
>I'll reply to specific patches with changes I can recommend. Also, there are a
>few implementation choices that I will have questions about.
>
>Here is an outline containing the generalized comments I plan on making
>within/near relevant patches:
>1) Found a bug in ras2.c in how the pcc_desc_list pointer is incremented.
This bug was fixed in the local v8 code.

>2) Executing a RAS2 command seems unnecessary for feature detection.
>specifically, in the functions ras2_is_patrol_scrub_support() and
>ras2_get_patrol_scrub_params()
I will check and modify in the next version.

>3) Consider adding more error detection/reporting in ras2_check_pcc_chan()
Sure.

>4) Concerns with multiple scrub devices attempting to map the same PCC
>channel.
I will reply in the patch. 
>
>-Daniel
>
Thanks,
Shiju