Message ID | 20241025171356.1377-4-shiju.jose@huawei.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers | expand |
On Fri, Oct 25, 2024 at 06:13:44PM +0100, shiju.jose@huawei.com wrote: > +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type > +Date: Jan 2025 > +KernelVersion: 6.13 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RW) The log entry type of how the DDR5 ECS log is reported. > + 00b - per DRAM. > + 01b - per memory media FRU. If the conversion function here is kstrtoul(), why are those values not "0" and "1" but in binary format? > + > +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type_per_dram > +Date: Jan 2025 > +KernelVersion: 6.13 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RO) True if current log entry type is per DRAM. > + > +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type_per_memory_media > +Date: Jan 2025 > +KernelVersion: 6.13 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RO) True if current log entry type is per memory media FRU. What's the point of those two if log_entry_type already gives you the same info? And the filename length is a bit too much... > + > +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/mode > +Date: Jan 2025 > +KernelVersion: 6.13 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RW) The mode of how the DDR5 ECS counts the errors. > + 0 - ECS counts rows with errors. > + 1 - ECS counts codewords with errors. Now we have "0" and "1"s. Oh well. What are "rows", what are "codewords"? Explain them here pls for the user. > +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/mode_counts_rows > +Date: Jan 2025 > +KernelVersion: 6.13 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RO) True if current mode is ECS counts rows with errors. > + > +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/mode_counts_codewords > +Date: Jan 2025 > +KernelVersion: 6.13 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RO) True if current mode is ECS counts codewords with errors. Same question as above - redundant files. > +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/reset > +Date: Jan 2025 > +KernelVersion: 6.13 > +Contact: linux-edac@vger.kernel.org > +Description: > + (WO) ECS reset ECC counter. > + 1 - reset ECC counter to the default value. 1 or any value? Looks like any to me... You should restrict it to "1" in case you want to extend this interface with "2" in the future, for example, doing something a bit different. > + > +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/threshold > +Date: Jan 2025 > +KernelVersion: 6.13 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RW) ECS threshold count per gigabits of memory cells. That definitely needs more explanation. > diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile > index 188501e676c7..b24c2c112d9c 100644 > --- a/drivers/edac/Makefile > +++ b/drivers/edac/Makefile > @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC) := edac_core.o > > edac_core-y := edac_mc.o edac_device.o edac_mc_sysfs.o > edac_core-y += edac_module.o edac_device_sysfs.o wq.o > -edac_core-y += scrub.o > +edac_core-y += scrub.o ecs.o > > edac_core-$(CONFIG_EDAC_DEBUG) += debugfs.o > > diff --git a/drivers/edac/ecs.c b/drivers/edac/ecs.c > new file mode 100755 > index 000000000000..a2b64d7bf6b6 > --- /dev/null > +++ b/drivers/edac/ecs.c > @@ -0,0 +1,240 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Generic ECS driver in order to support control the on die > + * error check scrub (e.g. DDR5 ECS). This sentence needs grammar check. > The common sysfs ECS > + * interface abstracts the control of an arbitrary ECS > + * functionality to a common set of functions. > + * > + * Copyright (c) 2024 HiSilicon Limited. > + */ > + #undef pr_fmt > +#define pr_fmt(fmt) "EDAC ECS: " fmt Grep the tree for examples how to do that properly. Also, this pr_fmt looks unused. > +static umode_t ecs_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id) > +{ > + struct device *ras_feat_dev = kobj_to_dev(kobj); > + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); > + const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops; > + > + switch (attr_id) { > + case ECS_LOG_ENTRY_TYPE: > + if (ops->get_log_entry_type) { > + if (ops->set_log_entry_type) > + return a->mode; > + else > + return 0444; What is the goal for the access mode of all those sysfs entries? I sure hope it is going to be root-only no-matter what. I don't want normal users to cause scrub activity. Please make sure your whole set does that. > + } > + break; > + case ECS_LOG_ENTRY_TYPE_PER_DRAM: > + if (ops->get_log_entry_type_per_dram) > + return a->mode; > + break; > + case ECS_LOG_ENTRY_TYPE_PER_MEMORY_MEDIA: > + if (ops->get_log_entry_type_per_memory_media) > + return a->mode; > + break; > + case ECS_MODE: > + if (ops->get_mode) { > + if (ops->set_mode) > + return a->mode; > + else > + return 0444; > + } > + break; > + case ECS_MODE_COUNTS_ROWS: > + if (ops->get_mode_counts_rows) > + return a->mode; > + break; > + case ECS_MODE_COUNTS_CODEWORDS: > + if (ops->get_mode_counts_codewords) > + return a->mode; > + break; > + case ECS_RESET: > + if (ops->reset) > + return a->mode; > + break; > + case ECS_THRESHOLD: > + if (ops->get_threshold) { > + if (ops->set_threshold) > + return a->mode; > + else > + return 0444; > + } > + break; > + default: > + break; > + } > + > + return 0; > +} > + > +#define EDAC_ECS_ATTR_RO(_name, _fru_id) \ > + ((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_RO(_name), \ > + .fru_id = _fru_id }) > + > +#define EDAC_ECS_ATTR_WO(_name, _fru_id) \ > + ((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_WO(_name), \ > + .fru_id = _fru_id }) > + > +#define EDAC_ECS_ATTR_RW(_name, _fru_id) \ > + ((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_RW(_name), \ > + .fru_id = _fru_id }) > + > +static int ecs_create_desc(struct device *ecs_dev, > + const struct attribute_group **attr_groups, u16 num_media_frus) > +{ > + struct edac_ecs_context *ecs_ctx; > + u32 fru; > + > + ecs_ctx = devm_kzalloc(ecs_dev, sizeof(*ecs_ctx), GFP_KERNEL); > + if (!ecs_ctx) > + return -ENOMEM; > + > + ecs_ctx->num_media_frus = num_media_frus; > + ecs_ctx->fru_ctxs = devm_kcalloc(ecs_dev, num_media_frus, > + sizeof(*ecs_ctx->fru_ctxs), > + GFP_KERNEL); > + if (!ecs_ctx->fru_ctxs) > + return -ENOMEM; > + > + for (fru = 0; fru < num_media_frus; fru++) { > + struct edac_ecs_fru_context *fru_ctx = &ecs_ctx->fru_ctxs[fru]; > + struct attribute_group *group = &fru_ctx->group; > + int i; > + > + fru_ctx->ecs_dev_attr[ECS_LOG_ENTRY_TYPE] = EDAC_ECS_ATTR_RW(log_entry_type, fru); > + fru_ctx->ecs_dev_attr[ECS_LOG_ENTRY_TYPE_PER_DRAM] = > + EDAC_ECS_ATTR_RO(log_entry_type_per_dram, fru); > + fru_ctx->ecs_dev_attr[ECS_LOG_ENTRY_TYPE_PER_MEMORY_MEDIA] = > + EDAC_ECS_ATTR_RO(log_entry_type_per_memory_media, fru); > + fru_ctx->ecs_dev_attr[ECS_MODE] = EDAC_ECS_ATTR_RW(mode, fru); > + fru_ctx->ecs_dev_attr[ECS_MODE_COUNTS_ROWS] = > + EDAC_ECS_ATTR_RO(mode_counts_rows, fru); > + fru_ctx->ecs_dev_attr[ECS_MODE_COUNTS_CODEWORDS] = > + EDAC_ECS_ATTR_RO(mode_counts_codewords, fru); > + fru_ctx->ecs_dev_attr[ECS_RESET] = EDAC_ECS_ATTR_WO(reset, fru); > + fru_ctx->ecs_dev_attr[ECS_THRESHOLD] = EDAC_ECS_ATTR_RW(threshold, fru); Clearly too long variable and define names. Shorten pls. Also, a new line here: <--- > + for (i = 0; i < ECS_MAX_ATTRS; i++) > + fru_ctx->ecs_attrs[i] = &fru_ctx->ecs_dev_attr[i].dev_attr.attr; > + > + sprintf(fru_ctx->name, "%s%d", EDAC_ECS_FRU_NAME, fru); > + group->name = fru_ctx->name; > + group->attrs = fru_ctx->ecs_attrs; > + group->is_visible = ecs_attr_visible; > + > + attr_groups[fru] = group; > + } > + > + return 0; > +} > + > +/** > + * edac_ecs_get_desc - get EDAC ECS descriptors > + * @ecs_dev: client device, supports ECS feature > + * @attr_groups: pointer to attribute group container > + * @num_media_frus: number of media FRUs in the device > + * > + * Return: > + * * %0 - Success. > + * * %-EINVAL - Invalid parameters passed. > + * * %-ENOMEM - Dynamic memory allocation failed. > + */ > +int edac_ecs_get_desc(struct device *ecs_dev, > + const struct attribute_group **attr_groups, u16 num_media_frus) > +{ > + if (!ecs_dev || !attr_groups || !num_media_frus) > + return -EINVAL; > + > + return ecs_create_desc(ecs_dev, attr_groups, num_media_frus); > +} > diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c > index 91552271b34a..5fc3ec7f25eb 100644 > --- a/drivers/edac/edac_device.c > +++ b/drivers/edac/edac_device.c > @@ -626,6 +626,9 @@ int edac_dev_register(struct device *parent, char *name, > attr_gcnt++; > scrub_cnt++; > break; > + case RAS_FEAT_ECS: > + attr_gcnt += ras_features[feat].ecs_info.num_media_frus; > + break; > default: > return -EINVAL; > } > @@ -667,6 +670,18 @@ int edac_dev_register(struct device *parent, char *name, > scrub_inst++; > attr_gcnt++; > break; > + case RAS_FEAT_ECS: > + if (!ras_features->ecs_ops) > + goto data_mem_free; <---- newline here. > + dev_data = &ctx->ecs; > + dev_data->ecs_ops = ras_features->ecs_ops; > + dev_data->private = ras_features->ctx; > + ret = edac_ecs_get_desc(parent, &ras_attr_groups[attr_gcnt], > + ras_features->ecs_info.num_media_frus); > + if (ret) > + goto data_mem_free; Ditto. > + attr_gcnt += ras_features->ecs_info.num_media_frus; > + break; > default: > ret = -EINVAL; > goto data_mem_free;
Thanks for the comments. >-----Original Message----- >From: Borislav Petkov <bp@alien8.de> >Sent: 28 October 2024 11:17 >To: Shiju Jose <shiju.jose@huawei.com> >Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux- >acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org; >tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org; >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan >Cameron <jonathan.cameron@huawei.com>; gregkh@linuxfoundation.org; >sudeep.holla@arm.com; jassisinghbrar@gmail.com; dave.jiang@intel.com; >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com; >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com; >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com; >Jon.Grimm@amd.com; dave.hansen@linux.intel.com; >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com; >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com; >duenwen@google.com; gthelen@google.com; >wschwartz@amperecomputing.com; dferguson@amperecomputing.com; >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com; >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm ><linuxarm@huawei.com> >Subject: Re: [PATCH v14 03/14] EDAC: Add ECS control feature > >On Fri, Oct 25, 2024 at 06:13:44PM +0100, shiju.jose@huawei.com wrote: >> +What: /sys/bus/edac/devices/<dev- >name>/ecs_fruX/log_entry_type >> +Date: Jan 2025 >> +KernelVersion: 6.13 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RW) The log entry type of how the DDR5 ECS log is reported. >> + 00b - per DRAM. >> + 01b - per memory media FRU. > >If the conversion function here is kstrtoul(), why are those values not "0" >and "1" but in binary format? Followed the description for the " Log Entry Type " in the CXL spec rev 3.1 Table 8-210. DDR5 ECS Control Feature Readable Attributes and Table 8-211. DDR5 ECS Control Feature Writable Attributes. This was written as " Common DDR5 ECS Log Capabilities " though actually returns the status of the log_entry_type. Common DDR5 ECS Log Capabilities • Bits[1:0]: Log Entry Type: The log entry type of how the ECS log is reported. The entry type is defined commonly for all memory media FRUs within the device. — 00b = Per DRAM — 01b = Per Memory Media FRU — All other encodings are reserved • Bits[7:2]: Reserved. I will update to 0 and 1. > >> + >> +What: /sys/bus/edac/devices/<dev- >name>/ecs_fruX/log_entry_type_per_dram >> +Date: Jan 2025 >> +KernelVersion: 6.13 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RO) True if current log entry type is per DRAM. >> + >> +What: /sys/bus/edac/devices/<dev- >name>/ecs_fruX/log_entry_type_per_memory_media >> +Date: Jan 2025 >> +KernelVersion: 6.13 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RO) True if current log entry type is per memory media FRU. > >What's the point of those two if log_entry_type already gives you the same info? > This was written as " Common DDR5 ECS Log Capabilities " in the CXL spec rev 3.1 Table 8-210. DDR5 ECS Control Feature Readable Attributes, though actually returns the status of the log_entry_type. I will remove these extra log type attributes. >And the filename length is a bit too much... > >> + >> +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/mode >> +Date: Jan 2025 >> +KernelVersion: 6.13 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RW) The mode of how the DDR5 ECS counts the errors. >> + 0 - ECS counts rows with errors. >> + 1 - ECS counts codewords with errors. > >Now we have "0" and "1"s. Oh well. > >What are "rows", what are "codewords"? Explain them here pls for the user. Will do. > >> +What: /sys/bus/edac/devices/<dev- >name>/ecs_fruX/mode_counts_rows >> +Date: Jan 2025 >> +KernelVersion: 6.13 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RO) True if current mode is ECS counts rows with errors. >> + >> +What: /sys/bus/edac/devices/<dev- >name>/ecs_fruX/mode_counts_codewords >> +Date: Jan 2025 >> +KernelVersion: 6.13 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RO) True if current mode is ECS counts codewords with errors. > >Same question as above - redundant files. I will remove redundant files. > >> +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/reset >> +Date: Jan 2025 >> +KernelVersion: 6.13 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (WO) ECS reset ECC counter. >> + 1 - reset ECC counter to the default value. > >1 or any value? > >Looks like any to me... > >You should restrict it to "1" in case you want to extend this interface with "2" in >the future, for example, doing something a bit different. I will update. Returns error for any other value set than '1'. > >> + >> +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/threshold >> +Date: Jan 2025 >> +KernelVersion: 6.13 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RW) ECS threshold count per gigabits of memory cells. > >That definitely needs more explanation. Sure. > >> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile index >> 188501e676c7..b24c2c112d9c 100644 >> --- a/drivers/edac/Makefile >> +++ b/drivers/edac/Makefile >> @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC) := edac_core.o >> >> edac_core-y := edac_mc.o edac_device.o edac_mc_sysfs.o >> edac_core-y += edac_module.o edac_device_sysfs.o wq.o >> -edac_core-y += scrub.o >> +edac_core-y += scrub.o ecs.o >> >> edac_core-$(CONFIG_EDAC_DEBUG) += debugfs.o >> >> diff --git a/drivers/edac/ecs.c b/drivers/edac/ecs.c new file mode >> 100755 index 000000000000..a2b64d7bf6b6 >> --- /dev/null >> +++ b/drivers/edac/ecs.c >> @@ -0,0 +1,240 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Generic ECS driver in order to support control the on die >> + * error check scrub (e.g. DDR5 ECS). > >This sentence needs grammar check. Will do. > >> The common sysfs ECS >> + * interface abstracts the control of an arbitrary ECS >> + * functionality to a common set of functions. >> + * >> + * Copyright (c) 2024 HiSilicon Limited. >> + */ >> + > >#undef pr_fmt > >> +#define pr_fmt(fmt) "EDAC ECS: " fmt > >Grep the tree for examples how to do that properly. > >Also, this pr_fmt looks unused. I will remove. > >> +static umode_t ecs_attr_visible(struct kobject *kobj, struct >> +attribute *a, int attr_id) { >> + struct device *ras_feat_dev = kobj_to_dev(kobj); >> + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); >> + const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops; >> + >> + switch (attr_id) { >> + case ECS_LOG_ENTRY_TYPE: >> + if (ops->get_log_entry_type) { >> + if (ops->set_log_entry_type) >> + return a->mode; >> + else >> + return 0444; > >What is the goal for the access mode of all those sysfs entries? I sure hope it is >going to be root-only no-matter what. I don't want normal users to cause scrub >activity. Please make sure your whole set does that. This is the attribute_group's is_visible() callback for controlling the visibility of the attributes based on whether attribute is added or not by the parent driver. Yes. Writable only by the root. > >> + } >> + break; >> + case ECS_LOG_ENTRY_TYPE_PER_DRAM: >> + if (ops->get_log_entry_type_per_dram) >> + return a->mode; >> + break; [...] > EDAC_ECS_ATTR_RO(log_entry_type_per_dram, fru); >> + fru_ctx- >>ecs_dev_attr[ECS_LOG_ENTRY_TYPE_PER_MEMORY_MEDIA] = >> + > EDAC_ECS_ATTR_RO(log_entry_type_per_memory_media, fru); >> + fru_ctx->ecs_dev_attr[ECS_MODE] = >EDAC_ECS_ATTR_RW(mode, fru); >> + fru_ctx->ecs_dev_attr[ECS_MODE_COUNTS_ROWS] = >> + > EDAC_ECS_ATTR_RO(mode_counts_rows, fru); >> + fru_ctx->ecs_dev_attr[ECS_MODE_COUNTS_CODEWORDS] = >> + > EDAC_ECS_ATTR_RO(mode_counts_codewords, fru); >> + fru_ctx->ecs_dev_attr[ECS_RESET] = >EDAC_ECS_ATTR_WO(reset, fru); >> + fru_ctx->ecs_dev_attr[ECS_THRESHOLD] = >EDAC_ECS_ATTR_RW(threshold, >> +fru); > >Clearly too long variable and define names. Shorten pls. Will do > >Also, a new line here: Ok. > ><--- > > [...] >> + case RAS_FEAT_ECS: >> + if (!ras_features->ecs_ops) >> + goto data_mem_free; > ><---- newline here. Wil do. > >> + dev_data = &ctx->ecs; >> + dev_data->ecs_ops = ras_features->ecs_ops; >> + dev_data->private = ras_features->ctx; >> + ret = edac_ecs_get_desc(parent, >&ras_attr_groups[attr_gcnt], >> + ras_features- >>ecs_info.num_media_frus); >> + if (ret) >> + goto data_mem_free; > >Ditto. Wil do. > >> + attr_gcnt += ras_features->ecs_info.num_media_frus; >> + break; >> default: >> ret = -EINVAL; >> goto data_mem_free; > >-- >Regards/Gruss, > Boris. Thanks, Shiju
On Mon, Oct 28, 2024 at 04:03:31PM +0000, Shiju Jose wrote:
> I will update to 0 and 1.
Yes please. Just because some spec is written by hw people, doesn't mean we
should use this braindead format.
Thx.
diff --git a/Documentation/ABI/testing/sysfs-edac-ecs b/Documentation/ABI/testing/sysfs-edac-ecs new file mode 100644 index 000000000000..b94cc09f9222 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-edac-ecs @@ -0,0 +1,76 @@ +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX +Date: Jan 2025 +KernelVersion: 6.13 +Contact: linux-edac@vger.kernel.org +Description: + The sysfs EDAC bus devices /<dev-name>/ecs_fruX subdirectory + belongs to the memory media ECS (Error Check Scrub) control + feature, where <dev-name> directory corresponds to a device + registered with the EDAC device driver for the ECS feature. + /ecs_fruX belongs to the media FRUs (Field Replaceable Unit) + under the memory device. + The sysfs ECS attr nodes are only present if the client + driver has implemented the corresponding attr callback + function and passed in ops to the EDAC RAS feature driver + during registration. + +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type +Date: Jan 2025 +KernelVersion: 6.13 +Contact: linux-edac@vger.kernel.org +Description: + (RW) The log entry type of how the DDR5 ECS log is reported. + 00b - per DRAM. + 01b - per memory media FRU. + +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type_per_dram +Date: Jan 2025 +KernelVersion: 6.13 +Contact: linux-edac@vger.kernel.org +Description: + (RO) True if current log entry type is per DRAM. + +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type_per_memory_media +Date: Jan 2025 +KernelVersion: 6.13 +Contact: linux-edac@vger.kernel.org +Description: + (RO) True if current log entry type is per memory media FRU. + +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/mode +Date: Jan 2025 +KernelVersion: 6.13 +Contact: linux-edac@vger.kernel.org +Description: + (RW) The mode of how the DDR5 ECS counts the errors. + 0 - ECS counts rows with errors. + 1 - ECS counts codewords with errors. + +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/mode_counts_rows +Date: Jan 2025 +KernelVersion: 6.13 +Contact: linux-edac@vger.kernel.org +Description: + (RO) True if current mode is ECS counts rows with errors. + +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/mode_counts_codewords +Date: Jan 2025 +KernelVersion: 6.13 +Contact: linux-edac@vger.kernel.org +Description: + (RO) True if current mode is ECS counts codewords with errors. + +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/reset +Date: Jan 2025 +KernelVersion: 6.13 +Contact: linux-edac@vger.kernel.org +Description: + (WO) ECS reset ECC counter. + 1 - reset ECC counter to the default value. + +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/threshold +Date: Jan 2025 +KernelVersion: 6.13 +Contact: linux-edac@vger.kernel.org +Description: + (RW) ECS threshold count per gigabits of memory cells. diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile index 188501e676c7..b24c2c112d9c 100644 --- a/drivers/edac/Makefile +++ b/drivers/edac/Makefile @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC) := edac_core.o edac_core-y := edac_mc.o edac_device.o edac_mc_sysfs.o edac_core-y += edac_module.o edac_device_sysfs.o wq.o -edac_core-y += scrub.o +edac_core-y += scrub.o ecs.o edac_core-$(CONFIG_EDAC_DEBUG) += debugfs.o diff --git a/drivers/edac/ecs.c b/drivers/edac/ecs.c new file mode 100755 index 000000000000..a2b64d7bf6b6 --- /dev/null +++ b/drivers/edac/ecs.c @@ -0,0 +1,240 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Generic ECS driver in order to support control the on die + * error check scrub (e.g. DDR5 ECS). The common sysfs ECS + * interface abstracts the control of an arbitrary ECS + * functionality to a common set of functions. + * + * Copyright (c) 2024 HiSilicon Limited. + */ + +#define pr_fmt(fmt) "EDAC ECS: " fmt + +#include <linux/edac.h> + +#define EDAC_ECS_FRU_NAME "ecs_fru" + +enum edac_ecs_attributes { + ECS_LOG_ENTRY_TYPE, + ECS_LOG_ENTRY_TYPE_PER_DRAM, + ECS_LOG_ENTRY_TYPE_PER_MEMORY_MEDIA, + ECS_MODE, + ECS_MODE_COUNTS_ROWS, + ECS_MODE_COUNTS_CODEWORDS, + ECS_RESET, + ECS_THRESHOLD, + ECS_MAX_ATTRS +}; + +struct edac_ecs_dev_attr { + struct device_attribute dev_attr; + int fru_id; +}; + +struct edac_ecs_fru_context { + char name[EDAC_FEAT_NAME_LEN]; + struct edac_ecs_dev_attr ecs_dev_attr[ECS_MAX_ATTRS]; + struct attribute *ecs_attrs[ECS_MAX_ATTRS + 1]; + struct attribute_group group; +}; + +struct edac_ecs_context { + u16 num_media_frus; + struct edac_ecs_fru_context *fru_ctxs; +}; + +#define TO_ECS_DEV_ATTR(_dev_attr) \ + container_of(_dev_attr, struct edac_ecs_dev_attr, dev_attr) + +#define EDAC_ECS_ATTR_SHOW(attrib, cb, type, format) \ +static ssize_t attrib##_show(struct device *ras_feat_dev, \ + struct device_attribute *attr, char *buf) \ +{ \ + struct edac_ecs_dev_attr *ecs_dev_attr = TO_ECS_DEV_ATTR(attr); \ + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ + const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops; \ + type data; \ + int ret; \ + \ + ret = ops->cb(ras_feat_dev->parent, ctx->ecs.private, \ + ecs_dev_attr->fru_id, &data); \ + if (ret) \ + return ret; \ + \ + return sysfs_emit(buf, format, data); \ +} + +EDAC_ECS_ATTR_SHOW(log_entry_type, get_log_entry_type, u32, "%u\n") +EDAC_ECS_ATTR_SHOW(log_entry_type_per_dram, get_log_entry_type_per_dram, u32, "%u\n") +EDAC_ECS_ATTR_SHOW(log_entry_type_per_memory_media, get_log_entry_type_per_memory_media, + u32, "%u\n") +EDAC_ECS_ATTR_SHOW(mode, get_mode, u32, "%u\n") +EDAC_ECS_ATTR_SHOW(mode_counts_rows, get_mode_counts_rows, u32, "%u\n") +EDAC_ECS_ATTR_SHOW(mode_counts_codewords, get_mode_counts_codewords, u32, "%u\n") +EDAC_ECS_ATTR_SHOW(threshold, get_threshold, u32, "%u\n") + +#define EDAC_ECS_ATTR_STORE(attrib, cb, type, conv_func) \ +static ssize_t attrib##_store(struct device *ras_feat_dev, \ + struct device_attribute *attr, \ + const char *buf, size_t len) \ +{ \ + struct edac_ecs_dev_attr *ecs_dev_attr = TO_ECS_DEV_ATTR(attr); \ + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ + const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops; \ + type data; \ + int ret; \ + \ + ret = conv_func(buf, 0, &data); \ + if (ret < 0) \ + return ret; \ + \ + ret = ops->cb(ras_feat_dev->parent, ctx->ecs.private, \ + ecs_dev_attr->fru_id, data); \ + if (ret) \ + return ret; \ + \ + return len; \ +} + +EDAC_ECS_ATTR_STORE(log_entry_type, set_log_entry_type, unsigned long, kstrtoul) +EDAC_ECS_ATTR_STORE(mode, set_mode, unsigned long, kstrtoul) +EDAC_ECS_ATTR_STORE(reset, reset, unsigned long, kstrtoul) +EDAC_ECS_ATTR_STORE(threshold, set_threshold, unsigned long, kstrtoul) + +static umode_t ecs_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id) +{ + struct device *ras_feat_dev = kobj_to_dev(kobj); + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); + const struct edac_ecs_ops *ops = ctx->ecs.ecs_ops; + + switch (attr_id) { + case ECS_LOG_ENTRY_TYPE: + if (ops->get_log_entry_type) { + if (ops->set_log_entry_type) + return a->mode; + else + return 0444; + } + break; + case ECS_LOG_ENTRY_TYPE_PER_DRAM: + if (ops->get_log_entry_type_per_dram) + return a->mode; + break; + case ECS_LOG_ENTRY_TYPE_PER_MEMORY_MEDIA: + if (ops->get_log_entry_type_per_memory_media) + return a->mode; + break; + case ECS_MODE: + if (ops->get_mode) { + if (ops->set_mode) + return a->mode; + else + return 0444; + } + break; + case ECS_MODE_COUNTS_ROWS: + if (ops->get_mode_counts_rows) + return a->mode; + break; + case ECS_MODE_COUNTS_CODEWORDS: + if (ops->get_mode_counts_codewords) + return a->mode; + break; + case ECS_RESET: + if (ops->reset) + return a->mode; + break; + case ECS_THRESHOLD: + if (ops->get_threshold) { + if (ops->set_threshold) + return a->mode; + else + return 0444; + } + break; + default: + break; + } + + return 0; +} + +#define EDAC_ECS_ATTR_RO(_name, _fru_id) \ + ((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_RO(_name), \ + .fru_id = _fru_id }) + +#define EDAC_ECS_ATTR_WO(_name, _fru_id) \ + ((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_WO(_name), \ + .fru_id = _fru_id }) + +#define EDAC_ECS_ATTR_RW(_name, _fru_id) \ + ((struct edac_ecs_dev_attr) { .dev_attr = __ATTR_RW(_name), \ + .fru_id = _fru_id }) + +static int ecs_create_desc(struct device *ecs_dev, + const struct attribute_group **attr_groups, u16 num_media_frus) +{ + struct edac_ecs_context *ecs_ctx; + u32 fru; + + ecs_ctx = devm_kzalloc(ecs_dev, sizeof(*ecs_ctx), GFP_KERNEL); + if (!ecs_ctx) + return -ENOMEM; + + ecs_ctx->num_media_frus = num_media_frus; + ecs_ctx->fru_ctxs = devm_kcalloc(ecs_dev, num_media_frus, + sizeof(*ecs_ctx->fru_ctxs), + GFP_KERNEL); + if (!ecs_ctx->fru_ctxs) + return -ENOMEM; + + for (fru = 0; fru < num_media_frus; fru++) { + struct edac_ecs_fru_context *fru_ctx = &ecs_ctx->fru_ctxs[fru]; + struct attribute_group *group = &fru_ctx->group; + int i; + + fru_ctx->ecs_dev_attr[ECS_LOG_ENTRY_TYPE] = EDAC_ECS_ATTR_RW(log_entry_type, fru); + fru_ctx->ecs_dev_attr[ECS_LOG_ENTRY_TYPE_PER_DRAM] = + EDAC_ECS_ATTR_RO(log_entry_type_per_dram, fru); + fru_ctx->ecs_dev_attr[ECS_LOG_ENTRY_TYPE_PER_MEMORY_MEDIA] = + EDAC_ECS_ATTR_RO(log_entry_type_per_memory_media, fru); + fru_ctx->ecs_dev_attr[ECS_MODE] = EDAC_ECS_ATTR_RW(mode, fru); + fru_ctx->ecs_dev_attr[ECS_MODE_COUNTS_ROWS] = + EDAC_ECS_ATTR_RO(mode_counts_rows, fru); + fru_ctx->ecs_dev_attr[ECS_MODE_COUNTS_CODEWORDS] = + EDAC_ECS_ATTR_RO(mode_counts_codewords, fru); + fru_ctx->ecs_dev_attr[ECS_RESET] = EDAC_ECS_ATTR_WO(reset, fru); + fru_ctx->ecs_dev_attr[ECS_THRESHOLD] = EDAC_ECS_ATTR_RW(threshold, fru); + for (i = 0; i < ECS_MAX_ATTRS; i++) + fru_ctx->ecs_attrs[i] = &fru_ctx->ecs_dev_attr[i].dev_attr.attr; + + sprintf(fru_ctx->name, "%s%d", EDAC_ECS_FRU_NAME, fru); + group->name = fru_ctx->name; + group->attrs = fru_ctx->ecs_attrs; + group->is_visible = ecs_attr_visible; + + attr_groups[fru] = group; + } + + return 0; +} + +/** + * edac_ecs_get_desc - get EDAC ECS descriptors + * @ecs_dev: client device, supports ECS feature + * @attr_groups: pointer to attribute group container + * @num_media_frus: number of media FRUs in the device + * + * Return: + * * %0 - Success. + * * %-EINVAL - Invalid parameters passed. + * * %-ENOMEM - Dynamic memory allocation failed. + */ +int edac_ecs_get_desc(struct device *ecs_dev, + const struct attribute_group **attr_groups, u16 num_media_frus) +{ + if (!ecs_dev || !attr_groups || !num_media_frus) + return -EINVAL; + + return ecs_create_desc(ecs_dev, attr_groups, num_media_frus); +} diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c index 91552271b34a..5fc3ec7f25eb 100644 --- a/drivers/edac/edac_device.c +++ b/drivers/edac/edac_device.c @@ -626,6 +626,9 @@ int edac_dev_register(struct device *parent, char *name, attr_gcnt++; scrub_cnt++; break; + case RAS_FEAT_ECS: + attr_gcnt += ras_features[feat].ecs_info.num_media_frus; + break; default: return -EINVAL; } @@ -667,6 +670,18 @@ int edac_dev_register(struct device *parent, char *name, scrub_inst++; attr_gcnt++; break; + case RAS_FEAT_ECS: + if (!ras_features->ecs_ops) + goto data_mem_free; + dev_data = &ctx->ecs; + dev_data->ecs_ops = ras_features->ecs_ops; + dev_data->private = ras_features->ctx; + ret = edac_ecs_get_desc(parent, &ras_attr_groups[attr_gcnt], + ras_features->ecs_info.num_media_frus); + if (ret) + goto data_mem_free; + attr_gcnt += ras_features->ecs_info.num_media_frus; + break; default: ret = -EINVAL; goto data_mem_free; diff --git a/include/linux/edac.h b/include/linux/edac.h index 3620a09c0476..077d3c252e99 100644 --- a/include/linux/edac.h +++ b/include/linux/edac.h @@ -669,6 +669,7 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci, /* RAS feature type */ enum edac_dev_feat { RAS_FEAT_SCRUB, + RAS_FEAT_ECS, RAS_FEAT_MAX }; @@ -702,9 +703,50 @@ int edac_scrub_get_desc(struct device *scrub_dev, const struct attribute_group **attr_groups, u8 instance); +/** + * struct edac_ecs_ops - ECS device operations (all elements optional) + * @get_log_entry_type: read the log entry type value. + * @set_log_entry_type: set the log entry type value. + * @get_log_entry_type_per_dram: read the log entry type per dram value. + * @get_log_entry_type_memory_media: read the log entry type per memory media value. + * @get_mode: read the mode value. + * @set_mode: set the mode value. + * @get_mode_counts_rows: read the mode counts rows value. + * @get_mode_counts_codewords: read the mode counts codewords value. + * @reset: reset the ECS counter. + * @get_threshold: read the threshold count per gigabits of memory cells. + * @set_threshold: set the threshold count per gigabits of memory cells. + */ +struct edac_ecs_ops { + int (*get_log_entry_type)(struct device *dev, void *drv_data, int fru_id, u32 *val); + int (*set_log_entry_type)(struct device *dev, void *drv_data, int fru_id, u32 val); + int (*get_log_entry_type_per_dram)(struct device *dev, void *drv_data, + int fru_id, u32 *val); + int (*get_log_entry_type_per_memory_media)(struct device *dev, void *drv_data, + int fru_id, u32 *val); + int (*get_mode)(struct device *dev, void *drv_data, int fru_id, u32 *val); + int (*set_mode)(struct device *dev, void *drv_data, int fru_id, u32 val); + int (*get_mode_counts_rows)(struct device *dev, void *drv_data, int fru_id, u32 *val); + int (*get_mode_counts_codewords)(struct device *dev, void *drv_data, int fru_id, u32 *val); + int (*reset)(struct device *dev, void *drv_data, int fru_id, u32 val); + int (*get_threshold)(struct device *dev, void *drv_data, int fru_id, u32 *threshold); + int (*set_threshold)(struct device *dev, void *drv_data, int fru_id, u32 threshold); +}; + +struct edac_ecs_ex_info { + u16 num_media_frus; +}; + +int edac_ecs_get_desc(struct device *ecs_dev, + const struct attribute_group **attr_groups, + u16 num_media_frus); + /* EDAC device feature information structure */ struct edac_dev_data { - const struct edac_scrub_ops *scrub_ops; + union { + const struct edac_scrub_ops *scrub_ops; + const struct edac_ecs_ops *ecs_ops; + }; u8 instance; void *private; }; @@ -713,13 +755,18 @@ struct edac_dev_feat_ctx { struct device dev; void *private; struct edac_dev_data *scrub; + struct edac_dev_data ecs; }; struct edac_dev_feature { enum edac_dev_feat ft_type; u8 instance; - const struct edac_scrub_ops *scrub_ops; + union { + const struct edac_scrub_ops *scrub_ops; + const struct edac_ecs_ops *ecs_ops; + }; void *ctx; + struct edac_ecs_ex_info ecs_info; }; int edac_dev_register(struct device *parent, char *dev_name,