Message ID | 216ab396ab0c34fc391d1c3d3797a0d832a8d563.1700615159.git.alison.schofield@intel.com |
---|---|
State | Superseded |
Commit | f996c3806c8daabad889e9565da0f4e87628ad2e |
Headers | show |
Series | Support poison list retrieval | expand |
alison.schofield@ wrote: > From: Alison Schofield <alison.schofield@intel.com> > > The --poison option to 'cxl list' retrieves poison lists from > memory devices supporting the capability and displays the > returned poison records in the cxl list json. This option can > apply to memdevs or regions. > > Example usage in the Documentation/cxl/cxl-list.txt update. > > Signed-off-by: Alison Schofield <alison.schofield@intel.com> > --- > Documentation/cxl/cxl-list.txt | 58 ++++++++++++++++++++++++++++++++++ > cxl/filter.h | 3 ++ > cxl/list.c | 2 ++ > 3 files changed, 63 insertions(+) > > diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt > index 838de4086678..ee2f1b2d9fae 100644 > --- a/Documentation/cxl/cxl-list.txt > +++ b/Documentation/cxl/cxl-list.txt > @@ -415,6 +415,64 @@ OPTIONS > --region:: > Specify CXL region device name(s), or device id(s), to filter the listing. > > +-L:: > +--poison:: > + Include poison information. The poison list is retrieved from the > + device(s) and poison records are added to the listing. Apply this > + option to memdevs and regions where devices support the poison > + list capability. While CXL calls it "poison" I am not convinced that's the term that end users can universally use for this. This is why "ndctl list" uses -M, but yeah, -M and -P are already taken. Even -E is taken for "errors". > + > +---- > +# cxl list -m mem11 --poison > +[ > + { > + "memdev":"mem11", > + "pmem_size":268435456, > + "ram_size":0, > + "serial":0, > + "host":"0000:37:00.0", > + "poison":{ > + "nr_records":1, > + "records":[ One cleanup I want to see before this goes live... drop nr_records and just make "poison" an array object directly. The number of records is trivially determined by the jq "len" operator. Also, per above rename "poison" to "media_errors". I believe "poison" is an x86'ism where "media_error" is a more generic term. > + { > + "dpa":0, > + "dpa_length":64, > + "source":"Internal", > + } > + ] > + } > + } > +] > +# cxl list -r region5 --poison > +[ > + { > + "region":"region5", > + "resource":1035623989248, > + "size":2147483648, > + "interleave_ways":2, > + "interleave_granularity":4096, > + "decode_state":"commit", > + "poison":{ > + "nr_records":2, > + "records":[ > + { > + "memdev":"mem2", > + "dpa":0, > + "dpa_length":64, Does length need to be prefixed with "dpa_"? > + "source":"Internal", I am not sure what the end user can do with "source"? I have tended to not emit things if I can't think of a use case for the field to be there.
Some comments inline -----Original Message----- From: Dan Williams <dan.j.williams@intel.com> Sent: Wednesday, December 6, 2023 8:23 PM To: Schofield, Alison <alison.schofield@intel.com>; Verma, Vishal L <vishal.l.verma@intel.com> Cc: Schofield, Alison <alison.schofield@intel.com>; nvdimm@lists.linux.dev; linux-cxl@vger.kernel.org Subject: Re: [ndctl PATCH v5 4/5] cxl/list: add --poison option to cxl list alison.schofield@ wrote: > From: Alison Schofield <alison.schofield@intel.com> > > The --poison option to 'cxl list' retrieves poison lists from memory > devices supporting the capability and displays the returned poison > records in the cxl list json. This option can apply to memdevs or > regions. > > Example usage in the Documentation/cxl/cxl-list.txt update. > > Signed-off-by: Alison Schofield <alison.schofield@intel.com> > --- > Documentation/cxl/cxl-list.txt | 58 ++++++++++++++++++++++++++++++++++ > cxl/filter.h | 3 ++ > cxl/list.c | 2 ++ > 3 files changed, 63 insertions(+) > > diff --git a/Documentation/cxl/cxl-list.txt > b/Documentation/cxl/cxl-list.txt index 838de4086678..ee2f1b2d9fae > 100644 > --- a/Documentation/cxl/cxl-list.txt > +++ b/Documentation/cxl/cxl-list.txt > @@ -415,6 +415,64 @@ OPTIONS > --region:: > Specify CXL region device name(s), or device id(s), to filter the listing. > > +-L:: > +--poison:: > + Include poison information. The poison list is retrieved from the > + device(s) and poison records are added to the listing. Apply this > + option to memdevs and regions where devices support the poison > + list capability. While CXL calls it "poison" I am not convinced that's the term that end users can universally use for this. This is why "ndctl list" uses -M, but yeah, -M and -P are already taken. Even -E is taken for "errors". > + > +---- > +# cxl list -m mem11 --poison > +[ > + { > + "memdev":"mem11", > + "pmem_size":268435456, > + "ram_size":0, > + "serial":0, > + "host":"0000:37:00.0", > + "poison":{ > + "nr_records":1, > + "records":[ One cleanup I want to see before this goes live... drop nr_records and just make "poison" an array object directly. The number of records is trivially determined by the jq "len" operator. Also, per above rename "poison" to "media_errors". I believe "poison" is an x86'ism where "media_error" is a more generic term. [ETT] The problem is that a poison can be written into CXL and is NOT a media error. Generally... the only way a customer should interpret that a DPA is "poisoned" is that if you consume that address, you'll end up in a MCA-Recovery path. It does not indicate media health. But see discussion later about "source" > + { > + "dpa":0, > + "dpa_length":64, > + "source":"Internal", > + } > + ] > + } > + } > +] > +# cxl list -r region5 --poison > +[ > + { > + "region":"region5", > + "resource":1035623989248, > + "size":2147483648, > + "interleave_ways":2, > + "interleave_granularity":4096, > + "decode_state":"commit", > + "poison":{ > + "nr_records":2, > + "records":[ > + { > + "memdev":"mem2", > + "dpa":0, > + "dpa_length":64, Does length need to be prefixed with "dpa_"? > + "source":"Internal", I am not sure what the end user can do with "source"? I have tended to not emit things if I can't think of a use case for the field to be there. [ETT] The sources are "external" (poison written into CXL), "internal" (the device generated the poison), "injected" (ie via poison inject) and "vendor specific". This is how I would use source. "external" = don't expect to see a cxl media error, look elsewhere like a UCNA or a mem_data error in the RP's CXL.CM RAS. "internal" = expect to see a media error for more information. "injected" = somebody injected the error, no service action needed except to maybe tighten up your security. "vendor" = see vendor How about a HPA? :D
On Wed, Dec 06, 2023 at 08:23:10PM -0800, Dan Williams wrote: > alison.schofield@ wrote: > > From: Alison Schofield <alison.schofield@intel.com> > > > > The --poison option to 'cxl list' retrieves poison lists from > > memory devices supporting the capability and displays the > > returned poison records in the cxl list json. This option can > > apply to memdevs or regions. > > > > Example usage in the Documentation/cxl/cxl-list.txt update. > > > > Signed-off-by: Alison Schofield <alison.schofield@intel.com> > > --- > > Documentation/cxl/cxl-list.txt | 58 ++++++++++++++++++++++++++++++++++ > > cxl/filter.h | 3 ++ > > cxl/list.c | 2 ++ > > 3 files changed, 63 insertions(+) > > > > diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt > > index 838de4086678..ee2f1b2d9fae 100644 > > --- a/Documentation/cxl/cxl-list.txt > > +++ b/Documentation/cxl/cxl-list.txt > > @@ -415,6 +415,64 @@ OPTIONS > > --region:: > > Specify CXL region device name(s), or device id(s), to filter the listing. > > > > +-L:: > > +--poison:: > > + Include poison information. The poison list is retrieved from the > > + device(s) and poison records are added to the listing. Apply this > > + option to memdevs and regions where devices support the poison > > + list capability. > > While CXL calls it "poison" I am not convinced that's the term that end > users can universally use for this. This is why "ndctl list" uses -M, > but yeah, -M and -P are already taken. Even -E is taken for "errors". > > > + > > +---- > > +# cxl list -m mem11 --poison > > +[ > > + { > > + "memdev":"mem11", > > + "pmem_size":268435456, > > + "ram_size":0, > > + "serial":0, > > + "host":"0000:37:00.0", > > + "poison":{ > > + "nr_records":1, > > + "records":[ > > One cleanup I want to see before this goes live... drop nr_records and > just make "poison" an array object directly. The number of records is > trivially determined by the jq "len" operator. The poison nr_records count is a convenience for the cmdline user, not for the user doing their own jq parsing. It also intends to explicitly show nr_records:0 when no poison is found. Say NAK again and I'll rm it all. > > Also, per above rename "poison" to "media_errors". I believe "poison" is > an x86'ism where "media_error" is a more generic term. OK > > > + { > > + "dpa":0, > > + "dpa_length":64, > > + "source":"Internal", > > + } > > + ] > > + } > > + } > > +] > > +# cxl list -r region5 --poison > > +[ > > + { > > + "region":"region5", > > + "resource":1035623989248, > > + "size":2147483648, > > + "interleave_ways":2, > > + "interleave_granularity":4096, > > + "decode_state":"commit", > > + "poison":{ > > + "nr_records":2, > > + "records":[ > > + { > > + "memdev":"mem2", > > + "dpa":0, > > + "dpa_length":64, > > Does length need to be prefixed with "dpa_"? > Nope. Will remove. > > + "source":"Internal", > > I am not sure what the end user can do with "source"? I have tended to > not emit things if I can't think of a use case for the field to be > there. Erwin followed w a comment on this one. Yeah, I think folks want to know if error was injected at least. Alison
diff --git a/Documentation/cxl/cxl-list.txt b/Documentation/cxl/cxl-list.txt index 838de4086678..ee2f1b2d9fae 100644 --- a/Documentation/cxl/cxl-list.txt +++ b/Documentation/cxl/cxl-list.txt @@ -415,6 +415,64 @@ OPTIONS --region:: Specify CXL region device name(s), or device id(s), to filter the listing. +-L:: +--poison:: + Include poison information. The poison list is retrieved from the + device(s) and poison records are added to the listing. Apply this + option to memdevs and regions where devices support the poison + list capability. + +---- +# cxl list -m mem11 --poison +[ + { + "memdev":"mem11", + "pmem_size":268435456, + "ram_size":0, + "serial":0, + "host":"0000:37:00.0", + "poison":{ + "nr_records":1, + "records":[ + { + "dpa":0, + "dpa_length":64, + "source":"Internal", + } + ] + } + } +] +# cxl list -r region5 --poison +[ + { + "region":"region5", + "resource":1035623989248, + "size":2147483648, + "interleave_ways":2, + "interleave_granularity":4096, + "decode_state":"commit", + "poison":{ + "nr_records":2, + "records":[ + { + "memdev":"mem2", + "dpa":0, + "dpa_length":64, + "source":"Internal", + }, + { + "memdev":"mem5", + "dpa":0, + "length":512, + "source":"Vendor", + } + ] + } + } +] +---- + -v:: --verbose:: Increase verbosity of the output. This can be specified diff --git a/cxl/filter.h b/cxl/filter.h index 3f65990f835a..1241f72ccf62 100644 --- a/cxl/filter.h +++ b/cxl/filter.h @@ -30,6 +30,7 @@ struct cxl_filter_params { bool fw; bool alert_config; bool dax; + bool poison; int verbose; struct log_ctx ctx; }; @@ -88,6 +89,8 @@ static inline unsigned long cxl_filter_to_flags(struct cxl_filter_params *param) flags |= UTIL_JSON_ALERT_CONFIG; if (param->dax) flags |= UTIL_JSON_DAX | UTIL_JSON_DAX_DEVS; + if (param->poison) + flags |= UTIL_JSON_MEDIA_ERRORS; return flags; } diff --git a/cxl/list.c b/cxl/list.c index 93ba51ef895c..13fef8569340 100644 --- a/cxl/list.c +++ b/cxl/list.c @@ -57,6 +57,8 @@ static const struct option options[] = { "include memory device firmware information"), OPT_BOOLEAN('A', "alert-config", ¶m.alert_config, "include alert configuration information"), + OPT_BOOLEAN('L', "poison", ¶m.poison, + "include poison information "), OPT_INCR('v', "verbose", ¶m.verbose, "increase output detail"), #ifdef ENABLE_DEBUG OPT_BOOLEAN(0, "debug", &debug, "debug list walk"),