Message ID | cover.1674070170.git.alison.schofield@intel.com |
---|---|
Headers | show |
Series | CXL Poison List Retrieval & Tracing | expand |
On Thu, Jan 26, 2023 at 05:59:03PM -0800, Dan Williams wrote: > alison.schofield@ wrote: > > From: Alison Schofield <alison.schofield@intel.com> > > > > Subject: [PATCH v5 0/5] CXL Poison List Retrieval & Tracing > > > > Changes in v5: > > - Rebase on cxl/next > > - Use struct_size() to calc mbox cmd payload .min_out > > - s/INTERNAL/INJECTED mocked poison record source > > - Added Jonathan Reviewed-by tag on Patch 3 > > > > Link to v4: > > https://lore.kernel.org/linux-cxl/cover.1671135967.git.alison.schofield@intel.com/ > > > > Add support for retrieving device poison lists and store the returned > > error records as kernel trace events. > > > > The handling of the poison list is guided by the CXL 3.0 Specification > > Section 8.2.9.8.4.1. [1] > > > > Example, triggered by memdev: > > $ echo 1 > /sys/bus/cxl/devices/mem3/trigger_poison_list > > cxl_poison: memdev=mem3 pcidev=cxl_mem.3 region= region_uuid=00000000-0000-0000-0000-000000000000 dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > I think the pcidev= field wants to be called something like "host" or > "parent", because there is no strict requirement that a 'struct > cxl_memdev' is related to a 'struct pci_dev'. In fact in that example > "cxl_mem.3" is a 'struct platform_device'. Now that I think about it, I > think all CXL device events should be emitting the PCIe serial number > for the memdev. ] Will do, 'host' and add PCIe serial no. > > I will look in the implementation, but do region= and region_uuid= get > populated when mem3 is a member of the region? Not always. In the case above, where the trigger was by memdev, no. Region= and region_uuid= (and in the follow-on patch, hpa=) only get populated if the poison was triggered by region, like the case below. It could be looked up for the by memdev cases. Is that wanted? Thanks for the reviews Dan! > > > > > Example, triggered by region: > > $ echo 1 > /sys/bus/cxl/devices/region5/trigger_poison_list > > cxl_poison: memdev=mem0 pcidev=cxl_mem.0 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > cxl_poison: memdev=mem1 pcidev=cxl_mem.1 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > [1]: https://www.computeexpresslink.org/download-the-specification > > > > Alison Schofield (5): > > cxl/mbox: Add GET_POISON_LIST mailbox command > > cxl/trace: Add TRACE support for CXL media-error records > > cxl/memdev: Add trigger_poison_list sysfs attribute > > cxl/region: Add trigger_poison_list sysfs attribute > > tools/testing/cxl: Mock support for Get Poison List > > > > Documentation/ABI/testing/sysfs-bus-cxl | 28 +++++++++ > > drivers/cxl/core/mbox.c | 78 +++++++++++++++++++++++ > > drivers/cxl/core/memdev.c | 45 ++++++++++++++ > > drivers/cxl/core/region.c | 33 ++++++++++ > > drivers/cxl/core/trace.h | 83 +++++++++++++++++++++++++ > > drivers/cxl/cxlmem.h | 69 +++++++++++++++++++- > > drivers/cxl/pci.c | 4 ++ > > tools/testing/cxl/test/mem.c | 42 +++++++++++++ > > 8 files changed, 381 insertions(+), 1 deletion(-) > > > > > > base-commit: 589c3357370a596ef7c99c00baca8ac799fce531 > > -- > > 2.37.3 > > > >
Alison Schofield wrote: > On Thu, Jan 26, 2023 at 05:59:03PM -0800, Dan Williams wrote: > > alison.schofield@ wrote: > > > From: Alison Schofield <alison.schofield@intel.com> > > > > > > Subject: [PATCH v5 0/5] CXL Poison List Retrieval & Tracing > > > > > > Changes in v5: > > > - Rebase on cxl/next > > > - Use struct_size() to calc mbox cmd payload .min_out > > > - s/INTERNAL/INJECTED mocked poison record source > > > - Added Jonathan Reviewed-by tag on Patch 3 > > > > > > Link to v4: > > > https://lore.kernel.org/linux-cxl/cover.1671135967.git.alison.schofield@intel.com/ > > > > > > Add support for retrieving device poison lists and store the returned > > > error records as kernel trace events. > > > > > > The handling of the poison list is guided by the CXL 3.0 Specification > > > Section 8.2.9.8.4.1. [1] > > > > > > Example, triggered by memdev: > > > $ echo 1 > /sys/bus/cxl/devices/mem3/trigger_poison_list > > > cxl_poison: memdev=mem3 pcidev=cxl_mem.3 region= region_uuid=00000000-0000-0000-0000-000000000000 dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > I think the pcidev= field wants to be called something like "host" or > > "parent", because there is no strict requirement that a 'struct > > cxl_memdev' is related to a 'struct pci_dev'. In fact in that example > > "cxl_mem.3" is a 'struct platform_device'. Now that I think about it, I > > think all CXL device events should be emitting the PCIe serial number > > for the memdev. > ] > > Will do, 'host' and add PCIe serial no. > > > > > I will look in the implementation, but do region= and region_uuid= get > > populated when mem3 is a member of the region? > > Not always. > In the case above, where the trigger was by memdev, no. > Region= and region_uuid= (and in the follow-on patch, hpa=) only get > populated if the poison was triggered by region, like the case below. > > It could be looked up for the by memdev cases. Is that wanted? Just trying to understand the semantics. However, I do think it makes sense for a memdev trigger to lookup information on all impacted regions across all of the device's DPA and the region trigger makes sense to lookup all memdevs, but bounded by the DPA that contributes to that region. I just want to avoid someone having to trigger the region to get extra information that was readily available from a memdev listing. > > Thanks for the reviews Dan! > > > > > > > > Example, triggered by region: > > > $ echo 1 > /sys/bus/cxl/devices/region5/trigger_poison_list > > > cxl_poison: memdev=mem0 pcidev=cxl_mem.0 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > cxl_poison: memdev=mem1 pcidev=cxl_mem.1 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > > > [1]: https://www.computeexpresslink.org/download-the-specification > > > > > > Alison Schofield (5): > > > cxl/mbox: Add GET_POISON_LIST mailbox command > > > cxl/trace: Add TRACE support for CXL media-error records > > > cxl/memdev: Add trigger_poison_list sysfs attribute > > > cxl/region: Add trigger_poison_list sysfs attribute > > > tools/testing/cxl: Mock support for Get Poison List > > > > > > Documentation/ABI/testing/sysfs-bus-cxl | 28 +++++++++ > > > drivers/cxl/core/mbox.c | 78 +++++++++++++++++++++++ > > > drivers/cxl/core/memdev.c | 45 ++++++++++++++ > > > drivers/cxl/core/region.c | 33 ++++++++++ > > > drivers/cxl/core/trace.h | 83 +++++++++++++++++++++++++ > > > drivers/cxl/cxlmem.h | 69 +++++++++++++++++++- > > > drivers/cxl/pci.c | 4 ++ > > > tools/testing/cxl/test/mem.c | 42 +++++++++++++ > > > 8 files changed, 381 insertions(+), 1 deletion(-) > > > > > > > > > base-commit: 589c3357370a596ef7c99c00baca8ac799fce531 > > > -- > > > 2.37.3 > > > > > > >
On Fri, Jan 27, 2023 at 11:16:49AM -0800, Dan Williams wrote: > Alison Schofield wrote: > > On Thu, Jan 26, 2023 at 05:59:03PM -0800, Dan Williams wrote: > > > alison.schofield@ wrote: > > > > From: Alison Schofield <alison.schofield@intel.com> > > > > > > > > Subject: [PATCH v5 0/5] CXL Poison List Retrieval & Tracing > > > > > > > > Changes in v5: > > > > - Rebase on cxl/next > > > > - Use struct_size() to calc mbox cmd payload .min_out > > > > - s/INTERNAL/INJECTED mocked poison record source > > > > - Added Jonathan Reviewed-by tag on Patch 3 > > > > > > > > Link to v4: > > > > https://lore.kernel.org/linux-cxl/cover.1671135967.git.alison.schofield@intel.com/ > > > > > > > > Add support for retrieving device poison lists and store the returned > > > > error records as kernel trace events. > > > > > > > > The handling of the poison list is guided by the CXL 3.0 Specification > > > > Section 8.2.9.8.4.1. [1] > > > > > > > > Example, triggered by memdev: > > > > $ echo 1 > /sys/bus/cxl/devices/mem3/trigger_poison_list > > > > cxl_poison: memdev=mem3 pcidev=cxl_mem.3 region= region_uuid=00000000-0000-0000-0000-000000000000 dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > > > I think the pcidev= field wants to be called something like "host" or > > > "parent", because there is no strict requirement that a 'struct > > > cxl_memdev' is related to a 'struct pci_dev'. In fact in that example > > > "cxl_mem.3" is a 'struct platform_device'. Now that I think about it, I > > > think all CXL device events should be emitting the PCIe serial number > > > for the memdev. > > ] > > > > Will do, 'host' and add PCIe serial no. > > > > > > > > I will look in the implementation, but do region= and region_uuid= get > > > populated when mem3 is a member of the region? > > > > Not always. > > In the case above, where the trigger was by memdev, no. > > Region= and region_uuid= (and in the follow-on patch, hpa=) only get > > populated if the poison was triggered by region, like the case below. > > > > It could be looked up for the by memdev cases. Is that wanted? > > Just trying to understand the semantics. However, I do think it makes sense > for a memdev trigger to lookup information on all impacted regions > across all of the device's DPA and the region trigger makes sense to > lookup all memdevs, but bounded by the DPA that contributes to that > region. I just want to avoid someone having to trigger the region to get > extra information that was readily available from a memdev listing. > Dan - Confirming my take-away from this email, and our chat: Remove the by-region trigger_poison_list option entirely. User space needs to trigger by-memdev the memdevs participating in the region and filter those events by region. Add the region info (region name, uuid) to the TRACE_EVENTs when the poisoned DPA is part of any region. Alison > > > > Thanks for the reviews Dan! > > > > > > > > > > > Example, triggered by region: > > > > $ echo 1 > /sys/bus/cxl/devices/region5/trigger_poison_list > > > > cxl_poison: memdev=mem0 pcidev=cxl_mem.0 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > cxl_poison: memdev=mem1 pcidev=cxl_mem.1 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > > > > > [1]: https://www.computeexpresslink.org/download-the-specification > > > > > > > > Alison Schofield (5): > > > > cxl/mbox: Add GET_POISON_LIST mailbox command > > > > cxl/trace: Add TRACE support for CXL media-error records > > > > cxl/memdev: Add trigger_poison_list sysfs attribute > > > > cxl/region: Add trigger_poison_list sysfs attribute > > > > tools/testing/cxl: Mock support for Get Poison List > > > > > > > > Documentation/ABI/testing/sysfs-bus-cxl | 28 +++++++++ > > > > drivers/cxl/core/mbox.c | 78 +++++++++++++++++++++++ > > > > drivers/cxl/core/memdev.c | 45 ++++++++++++++ > > > > drivers/cxl/core/region.c | 33 ++++++++++ > > > > drivers/cxl/core/trace.h | 83 +++++++++++++++++++++++++ > > > > drivers/cxl/cxlmem.h | 69 +++++++++++++++++++- > > > > drivers/cxl/pci.c | 4 ++ > > > > tools/testing/cxl/test/mem.c | 42 +++++++++++++ > > > > 8 files changed, 381 insertions(+), 1 deletion(-) > > > > > > > > > > > > base-commit: 589c3357370a596ef7c99c00baca8ac799fce531 > > > > -- > > > > 2.37.3 > > > > > > > > > > > >
Alison Schofield wrote: > On Fri, Jan 27, 2023 at 11:16:49AM -0800, Dan Williams wrote: > > Alison Schofield wrote: > > > On Thu, Jan 26, 2023 at 05:59:03PM -0800, Dan Williams wrote: > > > > alison.schofield@ wrote: > > > > > From: Alison Schofield <alison.schofield@intel.com> > > > > > > > > > > Subject: [PATCH v5 0/5] CXL Poison List Retrieval & Tracing > > > > > > > > > > Changes in v5: > > > > > - Rebase on cxl/next > > > > > - Use struct_size() to calc mbox cmd payload .min_out > > > > > - s/INTERNAL/INJECTED mocked poison record source > > > > > - Added Jonathan Reviewed-by tag on Patch 3 > > > > > > > > > > Link to v4: > > > > > https://lore.kernel.org/linux-cxl/cover.1671135967.git.alison.schofield@intel.com/ > > > > > > > > > > Add support for retrieving device poison lists and store the returned > > > > > error records as kernel trace events. > > > > > > > > > > The handling of the poison list is guided by the CXL 3.0 Specification > > > > > Section 8.2.9.8.4.1. [1] > > > > > > > > > > Example, triggered by memdev: > > > > > $ echo 1 > /sys/bus/cxl/devices/mem3/trigger_poison_list > > > > > cxl_poison: memdev=mem3 pcidev=cxl_mem.3 region= region_uuid=00000000-0000-0000-0000-000000000000 dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 > > > > > > > > I think the pcidev= field wants to be called something like "host" or > > > > "parent", because there is no strict requirement that a 'struct > > > > cxl_memdev' is related to a 'struct pci_dev'. In fact in that example > > > > "cxl_mem.3" is a 'struct platform_device'. Now that I think about it, I > > > > think all CXL device events should be emitting the PCIe serial number > > > > for the memdev. > > > ] > > > > > > Will do, 'host' and add PCIe serial no. > > > > > > > > > > > I will look in the implementation, but do region= and region_uuid= get > > > > populated when mem3 is a member of the region? > > > > > > Not always. > > > In the case above, where the trigger was by memdev, no. > > > Region= and region_uuid= (and in the follow-on patch, hpa=) only get > > > populated if the poison was triggered by region, like the case below. > > > > > > It could be looked up for the by memdev cases. Is that wanted? > > > > Just trying to understand the semantics. However, I do think it makes sense > > for a memdev trigger to lookup information on all impacted regions > > across all of the device's DPA and the region trigger makes sense to > > lookup all memdevs, but bounded by the DPA that contributes to that > > region. I just want to avoid someone having to trigger the region to get > > extra information that was readily available from a memdev listing. > > > > Dan - > > Confirming my take-away from this email, and our chat: > > Remove the by-region trigger_poison_list option entirely. User space > needs to trigger by-memdev the memdevs participating in the region and > filter those events by region. > > Add the region info (region name, uuid) to the TRACE_EVENTs when the > poisoned DPA is part of any region. That's what I was thinking, yes. So the internals of cxl_mem_get_poison() will take the cxl_region_rwsem for read and compare the device's endpoint decoder settings against the media error records to do the region (and later HPA) lookup.
From: Alison Schofield <alison.schofield@intel.com> **RESENDING this cover letter previously mis-threaded. Changes in v5: - Rebase on cxl/next - Use struct_size() to calc mbox cmd payload .min_out - s/INTERNAL/INJECTED mocked poison record source - Added Jonathan Reviewed-by tag on Patch 3 Link to v4: https://lore.kernel.org/linux-cxl/cover.1671135967.git.alison.schofield@intel.com/ Add support for retrieving device poison lists and store the returned error records as kernel trace events. The handling of the poison list is guided by the CXL 3.0 Specification Section 8.2.9.8.4.1. [1] Example, triggered by memdev: $ echo 1 > /sys/bus/cxl/devices/mem3/trigger_poison_list cxl_poison: memdev=mem3 pcidev=cxl_mem.3 region= region_uuid=00000000-0000-0000-0000-000000000000 dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 Example, triggered by region: $ echo 1 > /sys/bus/cxl/devices/region5/trigger_poison_list cxl_poison: memdev=mem0 pcidev=cxl_mem.0 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 cxl_poison: memdev=mem1 pcidev=cxl_mem.1 region=region5 region_uuid=bfcb7a29-890e-4a41-8236-fe22221fc75c dpa=0x0 length=0x40 source=Internal flags= overflow_time=0 [1]: https://www.computeexpresslink.org/download-the-specification Alison Schofield (5): cxl/mbox: Add GET_POISON_LIST mailbox command cxl/trace: Add TRACE support for CXL media-error records cxl/memdev: Add trigger_poison_list sysfs attribute cxl/region: Add trigger_poison_list sysfs attribute tools/testing/cxl: Mock support for Get Poison List Documentation/ABI/testing/sysfs-bus-cxl | 28 +++++++++ drivers/cxl/core/mbox.c | 78 +++++++++++++++++++++++ drivers/cxl/core/memdev.c | 45 ++++++++++++++ drivers/cxl/core/region.c | 33 ++++++++++ drivers/cxl/core/trace.h | 83 +++++++++++++++++++++++++ drivers/cxl/cxlmem.h | 69 +++++++++++++++++++- drivers/cxl/pci.c | 4 ++ tools/testing/cxl/test/mem.c | 42 +++++++++++++ 8 files changed, 381 insertions(+), 1 deletion(-) base-commit: 589c3357370a596ef7c99c00baca8ac799fce531