mbox series

[v6,0/12] cxl: Add support to report region access coordinates to numa nodes

Message ID 20240220231402.3156281-1-dave.jiang@intel.com (mailing list archive)
Headers show
Series cxl: Add support to report region access coordinates to numa nodes | expand

Message

Dave Jiang Feb. 20, 2024, 11:12 p.m. UTC
Hi Rafael,
Please review patches 1-4,10,11 and ack if they look ok to you. Thank you!

Hi Greg,
Please review patch 2 and 11 and ack the numa node bits if they look ok to you. Thank you!

v6:
- Enhance macros used to reduce code for cxl access coordinates sysfs attrs (Jonathan)
- Various minor updates and fixes, see per commit details. (Jonathan)
- Added review tags from Jonathan.

v5:
- Fix various 0-day issues
- Remove EXPORT_SYMBOL for cxl_coords_combine() (Dan)
- Rebased against fixes series for qos_class [1].

v4:
- Introduce access class 0 and 1 for CXL access coordinates.
- See individual patches for detailed change log if applicable.

v3:
- Make attributes not visible if no data. (Jonathan)
- Fix documentation verbiage. (Jonathan)
- Check against read bandwidth instead of write bandwidth due to future RO devices. (Jonathan)
- Export node_set_perf_attrs() to all namespaces. (Jonathan)
- Remove setting of coordinate access level 1. (Jonathan)

v2:
- Move calculation function to core/cdat.c due to QTG series changes
- Make cxlr->coord static (Dan)
- Move calculation to cxl_region_attach to be under cxl_dpa_rwsem (Dan)
- Normalize perf latency numbers to nanoseconds (Brice)
- Update documentation with units and initiator details (Brice, Dan)
- Fix notifier return values (Dan)
- Use devm_add_action_or_reset() to unregister memory notifier (Dan)

This series adds support for computing the performance data of a CXL region
and also updates the performance data to the NUMA node. This series depends
on the CXL QOS class series that's pending 6.8 pull request.

CXL memory devices already attached before boot are enumerated by the BIOS.
The SRAT and HMAT tables are properly setup to including memory regions
enumerated from those CXL memory devices. For regions not programmed or a
hot-plugged CXL memory device, the BIOS does not have the relevant
information and the performance data has to be caluclated by the driver
post region assembly.

According to numaperf documentation [2] there are 2 access classes defined
for performance between an initiator node and a memory target node. Access
class "0" describes attributes between a memory target and the highest
performing initator local to the target. In this case the initiator can be
a CPU or an I/O initiator such as a GPU or NIC. Access class "1" describes
attributes between a memory target and the nearest CPU node. Both access
classes are calculated for the CXL memory target and updated for NUMA nodes
through HMAT_REPORTING code or directly depending on if the NUMA node is
described by the ACPI SRAT table.

Recall from qos_class series (v6.8) that the performance data for the ranges
of a CXL memory device is computed and cached. A CXL memory region can be
backed by one or more devices. Thus the performance data would be the
aggregated bandwidth of all devices that back a region and the worst
latency out of all devices backing the region.

See kernel git branch [3] for convenience.

[1]: https://lore.kernel.org/linux-cxl/20240206190431.1810289-1-dave.jiang@intel.com/T/#t 
[2]: https://www.kernel.org/doc/Documentation/admin-guide/mm/numaperf.rst
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/djiang/linux.git/log/?h=cxl-hmem-report

Comments

Jonathan Cameron March 6, 2024, 2:55 p.m. UTC | #1
On Tue, 20 Feb 2024 16:12:29 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> Hi Rafael,
> Please review patches 1-4,10,11 and ack if they look ok to you. Thank you!
> 
> Hi Greg,
> Please review patch 2 and 11 and ack the numa node bits if they look ok to you. Thank you!

Whilst currently a bit light weight, I poked this along with the QEMU Generic Port emulation
on the gitlab.com/jic23/qemu cxl-2024-03-05 and some pathological cases from host side,

It works so

Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> 
> v6:
> - Enhance macros used to reduce code for cxl access coordinates sysfs attrs (Jonathan)
> - Various minor updates and fixes, see per commit details. (Jonathan)
> - Added review tags from Jonathan.
> 
> v5:
> - Fix various 0-day issues
> - Remove EXPORT_SYMBOL for cxl_coords_combine() (Dan)
> - Rebased against fixes series for qos_class [1].
> 
> v4:
> - Introduce access class 0 and 1 for CXL access coordinates.
> - See individual patches for detailed change log if applicable.
> 
> v3:
> - Make attributes not visible if no data. (Jonathan)
> - Fix documentation verbiage. (Jonathan)
> - Check against read bandwidth instead of write bandwidth due to future RO devices. (Jonathan)
> - Export node_set_perf_attrs() to all namespaces. (Jonathan)
> - Remove setting of coordinate access level 1. (Jonathan)
> 
> v2:
> - Move calculation function to core/cdat.c due to QTG series changes
> - Make cxlr->coord static (Dan)
> - Move calculation to cxl_region_attach to be under cxl_dpa_rwsem (Dan)
> - Normalize perf latency numbers to nanoseconds (Brice)
> - Update documentation with units and initiator details (Brice, Dan)
> - Fix notifier return values (Dan)
> - Use devm_add_action_or_reset() to unregister memory notifier (Dan)
> 
> This series adds support for computing the performance data of a CXL region
> and also updates the performance data to the NUMA node. This series depends
> on the CXL QOS class series that's pending 6.8 pull request.
> 
> CXL memory devices already attached before boot are enumerated by the BIOS.
> The SRAT and HMAT tables are properly setup to including memory regions
> enumerated from those CXL memory devices. For regions not programmed or a
> hot-plugged CXL memory device, the BIOS does not have the relevant
> information and the performance data has to be caluclated by the driver
> post region assembly.
> 
> According to numaperf documentation [2] there are 2 access classes defined
> for performance between an initiator node and a memory target node. Access
> class "0" describes attributes between a memory target and the highest
> performing initator local to the target. In this case the initiator can be
> a CPU or an I/O initiator such as a GPU or NIC. Access class "1" describes
> attributes between a memory target and the nearest CPU node. Both access
> classes are calculated for the CXL memory target and updated for NUMA nodes
> through HMAT_REPORTING code or directly depending on if the NUMA node is
> described by the ACPI SRAT table.
> 
> Recall from qos_class series (v6.8) that the performance data for the ranges
> of a CXL memory device is computed and cached. A CXL memory region can be
> backed by one or more devices. Thus the performance data would be the
> aggregated bandwidth of all devices that back a region and the worst
> latency out of all devices backing the region.
> 
> See kernel git branch [3] for convenience.
> 
> [1]: https://lore.kernel.org/linux-cxl/20240206190431.1810289-1-dave.jiang@intel.com/T/#t 
> [2]: https://www.kernel.org/doc/Documentation/admin-guide/mm/numaperf.rst
> [3]: https://git.kernel.org/pub/scm/linux/kernel/git/djiang/linux.git/log/?h=cxl-hmem-report
>