mbox series

[v5,00/27] DCD: Add support for Dynamic Capacity Devices (DCD)

Message ID 20241029-dcd-type2-upstream-v5-0-8739cb67c374@intel.com
Headers show
Series DCD: Add support for Dynamic Capacity Devices (DCD) | expand

Message

Ira Weiny Oct. 29, 2024, 8:34 p.m. UTC
A git tree of this series can be found here:

	https://github.com/weiny2/linux-kernel/tree/dcd-v4-2024-10-29

Series info
===========

This series has 4 parts:

Patch 1: Add core range_overlaps() function
Patch 2-6: CXL clean up/prelim patches
Patch 7-25: Core DCD support
Patch 26-27: cxl_test support

Background
==========

A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
device that allows memory capacity within a region to change
dynamically without the need for resetting the device, reconfiguring
HDM decoders, or reconfiguring software DAX regions.

One of the biggest use cases for Dynamic Capacity is to allow hosts to
share memory dynamically within a data center without increasing the
per-host attached memory.

The general flow for the addition or removal of memory is to have an
orchestrator coordinate the use of the memory.  Generally there are 5
actors in such a system, the Orchestrator, Fabric Manager, the Logical
device, the Host Kernel, and a Host User.

Typical work flows are shown below.

Orchestrator      FM         Device       Host Kernel    Host User

    |             |           |            |              |
    |-------------- Create region ----------------------->|
    |             |           |            |              |
    |             |           |            |<-- Create ---|
    |             |           |            |    Region    |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |-- Add ----->|-- Add --->|--- Add --->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Accept -|<- Accept  -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |<- Create --->|
    |             |           |            |   DAX dev    |-- Use memory
    |             |           |            |              |   |
    |             |           |            |              |   |
    |             |           |            |<- Release ---| <-+
    |             |           |            |   DAX dev    |
    |             |           |            |              |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |-- Remove -->|- Release->|- Release ->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Release-|<- Release -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |              |
    |-- Add ----->|-- Add --->|--- Add --->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Accept -|<- Accept  -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |<- Create ----|
    |             |           |            |   DAX dev    |-- Use memory
    |             |           |            |              |   |
    |             |           |            |<- Release ---| <-+
    |             |           |            |   DAX dev    |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |-- Remove -->|- Release->|- Release ->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Release-|<- Release -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |              |
    |-- Add ----->|-- Add --->|--- Add --->|              |
    |  Capacity   |  Extent   |   Extent   |              |
    |             |           |            |<- Create ----|
    |             |           |            |   DAX dev    |-- Use memory
    |             |           |            |              |   |
    |-- Remove -->|- Release->|- Release ->|              |   |
    |  Capacity   |  Extent   |   Extent   |              |   |
    |             |           |            |              |   |
    |             |           |     (Release Ignored)     |   |
    |             |           |            |              |   |
    |             |           |            |<- Release ---| <-+
    |             |           |            |   DAX dev    |
    |<------------- Signal done --------------------------|
    |             |           |            |              |
    |             |- Release->|- Release ->|              |
    |             |  Extent   |   Extent   |              |
    |             |           |            |              |
    |             |<- Release-|<- Release -|              |
    |             |   Extent  |   Extent   |              |
    |             |           |            |<- Destroy ---|
    |             |           |            |   Region     |
    |             |           |            |              |

Implementation
==============

The series still requires the creation of regions and DAX devices to be
closely synchronized with the Orchestrator and Fabric Manager.  The host
kernel will reject extents if a region is not yet created.  It also
ignores extent release if memory is in use (DAX device created).  These
synchronizations are not anticipated to be an issue with real
applications.

In order to allow for capacity to be added and removed a new concept of
a sparse DAX region is introduced.  A sparse DAX region may have 0 or
more bytes of available space.  The total space depends on the number
and size of the extents which have been added.

Initially it is anticipated that users of the memory will carefully
coordinate the surfacing of additional capacity with the creation of DAX
devices which use that capacity.  Therefore, the allocation of the
memory to DAX devices does not allow for specific associations between
DAX device and extent.  This keeps allocations very similar to existing
DAX region behavior.

To keep the DAX memory allocation aligned with the existing DAX devices
which do not have tags extents are not allowed to have tags.  Future
support for tags is planned.

Great care was taken to keep the extent tracking simple.  Some xarray's
needed to be added but extra software objects were kept to a minimum.

Region extents continue to be tracked as sub-devices of the DAX region.
This ensures that region destruction cleans up all extent allocations
properly.

Some review tags were kept if a patch did not change.

The major functionality of this series includes:

- Getting the dynamic capacity (DC) configuration information from cxl
  devices

- Configuring the DC partitions reported by hardware

- Enhancing the CXL and DAX regions for dynamic capacity support
	a. Maintain a logical separation between hardware extents and
	   software managed region extents.  This provides an
	   abstraction between the layers and should allow for
	   interleaving in the future

- Get hardware extent lists for endpoint decoders upon
  region creation.

- Adjust extent/region memory available on the following events.
        a. Add capacity Events
	b. Release capacity events

- Host response for add capacity
	a. do not accept the extent if:
		If the region does not exist
		or an error occurs realizing the extent
	b. If the region does exist
		realize a DAX region extent with 1:1 mapping (no
		interleave yet)
	c. Support the event more bit by processing a list of extents
	   marked with the more bit together before setting up a
	   response.

- Host response for remove capacity
	a. If no DAX device references the extent; release the extent
	b. If a reference does exist, ignore the request.
	   (Require FM to issue release again.)

- Modify DAX device creation/resize to account for extents within a
  sparse DAX region

- Trace Dynamic Capacity events for debugging

- Add cxl-test infrastructure to allow for faster unit testing
  (See new ndctl branch for cxl-dcd.sh test[1])

- Only support 0 value extent tags

Fan Ni's upstream of Qemu DCD was used for testing.

Remaining work:

	1) Allow mapping to specific extents (perhaps based on
	   label/tag)
	   1a) devise region size reporting based on tags
	2) Interleave support

Possible additional work depending on requirements:

	1) Accept a new extent which extends (but overlaps) an existing
	   extent(s)
	2) Release extents when DAX devices are released if a release
	   was previously seen from the device
	3) Rework DAX device interfaces, memfd has been explored a bit

[1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-10-01

---
Major Changes in v5:
- Clean up more bit processing with bug fixes
- Add cache flush on extent removal path
- Split out %pra print specifier
	Link: https://lore.kernel.org/all/20241025-cxl-pra-v2-0-123a825daba2@intel.com/
- Split out ACPI flags additions
- Address comments on code format/spelling etc.
- Link to v4: https://patch.msgid.link/20241007-dcd-type2-upstream-v4-0-c261ee6eeded@intel.com

---
Ira Weiny (13):
      range: Add range_overlaps()
      ACPI/CDAT: Add CDAT/DSMAS shared and read only flag values
      dax: Document struct dev_dax_range
      cxl/pci: Delay event buffer allocation
      cxl/hdm: Use guard() in cxl_dpa_set_mode()
      cxl/region: Refactor common create region code
      cxl/cdat: Gather DSMAS data for DCD regions
      cxl/events: Split event msgnum configuration from irq setup
      cxl/pci: Factor out interrupt policy check
      cxl/core: Return endpoint decoder information from region search
      dax/bus: Factor out dev dax resize logic
      tools/testing/cxl: Make event logs dynamic
      tools/testing/cxl: Add DC Regions to mock mem data

Navneet Singh (14):
      cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
      cxl/mem: Read dynamic capacity configuration from the device
      cxl/core: Separate region mode from decoder mode
      cxl/region: Add dynamic capacity decoder and region modes
      cxl/hdm: Add dynamic capacity size support to endpoint decoders
      cxl/mem: Expose DCD partition capabilities in sysfs
      cxl/port: Add endpoint decoder DC mode support to sysfs
      cxl/region: Add sparse DAX region support
      cxl/mem: Configure dynamic capacity interrupts
      cxl/extent: Process DCD events and realize region extents
      cxl/region/extent: Expose region extent information in sysfs
      dax/region: Create resources on sparse DAX regions
      cxl/region: Read existing extents on region creation
      cxl/mem: Trace Dynamic capacity Event Record

 Documentation/ABI/testing/sysfs-bus-cxl | 125 ++++-
 drivers/cxl/core/Makefile               |   2 +-
 drivers/cxl/core/cdat.c                 |  45 +-
 drivers/cxl/core/core.h                 |  34 +-
 drivers/cxl/core/extent.c               | 500 +++++++++++++++++
 drivers/cxl/core/hdm.c                  | 231 ++++++--
 drivers/cxl/core/mbox.c                 | 610 +++++++++++++++++++-
 drivers/cxl/core/memdev.c               | 128 ++++-
 drivers/cxl/core/port.c                 |  19 +-
 drivers/cxl/core/region.c               | 185 ++++--
 drivers/cxl/core/trace.h                |  65 +++
 drivers/cxl/cxl.h                       | 120 +++-
 drivers/cxl/cxlmem.h                    | 131 ++++-
 drivers/cxl/pci.c                       | 122 ++--
 drivers/dax/bus.c                       | 356 ++++++++++--
 drivers/dax/bus.h                       |   4 +-
 drivers/dax/cxl.c                       |  71 ++-
 drivers/dax/dax-private.h               |  66 ++-
 drivers/dax/hmem/hmem.c                 |   2 +-
 drivers/dax/pmem.c                      |   2 +-
 fs/btrfs/ordered-data.c                 |  10 +-
 include/acpi/actbl1.h                   |   2 +
 include/cxl/event.h                     |  32 ++
 include/linux/ioport.h                  |   3 +
 include/linux/range.h                   |   8 +
 tools/testing/cxl/Kbuild                |   3 +-
 tools/testing/cxl/test/mem.c            | 958 ++++++++++++++++++++++++++++----
 27 files changed, 3502 insertions(+), 332 deletions(-)
---
base-commit: c2ee9f594da826bea183ed14f2cc029c719bf4da
change-id: 20230604-dcd-type2-upstream-0cd15f6216fd

Best regards,

Comments

Jonathan Cameron Oct. 30, 2024, 2:48 p.m. UTC | #1
On Tue, 29 Oct 2024 15:34:35 -0500
Ira Weiny <ira.weiny@intel.com> wrote:

> A git tree of this series can be found here:
> 
> 	https://github.com/weiny2/linux-kernel/tree/dcd-v4-2024-10-29
> 
> Series info
> ===========
> 
> This series has 4 parts:
> 
> Patch 1: Add core range_overlaps() function
> Patch 2-6: CXL clean up/prelim patches
> Patch 7-25: Core DCD support
> Patch 26-27: cxl_test support

Other than a few trivial comments and that one build bot reported
issue all looks good to me. Nice work Ira, Navneet etc.

Maybe optimistic to hit 6.13, but I'd love it if it did.
If not, Dave, how about shaving a few off the front so at least
there is less to remember for v6 onwards :)

Jonathan
Dave Jiang Oct. 31, 2024, 3:55 p.m. UTC | #2
On 10/30/24 7:48 AM, Jonathan Cameron wrote:
> On Tue, 29 Oct 2024 15:34:35 -0500
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
>> A git tree of this series can be found here:
>>
>> 	https://github.com/weiny2/linux-kernel/tree/dcd-v4-2024-10-29
>>
>> Series info
>> ===========
>>
>> This series has 4 parts:
>>
>> Patch 1: Add core range_overlaps() function
>> Patch 2-6: CXL clean up/prelim patches
>> Patch 7-25: Core DCD support
>> Patch 26-27: cxl_test support
> 
> Other than a few trivial comments and that one build bot reported
> issue all looks good to me. Nice work Ira, Navneet etc.
> 
> Maybe optimistic to hit 6.13, but I'd love it if it did.
> If not, Dave, how about shaving a few off the front so at least
> there is less to remember for v6 onwards :)

I'd like to take it for 6.13. Just seeing if Dan has any last minute complaints :) We should be able to take 1-6 at least.

> 
> Jonathan