diff mbox series

[07/26] cxl/port: Add dynamic capacity size support to endpoint decoders

Message ID 20240324-dcd-type2-upstream-v1-7-b7b00d623625@intel.com (mailing list archive)
State New, archived
Headers show
Series DCD: Add support for Dynamic Capacity Devices (DCD) | expand

Commit Message

Ira Weiny March 24, 2024, 11:18 p.m. UTC
From: Navneet Singh <navneet.singh@intel.com>

To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
map DC partitions (regions).  In addition to assigning the size of the
DC partition, the decoder must assign any skip value from the previous
decoder.  This must be done within a contiguous DPA space.

Two complications arise with Dynamic Capacity regions which did not
exist with Ram and PMEM partitions.  First, gaps in the DPA space can
exist between and around the DC Regions.  Second, the Linux resource
tree does not allow a resource to be marked across existing nodes within
a tree.

For clarity, below is an example of an 60GB device with 10GB of RAM,
10GB of PMEM and 10GB for each of 2 DC Regions.  The desired CXL mapping
is 5GB of RAM, 5GB of PMEM, and all 10GB of DC1.

     DPA RANGE
     (dpa_res)
0GB        10GB       20GB       30GB       40GB       50GB       60GB
|----------|----------|----------|----------|----------|----------|

RAM         PMEM                  DC0                   DC1
 (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])
|----------|----------|   <gap>  |----------|   <gap>  |----------|

 RAM        PMEM                                        DC1
|XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB

The previous skip resource between RAM and PMEM was always a child of
the RAM resource and fit nicely [see (S) below].  Because of this
simplicity this skip resource reference was not stored in any CXL state.
On release the skip range could be calculated based on the endpoint
decoders stored values.

Now when DC1 is being mapped 4 skip resources must be created as
children.  One for the PMEM resource (A), two of the parent DPA resource
(B,D), and one more child of the DC0 resource (C).

0GB        10GB       20GB       30GB       40GB       50GB       60GB
|----------|----------|----------|----------|----------|----------|
                           |                     |
|----------|----------|    |     |----------|    |     |----------|
        |          |       |          |          |
       (S)        (A)     (B)        (C)        (D)
	v          v       v          v          v
|XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
       skip       skip  skip        skip      skip

Expand the calculation of DPA freespace and enhance the logic to support
mapping/unmapping DC DPA space.  To track the potential of multiple skip
resources an xarray is attached to the endpoint decoder.  The existing
algorithm between RAM and PMEM is consolidated within the new one to
streamline the code even though the result is the storage of a single
skip resource in the xarray.

Signed-off-by: Navneet Singh <navneet.singh@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for v1:
[iweiny: Update cover letter]
---
 drivers/cxl/core/hdm.c  | 192 +++++++++++++++++++++++++++++++++++++++++++-----
 drivers/cxl/core/port.c |   2 +
 drivers/cxl/cxl.h       |   2 +
 3 files changed, 179 insertions(+), 17 deletions(-)

Comments

Jonathan Cameron April 5, 2024, 1:54 p.m. UTC | #1
On Sun, 24 Mar 2024 16:18:10 -0700
ira.weiny@intel.com wrote:

> From: Navneet Singh <navneet.singh@intel.com>
> 
> To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
> map DC partitions (regions).  In addition to assigning the size of the
> DC partition, the decoder must assign any skip value from the previous
> decoder.  This must be done within a contiguous DPA space.
> 
> Two complications arise with Dynamic Capacity regions which did not
> exist with Ram and PMEM partitions.  First, gaps in the DPA space can
> exist between and around the DC Regions.  Second, the Linux resource
> tree does not allow a resource to be marked across existing nodes within
> a tree.
> 
> For clarity, below is an example of an 60GB device with 10GB of RAM,
> 10GB of PMEM and 10GB for each of 2 DC Regions.  The desired CXL mapping
> is 5GB of RAM, 5GB of PMEM, and all 10GB of DC1.
> 
>      DPA RANGE
>      (dpa_res)
> 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> |----------|----------|----------|----------|----------|----------|
> 
> RAM         PMEM                  DC0                   DC1
>  (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])
> |----------|----------|   <gap>  |----------|   <gap>  |----------|
> 
>  RAM        PMEM                                        DC1
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
> 0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB


To add another corner to the example, maybe map only part of DC1?

> 
> The previous skip resource between RAM and PMEM was always a child of
> the RAM resource and fit nicely [see (S) below].  Because of this
> simplicity this skip resource reference was not stored in any CXL state.
> On release the skip range could be calculated based on the endpoint
> decoders stored values.
> 
> Now when DC1 is being mapped 4 skip resources must be created as
> children.  One for the PMEM resource (A), two of the parent DPA resource
> (B,D), and one more child of the DC0 resource (C).
> 
> 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> |----------|----------|----------|----------|----------|----------|
>                            |                     |
> |----------|----------|    |     |----------|    |     |----------|
>         |          |       |          |          |
>        (S)        (A)     (B)        (C)        (D)
> 	v          v       v          v          v
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
>        skip       skip  skip        skip      skip
> 
> Expand the calculation of DPA freespace and enhance the logic to support
> mapping/unmapping DC DPA space.  To track the potential of multiple skip
> resources an xarray is attached to the endpoint decoder.  The existing
> algorithm between RAM and PMEM is consolidated within the new one to
> streamline the code even though the result is the storage of a single
> skip resource in the xarray.
> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes for v1:
> [iweiny: Update cover letter]
> ---
>  drivers/cxl/core/hdm.c  | 192 +++++++++++++++++++++++++++++++++++++++++++-----
>  drivers/cxl/core/port.c |   2 +
>  drivers/cxl/cxl.h       |   2 +
>  3 files changed, 179 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index e22b6f4f7145..da7d58184490 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -210,6 +210,25 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, CXL);
>  
> +static void cxl_skip_release(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct device *dev = &port->dev;
> +	unsigned long index;
> +	void *entry;
> +
> +	xa_for_each(&cxled->skip_res, index, entry) {
> +		struct resource *res = entry;
> +
> +		dev_dbg(dev, "decoder%d.%d: releasing skipped space; %pr\n",
> +			port->id, cxled->cxld.id, res);
> +		__release_region(&cxlds->dpa_res, res->start,
> +				 resource_size(res));
> +		xa_erase(&cxled->skip_res, index);
> +	}
> +}
> +
>  /*
>   * Must be called in a context that synchronizes against this decoder's
>   * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -220,15 +239,11 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct resource *res = cxled->dpa_res;
> -	resource_size_t skip_start;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
> -	/* save @skip_start, before @res is released */
> -	skip_start = res->start - cxled->skip;
>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> -	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +	cxl_skip_release(cxled);
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -263,6 +278,100 @@ static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
>  	return mode - CXL_DECODER_DC0;
>  }
>  
> +static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
> +			    resource_size_t skip_base, resource_size_t skip_len)
> +{
> +	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> +	const char *name = dev_name(&cxled->cxld.dev);
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct resource *dpa_res = &cxlds->dpa_res;
> +	struct device *dev = &port->dev;
> +	struct resource *res;
> +	int rc;
> +
> +	res = __request_region(dpa_res, skip_base, skip_len, name, 0);
> +	if (!res)
> +		return -EBUSY;
> +
> +	rc = xa_insert(&cxled->skip_res, skip_base, res, GFP_KERNEL);
> +	if (rc) {
> +		__release_region(dpa_res, skip_base, skip_len);
> +		return rc;
> +	}
> +
> +	dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
> +		port->id, cxled->cxld.id, res);
> +	return 0;
> +}
> +
> +static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
> +				resource_size_t base, resource_size_t skipped)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	resource_size_t skip_base = base - skipped;
> +	struct device *dev = &port->dev;
> +	resource_size_t skip_len = 0;
> +	int rc, index;
> +
> +	if (resource_size(&cxlds->ram_res) && skip_base <= cxlds->ram_res.end) {
> +		skip_len = cxlds->ram_res.end - skip_base + 1;
> +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> +		if (rc)
> +			return rc;
> +		skip_base += skip_len;
> +	}
> +
> +	if (skip_base == base) {
> +		dev_dbg(dev, "skip done ram!\n");
> +		return 0;
> +	}
> +
> +	if (resource_size(&cxlds->pmem_res) &&
> +	    skip_base <= cxlds->pmem_res.end) {
> +		skip_len = cxlds->pmem_res.end - skip_base + 1;
> +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> +		if (rc)
> +			return rc;
> +		skip_base += skip_len;
> +	}
> +
> +	index = dc_mode_to_region_index(cxled->mode);
> +	for (int i = 0; i <= index; i++) {
> +		struct resource *dcr = &cxlds->dc_res[i];
> +
> +		if (skip_base < dcr->start) {
> +			skip_len = dcr->start - skip_base;
> +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> +			if (rc)
> +				return rc;
> +			skip_base += skip_len;
> +		}
> +
> +		if (skip_base == base) {
> +			dev_dbg(dev, "skip done DC region %d!\n", i);
> +			break;
> +		}
> +
> +		if (resource_size(dcr) && skip_base <= dcr->end) {
> +			if (skip_base > base) {
> +				dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
> +					i, &skip_base, &base);
> +				return -ENXIO;
> +			}
> +
> +			skip_len = dcr->end - skip_base + 1;
> +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> +			if (rc)
> +				return rc;
> +			skip_base += skip_len;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -300,13 +409,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	}
>  
>  	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> +		int rc = cxl_reserve_dpa_skip(cxled, base, skipped);
> +
> +		if (rc) {
> +			dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space; %pa - %pa\n",
> +				port->id, cxled->cxld.id, &base, &skipped);
> +			return rc;
>  		}
>  	}
>  	res = __request_region(&cxlds->dpa_res, base, len,
> @@ -314,14 +422,20 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	if (!res) {
>  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
>  			port->id, cxled->cxld.id);
> -		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +		cxl_skip_release(cxled);
>  		return -EBUSY;
>  	}
>  	cxled->dpa_res = res;
>  	cxled->skip = skipped;
>  
> +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> +		int index = dc_mode_to_region_index(mode);
> +
> +		if (resource_contains(&cxlds->dc_res[index], res)) {
> +			cxled->mode = mode;
> +			goto success;
> +		}
> +	}
>  	if (resource_contains(&cxlds->pmem_res, res))
>  		cxled->mode = CXL_DECODER_PMEM;
>  	else if (resource_contains(&cxlds->ram_res, res))
> @@ -332,6 +446,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  		cxled->mode = CXL_DECODER_MIXED;
>  	}
>  
> +success:
> +	dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id, cxled->cxld.id,
> +		cxled->dpa_res, cxled->mode);
>  	port->hdm_end++;
>  	get_device(&cxled->cxld.dev);
>  	return 0;
> @@ -463,14 +580,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  
>  int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  {
> +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &cxled->cxld.dev;
>  	resource_size_t start, avail, skip;
>  	struct resource *p, *last;
> -	int rc;
> +	int rc, dc_index;
>  
>  	down_write(&cxl_dpa_rwsem);

Obviously not related to this patch as such, but maybe a good place
for scoped_guard() to avoid the dance around unlocking the rwsem
and allow some early returns on the error paths.


>  	if (cxled->cxld.region) {
> @@ -500,6 +617,21 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  	else
>  		free_pmem_start = cxlds->pmem_res.start;
>  
> +	/*
> +	 * Limit each decoder to a single DC region to map memory with
> +	 * different DSMAS entry.
> +	 */
> +	dc_index = dc_mode_to_region_index(cxled->mode);
> +	if (dc_index >= 0) {
> +		if (cxlds->dc_res[dc_index].child) {
> +			dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
> +				dc_index);
> +			rc = -EINVAL;
> +			goto out;
> +		}
> +		free_dc_start = cxlds->dc_res[dc_index].start;
> +	}
> +
>  	if (cxled->mode == CXL_DECODER_RAM) {
>  		start = free_ram_start;
>  		avail = cxlds->ram_res.end - start + 1;
> @@ -521,12 +653,38 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  		else
>  			skip_end = start - 1;
>  		skip = skip_end - skip_start + 1;
> +	} else if (cxl_decoder_mode_is_dc(cxled->mode)) {
> +		resource_size_t skip_start, skip_end;
> +
> +		start = free_dc_start;
> +		avail = cxlds->dc_res[dc_index].end - start + 1;
> +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> +			skip_start = free_ram_start;
> +		else
> +			skip_start = free_pmem_start;
> +		/*
> +		 * If any dc region is already mapped, then that allocation
> +		 * already handled the RAM and PMEM skip.  Check for DC region
> +		 * skip.
> +		 */
> +		for (int i = dc_index - 1; i >= 0 ; i--) {
> +			if (cxlds->dc_res[i].child) {
> +				skip_start = cxlds->dc_res[i].child->end + 1;
> +				break;
> +			}
> +		}
> +
> +		skip_end = start - 1;
> +		skip = skip_end - skip_start + 1;

I notice in the pmem equivalent there is a case for part of the region already mapped.
Can that not happen for a DC region as well?

>  	} else {
>  		dev_dbg(dev, "mode not set\n");
>  		rc = -EINVAL;
>  		goto out;
>  	}
>  
> +	dev_dbg(dev, "DPA Allocation start: %pa len: %#llx Skip: %pa\n",
> +		&start, size, &skip);
> +
>  	if (size > avail) {
>  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
>  			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
Alison Schofield April 10, 2024, 10:50 p.m. UTC | #2
On Sun, Mar 24, 2024 at 04:18:10PM -0700, Ira Weiny wrote:
> From: Navneet Singh <navneet.singh@intel.com>
> 
> To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
> map DC partitions (regions).  In addition to assigning the size of the
> DC partition, the decoder must assign any skip value from the previous
> decoder.  This must be done within a contiguous DPA space.
> 
> Two complications arise with Dynamic Capacity regions which did not
> exist with Ram and PMEM partitions.  First, gaps in the DPA space can

RAM

> exist between and around the DC Regions.  Second, the Linux resource
> tree does not allow a resource to be marked across existing nodes within
> a tree.
> 
> For clarity, below is an example of an 60GB device with 10GB of RAM,
> 10GB of PMEM and 10GB for each of 2 DC Regions.  The desired CXL mapping
> is 5GB of RAM, 5GB of PMEM, and all 10GB of DC1.
> 
>      DPA RANGE
>      (dpa_res)
> 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> |----------|----------|----------|----------|----------|----------|
> 
> RAM         PMEM                  DC0                   DC1
>  (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])
> |----------|----------|   <gap>  |----------|   <gap>  |----------|
> 
>  RAM        PMEM                                        DC1
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
> 0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB
> 
> The previous skip resource between RAM and PMEM was always a child of
> the RAM resource and fit nicely [see (S) below].  Because of this
> simplicity this skip resource reference was not stored in any CXL state.
> On release the skip range could be calculated based on the endpoint
> decoders stored values.
> 
> Now when DC1 is being mapped 4 skip resources must be created as
> children.  One for the PMEM resource (A), two of the parent DPA resource
> (B,D), and one more child of the DC0 resource (C).
> 
> 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> |----------|----------|----------|----------|----------|----------|
>                            |                     |
> |----------|----------|    |     |----------|    |     |----------|
>         |          |       |          |          |
>        (S)        (A)     (B)        (C)        (D)
> 	v          v       v          v          v
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
>        skip       skip  skip        skip      skip
> 

Nice art!


> Expand the calculation of DPA freespace and enhance the logic to support
> mapping/unmapping DC DPA space.  To track the potential of multiple skip
> resources an xarray is attached to the endpoint decoder.  The existing
> algorithm between RAM and PMEM is consolidated within the new one to
> streamline the code even though the result is the storage of a single
> skip resource in the xarray.

This passed the unit test cxl-poison.sh that relies on you not totally
breaking the cxled->skip here. Not exactly a tested by, but something!


> 
> Signed-off-by: Navneet Singh <navneet.singh@intel.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes for v1:
> [iweiny: Update cover letter]
> ---
>  drivers/cxl/core/hdm.c  | 192 +++++++++++++++++++++++++++++++++++++++++++-----
>  drivers/cxl/core/port.c |   2 +
>  drivers/cxl/cxl.h       |   2 +
>  3 files changed, 179 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index e22b6f4f7145..da7d58184490 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -210,6 +210,25 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, CXL);
>  
> +static void cxl_skip_release(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct device *dev = &port->dev;

Here and below it's probably needless to define dev. 
Just &port->dev in your single dev_dbg()
This is something to check for across the patchset.


> +	unsigned long index;
> +	void *entry;
> +
> +	xa_for_each(&cxled->skip_res, index, entry) {
> +		struct resource *res = entry;
> +
> +		dev_dbg(dev, "decoder%d.%d: releasing skipped space; %pr\n",
> +			port->id, cxled->cxld.id, res);
> +		__release_region(&cxlds->dpa_res, res->start,
> +				 resource_size(res));
> +		xa_erase(&cxled->skip_res, index);
> +	}
> +}
> +
>  /*
>   * Must be called in a context that synchronizes against this decoder's
>   * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -220,15 +239,11 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct resource *res = cxled->dpa_res;
> -	resource_size_t skip_start;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
> -	/* save @skip_start, before @res is released */
> -	skip_start = res->start - cxled->skip;
>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> -	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +	cxl_skip_release(cxled);
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -263,6 +278,100 @@ static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
>  	return mode - CXL_DECODER_DC0;
>  }
>  
> +static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
> +			    resource_size_t skip_base, resource_size_t skip_len)
> +{
> +	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> +	const char *name = dev_name(&cxled->cxld.dev);
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct resource *dpa_res = &cxlds->dpa_res;
> +	struct device *dev = &port->dev;

again

> +	struct resource *res;
> +	int rc;
> +
> +	res = __request_region(dpa_res, skip_base, skip_len, name, 0);
> +	if (!res)
> +		return -EBUSY;
> +
> +	rc = xa_insert(&cxled->skip_res, skip_base, res, GFP_KERNEL);
> +	if (rc) {
> +		__release_region(dpa_res, skip_base, skip_len);
> +		return rc;
> +	}
> +
> +	dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
> +		port->id, cxled->cxld.id, res);
> +	return 0;
> +}
> +
> +static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
> +				resource_size_t base, resource_size_t skipped)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_port *port = cxled_to_port(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	resource_size_t skip_base = base - skipped;
> +	struct device *dev = &port->dev;
> +	resource_size_t skip_len = 0;
> +	int rc, index;
> +
> +	if (resource_size(&cxlds->ram_res) && skip_base <= cxlds->ram_res.end) {
> +		skip_len = cxlds->ram_res.end - skip_base + 1;
> +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> +		if (rc)
> +			return rc;
> +		skip_base += skip_len;
> +	}
> +
> +	if (skip_base == base) {
> +		dev_dbg(dev, "skip done ram!\n");
> +		return 0;
> +	}
> +
> +	if (resource_size(&cxlds->pmem_res) &&
> +	    skip_base <= cxlds->pmem_res.end) {
> +		skip_len = cxlds->pmem_res.end - skip_base + 1;
> +		rc = cxl_request_skip(cxled, skip_base, skip_len);
> +		if (rc)
> +			return rc;
> +		skip_base += skip_len;
> +	}
> +
> +	index = dc_mode_to_region_index(cxled->mode);
> +	for (int i = 0; i <= index; i++) {
> +		struct resource *dcr = &cxlds->dc_res[i];
> +
> +		if (skip_base < dcr->start) {
> +			skip_len = dcr->start - skip_base;
> +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> +			if (rc)
> +				return rc;
> +			skip_base += skip_len;
> +		}
> +
> +		if (skip_base == base) {
> +			dev_dbg(dev, "skip done DC region %d!\n", i);
> +			break;
> +		}
> +
> +		if (resource_size(dcr) && skip_base <= dcr->end) {
> +			if (skip_base > base) {
> +				dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
> +					i, &skip_base, &base);
> +				return -ENXIO;
> +			}
> +
> +			skip_len = dcr->end - skip_base + 1;
> +			rc = cxl_request_skip(cxled, skip_base, skip_len);
> +			if (rc)
> +				return rc;
> +			skip_base += skip_len;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -300,13 +409,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	}
>  
>  	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> +		int rc = cxl_reserve_dpa_skip(cxled, base, skipped);
> +
> +		if (rc) {
> +			dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space; %pa - %pa\n",
> +				port->id, cxled->cxld.id, &base, &skipped);
> +			return rc;
>  		}
>  	}
>  	res = __request_region(&cxlds->dpa_res, base, len,
> @@ -314,14 +422,20 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	if (!res) {
>  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
>  			port->id, cxled->cxld.id);
> -		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +		cxl_skip_release(cxled);
>  		return -EBUSY;
>  	}
>  	cxled->dpa_res = res;
>  	cxled->skip = skipped;
>  
> +	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> +		int index = dc_mode_to_region_index(mode);
> +
> +		if (resource_contains(&cxlds->dc_res[index], res)) {
> +			cxled->mode = mode;
> +			goto success;
> +		}
> +	}
>  	if (resource_contains(&cxlds->pmem_res, res))
>  		cxled->mode = CXL_DECODER_PMEM;
>  	else if (resource_contains(&cxlds->ram_res, res))
> @@ -332,6 +446,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  		cxled->mode = CXL_DECODER_MIXED;
>  	}
>  
> +success:
> +	dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id, cxled->cxld.id,
> +		cxled->dpa_res, cxled->mode);
>  	port->hdm_end++;
>  	get_device(&cxled->cxld.dev);
>  	return 0;
> @@ -463,14 +580,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  
>  int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  {
> +	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &cxled->cxld.dev;
>  	resource_size_t start, avail, skip;
>  	struct resource *p, *last;
> -	int rc;
> +	int rc, dc_index;
>  
>  	down_write(&cxl_dpa_rwsem);
>  	if (cxled->cxld.region) {
> @@ -500,6 +617,21 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  	else
>  		free_pmem_start = cxlds->pmem_res.start;
>  
> +	/*
> +	 * Limit each decoder to a single DC region to map memory with
> +	 * different DSMAS entry.
> +	 */
> +	dc_index = dc_mode_to_region_index(cxled->mode);
> +	if (dc_index >= 0) {
> +		if (cxlds->dc_res[dc_index].child) {
> +			dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
> +				dc_index);
> +			rc = -EINVAL;
> +			goto out;
> +		}
> +		free_dc_start = cxlds->dc_res[dc_index].start;
> +	}

From the "Limit each decoder" comment to here please explain.
I'm reading we cannot alloc dpa from this DC region because
is has a child?  And a child is a region?  Maybe I got it ;)


snip to end

--Alison
Ira Weiny May 3, 2024, 5:09 p.m. UTC | #3
Jonathan Cameron wrote:
> On Sun, 24 Mar 2024 16:18:10 -0700
> ira.weiny@intel.com wrote:
> 
> > From: Navneet Singh <navneet.singh@intel.com>
> > 
> > To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
> > map DC partitions (regions).  In addition to assigning the size of the
> > DC partition, the decoder must assign any skip value from the previous
> > decoder.  This must be done within a contiguous DPA space.
> > 
> > Two complications arise with Dynamic Capacity regions which did not
> > exist with Ram and PMEM partitions.  First, gaps in the DPA space can
> > exist between and around the DC Regions.  Second, the Linux resource
> > tree does not allow a resource to be marked across existing nodes within
> > a tree.
> > 
> > For clarity, below is an example of an 60GB device with 10GB of RAM,
> > 10GB of PMEM and 10GB for each of 2 DC Regions.  The desired CXL mapping
> > is 5GB of RAM, 5GB of PMEM, and all 10GB of DC1.
> > 
> >      DPA RANGE
> >      (dpa_res)
> > 0GB        10GB       20GB       30GB       40GB       50GB       60GB
> > |----------|----------|----------|----------|----------|----------|
> > 
> > RAM         PMEM                  DC0                   DC1
> >  (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])
> > |----------|----------|   <gap>  |----------|   <gap>  |----------|
> > 
> >  RAM        PMEM                                        DC1
> > |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
> > 0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB
> 
> 
> To add another corner to the example, maybe map only part of DC1?

Maybe.  See below.


[snip]

> > @@ -500,6 +617,21 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> >  	else
> >  		free_pmem_start = cxlds->pmem_res.start;
> >  
> > +	/*
> > +	 * Limit each decoder to a single DC region to map memory with
> > +	 * different DSMAS entry.

This prevents more than 1 region per DC partition (region).

> > +	 */
> > +	dc_index = dc_mode_to_region_index(cxled->mode);
> > +	if (dc_index >= 0) {
> > +		if (cxlds->dc_res[dc_index].child) {
> > +			dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
> > +				dc_index);
> > +			rc = -EINVAL;
> > +			goto out;
> > +		}
> > +		free_dc_start = cxlds->dc_res[dc_index].start;
> > +	}
> > +
> >  	if (cxled->mode == CXL_DECODER_RAM) {
> >  		start = free_ram_start;
> >  		avail = cxlds->ram_res.end - start + 1;
> > @@ -521,12 +653,38 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> >  		else
> >  			skip_end = start - 1;
> >  		skip = skip_end - skip_start + 1;
> > +	} else if (cxl_decoder_mode_is_dc(cxled->mode)) {
> > +		resource_size_t skip_start, skip_end;
> > +
> > +		start = free_dc_start;
> > +		avail = cxlds->dc_res[dc_index].end - start + 1;
> > +		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> > +			skip_start = free_ram_start;
> > +		else
> > +			skip_start = free_pmem_start;
> > +		/*
> > +		 * If any dc region is already mapped, then that allocation
> > +		 * already handled the RAM and PMEM skip.  Check for DC region
> > +		 * skip.
> > +		 */
> > +		for (int i = dc_index - 1; i >= 0 ; i--) {
> > +			if (cxlds->dc_res[i].child) {
> > +				skip_start = cxlds->dc_res[i].child->end + 1;
> > +				break;
> > +			}
> > +		}
> > +
> > +		skip_end = start - 1;
> > +		skip = skip_end - skip_start + 1;
> 
> I notice in the pmem equivalent there is a case for part of the region already mapped.
> Can that not happen for a DC region as well?

See above check.  Each DC region (partition) was to be associated with a
single DSMAS entry.  I'm unclear now why that decision was made.

It does not seem hard to add this though.  Do we really need that ability
considering dax devices are likely going to be the main boundry for users
of a DC region?

Ira
Dan Williams May 3, 2024, 5:21 p.m. UTC | #4
Ira Weiny wrote:
> > >  	else
> > >  		free_pmem_start = cxlds->pmem_res.start;
> > >  
> > > +	/*
> > > +	 * Limit each decoder to a single DC region to map memory with
> > > +	 * different DSMAS entry.
> 
> This prevents more than 1 region per DC partition (region).

Why? Multiple regions per partition is the current status quo for other
partition types.

> > I notice in the pmem equivalent there is a case for part of the region already mapped.
> > Can that not happen for a DC region as well?
> 
> See above check.  Each DC region (partition) was to be associated with a
> single DSMAS entry.  I'm unclear now why that decision was made.

The limitation of one DSMAS per partition makes sense otherwise that
would indicate performance across different spans of the partition.

> It does not seem hard to add this though.  Do we really need that ability
> considering dax devices are likely going to be the main boundry for users
> of a DC region?

It seems like extra work to make DCD a special case compared to the
other partition types, so the burden of proof is the other way. Why
tolerate DCD divergence from the status quo?
Ira Weiny May 6, 2024, 4:07 a.m. UTC | #5
Dan Williams wrote:
> Ira Weiny wrote:
> > > >  	else
> > > >  		free_pmem_start = cxlds->pmem_res.start;
> > > >  
> > > > +	/*
> > > > +	 * Limit each decoder to a single DC region to map memory with
> > > > +	 * different DSMAS entry.
> > 
> > This prevents more than 1 region per DC partition (region).
> 
> Why? Multiple regions per partition is the current status quo for other
> partition types.
> 
> > > I notice in the pmem equivalent there is a case for part of the region already mapped.
> > > Can that not happen for a DC region as well?
> > 
> > See above check.  Each DC region (partition) was to be associated with a
> > single DSMAS entry.  I'm unclear now why that decision was made.
> 
> The limitation of one DSMAS per partition makes sense otherwise that
> would indicate performance across different spans of the partition.
> 
> > It does not seem hard to add this though.  Do we really need that ability
> > considering dax devices are likely going to be the main boundry for users
> > of a DC region?
> 
> It seems like extra work to make DCD a special case compared to the
> other partition types, so the burden of proof is the other way. Why
> tolerate DCD divergence from the status quo?

I don't remember the details.  But this was discussed before in a call.  I'm ok
adding this support.  I have a test for it now.  Just need to tweek a couple of
things.

Ira
diff mbox series

Patch

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index e22b6f4f7145..da7d58184490 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -210,6 +210,25 @@  void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, CXL);
 
+static void cxl_skip_release(struct cxl_endpoint_decoder *cxled)
+{
+	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
+	struct cxl_port *port = cxled_to_port(cxled);
+	struct device *dev = &port->dev;
+	unsigned long index;
+	void *entry;
+
+	xa_for_each(&cxled->skip_res, index, entry) {
+		struct resource *res = entry;
+
+		dev_dbg(dev, "decoder%d.%d: releasing skipped space; %pr\n",
+			port->id, cxled->cxld.id, res);
+		__release_region(&cxlds->dpa_res, res->start,
+				 resource_size(res));
+		xa_erase(&cxled->skip_res, index);
+	}
+}
+
 /*
  * Must be called in a context that synchronizes against this decoder's
  * port ->remove() callback (like an endpoint decoder sysfs attribute)
@@ -220,15 +239,11 @@  static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct resource *res = cxled->dpa_res;
-	resource_size_t skip_start;
 
 	lockdep_assert_held_write(&cxl_dpa_rwsem);
 
-	/* save @skip_start, before @res is released */
-	skip_start = res->start - cxled->skip;
 	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
-	if (cxled->skip)
-		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
+	cxl_skip_release(cxled);
 	cxled->skip = 0;
 	cxled->dpa_res = NULL;
 	put_device(&cxled->cxld.dev);
@@ -263,6 +278,100 @@  static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
 	return mode - CXL_DECODER_DC0;
 }
 
+static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
+			    resource_size_t skip_base, resource_size_t skip_len)
+{
+	struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
+	const char *name = dev_name(&cxled->cxld.dev);
+	struct cxl_port *port = cxled_to_port(cxled);
+	struct resource *dpa_res = &cxlds->dpa_res;
+	struct device *dev = &port->dev;
+	struct resource *res;
+	int rc;
+
+	res = __request_region(dpa_res, skip_base, skip_len, name, 0);
+	if (!res)
+		return -EBUSY;
+
+	rc = xa_insert(&cxled->skip_res, skip_base, res, GFP_KERNEL);
+	if (rc) {
+		__release_region(dpa_res, skip_base, skip_len);
+		return rc;
+	}
+
+	dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
+		port->id, cxled->cxld.id, res);
+	return 0;
+}
+
+static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
+				resource_size_t base, resource_size_t skipped)
+{
+	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+	struct cxl_port *port = cxled_to_port(cxled);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	resource_size_t skip_base = base - skipped;
+	struct device *dev = &port->dev;
+	resource_size_t skip_len = 0;
+	int rc, index;
+
+	if (resource_size(&cxlds->ram_res) && skip_base <= cxlds->ram_res.end) {
+		skip_len = cxlds->ram_res.end - skip_base + 1;
+		rc = cxl_request_skip(cxled, skip_base, skip_len);
+		if (rc)
+			return rc;
+		skip_base += skip_len;
+	}
+
+	if (skip_base == base) {
+		dev_dbg(dev, "skip done ram!\n");
+		return 0;
+	}
+
+	if (resource_size(&cxlds->pmem_res) &&
+	    skip_base <= cxlds->pmem_res.end) {
+		skip_len = cxlds->pmem_res.end - skip_base + 1;
+		rc = cxl_request_skip(cxled, skip_base, skip_len);
+		if (rc)
+			return rc;
+		skip_base += skip_len;
+	}
+
+	index = dc_mode_to_region_index(cxled->mode);
+	for (int i = 0; i <= index; i++) {
+		struct resource *dcr = &cxlds->dc_res[i];
+
+		if (skip_base < dcr->start) {
+			skip_len = dcr->start - skip_base;
+			rc = cxl_request_skip(cxled, skip_base, skip_len);
+			if (rc)
+				return rc;
+			skip_base += skip_len;
+		}
+
+		if (skip_base == base) {
+			dev_dbg(dev, "skip done DC region %d!\n", i);
+			break;
+		}
+
+		if (resource_size(dcr) && skip_base <= dcr->end) {
+			if (skip_base > base) {
+				dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
+					i, &skip_base, &base);
+				return -ENXIO;
+			}
+
+			skip_len = dcr->end - skip_base + 1;
+			rc = cxl_request_skip(cxled, skip_base, skip_len);
+			if (rc)
+				return rc;
+			skip_base += skip_len;
+		}
+	}
+
+	return 0;
+}
+
 static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 			     resource_size_t base, resource_size_t len,
 			     resource_size_t skipped)
@@ -300,13 +409,12 @@  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	}
 
 	if (skipped) {
-		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
-				       dev_name(&cxled->cxld.dev), 0);
-		if (!res) {
-			dev_dbg(dev,
-				"decoder%d.%d: failed to reserve skipped space\n",
-				port->id, cxled->cxld.id);
-			return -EBUSY;
+		int rc = cxl_reserve_dpa_skip(cxled, base, skipped);
+
+		if (rc) {
+			dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space; %pa - %pa\n",
+				port->id, cxled->cxld.id, &base, &skipped);
+			return rc;
 		}
 	}
 	res = __request_region(&cxlds->dpa_res, base, len,
@@ -314,14 +422,20 @@  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	if (!res) {
 		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
 			port->id, cxled->cxld.id);
-		if (skipped)
-			__release_region(&cxlds->dpa_res, base - skipped,
-					 skipped);
+		cxl_skip_release(cxled);
 		return -EBUSY;
 	}
 	cxled->dpa_res = res;
 	cxled->skip = skipped;
 
+	for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
+		int index = dc_mode_to_region_index(mode);
+
+		if (resource_contains(&cxlds->dc_res[index], res)) {
+			cxled->mode = mode;
+			goto success;
+		}
+	}
 	if (resource_contains(&cxlds->pmem_res, res))
 		cxled->mode = CXL_DECODER_PMEM;
 	else if (resource_contains(&cxlds->ram_res, res))
@@ -332,6 +446,9 @@  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 		cxled->mode = CXL_DECODER_MIXED;
 	}
 
+success:
+	dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id, cxled->cxld.id,
+		cxled->dpa_res, cxled->mode);
 	port->hdm_end++;
 	get_device(&cxled->cxld.dev);
 	return 0;
@@ -463,14 +580,14 @@  int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 
 int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 {
+	resource_size_t free_ram_start, free_pmem_start, free_dc_start;
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
-	resource_size_t free_ram_start, free_pmem_start;
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct device *dev = &cxled->cxld.dev;
 	resource_size_t start, avail, skip;
 	struct resource *p, *last;
-	int rc;
+	int rc, dc_index;
 
 	down_write(&cxl_dpa_rwsem);
 	if (cxled->cxld.region) {
@@ -500,6 +617,21 @@  int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 	else
 		free_pmem_start = cxlds->pmem_res.start;
 
+	/*
+	 * Limit each decoder to a single DC region to map memory with
+	 * different DSMAS entry.
+	 */
+	dc_index = dc_mode_to_region_index(cxled->mode);
+	if (dc_index >= 0) {
+		if (cxlds->dc_res[dc_index].child) {
+			dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
+				dc_index);
+			rc = -EINVAL;
+			goto out;
+		}
+		free_dc_start = cxlds->dc_res[dc_index].start;
+	}
+
 	if (cxled->mode == CXL_DECODER_RAM) {
 		start = free_ram_start;
 		avail = cxlds->ram_res.end - start + 1;
@@ -521,12 +653,38 @@  int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 		else
 			skip_end = start - 1;
 		skip = skip_end - skip_start + 1;
+	} else if (cxl_decoder_mode_is_dc(cxled->mode)) {
+		resource_size_t skip_start, skip_end;
+
+		start = free_dc_start;
+		avail = cxlds->dc_res[dc_index].end - start + 1;
+		if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
+			skip_start = free_ram_start;
+		else
+			skip_start = free_pmem_start;
+		/*
+		 * If any dc region is already mapped, then that allocation
+		 * already handled the RAM and PMEM skip.  Check for DC region
+		 * skip.
+		 */
+		for (int i = dc_index - 1; i >= 0 ; i--) {
+			if (cxlds->dc_res[i].child) {
+				skip_start = cxlds->dc_res[i].child->end + 1;
+				break;
+			}
+		}
+
+		skip_end = start - 1;
+		skip = skip_end - skip_start + 1;
 	} else {
 		dev_dbg(dev, "mode not set\n");
 		rc = -EINVAL;
 		goto out;
 	}
 
+	dev_dbg(dev, "DPA Allocation start: %pa len: %#llx Skip: %pa\n",
+		&start, size, &skip);
+
 	if (size > avail) {
 		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
 			cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 80c0651794eb..036b61cb3007 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -434,6 +434,7 @@  static void cxl_endpoint_decoder_release(struct device *dev)
 	struct cxl_endpoint_decoder *cxled = to_cxl_endpoint_decoder(dev);
 
 	__cxl_decoder_release(&cxled->cxld);
+	xa_destroy(&cxled->skip_res);
 	kfree(cxled);
 }
 
@@ -1896,6 +1897,7 @@  struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
 		return ERR_PTR(-ENOMEM);
 
 	cxled->pos = -1;
+	xa_init(&cxled->skip_res);
 	cxld = &cxled->cxld;
 	rc = cxl_decoder_init(port, cxld);
 	if (rc)	 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 3b8935089c0c..15d418b3bc9b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -441,6 +441,7 @@  enum cxl_decoder_state {
  * @cxld: base cxl_decoder_object
  * @dpa_res: actively claimed DPA span of this decoder
  * @skip: offset into @dpa_res where @cxld.hpa_range maps
+ * @skip_res: array of skipped resources from the previous decoder end
  * @mode: which memory type / access-mode-partition this decoder targets
  * @state: autodiscovery state
  * @pos: interleave position in @cxld.region
@@ -449,6 +450,7 @@  struct cxl_endpoint_decoder {
 	struct cxl_decoder cxld;
 	struct resource *dpa_res;
 	resource_size_t skip;
+	struct xarray skip_res;
 	enum cxl_decoder_mode mode;
 	enum cxl_decoder_state state;
 	int pos;