diff mbox series

[v2,4/5] cxl: Make cxl_dpa_alloc() DPA partition number agnostic

Message ID 173753637297.3849855.5217976225600372473.stgit@dwillia2-xfh.jf.intel.com
State New
Headers show
Series cxl: DPA partition metadata is a mess... | expand

Commit Message

Dan Williams Jan. 22, 2025, 8:59 a.m. UTC
cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
allocations being distinct from RAM allocations in specific ways when in
practice the allocation rules are only relative to DPA partition index.

The rules for cxl_dpa_alloc() are:

- allocations can only come from 1 partition

- if allocating at partition-index-N, all free space in partitions less
  than partition-index-N must be skipped over

Use the new 'struct cxl_dpa_partition' array to support allocation with
an arbitrary number of DPA partitions on the device.

A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
concept and supersede it with looking up the memory properties from
partition metadata. Until then cxl_part_mode() temporarily bridges code
that looks up partitions by @cxled->mode.

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alejandro Lucero <alucerop@amd.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/hdm.c |  215 +++++++++++++++++++++++++++++++++++-------------
 drivers/cxl/cxlmem.h   |   14 +++
 2 files changed, 172 insertions(+), 57 deletions(-)

Comments

Ira Weiny Jan. 22, 2025, 4:29 p.m. UTC | #1
Dan Williams wrote:
> cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> allocations being distinct from RAM allocations in specific ways when in
> practice the allocation rules are only relative to DPA partition index.
> 
> The rules for cxl_dpa_alloc() are:
> 
> - allocations can only come from 1 partition
> 
> - if allocating at partition-index-N, all free space in partitions less
>   than partition-index-N must be skipped over

I think this is a bit deeper.  The partition index must also correspond to
the DPA order.  The DCD code verifies the partition index's are in DPA
order when reading them from the device.  Therefore, that code will add
them to cxl_dpa_info in order.  But general device driver writers may miss
this point.

[snip]

> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 3f8a54ca4624..591aeb26c9e1 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
>  
> +/* See request_skip() kernel-doc */
> +static void release_skip(struct cxl_dev_state *cxlds,
> +			 const resource_size_t skip_base,
> +			 const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		const struct resource *part_res = &cxlds->part[i].res;
> +		resource_size_t skip_end, skip_size;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +
> +		if (!skip_rem)
> +			break;
> +	}
> +}
> +
>  /*
>   * Must be called in a context that synchronizes against this decoder's
>   * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	skip_start = res->start - cxled->skip;
>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
>  	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +		release_skip(cxlds, skip_start, cxled->skip);
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>  
> +/**
> + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree
> + * @cxlds: CXL.mem device context that parents @cxled
> + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA
> + * @skip_base: DPA < start of new DPA allocation (DPAnew)
> + * @skip_len: @skip_base + @skip_len == DPAnew
> + *
> + * DPA 'skip' arises from out-of-sequence DPA allocation events relative
> + * to free capacity across multiple partitions. It is a wasteful event
> + * as usable DPA gets thrown away, but if a deployment has, for example,
> + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM
> + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM.
> + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder
> + * Protection" for more details.

I think this is a great comment here.

> + *
> + * A 'skip' always covers the last allocated DPA in a previous partition
> + * to the start of the current partition to allocate.  Allocations never
> + * start in the middle of a partition, and allocations are always
> + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm
> + * unwind order from forced in-order allocation).
> + *
> + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip'
> + * would always be contained to a single partition. Given
> + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip'
> + * might span "tail capacity of partition[0], all of partition[1], ...,
> + * all of partition[N-1]" to support allocating from partition[N]. That
> + * in turn interacts with the partition 'struct resource' boundaries
> + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by
> + * partition. I.e. this is a quirk of using a 'struct resource' tree to
> + * detect range conflicts while also tracking partition boundaries in
> + * @cxlds->dpa_res.

Another great comment but it does not actually cover the DCD case.  This
is because in DCD the partitions might also have skips between them.

That said the update should come with DCD or if type 2 devices may have
the same loosening of device partitions.

This is a good clean up though,

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]
Dan Williams Jan. 22, 2025, 10:35 p.m. UTC | #2
Ira Weiny wrote:
> Dan Williams wrote:
> > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> > allocations being distinct from RAM allocations in specific ways when in
> > practice the allocation rules are only relative to DPA partition index.
> > 
> > The rules for cxl_dpa_alloc() are:
> > 
> > - allocations can only come from 1 partition
> > 
> > - if allocating at partition-index-N, all free space in partitions less
> >   than partition-index-N must be skipped over
> 
> I think this is a bit deeper.  The partition index must also correspond to
> the DPA order.  The DCD code verifies the partition index's are in DPA
> order when reading them from the device.  Therefore, that code will add
> them to cxl_dpa_info in order.  But general device driver writers may miss
> this point.

We could save them from themselves with some paranoia in
cxl_dpa_setup(), but as Alejandro said accelerators are typically
single-static-RAM-partition devices. The risk is low that someone builds
a multi-partition accelerator *and* builds a driver that messes that up,
but I would not say no to a comment that notes that expectation.

> [snip]
> 
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 3f8a54ca4624..591aeb26c9e1 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
> >  
> > +/* See request_skip() kernel-doc */
> > +static void release_skip(struct cxl_dev_state *cxlds,
> > +			 const resource_size_t skip_base,
> > +			 const resource_size_t skip_len)
> > +{
> > +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> > +
> > +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> > +		const struct resource *part_res = &cxlds->part[i].res;
> > +		resource_size_t skip_end, skip_size;
> > +
> > +		if (skip_start < part_res->start || skip_start > part_res->end)
> > +			continue;
> > +
> > +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> > +		skip_size = skip_end - skip_start + 1;
> > +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> > +		skip_start += skip_size;
> > +		skip_rem -= skip_size;
> > +
> > +		if (!skip_rem)
> > +			break;
> > +	}
> > +}
> > +
> >  /*
> >   * Must be called in a context that synchronizes against this decoder's
> >   * port ->remove() callback (like an endpoint decoder sysfs attribute)
> > @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >  	skip_start = res->start - cxled->skip;
> >  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> >  	if (cxled->skip)
> > -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> > +		release_skip(cxlds, skip_start, cxled->skip);
> >  	cxled->skip = 0;
> >  	cxled->dpa_res = NULL;
> >  	put_device(&cxled->cxld.dev);
> > @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> >  	__cxl_dpa_release(cxled);
> >  }
> >  
> > +/**
> > + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree
> > + * @cxlds: CXL.mem device context that parents @cxled
> > + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA
> > + * @skip_base: DPA < start of new DPA allocation (DPAnew)
> > + * @skip_len: @skip_base + @skip_len == DPAnew
> > + *
> > + * DPA 'skip' arises from out-of-sequence DPA allocation events relative
> > + * to free capacity across multiple partitions. It is a wasteful event
> > + * as usable DPA gets thrown away, but if a deployment has, for example,
> > + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM
> > + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM.
> > + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder
> > + * Protection" for more details.
> 
> I think this is a great comment here.

Appreciate that, never know how these things are going to translate.

> 
> > + *
> > + * A 'skip' always covers the last allocated DPA in a previous partition
> > + * to the start of the current partition to allocate.  Allocations never
> > + * start in the middle of a partition, and allocations are always
> > + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm
> > + * unwind order from forced in-order allocation).
> > + *
> > + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip'
> > + * would always be contained to a single partition. Given
> > + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip'
> > + * might span "tail capacity of partition[0], all of partition[1], ...,
> > + * all of partition[N-1]" to support allocating from partition[N]. That
> > + * in turn interacts with the partition 'struct resource' boundaries
> > + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by
> > + * partition. I.e. this is a quirk of using a 'struct resource' tree to
> > + * detect range conflicts while also tracking partition boundaries in
> > + * @cxlds->dpa_res.
> 
> Another great comment but it does not actually cover the DCD case.  This
> is because in DCD the partitions might also have skips between them.

I think that "just works". The allocation will be bound by the
partition, and the skip is calculated from the "end of last allocation
in a previous partition". So, the distance between "end of last" and
"allocation start" will naturally include inter-partition holes, right?

> That said the update should come with DCD or if type 2 devices may have
> the same loosening of device partitions.
> 
> This is a good clean up though,
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>

Appreciate the quick turnaround... I will endeavor to do the same with
the next DCD posting.
Ira Weiny Jan. 23, 2025, 3:14 a.m. UTC | #3
Dan Williams wrote:
> Ira Weiny wrote:
> > Dan Williams wrote:
> > > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> > > allocations being distinct from RAM allocations in specific ways when in
> > > practice the allocation rules are only relative to DPA partition index.
> > > 
> > > The rules for cxl_dpa_alloc() are:
> > > 
> > > - allocations can only come from 1 partition
> > > 
> > > - if allocating at partition-index-N, all free space in partitions less
> > >   than partition-index-N must be skipped over
> > 
> > I think this is a bit deeper.  The partition index must also correspond to
> > the DPA order.  The DCD code verifies the partition index's are in DPA
> > order when reading them from the device.  Therefore, that code will add
> > them to cxl_dpa_info in order.  But general device driver writers may miss
> > this point.
> 
> We could save them from themselves with some paranoia in
> cxl_dpa_setup(), but as Alejandro said accelerators are typically
> single-static-RAM-partition devices. The risk is low that someone builds
> a multi-partition accelerator *and* builds a driver that messes that up,
> but I would not say no to a comment that notes that expectation.
> 
> > [snip]
> > 
> > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > > index 3f8a54ca4624..591aeb26c9e1 100644
> > > --- a/drivers/cxl/core/hdm.c
> > > +++ b/drivers/cxl/core/hdm.c
> > > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
> > >  }
> > >  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
> > >  
> > > +/* See request_skip() kernel-doc */
> > > +static void release_skip(struct cxl_dev_state *cxlds,
> > > +			 const resource_size_t skip_base,
> > > +			 const resource_size_t skip_len)
> > > +{
> > > +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> > > +
> > > +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> > > +		const struct resource *part_res = &cxlds->part[i].res;
> > > +		resource_size_t skip_end, skip_size;
> > > +
> > > +		if (skip_start < part_res->start || skip_start > part_res->end)
> > > +			continue;
> > > +
> > > +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> > > +		skip_size = skip_end - skip_start + 1;
> > > +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> > > +		skip_start += skip_size;
> > > +		skip_rem -= skip_size;
> > > +
> > > +		if (!skip_rem)
> > > +			break;
> > > +	}
> > > +}
> > > +
> > >  /*
> > >   * Must be called in a context that synchronizes against this decoder's
> > >   * port ->remove() callback (like an endpoint decoder sysfs attribute)
> > > @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> > >  	skip_start = res->start - cxled->skip;
> > >  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> > >  	if (cxled->skip)
> > > -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> > > +		release_skip(cxlds, skip_start, cxled->skip);
> > >  	cxled->skip = 0;
> > >  	cxled->dpa_res = NULL;
> > >  	put_device(&cxled->cxld.dev);
> > > @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> > >  	__cxl_dpa_release(cxled);
> > >  }
> > >  
> > > +/**
> > > + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree
> > > + * @cxlds: CXL.mem device context that parents @cxled
> > > + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA
> > > + * @skip_base: DPA < start of new DPA allocation (DPAnew)
> > > + * @skip_len: @skip_base + @skip_len == DPAnew
> > > + *
> > > + * DPA 'skip' arises from out-of-sequence DPA allocation events relative
> > > + * to free capacity across multiple partitions. It is a wasteful event
> > > + * as usable DPA gets thrown away, but if a deployment has, for example,
> > > + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM
> > > + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM.
> > > + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder
> > > + * Protection" for more details.
> > 
> > I think this is a great comment here.
> 
> Appreciate that, never know how these things are going to translate.
> 
> > 
> > > + *
> > > + * A 'skip' always covers the last allocated DPA in a previous partition
> > > + * to the start of the current partition to allocate.  Allocations never
> > > + * start in the middle of a partition, and allocations are always
> > > + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm
> > > + * unwind order from forced in-order allocation).
> > > + *
> > > + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip'
> > > + * would always be contained to a single partition. Given
> > > + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip'
> > > + * might span "tail capacity of partition[0], all of partition[1], ...,
> > > + * all of partition[N-1]" to support allocating from partition[N]. That
> > > + * in turn interacts with the partition 'struct resource' boundaries
> > > + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by
> > > + * partition. I.e. this is a quirk of using a 'struct resource' tree to
> > > + * detect range conflicts while also tracking partition boundaries in
> > > + * @cxlds->dpa_res.
> > 
> > Another great comment but it does not actually cover the DCD case.  This
> > is because in DCD the partitions might also have skips between them.
> 
> I think that "just works". The allocation will be bound by the
> partition, and the skip is calculated from the "end of last allocation
> in a previous partition". So, the distance between "end of last" and
> "allocation start" will naturally include inter-partition holes, right?

Not without a change to the algorithm I came up with.  We could create
phantom partitions which represent the skips between partitions.
Otherwise the skip resources need a different parent.

From my commit message:

    Two complications arise with Dynamic Capacity regions which did not         
    exist with Ram and PMEM partitions.  First, gaps in the DPA space can       
    exist between and around the DC partitions.  Second, the Linux resource     
    tree does not allow a resource to be marked across existing nodes within    
    a tree.                                                                     
                                                                                 
    For clarity, below is an example of an 60GB device with 10GB of RAM,        
    10GB of PMEM and 10GB for each of 2 DC partitions.  The desired CXL         
    mapping is 5GB of RAM, 5GB of PMEM, and 5GB of DC1.                         
                                                                                
         DPA RANGE                                                               
         (dpa_res)                                                               
    0GB        10GB       20GB       30GB       40GB       50GB       60GB      
    |----------|----------|----------|----------|----------|----------|          
                                                                                
    RAM         PMEM                  DC0                   DC1                 
     (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])         
    |----------|----------|   <gap>  |----------|   <gap>  |----------|         
                                                                                
     RAM        PMEM                                        DC1                 
    |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|         
    0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB      
                                                                                
    The previous skip resource between RAM and PMEM was always a child of       
    the RAM resource and fit nicely [see (S) below].  Because of this           
    simplicity this skip resource reference was not stored in any CXL state.    
    On release the skip range could be calculated based on the endpoint         
    decoders stored values.                                                      
                                                                                
    Now when DC1 is being mapped 4 skip resources must be created as            
    children.  One for the PMEM resource (A), two of the parent DPA resource    
    (B,D), and one more child of the DC0 resource (C).                          
                                                                                
    0GB        10GB       20GB       30GB       40GB       50GB       60GB      
    |----------|----------|----------|----------|----------|----------|         
                               |                     |                          
    |----------|----------|    |     |----------|    |     |----------|         
            |          |       |          |          |                          
           (S)        (A)     (B)        (C)        (D)                         
            v          v       v          v          v                          
    |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|         
           skip       skip  skip        skip      skip
Dan Williams Jan. 23, 2025, 3:28 a.m. UTC | #4
Ira Weiny wrote:
[..]
> > > Another great comment but it does not actually cover the DCD case.  This
> > > is because in DCD the partitions might also have skips between them.
> > 
> > I think that "just works". The allocation will be bound by the
> > partition, and the skip is calculated from the "end of last allocation
> > in a previous partition". So, the distance between "end of last" and
> > "allocation start" will naturally include inter-partition holes, right?
> 
> Not without a change to the algorithm I came up with.  We could create
> phantom partitions which represent the skips between partitions.
> Otherwise the skip resources need a different parent.

Oh, that is true. Same problem as tracking skips across more than 2
partitions.

However, I highly doubt anyone is going to build a device with
inter-partition skips to the point where I am comfortable with the
driver rejecting the possibility of such devices. Effectively dare
someone to build such a needlessly complicated device especially when
Capacity Configuration supports the concept of decode-length vs usable
capacity.


> 
> From my commit message:
> 
>     Two complications arise with Dynamic Capacity regions which did not         
>     exist with Ram and PMEM partitions.  First, gaps in the DPA space can       
>     exist between and around the DC partitions.  Second, the Linux resource     
>     tree does not allow a resource to be marked across existing nodes within    
>     a tree.                                                                     
>                                                                                  
>     For clarity, below is an example of an 60GB device with 10GB of RAM,        
>     10GB of PMEM and 10GB for each of 2 DC partitions.  The desired CXL         
>     mapping is 5GB of RAM, 5GB of PMEM, and 5GB of DC1.                         
>                                                                                 
>          DPA RANGE                                                               
>          (dpa_res)                                                               
>     0GB        10GB       20GB       30GB       40GB       50GB       60GB      
>     |----------|----------|----------|----------|----------|----------|          
>                                                                                 
>     RAM         PMEM                  DC0                   DC1                 
>      (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])         
>     |----------|----------|   <gap>  |----------|   <gap>  |----------|         
>                                                                                 
>      RAM        PMEM                                        DC1                 
>     |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|         
>     0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB      
>                                                                                 
>     The previous skip resource between RAM and PMEM was always a child of       
>     the RAM resource and fit nicely [see (S) below].  Because of this           
>     simplicity this skip resource reference was not stored in any CXL state.    
>     On release the skip range could be calculated based on the endpoint         
>     decoders stored values.                                                      
>                                                                                 
>     Now when DC1 is being mapped 4 skip resources must be created as            
>     children.  One for the PMEM resource (A), two of the parent DPA resource    
>     (B,D), and one more child of the DC0 resource (C).                          
>                                                                                 
>     0GB        10GB       20GB       30GB       40GB       50GB       60GB      
>     |----------|----------|----------|----------|----------|----------|         
>                                |                     |                          
>     |----------|----------|    |     |----------|    |     |----------|         
>             |          |       |          |          |                          
>            (S)        (A)     (B)        (C)        (D)                         
>             v          v       v          v          v                          
>     |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|         
>            skip       skip  skip        skip      skip                          

Yeah, I simply have low interest to review a patch implementing that
scheme unless and until someone says "our device wants to do that, and
it is too late to change".
Jonathan Cameron Jan. 23, 2025, 4:41 p.m. UTC | #5
On Wed, 22 Jan 2025 00:59:33 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> allocations being distinct from RAM allocations in specific ways when in
> practice the allocation rules are only relative to DPA partition index.
> 
> The rules for cxl_dpa_alloc() are:
> 
> - allocations can only come from 1 partition
> 
> - if allocating at partition-index-N, all free space in partitions less
>   than partition-index-N must be skipped over
> 
> Use the new 'struct cxl_dpa_partition' array to support allocation with
> an arbitrary number of DPA partitions on the device.
> 
> A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
> concept and supersede it with looking up the memory properties from
> partition metadata. Until then cxl_part_mode() temporarily bridges code
> that looks up partitions by @cxled->mode.
> 
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alejandro Lucero <alucerop@amd.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

A few possible simplifications below + a trivial debug message printing
a useful value comment.

Jonathan

> ---
>  drivers/cxl/core/hdm.c |  215 +++++++++++++++++++++++++++++++++++-------------
>  drivers/cxl/cxlmem.h   |   14 +++
>  2 files changed, 172 insertions(+), 57 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 3f8a54ca4624..591aeb26c9e1 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
>  
> +/* See request_skip() kernel-doc */
> +static void release_skip(struct cxl_dev_state *cxlds,
> +			 const resource_size_t skip_base,
> +			 const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		const struct resource *part_res = &cxlds->part[i].res;
> +		resource_size_t skip_end, skip_size;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +
> +		if (!skip_rem)
> +			break;
> +	}
> +}

Could ignore all explicit ordering constraints and have perhaps simpler
(Even simpler if there is an overlap helper we can use)
Assumption is we want to blow away anything in the skip range, whatever
partition it is in.

	for (int i = 0; i < cxlds->nr_paritions; i++) {
		const struct resource *part_res = &cxlds->part[i].res;
		resource_size_t toremove_start, toremove_end;

		toremove_start = max(skip_start, part_res->start);
		toremove_end = min(skip_end, part_res->end);
		if (toremove_end > toremove_start) {
			resource_size_t rem_size = toremove_end - toremove_start + 1;
			__release_region(&cxlds->dpa_res, toremove_start, rem_size);
		}

	}	
Can track skip_rem or not bother with that optimization.

Mind you your code is fine so I don't really mind.
I think we can build similar for request_skip based on ordering assumption, though
there we do need to keep track of how far we got so as to unwind only
that bit.




> +

> +static int request_skip(struct cxl_dev_state *cxlds,
> +			struct cxl_endpoint_decoder *cxled,
> +			const resource_size_t skip_base,
> +			const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		const struct resource *part_res = &cxlds->part[i].res;
> +		struct cxl_port *port = cxled_to_port(cxled);
> +		resource_size_t skip_end, skip_size;
> +		struct resource *res;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +
> +		res = __request_region(&cxlds->dpa_res, skip_start, skip_size,
> +				       dev_name(&cxled->cxld.dev), 0);
> +		if (!res) {
> +			dev_dbg(cxlds->dev,
> +				"decoder%d.%d: failed to reserve skipped space\n",
> +				port->id, cxled->cxld.id);
> +			break;
> +		}
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +		if (!skip_rem)
> +			break;
> +	}
> +
> +	if (skip_rem == 0)
> +		return 0;
> +
> +	release_skip(cxlds, skip_base, skip_len - skip_rem);
> +
> +	return -EBUSY;
> +}


> @@ -529,15 +625,13 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  {
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &cxled->cxld.dev;
> -	resource_size_t start, avail, skip;
> +	struct resource *res, *prev = NULL;
> +	resource_size_t start, avail, skip, skip_start;
>  	struct resource *p, *last;
> -	const struct resource *ram_res = to_ram_res(cxlds);
> -	const struct resource *pmem_res = to_pmem_res(cxlds);
> -	int rc;
> +	int part, rc;
>  
>  	down_write(&cxl_dpa_rwsem);
>  	if (cxled->cxld.region) {
> @@ -553,47 +647,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  		goto out;
>  	}
>  
> -	for (p = ram_res->child, last = NULL; p; p = p->sibling)
> -		last = p;
> -	if (last)
> -		free_ram_start = last->end + 1;
> -	else
> -		free_ram_start = ram_res->start;
> +	part = -1;
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		if (cxled->mode == cxl_part_mode(cxlds->part[i].mode)) {
> +			part = i;
> +			break;
> +		}
> +	}
>  
> -	for (p = pmem_res->child, last = NULL; p; p = p->sibling)
> +	if (part < 0) {
> +		dev_dbg(dev, "partition %d not found\n", part);

how is part useful to print here? it's -1

> +		rc = -EBUSY;
> +		goto out;
> +	}

Maybe tidier as a check on loop exiting early.

	for (part = 0; i < cxlds->nr_partitions; part++) {
		if (cxled->mode == cxl_part_mode(cxlds->part[part].mode)
			break;
	}
	if (part == cxlds->nr_partitions) {
		dev_dbg(dev, "parition mode %d not found\n", cxled->mode);
		rc = -EBUSY;
		goto out;
	}
Alejandro Lucero Palau Jan. 23, 2025, 5:21 p.m. UTC | #6
On 1/22/25 08:59, Dan Williams wrote:
> cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> allocations being distinct from RAM allocations in specific ways when in
> practice the allocation rules are only relative to DPA partition index.
>
> The rules for cxl_dpa_alloc() are:
>
> - allocations can only come from 1 partition
>
> - if allocating at partition-index-N, all free space in partitions less
>    than partition-index-N must be skipped over
>
> Use the new 'struct cxl_dpa_partition' array to support allocation with
> an arbitrary number of DPA partitions on the device.
>
> A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
> concept and supersede it with looking up the memory properties from
> partition metadata. Until then cxl_part_mode() temporarily bridges code
> that looks up partitions by @cxled->mode.
>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alejandro Lucero <alucerop@amd.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>


Reviewed-by: Alejandro Lucero <alucerop@amd.com>


> ---
>   drivers/cxl/core/hdm.c |  215 +++++++++++++++++++++++++++++++++++-------------
>   drivers/cxl/cxlmem.h   |   14 +++
>   2 files changed, 172 insertions(+), 57 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 3f8a54ca4624..591aeb26c9e1 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
>   
> +/* See request_skip() kernel-doc */
> +static void release_skip(struct cxl_dev_state *cxlds,
> +			 const resource_size_t skip_base,
> +			 const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		const struct resource *part_res = &cxlds->part[i].res;
> +		resource_size_t skip_end, skip_size;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +
> +		if (!skip_rem)
> +			break;
> +	}
> +}
> +
>   /*
>    * Must be called in a context that synchronizes against this decoder's
>    * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>   	skip_start = res->start - cxled->skip;
>   	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
>   	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +		release_skip(cxlds, skip_start, cxled->skip);
>   	cxled->skip = 0;
>   	cxled->dpa_res = NULL;
>   	put_device(&cxled->cxld.dev);
> @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>   	__cxl_dpa_release(cxled);
>   }
>   
> +/**
> + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree
> + * @cxlds: CXL.mem device context that parents @cxled
> + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA
> + * @skip_base: DPA < start of new DPA allocation (DPAnew)
> + * @skip_len: @skip_base + @skip_len == DPAnew
> + *
> + * DPA 'skip' arises from out-of-sequence DPA allocation events relative
> + * to free capacity across multiple partitions. It is a wasteful event
> + * as usable DPA gets thrown away, but if a deployment has, for example,
> + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM
> + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM.
> + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder
> + * Protection" for more details.
> + *
> + * A 'skip' always covers the last allocated DPA in a previous partition
> + * to the start of the current partition to allocate.  Allocations never
> + * start in the middle of a partition, and allocations are always
> + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm
> + * unwind order from forced in-order allocation).
> + *
> + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip'
> + * would always be contained to a single partition. Given
> + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip'
> + * might span "tail capacity of partition[0], all of partition[1], ...,
> + * all of partition[N-1]" to support allocating from partition[N]. That
> + * in turn interacts with the partition 'struct resource' boundaries
> + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by
> + * partition. I.e. this is a quirk of using a 'struct resource' tree to
> + * detect range conflicts while also tracking partition boundaries in
> + * @cxlds->dpa_res.
> + */
> +static int request_skip(struct cxl_dev_state *cxlds,
> +			struct cxl_endpoint_decoder *cxled,
> +			const resource_size_t skip_base,
> +			const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		const struct resource *part_res = &cxlds->part[i].res;
> +		struct cxl_port *port = cxled_to_port(cxled);
> +		resource_size_t skip_end, skip_size;
> +		struct resource *res;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +
> +		res = __request_region(&cxlds->dpa_res, skip_start, skip_size,
> +				       dev_name(&cxled->cxld.dev), 0);
> +		if (!res) {
> +			dev_dbg(cxlds->dev,
> +				"decoder%d.%d: failed to reserve skipped space\n",
> +				port->id, cxled->cxld.id);
> +			break;
> +		}
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +		if (!skip_rem)
> +			break;
> +	}
> +
> +	if (skip_rem == 0)
> +		return 0;
> +
> +	release_skip(cxlds, skip_base, skip_len - skip_rem);
> +
> +	return -EBUSY;
> +}
> +
>   static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   			     resource_size_t base, resource_size_t len,
>   			     resource_size_t skipped)
> @@ -276,7 +374,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   	struct cxl_port *port = cxled_to_port(cxled);
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>   	struct device *dev = &port->dev;
> +	enum cxl_decoder_mode mode;
>   	struct resource *res;
> +	int rc;
>   
>   	lockdep_assert_held_write(&cxl_dpa_rwsem);
>   
> @@ -305,14 +405,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   	}
>   
>   	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> -		}
> +		rc = request_skip(cxlds, cxled, base - skipped, skipped);
> +		if (rc)
> +			return rc;
>   	}
>   	res = __request_region(&cxlds->dpa_res, base, len,
>   			       dev_name(&cxled->cxld.dev), 0);
> @@ -320,22 +415,23 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>   		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
>   			port->id, cxled->cxld.id);
>   		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +			release_skip(cxlds, base - skipped, skipped);
>   		return -EBUSY;
>   	}
>   	cxled->dpa_res = res;
>   	cxled->skip = skipped;
>   
> -	if (to_pmem_res(cxlds) && resource_contains(to_pmem_res(cxlds), res))
> -		cxled->mode = CXL_DECODER_PMEM;
> -	else if (to_ram_res(cxlds) && resource_contains(to_ram_res(cxlds), res))
> -		cxled->mode = CXL_DECODER_RAM;
> -	else {
> +	mode = CXL_DECODER_NONE;
> +	for (int i = 0; cxlds->nr_partitions; i++)
> +		if (resource_contains(&cxlds->part[i].res, res)) {
> +			mode = cxl_part_mode(cxlds->part[i].mode);
> +			break;
> +		}
> +
> +	if (mode == CXL_DECODER_NONE)
>   		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
>   			 port->id, cxled->cxld.id, res);
> -		cxled->mode = CXL_DECODER_NONE;
> -	}
> +	cxled->mode = mode;
>   
>   	port->hdm_end++;
>   	get_device(&cxled->cxld.dev);
> @@ -529,15 +625,13 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>   int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>   {
>   	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
>   	struct cxl_port *port = cxled_to_port(cxled);
>   	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>   	struct device *dev = &cxled->cxld.dev;
> -	resource_size_t start, avail, skip;
> +	struct resource *res, *prev = NULL;
> +	resource_size_t start, avail, skip, skip_start;
>   	struct resource *p, *last;
> -	const struct resource *ram_res = to_ram_res(cxlds);
> -	const struct resource *pmem_res = to_pmem_res(cxlds);
> -	int rc;
> +	int part, rc;
>   
>   	down_write(&cxl_dpa_rwsem);
>   	if (cxled->cxld.region) {
> @@ -553,47 +647,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>   		goto out;
>   	}
>   
> -	for (p = ram_res->child, last = NULL; p; p = p->sibling)
> -		last = p;
> -	if (last)
> -		free_ram_start = last->end + 1;
> -	else
> -		free_ram_start = ram_res->start;
> +	part = -1;
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		if (cxled->mode == cxl_part_mode(cxlds->part[i].mode)) {
> +			part = i;
> +			break;
> +		}
> +	}
>   
> -	for (p = pmem_res->child, last = NULL; p; p = p->sibling)
> +	if (part < 0) {
> +		dev_dbg(dev, "partition %d not found\n", part);
> +		rc = -EBUSY;
> +		goto out;
> +	}
> +
> +	res = &cxlds->part[part].res;
> +	for (p = res->child, last = NULL; p; p = p->sibling)
>   		last = p;
>   	if (last)
> -		free_pmem_start = last->end + 1;
> +		start = last->end + 1;
>   	else
> -		free_pmem_start = pmem_res->start;
> -
> -	if (cxled->mode == CXL_DECODER_RAM) {
> -		start = free_ram_start;
> -		avail = ram_res->end - start + 1;
> -		skip = 0;
> -	} else if (cxled->mode == CXL_DECODER_PMEM) {
> -		resource_size_t skip_start, skip_end;
> +		start = res->start;
>   
> -		start = free_pmem_start;
> -		avail = pmem_res->end - start + 1;
> -		skip_start = free_ram_start;
> -
> -		/*
> -		 * If some pmem is already allocated, then that allocation
> -		 * already handled the skip.
> -		 */
> -		if (pmem_res->child &&
> -		    skip_start == pmem_res->child->start)
> -			skip_end = skip_start - 1;
> -		else
> -			skip_end = start - 1;
> -		skip = skip_end - skip_start + 1;
> -	} else {
> -		dev_dbg(dev, "mode not set\n");
> -		rc = -EINVAL;
> -		goto out;
> +	/*
> +	 * To allocate at partition N, a skip needs to be calculated for all
> +	 * unallocated space at lower partitions indices.
> +	 *
> +	 * If a partition has any allocations, the search can end because a
> +	 * previous cxl_dpa_alloc() invocation is assumed to have accounted for
> +	 * all previous partitions.
> +	 */
> +	skip_start = CXL_RESOURCE_NONE;
> +	for (int i = part; i; i--) {
> +		prev = &cxlds->part[i - 1].res;
> +		for (p = prev->child, last = NULL; p; p = p->sibling)
> +			last = p;
> +		if (last) {
> +			skip_start = last->end + 1;
> +			break;
> +		}
> +		skip_start = prev->start;
>   	}
>   
> +	avail = res->end - start + 1;
> +	if (skip_start == CXL_RESOURCE_NONE)
> +		skip = 0;
> +	else
> +		skip = res->start - skip_start;
> +
>   	if (size > avail) {
>   		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
>   			cxl_decoder_mode_name(cxled->mode), &avail);
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 15f549afab7c..bad99456e901 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -530,6 +530,20 @@ static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
>   	return resource_size(res);
>   }
>   
> +/*
> + * Translate the operational mode of memory capacity with the
> + * operational mode of a decoder
> + * TODO: kill 'enum cxl_decoder_mode' to obviate this helper
> + */
> +static inline enum cxl_decoder_mode cxl_part_mode(enum cxl_partition_mode mode)
> +{
> +	if (mode == CXL_PARTMODE_RAM)
> +		return CXL_DECODER_RAM;
> +	if (mode == CXL_PARTMODE_PMEM)
> +		return CXL_DECODER_PMEM;
> +	return CXL_DECODER_NONE;
> +}
> +
>   static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>   {
>   	return dev_get_drvdata(cxl_mbox->host);
>
Dave Jiang Jan. 23, 2025, 8:52 p.m. UTC | #7
On 1/22/25 1:59 AM, Dan Williams wrote:
> cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> allocations being distinct from RAM allocations in specific ways when in
> practice the allocation rules are only relative to DPA partition index.
> 
> The rules for cxl_dpa_alloc() are:
> 
> - allocations can only come from 1 partition
> 
> - if allocating at partition-index-N, all free space in partitions less
>   than partition-index-N must be skipped over
> 
> Use the new 'struct cxl_dpa_partition' array to support allocation with
> an arbitrary number of DPA partitions on the device.
> 
> A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
> concept and supersede it with looking up the memory properties from
> partition metadata. Until then cxl_part_mode() temporarily bridges code
> that looks up partitions by @cxled->mode.
> 
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alejandro Lucero <alucerop@amd.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>  drivers/cxl/core/hdm.c |  215 +++++++++++++++++++++++++++++++++++-------------
>  drivers/cxl/cxlmem.h   |   14 +++
>  2 files changed, 172 insertions(+), 57 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 3f8a54ca4624..591aeb26c9e1 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
>  
> +/* See request_skip() kernel-doc */
> +static void release_skip(struct cxl_dev_state *cxlds,
> +			 const resource_size_t skip_base,
> +			 const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		const struct resource *part_res = &cxlds->part[i].res;
> +		resource_size_t skip_end, skip_size;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +
> +		if (!skip_rem)
> +			break;
> +	}
> +}
> +
>  /*
>   * Must be called in a context that synchronizes against this decoder's
>   * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	skip_start = res->start - cxled->skip;
>  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
>  	if (cxled->skip)
> -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> +		release_skip(cxlds, skip_start, cxled->skip);
>  	cxled->skip = 0;
>  	cxled->dpa_res = NULL;
>  	put_device(&cxled->cxld.dev);
> @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
>  	__cxl_dpa_release(cxled);
>  }
>  
> +/**
> + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree
> + * @cxlds: CXL.mem device context that parents @cxled
> + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA
> + * @skip_base: DPA < start of new DPA allocation (DPAnew)
> + * @skip_len: @skip_base + @skip_len == DPAnew
> + *
> + * DPA 'skip' arises from out-of-sequence DPA allocation events relative
> + * to free capacity across multiple partitions. It is a wasteful event
> + * as usable DPA gets thrown away, but if a deployment has, for example,
> + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM
> + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM.
> + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder
> + * Protection" for more details.
> + *
> + * A 'skip' always covers the last allocated DPA in a previous partition
> + * to the start of the current partition to allocate.  Allocations never
> + * start in the middle of a partition, and allocations are always
> + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm
> + * unwind order from forced in-order allocation).
> + *
> + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip'
> + * would always be contained to a single partition. Given
> + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip'
> + * might span "tail capacity of partition[0], all of partition[1], ...,
> + * all of partition[N-1]" to support allocating from partition[N]. That
> + * in turn interacts with the partition 'struct resource' boundaries
> + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by
> + * partition. I.e. this is a quirk of using a 'struct resource' tree to
> + * detect range conflicts while also tracking partition boundaries in
> + * @cxlds->dpa_res.
> + */
> +static int request_skip(struct cxl_dev_state *cxlds,
> +			struct cxl_endpoint_decoder *cxled,
> +			const resource_size_t skip_base,
> +			const resource_size_t skip_len)
> +{
> +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		const struct resource *part_res = &cxlds->part[i].res;
> +		struct cxl_port *port = cxled_to_port(cxled);
> +		resource_size_t skip_end, skip_size;
> +		struct resource *res;
> +
> +		if (skip_start < part_res->start || skip_start > part_res->end)
> +			continue;
> +
> +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> +		skip_size = skip_end - skip_start + 1;
> +
> +		res = __request_region(&cxlds->dpa_res, skip_start, skip_size,
> +				       dev_name(&cxled->cxld.dev), 0);
> +		if (!res) {
> +			dev_dbg(cxlds->dev,
> +				"decoder%d.%d: failed to reserve skipped space\n",
> +				port->id, cxled->cxld.id);
> +			break;
> +		}
> +		skip_start += skip_size;
> +		skip_rem -= skip_size;
> +		if (!skip_rem)
> +			break;
> +	}
> +
> +	if (skip_rem == 0)
> +		return 0;
> +
> +	release_skip(cxlds, skip_base, skip_len - skip_rem);
> +
> +	return -EBUSY;
> +}
> +
>  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  			     resource_size_t base, resource_size_t len,
>  			     resource_size_t skipped)
> @@ -276,7 +374,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &port->dev;
> +	enum cxl_decoder_mode mode;
>  	struct resource *res;
> +	int rc;
>  
>  	lockdep_assert_held_write(&cxl_dpa_rwsem);
>  
> @@ -305,14 +405,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  	}
>  
>  	if (skipped) {
> -		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> -				       dev_name(&cxled->cxld.dev), 0);
> -		if (!res) {
> -			dev_dbg(dev,
> -				"decoder%d.%d: failed to reserve skipped space\n",
> -				port->id, cxled->cxld.id);
> -			return -EBUSY;
> -		}
> +		rc = request_skip(cxlds, cxled, base - skipped, skipped);
> +		if (rc)
> +			return rc;
>  	}
>  	res = __request_region(&cxlds->dpa_res, base, len,
>  			       dev_name(&cxled->cxld.dev), 0);
> @@ -320,22 +415,23 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>  		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
>  			port->id, cxled->cxld.id);
>  		if (skipped)
> -			__release_region(&cxlds->dpa_res, base - skipped,
> -					 skipped);
> +			release_skip(cxlds, base - skipped, skipped);
>  		return -EBUSY;
>  	}
>  	cxled->dpa_res = res;
>  	cxled->skip = skipped;
>  
> -	if (to_pmem_res(cxlds) && resource_contains(to_pmem_res(cxlds), res))
> -		cxled->mode = CXL_DECODER_PMEM;
> -	else if (to_ram_res(cxlds) && resource_contains(to_ram_res(cxlds), res))
> -		cxled->mode = CXL_DECODER_RAM;
> -	else {
> +	mode = CXL_DECODER_NONE;
> +	for (int i = 0; cxlds->nr_partitions; i++)
> +		if (resource_contains(&cxlds->part[i].res, res)) {
> +			mode = cxl_part_mode(cxlds->part[i].mode);
> +			break;
> +		}
> +
> +	if (mode == CXL_DECODER_NONE)
>  		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
>  			 port->id, cxled->cxld.id, res);
> -		cxled->mode = CXL_DECODER_NONE;
> -	}
> +	cxled->mode = mode;
>  
>  	port->hdm_end++;
>  	get_device(&cxled->cxld.dev);
> @@ -529,15 +625,13 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>  int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  {
>  	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -	resource_size_t free_ram_start, free_pmem_start;
>  	struct cxl_port *port = cxled_to_port(cxled);
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  	struct device *dev = &cxled->cxld.dev;
> -	resource_size_t start, avail, skip;
> +	struct resource *res, *prev = NULL;
> +	resource_size_t start, avail, skip, skip_start;
>  	struct resource *p, *last;
> -	const struct resource *ram_res = to_ram_res(cxlds);
> -	const struct resource *pmem_res = to_pmem_res(cxlds);
> -	int rc;
> +	int part, rc;
>  
>  	down_write(&cxl_dpa_rwsem);
>  	if (cxled->cxld.region) {
> @@ -553,47 +647,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>  		goto out;
>  	}
>  
> -	for (p = ram_res->child, last = NULL; p; p = p->sibling)
> -		last = p;
> -	if (last)
> -		free_ram_start = last->end + 1;
> -	else
> -		free_ram_start = ram_res->start;
> +	part = -1;
> +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> +		if (cxled->mode == cxl_part_mode(cxlds->part[i].mode)) {
> +			part = i;
> +			break;
> +		}
> +	}
>  
> -	for (p = pmem_res->child, last = NULL; p; p = p->sibling)
> +	if (part < 0) {
> +		dev_dbg(dev, "partition %d not found\n", part);
> +		rc = -EBUSY;
> +		goto out;
> +	}
> +
> +	res = &cxlds->part[part].res;
> +	for (p = res->child, last = NULL; p; p = p->sibling)
>  		last = p;
>  	if (last)
> -		free_pmem_start = last->end + 1;
> +		start = last->end + 1;
>  	else
> -		free_pmem_start = pmem_res->start;
> -
> -	if (cxled->mode == CXL_DECODER_RAM) {
> -		start = free_ram_start;
> -		avail = ram_res->end - start + 1;
> -		skip = 0;
> -	} else if (cxled->mode == CXL_DECODER_PMEM) {
> -		resource_size_t skip_start, skip_end;
> +		start = res->start;
>  
> -		start = free_pmem_start;
> -		avail = pmem_res->end - start + 1;
> -		skip_start = free_ram_start;
> -
> -		/*
> -		 * If some pmem is already allocated, then that allocation
> -		 * already handled the skip.
> -		 */
> -		if (pmem_res->child &&
> -		    skip_start == pmem_res->child->start)
> -			skip_end = skip_start - 1;
> -		else
> -			skip_end = start - 1;
> -		skip = skip_end - skip_start + 1;
> -	} else {
> -		dev_dbg(dev, "mode not set\n");
> -		rc = -EINVAL;
> -		goto out;
> +	/*
> +	 * To allocate at partition N, a skip needs to be calculated for all
> +	 * unallocated space at lower partitions indices.
> +	 *
> +	 * If a partition has any allocations, the search can end because a
> +	 * previous cxl_dpa_alloc() invocation is assumed to have accounted for
> +	 * all previous partitions.
> +	 */
> +	skip_start = CXL_RESOURCE_NONE;
> +	for (int i = part; i; i--) {
> +		prev = &cxlds->part[i - 1].res;
> +		for (p = prev->child, last = NULL; p; p = p->sibling)
> +			last = p;
> +		if (last) {
> +			skip_start = last->end + 1;
> +			break;
> +		}
> +		skip_start = prev->start;
>  	}
>  
> +	avail = res->end - start + 1;
> +	if (skip_start == CXL_RESOURCE_NONE)
> +		skip = 0;
> +	else
> +		skip = res->start - skip_start;
> +
>  	if (size > avail) {
>  		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
>  			cxl_decoder_mode_name(cxled->mode), &avail);
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 15f549afab7c..bad99456e901 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -530,6 +530,20 @@ static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
>  	return resource_size(res);
>  }
>  
> +/*
> + * Translate the operational mode of memory capacity with the
> + * operational mode of a decoder
> + * TODO: kill 'enum cxl_decoder_mode' to obviate this helper
> + */
> +static inline enum cxl_decoder_mode cxl_part_mode(enum cxl_partition_mode mode)
> +{
> +	if (mode == CXL_PARTMODE_RAM)
> +		return CXL_DECODER_RAM;
> +	if (mode == CXL_PARTMODE_PMEM)
> +		return CXL_DECODER_PMEM;
> +	return CXL_DECODER_NONE;
> +}
> +
>  static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
>  {
>  	return dev_get_drvdata(cxl_mbox->host);
>
Dan Williams Jan. 23, 2025, 9:34 p.m. UTC | #8
Jonathan Cameron wrote:
> On Wed, 22 Jan 2025 00:59:33 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> > allocations being distinct from RAM allocations in specific ways when in
> > practice the allocation rules are only relative to DPA partition index.
> > 
> > The rules for cxl_dpa_alloc() are:
> > 
> > - allocations can only come from 1 partition
> > 
> > - if allocating at partition-index-N, all free space in partitions less
> >   than partition-index-N must be skipped over
> > 
> > Use the new 'struct cxl_dpa_partition' array to support allocation with
> > an arbitrary number of DPA partitions on the device.
> > 
> > A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
> > concept and supersede it with looking up the memory properties from
> > partition metadata. Until then cxl_part_mode() temporarily bridges code
> > that looks up partitions by @cxled->mode.
> > 
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Alejandro Lucero <alucerop@amd.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> 
> A few possible simplifications below + a trivial debug message printing
> a useful value comment.
> 
> Jonathan
> 
> > ---
> >  drivers/cxl/core/hdm.c |  215 +++++++++++++++++++++++++++++++++++-------------
> >  drivers/cxl/cxlmem.h   |   14 +++
> >  2 files changed, 172 insertions(+), 57 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 3f8a54ca4624..591aeb26c9e1 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
> >  
> > +/* See request_skip() kernel-doc */
> > +static void release_skip(struct cxl_dev_state *cxlds,
> > +			 const resource_size_t skip_base,
> > +			 const resource_size_t skip_len)
> > +{
> > +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> > +
> > +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> > +		const struct resource *part_res = &cxlds->part[i].res;
> > +		resource_size_t skip_end, skip_size;
> > +
> > +		if (skip_start < part_res->start || skip_start > part_res->end)
> > +			continue;
> > +
> > +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> > +		skip_size = skip_end - skip_start + 1;
> > +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> > +		skip_start += skip_size;
> > +		skip_rem -= skip_size;
> > +
> > +		if (!skip_rem)
> > +			break;
> > +	}
> > +}
> 
> Could ignore all explicit ordering constraints and have perhaps simpler
> (Even simpler if there is an overlap helper we can use)
> Assumption is we want to blow away anything in the skip range, whatever
> partition it is in.
> 
> 	for (int i = 0; i < cxlds->nr_paritions; i++) {
> 		const struct resource *part_res = &cxlds->part[i].res;
> 		resource_size_t toremove_start, toremove_end;
> 
> 		toremove_start = max(skip_start, part_res->start);
> 		toremove_end = min(skip_end, part_res->end);
> 		if (toremove_end > toremove_start) {
> 			resource_size_t rem_size = toremove_end - toremove_start + 1;
> 			__release_region(&cxlds->dpa_res, toremove_start, rem_size);
> 		}
> 
> 	}	
> Can track skip_rem or not bother with that optimization.

I like it, I'll switch to this.

> 
> Mind you your code is fine so I don't really mind.
> I think we can build similar for request_skip based on ordering assumption, though
> there we do need to keep track of how far we got so as to unwind only
> that bit.

Will give it a spin and see how it looks.

[..]
> > @@ -553,47 +647,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> >  		goto out;
> >  	}
> >  
> > -	for (p = ram_res->child, last = NULL; p; p = p->sibling)
> > -		last = p;
> > -	if (last)
> > -		free_ram_start = last->end + 1;
> > -	else
> > -		free_ram_start = ram_res->start;
> > +	part = -1;
> > +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> > +		if (cxled->mode == cxl_part_mode(cxlds->part[i].mode)) {
> > +			part = i;
> > +			break;
> > +		}
> > +	}
> >  
> > -	for (p = pmem_res->child, last = NULL; p; p = p->sibling)
> > +	if (part < 0) {
> > +		dev_dbg(dev, "partition %d not found\n", part);
> 
> how is part useful to print here? it's -1

Yeah, another thinko likely because I was already thinking about how
cxled->part shuold already be set before entering this function in the
next patch.

For this one, I'll just delete the message because in the next patch the
loop is gone and only the following remains:

        part = cxled->part;
        if (part < 0) {
                dev_dbg(dev, "partition not set\n");
                rc = -EBUSY;
                goto out;
        }
diff mbox series

Patch

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 3f8a54ca4624..591aeb26c9e1 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -223,6 +223,31 @@  void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
 
+/* See request_skip() kernel-doc */
+static void release_skip(struct cxl_dev_state *cxlds,
+			 const resource_size_t skip_base,
+			 const resource_size_t skip_len)
+{
+	resource_size_t skip_start = skip_base, skip_rem = skip_len;
+
+	for (int i = 0; i < cxlds->nr_partitions; i++) {
+		const struct resource *part_res = &cxlds->part[i].res;
+		resource_size_t skip_end, skip_size;
+
+		if (skip_start < part_res->start || skip_start > part_res->end)
+			continue;
+
+		skip_end = min(part_res->end, skip_start + skip_rem - 1);
+		skip_size = skip_end - skip_start + 1;
+		__release_region(&cxlds->dpa_res, skip_start, skip_size);
+		skip_start += skip_size;
+		skip_rem -= skip_size;
+
+		if (!skip_rem)
+			break;
+	}
+}
+
 /*
  * Must be called in a context that synchronizes against this decoder's
  * port ->remove() callback (like an endpoint decoder sysfs attribute)
@@ -241,7 +266,7 @@  static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	skip_start = res->start - cxled->skip;
 	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
 	if (cxled->skip)
-		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
+		release_skip(cxlds, skip_start, cxled->skip);
 	cxled->skip = 0;
 	cxled->dpa_res = NULL;
 	put_device(&cxled->cxld.dev);
@@ -268,6 +293,79 @@  static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
 	__cxl_dpa_release(cxled);
 }
 
+/**
+ * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree
+ * @cxlds: CXL.mem device context that parents @cxled
+ * @cxled: Endpoint decoder establishing new allocation that skips lower DPA
+ * @skip_base: DPA < start of new DPA allocation (DPAnew)
+ * @skip_len: @skip_base + @skip_len == DPAnew
+ *
+ * DPA 'skip' arises from out-of-sequence DPA allocation events relative
+ * to free capacity across multiple partitions. It is a wasteful event
+ * as usable DPA gets thrown away, but if a deployment has, for example,
+ * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM
+ * DPA, the free RAM DPA must be sacrificed to start allocating PMEM.
+ * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder
+ * Protection" for more details.
+ *
+ * A 'skip' always covers the last allocated DPA in a previous partition
+ * to the start of the current partition to allocate.  Allocations never
+ * start in the middle of a partition, and allocations are always
+ * de-allocated in reverse order (see cxl_dpa_free(), or natural devm
+ * unwind order from forced in-order allocation).
+ *
+ * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip'
+ * would always be contained to a single partition. Given
+ * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip'
+ * might span "tail capacity of partition[0], all of partition[1], ...,
+ * all of partition[N-1]" to support allocating from partition[N]. That
+ * in turn interacts with the partition 'struct resource' boundaries
+ * within @cxlds->dpa_res whereby 'skip' requests need to be divided by
+ * partition. I.e. this is a quirk of using a 'struct resource' tree to
+ * detect range conflicts while also tracking partition boundaries in
+ * @cxlds->dpa_res.
+ */
+static int request_skip(struct cxl_dev_state *cxlds,
+			struct cxl_endpoint_decoder *cxled,
+			const resource_size_t skip_base,
+			const resource_size_t skip_len)
+{
+	resource_size_t skip_start = skip_base, skip_rem = skip_len;
+
+	for (int i = 0; i < cxlds->nr_partitions; i++) {
+		const struct resource *part_res = &cxlds->part[i].res;
+		struct cxl_port *port = cxled_to_port(cxled);
+		resource_size_t skip_end, skip_size;
+		struct resource *res;
+
+		if (skip_start < part_res->start || skip_start > part_res->end)
+			continue;
+
+		skip_end = min(part_res->end, skip_start + skip_rem - 1);
+		skip_size = skip_end - skip_start + 1;
+
+		res = __request_region(&cxlds->dpa_res, skip_start, skip_size,
+				       dev_name(&cxled->cxld.dev), 0);
+		if (!res) {
+			dev_dbg(cxlds->dev,
+				"decoder%d.%d: failed to reserve skipped space\n",
+				port->id, cxled->cxld.id);
+			break;
+		}
+		skip_start += skip_size;
+		skip_rem -= skip_size;
+		if (!skip_rem)
+			break;
+	}
+
+	if (skip_rem == 0)
+		return 0;
+
+	release_skip(cxlds, skip_base, skip_len - skip_rem);
+
+	return -EBUSY;
+}
+
 static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 			     resource_size_t base, resource_size_t len,
 			     resource_size_t skipped)
@@ -276,7 +374,9 @@  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct device *dev = &port->dev;
+	enum cxl_decoder_mode mode;
 	struct resource *res;
+	int rc;
 
 	lockdep_assert_held_write(&cxl_dpa_rwsem);
 
@@ -305,14 +405,9 @@  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 	}
 
 	if (skipped) {
-		res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
-				       dev_name(&cxled->cxld.dev), 0);
-		if (!res) {
-			dev_dbg(dev,
-				"decoder%d.%d: failed to reserve skipped space\n",
-				port->id, cxled->cxld.id);
-			return -EBUSY;
-		}
+		rc = request_skip(cxlds, cxled, base - skipped, skipped);
+		if (rc)
+			return rc;
 	}
 	res = __request_region(&cxlds->dpa_res, base, len,
 			       dev_name(&cxled->cxld.dev), 0);
@@ -320,22 +415,23 @@  static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
 		dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
 			port->id, cxled->cxld.id);
 		if (skipped)
-			__release_region(&cxlds->dpa_res, base - skipped,
-					 skipped);
+			release_skip(cxlds, base - skipped, skipped);
 		return -EBUSY;
 	}
 	cxled->dpa_res = res;
 	cxled->skip = skipped;
 
-	if (to_pmem_res(cxlds) && resource_contains(to_pmem_res(cxlds), res))
-		cxled->mode = CXL_DECODER_PMEM;
-	else if (to_ram_res(cxlds) && resource_contains(to_ram_res(cxlds), res))
-		cxled->mode = CXL_DECODER_RAM;
-	else {
+	mode = CXL_DECODER_NONE;
+	for (int i = 0; cxlds->nr_partitions; i++)
+		if (resource_contains(&cxlds->part[i].res, res)) {
+			mode = cxl_part_mode(cxlds->part[i].mode);
+			break;
+		}
+
+	if (mode == CXL_DECODER_NONE)
 		dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
 			 port->id, cxled->cxld.id, res);
-		cxled->mode = CXL_DECODER_NONE;
-	}
+	cxled->mode = mode;
 
 	port->hdm_end++;
 	get_device(&cxled->cxld.dev);
@@ -529,15 +625,13 @@  int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
 int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 {
 	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
-	resource_size_t free_ram_start, free_pmem_start;
 	struct cxl_port *port = cxled_to_port(cxled);
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	struct device *dev = &cxled->cxld.dev;
-	resource_size_t start, avail, skip;
+	struct resource *res, *prev = NULL;
+	resource_size_t start, avail, skip, skip_start;
 	struct resource *p, *last;
-	const struct resource *ram_res = to_ram_res(cxlds);
-	const struct resource *pmem_res = to_pmem_res(cxlds);
-	int rc;
+	int part, rc;
 
 	down_write(&cxl_dpa_rwsem);
 	if (cxled->cxld.region) {
@@ -553,47 +647,54 @@  int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
 		goto out;
 	}
 
-	for (p = ram_res->child, last = NULL; p; p = p->sibling)
-		last = p;
-	if (last)
-		free_ram_start = last->end + 1;
-	else
-		free_ram_start = ram_res->start;
+	part = -1;
+	for (int i = 0; i < cxlds->nr_partitions; i++) {
+		if (cxled->mode == cxl_part_mode(cxlds->part[i].mode)) {
+			part = i;
+			break;
+		}
+	}
 
-	for (p = pmem_res->child, last = NULL; p; p = p->sibling)
+	if (part < 0) {
+		dev_dbg(dev, "partition %d not found\n", part);
+		rc = -EBUSY;
+		goto out;
+	}
+
+	res = &cxlds->part[part].res;
+	for (p = res->child, last = NULL; p; p = p->sibling)
 		last = p;
 	if (last)
-		free_pmem_start = last->end + 1;
+		start = last->end + 1;
 	else
-		free_pmem_start = pmem_res->start;
-
-	if (cxled->mode == CXL_DECODER_RAM) {
-		start = free_ram_start;
-		avail = ram_res->end - start + 1;
-		skip = 0;
-	} else if (cxled->mode == CXL_DECODER_PMEM) {
-		resource_size_t skip_start, skip_end;
+		start = res->start;
 
-		start = free_pmem_start;
-		avail = pmem_res->end - start + 1;
-		skip_start = free_ram_start;
-
-		/*
-		 * If some pmem is already allocated, then that allocation
-		 * already handled the skip.
-		 */
-		if (pmem_res->child &&
-		    skip_start == pmem_res->child->start)
-			skip_end = skip_start - 1;
-		else
-			skip_end = start - 1;
-		skip = skip_end - skip_start + 1;
-	} else {
-		dev_dbg(dev, "mode not set\n");
-		rc = -EINVAL;
-		goto out;
+	/*
+	 * To allocate at partition N, a skip needs to be calculated for all
+	 * unallocated space at lower partitions indices.
+	 *
+	 * If a partition has any allocations, the search can end because a
+	 * previous cxl_dpa_alloc() invocation is assumed to have accounted for
+	 * all previous partitions.
+	 */
+	skip_start = CXL_RESOURCE_NONE;
+	for (int i = part; i; i--) {
+		prev = &cxlds->part[i - 1].res;
+		for (p = prev->child, last = NULL; p; p = p->sibling)
+			last = p;
+		if (last) {
+			skip_start = last->end + 1;
+			break;
+		}
+		skip_start = prev->start;
 	}
 
+	avail = res->end - start + 1;
+	if (skip_start == CXL_RESOURCE_NONE)
+		skip = 0;
+	else
+		skip = res->start - skip_start;
+
 	if (size > avail) {
 		dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
 			cxl_decoder_mode_name(cxled->mode), &avail);
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 15f549afab7c..bad99456e901 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -530,6 +530,20 @@  static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds)
 	return resource_size(res);
 }
 
+/*
+ * Translate the operational mode of memory capacity with the
+ * operational mode of a decoder
+ * TODO: kill 'enum cxl_decoder_mode' to obviate this helper
+ */
+static inline enum cxl_decoder_mode cxl_part_mode(enum cxl_partition_mode mode)
+{
+	if (mode == CXL_PARTMODE_RAM)
+		return CXL_DECODER_RAM;
+	if (mode == CXL_PARTMODE_PMEM)
+		return CXL_DECODER_PMEM;
+	return CXL_DECODER_NONE;
+}
+
 static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
 {
 	return dev_get_drvdata(cxl_mbox->host);