Message ID | 173753637297.3849855.5217976225600372473.stgit@dwillia2-xfh.jf.intel.com |
---|---|
State | New |
Headers | show |
Series | cxl: DPA partition metadata is a mess... | expand |
Dan Williams wrote: > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM > allocations being distinct from RAM allocations in specific ways when in > practice the allocation rules are only relative to DPA partition index. > > The rules for cxl_dpa_alloc() are: > > - allocations can only come from 1 partition > > - if allocating at partition-index-N, all free space in partitions less > than partition-index-N must be skipped over I think this is a bit deeper. The partition index must also correspond to the DPA order. The DCD code verifies the partition index's are in DPA order when reading them from the device. Therefore, that code will add them to cxl_dpa_info in order. But general device driver writers may miss this point. [snip] > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c > index 3f8a54ca4624..591aeb26c9e1 100644 > --- a/drivers/cxl/core/hdm.c > +++ b/drivers/cxl/core/hdm.c > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds) > } > EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL"); > > +/* See request_skip() kernel-doc */ > +static void release_skip(struct cxl_dev_state *cxlds, > + const resource_size_t skip_base, > + const resource_size_t skip_len) > +{ > + resource_size_t skip_start = skip_base, skip_rem = skip_len; > + > + for (int i = 0; i < cxlds->nr_partitions; i++) { > + const struct resource *part_res = &cxlds->part[i].res; > + resource_size_t skip_end, skip_size; > + > + if (skip_start < part_res->start || skip_start > part_res->end) > + continue; > + > + skip_end = min(part_res->end, skip_start + skip_rem - 1); > + skip_size = skip_end - skip_start + 1; > + __release_region(&cxlds->dpa_res, skip_start, skip_size); > + skip_start += skip_size; > + skip_rem -= skip_size; > + > + if (!skip_rem) > + break; > + } > +} > + > /* > * Must be called in a context that synchronizes against this decoder's > * port ->remove() callback (like an endpoint decoder sysfs attribute) > @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled) > skip_start = res->start - cxled->skip; > __release_region(&cxlds->dpa_res, res->start, resource_size(res)); > if (cxled->skip) > - __release_region(&cxlds->dpa_res, skip_start, cxled->skip); > + release_skip(cxlds, skip_start, cxled->skip); > cxled->skip = 0; > cxled->dpa_res = NULL; > put_device(&cxled->cxld.dev); > @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled) > __cxl_dpa_release(cxled); > } > > +/** > + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree > + * @cxlds: CXL.mem device context that parents @cxled > + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA > + * @skip_base: DPA < start of new DPA allocation (DPAnew) > + * @skip_len: @skip_base + @skip_len == DPAnew > + * > + * DPA 'skip' arises from out-of-sequence DPA allocation events relative > + * to free capacity across multiple partitions. It is a wasteful event > + * as usable DPA gets thrown away, but if a deployment has, for example, > + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM > + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM. > + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder > + * Protection" for more details. I think this is a great comment here. > + * > + * A 'skip' always covers the last allocated DPA in a previous partition > + * to the start of the current partition to allocate. Allocations never > + * start in the middle of a partition, and allocations are always > + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm > + * unwind order from forced in-order allocation). > + * > + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip' > + * would always be contained to a single partition. Given > + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip' > + * might span "tail capacity of partition[0], all of partition[1], ..., > + * all of partition[N-1]" to support allocating from partition[N]. That > + * in turn interacts with the partition 'struct resource' boundaries > + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by > + * partition. I.e. this is a quirk of using a 'struct resource' tree to > + * detect range conflicts while also tracking partition boundaries in > + * @cxlds->dpa_res. Another great comment but it does not actually cover the DCD case. This is because in DCD the partitions might also have skips between them. That said the update should come with DCD or if type 2 devices may have the same loosening of device partitions. This is a good clean up though, Reviewed-by: Ira Weiny <ira.weiny@intel.com> [snip]
Ira Weiny wrote: > Dan Williams wrote: > > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM > > allocations being distinct from RAM allocations in specific ways when in > > practice the allocation rules are only relative to DPA partition index. > > > > The rules for cxl_dpa_alloc() are: > > > > - allocations can only come from 1 partition > > > > - if allocating at partition-index-N, all free space in partitions less > > than partition-index-N must be skipped over > > I think this is a bit deeper. The partition index must also correspond to > the DPA order. The DCD code verifies the partition index's are in DPA > order when reading them from the device. Therefore, that code will add > them to cxl_dpa_info in order. But general device driver writers may miss > this point. We could save them from themselves with some paranoia in cxl_dpa_setup(), but as Alejandro said accelerators are typically single-static-RAM-partition devices. The risk is low that someone builds a multi-partition accelerator *and* builds a driver that messes that up, but I would not say no to a comment that notes that expectation. > [snip] > > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c > > index 3f8a54ca4624..591aeb26c9e1 100644 > > --- a/drivers/cxl/core/hdm.c > > +++ b/drivers/cxl/core/hdm.c > > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds) > > } > > EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL"); > > > > +/* See request_skip() kernel-doc */ > > +static void release_skip(struct cxl_dev_state *cxlds, > > + const resource_size_t skip_base, > > + const resource_size_t skip_len) > > +{ > > + resource_size_t skip_start = skip_base, skip_rem = skip_len; > > + > > + for (int i = 0; i < cxlds->nr_partitions; i++) { > > + const struct resource *part_res = &cxlds->part[i].res; > > + resource_size_t skip_end, skip_size; > > + > > + if (skip_start < part_res->start || skip_start > part_res->end) > > + continue; > > + > > + skip_end = min(part_res->end, skip_start + skip_rem - 1); > > + skip_size = skip_end - skip_start + 1; > > + __release_region(&cxlds->dpa_res, skip_start, skip_size); > > + skip_start += skip_size; > > + skip_rem -= skip_size; > > + > > + if (!skip_rem) > > + break; > > + } > > +} > > + > > /* > > * Must be called in a context that synchronizes against this decoder's > > * port ->remove() callback (like an endpoint decoder sysfs attribute) > > @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled) > > skip_start = res->start - cxled->skip; > > __release_region(&cxlds->dpa_res, res->start, resource_size(res)); > > if (cxled->skip) > > - __release_region(&cxlds->dpa_res, skip_start, cxled->skip); > > + release_skip(cxlds, skip_start, cxled->skip); > > cxled->skip = 0; > > cxled->dpa_res = NULL; > > put_device(&cxled->cxld.dev); > > @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled) > > __cxl_dpa_release(cxled); > > } > > > > +/** > > + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree > > + * @cxlds: CXL.mem device context that parents @cxled > > + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA > > + * @skip_base: DPA < start of new DPA allocation (DPAnew) > > + * @skip_len: @skip_base + @skip_len == DPAnew > > + * > > + * DPA 'skip' arises from out-of-sequence DPA allocation events relative > > + * to free capacity across multiple partitions. It is a wasteful event > > + * as usable DPA gets thrown away, but if a deployment has, for example, > > + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM > > + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM. > > + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder > > + * Protection" for more details. > > I think this is a great comment here. Appreciate that, never know how these things are going to translate. > > > + * > > + * A 'skip' always covers the last allocated DPA in a previous partition > > + * to the start of the current partition to allocate. Allocations never > > + * start in the middle of a partition, and allocations are always > > + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm > > + * unwind order from forced in-order allocation). > > + * > > + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip' > > + * would always be contained to a single partition. Given > > + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip' > > + * might span "tail capacity of partition[0], all of partition[1], ..., > > + * all of partition[N-1]" to support allocating from partition[N]. That > > + * in turn interacts with the partition 'struct resource' boundaries > > + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by > > + * partition. I.e. this is a quirk of using a 'struct resource' tree to > > + * detect range conflicts while also tracking partition boundaries in > > + * @cxlds->dpa_res. > > Another great comment but it does not actually cover the DCD case. This > is because in DCD the partitions might also have skips between them. I think that "just works". The allocation will be bound by the partition, and the skip is calculated from the "end of last allocation in a previous partition". So, the distance between "end of last" and "allocation start" will naturally include inter-partition holes, right? > That said the update should come with DCD or if type 2 devices may have > the same loosening of device partitions. > > This is a good clean up though, > > Reviewed-by: Ira Weiny <ira.weiny@intel.com> Appreciate the quick turnaround... I will endeavor to do the same with the next DCD posting.
Dan Williams wrote: > Ira Weiny wrote: > > Dan Williams wrote: > > > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM > > > allocations being distinct from RAM allocations in specific ways when in > > > practice the allocation rules are only relative to DPA partition index. > > > > > > The rules for cxl_dpa_alloc() are: > > > > > > - allocations can only come from 1 partition > > > > > > - if allocating at partition-index-N, all free space in partitions less > > > than partition-index-N must be skipped over > > > > I think this is a bit deeper. The partition index must also correspond to > > the DPA order. The DCD code verifies the partition index's are in DPA > > order when reading them from the device. Therefore, that code will add > > them to cxl_dpa_info in order. But general device driver writers may miss > > this point. > > We could save them from themselves with some paranoia in > cxl_dpa_setup(), but as Alejandro said accelerators are typically > single-static-RAM-partition devices. The risk is low that someone builds > a multi-partition accelerator *and* builds a driver that messes that up, > but I would not say no to a comment that notes that expectation. > > > [snip] > > > > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c > > > index 3f8a54ca4624..591aeb26c9e1 100644 > > > --- a/drivers/cxl/core/hdm.c > > > +++ b/drivers/cxl/core/hdm.c > > > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds) > > > } > > > EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL"); > > > > > > +/* See request_skip() kernel-doc */ > > > +static void release_skip(struct cxl_dev_state *cxlds, > > > + const resource_size_t skip_base, > > > + const resource_size_t skip_len) > > > +{ > > > + resource_size_t skip_start = skip_base, skip_rem = skip_len; > > > + > > > + for (int i = 0; i < cxlds->nr_partitions; i++) { > > > + const struct resource *part_res = &cxlds->part[i].res; > > > + resource_size_t skip_end, skip_size; > > > + > > > + if (skip_start < part_res->start || skip_start > part_res->end) > > > + continue; > > > + > > > + skip_end = min(part_res->end, skip_start + skip_rem - 1); > > > + skip_size = skip_end - skip_start + 1; > > > + __release_region(&cxlds->dpa_res, skip_start, skip_size); > > > + skip_start += skip_size; > > > + skip_rem -= skip_size; > > > + > > > + if (!skip_rem) > > > + break; > > > + } > > > +} > > > + > > > /* > > > * Must be called in a context that synchronizes against this decoder's > > > * port ->remove() callback (like an endpoint decoder sysfs attribute) > > > @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled) > > > skip_start = res->start - cxled->skip; > > > __release_region(&cxlds->dpa_res, res->start, resource_size(res)); > > > if (cxled->skip) > > > - __release_region(&cxlds->dpa_res, skip_start, cxled->skip); > > > + release_skip(cxlds, skip_start, cxled->skip); > > > cxled->skip = 0; > > > cxled->dpa_res = NULL; > > > put_device(&cxled->cxld.dev); > > > @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled) > > > __cxl_dpa_release(cxled); > > > } > > > > > > +/** > > > + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree > > > + * @cxlds: CXL.mem device context that parents @cxled > > > + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA > > > + * @skip_base: DPA < start of new DPA allocation (DPAnew) > > > + * @skip_len: @skip_base + @skip_len == DPAnew > > > + * > > > + * DPA 'skip' arises from out-of-sequence DPA allocation events relative > > > + * to free capacity across multiple partitions. It is a wasteful event > > > + * as usable DPA gets thrown away, but if a deployment has, for example, > > > + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM > > > + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM. > > > + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder > > > + * Protection" for more details. > > > > I think this is a great comment here. > > Appreciate that, never know how these things are going to translate. > > > > > > + * > > > + * A 'skip' always covers the last allocated DPA in a previous partition > > > + * to the start of the current partition to allocate. Allocations never > > > + * start in the middle of a partition, and allocations are always > > > + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm > > > + * unwind order from forced in-order allocation). > > > + * > > > + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip' > > > + * would always be contained to a single partition. Given > > > + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip' > > > + * might span "tail capacity of partition[0], all of partition[1], ..., > > > + * all of partition[N-1]" to support allocating from partition[N]. That > > > + * in turn interacts with the partition 'struct resource' boundaries > > > + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by > > > + * partition. I.e. this is a quirk of using a 'struct resource' tree to > > > + * detect range conflicts while also tracking partition boundaries in > > > + * @cxlds->dpa_res. > > > > Another great comment but it does not actually cover the DCD case. This > > is because in DCD the partitions might also have skips between them. > > I think that "just works". The allocation will be bound by the > partition, and the skip is calculated from the "end of last allocation > in a previous partition". So, the distance between "end of last" and > "allocation start" will naturally include inter-partition holes, right? Not without a change to the algorithm I came up with. We could create phantom partitions which represent the skips between partitions. Otherwise the skip resources need a different parent. From my commit message: Two complications arise with Dynamic Capacity regions which did not exist with Ram and PMEM partitions. First, gaps in the DPA space can exist between and around the DC partitions. Second, the Linux resource tree does not allow a resource to be marked across existing nodes within a tree. For clarity, below is an example of an 60GB device with 10GB of RAM, 10GB of PMEM and 10GB for each of 2 DC partitions. The desired CXL mapping is 5GB of RAM, 5GB of PMEM, and 5GB of DC1. DPA RANGE (dpa_res) 0GB 10GB 20GB 30GB 40GB 50GB 60GB |----------|----------|----------|----------|----------|----------| RAM PMEM DC0 DC1 (ram_res) (pmem_res) (dc_res[0]) (dc_res[1]) |----------|----------| <gap> |----------| <gap> |----------| RAM PMEM DC1 |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----| 0GB 5GB 10GB 15GB 20GB 30GB 40GB 50GB 60GB The previous skip resource between RAM and PMEM was always a child of the RAM resource and fit nicely [see (S) below]. Because of this simplicity this skip resource reference was not stored in any CXL state. On release the skip range could be calculated based on the endpoint decoders stored values. Now when DC1 is being mapped 4 skip resources must be created as children. One for the PMEM resource (A), two of the parent DPA resource (B,D), and one more child of the DC0 resource (C). 0GB 10GB 20GB 30GB 40GB 50GB 60GB |----------|----------|----------|----------|----------|----------| | | |----------|----------| | |----------| | |----------| | | | | | (S) (A) (B) (C) (D) v v v v v |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----| skip skip skip skip skip
Ira Weiny wrote: [..] > > > Another great comment but it does not actually cover the DCD case. This > > > is because in DCD the partitions might also have skips between them. > > > > I think that "just works". The allocation will be bound by the > > partition, and the skip is calculated from the "end of last allocation > > in a previous partition". So, the distance between "end of last" and > > "allocation start" will naturally include inter-partition holes, right? > > Not without a change to the algorithm I came up with. We could create > phantom partitions which represent the skips between partitions. > Otherwise the skip resources need a different parent. Oh, that is true. Same problem as tracking skips across more than 2 partitions. However, I highly doubt anyone is going to build a device with inter-partition skips to the point where I am comfortable with the driver rejecting the possibility of such devices. Effectively dare someone to build such a needlessly complicated device especially when Capacity Configuration supports the concept of decode-length vs usable capacity. > > From my commit message: > > Two complications arise with Dynamic Capacity regions which did not > exist with Ram and PMEM partitions. First, gaps in the DPA space can > exist between and around the DC partitions. Second, the Linux resource > tree does not allow a resource to be marked across existing nodes within > a tree. > > For clarity, below is an example of an 60GB device with 10GB of RAM, > 10GB of PMEM and 10GB for each of 2 DC partitions. The desired CXL > mapping is 5GB of RAM, 5GB of PMEM, and 5GB of DC1. > > DPA RANGE > (dpa_res) > 0GB 10GB 20GB 30GB 40GB 50GB 60GB > |----------|----------|----------|----------|----------|----------| > > RAM PMEM DC0 DC1 > (ram_res) (pmem_res) (dc_res[0]) (dc_res[1]) > |----------|----------| <gap> |----------| <gap> |----------| > > RAM PMEM DC1 > |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----| > 0GB 5GB 10GB 15GB 20GB 30GB 40GB 50GB 60GB > > The previous skip resource between RAM and PMEM was always a child of > the RAM resource and fit nicely [see (S) below]. Because of this > simplicity this skip resource reference was not stored in any CXL state. > On release the skip range could be calculated based on the endpoint > decoders stored values. > > Now when DC1 is being mapped 4 skip resources must be created as > children. One for the PMEM resource (A), two of the parent DPA resource > (B,D), and one more child of the DC0 resource (C). > > 0GB 10GB 20GB 30GB 40GB 50GB 60GB > |----------|----------|----------|----------|----------|----------| > | | > |----------|----------| | |----------| | |----------| > | | | | | > (S) (A) (B) (C) (D) > v v v v v > |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----| > skip skip skip skip skip Yeah, I simply have low interest to review a patch implementing that scheme unless and until someone says "our device wants to do that, and it is too late to change".
On Wed, 22 Jan 2025 00:59:33 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM > allocations being distinct from RAM allocations in specific ways when in > practice the allocation rules are only relative to DPA partition index. > > The rules for cxl_dpa_alloc() are: > > - allocations can only come from 1 partition > > - if allocating at partition-index-N, all free space in partitions less > than partition-index-N must be skipped over > > Use the new 'struct cxl_dpa_partition' array to support allocation with > an arbitrary number of DPA partitions on the device. > > A follow-on patch can go further to cleanup 'enum cxl_decoder_mode' > concept and supersede it with looking up the memory properties from > partition metadata. Until then cxl_part_mode() temporarily bridges code > that looks up partitions by @cxled->mode. > > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Alejandro Lucero <alucerop@amd.com> > Cc: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> A few possible simplifications below + a trivial debug message printing a useful value comment. Jonathan > --- > drivers/cxl/core/hdm.c | 215 +++++++++++++++++++++++++++++++++++------------- > drivers/cxl/cxlmem.h | 14 +++ > 2 files changed, 172 insertions(+), 57 deletions(-) > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c > index 3f8a54ca4624..591aeb26c9e1 100644 > --- a/drivers/cxl/core/hdm.c > +++ b/drivers/cxl/core/hdm.c > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds) > } > EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL"); > > +/* See request_skip() kernel-doc */ > +static void release_skip(struct cxl_dev_state *cxlds, > + const resource_size_t skip_base, > + const resource_size_t skip_len) > +{ > + resource_size_t skip_start = skip_base, skip_rem = skip_len; > + > + for (int i = 0; i < cxlds->nr_partitions; i++) { > + const struct resource *part_res = &cxlds->part[i].res; > + resource_size_t skip_end, skip_size; > + > + if (skip_start < part_res->start || skip_start > part_res->end) > + continue; > + > + skip_end = min(part_res->end, skip_start + skip_rem - 1); > + skip_size = skip_end - skip_start + 1; > + __release_region(&cxlds->dpa_res, skip_start, skip_size); > + skip_start += skip_size; > + skip_rem -= skip_size; > + > + if (!skip_rem) > + break; > + } > +} Could ignore all explicit ordering constraints and have perhaps simpler (Even simpler if there is an overlap helper we can use) Assumption is we want to blow away anything in the skip range, whatever partition it is in. for (int i = 0; i < cxlds->nr_paritions; i++) { const struct resource *part_res = &cxlds->part[i].res; resource_size_t toremove_start, toremove_end; toremove_start = max(skip_start, part_res->start); toremove_end = min(skip_end, part_res->end); if (toremove_end > toremove_start) { resource_size_t rem_size = toremove_end - toremove_start + 1; __release_region(&cxlds->dpa_res, toremove_start, rem_size); } } Can track skip_rem or not bother with that optimization. Mind you your code is fine so I don't really mind. I think we can build similar for request_skip based on ordering assumption, though there we do need to keep track of how far we got so as to unwind only that bit. > + > +static int request_skip(struct cxl_dev_state *cxlds, > + struct cxl_endpoint_decoder *cxled, > + const resource_size_t skip_base, > + const resource_size_t skip_len) > +{ > + resource_size_t skip_start = skip_base, skip_rem = skip_len; > + > + for (int i = 0; i < cxlds->nr_partitions; i++) { > + const struct resource *part_res = &cxlds->part[i].res; > + struct cxl_port *port = cxled_to_port(cxled); > + resource_size_t skip_end, skip_size; > + struct resource *res; > + > + if (skip_start < part_res->start || skip_start > part_res->end) > + continue; > + > + skip_end = min(part_res->end, skip_start + skip_rem - 1); > + skip_size = skip_end - skip_start + 1; > + > + res = __request_region(&cxlds->dpa_res, skip_start, skip_size, > + dev_name(&cxled->cxld.dev), 0); > + if (!res) { > + dev_dbg(cxlds->dev, > + "decoder%d.%d: failed to reserve skipped space\n", > + port->id, cxled->cxld.id); > + break; > + } > + skip_start += skip_size; > + skip_rem -= skip_size; > + if (!skip_rem) > + break; > + } > + > + if (skip_rem == 0) > + return 0; > + > + release_skip(cxlds, skip_base, skip_len - skip_rem); > + > + return -EBUSY; > +} > @@ -529,15 +625,13 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled, > int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size) > { > struct cxl_memdev *cxlmd = cxled_to_memdev(cxled); > - resource_size_t free_ram_start, free_pmem_start; > struct cxl_port *port = cxled_to_port(cxled); > struct cxl_dev_state *cxlds = cxlmd->cxlds; > struct device *dev = &cxled->cxld.dev; > - resource_size_t start, avail, skip; > + struct resource *res, *prev = NULL; > + resource_size_t start, avail, skip, skip_start; > struct resource *p, *last; > - const struct resource *ram_res = to_ram_res(cxlds); > - const struct resource *pmem_res = to_pmem_res(cxlds); > - int rc; > + int part, rc; > > down_write(&cxl_dpa_rwsem); > if (cxled->cxld.region) { > @@ -553,47 +647,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size) > goto out; > } > > - for (p = ram_res->child, last = NULL; p; p = p->sibling) > - last = p; > - if (last) > - free_ram_start = last->end + 1; > - else > - free_ram_start = ram_res->start; > + part = -1; > + for (int i = 0; i < cxlds->nr_partitions; i++) { > + if (cxled->mode == cxl_part_mode(cxlds->part[i].mode)) { > + part = i; > + break; > + } > + } > > - for (p = pmem_res->child, last = NULL; p; p = p->sibling) > + if (part < 0) { > + dev_dbg(dev, "partition %d not found\n", part); how is part useful to print here? it's -1 > + rc = -EBUSY; > + goto out; > + } Maybe tidier as a check on loop exiting early. for (part = 0; i < cxlds->nr_partitions; part++) { if (cxled->mode == cxl_part_mode(cxlds->part[part].mode) break; } if (part == cxlds->nr_partitions) { dev_dbg(dev, "parition mode %d not found\n", cxled->mode); rc = -EBUSY; goto out; }
On 1/22/25 08:59, Dan Williams wrote: > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM > allocations being distinct from RAM allocations in specific ways when in > practice the allocation rules are only relative to DPA partition index. > > The rules for cxl_dpa_alloc() are: > > - allocations can only come from 1 partition > > - if allocating at partition-index-N, all free space in partitions less > than partition-index-N must be skipped over > > Use the new 'struct cxl_dpa_partition' array to support allocation with > an arbitrary number of DPA partitions on the device. > > A follow-on patch can go further to cleanup 'enum cxl_decoder_mode' > concept and supersede it with looking up the memory properties from > partition metadata. Until then cxl_part_mode() temporarily bridges code > that looks up partitions by @cxled->mode. > > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Alejandro Lucero <alucerop@amd.com> > Cc: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alejandro Lucero <alucerop@amd.com> > --- > drivers/cxl/core/hdm.c | 215 +++++++++++++++++++++++++++++++++++------------- > drivers/cxl/cxlmem.h | 14 +++ > 2 files changed, 172 insertions(+), 57 deletions(-) > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c > index 3f8a54ca4624..591aeb26c9e1 100644 > --- a/drivers/cxl/core/hdm.c > +++ b/drivers/cxl/core/hdm.c > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds) > } > EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL"); > > +/* See request_skip() kernel-doc */ > +static void release_skip(struct cxl_dev_state *cxlds, > + const resource_size_t skip_base, > + const resource_size_t skip_len) > +{ > + resource_size_t skip_start = skip_base, skip_rem = skip_len; > + > + for (int i = 0; i < cxlds->nr_partitions; i++) { > + const struct resource *part_res = &cxlds->part[i].res; > + resource_size_t skip_end, skip_size; > + > + if (skip_start < part_res->start || skip_start > part_res->end) > + continue; > + > + skip_end = min(part_res->end, skip_start + skip_rem - 1); > + skip_size = skip_end - skip_start + 1; > + __release_region(&cxlds->dpa_res, skip_start, skip_size); > + skip_start += skip_size; > + skip_rem -= skip_size; > + > + if (!skip_rem) > + break; > + } > +} > + > /* > * Must be called in a context that synchronizes against this decoder's > * port ->remove() callback (like an endpoint decoder sysfs attribute) > @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled) > skip_start = res->start - cxled->skip; > __release_region(&cxlds->dpa_res, res->start, resource_size(res)); > if (cxled->skip) > - __release_region(&cxlds->dpa_res, skip_start, cxled->skip); > + release_skip(cxlds, skip_start, cxled->skip); > cxled->skip = 0; > cxled->dpa_res = NULL; > put_device(&cxled->cxld.dev); > @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled) > __cxl_dpa_release(cxled); > } > > +/** > + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree > + * @cxlds: CXL.mem device context that parents @cxled > + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA > + * @skip_base: DPA < start of new DPA allocation (DPAnew) > + * @skip_len: @skip_base + @skip_len == DPAnew > + * > + * DPA 'skip' arises from out-of-sequence DPA allocation events relative > + * to free capacity across multiple partitions. It is a wasteful event > + * as usable DPA gets thrown away, but if a deployment has, for example, > + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM > + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM. > + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder > + * Protection" for more details. > + * > + * A 'skip' always covers the last allocated DPA in a previous partition > + * to the start of the current partition to allocate. Allocations never > + * start in the middle of a partition, and allocations are always > + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm > + * unwind order from forced in-order allocation). > + * > + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip' > + * would always be contained to a single partition. Given > + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip' > + * might span "tail capacity of partition[0], all of partition[1], ..., > + * all of partition[N-1]" to support allocating from partition[N]. That > + * in turn interacts with the partition 'struct resource' boundaries > + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by > + * partition. I.e. this is a quirk of using a 'struct resource' tree to > + * detect range conflicts while also tracking partition boundaries in > + * @cxlds->dpa_res. > + */ > +static int request_skip(struct cxl_dev_state *cxlds, > + struct cxl_endpoint_decoder *cxled, > + const resource_size_t skip_base, > + const resource_size_t skip_len) > +{ > + resource_size_t skip_start = skip_base, skip_rem = skip_len; > + > + for (int i = 0; i < cxlds->nr_partitions; i++) { > + const struct resource *part_res = &cxlds->part[i].res; > + struct cxl_port *port = cxled_to_port(cxled); > + resource_size_t skip_end, skip_size; > + struct resource *res; > + > + if (skip_start < part_res->start || skip_start > part_res->end) > + continue; > + > + skip_end = min(part_res->end, skip_start + skip_rem - 1); > + skip_size = skip_end - skip_start + 1; > + > + res = __request_region(&cxlds->dpa_res, skip_start, skip_size, > + dev_name(&cxled->cxld.dev), 0); > + if (!res) { > + dev_dbg(cxlds->dev, > + "decoder%d.%d: failed to reserve skipped space\n", > + port->id, cxled->cxld.id); > + break; > + } > + skip_start += skip_size; > + skip_rem -= skip_size; > + if (!skip_rem) > + break; > + } > + > + if (skip_rem == 0) > + return 0; > + > + release_skip(cxlds, skip_base, skip_len - skip_rem); > + > + return -EBUSY; > +} > + > static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, > resource_size_t base, resource_size_t len, > resource_size_t skipped) > @@ -276,7 +374,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, > struct cxl_port *port = cxled_to_port(cxled); > struct cxl_dev_state *cxlds = cxlmd->cxlds; > struct device *dev = &port->dev; > + enum cxl_decoder_mode mode; > struct resource *res; > + int rc; > > lockdep_assert_held_write(&cxl_dpa_rwsem); > > @@ -305,14 +405,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, > } > > if (skipped) { > - res = __request_region(&cxlds->dpa_res, base - skipped, skipped, > - dev_name(&cxled->cxld.dev), 0); > - if (!res) { > - dev_dbg(dev, > - "decoder%d.%d: failed to reserve skipped space\n", > - port->id, cxled->cxld.id); > - return -EBUSY; > - } > + rc = request_skip(cxlds, cxled, base - skipped, skipped); > + if (rc) > + return rc; > } > res = __request_region(&cxlds->dpa_res, base, len, > dev_name(&cxled->cxld.dev), 0); > @@ -320,22 +415,23 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, > dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n", > port->id, cxled->cxld.id); > if (skipped) > - __release_region(&cxlds->dpa_res, base - skipped, > - skipped); > + release_skip(cxlds, base - skipped, skipped); > return -EBUSY; > } > cxled->dpa_res = res; > cxled->skip = skipped; > > - if (to_pmem_res(cxlds) && resource_contains(to_pmem_res(cxlds), res)) > - cxled->mode = CXL_DECODER_PMEM; > - else if (to_ram_res(cxlds) && resource_contains(to_ram_res(cxlds), res)) > - cxled->mode = CXL_DECODER_RAM; > - else { > + mode = CXL_DECODER_NONE; > + for (int i = 0; cxlds->nr_partitions; i++) > + if (resource_contains(&cxlds->part[i].res, res)) { > + mode = cxl_part_mode(cxlds->part[i].mode); > + break; > + } > + > + if (mode == CXL_DECODER_NONE) > dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n", > port->id, cxled->cxld.id, res); > - cxled->mode = CXL_DECODER_NONE; > - } > + cxled->mode = mode; > > port->hdm_end++; > get_device(&cxled->cxld.dev); > @@ -529,15 +625,13 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled, > int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size) > { > struct cxl_memdev *cxlmd = cxled_to_memdev(cxled); > - resource_size_t free_ram_start, free_pmem_start; > struct cxl_port *port = cxled_to_port(cxled); > struct cxl_dev_state *cxlds = cxlmd->cxlds; > struct device *dev = &cxled->cxld.dev; > - resource_size_t start, avail, skip; > + struct resource *res, *prev = NULL; > + resource_size_t start, avail, skip, skip_start; > struct resource *p, *last; > - const struct resource *ram_res = to_ram_res(cxlds); > - const struct resource *pmem_res = to_pmem_res(cxlds); > - int rc; > + int part, rc; > > down_write(&cxl_dpa_rwsem); > if (cxled->cxld.region) { > @@ -553,47 +647,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size) > goto out; > } > > - for (p = ram_res->child, last = NULL; p; p = p->sibling) > - last = p; > - if (last) > - free_ram_start = last->end + 1; > - else > - free_ram_start = ram_res->start; > + part = -1; > + for (int i = 0; i < cxlds->nr_partitions; i++) { > + if (cxled->mode == cxl_part_mode(cxlds->part[i].mode)) { > + part = i; > + break; > + } > + } > > - for (p = pmem_res->child, last = NULL; p; p = p->sibling) > + if (part < 0) { > + dev_dbg(dev, "partition %d not found\n", part); > + rc = -EBUSY; > + goto out; > + } > + > + res = &cxlds->part[part].res; > + for (p = res->child, last = NULL; p; p = p->sibling) > last = p; > if (last) > - free_pmem_start = last->end + 1; > + start = last->end + 1; > else > - free_pmem_start = pmem_res->start; > - > - if (cxled->mode == CXL_DECODER_RAM) { > - start = free_ram_start; > - avail = ram_res->end - start + 1; > - skip = 0; > - } else if (cxled->mode == CXL_DECODER_PMEM) { > - resource_size_t skip_start, skip_end; > + start = res->start; > > - start = free_pmem_start; > - avail = pmem_res->end - start + 1; > - skip_start = free_ram_start; > - > - /* > - * If some pmem is already allocated, then that allocation > - * already handled the skip. > - */ > - if (pmem_res->child && > - skip_start == pmem_res->child->start) > - skip_end = skip_start - 1; > - else > - skip_end = start - 1; > - skip = skip_end - skip_start + 1; > - } else { > - dev_dbg(dev, "mode not set\n"); > - rc = -EINVAL; > - goto out; > + /* > + * To allocate at partition N, a skip needs to be calculated for all > + * unallocated space at lower partitions indices. > + * > + * If a partition has any allocations, the search can end because a > + * previous cxl_dpa_alloc() invocation is assumed to have accounted for > + * all previous partitions. > + */ > + skip_start = CXL_RESOURCE_NONE; > + for (int i = part; i; i--) { > + prev = &cxlds->part[i - 1].res; > + for (p = prev->child, last = NULL; p; p = p->sibling) > + last = p; > + if (last) { > + skip_start = last->end + 1; > + break; > + } > + skip_start = prev->start; > } > > + avail = res->end - start + 1; > + if (skip_start == CXL_RESOURCE_NONE) > + skip = 0; > + else > + skip = res->start - skip_start; > + > if (size > avail) { > dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size, > cxl_decoder_mode_name(cxled->mode), &avail); > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h > index 15f549afab7c..bad99456e901 100644 > --- a/drivers/cxl/cxlmem.h > +++ b/drivers/cxl/cxlmem.h > @@ -530,6 +530,20 @@ static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds) > return resource_size(res); > } > > +/* > + * Translate the operational mode of memory capacity with the > + * operational mode of a decoder > + * TODO: kill 'enum cxl_decoder_mode' to obviate this helper > + */ > +static inline enum cxl_decoder_mode cxl_part_mode(enum cxl_partition_mode mode) > +{ > + if (mode == CXL_PARTMODE_RAM) > + return CXL_DECODER_RAM; > + if (mode == CXL_PARTMODE_PMEM) > + return CXL_DECODER_PMEM; > + return CXL_DECODER_NONE; > +} > + > static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox) > { > return dev_get_drvdata(cxl_mbox->host); >
On 1/22/25 1:59 AM, Dan Williams wrote: > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM > allocations being distinct from RAM allocations in specific ways when in > practice the allocation rules are only relative to DPA partition index. > > The rules for cxl_dpa_alloc() are: > > - allocations can only come from 1 partition > > - if allocating at partition-index-N, all free space in partitions less > than partition-index-N must be skipped over > > Use the new 'struct cxl_dpa_partition' array to support allocation with > an arbitrary number of DPA partitions on the device. > > A follow-on patch can go further to cleanup 'enum cxl_decoder_mode' > concept and supersede it with looking up the memory properties from > partition metadata. Until then cxl_part_mode() temporarily bridges code > that looks up partitions by @cxled->mode. > > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Alejandro Lucero <alucerop@amd.com> > Cc: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> > --- > drivers/cxl/core/hdm.c | 215 +++++++++++++++++++++++++++++++++++------------- > drivers/cxl/cxlmem.h | 14 +++ > 2 files changed, 172 insertions(+), 57 deletions(-) > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c > index 3f8a54ca4624..591aeb26c9e1 100644 > --- a/drivers/cxl/core/hdm.c > +++ b/drivers/cxl/core/hdm.c > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds) > } > EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL"); > > +/* See request_skip() kernel-doc */ > +static void release_skip(struct cxl_dev_state *cxlds, > + const resource_size_t skip_base, > + const resource_size_t skip_len) > +{ > + resource_size_t skip_start = skip_base, skip_rem = skip_len; > + > + for (int i = 0; i < cxlds->nr_partitions; i++) { > + const struct resource *part_res = &cxlds->part[i].res; > + resource_size_t skip_end, skip_size; > + > + if (skip_start < part_res->start || skip_start > part_res->end) > + continue; > + > + skip_end = min(part_res->end, skip_start + skip_rem - 1); > + skip_size = skip_end - skip_start + 1; > + __release_region(&cxlds->dpa_res, skip_start, skip_size); > + skip_start += skip_size; > + skip_rem -= skip_size; > + > + if (!skip_rem) > + break; > + } > +} > + > /* > * Must be called in a context that synchronizes against this decoder's > * port ->remove() callback (like an endpoint decoder sysfs attribute) > @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled) > skip_start = res->start - cxled->skip; > __release_region(&cxlds->dpa_res, res->start, resource_size(res)); > if (cxled->skip) > - __release_region(&cxlds->dpa_res, skip_start, cxled->skip); > + release_skip(cxlds, skip_start, cxled->skip); > cxled->skip = 0; > cxled->dpa_res = NULL; > put_device(&cxled->cxld.dev); > @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled) > __cxl_dpa_release(cxled); > } > > +/** > + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree > + * @cxlds: CXL.mem device context that parents @cxled > + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA > + * @skip_base: DPA < start of new DPA allocation (DPAnew) > + * @skip_len: @skip_base + @skip_len == DPAnew > + * > + * DPA 'skip' arises from out-of-sequence DPA allocation events relative > + * to free capacity across multiple partitions. It is a wasteful event > + * as usable DPA gets thrown away, but if a deployment has, for example, > + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM > + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM. > + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder > + * Protection" for more details. > + * > + * A 'skip' always covers the last allocated DPA in a previous partition > + * to the start of the current partition to allocate. Allocations never > + * start in the middle of a partition, and allocations are always > + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm > + * unwind order from forced in-order allocation). > + * > + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip' > + * would always be contained to a single partition. Given > + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip' > + * might span "tail capacity of partition[0], all of partition[1], ..., > + * all of partition[N-1]" to support allocating from partition[N]. That > + * in turn interacts with the partition 'struct resource' boundaries > + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by > + * partition. I.e. this is a quirk of using a 'struct resource' tree to > + * detect range conflicts while also tracking partition boundaries in > + * @cxlds->dpa_res. > + */ > +static int request_skip(struct cxl_dev_state *cxlds, > + struct cxl_endpoint_decoder *cxled, > + const resource_size_t skip_base, > + const resource_size_t skip_len) > +{ > + resource_size_t skip_start = skip_base, skip_rem = skip_len; > + > + for (int i = 0; i < cxlds->nr_partitions; i++) { > + const struct resource *part_res = &cxlds->part[i].res; > + struct cxl_port *port = cxled_to_port(cxled); > + resource_size_t skip_end, skip_size; > + struct resource *res; > + > + if (skip_start < part_res->start || skip_start > part_res->end) > + continue; > + > + skip_end = min(part_res->end, skip_start + skip_rem - 1); > + skip_size = skip_end - skip_start + 1; > + > + res = __request_region(&cxlds->dpa_res, skip_start, skip_size, > + dev_name(&cxled->cxld.dev), 0); > + if (!res) { > + dev_dbg(cxlds->dev, > + "decoder%d.%d: failed to reserve skipped space\n", > + port->id, cxled->cxld.id); > + break; > + } > + skip_start += skip_size; > + skip_rem -= skip_size; > + if (!skip_rem) > + break; > + } > + > + if (skip_rem == 0) > + return 0; > + > + release_skip(cxlds, skip_base, skip_len - skip_rem); > + > + return -EBUSY; > +} > + > static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, > resource_size_t base, resource_size_t len, > resource_size_t skipped) > @@ -276,7 +374,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, > struct cxl_port *port = cxled_to_port(cxled); > struct cxl_dev_state *cxlds = cxlmd->cxlds; > struct device *dev = &port->dev; > + enum cxl_decoder_mode mode; > struct resource *res; > + int rc; > > lockdep_assert_held_write(&cxl_dpa_rwsem); > > @@ -305,14 +405,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, > } > > if (skipped) { > - res = __request_region(&cxlds->dpa_res, base - skipped, skipped, > - dev_name(&cxled->cxld.dev), 0); > - if (!res) { > - dev_dbg(dev, > - "decoder%d.%d: failed to reserve skipped space\n", > - port->id, cxled->cxld.id); > - return -EBUSY; > - } > + rc = request_skip(cxlds, cxled, base - skipped, skipped); > + if (rc) > + return rc; > } > res = __request_region(&cxlds->dpa_res, base, len, > dev_name(&cxled->cxld.dev), 0); > @@ -320,22 +415,23 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, > dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n", > port->id, cxled->cxld.id); > if (skipped) > - __release_region(&cxlds->dpa_res, base - skipped, > - skipped); > + release_skip(cxlds, base - skipped, skipped); > return -EBUSY; > } > cxled->dpa_res = res; > cxled->skip = skipped; > > - if (to_pmem_res(cxlds) && resource_contains(to_pmem_res(cxlds), res)) > - cxled->mode = CXL_DECODER_PMEM; > - else if (to_ram_res(cxlds) && resource_contains(to_ram_res(cxlds), res)) > - cxled->mode = CXL_DECODER_RAM; > - else { > + mode = CXL_DECODER_NONE; > + for (int i = 0; cxlds->nr_partitions; i++) > + if (resource_contains(&cxlds->part[i].res, res)) { > + mode = cxl_part_mode(cxlds->part[i].mode); > + break; > + } > + > + if (mode == CXL_DECODER_NONE) > dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n", > port->id, cxled->cxld.id, res); > - cxled->mode = CXL_DECODER_NONE; > - } > + cxled->mode = mode; > > port->hdm_end++; > get_device(&cxled->cxld.dev); > @@ -529,15 +625,13 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled, > int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size) > { > struct cxl_memdev *cxlmd = cxled_to_memdev(cxled); > - resource_size_t free_ram_start, free_pmem_start; > struct cxl_port *port = cxled_to_port(cxled); > struct cxl_dev_state *cxlds = cxlmd->cxlds; > struct device *dev = &cxled->cxld.dev; > - resource_size_t start, avail, skip; > + struct resource *res, *prev = NULL; > + resource_size_t start, avail, skip, skip_start; > struct resource *p, *last; > - const struct resource *ram_res = to_ram_res(cxlds); > - const struct resource *pmem_res = to_pmem_res(cxlds); > - int rc; > + int part, rc; > > down_write(&cxl_dpa_rwsem); > if (cxled->cxld.region) { > @@ -553,47 +647,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size) > goto out; > } > > - for (p = ram_res->child, last = NULL; p; p = p->sibling) > - last = p; > - if (last) > - free_ram_start = last->end + 1; > - else > - free_ram_start = ram_res->start; > + part = -1; > + for (int i = 0; i < cxlds->nr_partitions; i++) { > + if (cxled->mode == cxl_part_mode(cxlds->part[i].mode)) { > + part = i; > + break; > + } > + } > > - for (p = pmem_res->child, last = NULL; p; p = p->sibling) > + if (part < 0) { > + dev_dbg(dev, "partition %d not found\n", part); > + rc = -EBUSY; > + goto out; > + } > + > + res = &cxlds->part[part].res; > + for (p = res->child, last = NULL; p; p = p->sibling) > last = p; > if (last) > - free_pmem_start = last->end + 1; > + start = last->end + 1; > else > - free_pmem_start = pmem_res->start; > - > - if (cxled->mode == CXL_DECODER_RAM) { > - start = free_ram_start; > - avail = ram_res->end - start + 1; > - skip = 0; > - } else if (cxled->mode == CXL_DECODER_PMEM) { > - resource_size_t skip_start, skip_end; > + start = res->start; > > - start = free_pmem_start; > - avail = pmem_res->end - start + 1; > - skip_start = free_ram_start; > - > - /* > - * If some pmem is already allocated, then that allocation > - * already handled the skip. > - */ > - if (pmem_res->child && > - skip_start == pmem_res->child->start) > - skip_end = skip_start - 1; > - else > - skip_end = start - 1; > - skip = skip_end - skip_start + 1; > - } else { > - dev_dbg(dev, "mode not set\n"); > - rc = -EINVAL; > - goto out; > + /* > + * To allocate at partition N, a skip needs to be calculated for all > + * unallocated space at lower partitions indices. > + * > + * If a partition has any allocations, the search can end because a > + * previous cxl_dpa_alloc() invocation is assumed to have accounted for > + * all previous partitions. > + */ > + skip_start = CXL_RESOURCE_NONE; > + for (int i = part; i; i--) { > + prev = &cxlds->part[i - 1].res; > + for (p = prev->child, last = NULL; p; p = p->sibling) > + last = p; > + if (last) { > + skip_start = last->end + 1; > + break; > + } > + skip_start = prev->start; > } > > + avail = res->end - start + 1; > + if (skip_start == CXL_RESOURCE_NONE) > + skip = 0; > + else > + skip = res->start - skip_start; > + > if (size > avail) { > dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size, > cxl_decoder_mode_name(cxled->mode), &avail); > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h > index 15f549afab7c..bad99456e901 100644 > --- a/drivers/cxl/cxlmem.h > +++ b/drivers/cxl/cxlmem.h > @@ -530,6 +530,20 @@ static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds) > return resource_size(res); > } > > +/* > + * Translate the operational mode of memory capacity with the > + * operational mode of a decoder > + * TODO: kill 'enum cxl_decoder_mode' to obviate this helper > + */ > +static inline enum cxl_decoder_mode cxl_part_mode(enum cxl_partition_mode mode) > +{ > + if (mode == CXL_PARTMODE_RAM) > + return CXL_DECODER_RAM; > + if (mode == CXL_PARTMODE_PMEM) > + return CXL_DECODER_PMEM; > + return CXL_DECODER_NONE; > +} > + > static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox) > { > return dev_get_drvdata(cxl_mbox->host); >
Jonathan Cameron wrote: > On Wed, 22 Jan 2025 00:59:33 -0800 > Dan Williams <dan.j.williams@intel.com> wrote: > > > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM > > allocations being distinct from RAM allocations in specific ways when in > > practice the allocation rules are only relative to DPA partition index. > > > > The rules for cxl_dpa_alloc() are: > > > > - allocations can only come from 1 partition > > > > - if allocating at partition-index-N, all free space in partitions less > > than partition-index-N must be skipped over > > > > Use the new 'struct cxl_dpa_partition' array to support allocation with > > an arbitrary number of DPA partitions on the device. > > > > A follow-on patch can go further to cleanup 'enum cxl_decoder_mode' > > concept and supersede it with looking up the memory properties from > > partition metadata. Until then cxl_part_mode() temporarily bridges code > > that looks up partitions by @cxled->mode. > > > > Cc: Dave Jiang <dave.jiang@intel.com> > > Cc: Alejandro Lucero <alucerop@amd.com> > > Cc: Ira Weiny <ira.weiny@intel.com> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > > A few possible simplifications below + a trivial debug message printing > a useful value comment. > > Jonathan > > > --- > > drivers/cxl/core/hdm.c | 215 +++++++++++++++++++++++++++++++++++------------- > > drivers/cxl/cxlmem.h | 14 +++ > > 2 files changed, 172 insertions(+), 57 deletions(-) > > > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c > > index 3f8a54ca4624..591aeb26c9e1 100644 > > --- a/drivers/cxl/core/hdm.c > > +++ b/drivers/cxl/core/hdm.c > > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds) > > } > > EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL"); > > > > +/* See request_skip() kernel-doc */ > > +static void release_skip(struct cxl_dev_state *cxlds, > > + const resource_size_t skip_base, > > + const resource_size_t skip_len) > > +{ > > + resource_size_t skip_start = skip_base, skip_rem = skip_len; > > + > > + for (int i = 0; i < cxlds->nr_partitions; i++) { > > + const struct resource *part_res = &cxlds->part[i].res; > > + resource_size_t skip_end, skip_size; > > + > > + if (skip_start < part_res->start || skip_start > part_res->end) > > + continue; > > + > > + skip_end = min(part_res->end, skip_start + skip_rem - 1); > > + skip_size = skip_end - skip_start + 1; > > + __release_region(&cxlds->dpa_res, skip_start, skip_size); > > + skip_start += skip_size; > > + skip_rem -= skip_size; > > + > > + if (!skip_rem) > > + break; > > + } > > +} > > Could ignore all explicit ordering constraints and have perhaps simpler > (Even simpler if there is an overlap helper we can use) > Assumption is we want to blow away anything in the skip range, whatever > partition it is in. > > for (int i = 0; i < cxlds->nr_paritions; i++) { > const struct resource *part_res = &cxlds->part[i].res; > resource_size_t toremove_start, toremove_end; > > toremove_start = max(skip_start, part_res->start); > toremove_end = min(skip_end, part_res->end); > if (toremove_end > toremove_start) { > resource_size_t rem_size = toremove_end - toremove_start + 1; > __release_region(&cxlds->dpa_res, toremove_start, rem_size); > } > > } > Can track skip_rem or not bother with that optimization. I like it, I'll switch to this. > > Mind you your code is fine so I don't really mind. > I think we can build similar for request_skip based on ordering assumption, though > there we do need to keep track of how far we got so as to unwind only > that bit. Will give it a spin and see how it looks. [..] > > @@ -553,47 +647,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size) > > goto out; > > } > > > > - for (p = ram_res->child, last = NULL; p; p = p->sibling) > > - last = p; > > - if (last) > > - free_ram_start = last->end + 1; > > - else > > - free_ram_start = ram_res->start; > > + part = -1; > > + for (int i = 0; i < cxlds->nr_partitions; i++) { > > + if (cxled->mode == cxl_part_mode(cxlds->part[i].mode)) { > > + part = i; > > + break; > > + } > > + } > > > > - for (p = pmem_res->child, last = NULL; p; p = p->sibling) > > + if (part < 0) { > > + dev_dbg(dev, "partition %d not found\n", part); > > how is part useful to print here? it's -1 Yeah, another thinko likely because I was already thinking about how cxled->part shuold already be set before entering this function in the next patch. For this one, I'll just delete the message because in the next patch the loop is gone and only the following remains: part = cxled->part; if (part < 0) { dev_dbg(dev, "partition not set\n"); rc = -EBUSY; goto out; }
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c index 3f8a54ca4624..591aeb26c9e1 100644 --- a/drivers/cxl/core/hdm.c +++ b/drivers/cxl/core/hdm.c @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds) } EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL"); +/* See request_skip() kernel-doc */ +static void release_skip(struct cxl_dev_state *cxlds, + const resource_size_t skip_base, + const resource_size_t skip_len) +{ + resource_size_t skip_start = skip_base, skip_rem = skip_len; + + for (int i = 0; i < cxlds->nr_partitions; i++) { + const struct resource *part_res = &cxlds->part[i].res; + resource_size_t skip_end, skip_size; + + if (skip_start < part_res->start || skip_start > part_res->end) + continue; + + skip_end = min(part_res->end, skip_start + skip_rem - 1); + skip_size = skip_end - skip_start + 1; + __release_region(&cxlds->dpa_res, skip_start, skip_size); + skip_start += skip_size; + skip_rem -= skip_size; + + if (!skip_rem) + break; + } +} + /* * Must be called in a context that synchronizes against this decoder's * port ->remove() callback (like an endpoint decoder sysfs attribute) @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled) skip_start = res->start - cxled->skip; __release_region(&cxlds->dpa_res, res->start, resource_size(res)); if (cxled->skip) - __release_region(&cxlds->dpa_res, skip_start, cxled->skip); + release_skip(cxlds, skip_start, cxled->skip); cxled->skip = 0; cxled->dpa_res = NULL; put_device(&cxled->cxld.dev); @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled) __cxl_dpa_release(cxled); } +/** + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree + * @cxlds: CXL.mem device context that parents @cxled + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA + * @skip_base: DPA < start of new DPA allocation (DPAnew) + * @skip_len: @skip_base + @skip_len == DPAnew + * + * DPA 'skip' arises from out-of-sequence DPA allocation events relative + * to free capacity across multiple partitions. It is a wasteful event + * as usable DPA gets thrown away, but if a deployment has, for example, + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM. + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder + * Protection" for more details. + * + * A 'skip' always covers the last allocated DPA in a previous partition + * to the start of the current partition to allocate. Allocations never + * start in the middle of a partition, and allocations are always + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm + * unwind order from forced in-order allocation). + * + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip' + * would always be contained to a single partition. Given + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip' + * might span "tail capacity of partition[0], all of partition[1], ..., + * all of partition[N-1]" to support allocating from partition[N]. That + * in turn interacts with the partition 'struct resource' boundaries + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by + * partition. I.e. this is a quirk of using a 'struct resource' tree to + * detect range conflicts while also tracking partition boundaries in + * @cxlds->dpa_res. + */ +static int request_skip(struct cxl_dev_state *cxlds, + struct cxl_endpoint_decoder *cxled, + const resource_size_t skip_base, + const resource_size_t skip_len) +{ + resource_size_t skip_start = skip_base, skip_rem = skip_len; + + for (int i = 0; i < cxlds->nr_partitions; i++) { + const struct resource *part_res = &cxlds->part[i].res; + struct cxl_port *port = cxled_to_port(cxled); + resource_size_t skip_end, skip_size; + struct resource *res; + + if (skip_start < part_res->start || skip_start > part_res->end) + continue; + + skip_end = min(part_res->end, skip_start + skip_rem - 1); + skip_size = skip_end - skip_start + 1; + + res = __request_region(&cxlds->dpa_res, skip_start, skip_size, + dev_name(&cxled->cxld.dev), 0); + if (!res) { + dev_dbg(cxlds->dev, + "decoder%d.%d: failed to reserve skipped space\n", + port->id, cxled->cxld.id); + break; + } + skip_start += skip_size; + skip_rem -= skip_size; + if (!skip_rem) + break; + } + + if (skip_rem == 0) + return 0; + + release_skip(cxlds, skip_base, skip_len - skip_rem); + + return -EBUSY; +} + static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, resource_size_t base, resource_size_t len, resource_size_t skipped) @@ -276,7 +374,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, struct cxl_port *port = cxled_to_port(cxled); struct cxl_dev_state *cxlds = cxlmd->cxlds; struct device *dev = &port->dev; + enum cxl_decoder_mode mode; struct resource *res; + int rc; lockdep_assert_held_write(&cxl_dpa_rwsem); @@ -305,14 +405,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, } if (skipped) { - res = __request_region(&cxlds->dpa_res, base - skipped, skipped, - dev_name(&cxled->cxld.dev), 0); - if (!res) { - dev_dbg(dev, - "decoder%d.%d: failed to reserve skipped space\n", - port->id, cxled->cxld.id); - return -EBUSY; - } + rc = request_skip(cxlds, cxled, base - skipped, skipped); + if (rc) + return rc; } res = __request_region(&cxlds->dpa_res, base, len, dev_name(&cxled->cxld.dev), 0); @@ -320,22 +415,23 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled, dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n", port->id, cxled->cxld.id); if (skipped) - __release_region(&cxlds->dpa_res, base - skipped, - skipped); + release_skip(cxlds, base - skipped, skipped); return -EBUSY; } cxled->dpa_res = res; cxled->skip = skipped; - if (to_pmem_res(cxlds) && resource_contains(to_pmem_res(cxlds), res)) - cxled->mode = CXL_DECODER_PMEM; - else if (to_ram_res(cxlds) && resource_contains(to_ram_res(cxlds), res)) - cxled->mode = CXL_DECODER_RAM; - else { + mode = CXL_DECODER_NONE; + for (int i = 0; cxlds->nr_partitions; i++) + if (resource_contains(&cxlds->part[i].res, res)) { + mode = cxl_part_mode(cxlds->part[i].mode); + break; + } + + if (mode == CXL_DECODER_NONE) dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n", port->id, cxled->cxld.id, res); - cxled->mode = CXL_DECODER_NONE; - } + cxled->mode = mode; port->hdm_end++; get_device(&cxled->cxld.dev); @@ -529,15 +625,13 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled, int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size) { struct cxl_memdev *cxlmd = cxled_to_memdev(cxled); - resource_size_t free_ram_start, free_pmem_start; struct cxl_port *port = cxled_to_port(cxled); struct cxl_dev_state *cxlds = cxlmd->cxlds; struct device *dev = &cxled->cxld.dev; - resource_size_t start, avail, skip; + struct resource *res, *prev = NULL; + resource_size_t start, avail, skip, skip_start; struct resource *p, *last; - const struct resource *ram_res = to_ram_res(cxlds); - const struct resource *pmem_res = to_pmem_res(cxlds); - int rc; + int part, rc; down_write(&cxl_dpa_rwsem); if (cxled->cxld.region) { @@ -553,47 +647,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size) goto out; } - for (p = ram_res->child, last = NULL; p; p = p->sibling) - last = p; - if (last) - free_ram_start = last->end + 1; - else - free_ram_start = ram_res->start; + part = -1; + for (int i = 0; i < cxlds->nr_partitions; i++) { + if (cxled->mode == cxl_part_mode(cxlds->part[i].mode)) { + part = i; + break; + } + } - for (p = pmem_res->child, last = NULL; p; p = p->sibling) + if (part < 0) { + dev_dbg(dev, "partition %d not found\n", part); + rc = -EBUSY; + goto out; + } + + res = &cxlds->part[part].res; + for (p = res->child, last = NULL; p; p = p->sibling) last = p; if (last) - free_pmem_start = last->end + 1; + start = last->end + 1; else - free_pmem_start = pmem_res->start; - - if (cxled->mode == CXL_DECODER_RAM) { - start = free_ram_start; - avail = ram_res->end - start + 1; - skip = 0; - } else if (cxled->mode == CXL_DECODER_PMEM) { - resource_size_t skip_start, skip_end; + start = res->start; - start = free_pmem_start; - avail = pmem_res->end - start + 1; - skip_start = free_ram_start; - - /* - * If some pmem is already allocated, then that allocation - * already handled the skip. - */ - if (pmem_res->child && - skip_start == pmem_res->child->start) - skip_end = skip_start - 1; - else - skip_end = start - 1; - skip = skip_end - skip_start + 1; - } else { - dev_dbg(dev, "mode not set\n"); - rc = -EINVAL; - goto out; + /* + * To allocate at partition N, a skip needs to be calculated for all + * unallocated space at lower partitions indices. + * + * If a partition has any allocations, the search can end because a + * previous cxl_dpa_alloc() invocation is assumed to have accounted for + * all previous partitions. + */ + skip_start = CXL_RESOURCE_NONE; + for (int i = part; i; i--) { + prev = &cxlds->part[i - 1].res; + for (p = prev->child, last = NULL; p; p = p->sibling) + last = p; + if (last) { + skip_start = last->end + 1; + break; + } + skip_start = prev->start; } + avail = res->end - start + 1; + if (skip_start == CXL_RESOURCE_NONE) + skip = 0; + else + skip = res->start - skip_start; + if (size > avail) { dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size, cxl_decoder_mode_name(cxled->mode), &avail); diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h index 15f549afab7c..bad99456e901 100644 --- a/drivers/cxl/cxlmem.h +++ b/drivers/cxl/cxlmem.h @@ -530,6 +530,20 @@ static inline resource_size_t cxl_pmem_size(struct cxl_dev_state *cxlds) return resource_size(res); } +/* + * Translate the operational mode of memory capacity with the + * operational mode of a decoder + * TODO: kill 'enum cxl_decoder_mode' to obviate this helper + */ +static inline enum cxl_decoder_mode cxl_part_mode(enum cxl_partition_mode mode) +{ + if (mode == CXL_PARTMODE_RAM) + return CXL_DECODER_RAM; + if (mode == CXL_PARTMODE_PMEM) + return CXL_DECODER_PMEM; + return CXL_DECODER_NONE; +} + static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox) { return dev_get_drvdata(cxl_mbox->host);
cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM allocations being distinct from RAM allocations in specific ways when in practice the allocation rules are only relative to DPA partition index. The rules for cxl_dpa_alloc() are: - allocations can only come from 1 partition - if allocating at partition-index-N, all free space in partitions less than partition-index-N must be skipped over Use the new 'struct cxl_dpa_partition' array to support allocation with an arbitrary number of DPA partitions on the device. A follow-on patch can go further to cleanup 'enum cxl_decoder_mode' concept and supersede it with looking up the memory properties from partition metadata. Until then cxl_part_mode() temporarily bridges code that looks up partitions by @cxled->mode. Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alejandro Lucero <alucerop@amd.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/cxl/core/hdm.c | 215 +++++++++++++++++++++++++++++++++++------------- drivers/cxl/cxlmem.h | 14 +++ 2 files changed, 172 insertions(+), 57 deletions(-)