diff mbox series

[RFC,05/15] cxl/acpi: Reserve CXL resources from request_free_mem_region

Message ID 20220413183720.2444089-6-ben.widawsky@intel.com
State New, archived
Headers show
Series Region driver | expand

Commit Message

Ben Widawsky April 13, 2022, 6:37 p.m. UTC
Define an API which allows CXL drivers to manage CXL address space.
CXL is unique in that the address space and various properties are only
known after CXL drivers come up, and therefore cannot be part of core
memory enumeration.

Compute Express Link 2.0 [ECN] defines a concept called CXL Fixed Memory
Window Structures (CFMWS). Each CFMWS conveys a region of host physical
address (HPA) space which has certain properties that are familiar to
CXL, mainly interleave properties, and restrictions, such as
persistence. The HPA ranges therefore should be owned, or at least
guided by the relevant CXL driver, cxl_acpi [1].

It would be desirable to simply insert this address space into
iomem_resource with a new flag to denote this is CXL memory. This would
permit request_free_mem_region() to be reused for CXL memory provided it
learned some new tricks. For that, it is tempting to simply use
insert_resource(). The API was designed specifically for cases where new
devices may offer new address space. This cannot work in the general
case. Boot firmware can pass, some, none, or all of the CFMWS range as
various types of memory to the kernel, and this may be left alone,
merged, or even expanded. As a result iomem_resource may intersect CFMWS
regions in ways insert_resource cannot handle [2]. Similar reasoning
applies to allocate_resource().

With the insert_resource option out, the only reasonable approach left
is to let the CXL driver manage the address space independently of
iomem_resource and attempt to prevent users of device private memory
APIs from using CXL memory. In the case where cxl_acpi comes up first,
the new API allows cxl to block use of any CFMWS defined address space
by assuming everything above the highest CFMWS entry is fair game. It is
expected that this effectively will prevent usage of device private
memory, but if such behavior is undesired, cxl_acpi can be blocked from
loading, or unloaded. When device private memory is used before CXL
comes up, or, there are intersections as described above, the CXL driver
will have to make sure to not reuse sysram that is BUSY.

[1]: The specification defines enumeration via ACPI, however, one could
envision devicetree, or some other hardcoded mechanisms for doing the
same thing.

[2]: A common way to hit this case is when BIOS creates a volatile
region with extra space for hotplug. In this case, you're likely to have

|<--------------HPA space---------------------->|
|<---iomem_resource -->|
| DDR  | CXL Volatile  |
|      | CFMWS for volatile w/ hotplug |

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 drivers/cxl/acpi.c     | 26 ++++++++++++++++++++++++++
 include/linux/ioport.h |  1 +
 kernel/resource.c      | 11 ++++++++++-
 3 files changed, 37 insertions(+), 1 deletion(-)

Comments

Dan Williams April 18, 2022, 4:42 p.m. UTC | #1
[ add the usual HMM suspects Christoph, Jason, and John ]

On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> Define an API which allows CXL drivers to manage CXL address space.
> CXL is unique in that the address space and various properties are only
> known after CXL drivers come up, and therefore cannot be part of core
> memory enumeration.

I think this buries the lead on the problem introduced by
MEMORY_DEVICE_PRIVATE in the first place. Let's revisit that history
before diving into what CXL needs.

---

Commit 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using
ZONE_DEVICE") introduced the concept of MEMORY_DEVICE_PRIVATE. At its
core MEMORY_DEVICE_PRIVATE uses the ZONE_DEVICE capability to annotate
an "unused" physical address range with 'struct page' for the purpose
of coordinating migration of buffers onto and off of a GPU /
accelerator. The determination of "unused" was based on a heuristic,
not a guarantee, that any address range not expressly conveyed in the
platform firmware map of the system can be repurposed for software
use. The CXL Fixed Memory Windows Structure  (CFMWS) definition
explicitly breaks the assumptions of that heuristic.

---

...and then jump into what CFMWS is and the proposal to coordinate
with request_free_mem_region().


>
> Compute Express Link 2.0 [ECN] defines a concept called CXL Fixed Memory
> Window Structures (CFMWS). Each CFMWS conveys a region of host physical
> address (HPA) space which has certain properties that are familiar to
> CXL, mainly interleave properties, and restrictions, such as
> persistence. The HPA ranges therefore should be owned, or at least
> guided by the relevant CXL driver, cxl_acpi [1].
>
> It would be desirable to simply insert this address space into
> iomem_resource with a new flag to denote this is CXL memory. This would
> permit request_free_mem_region() to be reused for CXL memory provided it
> learned some new tricks. For that, it is tempting to simply use
> insert_resource(). The API was designed specifically for cases where new
> devices may offer new address space. This cannot work in the general
> case. Boot firmware can pass, some, none, or all of the CFMWS range as
> various types of memory to the kernel, and this may be left alone,
> merged, or even expanded.

s/expanded/expanded as the memory map is parsed and reconciled/

> As a result iomem_resource may intersect CFMWS
> regions in ways insert_resource cannot handle [2]. Similar reasoning
> applies to allocate_resource().
>
> With the insert_resource option out, the only reasonable approach left
> is to let the CXL driver manage the address space independently of
> iomem_resource and attempt to prevent users of device private memory

s/device private memory/MEMORY_DEVICE_PRIVATE/

> APIs from using CXL memory. In the case where cxl_acpi comes up first,
> the new API allows cxl to block use of any CFMWS defined address space
> by assuming everything above the highest CFMWS entry is fair game. It is
> expected that this effectively will prevent usage of device private
> memory,

No, only if CFMWS consumes the full 64-bit address space which is
unlikely. It's also unlikely going forward to need
MEMORY_DEVICE_PRIVATE when hardware supports CXL for fully coherent
migration of buffers onto and off of an accelearator.

> but if such behavior is undesired, cxl_acpi can be blocked from
> loading, or unloaded.

I would just say if MEMORY_DEVICE_PRIVATE needs exceed the memory
space left over by CXL then the loading of the dynamic CXL address
space allocation infrastructure can be deferred until after
MEMORY_DEVICE_PRIVATE consumers have

> When device private memory is used before CXL
> comes up, or, there are intersections as described above, the CXL driver
> will have to make sure to not reuse sysram that is BUSY.
>
> [1]: The specification defines enumeration via ACPI, however, one could
> envision devicetree, or some other hardcoded mechanisms for doing the
> same thing.
>
> [2]: A common way to hit this case is when BIOS creates a volatile
> region with extra space for hotplug. In this case, you're likely to have
>
> |<--------------HPA space---------------------->|
> |<---iomem_resource -->|
> | DDR  | CXL Volatile  |
> |      | CFMWS for volatile w/ hotplug |
>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
>  drivers/cxl/acpi.c     | 26 ++++++++++++++++++++++++++
>  include/linux/ioport.h |  1 +
>  kernel/resource.c      | 11 ++++++++++-
>  3 files changed, 37 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index 9b69955b90cb..0870904fe4b5 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -76,6 +76,7 @@ static int cxl_acpi_cfmws_verify(struct device *dev,
>  struct cxl_cfmws_context {
>         struct device *dev;
>         struct cxl_port *root_port;
> +       struct acpi_cedt_cfmws *high_cfmws;

Seems more straightforward to track the max 'end' address seen so far
rather than the "highest" cfmws entry.

>  };
>
>  static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
> @@ -126,6 +127,14 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
>                         cfmws->base_hpa + cfmws->window_size - 1);
>                 return 0;
>         }
> +
> +       if (ctx->high_cfmws) {
> +               if (cfmws->base_hpa > ctx->high_cfmws->base_hpa)
> +                       ctx->high_cfmws = cfmws;

I'd expect:

end = cfmws->base_hpa + window_size;
if (ctx->cfmws_max < end)
   ctx->cfmws_max = end;

> +       } else {
> +               ctx->high_cfmws = cfmws;
> +       }
> +
>         dev_dbg(dev, "add: %s node: %d range %#llx-%#llx\n",
>                 dev_name(&cxld->dev), phys_to_target_node(cxld->range.start),
>                 cfmws->base_hpa, cfmws->base_hpa + cfmws->window_size - 1);
> @@ -299,6 +308,7 @@ static int cxl_acpi_probe(struct platform_device *pdev)
>         ctx = (struct cxl_cfmws_context) {
>                 .dev = host,
>                 .root_port = root_port,
> +               .high_cfmws = NULL,
>         };
>         acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, cxl_parse_cfmws, &ctx);
>
> @@ -317,10 +327,25 @@ static int cxl_acpi_probe(struct platform_device *pdev)
>         if (rc < 0)
>                 return rc;
>
> +       if (ctx.high_cfmws) {

Even if there are zero CFMWS entries there will always be a max end
address to call set_request_free_min_base().

> +               resource_size_t end =
> +                       ctx.high_cfmws->base_hpa + ctx.high_cfmws->window_size;
> +               dev_dbg(host,
> +                       "Disabling free device private regions below %#llx\n",
> +                       end);
> +               set_request_free_min_base(end);
> +       }
> +
>         /* In case PCI is scanned before ACPI re-trigger memdev attach */
>         return cxl_bus_rescan();
>  }
>
> +static int cxl_acpi_remove(struct platform_device *pdev)

No need for a .remove() method, just use devm_add_action_or_reset() to
unreserve CXL address space as cxl_acpi unloads.

> +{
> +       set_request_free_min_base(0);
> +       return 0;
> +}
> +
>  static const struct acpi_device_id cxl_acpi_ids[] = {
>         { "ACPI0017" },
>         { },
> @@ -329,6 +354,7 @@ MODULE_DEVICE_TABLE(acpi, cxl_acpi_ids);
>
>  static struct platform_driver cxl_acpi_driver = {
>         .probe = cxl_acpi_probe,
> +       .remove = cxl_acpi_remove,
>         .driver = {
>                 .name = KBUILD_MODNAME,
>                 .acpi_match_table = cxl_acpi_ids,
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index ec5f71f7135b..dc41e4be5635 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -325,6 +325,7 @@ extern int
>  walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end,
>                     void *arg, int (*func)(struct resource *, void *));
>
> +void set_request_free_min_base(resource_size_t val);

Shouldn't there also be a static inline empty routine in the
CONFIG_DEVICE_PRIVATE=n case?

>  struct resource *devm_request_free_mem_region(struct device *dev,
>                 struct resource *base, unsigned long size);
>  struct resource *request_free_mem_region(struct resource *base,
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 34eaee179689..a4750689e529 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1774,6 +1774,14 @@ void resource_list_free(struct list_head *head)
>  EXPORT_SYMBOL(resource_list_free);
>
>  #ifdef CONFIG_DEVICE_PRIVATE
> +static resource_size_t request_free_min_base;
> +
> +void set_request_free_min_base(resource_size_t val)
> +{
> +       request_free_min_base = val;
> +}
> +EXPORT_SYMBOL_GPL(set_request_free_min_base);
> +
>  static struct resource *__request_free_mem_region(struct device *dev,
>                 struct resource *base, unsigned long size, const char *name)
>  {
> @@ -1799,7 +1807,8 @@ static struct resource *__request_free_mem_region(struct device *dev,
>         }
>
>         write_lock(&resource_lock);
> -       for (; addr > size && addr >= base->start; addr -= size) {
> +       for (; addr > size && addr >= max(base->start, request_free_min_base);
> +            addr -= size) {
>                 if (__region_intersects(addr, size, 0, IORES_DESC_NONE) !=
>                                 REGION_DISJOINT)
>                         continue;
> --
> 2.35.1
>
Jason Gunthorpe April 19, 2022, 4:43 p.m. UTC | #2
On Mon, Apr 18, 2022 at 09:42:00AM -0700, Dan Williams wrote:
> [ add the usual HMM suspects Christoph, Jason, and John ]
> 
> On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> >
> > Define an API which allows CXL drivers to manage CXL address space.
> > CXL is unique in that the address space and various properties are only
> > known after CXL drivers come up, and therefore cannot be part of core
> > memory enumeration.
> 
> I think this buries the lead on the problem introduced by
> MEMORY_DEVICE_PRIVATE in the first place. Let's revisit that history
> before diving into what CXL needs.
> 
> 
> Commit 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using
> ZONE_DEVICE") introduced the concept of MEMORY_DEVICE_PRIVATE. At its
> core MEMORY_DEVICE_PRIVATE uses the ZONE_DEVICE capability to annotate
> an "unused" physical address range with 'struct page' for the purpose
> of coordinating migration of buffers onto and off of a GPU /
> accelerator. The determination of "unused" was based on a heuristic,
> not a guarantee, that any address range not expressly conveyed in the
> platform firmware map of the system can be repurposed for software
> use. The CXL Fixed Memory Windows Structure  (CFMWS) definition
> explicitly breaks the assumptions of that heuristic.

So CXL defines an address map that is not part of the FW list?

> > It would be desirable to simply insert this address space into
> > iomem_resource with a new flag to denote this is CXL memory. This would
> > permit request_free_mem_region() to be reused for CXL memory provided it
> > learned some new tricks. For that, it is tempting to simply use
> > insert_resource(). The API was designed specifically for cases where new
> > devices may offer new address space. This cannot work in the general
> > case. Boot firmware can pass, some, none, or all of the CFMWS range as
> > various types of memory to the kernel, and this may be left alone,
> > merged, or even expanded.

And then we understand that on CXL the FW is might pass stuff that
intersects with the actual CXL ranges?

> > As a result iomem_resource may intersect CFMWS
> > regions in ways insert_resource cannot handle [2]. Similar reasoning
> > applies to allocate_resource().
> >
> > With the insert_resource option out, the only reasonable approach left
> > is to let the CXL driver manage the address space independently of
> > iomem_resource and attempt to prevent users of device private memory

And finally due to all these FW problems we are going to make a 2nd
allocator for physical address space and just disable the normal one?

Then since DEVICE_PRIVATE is a notable user of this allocator we now
understand it becomes broken?

Sounds horrible. IMHO you should fix the normal allocator somehow to
understand that the ranges from FW have been reprogrammed by Linux and
not try to build a whole different allocator in CXL code.

Jason
Dan Williams April 19, 2022, 9:50 p.m. UTC | #3
On Tue, Apr 19, 2022 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Apr 18, 2022 at 09:42:00AM -0700, Dan Williams wrote:
> > [ add the usual HMM suspects Christoph, Jason, and John ]
> >
> > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > >
> > > Define an API which allows CXL drivers to manage CXL address space.
> > > CXL is unique in that the address space and various properties are only
> > > known after CXL drivers come up, and therefore cannot be part of core
> > > memory enumeration.
> >
> > I think this buries the lead on the problem introduced by
> > MEMORY_DEVICE_PRIVATE in the first place. Let's revisit that history
> > before diving into what CXL needs.
> >
> >
> > Commit 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using
> > ZONE_DEVICE") introduced the concept of MEMORY_DEVICE_PRIVATE. At its
> > core MEMORY_DEVICE_PRIVATE uses the ZONE_DEVICE capability to annotate
> > an "unused" physical address range with 'struct page' for the purpose
> > of coordinating migration of buffers onto and off of a GPU /
> > accelerator. The determination of "unused" was based on a heuristic,
> > not a guarantee, that any address range not expressly conveyed in the
> > platform firmware map of the system can be repurposed for software
> > use. The CXL Fixed Memory Windows Structure  (CFMWS) definition
> > explicitly breaks the assumptions of that heuristic.
>
> So CXL defines an address map that is not part of the FW list?

It defines a super-set of 'potential' address space and a subset that
is active in the FW list. It's similar to memory hotplug where an
address range may come online after the fact, but unlike ACPI memory
hotplug, FW is not involved in the hotplug path, and FW cannot predict
what address ranges will come online. For example ACPI hotplug knows
in advance to publish the ranges that can experience an online /
insert event, CXL has many more degrees of freedom.

>
> > > It would be desirable to simply insert this address space into
> > > iomem_resource with a new flag to denote this is CXL memory. This would
> > > permit request_free_mem_region() to be reused for CXL memory provided it
> > > learned some new tricks. For that, it is tempting to simply use
> > > insert_resource(). The API was designed specifically for cases where new
> > > devices may offer new address space. This cannot work in the general
> > > case. Boot firmware can pass, some, none, or all of the CFMWS range as
> > > various types of memory to the kernel, and this may be left alone,
> > > merged, or even expanded.
>
> And then we understand that on CXL the FW is might pass stuff that
> intersects with the actual CXL ranges?
>
> > > As a result iomem_resource may intersect CFMWS
> > > regions in ways insert_resource cannot handle [2]. Similar reasoning
> > > applies to allocate_resource().
> > >
> > > With the insert_resource option out, the only reasonable approach left
> > > is to let the CXL driver manage the address space independently of
> > > iomem_resource and attempt to prevent users of device private memory
>
> And finally due to all these FW problems we are going to make a 2nd
> allocator for physical address space and just disable the normal one?

No, or I am misunderstanding this comment. The CXL address space
allocator is managing space that can be populated and become an
iomem_resource. So it's not supplanting iomem_resource it is
coordinating dynamic extensions to the FW map.

> Then since DEVICE_PRIVATE is a notable user of this allocator we now
> understand it becomes broken?
>
> Sounds horrible. IMHO you should fix the normal allocator somehow to
> understand that the ranges from FW have been reprogrammed by Linux

There is no reprogramming of the ranges from FW. CXL memory that is
mapped as System RAM at boot will have the CXL decode configuration
locked in all the participating devices. The remaining CXL decode
space is then available for dynamic reconfiguration of CXL resources
from the devices that the FW explicitly ignores, which is all
hot-added devices and all persistent-memory capacity.

> and
> not try to build a whole different allocator in CXL code.

I am not seeing much overlap for DEVICE_PRIVATE and CXL to share an
allocator. CXL explicitly wants ranges that have been set aside for
CXL and are related to 1 or more CXL host bridges. DEVICE_PRIVATE
wants to consume an unused physical address range to proxy
device-local-memory with no requirements on what range is chosen as
long as it does not collide with anything else.
Dan Williams April 19, 2022, 9:59 p.m. UTC | #4
On Tue, Apr 19, 2022 at 2:50 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Tue, Apr 19, 2022 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Mon, Apr 18, 2022 at 09:42:00AM -0700, Dan Williams wrote:
> > > [ add the usual HMM suspects Christoph, Jason, and John ]
> > >
> > > On Wed, Apr 13, 2022 at 11:38 AM Ben Widawsky <ben.widawsky@intel.com> wrote:
> > > >
> > > > Define an API which allows CXL drivers to manage CXL address space.
> > > > CXL is unique in that the address space and various properties are only
> > > > known after CXL drivers come up, and therefore cannot be part of core
> > > > memory enumeration.
> > >
> > > I think this buries the lead on the problem introduced by
> > > MEMORY_DEVICE_PRIVATE in the first place. Let's revisit that history
> > > before diving into what CXL needs.
> > >
> > >
> > > Commit 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using
> > > ZONE_DEVICE") introduced the concept of MEMORY_DEVICE_PRIVATE. At its
> > > core MEMORY_DEVICE_PRIVATE uses the ZONE_DEVICE capability to annotate
> > > an "unused" physical address range with 'struct page' for the purpose
> > > of coordinating migration of buffers onto and off of a GPU /
> > > accelerator. The determination of "unused" was based on a heuristic,
> > > not a guarantee, that any address range not expressly conveyed in the
> > > platform firmware map of the system can be repurposed for software
> > > use. The CXL Fixed Memory Windows Structure  (CFMWS) definition
> > > explicitly breaks the assumptions of that heuristic.
> >
> > So CXL defines an address map that is not part of the FW list?
>
> It defines a super-set of 'potential' address space and a subset that
> is active in the FW list. It's similar to memory hotplug where an
> address range may come online after the fact, but unlike ACPI memory
> hotplug, FW is not involved in the hotplug path, and FW cannot predict
> what address ranges will come online. For example ACPI hotplug knows
> in advance to publish the ranges that can experience an online /
> insert event, CXL has many more degrees of freedom.
>
> >
> > > > It would be desirable to simply insert this address space into
> > > > iomem_resource with a new flag to denote this is CXL memory. This would
> > > > permit request_free_mem_region() to be reused for CXL memory provided it
> > > > learned some new tricks. For that, it is tempting to simply use
> > > > insert_resource(). The API was designed specifically for cases where new
> > > > devices may offer new address space. This cannot work in the general
> > > > case. Boot firmware can pass, some, none, or all of the CFMWS range as
> > > > various types of memory to the kernel, and this may be left alone,
> > > > merged, or even expanded.
> >
> > And then we understand that on CXL the FW is might pass stuff that
> > intersects with the actual CXL ranges?
> >
> > > > As a result iomem_resource may intersect CFMWS
> > > > regions in ways insert_resource cannot handle [2]. Similar reasoning
> > > > applies to allocate_resource().
> > > >
> > > > With the insert_resource option out, the only reasonable approach left
> > > > is to let the CXL driver manage the address space independently of
> > > > iomem_resource and attempt to prevent users of device private memory
> >
> > And finally due to all these FW problems we are going to make a 2nd
> > allocator for physical address space and just disable the normal one?
>
> No, or I am misunderstanding this comment. The CXL address space
> allocator is managing space that can be populated and become an
> iomem_resource. So it's not supplanting iomem_resource it is
> coordinating dynamic extensions to the FW map.
>
> > Then since DEVICE_PRIVATE is a notable user of this allocator we now
> > understand it becomes broken?
> >
> > Sounds horrible. IMHO you should fix the normal allocator somehow to
> > understand that the ranges from FW have been reprogrammed by Linux
>
> There is no reprogramming of the ranges from FW. CXL memory that is
> mapped as System RAM at boot will have the CXL decode configuration
> locked in all the participating devices. The remaining CXL decode
> space is then available for dynamic reconfiguration of CXL resources
> from the devices that the FW explicitly ignores, which is all
> hot-added devices and all persistent-memory capacity.
>
> > and
> > not try to build a whole different allocator in CXL code.
>
> I am not seeing much overlap for DEVICE_PRIVATE and CXL to share an
> allocator. CXL explicitly wants ranges that have been set aside for
> CXL and are related to 1 or more CXL host bridges. DEVICE_PRIVATE
> wants to consume an unused physical address range to proxy
> device-local-memory with no requirements on what range is chosen as
> long as it does not collide with anything else.

...or are you suggesting to represent CXL free memory capacity in
iomem_resource and augment the FW list early with CXL ranges. That
seems doable, but it would only represent the free CXL ranges in
iomem_resource as the populated CXL ranges cannot have their resources
reparented after the fact, and there is plenty of code that expects
"System RAM" to be a top-level resource.
Jason Gunthorpe April 19, 2022, 11:04 p.m. UTC | #5
On Tue, Apr 19, 2022 at 02:59:46PM -0700, Dan Williams wrote:

> ...or are you suggesting to represent CXL free memory capacity in
> iomem_resource and augment the FW list early with CXL ranges. That
> seems doable, but it would only represent the free CXL ranges in
> iomem_resource as the populated CXL ranges cannot have their resources
> reparented after the fact, and there is plenty of code that expects
> "System RAM" to be a top-level resource.

Yes, something more like this. iomem_resource should represent stuff
actually in use and CXL shouldn't leave behind an 'IOW' for address
space it isn't actually able to currently use.

Your whole description sounds like the same problems PCI hotplug has
adjusting the bridge windows.

Jason
Dan Williams April 20, 2022, 12:47 a.m. UTC | #6
On Tue, Apr 19, 2022 at 4:04 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Apr 19, 2022 at 02:59:46PM -0700, Dan Williams wrote:
>
> > ...or are you suggesting to represent CXL free memory capacity in
> > iomem_resource and augment the FW list early with CXL ranges. That
> > seems doable, but it would only represent the free CXL ranges in
> > iomem_resource as the populated CXL ranges cannot have their resources
> > reparented after the fact, and there is plenty of code that expects
> > "System RAM" to be a top-level resource.
>
> Yes, something more like this. iomem_resource should represent stuff
> actually in use and CXL shouldn't leave behind an 'IOW' for address
> space it isn't actually able to currently use.

So that's the problem, these gigantic windows need to support someone
showing up unannounced with a stack of multi-terabyte devices to add
to the system. The address space is idle before that event, but it
needs to be reserved for CXL because the top-level system decode makes
mandates like "CXL cards of type X performance Y inserted underneath
CXL host-bridge Z can only use CXL address ranges 1, 4 and 5".

> Your whole description sounds like the same problems PCI hotplug has
> adjusting the bridge windows.

...but even there the base bounds (AFAICS) are coming from FW (_CRS
entries for ACPI described PCIe host bridges). So if CXL follows that
model then the entire unmapped portion of the CXL ranges should be
marked as an idle resource in iomem_resource.

The improvement that offers over this current proposal is that it
allows for global visibility of CXL hotplug resources, but it does set
up a discontinuity between FW mapped and OS mapped CXL. FW mapped will
have top-level "System RAM" resources indistinguishable from typical
DRAM while OS mapped CXL will look like this:

100000000-1ffffffff : CXL Range 0
  108000000-1ffffffff : region5
    108000000-1ffffffff : System RAM (CXL)

...even though to FW "range 0" spans across a BIOS mapped portion and
"free for OS to use" portion.
Jason Gunthorpe April 20, 2022, 2:34 p.m. UTC | #7
On Tue, Apr 19, 2022 at 05:47:56PM -0700, Dan Williams wrote:
> On Tue, Apr 19, 2022 at 4:04 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Apr 19, 2022 at 02:59:46PM -0700, Dan Williams wrote:
> >
> > > ...or are you suggesting to represent CXL free memory capacity in
> > > iomem_resource and augment the FW list early with CXL ranges. That
> > > seems doable, but it would only represent the free CXL ranges in
> > > iomem_resource as the populated CXL ranges cannot have their resources
> > > reparented after the fact, and there is plenty of code that expects
> > > "System RAM" to be a top-level resource.
> >
> > Yes, something more like this. iomem_resource should represent stuff
> > actually in use and CXL shouldn't leave behind an 'IOW' for address
> > space it isn't actually able to currently use.
> 
> So that's the problem, these gigantic windows need to support someone
> showing up unannounced with a stack of multi-terabyte devices to add
> to the system.

In my experience PCIe hotplug is already extremely rare, you may need
to do this reservation on systems with hotplug slots, but not
generally. In PCIe world the BIOS often figures this out and bridge
windows are not significantly over allocated on non-hotplug HW.

(though even PCIe has the resizable bar extension and other things
that are quite like hotplug and do trigger huge resource requirements)

> > Your whole description sounds like the same problems PCI hotplug has
> > adjusting the bridge windows.
> 
> ...but even there the base bounds (AFAICS) are coming from FW (_CRS
> entries for ACPI described PCIe host bridges). So if CXL follows that
> model then the entire unmapped portion of the CXL ranges should be
> marked as an idle resource in iomem_resource.

And possibly yes, because part of the point of this stuff is to
declare where HW is actually using the address space. So if FW has
left a host bridge decoder setup to actually consume this space then
it really has to be set aside to prevent hotplug of other bus types
from trying to claim the same address space for their own usages.

If no actual decoder is setup then it maybe it shouldn't be left as an
IOW in the resource tree. In this case it might be better to teach the
io resource allocator to leave gaps for future hotplug.

> The improvement that offers over this current proposal is that it
> allows for global visibility of CXL hotplug resources, but it does set
> up a discontinuity between FW mapped and OS mapped CXL. FW mapped will
> have top-level "System RAM" resources indistinguishable from typical
> DRAM while OS mapped CXL will look like this:

Maybe this can be reotractively fixed up in the resource tree?

Jason
Dan Williams April 20, 2022, 3:32 p.m. UTC | #8
On Wed, Apr 20, 2022 at 7:35 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Apr 19, 2022 at 05:47:56PM -0700, Dan Williams wrote:
> > On Tue, Apr 19, 2022 at 4:04 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Apr 19, 2022 at 02:59:46PM -0700, Dan Williams wrote:
> > >
> > > > ...or are you suggesting to represent CXL free memory capacity in
> > > > iomem_resource and augment the FW list early with CXL ranges. That
> > > > seems doable, but it would only represent the free CXL ranges in
> > > > iomem_resource as the populated CXL ranges cannot have their resources
> > > > reparented after the fact, and there is plenty of code that expects
> > > > "System RAM" to be a top-level resource.
> > >
> > > Yes, something more like this. iomem_resource should represent stuff
> > > actually in use and CXL shouldn't leave behind an 'IOW' for address
> > > space it isn't actually able to currently use.
> >
> > So that's the problem, these gigantic windows need to support someone
> > showing up unannounced with a stack of multi-terabyte devices to add
> > to the system.
>
> In my experience PCIe hotplug is already extremely rare, you may need
> to do this reservation on systems with hotplug slots, but not
> generally. In PCIe world the BIOS often figures this out and bridge
> windows are not significantly over allocated on non-hotplug HW.
>
> (though even PCIe has the resizable bar extension and other things
> that are quite like hotplug and do trigger huge resource requirements)
>
> > > Your whole description sounds like the same problems PCI hotplug has
> > > adjusting the bridge windows.
> >
> > ...but even there the base bounds (AFAICS) are coming from FW (_CRS
> > entries for ACPI described PCIe host bridges). So if CXL follows that
> > model then the entire unmapped portion of the CXL ranges should be
> > marked as an idle resource in iomem_resource.
>
> And possibly yes, because part of the point of this stuff is to
> declare where HW is actually using the address space. So if FW has
> left a host bridge decoder setup to actually consume this space then
> it really has to be set aside to prevent hotplug of other bus types
> from trying to claim the same address space for their own usages.
>
> If no actual decoder is setup then it maybe it shouldn't be left as an
> IOW in the resource tree. In this case it might be better to teach the
> io resource allocator to leave gaps for future hotplug.

Yeah, it is the former. These CXL ranges are all actively decoded by
the CPU complex memory controller as "this range goes to DDR and this
other range is interleaved across this set of CXL host bridges". Even
if there is nothing behind those host bridges there is hardware
actively routing requests that fall into those ranges to those
downstream devices.

> > The improvement that offers over this current proposal is that it
> > allows for global visibility of CXL hotplug resources, but it does set
> > up a discontinuity between FW mapped and OS mapped CXL. FW mapped will
> > have top-level "System RAM" resources indistinguishable from typical
> > DRAM while OS mapped CXL will look like this:
>
> Maybe this can be reotractively fixed up in the resource tree?

I had been discouraged to go that route considering some code only
scans top-level iomem_resource entries, but it is probably better to
try to fix that legacy code to operate correctly when System RAM is
parented by a CXL Range.
diff mbox series

Patch

diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index 9b69955b90cb..0870904fe4b5 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -76,6 +76,7 @@  static int cxl_acpi_cfmws_verify(struct device *dev,
 struct cxl_cfmws_context {
 	struct device *dev;
 	struct cxl_port *root_port;
+	struct acpi_cedt_cfmws *high_cfmws;
 };
 
 static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
@@ -126,6 +127,14 @@  static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
 			cfmws->base_hpa + cfmws->window_size - 1);
 		return 0;
 	}
+
+	if (ctx->high_cfmws) {
+		if (cfmws->base_hpa > ctx->high_cfmws->base_hpa)
+			ctx->high_cfmws = cfmws;
+	} else {
+		ctx->high_cfmws = cfmws;
+	}
+
 	dev_dbg(dev, "add: %s node: %d range %#llx-%#llx\n",
 		dev_name(&cxld->dev), phys_to_target_node(cxld->range.start),
 		cfmws->base_hpa, cfmws->base_hpa + cfmws->window_size - 1);
@@ -299,6 +308,7 @@  static int cxl_acpi_probe(struct platform_device *pdev)
 	ctx = (struct cxl_cfmws_context) {
 		.dev = host,
 		.root_port = root_port,
+		.high_cfmws = NULL,
 	};
 	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, cxl_parse_cfmws, &ctx);
 
@@ -317,10 +327,25 @@  static int cxl_acpi_probe(struct platform_device *pdev)
 	if (rc < 0)
 		return rc;
 
+	if (ctx.high_cfmws) {
+		resource_size_t end =
+			ctx.high_cfmws->base_hpa + ctx.high_cfmws->window_size;
+		dev_dbg(host,
+			"Disabling free device private regions below %#llx\n",
+			end);
+		set_request_free_min_base(end);
+	}
+
 	/* In case PCI is scanned before ACPI re-trigger memdev attach */
 	return cxl_bus_rescan();
 }
 
+static int cxl_acpi_remove(struct platform_device *pdev)
+{
+	set_request_free_min_base(0);
+	return 0;
+}
+
 static const struct acpi_device_id cxl_acpi_ids[] = {
 	{ "ACPI0017" },
 	{ },
@@ -329,6 +354,7 @@  MODULE_DEVICE_TABLE(acpi, cxl_acpi_ids);
 
 static struct platform_driver cxl_acpi_driver = {
 	.probe = cxl_acpi_probe,
+	.remove = cxl_acpi_remove,
 	.driver = {
 		.name = KBUILD_MODNAME,
 		.acpi_match_table = cxl_acpi_ids,
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index ec5f71f7135b..dc41e4be5635 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -325,6 +325,7 @@  extern int
 walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end,
 		    void *arg, int (*func)(struct resource *, void *));
 
+void set_request_free_min_base(resource_size_t val);
 struct resource *devm_request_free_mem_region(struct device *dev,
 		struct resource *base, unsigned long size);
 struct resource *request_free_mem_region(struct resource *base,
diff --git a/kernel/resource.c b/kernel/resource.c
index 34eaee179689..a4750689e529 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1774,6 +1774,14 @@  void resource_list_free(struct list_head *head)
 EXPORT_SYMBOL(resource_list_free);
 
 #ifdef CONFIG_DEVICE_PRIVATE
+static resource_size_t request_free_min_base;
+
+void set_request_free_min_base(resource_size_t val)
+{
+	request_free_min_base = val;
+}
+EXPORT_SYMBOL_GPL(set_request_free_min_base);
+
 static struct resource *__request_free_mem_region(struct device *dev,
 		struct resource *base, unsigned long size, const char *name)
 {
@@ -1799,7 +1807,8 @@  static struct resource *__request_free_mem_region(struct device *dev,
 	}
 
 	write_lock(&resource_lock);
-	for (; addr > size && addr >= base->start; addr -= size) {
+	for (; addr > size && addr >= max(base->start, request_free_min_base);
+	     addr -= size) {
 		if (__region_intersects(addr, size, 0, IORES_DESC_NONE) !=
 				REGION_DISJOINT)
 			continue;