Message ID | 20221026004737.3646-1-gregory.price@memverge.com |
---|---|
Headers | show |
Series | Multi-Region and Volatile Memory support for CXL Type-3 Devices | expand |
On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote: > Submitted as an extention to the multi-feature branch maintained > by Jonathan Cameron at: > https://urldefense.com/v3/__https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24__;!!EwVzqGoTKBqv-0DWAJBm!RyiGL5B1XmQnVFwgxikKJeosPMKtoO1cTr61gIq8fwqfju8l4cbGZGwAEkKXIJB-Dbkfi_LNN2rGCbzMISz65cTxpAxI9pQ$ > > > Summary of Changes: > 1) E820 CFMW Bug fix. > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev > > > Regarding the E820 fix > * This bugfix is required for memory regions to work on x86 > * input from Dan Williams and others suggest that E820 entry for > the CFMW should not exist, as it is expected to be dynamically > assigned at runtime. If this entry exists, it instead blocks > region creation by nature of the memory region being marked as > reserved. For CXL 2.0 it is my understanding that volatile capacity present at boot will be advertised by the firmware. In the absence of EFI I would assume this would be provided in the e820 map. Is the region driver meant to cover volatile capacity present at boot? I was under the impression that it would be used for hot added volatile memory. It would be good to cover all of these assumptions for the e820 fix. Lastly it is my understanding that the region driver does not have support for volatile memory. It would be great to add that functionality if we make this change in QEMU. > > Regarding Multi-Region and Volatile Memory > * Developed with input from Jonathan Cameron and Davidlohr Bueso. > > Regarding SRAT Generation for Type-3 Devices > * Co-Developed by Davidlohr Bueso. Built from his base patch and > extended to work with both volatile and persistent regions. > * This can be used to demonstrate static type-3 device mapping and > testing numa-access to type-3 device memory regions. > > > This series brings 3 features to CXL Type-3 Devices: > 1) Volatile Memory Region support > 2) Multi-Region support (1 Volatile, 1 Persistent) > 3) (optional) SRAT Entry generation for type-3 device regions > > In this series we implement multi-region and volatile region support > through 7 major changes to CXL devices > 1) The HostMemoryBackend [hostmem] has been replaced by two > [hostvmem] and [hostpmem] to store volatile and persistent > memory respectively > 2) The single AddressSpace has been replaced by two AddressSpaces > [hostvmem_as] and [hostpmem_as] to map respective memdevs. > 3) Each memory region size and total region are stored separately > 4) The CDAT and DVSEC memory map entries have been updated: > a) if vmem is present, vmem is mapped at DPA(0) > b) if pmem is present > i) and vmem is present, pmem is mapped at DPA(vmem->size) > ii) else, pmem is mapped at DPA(0) > c) partitioning of pmem is not supported in this patch set but > has been discussed and this design should suffice. > 5) Read/Write functions have been updated to access AddressSpaces > according to the mapping described in #4 > 6) cxl-mailbox has been updated to report the respective size of > volatile and persistent memory regions > 7) SRAT entries may optionally be generated by manually assigning > memdevs to a cpuless numa node > > To support the Device Physical Address (DPA) Mapping decisions, see > CXL Spec (3.0) Section 8.2.9.8.2.0 - Get Partition Info: > Active Volatile Memory > The device shall provide this volatile capacity starting at DPA 0 > Active Persistent Memory > The device shall provide this persistent capacity starting at the > DPA immediately following the volatile capacity > > Partitioning of Persistent Memory regions may be supported on > following patch sets. > > > Gregory Price (4): > hw/i386/pc.c: CXL Fixed Memory Window should not reserve e820 in bios > hw/cxl: Add CXL_CAPACITY_MULTIPLIER definition > hw/cxl: Multi-Region CXL Type-3 Devices (Volatile and Persistent) > hw/acpi/cxl.c: Fill in SRAT for vmem/pmem if NUMA node is assigned > > docs/system/devices/cxl.rst | 74 ++++++++-- > hw/acpi/cxl.c | 67 +++++++++ > hw/cxl/cxl-mailbox-utils.c | 23 +-- > hw/i386/acpi-build.c | 4 + > hw/i386/pc.c | 2 - > hw/mem/cxl_type3.c | 274 +++++++++++++++++++++++++++--------- > include/hw/acpi/cxl.h | 1 + > include/hw/cxl/cxl_device.h | 11 +- > tests/qtest/cxl-test.c | 111 +++++++++++---- > 9 files changed, 443 insertions(+), 124 deletions(-) > > -- > 2.37.3 >
On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote: > Submitted as an extention to the multi-feature branch maintained > by Jonathan Cameron at: > https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24 BTW pls set subject prefix for all patches, and put it before the patch #. -v parameter to format-patch will do this for you. > > Summary of Changes: > 1) E820 CFMW Bug fix. > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev > > > Regarding the E820 fix > * This bugfix is required for memory regions to work on x86 > * input from Dan Williams and others suggest that E820 entry for > the CFMW should not exist, as it is expected to be dynamically > assigned at runtime. If this entry exists, it instead blocks > region creation by nature of the memory region being marked as > reserved. > > Regarding Multi-Region and Volatile Memory > * Developed with input from Jonathan Cameron and Davidlohr Bueso. > > Regarding SRAT Generation for Type-3 Devices > * Co-Developed by Davidlohr Bueso. Built from his base patch and > extended to work with both volatile and persistent regions. > * This can be used to demonstrate static type-3 device mapping and > testing numa-access to type-3 device memory regions. > > > This series brings 3 features to CXL Type-3 Devices: > 1) Volatile Memory Region support > 2) Multi-Region support (1 Volatile, 1 Persistent) > 3) (optional) SRAT Entry generation for type-3 device regions > > In this series we implement multi-region and volatile region support > through 7 major changes to CXL devices > 1) The HostMemoryBackend [hostmem] has been replaced by two > [hostvmem] and [hostpmem] to store volatile and persistent > memory respectively > 2) The single AddressSpace has been replaced by two AddressSpaces > [hostvmem_as] and [hostpmem_as] to map respective memdevs. > 3) Each memory region size and total region are stored separately > 4) The CDAT and DVSEC memory map entries have been updated: > a) if vmem is present, vmem is mapped at DPA(0) > b) if pmem is present > i) and vmem is present, pmem is mapped at DPA(vmem->size) > ii) else, pmem is mapped at DPA(0) > c) partitioning of pmem is not supported in this patch set but > has been discussed and this design should suffice. > 5) Read/Write functions have been updated to access AddressSpaces > according to the mapping described in #4 > 6) cxl-mailbox has been updated to report the respective size of > volatile and persistent memory regions > 7) SRAT entries may optionally be generated by manually assigning > memdevs to a cpuless numa node > > To support the Device Physical Address (DPA) Mapping decisions, see > CXL Spec (3.0) Section 8.2.9.8.2.0 - Get Partition Info: > Active Volatile Memory > The device shall provide this volatile capacity starting at DPA 0 > Active Persistent Memory > The device shall provide this persistent capacity starting at the > DPA immediately following the volatile capacity > > Partitioning of Persistent Memory regions may be supported on > following patch sets. > > > Gregory Price (4): > hw/i386/pc.c: CXL Fixed Memory Window should not reserve e820 in bios > hw/cxl: Add CXL_CAPACITY_MULTIPLIER definition > hw/cxl: Multi-Region CXL Type-3 Devices (Volatile and Persistent) > hw/acpi/cxl.c: Fill in SRAT for vmem/pmem if NUMA node is assigned > > docs/system/devices/cxl.rst | 74 ++++++++-- > hw/acpi/cxl.c | 67 +++++++++ > hw/cxl/cxl-mailbox-utils.c | 23 +-- > hw/i386/acpi-build.c | 4 + > hw/i386/pc.c | 2 - > hw/mem/cxl_type3.c | 274 +++++++++++++++++++++++++++--------- > include/hw/acpi/cxl.h | 1 + > include/hw/cxl/cxl_device.h | 11 +- > tests/qtest/cxl-test.c | 111 +++++++++++---- > 9 files changed, 443 insertions(+), 124 deletions(-) > > -- > 2.37.3
On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote: > Submitted as an extention to the multi-feature branch maintained > by Jonathan Cameron at: > https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24 > I am not supposed to merge this patchset yet, right? That branch has a bunch of patches not yet posted for review. Pls add "RFC" in the subject when that is the case. Thanks! > Summary of Changes: > 1) E820 CFMW Bug fix. > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev > > > Regarding the E820 fix > * This bugfix is required for memory regions to work on x86 > * input from Dan Williams and others suggest that E820 entry for > the CFMW should not exist, as it is expected to be dynamically > assigned at runtime. If this entry exists, it instead blocks > region creation by nature of the memory region being marked as > reserved. > > Regarding Multi-Region and Volatile Memory > * Developed with input from Jonathan Cameron and Davidlohr Bueso. > > Regarding SRAT Generation for Type-3 Devices > * Co-Developed by Davidlohr Bueso. Built from his base patch and > extended to work with both volatile and persistent regions. > * This can be used to demonstrate static type-3 device mapping and > testing numa-access to type-3 device memory regions. > > > This series brings 3 features to CXL Type-3 Devices: > 1) Volatile Memory Region support > 2) Multi-Region support (1 Volatile, 1 Persistent) > 3) (optional) SRAT Entry generation for type-3 device regions > > In this series we implement multi-region and volatile region support > through 7 major changes to CXL devices > 1) The HostMemoryBackend [hostmem] has been replaced by two > [hostvmem] and [hostpmem] to store volatile and persistent > memory respectively > 2) The single AddressSpace has been replaced by two AddressSpaces > [hostvmem_as] and [hostpmem_as] to map respective memdevs. > 3) Each memory region size and total region are stored separately > 4) The CDAT and DVSEC memory map entries have been updated: > a) if vmem is present, vmem is mapped at DPA(0) > b) if pmem is present > i) and vmem is present, pmem is mapped at DPA(vmem->size) > ii) else, pmem is mapped at DPA(0) > c) partitioning of pmem is not supported in this patch set but > has been discussed and this design should suffice. > 5) Read/Write functions have been updated to access AddressSpaces > according to the mapping described in #4 > 6) cxl-mailbox has been updated to report the respective size of > volatile and persistent memory regions > 7) SRAT entries may optionally be generated by manually assigning > memdevs to a cpuless numa node > > To support the Device Physical Address (DPA) Mapping decisions, see > CXL Spec (3.0) Section 8.2.9.8.2.0 - Get Partition Info: > Active Volatile Memory > The device shall provide this volatile capacity starting at DPA 0 > Active Persistent Memory > The device shall provide this persistent capacity starting at the > DPA immediately following the volatile capacity > > Partitioning of Persistent Memory regions may be supported on > following patch sets. > > > Gregory Price (4): > hw/i386/pc.c: CXL Fixed Memory Window should not reserve e820 in bios > hw/cxl: Add CXL_CAPACITY_MULTIPLIER definition > hw/cxl: Multi-Region CXL Type-3 Devices (Volatile and Persistent) > hw/acpi/cxl.c: Fill in SRAT for vmem/pmem if NUMA node is assigned > > docs/system/devices/cxl.rst | 74 ++++++++-- > hw/acpi/cxl.c | 67 +++++++++ > hw/cxl/cxl-mailbox-utils.c | 23 +-- > hw/i386/acpi-build.c | 4 + > hw/i386/pc.c | 2 - > hw/mem/cxl_type3.c | 274 +++++++++++++++++++++++++++--------- > include/hw/acpi/cxl.h | 1 + > include/hw/cxl/cxl_device.h | 11 +- > tests/qtest/cxl-test.c | 111 +++++++++++---- > 9 files changed, 443 insertions(+), 124 deletions(-) > > -- > 2.37.3
On Wed, Oct 26, 2022 at 08:13:24PM +0000, Adam Manzanares wrote: > On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote: > > Submitted as an extention to the multi-feature branch maintained > > by Jonathan Cameron at: > > https://urldefense.com/v3/__https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24__;!!EwVzqGoTKBqv-0DWAJBm!RyiGL5B1XmQnVFwgxikKJeosPMKtoO1cTr61gIq8fwqfju8l4cbGZGwAEkKXIJB-Dbkfi_LNN2rGCbzMISz65cTxpAxI9pQ$ > > > > > > Summary of Changes: > > 1) E820 CFMW Bug fix. > > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers > > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices > > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev > > > > > > Regarding the E820 fix > > * This bugfix is required for memory regions to work on x86 > > * input from Dan Williams and others suggest that E820 entry for > > the CFMW should not exist, as it is expected to be dynamically > > assigned at runtime. If this entry exists, it instead blocks > > region creation by nature of the memory region being marked as > > reserved. > > For CXL 2.0 it is my understanding that volatile capacity present at boot will > be advertised by the firmware. In the absence of EFI I would assume this would > be provided in the e820 map. The issue in this case is very explicitly that a double-mapping occurs for the same region. An E820 mapping for RESERVED is set *and* the region driver allocates a CXL CFMW mapping. As a result the region drive straight up fails to allocate regions. So in either case - at least from my view - the entry added as RESERVED is just wrong. This is separate from type-3 device SRAT entries and default mappings for volatile regions. For this situation, if you explicitly assign the memdev backing a type-3 device to a numa node, then an SRAT area is generated and an explicit e820 entry is generated and marked usable - though I think there are likely issues with this kind of double-referencing. > > Is the region driver meant to cover volatile capacity present at boot? I was > under the impression that it would be used for hot added volatile memory. It > would be good to cover all of these assumptions for the e820 fix. This region appears to cover hotplug memory behind the CFMW. The problem is that this e820 RESERVED mapping blocks the CFMW region from being used at all. Without this, you can't use a type-3 persistent region, even with support, let alone a volatile region. In attempting to use a persistent region as volatile via ndctl and friends, I'm seeing further issues (it cannot be assigned to a numa node successfully), but that's a separate issue. > > Lastly it is my understanding that the region driver does not have support for > volatile memory. It would be great to add that functionality if we make this > change in QEMU. > Right now this is true, but it seems a bit of a chicken/egg scenario. Nothing to test against vs no support. Nudging this along such that we can at least report an (unusable) hot-add volatile memory region would provide someone working with the region driver something to poke and prod at. > > Regarding SRAT Generation for Type-3 Devices > > * Co-Developed by Davidlohr Bueso. Built from his base patch and > > extended to work with both volatile and persistent regions. > > * This can be used to demonstrate static type-3 device mapping and > > testing numa-access to type-3 device memory regions. Regarding "volatile memory present at boot" - there is still two ways for that memory to be onlined: Statically (entered as an explicit e820 region after reading the SRAT), or Dynamically (hot-add by the region driver). This patch would at least allow an SRAT to be generated if you explicitly add a NUMA node mapping to it. Although I concede that I'm not entirely sure what is "correct" here. What this ends up looking like is mapping a memdev to both a numa node and to a type-3 device. Though that seems wrong. After further testing it seems like creating a CPU-less, Memory-less NUMA node with the intent of mapping volatile memory regions to it is not supported (yet?).
On Wed, Oct 26, 2022 at 04:20:40PM -0400, Michael S. Tsirkin wrote: > On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote: > > Submitted as an extention to the multi-feature branch maintained > > by Jonathan Cameron at: > > https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24 > > > > I am not supposed to merge this patchset yet, right? > That branch has a bunch of patches not yet posted for review. > Pls add "RFC" in the subject when that is the case. > > Thanks! > > Correct, sorry, Jonathan asked me to send out a new round to incorporate into his branch, I should have marked it RFC. I will push up separate patch requests for the E820 and PCI_MEMORY_CXL bug fixes.
On Wed, 26 Oct 2022 16:47:18 -0400 Gregory Price <gregory.price@memverge.com> wrote: > On Wed, Oct 26, 2022 at 08:13:24PM +0000, Adam Manzanares wrote: > > On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote: > > > Submitted as an extention to the multi-feature branch maintained > > > by Jonathan Cameron at: > > > https://urldefense.com/v3/__https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24__;!!EwVzqGoTKBqv-0DWAJBm!RyiGL5B1XmQnVFwgxikKJeosPMKtoO1cTr61gIq8fwqfju8l4cbGZGwAEkKXIJB-Dbkfi_LNN2rGCbzMISz65cTxpAxI9pQ$ > > > > > > > > > Summary of Changes: > > > 1) E820 CFMW Bug fix. > > > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers > > > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices > > > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev +CC Dan for a question on status of Generic Ports ACPI code first ECN. Given that was done on the mailing list I can find any public tracking of whether it was accepted or not - hence whether we can get on with implementation. There hasn't been a release ACPI spec since before that was proposed so we need that confirmation of the code first proposal being accepted to get things moving. > > > > > > > > > Regarding the E820 fix > > > * This bugfix is required for memory regions to work on x86 > > > * input from Dan Williams and others suggest that E820 entry for > > > the CFMW should not exist, as it is expected to be dynamically > > > assigned at runtime. If this entry exists, it instead blocks > > > region creation by nature of the memory region being marked as > > > reserved. > > > > For CXL 2.0 it is my understanding that volatile capacity present at boot will > > be advertised by the firmware. In the absence of EFI I would assume this would > > be provided in the e820 map. Whilst this is one option, it is certainly not the case that all CXL 2.0 platforms will decide to do any setup of CXL memory (volatile or not) in the firmware. They can leave it entirely to the OS, so a cold plug flow. Early platforms will do the setup in BIOS to support unware OSes, once that's not a problem any more the only reason you'd want to do this is if you don't have other RAM to boot the system, or for some reason you want your host kernel etc in the CXL attached memory. I'd expect to see BIOS having OS managed configuration as an option in the intermediate period where some OSes are aware, others not. OS knows more about usecase / policy so can make better choices on interleaving etc of volatile CXL type 3 memory (let alone the fun corner of devices where you can dynamically change the mix of volatile and non volatile memory). > > The issue in this case is very explicitly that a double-mapping occurs > for the same region. An E820 mapping for RESERVED is set *and* the > region driver allocates a CXL CFMW mapping. As a result the region > drive straight up fails to allocate regions. > > So in either case - at least from my view - the entry added as RESERVED > is just wrong. > > This is separate from type-3 device SRAT entries and default mappings > for volatile regions. For this situation, if you explicitly assign the > memdev backing a type-3 device to a numa node, then an SRAT area is > generated and an explicit e820 entry is generated and marked usable - > though I think there are likely issues with this kind of > double-referencing. SRAT setup for CXL type 3 devices is to my mind is a job for a full BIOS, not QEMU level of faking a few things. That BIOS should also be doing the full configuration (HDM Decoders and all the rest). ARM has a prototype for one of the fixed virtual platforms that does this (talk at Plumbers Uconf), we should look to do the same if we want to test those flows using QEMU via appropriate changes in EDK2 to walk topology and configure everything. Until the devices are configured there is no way to configure the SLIT, HMAT entries that align with the SRAT ones (in theory those can be updated at runtime via _SLI, _HMA but in practice, I'm fairly sure Linux doesn't support doing that.) > > > > > Is the region driver meant to cover volatile capacity present at boot? I was > > under the impression that it would be used for hot added volatile memory. It > > would be good to cover all of these assumptions for the e820 fix. > > This region appears to cover hotplug memory behind the CFMW. The > problem is that this e820 RESERVED mapping blocks the CFMW region from > being used at all. > > Without this, you can't use a type-3 persistent region, even with > support, let alone a volatile region. In attempting to use a persistent > region as volatile via ndctl and friends, I'm seeing further issues (it > cannot be assigned to a numa node successfully), but that's a separate > issue. For the Numa node bit... So far, the CDAT table isn't used in Linux (read out for debug purposes is supported only). That all needs to be added yet to get the NUMA node allocations to work correctly. It shouldn't have anything to do with SRAT unless the BIOS has done the full full configuration (same way CXL will work with a legacy OS). Come to think of it, that should definitely be on the priority list for kernel support (don't think it was on the list on Tuesday?) > > > > > Lastly it is my understanding that the region driver does not have support for > > volatile memory. It would be great to add that functionality if we make this > > change in QEMU. > > > > Right now this is true, but it seems a bit of a chicken/egg scenario. > Nothing to test against vs no support. Nudging this along such that we > can at least report an (unusable) hot-add volatile memory region would > provide someone working with the region driver something to poke and > prod at. Agreed. This is same place as Ben's original CXL QEMU work on non volatile. Need something to develop against so we'll at least have QEMU patches available whilst the kernel side is in development (hopefully this cycle) > > > > Regarding SRAT Generation for Type-3 Devices > > > * Co-Developed by Davidlohr Bueso. Built from his base patch and > > > extended to work with both volatile and persistent regions. > > > * This can be used to demonstrate static type-3 device mapping and > > > testing numa-access to type-3 device memory regions. > > Regarding "volatile memory present at boot" - there is still two ways > for that memory to be onlined: Statically (entered as an explicit e820 > region after reading the SRAT), or Dynamically (hot-add by the region > driver). > > This patch would at least allow an SRAT to be generated if you > explicitly add a NUMA node mapping to it. Although I concede that I'm > not entirely sure what is "correct" here. For hotplug, needs to be the region driver, not here (or BIOS if you are doing a BIOS driven flow - in which case the whole topology is discovered and mostly configured by the BIOS before the OS reaches it - including filling in SRAT, SLIT, HMAT etCc). > > What this ends up looking like is mapping a memdev to both a numa node > and to a type-3 device. Though that seems wrong. > > After further testing it seems like creating a CPU-less, Memory-less > NUMA node with the intent of mapping volatile memory regions to it is > not supported (yet?). Yup, and I doubt it ever will be for reasons above. BIOS does it all, or OS does it all. Mixing and matching is a mess (there is an exception - Generic Port entries in SRAT - those we do need for the OS driven flow and possibly also the BIOS flow - patches welcome! I'd forgotten that on my list of QEMU stuff that needs doing.) https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/ If anyone is implementing that, also good to do Generic Initiators as they are very similar. Jonathan
ok to summarize then: patch 1) e820 - submitted as a separate patch/bugfix for mst to pick up patch 2&3) Pickup by Jonathan for his branch as it depends on DOE and other changes. patch 4) incorrect, this should be done in bios/efi, drop entirely On Thu, Oct 27, 2022 at 11:58:54AM +0100, Jonathan Cameron wrote: > On Wed, 26 Oct 2022 16:47:18 -0400 > Gregory Price <gregory.price@memverge.com> wrote: > > > On Wed, Oct 26, 2022 at 08:13:24PM +0000, Adam Manzanares wrote: > > > On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote: > > > > Submitted as an extention to the multi-feature branch maintained > > > > by Jonathan Cameron at: > > > > https://urldefense.com/v3/__https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24__;!!EwVzqGoTKBqv-0DWAJBm!RyiGL5B1XmQnVFwgxikKJeosPMKtoO1cTr61gIq8fwqfju8l4cbGZGwAEkKXIJB-Dbkfi_LNN2rGCbzMISz65cTxpAxI9pQ$ > > > > > > > > > > > > Summary of Changes: > > > > 1) E820 CFMW Bug fix. > > > > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers > > > > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices > > > > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev > > +CC Dan for a question on status of Generic Ports ACPI code first ECN. > Given that was done on the mailing list I can find any public tracking > of whether it was accepted or not - hence whether we can get on with > implementation. There hasn't been a release ACPI spec since before > that was proposed so we need that confirmation of the code first proposal > being accepted to get things moving. > /* snip for brevity */
On Thu, Oct 27, 2022 at 11:58:54AM +0100, Jonathan Cameron wrote: > On Wed, 26 Oct 2022 16:47:18 -0400 > Gregory Price <gregory.price@memverge.com> wrote: > > > On Wed, Oct 26, 2022 at 08:13:24PM +0000, Adam Manzanares wrote: > > > On Tue, Oct 25, 2022 at 08:47:33PM -0400, Gregory Price wrote: > > > > Submitted as an extention to the multi-feature branch maintained > > > > by Jonathan Cameron at: > > > > https://urldefense.com/v3/__https://gitlab.com/jic23/qemu/-/tree/cxl-2022-10-24__;!!EwVzqGoTKBqv-0DWAJBm!RyiGL5B1XmQnVFwgxikKJeosPMKtoO1cTr61gIq8fwqfju8l4cbGZGwAEkKXIJB-Dbkfi_LNN2rGCbzMISz65cTxpAxI9pQ$ > > > > > > > > > > > > Summary of Changes: > > > > 1) E820 CFMW Bug fix. > > > > 2) Add CXL_CAPACITY_MULTIPLIER definition to replace magic numbers > > > > 3) Multi-Region and Volatile Memory support for CXL Type-3 Devices > > > > 4) CXL Type-3 SRAT Generation when NUMA node is attached to memdev > > +CC Dan for a question on status of Generic Ports ACPI code first ECN. > Given that was done on the mailing list I can find any public tracking > of whether it was accepted or not - hence whether we can get on with > implementation. There hasn't been a release ACPI spec since before > that was proposed so we need that confirmation of the code first proposal > being accepted to get things moving. > > > > > > > > > > > > > Regarding the E820 fix > > > > * This bugfix is required for memory regions to work on x86 > > > > * input from Dan Williams and others suggest that E820 entry for > > > > the CFMW should not exist, as it is expected to be dynamically > > > > assigned at runtime. If this entry exists, it instead blocks > > > > region creation by nature of the memory region being marked as > > > > reserved. > > > > > > For CXL 2.0 it is my understanding that volatile capacity present at boot will > > > be advertised by the firmware. In the absence of EFI I would assume this would > > > be provided in the e820 map. > > Whilst this is one option, it is certainly not the case that all CXL 2.0 > platforms will decide to do any setup of CXL memory (volatile or not) in the > firmware. They can leave it entirely to the OS, so a cold plug flow. > Early platforms will do the setup in BIOS to support unware OSes, once > that's not a problem any more the only reason you'd want to do this is if > you don't have other RAM to boot the system, or for some reason you want > your host kernel etc in the CXL attached memory. > > I'd expect to see BIOS having OS managed configuration as an option in the > intermediate period where some OSes are aware, others not. > OS knows more about usecase / policy so can make better choices on interleaving > etc of volatile CXL type 3 memory (let alone the fun corner of devices > where you can dynamically change the mix of volatile and non volatile > memory). > > > > > > The issue in this case is very explicitly that a double-mapping occurs > > for the same region. An E820 mapping for RESERVED is set *and* the > > region driver allocates a CXL CFMW mapping. As a result the region > > drive straight up fails to allocate regions. > > > > So in either case - at least from my view - the entry added as RESERVED > > is just wrong. > > > > This is separate from type-3 device SRAT entries and default mappings > > for volatile regions. For this situation, if you explicitly assign the > > memdev backing a type-3 device to a numa node, then an SRAT area is > > generated and an explicit e820 entry is generated and marked usable - > > though I think there are likely issues with this kind of > > double-referencing. > > SRAT setup for CXL type 3 devices is to my mind is a job for a full BIOS, > not QEMU level of faking a few things. That BIOS should also > be doing the full configuration (HDM Decoders and all the rest). ARM has > a prototype for one of the fixed virtual platforms that does this (talk > at Plumbers Uconf), we should look to do the same if we want to test > those flows using QEMU via appropriate changes in EDK2 to walk topology > and configure everything. Until the devices are configured there is no > way to configure the SLIT, HMAT entries that align with the SRAT ones > (in theory those can be updated at runtime via _SLI, _HMA but in > practice, I'm fairly sure Linux doesn't support doing that.) > > > > > > > > > > Is the region driver meant to cover volatile capacity present at boot? I was > > > under the impression that it would be used for hot added volatile memory. It > > > would be good to cover all of these assumptions for the e820 fix. > > > > This region appears to cover hotplug memory behind the CFMW. The > > problem is that this e820 RESERVED mapping blocks the CFMW region from > > being used at all. > > > > Without this, you can't use a type-3 persistent region, even with > > support, let alone a volatile region. In attempting to use a persistent > > region as volatile via ndctl and friends, I'm seeing further issues (it > > cannot be assigned to a numa node successfully), but that's a separate > > issue. > For the Numa node bit... > > So far, the CDAT table isn't used in Linux (read out for debug purposes > is supported only). That all needs to be added yet to get the NUMA node > allocations to work correctly. It shouldn't have anything to do with SRAT > unless the BIOS has done the full full configuration (same way CXL will work > with a legacy OS). Come to think of it, that should definitely be on the > priority list for kernel support (don't think it was on the list on Tuesday?) > > > > > > > > > Lastly it is my understanding that the region driver does not have support for > > > volatile memory. It would be great to add that functionality if we make this > > > change in QEMU. > > > > > > > Right now this is true, but it seems a bit of a chicken/egg scenario. > > Nothing to test against vs no support. Nudging this along such that we > > can at least report an (unusable) hot-add volatile memory region would > > provide someone working with the region driver something to poke and > > prod at. > > Agreed. This is same place as Ben's original CXL QEMU work on non volatile. > Need something to develop against so we'll at least have QEMU patches > available whilst the kernel side is in development (hopefully this cycle) > > > > > > Regarding SRAT Generation for Type-3 Devices > > > > * Co-Developed by Davidlohr Bueso. Built from his base patch and > > > > extended to work with both volatile and persistent regions. > > > > * This can be used to demonstrate static type-3 device mapping and > > > > testing numa-access to type-3 device memory regions. > > > > Regarding "volatile memory present at boot" - there is still two ways > > for that memory to be onlined: Statically (entered as an explicit e820 > > region after reading the SRAT), or Dynamically (hot-add by the region > > driver). > > > > This patch would at least allow an SRAT to be generated if you > > explicitly add a NUMA node mapping to it. Although I concede that I'm > > not entirely sure what is "correct" here. > > For hotplug, needs to be the region driver, not here (or BIOS if you > are doing a BIOS driven flow - in which case the whole topology is > discovered and mostly configured by the BIOS before the OS reaches it > - including filling in SRAT, SLIT, HMAT etCc). > > > > What this ends up looking like is mapping a memdev to both a numa node > > and to a type-3 device. Though that seems wrong. > > > > After further testing it seems like creating a CPU-less, Memory-less > > NUMA node with the intent of mapping volatile memory regions to it is > > not supported (yet?). > > Yup, and I doubt it ever will be for reasons above. > > BIOS does it all, or OS does it all. Mixing and matching is a mess > (there is an exception - Generic Port entries in SRAT - those we do > need for the OS driven flow and possibly also the BIOS flow > - patches welcome! I'd forgotten that on my list of QEMU stuff that > needs doing.) Based on the discussions is it safe to say we have settled on an OS driven flow for the current QEMU CXL emulation. > > https://urldefense.com/v3/__https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/__;!!EwVzqGoTKBqv-0DWAJBm!XLqjPd1aXt3i5NXrZhakQlGdgJ5o4tcfV_5iUEp8vwBLv8T1Ft1OVHPI0p7KpYSFDYzhAGj_ulMz1LfZVmJgrOvrO-_v7udl$ > > If anyone is implementing that, also good to do Generic Initiators > as they are very similar. > > Jonathan > >