mbox series

[00/18] CXL RAM and the 'Soft Reserved' => 'System RAM' default

Message ID 167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com (mailing list archive)
Headers show
Series CXL RAM and the 'Soft Reserved' => 'System RAM' default | expand

Message

Dan Williams Feb. 6, 2023, 1:02 a.m. UTC
Summary:
--------

CXL RAM support allows for the dynamic provisioning of new CXL RAM
regions, and more routinely, assembling a region from an existing
configuration established by platform-firmware. The latter is motivated
by CXL memory RAS (Reliability, Availability and Serviceability)
support, that requires associating device events with System Physical
Address ranges and vice versa.

The 'Soft Reserved' policy rework arranges for performance
differentiated memory like CXL attached DRAM, or high-bandwidth memory,
to be designated for 'System RAM' by default, rather than the device-dax
dedicated access mode. That current device-dax default is confusing and
surprising for the Pareto of users that do not expect memory to be
quarantined for dedicated access by default. Most users expect all
'System RAM'-capable memory to show up in FREE(1).


Details:
--------

Recall that the Linux 'Soft Reserved' designation for memory is a
reaction to platform-firmware, like EFI EDK2, delineating memory with
the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An
alternative way to think of that attribute is that it specifies the
*not* general-purpose memory pool. It is memory that may be too precious
for general usage or not performant enough for some hot data structures.
However, in the absence of explicit policy it should just be 'System
RAM' by default.

Rather than require every distribution to ship a udev policy to assign
dax devices to dax_kmem (the device-memory hotplug driver) just make
that the kernel default. This is similar to the rationale in:

commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")

With this change the relatively niche use case of accessing this memory
via mapping a device-dax instance can be achieved by building with
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying
memhp_default_state=offline at boot, and then use:

    daxctl reconfigure-device $device -m devdax --force

...to shift the corresponding address range to device-dax access.

The process of assembling a device-dax instance for a given CXL region
device configuration is similar to the process of assembling a
Device-Mapper or MDRAID storage-device array. Specifically, asynchronous
probing by the PCI and driver core enumerates all CXL endpoints and
their decoders. Then, once enough decoders have arrived to a describe a
given region, that region is passed to the device-dax subsystem where it
is subject to the above 'dax_kmem' policy. This assignment and policy
choice is only possible if memory is set aside by the 'Soft Reserved'
designation. Otherwise, CXL that is mapped as 'System RAM' becomes
immutable by CXL driver mechanisms, but is still enumerated for RAS
purposes.

This series is also available via:

https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region

...and has gone through some preview testing in various forms.

---

Dan Williams (18):
      cxl/Documentation: Update references to attributes added in v6.0
      cxl/region: Add a mode attribute for regions
      cxl/region: Support empty uuids for non-pmem regions
      cxl/region: Validate region mode vs decoder mode
      cxl/region: Add volatile region creation support
      cxl/region: Refactor attach_target() for autodiscovery
      cxl/region: Move region-position validation to a helper
      kernel/range: Uplevel the cxl subsystem's range_contains() helper
      cxl/region: Enable CONFIG_CXL_REGION to be toggled
      cxl/region: Fix passthrough-decoder detection
      cxl/region: Add region autodiscovery
      tools/testing/cxl: Define a fixed volatile configuration to parse
      dax/hmem: Move HMAT and Soft reservation probe initcall level
      dax/hmem: Drop unnecessary dax_hmem_remove()
      dax/hmem: Convey the dax range via memregion_info()
      dax/hmem: Move hmem device registration to dax_hmem.ko
      dax: Assign RAM regions to memory-hotplug by default
      cxl/dax: Create dax devices for CXL RAM regions


 Documentation/ABI/testing/sysfs-bus-cxl |   64 +-
 MAINTAINERS                             |    1 
 drivers/acpi/numa/hmat.c                |    4 
 drivers/cxl/Kconfig                     |   12 
 drivers/cxl/acpi.c                      |    3 
 drivers/cxl/core/core.h                 |    7 
 drivers/cxl/core/hdm.c                  |    8 
 drivers/cxl/core/pci.c                  |    5 
 drivers/cxl/core/port.c                 |   34 +
 drivers/cxl/core/region.c               |  848 ++++++++++++++++++++++++++++---
 drivers/cxl/cxl.h                       |   46 ++
 drivers/cxl/cxlmem.h                    |    3 
 drivers/cxl/port.c                      |   26 +
 drivers/dax/Kconfig                     |   17 +
 drivers/dax/Makefile                    |    2 
 drivers/dax/bus.c                       |   53 +-
 drivers/dax/bus.h                       |   12 
 drivers/dax/cxl.c                       |   53 ++
 drivers/dax/device.c                    |    3 
 drivers/dax/hmem/Makefile               |    3 
 drivers/dax/hmem/device.c               |  102 ++--
 drivers/dax/hmem/hmem.c                 |  148 +++++
 drivers/dax/kmem.c                      |    1 
 include/linux/dax.h                     |    7 
 include/linux/memregion.h               |    2 
 include/linux/range.h                   |    5 
 lib/stackinit_kunit.c                   |    6 
 tools/testing/cxl/test/cxl.c            |  146 +++++
 28 files changed, 1355 insertions(+), 266 deletions(-)
 create mode 100644 drivers/dax/cxl.c

base-commit: 172738bbccdb4ef76bdd72fc72a315c741c39161

Comments

Gregory Price Feb. 6, 2023, 5:36 a.m. UTC | #1
On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> Summary:
> --------
> 
> CXL RAM support allows for the dynamic provisioning of new CXL RAM
> regions, and more routinely, assembling a region from an existing
> configuration established by platform-firmware. The latter is motivated
> by CXL memory RAS (Reliability, Availability and Serviceability)
> support, that requires associating device events with System Physical
> Address ranges and vice versa.
> 
> The 'Soft Reserved' policy rework arranges for performance
> differentiated memory like CXL attached DRAM, or high-bandwidth memory,
> to be designated for 'System RAM' by default, rather than the device-dax
> dedicated access mode. That current device-dax default is confusing and
> surprising for the Pareto of users that do not expect memory to be
> quarantined for dedicated access by default. Most users expect all
> 'System RAM'-capable memory to show up in FREE(1).

Leverage the same QEMU branch, machine, and configuration as my prior
tests, i'm now experiencing a kernel panic on boot.  Will debug a bit
in the morning, but here is the stack trace i'm seeing

Saw this in both 1 and 2 root port configurations

(note: I also have the region reset issue previously discussed on top of
your branch).  

QEMU configuration:

sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 2G,slots=4,maxmem=4G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
-object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=1G,share=true \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G

[   13.936817] Call Trace:
[   13.970691]  <TASK>
[   13.990690]  device_add+0x39d/0x9a0
[   14.024690]  ? kobject_set_name_vargs+0x6d/0x90
[   14.066690]  ? dev_set_name+0x4b/0x60
[   14.090691]  devm_cxl_add_port+0x29a/0x4d0
[   14.135946]  cxl_acpi_probe+0xd9/0x2f0
[   14.167691]  ? device_pm_check_callbacks+0x36/0x100
[   14.203691]  platform_probe+0x44/0x90
[   14.247691]  really_probe+0xde/0x380
[   14.277690]  ? pm_runtime_barrier+0x50/0x90
[   14.324693]  __driver_probe_device+0x78/0x170
[   14.356694]  driver_probe_device+0x1f/0x90
[   14.396692]  __driver_attach+0xce/0x1c0
[   14.435691]  ? __pfx___driver_attach+0x10/0x10
[   14.471692]  bus_for_each_dev+0x73/0xa0
[   14.508693]  bus_add_driver+0x1ae/0x200
[   14.551691]  driver_register+0x89/0xe0
[   14.587691]  ? __pfx_cxl_acpi_init+0x10/0x10
[   14.625690]  do_one_initcall+0x59/0x230
[   14.814691]  kernel_init_freeable+0x204/0x24e
[   14.846710]  ? __pfx_kernel_init+0x10/0x10
[   14.899692]  kernel_init+0x16/0x140
[   14.954691]  ret_from_fork+0x2c/0x50
[   14.986692]  </TASK>
[   15.023689] Modules linked in:
[   15.057693] CR2: 0000000000000060
[   15.105691] ---[ end trace 0000000000000000 ]---
[   15.162837] RIP: 0010:bus_add_device+0x5b/0x150
[   15.217693] Code: 49 8b 74 24 20 48 89 df e8 92 88 ff ff 89 c5 85 c0 75 3b 48 8b 53 50 48 85 d2 75 03 48 8b 13 49 8b 84 24 a8 00 00 00 48 89 0
[   15.427859] RSP: 0000:ffffbd2a40013bc0 EFLAGS: 00010246
[   15.475692] RAX: 0000000000000000 RBX: ffff955c419e1800 RCX: 0000000000000000
[   15.591691] RDX: ffff955c41921778 RSI: ffff955c419e1800 RDI: ffff955c419e1800
[   15.703693] RBP: 0000000000000000 R08: 0000000000000228 R09: ffff955c4119e550
[   15.802711] R10: ffff955c414bedc8 R11: 0000000000000000 R12: ffffffff9e259d60
[   15.907692] R13: ffffffff9d3cee40 R14: ffff955c419e1800 R15: ffff955c41f06010
[   15.983693] FS:  0000000000000000(0000) GS:ffff955cbdd00000(0000) knlGS:0000000000000000
[   16.126698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   16.206694] CR2: 0000000000000060 CR3: 0000000036010000 CR4: 00000000000006e0
[   16.347694] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[   16.348686] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---
Davidlohr Bueso Feb. 6, 2023, 4:40 p.m. UTC | #2
On Mon, 06 Feb 2023, Gregory Price wrote:

>On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
>> Summary:
>> --------
>>
>> CXL RAM support allows for the dynamic provisioning of new CXL RAM
>> regions, and more routinely, assembling a region from an existing
>> configuration established by platform-firmware. The latter is motivated
>> by CXL memory RAS (Reliability, Availability and Serviceability)
>> support, that requires associating device events with System Physical
>> Address ranges and vice versa.
>>
>> The 'Soft Reserved' policy rework arranges for performance
>> differentiated memory like CXL attached DRAM, or high-bandwidth memory,
>> to be designated for 'System RAM' by default, rather than the device-dax
>> dedicated access mode. That current device-dax default is confusing and
>> surprising for the Pareto of users that do not expect memory to be
>> quarantined for dedicated access by default. Most users expect all
>> 'System RAM'-capable memory to show up in FREE(1).
>
>Leverage the same QEMU branch, machine, and configuration as my prior
>tests, i'm now experiencing a kernel panic on boot.  Will debug a bit
>in the morning, but here is the stack trace i'm seeing
>
>Saw this in both 1 and 2 root port configurations

I also see it in "regular" pmem setups, and narrowed it down to this
change in the last patch:

-module_init(cxl_acpi_init);
+/* load before dax_hmem sees 'Soft Reserved' CXL ranges */
+subsys_initcall(cxl_acpi_init);
Davidlohr Bueso Feb. 6, 2023, 5:18 p.m. UTC | #3
On Mon, 06 Feb 2023, Dan Williams wrote:

>Gregory Price wrote:
>[..]
>> Leverage the same QEMU branch, machine, and configuration as my prior
>> tests, i'm now experiencing a kernel panic on boot.  Will debug a bit
>> in the morning, but here is the stack trace i'm seeing
>>
>> Saw this in both 1 and 2 root port configurations
>>
>> (note: I also have the region reset issue previously discussed on top of
>> your branch).
>>
>> QEMU configuration:
>>
>> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
>> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
>> -m 2G,slots=4,maxmem=4G \
>> -smp 4 \
>> -machine type=q35,accel=kvm,cxl=on \
>> -enable-kvm \
>> -nographic \
>> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
>> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
>> -object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=1G,share=true \
>> -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
>> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
>>
>[..]
>> [   15.162837] RIP: 0010:bus_add_device+0x5b/0x150
>
>I suspect cxl_bus_type is not intialized yet. I think this should
>address it:

Yep, thanks.

>
>diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>index 0faeb1ffc212..6fe00702327f 100644
>--- a/drivers/cxl/core/port.c
>+++ b/drivers/cxl/core/port.c
>@@ -2011,6 +2011,6 @@ static void cxl_core_exit(void)
>	debugfs_remove_recursive(cxl_debugfs);
> }
>
>-module_init(cxl_core_init);
>+subsys_initcall(cxl_core_init);
> module_exit(cxl_core_exit);
> MODULE_LICENSE("GPL v2");
Dan Williams Feb. 6, 2023, 5:29 p.m. UTC | #4
Gregory Price wrote:
[..]
> Leverage the same QEMU branch, machine, and configuration as my prior
> tests, i'm now experiencing a kernel panic on boot.  Will debug a bit
> in the morning, but here is the stack trace i'm seeing
> 
> Saw this in both 1 and 2 root port configurations
> 
> (note: I also have the region reset issue previously discussed on top of
> your branch).  
> 
> QEMU configuration:
> 
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 2G,slots=4,maxmem=4G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
> -object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=1G,share=true \
> -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> 
[..]
> [   15.162837] RIP: 0010:bus_add_device+0x5b/0x150

I suspect cxl_bus_type is not intialized yet. I think this should
address it:

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 0faeb1ffc212..6fe00702327f 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -2011,6 +2011,6 @@ static void cxl_core_exit(void)
 	debugfs_remove_recursive(cxl_debugfs);
 }
 
-module_init(cxl_core_init);
+subsys_initcall(cxl_core_init);
 module_exit(cxl_core_exit);
 MODULE_LICENSE("GPL v2");
Dan Williams Feb. 6, 2023, 6:23 p.m. UTC | #5
Davidlohr Bueso wrote:
> On Mon, 06 Feb 2023, Gregory Price wrote:
> 
> >On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> >> Summary:
> >> --------
> >>
> >> CXL RAM support allows for the dynamic provisioning of new CXL RAM
> >> regions, and more routinely, assembling a region from an existing
> >> configuration established by platform-firmware. The latter is motivated
> >> by CXL memory RAS (Reliability, Availability and Serviceability)
> >> support, that requires associating device events with System Physical
> >> Address ranges and vice versa.
> >>
> >> The 'Soft Reserved' policy rework arranges for performance
> >> differentiated memory like CXL attached DRAM, or high-bandwidth memory,
> >> to be designated for 'System RAM' by default, rather than the device-dax
> >> dedicated access mode. That current device-dax default is confusing and
> >> surprising for the Pareto of users that do not expect memory to be
> >> quarantined for dedicated access by default. Most users expect all
> >> 'System RAM'-capable memory to show up in FREE(1).
> >
> >Leverage the same QEMU branch, machine, and configuration as my prior
> >tests, i'm now experiencing a kernel panic on boot.  Will debug a bit
> >in the morning, but here is the stack trace i'm seeing
> >
> >Saw this in both 1 and 2 root port configurations
> 
> I also see it in "regular" pmem setups, and narrowed it down to this
> change in the last patch:
> 
> -module_init(cxl_acpi_init);
> +/* load before dax_hmem sees 'Soft Reserved' CXL ranges */
> +subsys_initcall(cxl_acpi_init);

Yeah, I need to add some CONFIG_CXL_ACPI=y and CONFIG_CXL_BUS=y configs
to my tests. Typically I only run module builds.
Fan Ni Feb. 8, 2023, 5:37 p.m. UTC | #6
On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:


> Summary:
> --------
> 
> CXL RAM support allows for the dynamic provisioning of new CXL RAM
> regions, and more routinely, assembling a region from an existing
> configuration established by platform-firmware. The latter is motivated
> by CXL memory RAS (Reliability, Availability and Serviceability)
> support, that requires associating device events with System Physical
> Address ranges and vice versa.
> 
> The 'Soft Reserved' policy rework arranges for performance
> differentiated memory like CXL attached DRAM, or high-bandwidth memory,
> to be designated for 'System RAM' by default, rather than the device-dax
> dedicated access mode. That current device-dax default is confusing and
> surprising for the Pareto of users that do not expect memory to be
> quarantined for dedicated access by default. Most users expect all
> 'System RAM'-capable memory to show up in FREE(1).
> 
> 
> Details:
> --------
> 
> Recall that the Linux 'Soft Reserved' designation for memory is a
> reaction to platform-firmware, like EFI EDK2, delineating memory with
> the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An
> alternative way to think of that attribute is that it specifies the
> *not* general-purpose memory pool. It is memory that may be too precious
> for general usage or not performant enough for some hot data structures.
> However, in the absence of explicit policy it should just be 'System
> RAM' by default.
> 
> Rather than require every distribution to ship a udev policy to assign
> dax devices to dax_kmem (the device-memory hotplug driver) just make
> that the kernel default. This is similar to the rationale in:
> 
> commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")
> 
> With this change the relatively niche use case of accessing this memory
> via mapping a device-dax instance can be achieved by building with
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying
> memhp_default_state=offline at boot, and then use:
> 
>     daxctl reconfigure-device $device -m devdax --force
> 
> ...to shift the corresponding address range to device-dax access.
> 
> The process of assembling a device-dax instance for a given CXL region
> device configuration is similar to the process of assembling a
> Device-Mapper or MDRAID storage-device array. Specifically, asynchronous
> probing by the PCI and driver core enumerates all CXL endpoints and
> their decoders. Then, once enough decoders have arrived to a describe a
> given region, that region is passed to the device-dax subsystem where it
> is subject to the above 'dax_kmem' policy. This assignment and policy
> choice is only possible if memory is set aside by the 'Soft Reserved'
> designation. Otherwise, CXL that is mapped as 'System RAM' becomes
> immutable by CXL driver mechanisms, but is still enumerated for RAS
> purposes.
> 
> This series is also available via:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
> 
> ...and has gone through some preview testing in various forms.
> 
> ---

Tested-by: Fan Ni <fan.ni@samsung.com>


Run the following tests with the patch (with the volatile support at qemu).
Note: cxl related code are compiled as modules and loaded before used.

For pmem setup, tried three topologies (1HB1RP1Mem, 1HB2RP2Mem, 1HB2RP4Mem with
a cxl switch). The memdev is either provided in the command line when launching
qemu or hot added to the guest with device_add command in qemu monitor.

The following operations are performed,
1. create-region with cxl cmd
2. create name-space with ndctl cmd
3. convert cxl mem to ram with daxctl cmd
4. online the memory with daxctl cmd
5. Let app use the memory (numactl --membind=1 htop)

Results: No regression.

For volatile memory (hot add with device_add command), mainly tested 1HB1RP1Mem
case (passthrough).
1. the device can be correctly discovered after hot add (cxl list, may need
cxl enable-memdev)
2. creating ram region (cxl create-region) succeeded, after creating the
region, a dax device under /dev/ is shown.
3. online the memory passes, and the memory is shown on another NUMA node.
4. Let app use the memory (numactl --membind=1 htop) passed.




> 
> Dan Williams (18):
>       cxl/Documentation: Update references to attributes added in v6.0
>       cxl/region: Add a mode attribute for regions
>       cxl/region: Support empty uuids for non-pmem regions
>       cxl/region: Validate region mode vs decoder mode
>       cxl/region: Add volatile region creation support
>       cxl/region: Refactor attach_target() for autodiscovery
>       cxl/region: Move region-position validation to a helper
>       kernel/range: Uplevel the cxl subsystem's range_contains() helper
>       cxl/region: Enable CONFIG_CXL_REGION to be toggled
>       cxl/region: Fix passthrough-decoder detection
>       cxl/region: Add region autodiscovery
>       tools/testing/cxl: Define a fixed volatile configuration to parse
>       dax/hmem: Move HMAT and Soft reservation probe initcall level
>       dax/hmem: Drop unnecessary dax_hmem_remove()
>       dax/hmem: Convey the dax range via memregion_info()
>       dax/hmem: Move hmem device registration to dax_hmem.ko
>       dax: Assign RAM regions to memory-hotplug by default
>       cxl/dax: Create dax devices for CXL RAM regions
> 
> 
>  Documentation/ABI/testing/sysfs-bus-cxl |   64 +-
>  MAINTAINERS                             |    1 
>  drivers/acpi/numa/hmat.c                |    4 
>  drivers/cxl/Kconfig                     |   12 
>  drivers/cxl/acpi.c                      |    3 
>  drivers/cxl/core/core.h                 |    7 
>  drivers/cxl/core/hdm.c                  |    8 
>  drivers/cxl/core/pci.c                  |    5 
>  drivers/cxl/core/port.c                 |   34 +
>  drivers/cxl/core/region.c               |  848 ++++++++++++++++++++++++++++---
>  drivers/cxl/cxl.h                       |   46 ++
>  drivers/cxl/cxlmem.h                    |    3 
>  drivers/cxl/port.c                      |   26 +
>  drivers/dax/Kconfig                     |   17 +
>  drivers/dax/Makefile                    |    2 
>  drivers/dax/bus.c                       |   53 +-
>  drivers/dax/bus.h                       |   12 
>  drivers/dax/cxl.c                       |   53 ++
>  drivers/dax/device.c                    |    3 
>  drivers/dax/hmem/Makefile               |    3 
>  drivers/dax/hmem/device.c               |  102 ++--
>  drivers/dax/hmem/hmem.c                 |  148 +++++
>  drivers/dax/kmem.c                      |    1 
>  include/linux/dax.h                     |    7 
>  include/linux/memregion.h               |    2 
>  include/linux/range.h                   |    5 
>  lib/stackinit_kunit.c                   |    6 
>  tools/testing/cxl/test/cxl.c            |  146 +++++
>  28 files changed, 1355 insertions(+), 266 deletions(-)
>  create mode 100644 drivers/dax/cxl.c
> 
> base-commit: 172738bbccdb4ef76bdd72fc72a315c741c39161
>
Dan Williams Feb. 9, 2023, 4:56 a.m. UTC | #7
Fan Ni wrote:
> On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> 
> 
> > Summary:
> > --------
> > 
> > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > regions, and more routinely, assembling a region from an existing
> > configuration established by platform-firmware. The latter is motivated
> > by CXL memory RAS (Reliability, Availability and Serviceability)
> > support, that requires associating device events with System Physical
> > Address ranges and vice versa.
> > 
> > The 'Soft Reserved' policy rework arranges for performance
> > differentiated memory like CXL attached DRAM, or high-bandwidth memory,
> > to be designated for 'System RAM' by default, rather than the device-dax
> > dedicated access mode. That current device-dax default is confusing and
> > surprising for the Pareto of users that do not expect memory to be
> > quarantined for dedicated access by default. Most users expect all
> > 'System RAM'-capable memory to show up in FREE(1).
> > 
> > 
> > Details:
> > --------
> > 
> > Recall that the Linux 'Soft Reserved' designation for memory is a
> > reaction to platform-firmware, like EFI EDK2, delineating memory with
> > the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An
> > alternative way to think of that attribute is that it specifies the
> > *not* general-purpose memory pool. It is memory that may be too precious
> > for general usage or not performant enough for some hot data structures.
> > However, in the absence of explicit policy it should just be 'System
> > RAM' by default.
> > 
> > Rather than require every distribution to ship a udev policy to assign
> > dax devices to dax_kmem (the device-memory hotplug driver) just make
> > that the kernel default. This is similar to the rationale in:
> > 
> > commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")
> > 
> > With this change the relatively niche use case of accessing this memory
> > via mapping a device-dax instance can be achieved by building with
> > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying
> > memhp_default_state=offline at boot, and then use:
> > 
> >     daxctl reconfigure-device $device -m devdax --force
> > 
> > ...to shift the corresponding address range to device-dax access.
> > 
> > The process of assembling a device-dax instance for a given CXL region
> > device configuration is similar to the process of assembling a
> > Device-Mapper or MDRAID storage-device array. Specifically, asynchronous
> > probing by the PCI and driver core enumerates all CXL endpoints and
> > their decoders. Then, once enough decoders have arrived to a describe a
> > given region, that region is passed to the device-dax subsystem where it
> > is subject to the above 'dax_kmem' policy. This assignment and policy
> > choice is only possible if memory is set aside by the 'Soft Reserved'
> > designation. Otherwise, CXL that is mapped as 'System RAM' becomes
> > immutable by CXL driver mechanisms, but is still enumerated for RAS
> > purposes.
> > 
> > This series is also available via:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
> > 
> > ...and has gone through some preview testing in various forms.
> > 
> > ---
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> 
> 
> Run the following tests with the patch (with the volatile support at qemu).
> Note: cxl related code are compiled as modules and loaded before used.
> 
> For pmem setup, tried three topologies (1HB1RP1Mem, 1HB2RP2Mem, 1HB2RP4Mem with
> a cxl switch). The memdev is either provided in the command line when launching
> qemu or hot added to the guest with device_add command in qemu monitor.
> 
> The following operations are performed,
> 1. create-region with cxl cmd
> 2. create name-space with ndctl cmd
> 3. convert cxl mem to ram with daxctl cmd
> 4. online the memory with daxctl cmd
> 5. Let app use the memory (numactl --membind=1 htop)
> 
> Results: No regression.
> 
> For volatile memory (hot add with device_add command), mainly tested 1HB1RP1Mem
> case (passthrough).
> 1. the device can be correctly discovered after hot add (cxl list, may need
> cxl enable-memdev)
> 2. creating ram region (cxl create-region) succeeded, after creating the
> region, a dax device under /dev/ is shown.
> 3. online the memory passes, and the memory is shown on another NUMA node.
> 4. Let app use the memory (numactl --membind=1 htop) passed.

Thank you, Fan!
David Hildenbrand Feb. 13, 2023, 12:13 p.m. UTC | #8
On 06.02.23 02:02, Dan Williams wrote:
> Summary:
> --------
> 
> CXL RAM support allows for the dynamic provisioning of new CXL RAM
> regions, and more routinely, assembling a region from an existing
> configuration established by platform-firmware. The latter is motivated
> by CXL memory RAS (Reliability, Availability and Serviceability)
> support, that requires associating device events with System Physical
> Address ranges and vice versa.
> 
> The 'Soft Reserved' policy rework arranges for performance
> differentiated memory like CXL attached DRAM, or high-bandwidth memory,
> to be designated for 'System RAM' by default, rather than the device-dax
> dedicated access mode. That current device-dax default is confusing and
> surprising for the Pareto of users that do not expect memory to be
> quarantined for dedicated access by default. Most users expect all
> 'System RAM'-capable memory to show up in FREE(1).
> 
> 
> Details:
> --------
> 
> Recall that the Linux 'Soft Reserved' designation for memory is a
> reaction to platform-firmware, like EFI EDK2, delineating memory with
> the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An
> alternative way to think of that attribute is that it specifies the
> *not* general-purpose memory pool. It is memory that may be too precious
> for general usage or not performant enough for some hot data structures.
> However, in the absence of explicit policy it should just be 'System
> RAM' by default.
> 
> Rather than require every distribution to ship a udev policy to assign
> dax devices to dax_kmem (the device-memory hotplug driver) just make
> that the kernel default. This is similar to the rationale in:
> 
> commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")
> 
> With this change the relatively niche use case of accessing this memory
> via mapping a device-dax instance can be achieved by building with
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying
> memhp_default_state=offline at boot, and then use:
> 
>      daxctl reconfigure-device $device -m devdax --force
> 
> ...to shift the corresponding address range to device-dax access.
> 
> The process of assembling a device-dax instance for a given CXL region
> device configuration is similar to the process of assembling a
> Device-Mapper or MDRAID storage-device array. Specifically, asynchronous
> probing by the PCI and driver core enumerates all CXL endpoints and
> their decoders. Then, once enough decoders have arrived to a describe a
> given region, that region is passed to the device-dax subsystem where it
> is subject to the above 'dax_kmem' policy. This assignment and policy
> choice is only possible if memory is set aside by the 'Soft Reserved'
> designation. Otherwise, CXL that is mapped as 'System RAM' becomes
> immutable by CXL driver mechanisms, but is still enumerated for RAS
> purposes.
> 
> This series is also available via:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
> 
> ...and has gone through some preview testing in various forms.
> 

My concern would be that in setups with a lot of CXL memory 
(soft-reserved), having that much offline memory during boot might make 
the kernel run out of memory. After all, offline memory consumes memory 
for the memmap.

Is the assumption that something like that cannot happen because we'll 
never ever have that much soft-reserved memory? :)

Note that this is a concern only applies when not using auto-onlining in 
the kernel during boot, which (IMHO) is or will be the default in the 
future.
Gregory Price Feb. 14, 2023, 6:27 p.m. UTC | #9
On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> Summary:
> --------
> 
> CXL RAM support allows for the dynamic provisioning of new CXL RAM
> regions, and more routinely, assembling a region from an existing
> configuration established by platform-firmware. The latter is motivated
> by CXL memory RAS (Reliability, Availability and Serviceability)
> support, that requires associating device events with System Physical
> Address ranges and vice versa.
> 

Ok, I simplified down my tests and reverted a bunch of stuff, figured i
should report this before I dive further in.

Earlier i was carrying the DOE patches and others, I've dropped most of
that to make sure i could replicate on the base kernel and qemu images

QEMU branch: 
https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
this is a little out of date at this point i think? but it shouldn't
matter, the results are the same regardless of what else i pull in.

Kernel branch:
https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
I had been carrying a bunch of other hotfixes and work like the DOE
patches, but but i wanted to know if the base branch saw the same
behavior.

Config:
sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/data/qemu/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 4G,slots=4,maxmem=8G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-netdev bridge,id=hn0,br=virbr0 \
-device virtio-net-pci,netdev=hn0,id=nic1 \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
-object memory-backend-ram,id=mem0,size=4G \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G


boot is fine
# ./ndctl/build/cxl/cxl create-region -m -t ram -d decoder0.0 \
  -w 1 -g 4096 mem0

[   28.183276] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!
{
  "region":"region0",
  "resource":"0x390000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder2.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region
[   28.247144] Built 1 zonelists, mobility grouping on.  Total pages: 979754
[   28.247844] Policy zone: Normal
[   28.258449] Fallback order for Node 0: 0 1
[   28.258945] Fallback order for Node 1: 1 0
[   28.259422] Built 2 zonelists, mobility grouping on.  Total pages: 1012522
[   28.260087] Policy zone: Normal


top shows the memory has been onlined
MiB Mem :   8022.1 total,   7631.6 free,    200.9 used,    189.7 buff/cache
MiB Swap:   3926.0 total,   3926.0 free,      0.0 used.   7567.8 avail Mem


Lets attempt to use the memory
[root@fedora ~]# numactl --membind=1 python
KVM internal error. Suberror: 3
extra data[0]: 0x0000000080000b0e
extra data[1]: 0x0000000000000031
extra data[2]: 0x0000000000000d81
extra data[3]: 0x0000000390074ac0
extra data[4]: 0x0000000000000010
RAX=0000000080000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000001
RSI=0000000000000000 RDI=0000000390074000 RBP=ffffac1c4067bca0 RSP=ffffac1c4067bc88
R8 =0000000000000000 R9 =0000000000000001 R10=0000000000000000 R11=0000000000000000
R12=0000000000000000 R13=ffff99eed0074000 R14=0000000000000000 R15=0000000000000000
RIP=ffffffff812b3d62 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 0000000000000000 ffffffff 00c00000
GS =0000 ffff99ec3bc00000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 fffffe1d13135000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe1d13133000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=ffffffff812b3d62 CR3=0000000390074000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=5d 9c 01 0f b7 db 48 09 df 48 0f ba ef 3f 0f 22 df 0f 1f 00 <5b> 41 5c 41 5d 5d c3 cc cc cc cc 48 c7 c0 00 00 00 80 48 2b 05 cd 0d 76 01



I also tested lowering the ram sizes (2GB ram, 1GB "CXL") to see if
there's something going on with the PCI hole or something, but no, same
results.

Double checked if there was an issue using a single root port so i
registered a second one - same results.


In prior tests i accessed the memory directly via devmem2

This still works when mapping the memory manually

[root@fedora map] ./map_memory.sh
echo ram > /sys/bus/cxl/devices/decoder2.0/mode
echo 0x40000000 > /sys/bus/cxl/devices/decoder2.0/dpa_size
echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
echo 4096 > /sys/bus/cxl/devices/region0/interleave_granularity
echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
echo 0x40000000 > /sys/bus/cxl/devices/region0/size
echo decoder2.0 > /sys/bus/cxl/devices/region0/target0
echo 1 > /sys/bus/cxl/devices/region0/commit


[root@fedora devmem]# ./devmem2 0x290000000 w 0x12345678
/dev/mem opened.
Memory mapped at address 0x7fb4d4ed3000.
Value at address 0x290000000 (0x7fb4d4ed3000): 0x0
Written 0x12345678; readback 0x12345678



This kind of implies there's a disagreement about the state of memory
between linux and qemu.


but even just onlining a region produces memory usage:

[root@fedora ~]# cat /sys/bus/node/devices/node1/meminfo
Node 1 MemTotal:        1048576 kB
Node 1 MemFree:         1048112 kB
Node 1 MemUsed:             464 kB


Which I would expect to set off some fireworks.

Maybe an issue at the NUMA level? I just... i have no idea.


I will need to dig through the email chains to figure out what others
have been doing that i'm missing.  Everything *looks* nominal, but the
reactors are exploding so... ¯\_(ツ)_/¯

I'm not sure where to start here, but i'll bash my face on the keyboard
for a bit until i have some ideas.


~Gregory
Dan Williams Feb. 14, 2023, 6:39 p.m. UTC | #10
Gregory Price wrote:
> On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> > Summary:
> > --------
> > 
> > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > regions, and more routinely, assembling a region from an existing
> > configuration established by platform-firmware. The latter is motivated
> > by CXL memory RAS (Reliability, Availability and Serviceability)
> > support, that requires associating device events with System Physical
> > Address ranges and vice versa.
> > 
> 
> Ok, I simplified down my tests and reverted a bunch of stuff, figured i
> should report this before I dive further in.
> 
> Earlier i was carrying the DOE patches and others, I've dropped most of
> that to make sure i could replicate on the base kernel and qemu images
> 
> QEMU branch: 
> https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
> this is a little out of date at this point i think? but it shouldn't
> matter, the results are the same regardless of what else i pull in.
> 
> Kernel branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region

Note that I acted on this feedback from Greg to break out a fix and
merge it for v6.2-final

http://lore.kernel.org/r/Y+CSOeHVLKudN0A6@kroah.com

...i.e. you are missing at least the passthrough decoder fix, but that
would show up as a region creation failure not a QEMU crash.

So I would move to testing cxl/next.

[..]
> Lets attempt to use the memory
> [root@fedora ~]# numactl --membind=1 python
> KVM internal error. Suberror: 3
> extra data[0]: 0x0000000080000b0e
> extra data[1]: 0x0000000000000031
> extra data[2]: 0x0000000000000d81
> extra data[3]: 0x0000000390074ac0
> extra data[4]: 0x0000000000000010
> RAX=0000000080000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000001
> RSI=0000000000000000 RDI=0000000390074000 RBP=ffffac1c4067bca0 RSP=ffffac1c4067bc88
> R8 =0000000000000000 R9 =0000000000000001 R10=0000000000000000 R11=0000000000000000
> R12=0000000000000000 R13=ffff99eed0074000 R14=0000000000000000 R15=0000000000000000
> RIP=ffffffff812b3d62 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 0000000000000000 ffffffff 00c00000
> CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0000 0000000000000000 ffffffff 00c00000
> FS =0000 0000000000000000 ffffffff 00c00000
> GS =0000 ffff99ec3bc00000 ffffffff 00c00000
> LDT=0000 0000000000000000 ffffffff 00c00000
> TR =0040 fffffe1d13135000 00004087 00008b00 DPL=0 TSS64-busy
> GDT=     fffffe1d13133000 0000007f
> IDT=     fffffe0000000000 00000fff
> CR0=80050033 CR2=ffffffff812b3d62 CR3=0000000390074000 CR4=000006f0
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000fffe0ff0 DR7=0000000000000400
> EFER=0000000000000d01
> Code=5d 9c 01 0f b7 db 48 09 df 48 0f ba ef 3f 0f 22 df 0f 1f 00 <5b> 41 5c 41 5d 5d c3 cc cc cc cc 48 c7 c0 00 00 00 80 48 2b 05 cd 0d 76 01
> 

At first glance that looks like a QEMU issue, but I would capture a cxl
list -vvv before attempting to use the memory just to verify the decoder
setup looks sane.

> 
> 
> I also tested lowering the ram sizes (2GB ram, 1GB "CXL") to see if
> there's something going on with the PCI hole or something, but no, same
> results.
> 
> Double checked if there was an issue using a single root port so i
> registered a second one - same results.
> 
> 
> In prior tests i accessed the memory directly via devmem2
> 
> This still works when mapping the memory manually
> 
> [root@fedora map] ./map_memory.sh
> echo ram > /sys/bus/cxl/devices/decoder2.0/mode
> echo 0x40000000 > /sys/bus/cxl/devices/decoder2.0/dpa_size
> echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> echo 4096 > /sys/bus/cxl/devices/region0/interleave_granularity
> echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
> echo 0x40000000 > /sys/bus/cxl/devices/region0/size
> echo decoder2.0 > /sys/bus/cxl/devices/region0/target0
> echo 1 > /sys/bus/cxl/devices/region0/commit
> 
> 
> [root@fedora devmem]# ./devmem2 0x290000000 w 0x12345678
> /dev/mem opened.
> Memory mapped at address 0x7fb4d4ed3000.
> Value at address 0x290000000 (0x7fb4d4ed3000): 0x0
> Written 0x12345678; readback 0x12345678

Likely it is sensitive to crossing an interleave threshold.

> This kind of implies there's a disagreement about the state of memory
> between linux and qemu.
> 
> 
> but even just onlining a region produces memory usage:
> 
> [root@fedora ~]# cat /sys/bus/node/devices/node1/meminfo
> Node 1 MemTotal:        1048576 kB
> Node 1 MemFree:         1048112 kB
> Node 1 MemUsed:             464 kB
> 
> 
> Which I would expect to set off some fireworks.
> 
> Maybe an issue at the NUMA level? I just... i have no idea.
> 
> 
> I will need to dig through the email chains to figure out what others
> have been doing that i'm missing.  Everything *looks* nominal, but the
> reactors are exploding so... ¯\_(ツ)_/¯
> 
> I'm not sure where to start here, but i'll bash my face on the keyboard
> for a bit until i have some ideas.

Not ruling out the driver yet, but Fan's tests with hardware has me
leaning more towards QEMU.
Dan Williams Feb. 14, 2023, 6:45 p.m. UTC | #11
David Hildenbrand wrote:
[..]
> My concern would be that in setups with a lot of CXL memory 
> (soft-reserved), having that much offline memory during boot might make 
> the kernel run out of memory. After all, offline memory consumes memory 
> for the memmap.
> 
> Is the assumption that something like that cannot happen because we'll 
> never ever have that much soft-reserved memory? :)

Right, the assumption is that platform firmware is going to pick a
reasonable general purpose pool that lets an OS boot. In the near term
it is difficult to get that wrong since it will encompass all locally
attached DDR-DRAM. In the longer term, where it becomes possible that
all memory is CXL attached, then platform firmware will need to mark
some or all of CXL memory as general purpose (not Soft Reserved).

> Note that this is a concern only applies when not using auto-onlining in 
> the kernel during boot, which (IMHO) is or will be the default in the 
> future.

Yes, I expect that to be the default as well and let those small few
that have custom requirements figure out how to override that default.
Gregory Price Feb. 14, 2023, 7:01 p.m. UTC | #12
On Tue, Feb 14, 2023 at 10:39:42AM -0800, Dan Williams wrote:
> Gregory Price wrote:
> > On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> > > Summary:
> > > --------
> > > 
> > > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > > regions, and more routinely, assembling a region from an existing
> > > configuration established by platform-firmware. The latter is motivated
> > > by CXL memory RAS (Reliability, Availability and Serviceability)
> > > support, that requires associating device events with System Physical
> > > Address ranges and vice versa.
> > > 
> > 
> > Ok, I simplified down my tests and reverted a bunch of stuff, figured i
> > should report this before I dive further in.
> > 
> > Earlier i was carrying the DOE patches and others, I've dropped most of
> > that to make sure i could replicate on the base kernel and qemu images
> > 
> > QEMU branch: 
> > https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
> > this is a little out of date at this point i think? but it shouldn't
> > matter, the results are the same regardless of what else i pull in.
> > 
> > Kernel branch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
> 
> Note that I acted on this feedback from Greg to break out a fix and
> merge it for v6.2-final
> 
> http://lore.kernel.org/r/Y+CSOeHVLKudN0A6@kroah.com
> 
> ...i.e. you are missing at least the passthrough decoder fix, but that
> would show up as a region creation failure not a QEMU crash.
> 
> So I would move to testing cxl/next.
> 

I just noticed this, already spinning a new kernel.  Will report back

> Not ruling out the driver yet, but Fan's tests with hardware has me
> leaning more towards QEMU.

Same, not much has changed and I haven't tested with hardware yet. Was
planning to install it on our local boxes sometime later this week.

Was just so close to setting up a virtual memory pool in the lab, was
getting antsy :]
Jonathan Cameron Feb. 14, 2023, 9:18 p.m. UTC | #13
On Tue, 14 Feb 2023 14:01:23 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Tue, Feb 14, 2023 at 10:39:42AM -0800, Dan Williams wrote:
> > Gregory Price wrote:  
> > > On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:  
> > > > Summary:
> > > > --------
> > > > 
> > > > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > > > regions, and more routinely, assembling a region from an existing
> > > > configuration established by platform-firmware. The latter is motivated
> > > > by CXL memory RAS (Reliability, Availability and Serviceability)
> > > > support, that requires associating device events with System Physical
> > > > Address ranges and vice versa.
> > > >   
> > > 
> > > Ok, I simplified down my tests and reverted a bunch of stuff, figured i
> > > should report this before I dive further in.
> > > 
> > > Earlier i was carrying the DOE patches and others, I've dropped most of
> > > that to make sure i could replicate on the base kernel and qemu images
> > > 
> > > QEMU branch: 
> > > https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
> > > this is a little out of date at this point i think? but it shouldn't
> > > matter, the results are the same regardless of what else i pull in.
> > > 
> > > Kernel branch:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region  
> > 
> > Note that I acted on this feedback from Greg to break out a fix and
> > merge it for v6.2-final
> > 
> > http://lore.kernel.org/r/Y+CSOeHVLKudN0A6@kroah.com
> > 
> > ...i.e. you are missing at least the passthrough decoder fix, but that
> > would show up as a region creation failure not a QEMU crash.
> > 
> > So I would move to testing cxl/next.
> >   
> 
> I just noticed this, already spinning a new kernel.  Will report back
> 
> > Not ruling out the driver yet, but Fan's tests with hardware has me
> > leaning more towards QEMU.  
> 
> Same, not much has changed and I haven't tested with hardware yet. Was
> planning to install it on our local boxes sometime later this week.
> 
> Was just so close to setting up a virtual memory pool in the lab, was
> getting antsy :]

Could you test it with TCG (just drop --enable-kvm)?  We have a known
limitation with x86 instructions running out of CXL emulated memory
(side effect of emulating the interleave).  You'll need a fix even on TCG
for the corner case of an instruction bridging from normal ram to cxl memory.
https://lore.kernel.org/qemu-devel/20230206193809.1153124-1-richard.henderson@linaro.org/

Performance will be bad, but so far this is only way we can do it correctly.

Jonathan
Gregory Price Feb. 14, 2023, 9:51 p.m. UTC | #14
On Tue, Feb 14, 2023 at 09:18:24PM +0000, Jonathan Cameron wrote:
> On Tue, 14 Feb 2023 14:01:23 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > On Tue, Feb 14, 2023 at 10:39:42AM -0800, Dan Williams wrote:
> > > Gregory Price wrote:  
> > > > On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:  
> > > > > Summary:
> > > > > --------
> > > > > 
> > > > > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > > > > regions, and more routinely, assembling a region from an existing
> > > > > configuration established by platform-firmware. The latter is motivated
> > > > > by CXL memory RAS (Reliability, Availability and Serviceability)
> > > > > support, that requires associating device events with System Physical
> > > > > Address ranges and vice versa.
> > > > >   
> > > > 
> > > > Ok, I simplified down my tests and reverted a bunch of stuff, figured i
> > > > should report this before I dive further in.
> > > > 
> > > > Earlier i was carrying the DOE patches and others, I've dropped most of
> > > > that to make sure i could replicate on the base kernel and qemu images
> > > > 
> > > > QEMU branch: 
> > > > https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
> > > > this is a little out of date at this point i think? but it shouldn't
> > > > matter, the results are the same regardless of what else i pull in.
> > > > 
> > > > Kernel branch:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region  
> > > 
> > > Note that I acted on this feedback from Greg to break out a fix and
> > > merge it for v6.2-final
> > > 
> > > http://lore.kernel.org/r/Y+CSOeHVLKudN0A6@kroah.com
> > > 
> > > ...i.e. you are missing at least the passthrough decoder fix, but that
> > > would show up as a region creation failure not a QEMU crash.
> > > 
> > > So I would move to testing cxl/next.
> > >   
> > 
> > I just noticed this, already spinning a new kernel.  Will report back
> > 
> > > Not ruling out the driver yet, but Fan's tests with hardware has me
> > > leaning more towards QEMU.  
> > 
> > Same, not much has changed and I haven't tested with hardware yet. Was
> > planning to install it on our local boxes sometime later this week.
> > 
> > Was just so close to setting up a virtual memory pool in the lab, was
> > getting antsy :]
> 
> Could you test it with TCG (just drop --enable-kvm)?  We have a known
> limitation with x86 instructions running out of CXL emulated memory
> (side effect of emulating the interleave).  You'll need a fix even on TCG
> for the corner case of an instruction bridging from normal ram to cxl memory.
> https://lore.kernel.org/qemu-devel/20230206193809.1153124-1-richard.henderson@linaro.org/
> 
> Performance will be bad, but so far this is only way we can do it correctly.
> 
> Jonathan
> 

Siiiggghh... i had this patch and dropped --enable-kvm, but forgot to
drop "accel=kvm" from the -machine line

This was the issue.

And let me tell you, if you numactl --membind=1 python, it is
IMPRESSIVELY slow.  I wonder if it's even hitting a few 100k
instructions a second.


This appears to be the issue.  When I get a bit more time, try to dive
into the deep dark depths of qemu memory regions to see how difficult
a non-mmio fork might be, unless someone else is already looking at it.

~Gregory
Gregory Price Feb. 14, 2023, 9:54 p.m. UTC | #15
On Tue, Feb 14, 2023 at 04:51:53PM -0500, Gregory Price wrote:
> On Tue, Feb 14, 2023 at 09:18:24PM +0000, Jonathan Cameron wrote:
> > On Tue, 14 Feb 2023 14:01:23 -0500
> > Gregory Price <gregory.price@memverge.com> wrote:
> > 
> > Could you test it with TCG (just drop --enable-kvm)?  We have a known
> > limitation with x86 instructions running out of CXL emulated memory
> > (side effect of emulating the interleave).  You'll need a fix even on TCG
> > for the corner case of an instruction bridging from normal ram to cxl memory.
> > https://lore.kernel.org/qemu-devel/20230206193809.1153124-1-richard.henderson@linaro.org/
> > 
> > Performance will be bad, but so far this is only way we can do it correctly.
> > 
> > Jonathan
> > 
> 
> Siiiggghh... i had this patch and dropped --enable-kvm, but forgot to
> drop "accel=kvm" from the -machine line
> 
> This was the issue.
> 
> And let me tell you, if you numactl --membind=1 python, it is
> IMPRESSIVELY slow.  I wonder if it's even hitting a few 100k
> instructions a second.
> 
> 
> This appears to be the issue.  When I get a bit more time, try to dive
> into the deep dark depths of qemu memory regions to see how difficult
> a non-mmio fork might be, unless someone else is already looking at it.
> 
> ~Gregory

Just clarifying one thing:  Even with the patch, KVM blows up.
Disabling KVM fixes this entirely.  I haven't tested without KVM but
with the patch, i will do that now.
Jonathan Cameron Feb. 15, 2023, 10:03 a.m. UTC | #16
On Tue, 14 Feb 2023 16:54:02 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Tue, Feb 14, 2023 at 04:51:53PM -0500, Gregory Price wrote:
> > On Tue, Feb 14, 2023 at 09:18:24PM +0000, Jonathan Cameron wrote:  
> > > On Tue, 14 Feb 2023 14:01:23 -0500
> > > Gregory Price <gregory.price@memverge.com> wrote:
> > > 
> > > Could you test it with TCG (just drop --enable-kvm)?  We have a known
> > > limitation with x86 instructions running out of CXL emulated memory
> > > (side effect of emulating the interleave).  You'll need a fix even on TCG
> > > for the corner case of an instruction bridging from normal ram to cxl memory.
> > > https://lore.kernel.org/qemu-devel/20230206193809.1153124-1-richard.henderson@linaro.org/
> > > 
> > > Performance will be bad, but so far this is only way we can do it correctly.
> > > 
> > > Jonathan
> > >   
> > 
> > Siiiggghh... i had this patch and dropped --enable-kvm, but forgot to
> > drop "accel=kvm" from the -machine line
> > 
> > This was the issue.
> > 
> > And let me tell you, if you numactl --membind=1 python, it is
> > IMPRESSIVELY slow.  I wonder if it's even hitting a few 100k
> > instructions a second.
> > 
> > 
> > This appears to be the issue.  When I get a bit more time, try to dive
> > into the deep dark depths of qemu memory regions to see how difficult
> > a non-mmio fork might be, unless someone else is already looking at it.
> > 
> > ~Gregory  
> 
> Just clarifying one thing:  Even with the patch, KVM blows up.
> Disabling KVM fixes this entirely.  I haven't tested without KVM but
> with the patch, i will do that now.

yup.  The patch only fixes TCG so that's expected behavior.

Fingers crossed on this 'working'.

I'm open to suggestions on how to work around the problem with KVM
or indeed allow TCG to cache the instructions (right not it has
to fetch and emulate each instruction on it's own).

I can envision how we might do it for KVM with userspace page fault handling
used to get a fault up to QEMU which can then stitch in a cache
of the underlying memory as a stage 2 translation to the page (a little
bit like how post migration copy works) though I've not prototyped
anything...

I think it would be complex code that would be little used
so we may just have to cope with the emulation being slow.

Intent is very much to be able to test the kernel code etc, not
test it quickly :)

Jonathan
Gregory Price Feb. 18, 2023, 9:47 a.m. UTC | #17
On Wed, Feb 15, 2023 at 10:03:27AM +0000, Jonathan Cameron wrote:
> On Tue, 14 Feb 2023 16:54:02 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > Just clarifying one thing:  Even with the patch, KVM blows up.
> > Disabling KVM fixes this entirely.  I haven't tested without KVM but
> > with the patch, i will do that now.
> 
> yup.  The patch only fixes TCG so that's expected behavior.
> 
> Fingers crossed on this 'working'.
> 
> I'm open to suggestions on how to work around the problem with KVM
> or indeed allow TCG to cache the instructions (right not it has
> to fetch and emulate each instruction on it's own).
> 
> I can envision how we might do it for KVM with userspace page fault handling
> used to get a fault up to QEMU which can then stitch in a cache
> of the underlying memory as a stage 2 translation to the page (a little
> bit like how post migration copy works) though I've not prototyped
> anything...
> 

Just following up.  With the patch applied and KVM turned off, no crash.
I've been working with this for a while.

We should move the instruction alignment issue into a separate
discussion thread.