Message ID | 20230111142440.24771-1-Jonathan.Cameron@huawei.com |
---|---|
Headers | show |
Series | hw/cxl: CXL emulation cleanups and minor fixes for upstream | expand |
On Wed, Jan 11, 2023 at 02:24:32PM +0000, Jonathan Cameron via wrote: > Gregory's patches were posted as part of his work on adding volatile support. > https://lore.kernel.org/linux-cxl/20221006233702.18532-1-gregory.price@memverge.com/ > https://lore.kernel.org/linux-cxl/20221128150157.97724-2-gregory.price@memverge.com/ > I might propose this for upstream inclusion this cycle, but testing is > currently limited by lack of suitable kernel support. fwiw the testing i've done suggests the problem isn't necessarily the implementation so much as either the EFI support or the ACPI tables. For example, we see memory expanders come up no problem and turn into volatile memory on real hardware, with the same kernels with just a few commands. My gut feeling is that either a mailbox command is missing or that the ACPI tables are missing/significantly different. I haven't been able to investigate further at this point, but that's my current state with the voltile type-3 device testing.
On Thu, 12 Jan 2023 10:39:17 -0500 Gregory Price <gregory.price@memverge.com> wrote: > On Wed, Jan 11, 2023 at 02:24:32PM +0000, Jonathan Cameron via wrote: > > Gregory's patches were posted as part of his work on adding volatile support. > > https://lore.kernel.org/linux-cxl/20221006233702.18532-1-gregory.price@memverge.com/ > > https://lore.kernel.org/linux-cxl/20221128150157.97724-2-gregory.price@memverge.com/ > > I might propose this for upstream inclusion this cycle, but testing is > > currently limited by lack of suitable kernel support. > > fwiw the testing i've done suggests the problem isn't necessarily the > implementation so much as either the EFI support or the ACPI tables. > > For example, we see memory expanders come up no problem and turn into > volatile memory on real hardware, with the same kernels with just a few > commands. My gut feeling is that either a mailbox command is missing or > that the ACPI tables are missing/significantly different. > > I haven't been able to investigate further at this point, but that's my > current state with the voltile type-3 device testing. My assumption was that all shipping hardware platforms were doing the enumeration and bring up of memory expanders in the BIOS / firmware. Those are then presented to the OS already set up exactly as if they were normal memory. We could do the same on QEMU but that means a lot of work in EDK2. Note that it makes no sense to do the enumeration and creation of ACPI tables in QEMU itself though could hack it like that. This stuff is done in firmware because that enables it for legacy OSes. Everything is more or less presented to the OS like you would present RAM (EFI memory map, ACPI tables etc). Firmware enumeration doesn't typically support hotplug, so if we add support for hotplug of volatile memory type 3 devices to the kernel we will also be able to do 'cold plug' and have the kernel bring them up in a similar fashion to what we do for non-volatile (for non volatile there is typically no real support in firmware as there is a bunch of policy to deal with that doesn't belong in firmware). (simplifying heavily ;) So I don't think we are missing anything in the emulation, just in the software layers above it. Could be wrong though ;) Jonathan
On Thu, Jan 12, 2023 at 05:21:30PM +0000, Jonathan Cameron wrote: > On Thu, 12 Jan 2023 10:39:17 -0500 > Gregory Price <gregory.price@memverge.com> wrote: > > > On Wed, Jan 11, 2023 at 02:24:32PM +0000, Jonathan Cameron via wrote: > > > Gregory's patches were posted as part of his work on adding volatile support. > > > https://lore.kernel.org/linux-cxl/20221006233702.18532-1-gregory.price@memverge.com/ > > > https://lore.kernel.org/linux-cxl/20221128150157.97724-2-gregory.price@memverge.com/ > > > I might propose this for upstream inclusion this cycle, but testing is > > > currently limited by lack of suitable kernel support. > > > > fwiw the testing i've done suggests the problem isn't necessarily the > > implementation so much as either the EFI support or the ACPI tables. > > > > For example, we see memory expanders come up no problem and turn into > > volatile memory on real hardware, with the same kernels with just a few > > commands. My gut feeling is that either a mailbox command is missing or > > that the ACPI tables are missing/significantly different. > > > > I haven't been able to investigate further at this point, but that's my > > current state with the voltile type-3 device testing. > > My assumption was that all shipping hardware platforms were doing the > enumeration and bring up of memory expanders in the BIOS / firmware. > Those are then presented to the OS already set up exactly as if they were > normal memory. We could do the same on QEMU but that means a lot of > work in EDK2. Note that it makes no sense to do the enumeration and > creation of ACPI tables in QEMU itself though could hack it like that. > This stuff is done in firmware because that enables it for legacy > OSes. Everything is more or less presented to the OS like you would > present RAM (EFI memory map, ACPI tables etc). > > Firmware enumeration doesn't typically support hotplug, so if we add > support for hotplug of volatile memory type 3 devices to the kernel > we will also be able to do 'cold plug' and have the kernel bring them up > in a similar fashion to what we do for non-volatile (for non volatile there > is typically no real support in firmware as there is a bunch of policy to > deal with that doesn't belong in firmware). (simplifying heavily ;) > > So I don't think we are missing anything in the emulation, just in the > software layers above it. Could be wrong though ;) > > Jonathan > > I'm not so sure something is missing so much as something seems incorrect in either the ACPI table structure definitions, the mailbox, or even the doe emulation. I took your branch and reverted to just prior to the volatile patch refernce: 59a59ef725699e0efb3e9e31a7f8d246de7286ed QEMU configuration for boot (Please let me know if something is wrong) sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ -m 2G,slots=4,maxmem=4G \ -smp 4 \ -machine type=q35,accel=kvm,cxl=on \ -enable-kvm \ -nographic \ -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \ -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \ -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \ -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G After boot we find: [root@fedora ~]# ls /sys/bus/cxl/devices/ decoder0.0 decoder2.0 mem0 pmem0 root0 decoder1.0 endpoint2 nvdimm-bridge0 port1 [root@fedora ~]# ls -al /sys/bus/dax/devices/ total 0 drwxr-xr-x. 2 root root 0 Jan 12 22:44 . drwxr-xr-x. 4 root root 0 Jan 12 22:44 .. During boot, I am seeing three separate call traces, all of which appear to be related to PCI DOE and/or getting CDAT information. [ 3.916900] Call Trace: [ 3.916906] <TASK> [ 3.931217] pci_doe_submit_task+0x5d/0xd0 [ 3.936609] pci_doe_discovery+0xb4/0x100 [ 3.936627] ? pci_doe_xa_destroy+0x10/0x10 [ 3.942675] pcim_doe_create_mb+0x219/0x290 [ 3.950506] cxl_pci_probe+0x192/0x430 [ 3.960248] local_pci_probe+0x41/0x80 [ 3.966564] pci_device_probe+0xb3/0x220 [ 3.966579] really_probe+0xde/0x380 [ 3.966583] ? pm_runtime_barrier+0x50/0x90 [ 3.969158] __driver_probe_device+0x78/0x170 [ 3.969167] driver_probe_device+0x1f/0x90 [ 3.978264] __driver_attach_async_helper+0x5c/0xe0 [ 3.983953] async_run_entry_fn+0x30/0x130 [ 3.991084] process_one_work+0x294/0x5b0 [ 4.004458] worker_thread+0x4f/0x3a0 [ 4.012612] ? process_one_work+0x5b0/0x5b0 [ 4.019114] kthread+0xf5/0x120 [ 4.025133] ? kthread_complete_and_exit+0x20/0x20 [ 4.031327] ret_from_fork+0x22/0x30 [ 4.038969] </TASK> [ 16.047704] pci_doe_submit_task+0x5d/0xd0 [ 16.047713] cxl_cdat_get_length+0xb8/0x110 [ 16.047779] ? dvsec_range_allowed+0x60/0x60 [ 16.047803] read_cdat_data+0xaf/0x1a0 [ 16.047814] cxl_port_probe+0x80/0x120 [ 16.047824] cxl_bus_probe+0x17/0x50 [ 16.047830] really_probe+0xde/0x380 [ 16.047835] ? pm_runtime_barrier+0x50/0x90 [ 16.047843] __driver_probe_device+0x78/0x170 [ 16.047851] driver_probe_device+0x1f/0x90 [ 16.047858] __device_attach_driver+0x85/0x110 [ 16.047881] ? driver_allows_async_probing+0x70/0x70 [ 16.047884] bus_for_each_drv+0x7a/0xb0 [ 16.047896] __device_attach+0xb3/0x1d0 [ 16.047907] bus_probe_device+0x9f/0xc0 [ 16.047913] device_add+0x41e/0x9b0 [ 16.047918] ? kobject_set_name_vargs+0x6d/0x90 [ 16.047928] ? dev_set_name+0x4b/0x60 [ 16.047944] devm_cxl_add_port+0x27b/0x3b0 [ 16.047970] devm_cxl_add_endpoint+0x82/0x130 [ 16.047982] cxl_mem_probe+0xc4/0x11d [cxl_mem] [ 16.047997] cxl_bus_probe+0x17/0x50 [ 16.048003] really_probe+0xde/0x380 [ 16.048007] ? pm_runtime_barrier+0x50/0x90 [ 16.048014] __driver_probe_device+0x78/0x170 [ 16.048022] driver_probe_device+0x1f/0x90 [ 16.048029] __driver_attach+0xd5/0x1d0 [ 16.048036] ? __device_attach_driver+0x110/0x110 [ 16.048040] bus_for_each_dev+0x76/0xa0 [ 16.048051] bus_add_driver+0x1b1/0x200 [ 16.048061] driver_register+0x89/0xe0 [ 16.048066] ? 0xffffffffc056e000 [ 16.048070] do_one_initcall+0x6e/0x320 [ 16.048091] do_init_module+0x4a/0x200 [ 16.048099] __do_sys_init_module+0x16a/0x1a0 [ 16.048132] do_syscall_64+0x5b/0x80 [ 16.048138] ? lock_is_held_type+0xe8/0x140 [ 16.048148] ? asm_exc_page_fault+0x22/0x30 [ 16.048156] ? lockdep_hardirqs_on+0x7d/0x100 [ 16.048162] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 16.054601] pci_doe_submit_task+0x5d/0xd0 [ 16.054610] cxl_cdat_read_table.isra.0+0x141/0x190 [ 16.054660] ? dvsec_range_allowed+0x60/0x60 [ 16.054685] read_cdat_data+0xfc/0x1a0 [ 16.054695] cxl_port_probe+0x80/0x120 [ 16.054706] cxl_bus_probe+0x17/0x50 [ 16.054712] really_probe+0xde/0x380 [ 16.054717] ? pm_runtime_barrier+0x50/0x90 [ 16.054725] __driver_probe_device+0x78/0x170 [ 16.054733] driver_probe_device+0x1f/0x90 [ 16.054739] __device_attach_driver+0x85/0x110 [ 16.054747] ? driver_allows_async_probing+0x70/0x70 [ 16.054751] bus_for_each_drv+0x7a/0xb0 [ 16.054767] __device_attach+0xb3/0x1d0 [ 16.054782] bus_probe_device+0x9f/0xc0 [ 16.054791] device_add+0x41e/0x9b0 [ 16.054798] ? kobject_set_name_vargs+0x6d/0x90 [ 16.054811] ? dev_set_name+0x4b/0x60 [ 16.054831] devm_cxl_add_port+0x27b/0x3b0 [ 16.054843] devm_cxl_add_endpoint+0x82/0x130 [ 16.054854] cxl_mem_probe+0xc4/0x11d [cxl_mem] [ 16.054869] cxl_bus_probe+0x17/0x50 [ 16.054875] really_probe+0xde/0x380 [ 16.054879] ? pm_runtime_barrier+0x50/0x90 [ 16.054887] __driver_probe_device+0x78/0x170 [ 16.054894] driver_probe_device+0x1f/0x90 [ 16.054901] __driver_attach+0xd5/0x1d0 [ 16.054908] ? __device_attach_driver+0x110/0x110 [ 16.054912] bus_for_each_dev+0x76/0xa0 [ 16.054923] bus_add_driver+0x1b1/0x200 [ 16.055204] driver_register+0x89/0xe0 [ 16.055211] ? 0xffffffffc056e000 [ 16.055215] do_one_initcall+0x6e/0x320 [ 16.055237] do_init_module+0x4a/0x200 [ 16.055245] __do_sys_init_module+0x16a/0x1a0 [ 16.055277] do_syscall_64+0x5b/0x80 [ 16.055283] ? lock_is_held_type+0xe8/0x140 [ 16.055294] ? asm_exc_page_fault+0x22/0x30 [ 16.055301] ? lockdep_hardirqs_on+0x7d/0x100 [ 16.055307] entry_SYSCALL_64_after_hwframe+0x63/0xcd
On Thu, 12 Jan 2023 17:46:27 -0500 Gregory Price <gregory.price@memverge.com> wrote: > On Thu, Jan 12, 2023 at 05:21:30PM +0000, Jonathan Cameron wrote: > > On Thu, 12 Jan 2023 10:39:17 -0500 > > Gregory Price <gregory.price@memverge.com> wrote: > > > > > On Wed, Jan 11, 2023 at 02:24:32PM +0000, Jonathan Cameron via wrote: > > > > Gregory's patches were posted as part of his work on adding volatile support. > > > > https://lore.kernel.org/linux-cxl/20221006233702.18532-1-gregory.price@memverge.com/ > > > > https://lore.kernel.org/linux-cxl/20221128150157.97724-2-gregory.price@memverge.com/ > > > > I might propose this for upstream inclusion this cycle, but testing is > > > > currently limited by lack of suitable kernel support. > > > > > > fwiw the testing i've done suggests the problem isn't necessarily the > > > implementation so much as either the EFI support or the ACPI tables. > > > > > > For example, we see memory expanders come up no problem and turn into > > > volatile memory on real hardware, with the same kernels with just a few > > > commands. My gut feeling is that either a mailbox command is missing or > > > that the ACPI tables are missing/significantly different. > > > > > > I haven't been able to investigate further at this point, but that's my > > > current state with the voltile type-3 device testing. > > > > My assumption was that all shipping hardware platforms were doing the > > enumeration and bring up of memory expanders in the BIOS / firmware. > > Those are then presented to the OS already set up exactly as if they were > > normal memory. We could do the same on QEMU but that means a lot of > > work in EDK2. Note that it makes no sense to do the enumeration and > > creation of ACPI tables in QEMU itself though could hack it like that. > > This stuff is done in firmware because that enables it for legacy > > OSes. Everything is more or less presented to the OS like you would > > present RAM (EFI memory map, ACPI tables etc). > > > > Firmware enumeration doesn't typically support hotplug, so if we add > > support for hotplug of volatile memory type 3 devices to the kernel > > we will also be able to do 'cold plug' and have the kernel bring them up > > in a similar fashion to what we do for non-volatile (for non volatile there > > is typically no real support in firmware as there is a bunch of policy to > > deal with that doesn't belong in firmware). (simplifying heavily ;) > > > > So I don't think we are missing anything in the emulation, just in the > > software layers above it. Could be wrong though ;) > > > > Jonathan > > > > > > I'm not so sure something is missing so much as something seems > incorrect in either the ACPI table structure definitions, the mailbox, > or even the doe emulation. > > I took your branch and reverted to just prior to the volatile patch > refernce: 59a59ef725699e0efb3e9e31a7f8d246de7286ed > > > QEMU configuration for boot (Please let me know if something is wrong) > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > -m 2G,slots=4,maxmem=4G \ > -smp 4 \ > -machine type=q35,accel=kvm,cxl=on \ > -enable-kvm \ > -nographic \ > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \ > -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \ > -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \ > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > After boot we find: > > [root@fedora ~]# ls /sys/bus/cxl/devices/ > decoder0.0 decoder2.0 mem0 pmem0 root0 > decoder1.0 endpoint2 nvdimm-bridge0 port1 > > [root@fedora ~]# ls -al /sys/bus/dax/devices/ > total 0 > drwxr-xr-x. 2 root root 0 Jan 12 22:44 . > drwxr-xr-x. 4 root root 0 Jan 12 22:44 .. > > > During boot, I am seeing three separate call traces, all of which appear > to be related to PCI DOE and/or getting CDAT information. Just to check, are these different from the on stack problem you reported previously? Doesn't look like the fix for that has made it upstream yet. What kernel are you running? > > [ 3.916900] Call Trace: > [ 3.916906] <TASK> > [ 3.931217] pci_doe_submit_task+0x5d/0xd0 > [ 3.936609] pci_doe_discovery+0xb4/0x100 > [ 3.936627] ? pci_doe_xa_destroy+0x10/0x10 > [ 3.942675] pcim_doe_create_mb+0x219/0x290 > [ 3.950506] cxl_pci_probe+0x192/0x430 > [ 3.960248] local_pci_probe+0x41/0x80 > [ 3.966564] pci_device_probe+0xb3/0x220 > [ 3.966579] really_probe+0xde/0x380 > [ 3.966583] ? pm_runtime_barrier+0x50/0x90 > [ 3.969158] __driver_probe_device+0x78/0x170 > [ 3.969167] driver_probe_device+0x1f/0x90 > [ 3.978264] __driver_attach_async_helper+0x5c/0xe0 > [ 3.983953] async_run_entry_fn+0x30/0x130 > [ 3.991084] process_one_work+0x294/0x5b0 > [ 4.004458] worker_thread+0x4f/0x3a0 > [ 4.012612] ? process_one_work+0x5b0/0x5b0 > [ 4.019114] kthread+0xf5/0x120 > [ 4.025133] ? kthread_complete_and_exit+0x20/0x20 > [ 4.031327] ret_from_fork+0x22/0x30 > [ 4.038969] </TASK> > > [ 16.047704] pci_doe_submit_task+0x5d/0xd0 > [ 16.047713] cxl_cdat_get_length+0xb8/0x110 > [ 16.047779] ? dvsec_range_allowed+0x60/0x60 > [ 16.047803] read_cdat_data+0xaf/0x1a0 > [ 16.047814] cxl_port_probe+0x80/0x120 > [ 16.047824] cxl_bus_probe+0x17/0x50 > [ 16.047830] really_probe+0xde/0x380 > [ 16.047835] ? pm_runtime_barrier+0x50/0x90 > [ 16.047843] __driver_probe_device+0x78/0x170 > [ 16.047851] driver_probe_device+0x1f/0x90 > [ 16.047858] __device_attach_driver+0x85/0x110 > [ 16.047881] ? driver_allows_async_probing+0x70/0x70 > [ 16.047884] bus_for_each_drv+0x7a/0xb0 > [ 16.047896] __device_attach+0xb3/0x1d0 > [ 16.047907] bus_probe_device+0x9f/0xc0 > [ 16.047913] device_add+0x41e/0x9b0 > [ 16.047918] ? kobject_set_name_vargs+0x6d/0x90 > [ 16.047928] ? dev_set_name+0x4b/0x60 > [ 16.047944] devm_cxl_add_port+0x27b/0x3b0 > [ 16.047970] devm_cxl_add_endpoint+0x82/0x130 > [ 16.047982] cxl_mem_probe+0xc4/0x11d [cxl_mem] > [ 16.047997] cxl_bus_probe+0x17/0x50 > [ 16.048003] really_probe+0xde/0x380 > [ 16.048007] ? pm_runtime_barrier+0x50/0x90 > [ 16.048014] __driver_probe_device+0x78/0x170 > [ 16.048022] driver_probe_device+0x1f/0x90 > [ 16.048029] __driver_attach+0xd5/0x1d0 > [ 16.048036] ? __device_attach_driver+0x110/0x110 > [ 16.048040] bus_for_each_dev+0x76/0xa0 > [ 16.048051] bus_add_driver+0x1b1/0x200 > [ 16.048061] driver_register+0x89/0xe0 > [ 16.048066] ? 0xffffffffc056e000 > [ 16.048070] do_one_initcall+0x6e/0x320 > [ 16.048091] do_init_module+0x4a/0x200 > [ 16.048099] __do_sys_init_module+0x16a/0x1a0 > [ 16.048132] do_syscall_64+0x5b/0x80 > [ 16.048138] ? lock_is_held_type+0xe8/0x140 > [ 16.048148] ? asm_exc_page_fault+0x22/0x30 > [ 16.048156] ? lockdep_hardirqs_on+0x7d/0x100 > [ 16.048162] entry_SYSCALL_64_after_hwframe+0x63/0xcd > > [ 16.054601] pci_doe_submit_task+0x5d/0xd0 > [ 16.054610] cxl_cdat_read_table.isra.0+0x141/0x190 > [ 16.054660] ? dvsec_range_allowed+0x60/0x60 > [ 16.054685] read_cdat_data+0xfc/0x1a0 > [ 16.054695] cxl_port_probe+0x80/0x120 > [ 16.054706] cxl_bus_probe+0x17/0x50 > [ 16.054712] really_probe+0xde/0x380 > [ 16.054717] ? pm_runtime_barrier+0x50/0x90 > [ 16.054725] __driver_probe_device+0x78/0x170 > [ 16.054733] driver_probe_device+0x1f/0x90 > [ 16.054739] __device_attach_driver+0x85/0x110 > [ 16.054747] ? driver_allows_async_probing+0x70/0x70 > [ 16.054751] bus_for_each_drv+0x7a/0xb0 > [ 16.054767] __device_attach+0xb3/0x1d0 > [ 16.054782] bus_probe_device+0x9f/0xc0 > [ 16.054791] device_add+0x41e/0x9b0 > [ 16.054798] ? kobject_set_name_vargs+0x6d/0x90 > [ 16.054811] ? dev_set_name+0x4b/0x60 > [ 16.054831] devm_cxl_add_port+0x27b/0x3b0 > [ 16.054843] devm_cxl_add_endpoint+0x82/0x130 > [ 16.054854] cxl_mem_probe+0xc4/0x11d [cxl_mem] > [ 16.054869] cxl_bus_probe+0x17/0x50 > [ 16.054875] really_probe+0xde/0x380 > [ 16.054879] ? pm_runtime_barrier+0x50/0x90 > [ 16.054887] __driver_probe_device+0x78/0x170 > [ 16.054894] driver_probe_device+0x1f/0x90 > [ 16.054901] __driver_attach+0xd5/0x1d0 > [ 16.054908] ? __device_attach_driver+0x110/0x110 > [ 16.054912] bus_for_each_dev+0x76/0xa0 > [ 16.054923] bus_add_driver+0x1b1/0x200 > [ 16.055204] driver_register+0x89/0xe0 > [ 16.055211] ? 0xffffffffc056e000 > [ 16.055215] do_one_initcall+0x6e/0x320 > [ 16.055237] do_init_module+0x4a/0x200 > [ 16.055245] __do_sys_init_module+0x16a/0x1a0 > [ 16.055277] do_syscall_64+0x5b/0x80 > [ 16.055283] ? lock_is_held_type+0xe8/0x140 > [ 16.055294] ? asm_exc_page_fault+0x22/0x30 > [ 16.055301] ? lockdep_hardirqs_on+0x7d/0x100 > [ 16.055307] entry_SYSCALL_64_after_hwframe+0x63/0xcd
On Fri, Jan 13, 2023 at 09:12:13AM +0000, Jonathan Cameron wrote: > > Just to check, are these different from the on stack problem you reported > previously? Doesn't look like the fix for that has made it upstream yet. > > What kernel are you running? > > The prior issue I saw was related to the CXL Fixed Memory Window having an e820 region registered during machine initialization. That fix is upstream. On 2023-1-11 branch it is commit 2486dd045794d65598fbca9fd1224c27b9732dce This one appears when registering any kind of type-3 device, during boot.
On Fri, 13 Jan 2023 09:19:59 -0500 Gregory Price <gregory.price@memverge.com> wrote: > On Fri, Jan 13, 2023 at 09:12:13AM +0000, Jonathan Cameron wrote: > > > > Just to check, are these different from the on stack problem you reported > > previously? Doesn't look like the fix for that has made it upstream yet. > > > > What kernel are you running? > > > > > > The prior issue I saw was related to the CXL Fixed Memory Window having > an e820 region registered during machine initialization. That fix is > upstream. > > On 2023-1-11 branch it is commit 2486dd045794d65598fbca9fd1224c27b9732dce > > This one appears when registering any kind of type-3 device, during > boot. I meant this one https://lore.kernel.org/all/20221118000524.1477383-1-ira.weiny@intel.com/ Sorry, should have dug out a link in earlier reply and save a round trip. Jonathan
On Fri, 13 Jan 2023 14:40:26 +0000 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > On Fri, 13 Jan 2023 09:19:59 -0500 > Gregory Price <gregory.price@memverge.com> wrote: > > > On Fri, Jan 13, 2023 at 09:12:13AM +0000, Jonathan Cameron wrote: > > > > > > Just to check, are these different from the on stack problem you reported > > > previously? Doesn't look like the fix for that has made it upstream yet. > > > > > > What kernel are you running? > > > > > > > > > > The prior issue I saw was related to the CXL Fixed Memory Window having > > an e820 region registered during machine initialization. That fix is > > upstream. > > > > On 2023-1-11 branch it is commit 2486dd045794d65598fbca9fd1224c27b9732dce > > > > This one appears when registering any kind of type-3 device, during > > boot. > > I meant this one > > https://lore.kernel.org/all/20221118000524.1477383-1-ira.weiny@intel.com/ > > Sorry, should have dug out a link in earlier reply and save a round trip. Ah I'd forgotten we were going round the houses somewhat on the right fix... https://lore.kernel.org/all/20221128040338.1936529-3-ira.weiny@intel.com/ was another proposal and Lukas had yet another. https://lore.kernel.org/all/cover.1669608950.git.lukas@wunner.de/ > > Jonathan
On Fri, Jan 13, 2023 at 02:45:11PM +0000, Jonathan Cameron wrote: > On Fri, 13 Jan 2023 14:40:26 +0000 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > > On Fri, 13 Jan 2023 09:19:59 -0500 Gregory Price <gregory.price@memverge.com> wrote: > > > On Fri, Jan 13, 2023 at 09:12:13AM +0000, Jonathan Cameron wrote: > > > > Just to check, are these different from the on stack problem you reported > > > > previously? Doesn't look like the fix for that has made it upstream yet. > > > > > > > > What kernel are you running? > > > > > > The prior issue I saw was related to the CXL Fixed Memory Window having > > > an e820 region registered during machine initialization. That fix is > > > upstream. > > > > > > On 2023-1-11 branch it is commit 2486dd045794d65598fbca9fd1224c27b9732dce > > > > > > This one appears when registering any kind of type-3 device, during > > > boot. > > > > I meant this one > > > > https://lore.kernel.org/all/20221118000524.1477383-1-ira.weiny@intel.com/ > > > > Sorry, should have dug out a link in earlier reply and save a round trip. > > Ah I'd forgotten we were going round the houses somewhat on the right fix... > https://lore.kernel.org/all/20221128040338.1936529-3-ira.weiny@intel.com/ > was another proposal and Lukas had yet another. > https://lore.kernel.org/all/cover.1669608950.git.lukas@wunner.de/ I'll respin those fixes shortly, together with the other DOE patches I've accumulated on my development branch for DOE+SPDM: https://github.com/l1k/linux/commits/doe Ira kindly tested them this week and reports that they don't regress CDAT retrieval for him. Thanks, Lukas
On Fri, Jan 13, 2023 at 04:12:06PM +0100, Lukas Wunner wrote: > On Fri, Jan 13, 2023 at 02:45:11PM +0000, Jonathan Cameron wrote: > > On Fri, 13 Jan 2023 14:40:26 +0000 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > > > On Fri, 13 Jan 2023 09:19:59 -0500 Gregory Price <gregory.price@memverge.com> wrote: > > > > On Fri, Jan 13, 2023 at 09:12:13AM +0000, Jonathan Cameron wrote: > > > > > Just to check, are these different from the on stack problem you reported > > > > > previously? Doesn't look like the fix for that has made it upstream yet. > > > > > > > > > > What kernel are you running? > > > > > > > > The prior issue I saw was related to the CXL Fixed Memory Window having > > > > an e820 region registered during machine initialization. That fix is > > > > upstream. > > > > > > > > On 2023-1-11 branch it is commit 2486dd045794d65598fbca9fd1224c27b9732dce > > > > > > > > This one appears when registering any kind of type-3 device, during > > > > boot. > > > > > > I meant this one > > > > > > https://lore.kernel.org/all/20221118000524.1477383-1-ira.weiny@intel.com/ > > > > > > Sorry, should have dug out a link in earlier reply and save a round trip. > > > > Ah I'd forgotten we were going round the houses somewhat on the right fix... > > https://lore.kernel.org/all/20221128040338.1936529-3-ira.weiny@intel.com/ > > was another proposal and Lukas had yet another. > > https://lore.kernel.org/all/cover.1669608950.git.lukas@wunner.de/ > > I'll respin those fixes shortly, together with the other DOE patches > I've accumulated on my development branch for DOE+SPDM: > > https://github.com/l1k/linux/commits/doe > > Ira kindly tested them this week and reports that they don't regress > CDAT retrieval for him. > > Thanks, > > Lukas I'll see if I can't get this branch installed and test if this fixes the issue i'm seeing. It may take me a bit to get back to. Thank you!
1) No stack traces present 2) Device usage appears to work, but cxl-cli fails to create a region, i haven't checked why yet (also tried ndctl-75, same results) 3) There seems to be some other regression with the cxl_pmem_init routine, because I get a stack trace in this setup regardless of whether I apply the type-3 device commit. All tests below with the previously posted DOE linux branch. Base QEMU branch was Jonathan's 2023-1-11 DOE Branch - 2023-1-11 (HEAD) (all commits) QEMU Config: sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ -m 3G,slots=4,maxmem=8G \ -smp 4 \ -machine type=q35,accel=kvm,cxl=on \ -enable-kvm \ -nographic \ -object memory-backend-ram,id=mem0,size=1G,share=on \ -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \ -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G Result: This worked, but cxl-cli could not create a region (will look into this further later). When running with a persistent memory configuration, I'm seeing a kernel stack trace on cxl_pmem_init Config: sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ -m 3G,slots=4,maxmem=4G \ -smp 4 \ -machine type=q35,accel=kvm,cxl=on \ -enable-kvm \ -nographic \ -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \ -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \ -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \ -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G [ 62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0 [ 62.185069] #PF: supervisor read access in kernel mode [ 62.198502] #PF: error_code(0x0000) - not-present page [ 62.211019] PGD 0 P4D 0 [ 62.220521] Oops: 0000 [#1] PREEMPT SMP PTI [ 62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1 [ 62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 [ 62.258432] Adding 2939900k swap on /dev/zram0. Priority:100 extents:1 across:2939900k SSDscFS [ 62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] [ 62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 [ 62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 [ 62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 [ 62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff [ 62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 [ 62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 [ 62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 [ 62.285541] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 [ 62.285542] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 [ 62.285653] Call Trace: [ 62.285656] <TASK> [ 62.285660] cxl_bus_probe+0x17/0x50 [ 62.285691] really_probe+0xde/0x380 [ 62.285695] ? pm_runtime_barrier+0x50/0x90 [ 62.285700] __driver_probe_device+0x78/0x170 [ 62.285846] driver_probe_device+0x1f/0x90 [ 62.285850] __driver_attach+0xd2/0x1c0 [ 62.285853] ? __pfx___driver_attach+0x10/0x10 [ 62.285856] bus_for_each_dev+0x76/0xa0 [ 62.285860] bus_add_driver+0x1b1/0x200 [ 62.285863] driver_register+0x89/0xe0 [ 62.285868] ? __pfx_init_module+0x10/0x10 [cxl_pmem] [ 62.285874] cxl_pmem_init+0x50/0xff0 [cxl_pmem] [ 62.285880] do_one_initcall+0x6e/0x330 [ 62.285888] do_init_module+0x4a/0x200 [ 62.285892] __do_sys_finit_module+0x93/0xf0 [ 62.285899] do_syscall_64+0x5b/0x80 [ 62.285904] ? do_syscall_64+0x67/0x80 [ 62.285906] ? asm_exc_page_fault+0x22/0x30 [ 62.285910] ? lockdep_hardirqs_on+0x7d/0x100 [ 62.285914] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 62.285917] RIP: 0033:0x7f2619b0afbd [ 62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8 [ 62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd [ 62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006 [ 62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52 [ 62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000 [ 62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0 [ 62.285932] </TASK> [ 62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g [ 62.285954] CR2: 00000000000004c0 [ 62.288385] ---[ end trace 0000000000000000 ]--- [ 63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] [ 63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 [ 63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 [ 63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 [ 63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff [ 63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 [ 63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 [ 63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 [ 63.203580] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 [ 63.203583] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 Next i reverted the QEMU branch to the commit just before the type-3 volatile commit and used the old method of launching with a type-3 pmem device Config: sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ -m 2G,slots=4,maxmem=4G \ -smp 4 \ -machine type=q35,accel=kvm,cxl=on \ -enable-kvm \ -nographic \ -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \ -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \ -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \ -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G Result: Similar stack trace [ 29.850023] BUG: kernel NULL pointer dereference, address: 00000000000004c0 [ 29.882400] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] [ 29.957485] Call Trace: [ 29.959067] <TASK> [ 29.962176] cxl_bus_probe+0x17/0x50 [ 29.964940] really_probe+0xde/0x380 [ 29.969065] ? pm_runtime_barrier+0x50/0x90 [ 29.973419] __driver_probe_device+0x78/0x170 [ 29.977183] driver_probe_device+0x1f/0x90 [ 29.984212] __driver_attach+0xd2/0x1c0 [ 29.988463] ? __pfx___driver_attach+0x10/0x10 [ 29.992379] bus_for_each_dev+0x76/0xa0 [ 29.997040] bus_add_driver+0x1b1/0x200 [ 30.000368] driver_register+0x89/0xe0 [ 30.004579] ? __pfx_init_module+0x10/0x10 [cxl_pmem] [ 30.012403] cxl_pmem_init+0x50/0xff0 [cxl_pmem] [ 30.019394] do_one_initcall+0x6e/0x330 [ 30.024028] do_init_module+0x4a/0x200 [ 30.029243] __do_sys_finit_module+0x93/0xf0 [ 30.034943] do_syscall_64+0x5b/0x80 [ 30.039844] ? do_syscall_64+0x67/0x80 [ 30.045163] ? do_syscall_64+0x67/0x80 [ 30.049729] ? lock_release+0x14b/0x440 [ 30.054055] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90 [ 30.061039] ? lock_is_held_type+0xe8/0x140 [ 30.067625] ? do_syscall_64+0x67/0x80 [ 30.071909] ? lockdep_hardirqs_on+0x7d/0x100 [ 30.079037] ? do_syscall_64+0x67/0x80 [ 30.084537] ? do_syscall_64+0x67/0x80 [ 30.089091] ? do_syscall_64+0x67/0x80 [ 30.094174] ? lockdep_hardirqs_on+0x7d/0x100 [ 30.099224] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 30.104446] RIP: 0033:0x7f000550afbd
I apparently forgot an intro lol I tested the DOE linux branch with the 2023-1-11 QEMU branch with both volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode. 1) *In volatile mode, there are no stack traces present (during boot*) On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote: > > 1) No stack traces present > 2) Device usage appears to work, but cxl-cli fails to create a region, i > haven't checked why yet (also tried ndctl-75, same results) > 3) There seems to be some other regression with the cxl_pmem_init > routine, because I get a stack trace in this setup regardless of whether > I apply the type-3 device commit. > > > All tests below with the previously posted DOE linux branch. > Base QEMU branch was Jonathan's 2023-1-11 > > > DOE Branch - 2023-1-11 (HEAD) (all commits) > > QEMU Config: > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > -m 3G,slots=4,maxmem=8G \ > -smp 4 \ > -machine type=q35,accel=kvm,cxl=on \ > -enable-kvm \ > -nographic \ > -object memory-backend-ram,id=mem0,size=1G,share=on \ > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \ > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > Result: This worked, but cxl-cli could not create a region (will look > into this further later). > > > > > When running with a persistent memory configuration, I'm seeing a > kernel stack trace on cxl_pmem_init > > Config: > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > -m 3G,slots=4,maxmem=4G \ > -smp 4 \ > -machine type=q35,accel=kvm,cxl=on \ > -enable-kvm \ > -nographic \ > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \ > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \ > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > [ 62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > [ 62.185069] #PF: supervisor read access in kernel mode > [ 62.198502] #PF: error_code(0x0000) - not-present page > [ 62.211019] PGD 0 P4D 0 > [ 62.220521] Oops: 0000 [#1] PREEMPT SMP PTI > [ 62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1 > [ 62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 > [ 62.258432] Adding 2939900k swap on /dev/zram0. Priority:100 extents:1 across:2939900k SSDscFS > [ 62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > [ 62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > [ 62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > [ 62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > [ 62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > [ 62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > [ 62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > [ 62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > [ 62.285541] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > [ 62.285542] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > [ 62.285653] Call Trace: > [ 62.285656] <TASK> > [ 62.285660] cxl_bus_probe+0x17/0x50 > [ 62.285691] really_probe+0xde/0x380 > [ 62.285695] ? pm_runtime_barrier+0x50/0x90 > [ 62.285700] __driver_probe_device+0x78/0x170 > [ 62.285846] driver_probe_device+0x1f/0x90 > [ 62.285850] __driver_attach+0xd2/0x1c0 > [ 62.285853] ? __pfx___driver_attach+0x10/0x10 > [ 62.285856] bus_for_each_dev+0x76/0xa0 > [ 62.285860] bus_add_driver+0x1b1/0x200 > [ 62.285863] driver_register+0x89/0xe0 > [ 62.285868] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > [ 62.285874] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > [ 62.285880] do_one_initcall+0x6e/0x330 > [ 62.285888] do_init_module+0x4a/0x200 > [ 62.285892] __do_sys_finit_module+0x93/0xf0 > [ 62.285899] do_syscall_64+0x5b/0x80 > [ 62.285904] ? do_syscall_64+0x67/0x80 > [ 62.285906] ? asm_exc_page_fault+0x22/0x30 > [ 62.285910] ? lockdep_hardirqs_on+0x7d/0x100 > [ 62.285914] entry_SYSCALL_64_after_hwframe+0x72/0xdc > [ 62.285917] RIP: 0033:0x7f2619b0afbd > [ 62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8 > [ 62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > [ 62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd > [ 62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006 > [ 62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52 > [ 62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000 > [ 62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0 > [ 62.285932] </TASK> > [ 62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g > [ 62.285954] CR2: 00000000000004c0 > [ 62.288385] ---[ end trace 0000000000000000 ]--- > [ 63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > [ 63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > [ 63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > [ 63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > [ 63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > [ 63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > [ 63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > [ 63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > [ 63.203580] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > [ 63.203583] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > > > Next i reverted the QEMU branch to the commit just before the type-3 > volatile commit and used the old method of launching with a type-3 pmem > device > > Config: > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > -m 2G,slots=4,maxmem=4G \ > -smp 4 \ > -machine type=q35,accel=kvm,cxl=on \ > -enable-kvm \ > -nographic \ > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \ > -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \ > -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \ > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > Result: Similar stack trace > [ 29.850023] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > [ 29.882400] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > [ 29.957485] Call Trace: > [ 29.959067] <TASK> > [ 29.962176] cxl_bus_probe+0x17/0x50 > [ 29.964940] really_probe+0xde/0x380 > [ 29.969065] ? pm_runtime_barrier+0x50/0x90 > [ 29.973419] __driver_probe_device+0x78/0x170 > [ 29.977183] driver_probe_device+0x1f/0x90 > [ 29.984212] __driver_attach+0xd2/0x1c0 > [ 29.988463] ? __pfx___driver_attach+0x10/0x10 > [ 29.992379] bus_for_each_dev+0x76/0xa0 > [ 29.997040] bus_add_driver+0x1b1/0x200 > [ 30.000368] driver_register+0x89/0xe0 > [ 30.004579] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > [ 30.012403] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > [ 30.019394] do_one_initcall+0x6e/0x330 > [ 30.024028] do_init_module+0x4a/0x200 > [ 30.029243] __do_sys_finit_module+0x93/0xf0 > [ 30.034943] do_syscall_64+0x5b/0x80 > [ 30.039844] ? do_syscall_64+0x67/0x80 > [ 30.045163] ? do_syscall_64+0x67/0x80 > [ 30.049729] ? lock_release+0x14b/0x440 > [ 30.054055] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90 > [ 30.061039] ? lock_is_held_type+0xe8/0x140 > [ 30.067625] ? do_syscall_64+0x67/0x80 > [ 30.071909] ? lockdep_hardirqs_on+0x7d/0x100 > [ 30.079037] ? do_syscall_64+0x67/0x80 > [ 30.084537] ? do_syscall_64+0x67/0x80 > [ 30.089091] ? do_syscall_64+0x67/0x80 > [ 30.094174] ? lockdep_hardirqs_on+0x7d/0x100 > [ 30.099224] entry_SYSCALL_64_after_hwframe+0x72/0xdc > [ 30.104446] RIP: 0033:0x7f000550afbd
On Wed, 18 Jan 2023 14:22:08 -0500 Gregory Price <gregory.price@memverge.com> wrote: > 1) No stack traces present > 2) Device usage appears to work, but cxl-cli fails to create a region, i > haven't checked why yet (also tried ndctl-75, same results) > 3) There seems to be some other regression with the cxl_pmem_init > routine, because I get a stack trace in this setup regardless of whether > I apply the type-3 device commit. > > > All tests below with the previously posted DOE linux branch. > Base QEMU branch was Jonathan's 2023-1-11 > > > DOE Branch - 2023-1-11 (HEAD) (all commits) > > QEMU Config: > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > -m 3G,slots=4,maxmem=8G \ > -smp 4 \ > -machine type=q35,accel=kvm,cxl=on \ > -enable-kvm \ > -nographic \ > -object memory-backend-ram,id=mem0,size=1G,share=on \ > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \ > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > Result: This worked, but cxl-cli could not create a region (will look > into this further later). > Even if everything else worked, it will currently fail because of the issue with pass through decoders. (Kernel assumes always pass through for single rp, qemu assumes never pass through - both are valid under spec). Add a second rp for now. I should be able to post some patches to work around that shortly. > > > > When running with a persistent memory configuration, I'm seeing a > kernel stack trace on cxl_pmem_init > > Config: > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > -m 3G,slots=4,maxmem=4G \ > -smp 4 \ > -machine type=q35,accel=kvm,cxl=on \ > -enable-kvm \ > -nographic \ > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \ > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \ > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > [ 62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > [ 62.185069] #PF: supervisor read access in kernel mode > [ 62.198502] #PF: error_code(0x0000) - not-present page > [ 62.211019] PGD 0 P4D 0 > [ 62.220521] Oops: 0000 [#1] PREEMPT SMP PTI > [ 62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1 > [ 62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 > [ 62.258432] Adding 2939900k swap on /dev/zram0. Priority:100 extents:1 across:2939900k SSDscFS > [ 62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > [ 62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > [ 62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > [ 62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > [ 62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > [ 62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > [ 62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > [ 62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > [ 62.285541] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > [ 62.285542] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > [ 62.285653] Call Trace: > [ 62.285656] <TASK> > [ 62.285660] cxl_bus_probe+0x17/0x50 > [ 62.285691] really_probe+0xde/0x380 > [ 62.285695] ? pm_runtime_barrier+0x50/0x90 > [ 62.285700] __driver_probe_device+0x78/0x170 > [ 62.285846] driver_probe_device+0x1f/0x90 > [ 62.285850] __driver_attach+0xd2/0x1c0 > [ 62.285853] ? __pfx___driver_attach+0x10/0x10 > [ 62.285856] bus_for_each_dev+0x76/0xa0 > [ 62.285860] bus_add_driver+0x1b1/0x200 > [ 62.285863] driver_register+0x89/0xe0 > [ 62.285868] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > [ 62.285874] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > [ 62.285880] do_one_initcall+0x6e/0x330 > [ 62.285888] do_init_module+0x4a/0x200 > [ 62.285892] __do_sys_finit_module+0x93/0xf0 > [ 62.285899] do_syscall_64+0x5b/0x80 > [ 62.285904] ? do_syscall_64+0x67/0x80 > [ 62.285906] ? asm_exc_page_fault+0x22/0x30 > [ 62.285910] ? lockdep_hardirqs_on+0x7d/0x100 > [ 62.285914] entry_SYSCALL_64_after_hwframe+0x72/0xdc > [ 62.285917] RIP: 0033:0x7f2619b0afbd > [ 62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8 > [ 62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > [ 62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd > [ 62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006 > [ 62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52 > [ 62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000 > [ 62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0 > [ 62.285932] </TASK> > [ 62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g > [ 62.285954] CR2: 00000000000004c0 > [ 62.288385] ---[ end trace 0000000000000000 ]--- > [ 63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > [ 63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > [ 63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > [ 63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > [ 63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > [ 63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > [ 63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > [ 63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > [ 63.203580] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > [ 63.203583] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > I clearly need to run some more rests as not seen this issue (and I've had a couple of reports of it now). I guess I never tend to be running a completely clean tree on either side + testing is mostly on arm64 (though I doubt that matters). It's not the pass through decoder issues as that is visible only when bringing up a region. > > > Next i reverted the QEMU branch to the commit just before the type-3 > volatile commit and used the old method of launching with a type-3 pmem > device > > Config: > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > -m 2G,slots=4,maxmem=4G \ > -smp 4 \ > -machine type=q35,accel=kvm,cxl=on \ > -enable-kvm \ > -nographic \ > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \ > -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \ > -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \ > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > Result: Similar stack trace > [ 29.850023] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > [ 29.882400] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > [ 29.957485] Call Trace: > [ 29.959067] <TASK> > [ 29.962176] cxl_bus_probe+0x17/0x50 > [ 29.964940] really_probe+0xde/0x380 > [ 29.969065] ? pm_runtime_barrier+0x50/0x90 > [ 29.973419] __driver_probe_device+0x78/0x170 > [ 29.977183] driver_probe_device+0x1f/0x90 > [ 29.984212] __driver_attach+0xd2/0x1c0 > [ 29.988463] ? __pfx___driver_attach+0x10/0x10 > [ 29.992379] bus_for_each_dev+0x76/0xa0 > [ 29.997040] bus_add_driver+0x1b1/0x200 > [ 30.000368] driver_register+0x89/0xe0 > [ 30.004579] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > [ 30.012403] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > [ 30.019394] do_one_initcall+0x6e/0x330 > [ 30.024028] do_init_module+0x4a/0x200 > [ 30.029243] __do_sys_finit_module+0x93/0xf0 > [ 30.034943] do_syscall_64+0x5b/0x80 > [ 30.039844] ? do_syscall_64+0x67/0x80 > [ 30.045163] ? do_syscall_64+0x67/0x80 > [ 30.049729] ? lock_release+0x14b/0x440 > [ 30.054055] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90 > [ 30.061039] ? lock_is_held_type+0xe8/0x140 > [ 30.067625] ? do_syscall_64+0x67/0x80 > [ 30.071909] ? lockdep_hardirqs_on+0x7d/0x100 > [ 30.079037] ? do_syscall_64+0x67/0x80 > [ 30.084537] ? do_syscall_64+0x67/0x80 > [ 30.089091] ? do_syscall_64+0x67/0x80 > [ 30.094174] ? lockdep_hardirqs_on+0x7d/0x100 > [ 30.099224] entry_SYSCALL_64_after_hwframe+0x72/0xdc > [ 30.104446] RIP: 0033:0x7f000550afbd
On Thu, Jan 19, 2023 at 10:19:46AM +0000, Jonathan Cameron wrote: > On Wed, 18 Jan 2023 14:22:08 -0500 > Gregory Price <gregory.price@memverge.com> wrote: > > > 1) No stack traces present > > 2) Device usage appears to work, but cxl-cli fails to create a region, i > > haven't checked why yet (also tried ndctl-75, same results) > > 3) There seems to be some other regression with the cxl_pmem_init > > routine, because I get a stack trace in this setup regardless of whether > > I apply the type-3 device commit. > > > > > > All tests below with the previously posted DOE linux branch. > > Base QEMU branch was Jonathan's 2023-1-11 > > > > > > DOE Branch - 2023-1-11 (HEAD) (all commits) > > > > QEMU Config: > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > -m 3G,slots=4,maxmem=8G \ > > -smp 4 \ > > -machine type=q35,accel=kvm,cxl=on \ > > -enable-kvm \ > > -nographic \ > > -object memory-backend-ram,id=mem0,size=1G,share=on \ > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \ > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > Result: This worked, but cxl-cli could not create a region (will look > > into this further later). > > > > Even if everything else worked, it will currently fail because of the > issue with pass through decoders. > (Kernel assumes always pass through for single rp, qemu assumes never > pass through - both are valid under spec). > Add a second rp for now. I should be able to post some patches > to work around that shortly. > > > > > > > > > > > > > When running with a persistent memory configuration, I'm seeing a > > kernel stack trace on cxl_pmem_init > > > > Config: > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > -m 3G,slots=4,maxmem=4G \ > > -smp 4 \ > > -machine type=q35,accel=kvm,cxl=on \ > > -enable-kvm \ > > -nographic \ > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \ > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \ > > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > [ 62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > > [ 62.185069] #PF: supervisor read access in kernel mode > > [ 62.198502] #PF: error_code(0x0000) - not-present page > > [ 62.211019] PGD 0 P4D 0 > > [ 62.220521] Oops: 0000 [#1] PREEMPT SMP PTI > > [ 62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1 > > [ 62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 > > [ 62.258432] Adding 2939900k swap on /dev/zram0. Priority:100 extents:1 across:2939900k SSDscFS > > [ 62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > [ 62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > [ 62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > [ 62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > [ 62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > [ 62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > [ 62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > [ 62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > [ 62.285541] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > [ 62.285542] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > [ 62.285653] Call Trace: > > [ 62.285656] <TASK> > > [ 62.285660] cxl_bus_probe+0x17/0x50 > > [ 62.285691] really_probe+0xde/0x380 > > [ 62.285695] ? pm_runtime_barrier+0x50/0x90 > > [ 62.285700] __driver_probe_device+0x78/0x170 > > [ 62.285846] driver_probe_device+0x1f/0x90 > > [ 62.285850] __driver_attach+0xd2/0x1c0 > > [ 62.285853] ? __pfx___driver_attach+0x10/0x10 > > [ 62.285856] bus_for_each_dev+0x76/0xa0 > > [ 62.285860] bus_add_driver+0x1b1/0x200 > > [ 62.285863] driver_register+0x89/0xe0 > > [ 62.285868] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > > [ 62.285874] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > > [ 62.285880] do_one_initcall+0x6e/0x330 > > [ 62.285888] do_init_module+0x4a/0x200 > > [ 62.285892] __do_sys_finit_module+0x93/0xf0 > > [ 62.285899] do_syscall_64+0x5b/0x80 > > [ 62.285904] ? do_syscall_64+0x67/0x80 > > [ 62.285906] ? asm_exc_page_fault+0x22/0x30 > > [ 62.285910] ? lockdep_hardirqs_on+0x7d/0x100 > > [ 62.285914] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > [ 62.285917] RIP: 0033:0x7f2619b0afbd > > [ 62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8 > > [ 62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > [ 62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd > > [ 62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006 > > [ 62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52 > > [ 62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000 > > [ 62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0 > > [ 62.285932] </TASK> > > [ 62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g > > [ 62.285954] CR2: 00000000000004c0 > > [ 62.288385] ---[ end trace 0000000000000000 ]--- > > [ 63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > [ 63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > [ 63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > [ 63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > [ 63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > [ 63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > [ 63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > [ 63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > [ 63.203580] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > [ 63.203583] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > > > I clearly need to run some more rests as not seen this issue (and I've > had a couple of reports of it now). I guess I never tend to be running > a completely clean tree on either side + testing is mostly on arm64 > (though I doubt that matters). It's not the pass through decoder issues as > that is visible only when bringing up a region. OK so I should drop this from my queue for now? > > > > > > Next i reverted the QEMU branch to the commit just before the type-3 > > volatile commit and used the old method of launching with a type-3 pmem > > device > > > > Config: > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > -m 2G,slots=4,maxmem=4G \ > > -smp 4 \ > > -machine type=q35,accel=kvm,cxl=on \ > > -enable-kvm \ > > -nographic \ > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \ > > -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \ > > -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \ > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > Result: Similar stack trace > > [ 29.850023] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > > [ 29.882400] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > [ 29.957485] Call Trace: > > [ 29.959067] <TASK> > > [ 29.962176] cxl_bus_probe+0x17/0x50 > > [ 29.964940] really_probe+0xde/0x380 > > [ 29.969065] ? pm_runtime_barrier+0x50/0x90 > > [ 29.973419] __driver_probe_device+0x78/0x170 > > [ 29.977183] driver_probe_device+0x1f/0x90 > > [ 29.984212] __driver_attach+0xd2/0x1c0 > > [ 29.988463] ? __pfx___driver_attach+0x10/0x10 > > [ 29.992379] bus_for_each_dev+0x76/0xa0 > > [ 29.997040] bus_add_driver+0x1b1/0x200 > > [ 30.000368] driver_register+0x89/0xe0 > > [ 30.004579] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > > [ 30.012403] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > > [ 30.019394] do_one_initcall+0x6e/0x330 > > [ 30.024028] do_init_module+0x4a/0x200 > > [ 30.029243] __do_sys_finit_module+0x93/0xf0 > > [ 30.034943] do_syscall_64+0x5b/0x80 > > [ 30.039844] ? do_syscall_64+0x67/0x80 > > [ 30.045163] ? do_syscall_64+0x67/0x80 > > [ 30.049729] ? lock_release+0x14b/0x440 > > [ 30.054055] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90 > > [ 30.061039] ? lock_is_held_type+0xe8/0x140 > > [ 30.067625] ? do_syscall_64+0x67/0x80 > > [ 30.071909] ? lockdep_hardirqs_on+0x7d/0x100 > > [ 30.079037] ? do_syscall_64+0x67/0x80 > > [ 30.084537] ? do_syscall_64+0x67/0x80 > > [ 30.089091] ? do_syscall_64+0x67/0x80 > > [ 30.094174] ? lockdep_hardirqs_on+0x7d/0x100 > > [ 30.099224] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > [ 30.104446] RIP: 0033:0x7f000550afbd
On Thu, 19 Jan 2023 06:48:11 -0500 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Jan 19, 2023 at 10:19:46AM +0000, Jonathan Cameron wrote: > > On Wed, 18 Jan 2023 14:22:08 -0500 > > Gregory Price <gregory.price@memverge.com> wrote: > > > > > 1) No stack traces present > > > 2) Device usage appears to work, but cxl-cli fails to create a region, i > > > haven't checked why yet (also tried ndctl-75, same results) > > > 3) There seems to be some other regression with the cxl_pmem_init > > > routine, because I get a stack trace in this setup regardless of whether > > > I apply the type-3 device commit. > > > > > > > > > All tests below with the previously posted DOE linux branch. > > > Base QEMU branch was Jonathan's 2023-1-11 > > > > > > > > > DOE Branch - 2023-1-11 (HEAD) (all commits) > > > > > > QEMU Config: > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > > -m 3G,slots=4,maxmem=8G \ > > > -smp 4 \ > > > -machine type=q35,accel=kvm,cxl=on \ > > > -enable-kvm \ > > > -nographic \ > > > -object memory-backend-ram,id=mem0,size=1G,share=on \ > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \ > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > Result: This worked, but cxl-cli could not create a region (will look > > > into this further later). > > > > > > > Even if everything else worked, it will currently fail because of the > > issue with pass through decoders. > > (Kernel assumes always pass through for single rp, qemu assumes never > > pass through - both are valid under spec). > > Add a second rp for now. I should be able to post some patches > > to work around that shortly. > > > > > > > > > > > > > > > > > > > > > > When running with a persistent memory configuration, I'm seeing a > > > kernel stack trace on cxl_pmem_init > > > > > > Config: > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > > -m 3G,slots=4,maxmem=4G \ > > > -smp 4 \ > > > -machine type=q35,accel=kvm,cxl=on \ > > > -enable-kvm \ > > > -nographic \ > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \ > > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \ > > > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > > > > [ 62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > > > [ 62.185069] #PF: supervisor read access in kernel mode > > > [ 62.198502] #PF: error_code(0x0000) - not-present page > > > [ 62.211019] PGD 0 P4D 0 > > > [ 62.220521] Oops: 0000 [#1] PREEMPT SMP PTI > > > [ 62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1 > > > [ 62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 > > > [ 62.258432] Adding 2939900k swap on /dev/zram0. Priority:100 extents:1 across:2939900k SSDscFS > > > [ 62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > > [ 62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > > [ 62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > > [ 62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > > [ 62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > > [ 62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > > [ 62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > > [ 62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > > [ 62.285541] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > > [ 62.285542] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > > [ 62.285653] Call Trace: > > > [ 62.285656] <TASK> > > > [ 62.285660] cxl_bus_probe+0x17/0x50 > > > [ 62.285691] really_probe+0xde/0x380 > > > [ 62.285695] ? pm_runtime_barrier+0x50/0x90 > > > [ 62.285700] __driver_probe_device+0x78/0x170 > > > [ 62.285846] driver_probe_device+0x1f/0x90 > > > [ 62.285850] __driver_attach+0xd2/0x1c0 > > > [ 62.285853] ? __pfx___driver_attach+0x10/0x10 > > > [ 62.285856] bus_for_each_dev+0x76/0xa0 > > > [ 62.285860] bus_add_driver+0x1b1/0x200 > > > [ 62.285863] driver_register+0x89/0xe0 > > > [ 62.285868] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > > > [ 62.285874] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > > > [ 62.285880] do_one_initcall+0x6e/0x330 > > > [ 62.285888] do_init_module+0x4a/0x200 > > > [ 62.285892] __do_sys_finit_module+0x93/0xf0 > > > [ 62.285899] do_syscall_64+0x5b/0x80 > > > [ 62.285904] ? do_syscall_64+0x67/0x80 > > > [ 62.285906] ? asm_exc_page_fault+0x22/0x30 > > > [ 62.285910] ? lockdep_hardirqs_on+0x7d/0x100 > > > [ 62.285914] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > > [ 62.285917] RIP: 0033:0x7f2619b0afbd > > > [ 62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8 > > > [ 62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > > [ 62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd > > > [ 62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006 > > > [ 62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52 > > > [ 62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000 > > > [ 62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0 > > > [ 62.285932] </TASK> > > > [ 62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g > > > [ 62.285954] CR2: 00000000000004c0 > > > [ 62.288385] ---[ end trace 0000000000000000 ]--- > > > [ 63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > > [ 63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > > [ 63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > > [ 63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > > [ 63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > > [ 63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > > [ 63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > > [ 63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > > [ 63.203580] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > > [ 63.203583] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > > > > > > I clearly need to run some more rests as not seen this issue (and I've > > had a couple of reports of it now). I guess I never tend to be running > > a completely clean tree on either side + testing is mostly on arm64 > > (though I doubt that matters). It's not the pass through decoder issues as > > that is visible only when bringing up a region. > > OK so I should drop this from my queue for now? Whilst discussion is in this thread, it is not related to this patch set (at least I've seen no indication that it is!) The revert referred to is down to a point with a whole load of other stuff on top of upstream QEMU. However I think that whatever this is effects upstream as well - though it seems more likely to be a kernel bug or possibly something we simply aren't emulating yet. The trace is probably pointing to a dereference of (struct device)->driver as there is very little else in that function, which is very suspicious. So keeping this set queued up would be great. Jonathan
On Wed, 18 Jan 2023 14:31:53 -0500 Gregory Price <gregory.price@memverge.com> wrote: > I apparently forgot an intro lol > > I tested the DOE linux branch with the 2023-1-11 QEMU branch with both > volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode. > > 1) *In volatile mode, there are no stack traces present (during boot*) > > On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote: > > > > 1) No stack traces present > > 2) Device usage appears to work, but cxl-cli fails to create a region, i > > haven't checked why yet (also tried ndctl-75, same results) > > 3) There seems to be some other regression with the cxl_pmem_init > > routine, because I get a stack trace in this setup regardless of whether > > I apply the type-3 device commit. > > > > > > All tests below with the previously posted DOE linux branch. > > Base QEMU branch was Jonathan's 2023-1-11 > > > > > > DOE Branch - 2023-1-11 (HEAD) (all commits) > > > > QEMU Config: > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > -m 3G,slots=4,maxmem=8G \ > > -smp 4 \ > > -machine type=q35,accel=kvm,cxl=on \ > > -enable-kvm \ > > -nographic \ > > -object memory-backend-ram,id=mem0,size=1G,share=on \ > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \ > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > Result: This worked, but cxl-cli could not create a region (will look > > into this further later). > > > > > > > > > > When running with a persistent memory configuration, I'm seeing a > > kernel stack trace on cxl_pmem_init > > > > Config: > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > -m 3G,slots=4,maxmem=4G \ > > -smp 4 \ > > -machine type=q35,accel=kvm,cxl=on \ > > -enable-kvm \ > > -nographic \ > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \ > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \ > > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > [ 62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > > [ 62.185069] #PF: supervisor read access in kernel mode > > [ 62.198502] #PF: error_code(0x0000) - not-present page > > [ 62.211019] PGD 0 P4D 0 > > [ 62.220521] Oops: 0000 [#1] PREEMPT SMP PTI > > [ 62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1 > > [ 62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 > > [ 62.258432] Adding 2939900k swap on /dev/zram0. Priority:100 extents:1 across:2939900k SSDscFS > > [ 62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > [ 62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > [ 62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > [ 62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > [ 62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > [ 62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > [ 62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > [ 62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > [ 62.285541] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > [ 62.285542] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > [ 62.285653] Call Trace: > > [ 62.285656] <TASK> > > [ 62.285660] cxl_bus_probe+0x17/0x50 > > [ 62.285691] really_probe+0xde/0x380 > > [ 62.285695] ? pm_runtime_barrier+0x50/0x90 > > [ 62.285700] __driver_probe_device+0x78/0x170 > > [ 62.285846] driver_probe_device+0x1f/0x90 > > [ 62.285850] __driver_attach+0xd2/0x1c0 > > [ 62.285853] ? __pfx___driver_attach+0x10/0x10 > > [ 62.285856] bus_for_each_dev+0x76/0xa0 > > [ 62.285860] bus_add_driver+0x1b1/0x200 > > [ 62.285863] driver_register+0x89/0xe0 > > [ 62.285868] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > > [ 62.285874] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > > [ 62.285880] do_one_initcall+0x6e/0x330 > > [ 62.285888] do_init_module+0x4a/0x200 > > [ 62.285892] __do_sys_finit_module+0x93/0xf0 > > [ 62.285899] do_syscall_64+0x5b/0x80 > > [ 62.285904] ? do_syscall_64+0x67/0x80 > > [ 62.285906] ? asm_exc_page_fault+0x22/0x30 > > [ 62.285910] ? lockdep_hardirqs_on+0x7d/0x100 > > [ 62.285914] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > [ 62.285917] RIP: 0033:0x7f2619b0afbd > > [ 62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8 > > [ 62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > [ 62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd > > [ 62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006 > > [ 62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52 > > [ 62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000 > > [ 62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0 > > [ 62.285932] </TASK> > > [ 62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g > > [ 62.285954] CR2: 00000000000004c0 > > [ 62.288385] ---[ end trace 0000000000000000 ]--- > > [ 63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > [ 63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > [ 63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > [ 63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > [ 63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > [ 63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > [ 63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > [ 63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > [ 63.203580] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > [ 63.203583] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 Possibly replicated. What I did was stop cxl_pmem.ko being probed automatically and added it manually later. Trace that results is certainly similar to yours. Now the MODULE_SOFTDEP() in drivers/cxl/acpi.c should stop that happening assuming you are letting autoloading run. I wonder if there is a path in which it doesn't? Dan, any thoughts? There is another race that I can trigger by repeatedly injecting errors and causing resets, but the trace for that is very different and points at cxl_pmem_ctl() called via nvdimm_probe(). I was going to try and pin that one down a little more before posting a report but might as well muddy the waters :) Unable to handle kernel NULL pointer dereference at virtual address 0000000000000358 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102e12000 [0000000000000358] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP Modules linked in: cxl_mem cxl_port cxl_acpi cxl_pmem cxl_pci cxl_core CPU: 0 PID: 236 Comm: kworker/u8:3 Not tainted 6.2.0-rc3+ #598 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 Workqueue: events_unbound async_run_entry_fn pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : cxl_pmem_ctl+0x74/0x244 [cxl_pmem] lr : cxl_pmem_ctl+0x60/0x244 [cxl_pmem] sp : ffff8000089239d0 x29: ffff8000089239d0 x28: 0000000000000000 x27: 0000000000000000 x26: ffffcd4f6b263000 x25: ffffcd4f6a25d9c8 x24: 0000000000000000 kernel: Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP x23: 0000000000000000 x22: 000000000000000c x21: ffff0000c5b95400 x20: ffff0000c0d9ce0c x19: 0000000000000004 x18: 0000000000000000 x17: 0000000000000000 x16: ffffcd4f698a5fa0 x15: 0000000000000000 x14: 0000000000000002 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000001 x10: 409e5dd45a38ef72 x9 : ffffcd4f607531f0 x8 : ffff0000c0d9ce80 x7 : 0000000000000000 x6 : 0000000000000000 x5 : ffff800008923a84 x4 : 000000000000000c x3 : ffff800008923a10 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000070 Call trace: cxl_pmem_ctl+0x74/0x244 [cxl_pmem] nvdimm_init_nsarea+0xb8/0xdc nvdimm_probe+0xc0/0x1d0 nvdimm_bus_probe+0x90/0x200 really_probe+0xc8/0x3e0 __driver_probe_device+0x84/0x190 driver_probe_device+0x44/0x120 __device_attach_driver+0xc4/0x160 bus_for_each_drv+0x80/0xe0 __device_attach+0xa4/0x1cc device_initial_probe+0x1c/0x2c bus_probe_device+0xa4/0xb0 device_add+0x404/0x920 nd_async_device_register+0x20/0x70 async_run_entry_fn+0x3c/0x154 process_one_work+0x200/0x474 worker_thread+0x74/0x43c kthread+0x110/0x114 ret_from_fork+0x10/0x20 Code: 53067e61 f90023e0 f9417aa2 f8617860 (f941ac57) ---[ end trace 0000000000000000 ]--- Note this seems to have gotten harder to hit for some reason - took about 50 resets. I'll keep digging Jonathan > > > > > > > > Next i reverted the QEMU branch to the commit just before the type-3 > > volatile commit and used the old method of launching with a type-3 pmem > > device > > > > Config: > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > -m 2G,slots=4,maxmem=4G \ > > -smp 4 \ > > -machine type=q35,accel=kvm,cxl=on \ > > -enable-kvm \ > > -nographic \ > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \ > > -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \ > > -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \ > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > Result: Similar stack trace > > [ 29.850023] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > > [ 29.882400] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > [ 29.957485] Call Trace: > > [ 29.959067] <TASK> > > [ 29.962176] cxl_bus_probe+0x17/0x50 > > [ 29.964940] really_probe+0xde/0x380 > > [ 29.969065] ? pm_runtime_barrier+0x50/0x90 > > [ 29.973419] __driver_probe_device+0x78/0x170 > > [ 29.977183] driver_probe_device+0x1f/0x90 > > [ 29.984212] __driver_attach+0xd2/0x1c0 > > [ 29.988463] ? __pfx___driver_attach+0x10/0x10 > > [ 29.992379] bus_for_each_dev+0x76/0xa0 > > [ 29.997040] bus_add_driver+0x1b1/0x200 > > [ 30.000368] driver_register+0x89/0xe0 > > [ 30.004579] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > > [ 30.012403] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > > [ 30.019394] do_one_initcall+0x6e/0x330 > > [ 30.024028] do_init_module+0x4a/0x200 > > [ 30.029243] __do_sys_finit_module+0x93/0xf0 > > [ 30.034943] do_syscall_64+0x5b/0x80 > > [ 30.039844] ? do_syscall_64+0x67/0x80 > > [ 30.045163] ? do_syscall_64+0x67/0x80 > > [ 30.049729] ? lock_release+0x14b/0x440 > > [ 30.054055] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90 > > [ 30.061039] ? lock_is_held_type+0xe8/0x140 > > [ 30.067625] ? do_syscall_64+0x67/0x80 > > [ 30.071909] ? lockdep_hardirqs_on+0x7d/0x100 > > [ 30.079037] ? do_syscall_64+0x67/0x80 > > [ 30.084537] ? do_syscall_64+0x67/0x80 > > [ 30.089091] ? do_syscall_64+0x67/0x80 > > [ 30.094174] ? lockdep_hardirqs_on+0x7d/0x100 > > [ 30.099224] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > [ 30.104446] RIP: 0033:0x7f000550afbd
On Thu, Jan 19, 2023 at 10:19:46AM +0000, Jonathan Cameron wrote: > Even if everything else worked, it will currently fail because of the > issue with pass through decoders. > (Kernel assumes always pass through for single rp, qemu assumes never > pass through - both are valid under spec). > Add a second rp for now. I should be able to post some patches > to work around that shortly. > *facepalm* i knew this and yet i still kept old configs around. I'll do some more testing today with 2 root ports
On Thu, Jan 19, 2023 at 06:48:11AM -0500, Michael S. Tsirkin wrote: > > > > I clearly need to run some more rests as not seen this issue (and I've > > had a couple of reports of it now). I guess I never tend to be running > > a completely clean tree on either side + testing is mostly on arm64 > > (though I doubt that matters). It's not the pass through decoder issues as > > that is visible only when bringing up a region. > > OK so I should drop this from my queue for now? > Even with this trace, I'm fairly confident that the DOE line + this have fixed other issues, i'm seeing less general instability on the x86 than i did before, and when loading in non-volatile mode i don't see any stability issues.
On Thu, 19 Jan 2023 12:42:44 +0000 Jonathan Cameron via <qemu-devel@nongnu.org> wrote: > On Wed, 18 Jan 2023 14:31:53 -0500 > Gregory Price <gregory.price@memverge.com> wrote: > > > I apparently forgot an intro lol > > > > I tested the DOE linux branch with the 2023-1-11 QEMU branch with both > > volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode. > > > > 1) *In volatile mode, there are no stack traces present (during boot*) > > > > On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote: > > > > > > 1) No stack traces present > > > 2) Device usage appears to work, but cxl-cli fails to create a region, i > > > haven't checked why yet (also tried ndctl-75, same results) > > > 3) There seems to be some other regression with the cxl_pmem_init > > > routine, because I get a stack trace in this setup regardless of whether > > > I apply the type-3 device commit. > > > > > > > > > All tests below with the previously posted DOE linux branch. > > > Base QEMU branch was Jonathan's 2023-1-11 > > > > > > > > > DOE Branch - 2023-1-11 (HEAD) (all commits) > > > > > > QEMU Config: > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > > -m 3G,slots=4,maxmem=8G \ > > > -smp 4 \ > > > -machine type=q35,accel=kvm,cxl=on \ > > > -enable-kvm \ > > > -nographic \ > > > -object memory-backend-ram,id=mem0,size=1G,share=on \ > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \ > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > Result: This worked, but cxl-cli could not create a region (will look > > > into this further later). > > > > > > > > > > > > > > > When running with a persistent memory configuration, I'm seeing a > > > kernel stack trace on cxl_pmem_init > > > > > > Config: > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > > -m 3G,slots=4,maxmem=4G \ > > > -smp 4 \ > > > -machine type=q35,accel=kvm,cxl=on \ > > > -enable-kvm \ > > > -nographic \ > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \ > > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \ > > > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > > > > [ 62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > > > [ 62.185069] #PF: supervisor read access in kernel mode > > > [ 62.198502] #PF: error_code(0x0000) - not-present page > > > [ 62.211019] PGD 0 P4D 0 > > > [ 62.220521] Oops: 0000 [#1] PREEMPT SMP PTI > > > [ 62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1 > > > [ 62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 > > > [ 62.258432] Adding 2939900k swap on /dev/zram0. Priority:100 extents:1 across:2939900k SSDscFS > > > [ 62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > > [ 62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > > [ 62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > > [ 62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > > [ 62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > > [ 62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > > [ 62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > > [ 62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > > [ 62.285541] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > > [ 62.285542] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > > [ 62.285653] Call Trace: > > > [ 62.285656] <TASK> > > > [ 62.285660] cxl_bus_probe+0x17/0x50 > > > [ 62.285691] really_probe+0xde/0x380 > > > [ 62.285695] ? pm_runtime_barrier+0x50/0x90 > > > [ 62.285700] __driver_probe_device+0x78/0x170 > > > [ 62.285846] driver_probe_device+0x1f/0x90 > > > [ 62.285850] __driver_attach+0xd2/0x1c0 > > > [ 62.285853] ? __pfx___driver_attach+0x10/0x10 > > > [ 62.285856] bus_for_each_dev+0x76/0xa0 > > > [ 62.285860] bus_add_driver+0x1b1/0x200 > > > [ 62.285863] driver_register+0x89/0xe0 > > > [ 62.285868] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > > > [ 62.285874] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > > > [ 62.285880] do_one_initcall+0x6e/0x330 > > > [ 62.285888] do_init_module+0x4a/0x200 > > > [ 62.285892] __do_sys_finit_module+0x93/0xf0 > > > [ 62.285899] do_syscall_64+0x5b/0x80 > > > [ 62.285904] ? do_syscall_64+0x67/0x80 > > > [ 62.285906] ? asm_exc_page_fault+0x22/0x30 > > > [ 62.285910] ? lockdep_hardirqs_on+0x7d/0x100 > > > [ 62.285914] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > > [ 62.285917] RIP: 0033:0x7f2619b0afbd > > > [ 62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8 > > > [ 62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > > [ 62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd > > > [ 62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006 > > > [ 62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52 > > > [ 62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000 > > > [ 62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0 > > > [ 62.285932] </TASK> > > > [ 62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g > > > [ 62.285954] CR2: 00000000000004c0 > > > [ 62.288385] ---[ end trace 0000000000000000 ]--- > > > [ 63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > > [ 63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > > [ 63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > > [ 63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > > [ 63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > > [ 63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > > [ 63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > > [ 63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > > [ 63.203580] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > > [ 63.203583] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > Possibly replicated. What I did was stop cxl_pmem.ko being probed automatically and > added it manually later. Trace that results is certainly similar to yours. > > Now the MODULE_SOFTDEP() in drivers/cxl/acpi.c should stop that happening > assuming you are letting autoloading run. > I wonder if there is a path in which it doesn't? Gregory, would you mind checking if cxl_nvb is NULL here... https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67 (printk before it is used should work). Might also be worth checking cxl_nvd and cxl_ds but my guess is cxl_nvb is our problem (it is when I deliberate change the load order). Jonathan > > Dan, any thoughts? > > There is another race that I can trigger by repeatedly injecting errors and > causing resets, but the trace for that is very different and > points at cxl_pmem_ctl() called via nvdimm_probe(). I was going to try > and pin that one down a little more before posting a report but might > as well muddy the waters :) > > Unable to handle kernel NULL pointer dereference at virtual address 0000000000000358 > Mem abort info: > ESR = 0x0000000096000004 > EC = 0x25: DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > FSC = 0x04: level 0 translation fault > Data abort info: > ISV = 0, ISS = 0x00000004 > CM = 0, WnR = 0 > user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102e12000 > [0000000000000358] pgd=0000000000000000, p4d=0000000000000000 > Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP > Modules linked in: cxl_mem cxl_port cxl_acpi cxl_pmem cxl_pci cxl_core > CPU: 0 PID: 236 Comm: kworker/u8:3 Not tainted 6.2.0-rc3+ #598 > Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 > Workqueue: events_unbound async_run_entry_fn > pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) > pc : cxl_pmem_ctl+0x74/0x244 [cxl_pmem] > lr : cxl_pmem_ctl+0x60/0x244 [cxl_pmem] > sp : ffff8000089239d0 > x29: ffff8000089239d0 x28: 0000000000000000 x27: 0000000000000000 > x26: ffffcd4f6b263000 x25: ffffcd4f6a25d9c8 x24: 0000000000000000 > kernel: Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP > x23: 0000000000000000 x22: 000000000000000c x21: ffff0000c5b95400 > x20: ffff0000c0d9ce0c x19: 0000000000000004 x18: 0000000000000000 > x17: 0000000000000000 x16: ffffcd4f698a5fa0 x15: 0000000000000000 > x14: 0000000000000002 x13: 0000000000000000 x12: 0000000000000000 > x11: 0000000000000001 x10: 409e5dd45a38ef72 x9 : ffffcd4f607531f0 > x8 : ffff0000c0d9ce80 x7 : 0000000000000000 x6 : 0000000000000000 > x5 : ffff800008923a84 x4 : 000000000000000c x3 : ffff800008923a10 > x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000070 > Call trace: > cxl_pmem_ctl+0x74/0x244 [cxl_pmem] > nvdimm_init_nsarea+0xb8/0xdc > nvdimm_probe+0xc0/0x1d0 > nvdimm_bus_probe+0x90/0x200 > really_probe+0xc8/0x3e0 > __driver_probe_device+0x84/0x190 > driver_probe_device+0x44/0x120 > __device_attach_driver+0xc4/0x160 > bus_for_each_drv+0x80/0xe0 > __device_attach+0xa4/0x1cc > device_initial_probe+0x1c/0x2c > bus_probe_device+0xa4/0xb0 > device_add+0x404/0x920 > nd_async_device_register+0x20/0x70 > async_run_entry_fn+0x3c/0x154 > process_one_work+0x200/0x474 > worker_thread+0x74/0x43c > kthread+0x110/0x114 > ret_from_fork+0x10/0x20 > Code: 53067e61 f90023e0 f9417aa2 f8617860 (f941ac57) > ---[ end trace 0000000000000000 ]--- > > Note this seems to have gotten harder to hit for some reason - took > about 50 resets. > > I'll keep digging > > Jonathan >
On Thu, 19 Jan 2023 15:04:49 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > On Thu, 19 Jan 2023 12:42:44 +0000 > Jonathan Cameron via <qemu-devel@nongnu.org> wrote: > > > On Wed, 18 Jan 2023 14:31:53 -0500 > > Gregory Price <gregory.price@memverge.com> wrote: > > > > > I apparently forgot an intro lol > > > > > > I tested the DOE linux branch with the 2023-1-11 QEMU branch with both > > > volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode. > > > > > > 1) *In volatile mode, there are no stack traces present (during boot*) > > > > > > On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote: > > > > > > > > 1) No stack traces present > > > > 2) Device usage appears to work, but cxl-cli fails to create a region, i > > > > haven't checked why yet (also tried ndctl-75, same results) > > > > 3) There seems to be some other regression with the cxl_pmem_init > > > > routine, because I get a stack trace in this setup regardless of whether > > > > I apply the type-3 device commit. > > > > > > > > > > > > All tests below with the previously posted DOE linux branch. > > > > Base QEMU branch was Jonathan's 2023-1-11 > > > > > > > > > > > > DOE Branch - 2023-1-11 (HEAD) (all commits) > > > > > > > > QEMU Config: > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > > > -m 3G,slots=4,maxmem=8G \ > > > > -smp 4 \ > > > > -machine type=q35,accel=kvm,cxl=on \ > > > > -enable-kvm \ > > > > -nographic \ > > > > -object memory-backend-ram,id=mem0,size=1G,share=on \ > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \ > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > > > Result: This worked, but cxl-cli could not create a region (will look > > > > into this further later). > > > > > > > > > > > > > > > > > > > > When running with a persistent memory configuration, I'm seeing a > > > > kernel stack trace on cxl_pmem_init > > > > > > > > Config: > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > > > -m 3G,slots=4,maxmem=4G \ > > > > -smp 4 \ > > > > -machine type=q35,accel=kvm,cxl=on \ > > > > -enable-kvm \ > > > > -nographic \ > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \ > > > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \ > > > > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > > > > > > > [ 62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > > > > [ 62.185069] #PF: supervisor read access in kernel mode > > > > [ 62.198502] #PF: error_code(0x0000) - not-present page > > > > [ 62.211019] PGD 0 P4D 0 > > > > [ 62.220521] Oops: 0000 [#1] PREEMPT SMP PTI > > > > [ 62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1 > > > > [ 62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 > > > > [ 62.258432] Adding 2939900k swap on /dev/zram0. Priority:100 extents:1 across:2939900k SSDscFS > > > > [ 62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > > > [ 62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > > > [ 62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > > > [ 62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > > > [ 62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > > > [ 62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > > > [ 62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > > > [ 62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > > > [ 62.285541] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > > > [ 62.285542] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [ 62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > > > [ 62.285653] Call Trace: > > > > [ 62.285656] <TASK> > > > > [ 62.285660] cxl_bus_probe+0x17/0x50 > > > > [ 62.285691] really_probe+0xde/0x380 > > > > [ 62.285695] ? pm_runtime_barrier+0x50/0x90 > > > > [ 62.285700] __driver_probe_device+0x78/0x170 > > > > [ 62.285846] driver_probe_device+0x1f/0x90 > > > > [ 62.285850] __driver_attach+0xd2/0x1c0 > > > > [ 62.285853] ? __pfx___driver_attach+0x10/0x10 > > > > [ 62.285856] bus_for_each_dev+0x76/0xa0 > > > > [ 62.285860] bus_add_driver+0x1b1/0x200 > > > > [ 62.285863] driver_register+0x89/0xe0 > > > > [ 62.285868] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > > > > [ 62.285874] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > > > > [ 62.285880] do_one_initcall+0x6e/0x330 > > > > [ 62.285888] do_init_module+0x4a/0x200 > > > > [ 62.285892] __do_sys_finit_module+0x93/0xf0 > > > > [ 62.285899] do_syscall_64+0x5b/0x80 > > > > [ 62.285904] ? do_syscall_64+0x67/0x80 > > > > [ 62.285906] ? asm_exc_page_fault+0x22/0x30 > > > > [ 62.285910] ? lockdep_hardirqs_on+0x7d/0x100 > > > > [ 62.285914] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > > > [ 62.285917] RIP: 0033:0x7f2619b0afbd > > > > [ 62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8 > > > > [ 62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > > > [ 62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd > > > > [ 62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006 > > > > [ 62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52 > > > > [ 62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000 > > > > [ 62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0 > > > > [ 62.285932] </TASK> > > > > [ 62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g > > > > [ 62.285954] CR2: 00000000000004c0 > > > > [ 62.288385] ---[ end trace 0000000000000000 ]--- > > > > [ 63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > > > [ 63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > > > [ 63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > > > [ 63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > > > [ 63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > > > [ 63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > > > [ 63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > > > [ 63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > > > [ 63.203580] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > > > [ 63.203583] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [ 63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > > > Possibly replicated. What I did was stop cxl_pmem.ko being probed automatically and > > added it manually later. Trace that results is certainly similar to yours. > > > > Now the MODULE_SOFTDEP() in drivers/cxl/acpi.c should stop that happening > > assuming you are letting autoloading run. > > I wonder if there is a path in which it doesn't? > > Gregory, would you mind checking if > cxl_nvb is NULL here... > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67 > (printk before it is used should work). > > Might also be worth checking cxl_nvd and cxl_ds > but my guess is cxl_nvb is our problem (it is when I deliberate change > the load order). Whilst I still have no idea if this is the same problem, I have identified what goes wrong if there is a module probe ordering issue. https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/core/pmem.c#L306 /* * The two actions below arrange for @cxl_nvd to be deleted when either * the top-level PMEM bridge goes down, or the endpoint device goes * through ->remove(). */ device_lock(&cxl_nvb->dev); if (cxl_nvb->dev.driver) rc = devm_add_action_or_reset(&cxl_nvb->dev, cxl_nvd_unregister, cxl_nvd); else // bridge driver not loaded, so we hit this path. rc = -ENXIO; device_unlock(&cxl_nvb->dev); if (rc) /// and this one goto err_alloc; /* @cxlmd carries a reference on @cxl_nvb until cxlmd_release_nvdimm */ return devm_add_action_or_reset(&cxlmd->dev, cxlmd_release_nvdimm, cxlmd); err: put_device(dev); err_alloc: cxlmd->cxl_nvb = NULL; cxlmd->cxl_nvd = NULL; put_device(&cxl_nvb->dev); // whilst we scrub the pointers we don't actually get rid of the // cxl_nvd that we registered. Hence later load of the driver tries to // attach to that and boom because we've scrubbed these pointers here. // A quick hack is to just call device_del(&cxl_nvd->dev) if rc = -ENXIO here. // There may well be a races though.... return rc; } EXPORT_SYMBOL_NS_GPL(devm_cxl_add_nvdimm, CXL); Of course this "fix" just stops things blowing up, it doesn't leave things in a remotely useful state. If it's triggered because someone is messing with the load order that's fine. If the same issue is occurring for Gregory, not so much. Jonathan > > Jonathan > > > > > > Dan, any thoughts? > > > > There is another race that I can trigger by repeatedly injecting errors and > > causing resets, but the trace for that is very different and > > points at cxl_pmem_ctl() called via nvdimm_probe(). I was going to try > > and pin that one down a little more before posting a report but might > > as well muddy the waters :) > > > > Unable to handle kernel NULL pointer dereference at virtual address 0000000000000358 > > Mem abort info: > > ESR = 0x0000000096000004 > > EC = 0x25: DABT (current EL), IL = 32 bits > > SET = 0, FnV = 0 > > EA = 0, S1PTW = 0 > > FSC = 0x04: level 0 translation fault > > Data abort info: > > ISV = 0, ISS = 0x00000004 > > CM = 0, WnR = 0 > > user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102e12000 > > [0000000000000358] pgd=0000000000000000, p4d=0000000000000000 > > Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP > > Modules linked in: cxl_mem cxl_port cxl_acpi cxl_pmem cxl_pci cxl_core > > CPU: 0 PID: 236 Comm: kworker/u8:3 Not tainted 6.2.0-rc3+ #598 > > Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 > > Workqueue: events_unbound async_run_entry_fn > > pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) > > pc : cxl_pmem_ctl+0x74/0x244 [cxl_pmem] > > lr : cxl_pmem_ctl+0x60/0x244 [cxl_pmem] > > sp : ffff8000089239d0 > > x29: ffff8000089239d0 x28: 0000000000000000 x27: 0000000000000000 > > x26: ffffcd4f6b263000 x25: ffffcd4f6a25d9c8 x24: 0000000000000000 > > kernel: Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP > > x23: 0000000000000000 x22: 000000000000000c x21: ffff0000c5b95400 > > x20: ffff0000c0d9ce0c x19: 0000000000000004 x18: 0000000000000000 > > x17: 0000000000000000 x16: ffffcd4f698a5fa0 x15: 0000000000000000 > > x14: 0000000000000002 x13: 0000000000000000 x12: 0000000000000000 > > x11: 0000000000000001 x10: 409e5dd45a38ef72 x9 : ffffcd4f607531f0 > > x8 : ffff0000c0d9ce80 x7 : 0000000000000000 x6 : 0000000000000000 > > x5 : ffff800008923a84 x4 : 000000000000000c x3 : ffff800008923a10 > > x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000070 > > Call trace: > > cxl_pmem_ctl+0x74/0x244 [cxl_pmem] > > nvdimm_init_nsarea+0xb8/0xdc > > nvdimm_probe+0xc0/0x1d0 > > nvdimm_bus_probe+0x90/0x200 > > really_probe+0xc8/0x3e0 > > __driver_probe_device+0x84/0x190 > > driver_probe_device+0x44/0x120 > > __device_attach_driver+0xc4/0x160 > > bus_for_each_drv+0x80/0xe0 > > __device_attach+0xa4/0x1cc > > device_initial_probe+0x1c/0x2c > > bus_probe_device+0xa4/0xb0 > > device_add+0x404/0x920 > > nd_async_device_register+0x20/0x70 > > async_run_entry_fn+0x3c/0x154 > > process_one_work+0x200/0x474 > > worker_thread+0x74/0x43c > > kthread+0x110/0x114 > > ret_from_fork+0x10/0x20 > > Code: 53067e61 f90023e0 f9417aa2 f8617860 (f941ac57) > > ---[ end trace 0000000000000000 ]--- > > > > Note this seems to have gotten harder to hit for some reason - took > > about 50 resets. > > > > I'll keep digging > > > > Jonathan > >
On Thu, Jan 19, 2023 at 03:04:49PM +0000, Jonathan Cameron wrote: > Gregory, would you mind checking if > cxl_nvb is NULL here... > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67 > (printk before it is used should work). > > Might also be worth checking cxl_nvd and cxl_ds > but my guess is cxl_nvb is our problem (it is when I deliberate change > the load order). > > Jonathan > This is exactly the issue. cxl_nvb is null, the rest appear fine. Also, note, that weirdly the non-volatile bridge shows up when launching this in volatile mode, but no stack trace appears. ¯\_(ツ)_/¯ After spending way too much time tracing through the current cxl driver code, i have only really determined that 1) The code is very pmem oriented, and it's unclear to me how the driver as-is differentiates a persistent device from a volatile device. That code path still completely escapes me. The only differentiating code i see is in the memdev probe path that creates mem#/pmem and mem#/ram 2) The code successfully manages probe, enable, and mount a REAL device - cxl memdev appears (/sys/bus/cxl/devices/mem0) - a dax device appears (/sys/bus/dax/devices/) This happens at boot, which I assume must be bios related - The memory *does not* auto-online, instead the dax device can be onlined as system-ram *manually* via ndctl and friends 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless of the type-3 device configuration (pmem-only or vmem-only) # CFMW defined [root@fedora ~]# ls /sys/bus/cxl/devices/ decoder0.0 decoder2.0 mem0 port1 decoder1.0 endpoint2 nvdimm-bridge0 root0 # CFMW not defined [root@fedora ~]# ls /sys/bus/cxl/devices/ decoder1.0 decoder2.0 endpoint2 mem0 port1 root0 4) As you can see above, multiple decoders are registered. I'm not sure if that's correct or not, but it does seem odd given there's only one cxl type-3 device. Odd that decoder0.0 shows up when CFMW is there, but not when it isn't. Note: All these tests have two root ports: -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \ -device cxl-rp,id=rp1,bus=cxl.0,chassis=0,port=1,slot=1 \ Don't know why I haven't thought of this until now, but is the CFMW code reporting something odd about what's behind it? Is it assuming the devices are pmem?
On Thu, Jan 19, 2023 at 04:17:11PM +0000, Jonathan Cameron wrote: > > Whilst I still have no idea if this is the same problem, I have identified > what goes wrong if there is a module probe ordering issue. > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/core/pmem.c#L306 > > /* > * The two actions below arrange for @cxl_nvd to be deleted when either > * the top-level PMEM bridge goes down, or the endpoint device goes > * through ->remove(). > */ > device_lock(&cxl_nvb->dev); > if (cxl_nvb->dev.driver) > rc = devm_add_action_or_reset(&cxl_nvb->dev, cxl_nvd_unregister, > cxl_nvd); > else > // bridge driver not loaded, so we hit this path. > rc = -ENXIO; > device_unlock(&cxl_nvb->dev); > > if (rc) > /// and this one > goto err_alloc; > > /* @cxlmd carries a reference on @cxl_nvb until cxlmd_release_nvdimm */ > return devm_add_action_or_reset(&cxlmd->dev, cxlmd_release_nvdimm, cxlmd); > > err: > put_device(dev); > err_alloc: > cxlmd->cxl_nvb = NULL; > cxlmd->cxl_nvd = NULL; > put_device(&cxl_nvb->dev); > // whilst we scrub the pointers we don't actually get rid of the > // cxl_nvd that we registered. Hence later load of the driver tries to > // attach to that and boom because we've scrubbed these pointers here. > // A quick hack is to just call device_del(&cxl_nvd->dev) if rc = -ENXIO here. > // There may well be a races though.... > return rc; > } > EXPORT_SYMBOL_NS_GPL(devm_cxl_add_nvdimm, CXL); > > > Of course this "fix" just stops things blowing up, it doesn't leave things > in a remotely useful state. If it's triggered because someone > is messing with the load order that's fine. If the same issue > is occurring for Gregory, not so much. > > Jonathan > mild hint in the dev_cxl_add_nvdimm_bridge path driver/cxl/acpi.c static int cxl_acpi_probe(struct platform_device *pdev) { ... snip ... if (IS_ENABLED(CONFIG_CXL_PMEM)) rc = device_for_each_child(&root_port->dev, root_port, add_root_nvdimm_bridge); if (rc < 0) return rc; /* In case PCI is scanned before ACPI re-trigger memdev attach */ cxl_bus_rescan(); return 0; } if PCI is presently written in a way that it's expecting nvdimm_bridge to be present (via acpi_probe), then clearly this would break. From the other discussion here... that seems to be the issue? If that's an issue, I also imagine there are other parts that may be subject to the same problem. static int cxl_pmem_region_probe(struct device *dev) { struct nd_mapping_desc mappings[CXL_DECODER_MAX_INTERLEAVE]; struct cxl_pmem_region *cxlr_pmem = to_cxl_pmem_region(dev); struct cxl_region *cxlr = cxlr_pmem->cxlr; struct cxl_nvdimm_bridge *cxl_nvb = cxlr->cxl_nvb; this may be unreachable due to prior stack traces, but you get the point. Reiterating my confusion a bit: I don't have an nvdimm, why am i getting an nvdimm_bridge? The reason it no longer appears to trigger on my memexp example is because it doesnt go down this path: static int cxl_mem_probe(struct device *dev) { ... snip ... // resource size is 0 here due to type3dev->persistent_capacity=0 if (resource_size(&cxlds->pmem_res) && IS_ENABLED(CONFIG_CXL_PMEM)) { rc = devm_cxl_add_nvdimm(cxlmd); if (rc == -ENODEV) dev_info(dev, "PMEM disabled by platform\n"); else return rc; } ... snip ... } This seems like more than an ordering issue.
On Thu, 19 Jan 2023 23:53:53 -0500 Gregory Price <gregory.price@memverge.com> wrote: > On Thu, Jan 19, 2023 at 03:04:49PM +0000, Jonathan Cameron wrote: > > Gregory, would you mind checking if > > cxl_nvb is NULL here... > > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67 > > (printk before it is used should work). > > > > Might also be worth checking cxl_nvd and cxl_ds > > but my guess is cxl_nvb is our problem (it is when I deliberate change > > the load order). > > > > Jonathan > > > > This is exactly the issue. cxl_nvb is null, the rest appear fine. > > Also, note, that weirdly the non-volatile bridge shows up when launching > this in volatile mode, but no stack trace appears. > > ¯\_(ツ)_/¯ > > After spending way too much time tracing through the current cxl driver > code, i have only really determined that > > 1) The code is very pmem oriented, and it's unclear to me how the driver > as-is differentiates a persistent device from a volatile device. That > code path still completely escapes me. The only differentiating code > i see is in the memdev probe path that creates mem#/pmem and mem#/ram Absolutely on pmem. Target for kernel side of things was always pmem first. Volatile has been on roadmap / todo list for a few kernel cycles but I haven't seen any code yet. > > 2) The code successfully manages probe, enable, and mount a REAL device > - cxl memdev appears (/sys/bus/cxl/devices/mem0) > - a dax device appears (/sys/bus/dax/devices/) > This happens at boot, which I assume must be bios related > - The memory *does not* auto-online, instead the dax device can be > onlined as system-ram *manually* via ndctl and friends Interesting. Just curious, is the host a CXL 1.1 host or a CXL 2.0 host? > > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless > of the type-3 device configuration (pmem-only or vmem-only) > > # CFMW defined > [root@fedora ~]# ls /sys/bus/cxl/devices/ > decoder0.0 decoder2.0 mem0 port1 > decoder1.0 endpoint2 nvdimm-bridge0 root0 > > # CFMW not defined > [root@fedora ~]# ls /sys/bus/cxl/devices/ > decoder1.0 decoder2.0 endpoint2 mem0 port1 root0 That should be harmless and may be needed to tie everything through to DAX. > > 4) As you can see above, multiple decoders are registered. I'm not sure > if that's correct or not, but it does seem odd given there's only one > cxl type-3 device. Odd that decoder0.0 shows up when CFMW is there, > but not when it isn't. > > Note: All these tests have two root ports: > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \ > -device cxl-rp,id=rp1,bus=cxl.0,chassis=0,port=1,slot=1 \ IIRC decoder0.0 represents the fixed routing in the host as defined by the CFMWS - not an actual programmable decoder. decoder1.0 is the routing in the host bridge - may be pass through decoder if there is only one root port. decoder2.0 is the one is the endpoint itself. > > > Don't know why I haven't thought of this until now, but is the CFMW code > reporting something odd about what's behind it? Is it assuming the > devices are pmem? It reports the ability to support pmem or support volatile or support both. Currently https://elixir.bootlin.com/qemu/latest/source/hw/acpi/cxl.c#L107 qemu reports that all CFMWS windows support everything except "Fixed Device Configuration (Bit[4])" which would tell the OS not to move devices that are already programmed out of this window and doesn't really make sense for QEMU to ever set. That is we support all of: Device Coherent (type 2 and back invalidate flows on type 3, though we aren't emulating the back invalidate stuff yet on the EP) Host only coherent. [thinking about it we should probably not support both this and device coherent as they would be mutually incompatible on a real host] Volatile Persistent Jonathan
Jonathan Cameron wrote: > On Thu, 19 Jan 2023 15:04:49 +0000 > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > > On Thu, 19 Jan 2023 12:42:44 +0000 > > Jonathan Cameron via <qemu-devel@nongnu.org> wrote: > > > > > On Wed, 18 Jan 2023 14:31:53 -0500 > > > Gregory Price <gregory.price@memverge.com> wrote: > > > > > > > I apparently forgot an intro lol > > > > > > > > I tested the DOE linux branch with the 2023-1-11 QEMU branch with both > > > > volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode. > > > > > > > > 1) *In volatile mode, there are no stack traces present (during boot*) > > > > > > > > On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote: > > > > > > > > > > 1) No stack traces present > > > > > 2) Device usage appears to work, but cxl-cli fails to create a region, i > > > > > haven't checked why yet (also tried ndctl-75, same results) > > > > > 3) There seems to be some other regression with the cxl_pmem_init > > > > > routine, because I get a stack trace in this setup regardless of whether > > > > > I apply the type-3 device commit. > > > > > > > > > > > > > > > All tests below with the previously posted DOE linux branch. > > > > > Base QEMU branch was Jonathan's 2023-1-11 > > > > > > > > > > > > > > > DOE Branch - 2023-1-11 (HEAD) (all commits) > > > > > > > > > > QEMU Config: > > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > > > > -m 3G,slots=4,maxmem=8G \ > > > > > -smp 4 \ > > > > > -machine type=q35,accel=kvm,cxl=on \ > > > > > -enable-kvm \ > > > > > -nographic \ > > > > > -object memory-backend-ram,id=mem0,size=1G,share=on \ > > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > > > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \ > > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > > > > > Result: This worked, but cxl-cli could not create a region (will look > > > > > into this further later). > > > > > > > > > > > > > > > > > > > > > > > > > When running with a persistent memory configuration, I'm seeing a > > > > > kernel stack trace on cxl_pmem_init > > > > > > > > > > Config: > > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > > > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \ > > > > > -m 3G,slots=4,maxmem=4G \ > > > > > -smp 4 \ > > > > > -machine type=q35,accel=kvm,cxl=on \ > > > > > -enable-kvm \ > > > > > -nographic \ > > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > > > > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \ > > > > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \ > > > > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \ > > > > > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \ > > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G > > > > > > > > > > > > > > > [ 62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0 > > > > > [ 62.185069] #PF: supervisor read access in kernel mode > > > > > [ 62.198502] #PF: error_code(0x0000) - not-present page > > > > > [ 62.211019] PGD 0 P4D 0 > > > > > [ 62.220521] Oops: 0000 [#1] PREEMPT SMP PTI > > > > > [ 62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1 > > > > > [ 62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 > > > > > [ 62.258432] Adding 2939900k swap on /dev/zram0. Priority:100 extents:1 across:2939900k SSDscFS > > > > > [ 62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > > > > [ 62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > > > > [ 62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > > > > [ 62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > > > > [ 62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > > > > [ 62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > > > > [ 62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > > > > [ 62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > > > > [ 62.285541] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > > > > [ 62.285542] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > [ 62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > > > > [ 62.285653] Call Trace: > > > > > [ 62.285656] <TASK> > > > > > [ 62.285660] cxl_bus_probe+0x17/0x50 > > > > > [ 62.285691] really_probe+0xde/0x380 > > > > > [ 62.285695] ? pm_runtime_barrier+0x50/0x90 > > > > > [ 62.285700] __driver_probe_device+0x78/0x170 > > > > > [ 62.285846] driver_probe_device+0x1f/0x90 > > > > > [ 62.285850] __driver_attach+0xd2/0x1c0 > > > > > [ 62.285853] ? __pfx___driver_attach+0x10/0x10 > > > > > [ 62.285856] bus_for_each_dev+0x76/0xa0 > > > > > [ 62.285860] bus_add_driver+0x1b1/0x200 > > > > > [ 62.285863] driver_register+0x89/0xe0 > > > > > [ 62.285868] ? __pfx_init_module+0x10/0x10 [cxl_pmem] > > > > > [ 62.285874] cxl_pmem_init+0x50/0xff0 [cxl_pmem] > > > > > [ 62.285880] do_one_initcall+0x6e/0x330 > > > > > [ 62.285888] do_init_module+0x4a/0x200 > > > > > [ 62.285892] __do_sys_finit_module+0x93/0xf0 > > > > > [ 62.285899] do_syscall_64+0x5b/0x80 > > > > > [ 62.285904] ? do_syscall_64+0x67/0x80 > > > > > [ 62.285906] ? asm_exc_page_fault+0x22/0x30 > > > > > [ 62.285910] ? lockdep_hardirqs_on+0x7d/0x100 > > > > > [ 62.285914] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > > > > [ 62.285917] RIP: 0033:0x7f2619b0afbd > > > > > [ 62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8 > > > > > [ 62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > > > > [ 62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd > > > > > [ 62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006 > > > > > [ 62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52 > > > > > [ 62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000 > > > > > [ 62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0 > > > > > [ 62.285932] </TASK> > > > > > [ 62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g > > > > > [ 62.285954] CR2: 00000000000004c0 > > > > > [ 62.288385] ---[ end trace 0000000000000000 ]--- > > > > > [ 63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem] > > > > > [ 63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0 > > > > > [ 63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202 > > > > > [ 63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000 > > > > > [ 63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff > > > > > [ 63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001 > > > > > [ 63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000 > > > > > [ 63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000 > > > > > [ 63.203580] FS: 00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000 > > > > > [ 63.203583] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > [ 63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0 > > > > > > Possibly replicated. What I did was stop cxl_pmem.ko being probed automatically and > > > added it manually later. Trace that results is certainly similar to yours. > > > > > > Now the MODULE_SOFTDEP() in drivers/cxl/acpi.c should stop that happening > > > assuming you are letting autoloading run. > > > I wonder if there is a path in which it doesn't? > > > > Gregory, would you mind checking if > > cxl_nvb is NULL here... > > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67 > > (printk before it is used should work). > > > > Might also be worth checking cxl_nvd and cxl_ds > > but my guess is cxl_nvb is our problem (it is when I deliberate change > > the load order). > > Whilst I still have no idea if this is the same problem, I have identified > what goes wrong if there is a module probe ordering issue. > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/core/pmem.c#L306 > > /* > * The two actions below arrange for @cxl_nvd to be deleted when either > * the top-level PMEM bridge goes down, or the endpoint device goes > * through ->remove(). > */ > device_lock(&cxl_nvb->dev); > if (cxl_nvb->dev.driver) > rc = devm_add_action_or_reset(&cxl_nvb->dev, cxl_nvd_unregister, > cxl_nvd); > else > // bridge driver not loaded, so we hit this path. > rc = -ENXIO; > device_unlock(&cxl_nvb->dev); > > if (rc) > /// and this one > goto err_alloc; > > /* @cxlmd carries a reference on @cxl_nvb until cxlmd_release_nvdimm */ > return devm_add_action_or_reset(&cxlmd->dev, cxlmd_release_nvdimm, cxlmd); > > err: > put_device(dev); > err_alloc: > cxlmd->cxl_nvb = NULL; > cxlmd->cxl_nvd = NULL; > put_device(&cxl_nvb->dev); > // whilst we scrub the pointers we don't actually get rid of the > // cxl_nvd that we registered. Hence later load of the driver tries to > // attach to that and boom because we've scrubbed these pointers here. > // A quick hack is to just call device_del(&cxl_nvd->dev) if rc = -ENXIO here. > // There may well be a races though.... > return rc; > } > EXPORT_SYMBOL_NS_GPL(devm_cxl_add_nvdimm, CXL); > > > Of course this "fix" just stops things blowing up, it doesn't leave things > in a remotely useful state. If it's triggered because someone > is messing with the load order that's fine. If the same issue > is occurring for Gregory, not so much. Apologies for not engaging on this sooner, I have been heads down on trying to get pre-existing region enumeration on its feet. One workaround for this is to just preclude the load order scenario from happening by making the pmem enabling part of the core module. There is not much reason for it to exist indepdendently. ...but I think it is useful to be able to have some policy for disabling all pmem enabling in an emergency. So I think fixing the cxl_nvd unregistration is the right way to go.
Gregory Price wrote: > On Thu, Jan 19, 2023 at 03:04:49PM +0000, Jonathan Cameron wrote: > > Gregory, would you mind checking if > > cxl_nvb is NULL here... > > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67 > > (printk before it is used should work). > > > > Might also be worth checking cxl_nvd and cxl_ds > > but my guess is cxl_nvb is our problem (it is when I deliberate change > > the load order). > > > > Jonathan > > > > This is exactly the issue. cxl_nvb is null, the rest appear fine. > > Also, note, that weirdly the non-volatile bridge shows up when launching > this in volatile mode, but no stack trace appears. > > ¯\_(ツ)_/¯ > > After spending way too much time tracing through the current cxl driver > code, i have only really determined that > > 1) The code is very pmem oriented, and it's unclear to me how the driver > as-is differentiates a persistent device from a volatile device. That > code path still completely escapes me. The only differentiating code > i see is in the memdev probe path that creates mem#/pmem and mem#/ram Yes, pmem was the initial focus because it had the most dependency on the OS to setup vs BIOS, but the ram enabling is at the top of the queue now. > > 2) The code successfully manages probe, enable, and mount a REAL device > - cxl memdev appears (/sys/bus/cxl/devices/mem0) > - a dax device appears (/sys/bus/dax/devices/) > This happens at boot, which I assume must be bios related As it stands currently that dax device and the cxl device are not related since a default dax-device is loaded just based on the presence of an EFI_MEMORY_SP address range in the address map. With the new ram enabling that default device will be elided and CXL will register a dax-device parented by a cxl region. > - The memory *does not* auto-online, instead the dax device can be > onlined as system-ram *manually* via ndctl and friends That *manually* part is the problem that needs distro help to solve. It should be the case that by default all Linux distributions auto-online all dax-devices. If that happens to online memory that is too slow for general use, or too high-performance / precious for general purpose use then the administrator can set policy after the fact. Unfortunately user policy can not be applied if these memory ranges were onlined by the kernel at boot , so that's why the kernel policy defaults to not-online. In other words, there is no guarantee that memory that was assigned to the general purpose pool at boot can be removed. The only guaranteed behavior is to never give the memory to the core kernel in the first instance and always let user policy route the memory. > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless > of the type-3 device configuration (pmem-only or vmem-only) Correct, the top-level bus code (cxl_acpi) and the endpoint code (cxl_mem, cxl_port) need to handshake before establishing regions. For pmem regions the platform needs to claim the availability of a pmem capable CXL window. > > # CFMW defined > [root@fedora ~]# ls /sys/bus/cxl/devices/ > decoder0.0 decoder2.0 mem0 port1 > decoder1.0 endpoint2 nvdimm-bridge0 root0 > > # CFMW not defined > [root@fedora ~]# ls /sys/bus/cxl/devices/ > decoder1.0 decoder2.0 endpoint2 mem0 port1 root0 > > 4) As you can see above, multiple decoders are registered. I'm not sure > if that's correct or not, but it does seem odd given there's only one > cxl type-3 device. Odd that decoder0.0 shows up when CFMW is there, > but not when it isn't. CXL windows are modeled as decoders hanging off the the CXL root device (ACPI0017 on ACPI based platforms). An endpoint decoder can then map a selection of that window. > > Note: All these tests have two root ports: > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \ > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \ > -device cxl-rp,id=rp1,bus=cxl.0,chassis=0,port=1,slot=1 \ > > > Don't know why I haven't thought of this until now, but is the CFMW code > reporting something odd about what's behind it? Is it assuming the > devices are pmem? No, the cxl_acpi code is just advertising platform decode possibilities independent of what devices show up. Think of this like the PCI MMIO space that gets allocated to a root bridge at the beginning of time. That space may or may not get consumed based on what devices show up downstream.
On Fri, Jan 20, 2023 at 09:38:13AM -0800, Dan Williams wrote: > As it stands currently that dax device and the cxl device are not > related since a default dax-device is loaded just based on the presence > of an EFI_MEMORY_SP address range in the address map. With the new ram > enabling that default device will be elided and CXL will register a > dax-device parented by a cxl region. > > > - The memory *does not* auto-online, instead the dax device can be > > onlined as system-ram *manually* via ndctl and friends > > That *manually* part is the problem that needs distro help to solve. It > should be the case that by default all Linux distributions auto-online > all dax-devices. If that happens to online memory that is too slow for > general use, or too high-performance / precious for general purpose use > then the administrator can set policy after the fact. Unfortunately user > policy can not be applied if these memory ranges were onlined by the > kernel at boot , so that's why the kernel policy defaults to not-online. > > In other words, there is no guarantee that memory that was assigned to > the general purpose pool at boot can be removed. The only guaranteed > behavior is to never give the memory to the core kernel in the first > instance and always let user policy route the memory. > > > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless > > of the type-3 device configuration (pmem-only or vmem-only) > > Correct, the top-level bus code (cxl_acpi) and the endpoint code > (cxl_mem, cxl_port) need to handshake before establishing regions. For > pmem regions the platform needs to claim the availability of a pmem > capable CXL window. > > > 4) As you can see above, multiple decoders are registered. I'm not sure > > if that's correct or not, but it does seem odd given there's only one > > cxl type-3 device. Odd that decoder0.0 shows up when CFMW is there, > > but not when it isn't. > > CXL windows are modeled as decoders hanging off the the CXL root device > (ACPI0017 on ACPI based platforms). An endpoint decoder can then map a > selection of that window. > > > Don't know why I haven't thought of this until now, but is the CFMW code > > reporting something odd about what's behind it? Is it assuming the > > devices are pmem? > > No, the cxl_acpi code is just advertising platform decode possibilities > independent of what devices show up. Think of this like the PCI MMIO > space that gets allocated to a root bridge at the beginning of time. > That space may or may not get consumed based on what devices show up > downstream. Thank you for the explanation Dan, and thank you for you patience @JCameron. I'm fairly sure I grok it now. Summarizing to make sure: the cxl driver is providing what would be the CXL.io (control) path, and the CXL.mem path is basically being simulated by what otherwise would be a traditional PCI memory region. This explains why turning off Legacy mode drops the dax devices, and why the topology looks strange - the devices are basically attached in 2 different ways. Might there be interest from the QEMU community to implement this legacy-style setup in the short term, in an effort to test the the control path of type-3 devices while we wait for the kernel to catch up? Or should we forget this mode ever existed and just barrel forward with HDM decoders and writing the kernel code to hook up the underlying devices in drivers/cxl?
Gregory Price wrote: > On Fri, Jan 20, 2023 at 09:38:13AM -0800, Dan Williams wrote: > > As it stands currently that dax device and the cxl device are not > > related since a default dax-device is loaded just based on the presence > > of an EFI_MEMORY_SP address range in the address map. With the new ram > > enabling that default device will be elided and CXL will register a > > dax-device parented by a cxl region. > > > > > - The memory *does not* auto-online, instead the dax device can be > > > onlined as system-ram *manually* via ndctl and friends > > > > That *manually* part is the problem that needs distro help to solve. It > > should be the case that by default all Linux distributions auto-online > > all dax-devices. If that happens to online memory that is too slow for > > general use, or too high-performance / precious for general purpose use > > then the administrator can set policy after the fact. Unfortunately user > > policy can not be applied if these memory ranges were onlined by the > > kernel at boot , so that's why the kernel policy defaults to not-online. > > > > In other words, there is no guarantee that memory that was assigned to > > the general purpose pool at boot can be removed. The only guaranteed > > behavior is to never give the memory to the core kernel in the first > > instance and always let user policy route the memory. > > > > > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless > > > of the type-3 device configuration (pmem-only or vmem-only) > > > > Correct, the top-level bus code (cxl_acpi) and the endpoint code > > (cxl_mem, cxl_port) need to handshake before establishing regions. For > > pmem regions the platform needs to claim the availability of a pmem > > capable CXL window. > > > > > 4) As you can see above, multiple decoders are registered. I'm not sure > > > if that's correct or not, but it does seem odd given there's only one > > > cxl type-3 device. Odd that decoder0.0 shows up when CFMW is there, > > > but not when it isn't. > > > > CXL windows are modeled as decoders hanging off the the CXL root device > > (ACPI0017 on ACPI based platforms). An endpoint decoder can then map a > > selection of that window. > > > > > Don't know why I haven't thought of this until now, but is the CFMW code > > > reporting something odd about what's behind it? Is it assuming the > > > devices are pmem? > > > > No, the cxl_acpi code is just advertising platform decode possibilities > > independent of what devices show up. Think of this like the PCI MMIO > > space that gets allocated to a root bridge at the beginning of time. > > That space may or may not get consumed based on what devices show up > > downstream. > > Thank you for the explanation Dan, and thank you for you patience > @JCameron. I'm fairly sure I grok it now. > > Summarizing to make sure: the cxl driver is providing what would be the > CXL.io (control) path, and the CXL.mem path is basically being simulated > by what otherwise would be a traditional PCI memory region. This explains > why turning off Legacy mode drops the dax devices, and why the topology > looks strange - the devices are basically attached in 2 different ways. > > Might there be interest from the QEMU community to implement this > legacy-style setup in the short term, in an effort to test the the > control path of type-3 devices while we wait for the kernel to catch up? > > Or should we forget this mode ever existed and just barrel forward > with HDM decoders and writing the kernel code to hook up the underlying > devices in drivers/cxl? Which mode are you referring? The next steps for the kernel enabling relevant to this thread are: * ram region discovery (platform firmware or kexec established) * ram region creation * pmem region discovery (from labels)
On Fri, 20 Jan 2023 14:41:05 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > Gregory Price wrote: > > On Fri, Jan 20, 2023 at 09:38:13AM -0800, Dan Williams wrote: > > > As it stands currently that dax device and the cxl device are not > > > related since a default dax-device is loaded just based on the presence > > > of an EFI_MEMORY_SP address range in the address map. With the new ram > > > enabling that default device will be elided and CXL will register a > > > dax-device parented by a cxl region. > > > > > > > - The memory *does not* auto-online, instead the dax device can be > > > > onlined as system-ram *manually* via ndctl and friends > > > > > > That *manually* part is the problem that needs distro help to solve. It > > > should be the case that by default all Linux distributions auto-online > > > all dax-devices. If that happens to online memory that is too slow for > > > general use, or too high-performance / precious for general purpose use > > > then the administrator can set policy after the fact. Unfortunately user > > > policy can not be applied if these memory ranges were onlined by the > > > kernel at boot , so that's why the kernel policy defaults to not-online. > > > > > > In other words, there is no guarantee that memory that was assigned to > > > the general purpose pool at boot can be removed. The only guaranteed > > > behavior is to never give the memory to the core kernel in the first > > > instance and always let user policy route the memory. > > > > > > > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless > > > > of the type-3 device configuration (pmem-only or vmem-only) > > > > > > Correct, the top-level bus code (cxl_acpi) and the endpoint code > > > (cxl_mem, cxl_port) need to handshake before establishing regions. For > > > pmem regions the platform needs to claim the availability of a pmem > > > capable CXL window. > > > > > > > 4) As you can see above, multiple decoders are registered. I'm not sure > > > > if that's correct or not, but it does seem odd given there's only one > > > > cxl type-3 device. Odd that decoder0.0 shows up when CFMW is there, > > > > but not when it isn't. > > > > > > CXL windows are modeled as decoders hanging off the the CXL root device > > > (ACPI0017 on ACPI based platforms). An endpoint decoder can then map a > > > selection of that window. > > > > > > > Don't know why I haven't thought of this until now, but is the CFMW code > > > > reporting something odd about what's behind it? Is it assuming the > > > > devices are pmem? > > > > > > No, the cxl_acpi code is just advertising platform decode possibilities > > > independent of what devices show up. Think of this like the PCI MMIO > > > space that gets allocated to a root bridge at the beginning of time. > > > That space may or may not get consumed based on what devices show up > > > downstream. > > > > Thank you for the explanation Dan, and thank you for you patience > > @JCameron. I'm fairly sure I grok it now. > > > > Summarizing to make sure: the cxl driver is providing what would be the > > CXL.io (control) path, and the CXL.mem path is basically being simulated > > by what otherwise would be a traditional PCI memory region. This explains > > why turning off Legacy mode drops the dax devices, and why the topology > > looks strange - the devices are basically attached in 2 different ways. > > > > Might there be interest from the QEMU community to implement this > > legacy-style setup in the short term, in an effort to test the the > > control path of type-3 devices while we wait for the kernel to catch up? I'd happily review such code, but it's not on my list of things to work on otherwise. Too many other things to support! Jonathan > > > > Or should we forget this mode ever existed and just barrel forward > > with HDM decoders and writing the kernel code to hook up the underlying > > devices in drivers/cxl? > > Which mode are you referring? > > The next steps for the kernel enabling relevant to this thread are: > > * ram region discovery (platform firmware or kexec established) > * ram region creation > * pmem region discovery (from labels)
On Fri, Jan 20, 2023 at 02:41:05PM -0800, Dan Williams wrote: > > Which mode are you referring? > > The next steps for the kernel enabling relevant to this thread are: > > * ram region discovery (platform firmware or kexec established) > * ram region creation > * pmem region discovery (from labels) On some bios there is an option for "legacy cxl" which produces the nvdimm-esque behavior described above. After further investigation, supporting this would require a bigger lift than i expected since the existing emulated CXL devices don't carry any kind of node assignment and we'd have to plumb that through Not worth it, better off working towards the new stuff.