[0/8] hw/cxl: CXL emulation cleanups and minor fixes for upstream

Message ID	20230111142440.24771-1-Jonathan.Cameron@huawei.com
Headers	show Return-Path: <linux-cxl-owner@vger.kernel.org> From: Jonathan Cameron <Jonathan.Cameron@huawei.com> To: <qemu-devel@nongnu.org>, Michael Tsirkin <mst@redhat.com> CC: Ben Widawsky <bwidawsk@kernel.org>, <linux-cxl@vger.kernel.org>, <linuxarm@huawei.com>, Ira Weiny <ira.weiny@intel.com>, Gregory Price <gourry.memverge@gmail.com> Subject: [PATCH 0/8] hw/cxl: CXL emulation cleanups and minor fixes for upstream Date: Wed, 11 Jan 2023 14:24:32 +0000 Message-ID: <20230111142440.24771-1-Jonathan.Cameron@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	hw/cxl: CXL emulation cleanups and minor fixes for upstream \| expand [0/8] hw/cxl: CXL emulation cleanups and minor fixes for upstream [1/8] hw/mem/cxl_type3: Improve error handling in realize() [2/8] hw/pci-bridge/cxl_downstream: Fix type naming mismatch [3/8] hw/cxl: set cxl-type3 device type to PCI_CLASS_MEMORY_CXL [4/8] hw/cxl: Add CXL_CAPACITY_MULTIPLIER definition [5/8] hw/i386/acpi: Drop duplicate _UID entry for CXL root bridge [6/8] qemu/bswap: Add const_le64() [7/8] qemu/uuid: Add UUID static initializer [8/8] hw/cxl/mailbox: Use new UUID network order define for cel_uuid

Jonathan Cameron Jan. 11, 2023, 2:24 p.m. UTC

A small collection of misc fixes and tidying up pulled out from various
series. I've pulled this to the top of my queue of CXL related work
as they stand fine on their own and it will reduce the noise in
the larger patch sets if these go upstream first.

Gregory's patches were posted as part of his work on adding volatile support.
https://lore.kernel.org/linux-cxl/20221006233702.18532-1-gregory.price@memverge.com/
https://lore.kernel.org/linux-cxl/20221128150157.97724-2-gregory.price@memverge.com/
I might propose this for upstream inclusion this cycle, but testing is
currently limited by lack of suitable kernel support.

Ira's patches were part of his event injection series.
https://lore.kernel.org/linux-cxl/20221221-ira-cxl-events-2022-11-17-v2-0-2ce2ecc06219@intel.com/
Intent is to propose for upstream the rest of that series shortly after
some minor changes from earlier review.

My three patches have not previously been posted.

For the curious, the current state of QEMU CXL emulation that we are working
through the backlog wrt to final cleanup before proposing for upstreaming can be found at.

https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11

Gregory Price (2):
  hw/cxl: set cxl-type3 device type to PCI_CLASS_MEMORY_CXL
  hw/cxl: Add CXL_CAPACITY_MULTIPLIER definition

Ira Weiny (3):
  qemu/bswap: Add const_le64()
  qemu/uuid: Add UUID static initializer
  hw/cxl/mailbox: Use new UUID network order define for cel_uuid

Jonathan Cameron (3):
  hw/mem/cxl_type3: Improve error handling in realize()
  hw/pci-bridge/cxl_downstream: Fix type naming mismatch
  hw/i386/acpi: Drop duplicate _UID entry for CXL root bridge

 hw/cxl/cxl-device-utils.c      |  2 +-
 hw/cxl/cxl-mailbox-utils.c     | 27 ++++++++++++++-------------
 hw/i386/acpi-build.c           |  1 -
 hw/mem/cxl_type3.c             | 15 +++++++++++----
 hw/pci-bridge/cxl_downstream.c |  2 +-
 include/hw/cxl/cxl_device.h    |  2 +-
 include/qemu/bswap.h           | 10 ++++++++++
 include/qemu/uuid.h            | 12 ++++++++++++
 8 files changed, 50 insertions(+), 21 deletions(-)

Gregory Price Jan. 12, 2023, 3:39 p.m. UTC | #1

On Wed, Jan 11, 2023 at 02:24:32PM +0000, Jonathan Cameron via wrote:
> Gregory's patches were posted as part of his work on adding volatile support.
> https://lore.kernel.org/linux-cxl/20221006233702.18532-1-gregory.price@memverge.com/
> https://lore.kernel.org/linux-cxl/20221128150157.97724-2-gregory.price@memverge.com/
> I might propose this for upstream inclusion this cycle, but testing is
> currently limited by lack of suitable kernel support.

fwiw the testing i've done suggests the problem isn't necessarily the
implementation so much as either the EFI support or the ACPI tables.

For example, we see memory expanders come up no problem and turn into
volatile memory on real hardware, with the same kernels with just a few
commands.  My gut feeling is that either a mailbox command is missing or
that the ACPI tables are missing/significantly different.

I haven't been able to investigate further at this point, but that's my
current state with the voltile type-3 device testing.

Jonathan Cameron Jan. 12, 2023, 5:21 p.m. UTC | #2

On Thu, 12 Jan 2023 10:39:17 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Wed, Jan 11, 2023 at 02:24:32PM +0000, Jonathan Cameron via wrote:
> > Gregory's patches were posted as part of his work on adding volatile support.
> > https://lore.kernel.org/linux-cxl/20221006233702.18532-1-gregory.price@memverge.com/
> > https://lore.kernel.org/linux-cxl/20221128150157.97724-2-gregory.price@memverge.com/
> > I might propose this for upstream inclusion this cycle, but testing is
> > currently limited by lack of suitable kernel support.  
> 
> fwiw the testing i've done suggests the problem isn't necessarily the
> implementation so much as either the EFI support or the ACPI tables.
> 
> For example, we see memory expanders come up no problem and turn into
> volatile memory on real hardware, with the same kernels with just a few
> commands.  My gut feeling is that either a mailbox command is missing or
> that the ACPI tables are missing/significantly different.
> 
> I haven't been able to investigate further at this point, but that's my
> current state with the voltile type-3 device testing.

My assumption was that all shipping hardware platforms were doing the
enumeration and bring up of memory expanders in the BIOS / firmware.
Those are then presented to the OS already set up exactly as if they were
normal memory.  We could do the same on QEMU but that means a lot of
work in EDK2. Note that it makes no sense to do the enumeration and
creation of ACPI tables in QEMU itself though could hack it like that.
This stuff is done in firmware because that enables it for legacy
OSes. Everything is more or less presented to the OS like you would
present RAM (EFI memory map, ACPI tables etc).

Firmware enumeration doesn't typically support hotplug, so if we add
support for hotplug of volatile memory type 3 devices to the kernel
we will also be able to do 'cold plug' and have the kernel bring them up
in a similar fashion to what we do for non-volatile (for non volatile there
is typically no real support in firmware as there is a bunch of policy to
deal with that doesn't belong in firmware). (simplifying heavily ;)

So I don't think we are missing anything in the emulation, just in the
software layers above it.  Could be wrong though ;)

Jonathan

Gregory Price Jan. 12, 2023, 10:46 p.m. UTC | #3

On Thu, Jan 12, 2023 at 05:21:30PM +0000, Jonathan Cameron wrote:
> On Thu, 12 Jan 2023 10:39:17 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > On Wed, Jan 11, 2023 at 02:24:32PM +0000, Jonathan Cameron via wrote:
> > > Gregory's patches were posted as part of his work on adding volatile support.
> > > https://lore.kernel.org/linux-cxl/20221006233702.18532-1-gregory.price@memverge.com/
> > > https://lore.kernel.org/linux-cxl/20221128150157.97724-2-gregory.price@memverge.com/
> > > I might propose this for upstream inclusion this cycle, but testing is
> > > currently limited by lack of suitable kernel support.  
> > 
> > fwiw the testing i've done suggests the problem isn't necessarily the
> > implementation so much as either the EFI support or the ACPI tables.
> > 
> > For example, we see memory expanders come up no problem and turn into
> > volatile memory on real hardware, with the same kernels with just a few
> > commands.  My gut feeling is that either a mailbox command is missing or
> > that the ACPI tables are missing/significantly different.
> > 
> > I haven't been able to investigate further at this point, but that's my
> > current state with the voltile type-3 device testing.
> 
> My assumption was that all shipping hardware platforms were doing the
> enumeration and bring up of memory expanders in the BIOS / firmware.
> Those are then presented to the OS already set up exactly as if they were
> normal memory.  We could do the same on QEMU but that means a lot of
> work in EDK2. Note that it makes no sense to do the enumeration and
> creation of ACPI tables in QEMU itself though could hack it like that.
> This stuff is done in firmware because that enables it for legacy
> OSes. Everything is more or less presented to the OS like you would
> present RAM (EFI memory map, ACPI tables etc).
> 
> Firmware enumeration doesn't typically support hotplug, so if we add
> support for hotplug of volatile memory type 3 devices to the kernel
> we will also be able to do 'cold plug' and have the kernel bring them up
> in a similar fashion to what we do for non-volatile (for non volatile there
> is typically no real support in firmware as there is a bunch of policy to
> deal with that doesn't belong in firmware). (simplifying heavily ;)
> 
> So I don't think we are missing anything in the emulation, just in the
> software layers above it.  Could be wrong though ;)
> 
> Jonathan
> 
>

I'm not so sure something is missing so much as something seems
incorrect in either the ACPI table structure definitions, the mailbox,
or even the doe emulation.

I took your branch and reverted to just prior to the volatile patch
refernce: 59a59ef725699e0efb3e9e31a7f8d246de7286ed


QEMU configuration for boot (Please let me know if something is wrong)

sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 2G,slots=4,maxmem=4G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
-object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \
-object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \
-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G

After boot we find:

[root@fedora ~]# ls /sys/bus/cxl/devices/
decoder0.0  decoder2.0  mem0            pmem0  root0
decoder1.0  endpoint2   nvdimm-bridge0  port1

[root@fedora ~]# ls -al /sys/bus/dax/devices/
total 0
drwxr-xr-x. 2 root root 0 Jan 12 22:44 .
drwxr-xr-x. 4 root root 0 Jan 12 22:44 ..


During boot, I am seeing three separate call traces, all of which appear
to be related to PCI DOE and/or getting CDAT information.

[    3.916900] Call Trace:
[    3.916906]  <TASK>
[    3.931217]  pci_doe_submit_task+0x5d/0xd0
[    3.936609]  pci_doe_discovery+0xb4/0x100
[    3.936627]  ? pci_doe_xa_destroy+0x10/0x10
[    3.942675]  pcim_doe_create_mb+0x219/0x290
[    3.950506]  cxl_pci_probe+0x192/0x430
[    3.960248]  local_pci_probe+0x41/0x80
[    3.966564]  pci_device_probe+0xb3/0x220
[    3.966579]  really_probe+0xde/0x380
[    3.966583]  ? pm_runtime_barrier+0x50/0x90
[    3.969158]  __driver_probe_device+0x78/0x170
[    3.969167]  driver_probe_device+0x1f/0x90
[    3.978264]  __driver_attach_async_helper+0x5c/0xe0
[    3.983953]  async_run_entry_fn+0x30/0x130
[    3.991084]  process_one_work+0x294/0x5b0
[    4.004458]  worker_thread+0x4f/0x3a0
[    4.012612]  ? process_one_work+0x5b0/0x5b0
[    4.019114]  kthread+0xf5/0x120
[    4.025133]  ? kthread_complete_and_exit+0x20/0x20
[    4.031327]  ret_from_fork+0x22/0x30
[    4.038969]  </TASK>

[   16.047704]  pci_doe_submit_task+0x5d/0xd0
[   16.047713]  cxl_cdat_get_length+0xb8/0x110
[   16.047779]  ? dvsec_range_allowed+0x60/0x60
[   16.047803]  read_cdat_data+0xaf/0x1a0
[   16.047814]  cxl_port_probe+0x80/0x120
[   16.047824]  cxl_bus_probe+0x17/0x50
[   16.047830]  really_probe+0xde/0x380
[   16.047835]  ? pm_runtime_barrier+0x50/0x90
[   16.047843]  __driver_probe_device+0x78/0x170
[   16.047851]  driver_probe_device+0x1f/0x90
[   16.047858]  __device_attach_driver+0x85/0x110
[   16.047881]  ? driver_allows_async_probing+0x70/0x70
[   16.047884]  bus_for_each_drv+0x7a/0xb0
[   16.047896]  __device_attach+0xb3/0x1d0
[   16.047907]  bus_probe_device+0x9f/0xc0
[   16.047913]  device_add+0x41e/0x9b0
[   16.047918]  ? kobject_set_name_vargs+0x6d/0x90
[   16.047928]  ? dev_set_name+0x4b/0x60
[   16.047944]  devm_cxl_add_port+0x27b/0x3b0
[   16.047970]  devm_cxl_add_endpoint+0x82/0x130
[   16.047982]  cxl_mem_probe+0xc4/0x11d [cxl_mem]
[   16.047997]  cxl_bus_probe+0x17/0x50
[   16.048003]  really_probe+0xde/0x380
[   16.048007]  ? pm_runtime_barrier+0x50/0x90
[   16.048014]  __driver_probe_device+0x78/0x170
[   16.048022]  driver_probe_device+0x1f/0x90
[   16.048029]  __driver_attach+0xd5/0x1d0
[   16.048036]  ? __device_attach_driver+0x110/0x110
[   16.048040]  bus_for_each_dev+0x76/0xa0
[   16.048051]  bus_add_driver+0x1b1/0x200
[   16.048061]  driver_register+0x89/0xe0
[   16.048066]  ? 0xffffffffc056e000
[   16.048070]  do_one_initcall+0x6e/0x320
[   16.048091]  do_init_module+0x4a/0x200
[   16.048099]  __do_sys_init_module+0x16a/0x1a0
[   16.048132]  do_syscall_64+0x5b/0x80
[   16.048138]  ? lock_is_held_type+0xe8/0x140
[   16.048148]  ? asm_exc_page_fault+0x22/0x30
[   16.048156]  ? lockdep_hardirqs_on+0x7d/0x100
[   16.048162]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

[   16.054601]  pci_doe_submit_task+0x5d/0xd0
[   16.054610]  cxl_cdat_read_table.isra.0+0x141/0x190
[   16.054660]  ? dvsec_range_allowed+0x60/0x60
[   16.054685]  read_cdat_data+0xfc/0x1a0
[   16.054695]  cxl_port_probe+0x80/0x120
[   16.054706]  cxl_bus_probe+0x17/0x50
[   16.054712]  really_probe+0xde/0x380
[   16.054717]  ? pm_runtime_barrier+0x50/0x90
[   16.054725]  __driver_probe_device+0x78/0x170
[   16.054733]  driver_probe_device+0x1f/0x90
[   16.054739]  __device_attach_driver+0x85/0x110
[   16.054747]  ? driver_allows_async_probing+0x70/0x70
[   16.054751]  bus_for_each_drv+0x7a/0xb0
[   16.054767]  __device_attach+0xb3/0x1d0
[   16.054782]  bus_probe_device+0x9f/0xc0
[   16.054791]  device_add+0x41e/0x9b0
[   16.054798]  ? kobject_set_name_vargs+0x6d/0x90
[   16.054811]  ? dev_set_name+0x4b/0x60
[   16.054831]  devm_cxl_add_port+0x27b/0x3b0
[   16.054843]  devm_cxl_add_endpoint+0x82/0x130
[   16.054854]  cxl_mem_probe+0xc4/0x11d [cxl_mem]
[   16.054869]  cxl_bus_probe+0x17/0x50
[   16.054875]  really_probe+0xde/0x380
[   16.054879]  ? pm_runtime_barrier+0x50/0x90
[   16.054887]  __driver_probe_device+0x78/0x170
[   16.054894]  driver_probe_device+0x1f/0x90
[   16.054901]  __driver_attach+0xd5/0x1d0
[   16.054908]  ? __device_attach_driver+0x110/0x110
[   16.054912]  bus_for_each_dev+0x76/0xa0
[   16.054923]  bus_add_driver+0x1b1/0x200
[   16.055204]  driver_register+0x89/0xe0
[   16.055211]  ? 0xffffffffc056e000
[   16.055215]  do_one_initcall+0x6e/0x320
[   16.055237]  do_init_module+0x4a/0x200
[   16.055245]  __do_sys_init_module+0x16a/0x1a0
[   16.055277]  do_syscall_64+0x5b/0x80
[   16.055283]  ? lock_is_held_type+0xe8/0x140
[   16.055294]  ? asm_exc_page_fault+0x22/0x30
[   16.055301]  ? lockdep_hardirqs_on+0x7d/0x100
[   16.055307]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

Jonathan Cameron Jan. 13, 2023, 9:12 a.m. UTC | #4

On Thu, 12 Jan 2023 17:46:27 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Thu, Jan 12, 2023 at 05:21:30PM +0000, Jonathan Cameron wrote:
> > On Thu, 12 Jan 2023 10:39:17 -0500
> > Gregory Price <gregory.price@memverge.com> wrote:
> >   
> > > On Wed, Jan 11, 2023 at 02:24:32PM +0000, Jonathan Cameron via wrote:  
> > > > Gregory's patches were posted as part of his work on adding volatile support.
> > > > https://lore.kernel.org/linux-cxl/20221006233702.18532-1-gregory.price@memverge.com/
> > > > https://lore.kernel.org/linux-cxl/20221128150157.97724-2-gregory.price@memverge.com/
> > > > I might propose this for upstream inclusion this cycle, but testing is
> > > > currently limited by lack of suitable kernel support.    
> > > 
> > > fwiw the testing i've done suggests the problem isn't necessarily the
> > > implementation so much as either the EFI support or the ACPI tables.
> > > 
> > > For example, we see memory expanders come up no problem and turn into
> > > volatile memory on real hardware, with the same kernels with just a few
> > > commands.  My gut feeling is that either a mailbox command is missing or
> > > that the ACPI tables are missing/significantly different.
> > > 
> > > I haven't been able to investigate further at this point, but that's my
> > > current state with the voltile type-3 device testing.  
> > 
> > My assumption was that all shipping hardware platforms were doing the
> > enumeration and bring up of memory expanders in the BIOS / firmware.
> > Those are then presented to the OS already set up exactly as if they were
> > normal memory.  We could do the same on QEMU but that means a lot of
> > work in EDK2. Note that it makes no sense to do the enumeration and
> > creation of ACPI tables in QEMU itself though could hack it like that.
> > This stuff is done in firmware because that enables it for legacy
> > OSes. Everything is more or less presented to the OS like you would
> > present RAM (EFI memory map, ACPI tables etc).
> > 
> > Firmware enumeration doesn't typically support hotplug, so if we add
> > support for hotplug of volatile memory type 3 devices to the kernel
> > we will also be able to do 'cold plug' and have the kernel bring them up
> > in a similar fashion to what we do for non-volatile (for non volatile there
> > is typically no real support in firmware as there is a bunch of policy to
> > deal with that doesn't belong in firmware). (simplifying heavily ;)
> > 
> > So I don't think we are missing anything in the emulation, just in the
> > software layers above it.  Could be wrong though ;)
> > 
> > Jonathan
> > 
> >  
> 
> I'm not so sure something is missing so much as something seems
> incorrect in either the ACPI table structure definitions, the mailbox,
> or even the doe emulation.
> 
> I took your branch and reverted to just prior to the volatile patch
> refernce: 59a59ef725699e0efb3e9e31a7f8d246de7286ed
> 
> 
> QEMU configuration for boot (Please let me know if something is wrong)
> 
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 2G,slots=4,maxmem=4G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \
> -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \
> -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> 
> After boot we find:
> 
> [root@fedora ~]# ls /sys/bus/cxl/devices/
> decoder0.0  decoder2.0  mem0            pmem0  root0
> decoder1.0  endpoint2   nvdimm-bridge0  port1
> 
> [root@fedora ~]# ls -al /sys/bus/dax/devices/
> total 0
> drwxr-xr-x. 2 root root 0 Jan 12 22:44 .
> drwxr-xr-x. 4 root root 0 Jan 12 22:44 ..
> 
> 
> During boot, I am seeing three separate call traces, all of which appear
> to be related to PCI DOE and/or getting CDAT information.

Just to check, are these different from the on stack problem you reported
previously?  Doesn't look like the fix for that has made it upstream yet.

What kernel are you running?


> 
> [    3.916900] Call Trace:
> [    3.916906]  <TASK>
> [    3.931217]  pci_doe_submit_task+0x5d/0xd0
> [    3.936609]  pci_doe_discovery+0xb4/0x100
> [    3.936627]  ? pci_doe_xa_destroy+0x10/0x10
> [    3.942675]  pcim_doe_create_mb+0x219/0x290
> [    3.950506]  cxl_pci_probe+0x192/0x430
> [    3.960248]  local_pci_probe+0x41/0x80
> [    3.966564]  pci_device_probe+0xb3/0x220
> [    3.966579]  really_probe+0xde/0x380
> [    3.966583]  ? pm_runtime_barrier+0x50/0x90
> [    3.969158]  __driver_probe_device+0x78/0x170
> [    3.969167]  driver_probe_device+0x1f/0x90
> [    3.978264]  __driver_attach_async_helper+0x5c/0xe0
> [    3.983953]  async_run_entry_fn+0x30/0x130
> [    3.991084]  process_one_work+0x294/0x5b0
> [    4.004458]  worker_thread+0x4f/0x3a0
> [    4.012612]  ? process_one_work+0x5b0/0x5b0
> [    4.019114]  kthread+0xf5/0x120
> [    4.025133]  ? kthread_complete_and_exit+0x20/0x20
> [    4.031327]  ret_from_fork+0x22/0x30
> [    4.038969]  </TASK>
> 
> [   16.047704]  pci_doe_submit_task+0x5d/0xd0
> [   16.047713]  cxl_cdat_get_length+0xb8/0x110
> [   16.047779]  ? dvsec_range_allowed+0x60/0x60
> [   16.047803]  read_cdat_data+0xaf/0x1a0
> [   16.047814]  cxl_port_probe+0x80/0x120
> [   16.047824]  cxl_bus_probe+0x17/0x50
> [   16.047830]  really_probe+0xde/0x380
> [   16.047835]  ? pm_runtime_barrier+0x50/0x90
> [   16.047843]  __driver_probe_device+0x78/0x170
> [   16.047851]  driver_probe_device+0x1f/0x90
> [   16.047858]  __device_attach_driver+0x85/0x110
> [   16.047881]  ? driver_allows_async_probing+0x70/0x70
> [   16.047884]  bus_for_each_drv+0x7a/0xb0
> [   16.047896]  __device_attach+0xb3/0x1d0
> [   16.047907]  bus_probe_device+0x9f/0xc0
> [   16.047913]  device_add+0x41e/0x9b0
> [   16.047918]  ? kobject_set_name_vargs+0x6d/0x90
> [   16.047928]  ? dev_set_name+0x4b/0x60
> [   16.047944]  devm_cxl_add_port+0x27b/0x3b0
> [   16.047970]  devm_cxl_add_endpoint+0x82/0x130
> [   16.047982]  cxl_mem_probe+0xc4/0x11d [cxl_mem]
> [   16.047997]  cxl_bus_probe+0x17/0x50
> [   16.048003]  really_probe+0xde/0x380
> [   16.048007]  ? pm_runtime_barrier+0x50/0x90
> [   16.048014]  __driver_probe_device+0x78/0x170
> [   16.048022]  driver_probe_device+0x1f/0x90
> [   16.048029]  __driver_attach+0xd5/0x1d0
> [   16.048036]  ? __device_attach_driver+0x110/0x110
> [   16.048040]  bus_for_each_dev+0x76/0xa0
> [   16.048051]  bus_add_driver+0x1b1/0x200
> [   16.048061]  driver_register+0x89/0xe0
> [   16.048066]  ? 0xffffffffc056e000
> [   16.048070]  do_one_initcall+0x6e/0x320
> [   16.048091]  do_init_module+0x4a/0x200
> [   16.048099]  __do_sys_init_module+0x16a/0x1a0
> [   16.048132]  do_syscall_64+0x5b/0x80
> [   16.048138]  ? lock_is_held_type+0xe8/0x140
> [   16.048148]  ? asm_exc_page_fault+0x22/0x30
> [   16.048156]  ? lockdep_hardirqs_on+0x7d/0x100
> [   16.048162]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> [   16.054601]  pci_doe_submit_task+0x5d/0xd0
> [   16.054610]  cxl_cdat_read_table.isra.0+0x141/0x190
> [   16.054660]  ? dvsec_range_allowed+0x60/0x60
> [   16.054685]  read_cdat_data+0xfc/0x1a0
> [   16.054695]  cxl_port_probe+0x80/0x120
> [   16.054706]  cxl_bus_probe+0x17/0x50
> [   16.054712]  really_probe+0xde/0x380
> [   16.054717]  ? pm_runtime_barrier+0x50/0x90
> [   16.054725]  __driver_probe_device+0x78/0x170
> [   16.054733]  driver_probe_device+0x1f/0x90
> [   16.054739]  __device_attach_driver+0x85/0x110
> [   16.054747]  ? driver_allows_async_probing+0x70/0x70
> [   16.054751]  bus_for_each_drv+0x7a/0xb0
> [   16.054767]  __device_attach+0xb3/0x1d0
> [   16.054782]  bus_probe_device+0x9f/0xc0
> [   16.054791]  device_add+0x41e/0x9b0
> [   16.054798]  ? kobject_set_name_vargs+0x6d/0x90
> [   16.054811]  ? dev_set_name+0x4b/0x60
> [   16.054831]  devm_cxl_add_port+0x27b/0x3b0
> [   16.054843]  devm_cxl_add_endpoint+0x82/0x130
> [   16.054854]  cxl_mem_probe+0xc4/0x11d [cxl_mem]
> [   16.054869]  cxl_bus_probe+0x17/0x50
> [   16.054875]  really_probe+0xde/0x380
> [   16.054879]  ? pm_runtime_barrier+0x50/0x90
> [   16.054887]  __driver_probe_device+0x78/0x170
> [   16.054894]  driver_probe_device+0x1f/0x90
> [   16.054901]  __driver_attach+0xd5/0x1d0
> [   16.054908]  ? __device_attach_driver+0x110/0x110
> [   16.054912]  bus_for_each_dev+0x76/0xa0
> [   16.054923]  bus_add_driver+0x1b1/0x200
> [   16.055204]  driver_register+0x89/0xe0
> [   16.055211]  ? 0xffffffffc056e000
> [   16.055215]  do_one_initcall+0x6e/0x320
> [   16.055237]  do_init_module+0x4a/0x200
> [   16.055245]  __do_sys_init_module+0x16a/0x1a0
> [   16.055277]  do_syscall_64+0x5b/0x80
> [   16.055283]  ? lock_is_held_type+0xe8/0x140
> [   16.055294]  ? asm_exc_page_fault+0x22/0x30
> [   16.055301]  ? lockdep_hardirqs_on+0x7d/0x100
> [   16.055307]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

Gregory Price Jan. 13, 2023, 2:19 p.m. UTC | #5

On Fri, Jan 13, 2023 at 09:12:13AM +0000, Jonathan Cameron wrote:
> 
> Just to check, are these different from the on stack problem you reported
> previously?  Doesn't look like the fix for that has made it upstream yet.
> 
> What kernel are you running?
> 
> 

The prior issue I saw was related to the CXL Fixed Memory Window having
an e820 region registered during machine initialization.  That fix is
upstream.

On 2023-1-11 branch it is commit 2486dd045794d65598fbca9fd1224c27b9732dce

This one appears when registering any kind of type-3 device, during
boot.

Jonathan Cameron Jan. 13, 2023, 2:40 p.m. UTC | #6

On Fri, 13 Jan 2023 09:19:59 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Fri, Jan 13, 2023 at 09:12:13AM +0000, Jonathan Cameron wrote:
> > 
> > Just to check, are these different from the on stack problem you reported
> > previously?  Doesn't look like the fix for that has made it upstream yet.
> > 
> > What kernel are you running?
> > 
> >   
> 
> The prior issue I saw was related to the CXL Fixed Memory Window having
> an e820 region registered during machine initialization.  That fix is
> upstream.
> 
> On 2023-1-11 branch it is commit 2486dd045794d65598fbca9fd1224c27b9732dce
> 
> This one appears when registering any kind of type-3 device, during
> boot.

I meant this one

https://lore.kernel.org/all/20221118000524.1477383-1-ira.weiny@intel.com/

Sorry, should have dug out a link in earlier reply and save a round trip.

Jonathan

Jonathan Cameron Jan. 13, 2023, 2:45 p.m. UTC | #7

On Fri, 13 Jan 2023 14:40:26 +0000
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Fri, 13 Jan 2023 09:19:59 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > On Fri, Jan 13, 2023 at 09:12:13AM +0000, Jonathan Cameron wrote:  
> > > 
> > > Just to check, are these different from the on stack problem you reported
> > > previously?  Doesn't look like the fix for that has made it upstream yet.
> > > 
> > > What kernel are you running?
> > > 
> > >     
> > 
> > The prior issue I saw was related to the CXL Fixed Memory Window having
> > an e820 region registered during machine initialization.  That fix is
> > upstream.
> > 
> > On 2023-1-11 branch it is commit 2486dd045794d65598fbca9fd1224c27b9732dce
> > 
> > This one appears when registering any kind of type-3 device, during
> > boot.  
> 
> I meant this one
> 
> https://lore.kernel.org/all/20221118000524.1477383-1-ira.weiny@intel.com/
> 
> Sorry, should have dug out a link in earlier reply and save a round trip.

Ah I'd forgotten we were going round the houses somewhat on the right fix...
https://lore.kernel.org/all/20221128040338.1936529-3-ira.weiny@intel.com/
was another proposal and Lukas had yet another.
https://lore.kernel.org/all/cover.1669608950.git.lukas@wunner.de/

> 
> Jonathan

Lukas Wunner Jan. 13, 2023, 3:12 p.m. UTC | #8

On Fri, Jan 13, 2023 at 02:45:11PM +0000, Jonathan Cameron wrote:
> On Fri, 13 Jan 2023 14:40:26 +0000 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> > On Fri, 13 Jan 2023 09:19:59 -0500 Gregory Price <gregory.price@memverge.com> wrote:
> > > On Fri, Jan 13, 2023 at 09:12:13AM +0000, Jonathan Cameron wrote:  
> > > > Just to check, are these different from the on stack problem you reported
> > > > previously?  Doesn't look like the fix for that has made it upstream yet.
> > > > 
> > > > What kernel are you running?
> > > 
> > > The prior issue I saw was related to the CXL Fixed Memory Window having
> > > an e820 region registered during machine initialization.  That fix is
> > > upstream.
> > > 
> > > On 2023-1-11 branch it is commit 2486dd045794d65598fbca9fd1224c27b9732dce
> > > 
> > > This one appears when registering any kind of type-3 device, during
> > > boot.  
> > 
> > I meant this one
> > 
> > https://lore.kernel.org/all/20221118000524.1477383-1-ira.weiny@intel.com/
> > 
> > Sorry, should have dug out a link in earlier reply and save a round trip.
> 
> Ah I'd forgotten we were going round the houses somewhat on the right fix...
> https://lore.kernel.org/all/20221128040338.1936529-3-ira.weiny@intel.com/
> was another proposal and Lukas had yet another.
> https://lore.kernel.org/all/cover.1669608950.git.lukas@wunner.de/

I'll respin those fixes shortly, together with the other DOE patches
I've accumulated on my development branch for DOE+SPDM:

https://github.com/l1k/linux/commits/doe

Ira kindly tested them this week and reports that they don't regress
CDAT retrieval for him.

Thanks,

Lukas

Gregory Price Jan. 13, 2023, 3:42 p.m. UTC | #9

On Fri, Jan 13, 2023 at 04:12:06PM +0100, Lukas Wunner wrote:
> On Fri, Jan 13, 2023 at 02:45:11PM +0000, Jonathan Cameron wrote:
> > On Fri, 13 Jan 2023 14:40:26 +0000 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> > > On Fri, 13 Jan 2023 09:19:59 -0500 Gregory Price <gregory.price@memverge.com> wrote:
> > > > On Fri, Jan 13, 2023 at 09:12:13AM +0000, Jonathan Cameron wrote:  
> > > > > Just to check, are these different from the on stack problem you reported
> > > > > previously?  Doesn't look like the fix for that has made it upstream yet.
> > > > > 
> > > > > What kernel are you running?
> > > > 
> > > > The prior issue I saw was related to the CXL Fixed Memory Window having
> > > > an e820 region registered during machine initialization.  That fix is
> > > > upstream.
> > > > 
> > > > On 2023-1-11 branch it is commit 2486dd045794d65598fbca9fd1224c27b9732dce
> > > > 
> > > > This one appears when registering any kind of type-3 device, during
> > > > boot.  
> > > 
> > > I meant this one
> > > 
> > > https://lore.kernel.org/all/20221118000524.1477383-1-ira.weiny@intel.com/
> > > 
> > > Sorry, should have dug out a link in earlier reply and save a round trip.
> > 
> > Ah I'd forgotten we were going round the houses somewhat on the right fix...
> > https://lore.kernel.org/all/20221128040338.1936529-3-ira.weiny@intel.com/
> > was another proposal and Lukas had yet another.
> > https://lore.kernel.org/all/cover.1669608950.git.lukas@wunner.de/
> 
> I'll respin those fixes shortly, together with the other DOE patches
> I've accumulated on my development branch for DOE+SPDM:
> 
> https://github.com/l1k/linux/commits/doe
> 
> Ira kindly tested them this week and reports that they don't regress
> CDAT retrieval for him.
> 
> Thanks,
> 
> Lukas

I'll see if I can't get this branch installed and test if this fixes the
issue i'm seeing.  It may take me a bit to get back to.

Thank you!

Gregory Price Jan. 18, 2023, 7:22 p.m. UTC | #10

1) No stack traces present
2) Device usage appears to work, but cxl-cli fails to create a region, i
haven't checked why yet (also tried ndctl-75, same results)
3) There seems to be some other regression with the cxl_pmem_init
routine, because I get a stack trace in this setup regardless of whether
I apply the type-3 device commit.


All tests below with the previously posted DOE linux branch.
Base QEMU branch was Jonathan's 2023-1-11


DOE Branch - 2023-1-11 (HEAD) (all commits)

QEMU Config:
sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 3G,slots=4,maxmem=8G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-object memory-backend-ram,id=mem0,size=1G,share=on \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G

Result:  This worked, but cxl-cli could not create a region (will look
into this further later).




When running with a persistent memory configuration, I'm seeing a
kernel stack trace on cxl_pmem_init

Config:
sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 3G,slots=4,maxmem=4G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
-object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
-object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
-device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G


[   62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0
[   62.185069] #PF: supervisor read access in kernel mode
[   62.198502] #PF: error_code(0x0000) - not-present page
[   62.211019] PGD 0 P4D 0
[   62.220521] Oops: 0000 [#1] PREEMPT SMP PTI
[   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1
[   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 extents:1 across:2939900k SSDscFS
[   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
[   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
[   62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
[   62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
[   62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
[   62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
[   62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
[   62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
[   62.285541] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
[   62.285542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
[   62.285653] Call Trace:
[   62.285656]  <TASK>
[   62.285660]  cxl_bus_probe+0x17/0x50
[   62.285691]  really_probe+0xde/0x380
[   62.285695]  ? pm_runtime_barrier+0x50/0x90
[   62.285700]  __driver_probe_device+0x78/0x170
[   62.285846]  driver_probe_device+0x1f/0x90
[   62.285850]  __driver_attach+0xd2/0x1c0
[   62.285853]  ? __pfx___driver_attach+0x10/0x10
[   62.285856]  bus_for_each_dev+0x76/0xa0
[   62.285860]  bus_add_driver+0x1b1/0x200
[   62.285863]  driver_register+0x89/0xe0
[   62.285868]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
[   62.285874]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
[   62.285880]  do_one_initcall+0x6e/0x330
[   62.285888]  do_init_module+0x4a/0x200
[   62.285892]  __do_sys_finit_module+0x93/0xf0
[   62.285899]  do_syscall_64+0x5b/0x80
[   62.285904]  ? do_syscall_64+0x67/0x80
[   62.285906]  ? asm_exc_page_fault+0x22/0x30
[   62.285910]  ? lockdep_hardirqs_on+0x7d/0x100
[   62.285914]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   62.285917] RIP: 0033:0x7f2619b0afbd
[   62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8
[   62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd
[   62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006
[   62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52
[   62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
[   62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0
[   62.285932]  </TASK>
[   62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g
[   62.285954] CR2: 00000000000004c0
[   62.288385] ---[ end trace 0000000000000000 ]---
[   63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
[   63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
[   63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
[   63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
[   63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
[   63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
[   63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
[   63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
[   63.203580] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
[   63.203583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0



Next i reverted the QEMU branch to the commit just before the type-3
volatile commit and used the old method of launching with a type-3 pmem
device

Config:
sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 2G,slots=4,maxmem=4G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
-object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \
-object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \
-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G


Result: Similar stack trace
[   29.850023] BUG: kernel NULL pointer dereference, address: 00000000000004c0
[   29.882400] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
[   29.957485] Call Trace:
[   29.959067]  <TASK>
[   29.962176]  cxl_bus_probe+0x17/0x50
[   29.964940]  really_probe+0xde/0x380
[   29.969065]  ? pm_runtime_barrier+0x50/0x90
[   29.973419]  __driver_probe_device+0x78/0x170
[   29.977183]  driver_probe_device+0x1f/0x90
[   29.984212]  __driver_attach+0xd2/0x1c0
[   29.988463]  ? __pfx___driver_attach+0x10/0x10
[   29.992379]  bus_for_each_dev+0x76/0xa0
[   29.997040]  bus_add_driver+0x1b1/0x200
[   30.000368]  driver_register+0x89/0xe0
[   30.004579]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
[   30.012403]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
[   30.019394]  do_one_initcall+0x6e/0x330
[   30.024028]  do_init_module+0x4a/0x200
[   30.029243]  __do_sys_finit_module+0x93/0xf0
[   30.034943]  do_syscall_64+0x5b/0x80
[   30.039844]  ? do_syscall_64+0x67/0x80
[   30.045163]  ? do_syscall_64+0x67/0x80
[   30.049729]  ? lock_release+0x14b/0x440
[   30.054055]  ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
[   30.061039]  ? lock_is_held_type+0xe8/0x140
[   30.067625]  ? do_syscall_64+0x67/0x80
[   30.071909]  ? lockdep_hardirqs_on+0x7d/0x100
[   30.079037]  ? do_syscall_64+0x67/0x80
[   30.084537]  ? do_syscall_64+0x67/0x80
[   30.089091]  ? do_syscall_64+0x67/0x80
[   30.094174]  ? lockdep_hardirqs_on+0x7d/0x100
[   30.099224]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   30.104446] RIP: 0033:0x7f000550afbd

Gregory Price Jan. 18, 2023, 7:31 p.m. UTC | #11

I apparently forgot an intro lol

I tested the DOE linux branch with the 2023-1-11 QEMU branch with both
volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode.

1) *In volatile mode, there are no stack traces present (during boot*)

On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote:
> 
> 1) No stack traces present
> 2) Device usage appears to work, but cxl-cli fails to create a region, i
> haven't checked why yet (also tried ndctl-75, same results)
> 3) There seems to be some other regression with the cxl_pmem_init
> routine, because I get a stack trace in this setup regardless of whether
> I apply the type-3 device commit.
> 
> 
> All tests below with the previously posted DOE linux branch.
> Base QEMU branch was Jonathan's 2023-1-11
> 
> 
> DOE Branch - 2023-1-11 (HEAD) (all commits)
> 
> QEMU Config:
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 3G,slots=4,maxmem=8G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -object memory-backend-ram,id=mem0,size=1G,share=on \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> 
> Result:  This worked, but cxl-cli could not create a region (will look
> into this further later).
> 
> 
> 
> 
> When running with a persistent memory configuration, I'm seeing a
> kernel stack trace on cxl_pmem_init
> 
> Config:
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 3G,slots=4,maxmem=4G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
> -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
> -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> 
> 
> [   62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> [   62.185069] #PF: supervisor read access in kernel mode
> [   62.198502] #PF: error_code(0x0000) - not-present page
> [   62.211019] PGD 0 P4D 0
> [   62.220521] Oops: 0000 [#1] PREEMPT SMP PTI
> [   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1
> [   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> [   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 extents:1 across:2939900k SSDscFS
> [   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> [   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> [   62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> [   62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> [   62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> [   62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> [   62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> [   62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> [   62.285541] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> [   62.285542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> [   62.285653] Call Trace:
> [   62.285656]  <TASK>
> [   62.285660]  cxl_bus_probe+0x17/0x50
> [   62.285691]  really_probe+0xde/0x380
> [   62.285695]  ? pm_runtime_barrier+0x50/0x90
> [   62.285700]  __driver_probe_device+0x78/0x170
> [   62.285846]  driver_probe_device+0x1f/0x90
> [   62.285850]  __driver_attach+0xd2/0x1c0
> [   62.285853]  ? __pfx___driver_attach+0x10/0x10
> [   62.285856]  bus_for_each_dev+0x76/0xa0
> [   62.285860]  bus_add_driver+0x1b1/0x200
> [   62.285863]  driver_register+0x89/0xe0
> [   62.285868]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> [   62.285874]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> [   62.285880]  do_one_initcall+0x6e/0x330
> [   62.285888]  do_init_module+0x4a/0x200
> [   62.285892]  __do_sys_finit_module+0x93/0xf0
> [   62.285899]  do_syscall_64+0x5b/0x80
> [   62.285904]  ? do_syscall_64+0x67/0x80
> [   62.285906]  ? asm_exc_page_fault+0x22/0x30
> [   62.285910]  ? lockdep_hardirqs_on+0x7d/0x100
> [   62.285914]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [   62.285917] RIP: 0033:0x7f2619b0afbd
> [   62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8
> [   62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [   62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd
> [   62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006
> [   62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52
> [   62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
> [   62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0
> [   62.285932]  </TASK>
> [   62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g
> [   62.285954] CR2: 00000000000004c0
> [   62.288385] ---[ end trace 0000000000000000 ]---
> [   63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> [   63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> [   63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> [   63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> [   63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> [   63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> [   63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> [   63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> [   63.203580] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> [   63.203583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> 
> 
> 
> Next i reverted the QEMU branch to the commit just before the type-3
> volatile commit and used the old method of launching with a type-3 pmem
> device
> 
> Config:
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 2G,slots=4,maxmem=4G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \
> -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \
> -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> 
> 
> Result: Similar stack trace
> [   29.850023] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> [   29.882400] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> [   29.957485] Call Trace:
> [   29.959067]  <TASK>
> [   29.962176]  cxl_bus_probe+0x17/0x50
> [   29.964940]  really_probe+0xde/0x380
> [   29.969065]  ? pm_runtime_barrier+0x50/0x90
> [   29.973419]  __driver_probe_device+0x78/0x170
> [   29.977183]  driver_probe_device+0x1f/0x90
> [   29.984212]  __driver_attach+0xd2/0x1c0
> [   29.988463]  ? __pfx___driver_attach+0x10/0x10
> [   29.992379]  bus_for_each_dev+0x76/0xa0
> [   29.997040]  bus_add_driver+0x1b1/0x200
> [   30.000368]  driver_register+0x89/0xe0
> [   30.004579]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> [   30.012403]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> [   30.019394]  do_one_initcall+0x6e/0x330
> [   30.024028]  do_init_module+0x4a/0x200
> [   30.029243]  __do_sys_finit_module+0x93/0xf0
> [   30.034943]  do_syscall_64+0x5b/0x80
> [   30.039844]  ? do_syscall_64+0x67/0x80
> [   30.045163]  ? do_syscall_64+0x67/0x80
> [   30.049729]  ? lock_release+0x14b/0x440
> [   30.054055]  ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
> [   30.061039]  ? lock_is_held_type+0xe8/0x140
> [   30.067625]  ? do_syscall_64+0x67/0x80
> [   30.071909]  ? lockdep_hardirqs_on+0x7d/0x100
> [   30.079037]  ? do_syscall_64+0x67/0x80
> [   30.084537]  ? do_syscall_64+0x67/0x80
> [   30.089091]  ? do_syscall_64+0x67/0x80
> [   30.094174]  ? lockdep_hardirqs_on+0x7d/0x100
> [   30.099224]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [   30.104446] RIP: 0033:0x7f000550afbd

Jonathan Cameron Jan. 19, 2023, 10:19 a.m. UTC | #12

On Wed, 18 Jan 2023 14:22:08 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> 1) No stack traces present
> 2) Device usage appears to work, but cxl-cli fails to create a region, i
> haven't checked why yet (also tried ndctl-75, same results)
> 3) There seems to be some other regression with the cxl_pmem_init
> routine, because I get a stack trace in this setup regardless of whether
> I apply the type-3 device commit.
> 
> 
> All tests below with the previously posted DOE linux branch.
> Base QEMU branch was Jonathan's 2023-1-11
> 
> 
> DOE Branch - 2023-1-11 (HEAD) (all commits)
> 
> QEMU Config:
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 3G,slots=4,maxmem=8G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -object memory-backend-ram,id=mem0,size=1G,share=on \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> 
> Result:  This worked, but cxl-cli could not create a region (will look
> into this further later).
> 

Even if everything else worked, it will currently fail because of the
issue with pass through decoders.
(Kernel assumes always pass through for single rp, qemu assumes never
 pass through - both are valid under spec).
Add a second rp for now.  I should be able to post some patches
to work around that shortly. 




 
> 
> 
> 
> When running with a persistent memory configuration, I'm seeing a
> kernel stack trace on cxl_pmem_init
> 
> Config:
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 3G,slots=4,maxmem=4G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
> -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
> -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> 
> 
> [   62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> [   62.185069] #PF: supervisor read access in kernel mode
> [   62.198502] #PF: error_code(0x0000) - not-present page
> [   62.211019] PGD 0 P4D 0
> [   62.220521] Oops: 0000 [#1] PREEMPT SMP PTI
> [   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1
> [   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> [   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 extents:1 across:2939900k SSDscFS
> [   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> [   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> [   62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> [   62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> [   62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> [   62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> [   62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> [   62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> [   62.285541] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> [   62.285542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> [   62.285653] Call Trace:
> [   62.285656]  <TASK>
> [   62.285660]  cxl_bus_probe+0x17/0x50
> [   62.285691]  really_probe+0xde/0x380
> [   62.285695]  ? pm_runtime_barrier+0x50/0x90
> [   62.285700]  __driver_probe_device+0x78/0x170
> [   62.285846]  driver_probe_device+0x1f/0x90
> [   62.285850]  __driver_attach+0xd2/0x1c0
> [   62.285853]  ? __pfx___driver_attach+0x10/0x10
> [   62.285856]  bus_for_each_dev+0x76/0xa0
> [   62.285860]  bus_add_driver+0x1b1/0x200
> [   62.285863]  driver_register+0x89/0xe0
> [   62.285868]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> [   62.285874]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> [   62.285880]  do_one_initcall+0x6e/0x330
> [   62.285888]  do_init_module+0x4a/0x200
> [   62.285892]  __do_sys_finit_module+0x93/0xf0
> [   62.285899]  do_syscall_64+0x5b/0x80
> [   62.285904]  ? do_syscall_64+0x67/0x80
> [   62.285906]  ? asm_exc_page_fault+0x22/0x30
> [   62.285910]  ? lockdep_hardirqs_on+0x7d/0x100
> [   62.285914]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [   62.285917] RIP: 0033:0x7f2619b0afbd
> [   62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8
> [   62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [   62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd
> [   62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006
> [   62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52
> [   62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
> [   62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0
> [   62.285932]  </TASK>
> [   62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g
> [   62.285954] CR2: 00000000000004c0
> [   62.288385] ---[ end trace 0000000000000000 ]---
> [   63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> [   63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> [   63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> [   63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> [   63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> [   63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> [   63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> [   63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> [   63.203580] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> [   63.203583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> 

I clearly need to run some more rests as not seen this issue (and I've
had a couple of reports of it now).  I guess I never tend to be running
a completely clean tree on either side + testing is mostly on arm64
(though I doubt that matters). It's not the pass through decoder issues as
that is visible only when bringing up a region.

> 
> 
> Next i reverted the QEMU branch to the commit just before the type-3
> volatile commit and used the old method of launching with a type-3 pmem
> device
> 
> Config:
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 2G,slots=4,maxmem=4G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \
> -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \
> -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> 
> 
> Result: Similar stack trace
> [   29.850023] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> [   29.882400] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> [   29.957485] Call Trace:
> [   29.959067]  <TASK>
> [   29.962176]  cxl_bus_probe+0x17/0x50
> [   29.964940]  really_probe+0xde/0x380
> [   29.969065]  ? pm_runtime_barrier+0x50/0x90
> [   29.973419]  __driver_probe_device+0x78/0x170
> [   29.977183]  driver_probe_device+0x1f/0x90
> [   29.984212]  __driver_attach+0xd2/0x1c0
> [   29.988463]  ? __pfx___driver_attach+0x10/0x10
> [   29.992379]  bus_for_each_dev+0x76/0xa0
> [   29.997040]  bus_add_driver+0x1b1/0x200
> [   30.000368]  driver_register+0x89/0xe0
> [   30.004579]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> [   30.012403]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> [   30.019394]  do_one_initcall+0x6e/0x330
> [   30.024028]  do_init_module+0x4a/0x200
> [   30.029243]  __do_sys_finit_module+0x93/0xf0
> [   30.034943]  do_syscall_64+0x5b/0x80
> [   30.039844]  ? do_syscall_64+0x67/0x80
> [   30.045163]  ? do_syscall_64+0x67/0x80
> [   30.049729]  ? lock_release+0x14b/0x440
> [   30.054055]  ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
> [   30.061039]  ? lock_is_held_type+0xe8/0x140
> [   30.067625]  ? do_syscall_64+0x67/0x80
> [   30.071909]  ? lockdep_hardirqs_on+0x7d/0x100
> [   30.079037]  ? do_syscall_64+0x67/0x80
> [   30.084537]  ? do_syscall_64+0x67/0x80
> [   30.089091]  ? do_syscall_64+0x67/0x80
> [   30.094174]  ? lockdep_hardirqs_on+0x7d/0x100
> [   30.099224]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [   30.104446] RIP: 0033:0x7f000550afbd

Michael S. Tsirkin Jan. 19, 2023, 11:48 a.m. UTC | #13

On Thu, Jan 19, 2023 at 10:19:46AM +0000, Jonathan Cameron wrote:
> On Wed, 18 Jan 2023 14:22:08 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > 1) No stack traces present
> > 2) Device usage appears to work, but cxl-cli fails to create a region, i
> > haven't checked why yet (also tried ndctl-75, same results)
> > 3) There seems to be some other regression with the cxl_pmem_init
> > routine, because I get a stack trace in this setup regardless of whether
> > I apply the type-3 device commit.
> > 
> > 
> > All tests below with the previously posted DOE linux branch.
> > Base QEMU branch was Jonathan's 2023-1-11
> > 
> > 
> > DOE Branch - 2023-1-11 (HEAD) (all commits)
> > 
> > QEMU Config:
> > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > -m 3G,slots=4,maxmem=8G \
> > -smp 4 \
> > -machine type=q35,accel=kvm,cxl=on \
> > -enable-kvm \
> > -nographic \
> > -object memory-backend-ram,id=mem0,size=1G,share=on \
> > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > 
> > Result:  This worked, but cxl-cli could not create a region (will look
> > into this further later).
> > 
> 
> Even if everything else worked, it will currently fail because of the
> issue with pass through decoders.
> (Kernel assumes always pass through for single rp, qemu assumes never
>  pass through - both are valid under spec).
> Add a second rp for now.  I should be able to post some patches
> to work around that shortly. 
> 
> 
> 
> 
>  
> > 
> > 
> > 
> > When running with a persistent memory configuration, I'm seeing a
> > kernel stack trace on cxl_pmem_init
> > 
> > Config:
> > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > -m 3G,slots=4,maxmem=4G \
> > -smp 4 \
> > -machine type=q35,accel=kvm,cxl=on \
> > -enable-kvm \
> > -nographic \
> > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
> > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
> > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
> > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > 
> > 
> > [   62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> > [   62.185069] #PF: supervisor read access in kernel mode
> > [   62.198502] #PF: error_code(0x0000) - not-present page
> > [   62.211019] PGD 0 P4D 0
> > [   62.220521] Oops: 0000 [#1] PREEMPT SMP PTI
> > [   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1
> > [   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > [   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 extents:1 across:2939900k SSDscFS
> > [   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > [   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > [   62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > [   62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > [   62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > [   62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > [   62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > [   62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > [   62.285541] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > [   62.285542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> > [   62.285653] Call Trace:
> > [   62.285656]  <TASK>
> > [   62.285660]  cxl_bus_probe+0x17/0x50
> > [   62.285691]  really_probe+0xde/0x380
> > [   62.285695]  ? pm_runtime_barrier+0x50/0x90
> > [   62.285700]  __driver_probe_device+0x78/0x170
> > [   62.285846]  driver_probe_device+0x1f/0x90
> > [   62.285850]  __driver_attach+0xd2/0x1c0
> > [   62.285853]  ? __pfx___driver_attach+0x10/0x10
> > [   62.285856]  bus_for_each_dev+0x76/0xa0
> > [   62.285860]  bus_add_driver+0x1b1/0x200
> > [   62.285863]  driver_register+0x89/0xe0
> > [   62.285868]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> > [   62.285874]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> > [   62.285880]  do_one_initcall+0x6e/0x330
> > [   62.285888]  do_init_module+0x4a/0x200
> > [   62.285892]  __do_sys_finit_module+0x93/0xf0
> > [   62.285899]  do_syscall_64+0x5b/0x80
> > [   62.285904]  ? do_syscall_64+0x67/0x80
> > [   62.285906]  ? asm_exc_page_fault+0x22/0x30
> > [   62.285910]  ? lockdep_hardirqs_on+0x7d/0x100
> > [   62.285914]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [   62.285917] RIP: 0033:0x7f2619b0afbd
> > [   62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8
> > [   62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > [   62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd
> > [   62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006
> > [   62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52
> > [   62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
> > [   62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0
> > [   62.285932]  </TASK>
> > [   62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g
> > [   62.285954] CR2: 00000000000004c0
> > [   62.288385] ---[ end trace 0000000000000000 ]---
> > [   63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > [   63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > [   63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > [   63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > [   63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > [   63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > [   63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > [   63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > [   63.203580] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > [   63.203583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> > 
> 
> I clearly need to run some more rests as not seen this issue (and I've
> had a couple of reports of it now).  I guess I never tend to be running
> a completely clean tree on either side + testing is mostly on arm64
> (though I doubt that matters). It's not the pass through decoder issues as
> that is visible only when bringing up a region.

OK so I should drop this from my queue for now?


> > 
> > 
> > Next i reverted the QEMU branch to the commit just before the type-3
> > volatile commit and used the old method of launching with a type-3 pmem
> > device
> > 
> > Config:
> > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > -m 2G,slots=4,maxmem=4G \
> > -smp 4 \
> > -machine type=q35,accel=kvm,cxl=on \
> > -enable-kvm \
> > -nographic \
> > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \
> > -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \
> > -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
> > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > 
> > 
> > Result: Similar stack trace
> > [   29.850023] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> > [   29.882400] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > [   29.957485] Call Trace:
> > [   29.959067]  <TASK>
> > [   29.962176]  cxl_bus_probe+0x17/0x50
> > [   29.964940]  really_probe+0xde/0x380
> > [   29.969065]  ? pm_runtime_barrier+0x50/0x90
> > [   29.973419]  __driver_probe_device+0x78/0x170
> > [   29.977183]  driver_probe_device+0x1f/0x90
> > [   29.984212]  __driver_attach+0xd2/0x1c0
> > [   29.988463]  ? __pfx___driver_attach+0x10/0x10
> > [   29.992379]  bus_for_each_dev+0x76/0xa0
> > [   29.997040]  bus_add_driver+0x1b1/0x200
> > [   30.000368]  driver_register+0x89/0xe0
> > [   30.004579]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> > [   30.012403]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> > [   30.019394]  do_one_initcall+0x6e/0x330
> > [   30.024028]  do_init_module+0x4a/0x200
> > [   30.029243]  __do_sys_finit_module+0x93/0xf0
> > [   30.034943]  do_syscall_64+0x5b/0x80
> > [   30.039844]  ? do_syscall_64+0x67/0x80
> > [   30.045163]  ? do_syscall_64+0x67/0x80
> > [   30.049729]  ? lock_release+0x14b/0x440
> > [   30.054055]  ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
> > [   30.061039]  ? lock_is_held_type+0xe8/0x140
> > [   30.067625]  ? do_syscall_64+0x67/0x80
> > [   30.071909]  ? lockdep_hardirqs_on+0x7d/0x100
> > [   30.079037]  ? do_syscall_64+0x67/0x80
> > [   30.084537]  ? do_syscall_64+0x67/0x80
> > [   30.089091]  ? do_syscall_64+0x67/0x80
> > [   30.094174]  ? lockdep_hardirqs_on+0x7d/0x100
> > [   30.099224]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [   30.104446] RIP: 0033:0x7f000550afbd

Jonathan Cameron Jan. 19, 2023, 12:16 p.m. UTC | #14

On Thu, 19 Jan 2023 06:48:11 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jan 19, 2023 at 10:19:46AM +0000, Jonathan Cameron wrote:
> > On Wed, 18 Jan 2023 14:22:08 -0500
> > Gregory Price <gregory.price@memverge.com> wrote:
> >   
> > > 1) No stack traces present
> > > 2) Device usage appears to work, but cxl-cli fails to create a region, i
> > > haven't checked why yet (also tried ndctl-75, same results)
> > > 3) There seems to be some other regression with the cxl_pmem_init
> > > routine, because I get a stack trace in this setup regardless of whether
> > > I apply the type-3 device commit.
> > > 
> > > 
> > > All tests below with the previously posted DOE linux branch.
> > > Base QEMU branch was Jonathan's 2023-1-11
> > > 
> > > 
> > > DOE Branch - 2023-1-11 (HEAD) (all commits)
> > > 
> > > QEMU Config:
> > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > > -m 3G,slots=4,maxmem=8G \
> > > -smp 4 \
> > > -machine type=q35,accel=kvm,cxl=on \
> > > -enable-kvm \
> > > -nographic \
> > > -object memory-backend-ram,id=mem0,size=1G,share=on \
> > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > 
> > > Result:  This worked, but cxl-cli could not create a region (will look
> > > into this further later).
> > >   
> > 
> > Even if everything else worked, it will currently fail because of the
> > issue with pass through decoders.
> > (Kernel assumes always pass through for single rp, qemu assumes never
> >  pass through - both are valid under spec).
> > Add a second rp for now.  I should be able to post some patches
> > to work around that shortly. 
> > 
> > 
> > 
> > 
> >    
> > > 
> > > 
> > > 
> > > When running with a persistent memory configuration, I'm seeing a
> > > kernel stack trace on cxl_pmem_init
> > > 
> > > Config:
> > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > > -m 3G,slots=4,maxmem=4G \
> > > -smp 4 \
> > > -machine type=q35,accel=kvm,cxl=on \
> > > -enable-kvm \
> > > -nographic \
> > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
> > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
> > > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
> > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > 
> > > 
> > > [   62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> > > [   62.185069] #PF: supervisor read access in kernel mode
> > > [   62.198502] #PF: error_code(0x0000) - not-present page
> > > [   62.211019] PGD 0 P4D 0
> > > [   62.220521] Oops: 0000 [#1] PREEMPT SMP PTI
> > > [   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1
> > > [   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > > [   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 extents:1 across:2939900k SSDscFS
> > > [   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > [   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > > [   62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > > [   62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > > [   62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > > [   62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > > [   62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > > [   62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > > [   62.285541] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > > [   62.285542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> > > [   62.285653] Call Trace:
> > > [   62.285656]  <TASK>
> > > [   62.285660]  cxl_bus_probe+0x17/0x50
> > > [   62.285691]  really_probe+0xde/0x380
> > > [   62.285695]  ? pm_runtime_barrier+0x50/0x90
> > > [   62.285700]  __driver_probe_device+0x78/0x170
> > > [   62.285846]  driver_probe_device+0x1f/0x90
> > > [   62.285850]  __driver_attach+0xd2/0x1c0
> > > [   62.285853]  ? __pfx___driver_attach+0x10/0x10
> > > [   62.285856]  bus_for_each_dev+0x76/0xa0
> > > [   62.285860]  bus_add_driver+0x1b1/0x200
> > > [   62.285863]  driver_register+0x89/0xe0
> > > [   62.285868]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> > > [   62.285874]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> > > [   62.285880]  do_one_initcall+0x6e/0x330
> > > [   62.285888]  do_init_module+0x4a/0x200
> > > [   62.285892]  __do_sys_finit_module+0x93/0xf0
> > > [   62.285899]  do_syscall_64+0x5b/0x80
> > > [   62.285904]  ? do_syscall_64+0x67/0x80
> > > [   62.285906]  ? asm_exc_page_fault+0x22/0x30
> > > [   62.285910]  ? lockdep_hardirqs_on+0x7d/0x100
> > > [   62.285914]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > > [   62.285917] RIP: 0033:0x7f2619b0afbd
> > > [   62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8
> > > [   62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > > [   62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd
> > > [   62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006
> > > [   62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52
> > > [   62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
> > > [   62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0
> > > [   62.285932]  </TASK>
> > > [   62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g
> > > [   62.285954] CR2: 00000000000004c0
> > > [   62.288385] ---[ end trace 0000000000000000 ]---
> > > [   63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > [   63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > > [   63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > > [   63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > > [   63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > > [   63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > > [   63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > > [   63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > > [   63.203580] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > > [   63.203583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> > >   
> > 
> > I clearly need to run some more rests as not seen this issue (and I've
> > had a couple of reports of it now).  I guess I never tend to be running
> > a completely clean tree on either side + testing is mostly on arm64
> > (though I doubt that matters). It's not the pass through decoder issues as
> > that is visible only when bringing up a region.  
> 
> OK so I should drop this from my queue for now?

Whilst discussion is in this thread, it is not related to this patch set
(at least I've seen no indication that it is!)

The revert referred to is down to a point with a whole load of other stuff
on top of upstream QEMU. However I think that whatever this is effects
upstream as well - though it seems more likely to be a kernel bug or
possibly something we simply aren't emulating yet.
The trace is probably pointing to a dereference of
(struct device)->driver
as there is very little else in that function, which is very suspicious.

So keeping this set queued up would be great.

Jonathan

Jonathan Cameron Jan. 19, 2023, 12:42 p.m. UTC | #15

On Wed, 18 Jan 2023 14:31:53 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> I apparently forgot an intro lol
> 
> I tested the DOE linux branch with the 2023-1-11 QEMU branch with both
> volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode.
> 
> 1) *In volatile mode, there are no stack traces present (during boot*)
> 
> On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote:
> > 
> > 1) No stack traces present
> > 2) Device usage appears to work, but cxl-cli fails to create a region, i
> > haven't checked why yet (also tried ndctl-75, same results)
> > 3) There seems to be some other regression with the cxl_pmem_init
> > routine, because I get a stack trace in this setup regardless of whether
> > I apply the type-3 device commit.
> > 
> > 
> > All tests below with the previously posted DOE linux branch.
> > Base QEMU branch was Jonathan's 2023-1-11
> > 
> > 
> > DOE Branch - 2023-1-11 (HEAD) (all commits)
> > 
> > QEMU Config:
> > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > -m 3G,slots=4,maxmem=8G \
> > -smp 4 \
> > -machine type=q35,accel=kvm,cxl=on \
> > -enable-kvm \
> > -nographic \
> > -object memory-backend-ram,id=mem0,size=1G,share=on \
> > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > 
> > Result:  This worked, but cxl-cli could not create a region (will look
> > into this further later).
> > 
> > 
> > 
> > 
> > When running with a persistent memory configuration, I'm seeing a
> > kernel stack trace on cxl_pmem_init
> > 
> > Config:
> > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > -m 3G,slots=4,maxmem=4G \
> > -smp 4 \
> > -machine type=q35,accel=kvm,cxl=on \
> > -enable-kvm \
> > -nographic \
> > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
> > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
> > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
> > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > 
> > 
> > [   62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> > [   62.185069] #PF: supervisor read access in kernel mode
> > [   62.198502] #PF: error_code(0x0000) - not-present page
> > [   62.211019] PGD 0 P4D 0
> > [   62.220521] Oops: 0000 [#1] PREEMPT SMP PTI
> > [   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1
> > [   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > [   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 extents:1 across:2939900k SSDscFS
> > [   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > [   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > [   62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > [   62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > [   62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > [   62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > [   62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > [   62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > [   62.285541] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > [   62.285542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> > [   62.285653] Call Trace:
> > [   62.285656]  <TASK>
> > [   62.285660]  cxl_bus_probe+0x17/0x50
> > [   62.285691]  really_probe+0xde/0x380
> > [   62.285695]  ? pm_runtime_barrier+0x50/0x90
> > [   62.285700]  __driver_probe_device+0x78/0x170
> > [   62.285846]  driver_probe_device+0x1f/0x90
> > [   62.285850]  __driver_attach+0xd2/0x1c0
> > [   62.285853]  ? __pfx___driver_attach+0x10/0x10
> > [   62.285856]  bus_for_each_dev+0x76/0xa0
> > [   62.285860]  bus_add_driver+0x1b1/0x200
> > [   62.285863]  driver_register+0x89/0xe0
> > [   62.285868]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> > [   62.285874]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> > [   62.285880]  do_one_initcall+0x6e/0x330
> > [   62.285888]  do_init_module+0x4a/0x200
> > [   62.285892]  __do_sys_finit_module+0x93/0xf0
> > [   62.285899]  do_syscall_64+0x5b/0x80
> > [   62.285904]  ? do_syscall_64+0x67/0x80
> > [   62.285906]  ? asm_exc_page_fault+0x22/0x30
> > [   62.285910]  ? lockdep_hardirqs_on+0x7d/0x100
> > [   62.285914]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [   62.285917] RIP: 0033:0x7f2619b0afbd
> > [   62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8
> > [   62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > [   62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd
> > [   62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006
> > [   62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52
> > [   62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
> > [   62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0
> > [   62.285932]  </TASK>
> > [   62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g
> > [   62.285954] CR2: 00000000000004c0
> > [   62.288385] ---[ end trace 0000000000000000 ]---
> > [   63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > [   63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > [   63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > [   63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > [   63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > [   63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > [   63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > [   63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > [   63.203580] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > [   63.203583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0

Possibly replicated.  What I did was stop cxl_pmem.ko being probed automatically and
added it manually later. Trace that results is certainly similar to yours.

Now the MODULE_SOFTDEP() in drivers/cxl/acpi.c should stop that happening
assuming you are letting autoloading run.
I wonder if there is a path in which it doesn't?

Dan, any thoughts?

There is another race that I can trigger by repeatedly injecting errors and
causing resets, but the trace for that is very different and
points at cxl_pmem_ctl() called via nvdimm_probe(). I was going to try
and pin that one down a little more before posting a report but might
as well muddy the waters :)

 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000358
 Mem abort info:
 ESR = 0x0000000096000004
 EC = 0x25: DABT (current EL), IL = 32 bits
 SET = 0, FnV = 0
 EA = 0, S1PTW = 0
 FSC = 0x04: level 0 translation fault
 Data abort info:
 ISV = 0, ISS = 0x00000004
 CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102e12000
 [0000000000000358] pgd=0000000000000000, p4d=0000000000000000
 Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
 Modules linked in: cxl_mem cxl_port cxl_acpi cxl_pmem cxl_pci cxl_core
 CPU: 0 PID: 236 Comm: kworker/u8:3 Not tainted 6.2.0-rc3+ #598
 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
 Workqueue: events_unbound async_run_entry_fn
 pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
 pc : cxl_pmem_ctl+0x74/0x244 [cxl_pmem]
 lr : cxl_pmem_ctl+0x60/0x244 [cxl_pmem]
 sp : ffff8000089239d0
 x29: ffff8000089239d0 x28: 0000000000000000 x27: 0000000000000000
 x26: ffffcd4f6b263000 x25: ffffcd4f6a25d9c8 x24: 0000000000000000
 kernel: Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
 x23: 0000000000000000 x22: 000000000000000c x21: ffff0000c5b95400
 x20: ffff0000c0d9ce0c x19: 0000000000000004 x18: 0000000000000000
 x17: 0000000000000000 x16: ffffcd4f698a5fa0 x15: 0000000000000000
 x14: 0000000000000002 x13: 0000000000000000 x12: 0000000000000000
 x11: 0000000000000001 x10: 409e5dd45a38ef72 x9 : ffffcd4f607531f0
 x8 : ffff0000c0d9ce80 x7 : 0000000000000000 x6 : 0000000000000000
 x5 : ffff800008923a84 x4 : 000000000000000c x3 : ffff800008923a10
 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000070
 Call trace:
  cxl_pmem_ctl+0x74/0x244 [cxl_pmem]
  nvdimm_init_nsarea+0xb8/0xdc
  nvdimm_probe+0xc0/0x1d0
  nvdimm_bus_probe+0x90/0x200
  really_probe+0xc8/0x3e0
  __driver_probe_device+0x84/0x190
  driver_probe_device+0x44/0x120
  __device_attach_driver+0xc4/0x160
  bus_for_each_drv+0x80/0xe0
  __device_attach+0xa4/0x1cc
  device_initial_probe+0x1c/0x2c
  bus_probe_device+0xa4/0xb0
  device_add+0x404/0x920
  nd_async_device_register+0x20/0x70
  async_run_entry_fn+0x3c/0x154
  process_one_work+0x200/0x474
  worker_thread+0x74/0x43c
  kthread+0x110/0x114
  ret_from_fork+0x10/0x20
  Code: 53067e61 f90023e0 f9417aa2 f8617860 (f941ac57)
  ---[ end trace 0000000000000000 ]---

Note this seems to have gotten harder to hit for some reason - took
about 50 resets.

I'll keep digging

Jonathan


> > 
> > 
> > 
> > Next i reverted the QEMU branch to the commit just before the type-3
> > volatile commit and used the old method of launching with a type-3 pmem
> > device
> > 
> > Config:
> > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > -m 2G,slots=4,maxmem=4G \
> > -smp 4 \
> > -machine type=q35,accel=kvm,cxl=on \
> > -enable-kvm \
> > -nographic \
> > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > -object memory-backend-file,pmem=true,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=1G \
> > -object memory-backend-file,pmem=true,id=lsa0,mem-path=/tmp/cxl-lsa0,size=1G \
> > -device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
> > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > 
> > 
> > Result: Similar stack trace
> > [   29.850023] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> > [   29.882400] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > [   29.957485] Call Trace:
> > [   29.959067]  <TASK>
> > [   29.962176]  cxl_bus_probe+0x17/0x50
> > [   29.964940]  really_probe+0xde/0x380
> > [   29.969065]  ? pm_runtime_barrier+0x50/0x90
> > [   29.973419]  __driver_probe_device+0x78/0x170
> > [   29.977183]  driver_probe_device+0x1f/0x90
> > [   29.984212]  __driver_attach+0xd2/0x1c0
> > [   29.988463]  ? __pfx___driver_attach+0x10/0x10
> > [   29.992379]  bus_for_each_dev+0x76/0xa0
> > [   29.997040]  bus_add_driver+0x1b1/0x200
> > [   30.000368]  driver_register+0x89/0xe0
> > [   30.004579]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> > [   30.012403]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> > [   30.019394]  do_one_initcall+0x6e/0x330
> > [   30.024028]  do_init_module+0x4a/0x200
> > [   30.029243]  __do_sys_finit_module+0x93/0xf0
> > [   30.034943]  do_syscall_64+0x5b/0x80
> > [   30.039844]  ? do_syscall_64+0x67/0x80
> > [   30.045163]  ? do_syscall_64+0x67/0x80
> > [   30.049729]  ? lock_release+0x14b/0x440
> > [   30.054055]  ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
> > [   30.061039]  ? lock_is_held_type+0xe8/0x140
> > [   30.067625]  ? do_syscall_64+0x67/0x80
> > [   30.071909]  ? lockdep_hardirqs_on+0x7d/0x100
> > [   30.079037]  ? do_syscall_64+0x67/0x80
> > [   30.084537]  ? do_syscall_64+0x67/0x80
> > [   30.089091]  ? do_syscall_64+0x67/0x80
> > [   30.094174]  ? lockdep_hardirqs_on+0x7d/0x100
> > [   30.099224]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [   30.104446] RIP: 0033:0x7f000550afbd

Gregory Price Jan. 19, 2023, 2:20 p.m. UTC | #16

On Thu, Jan 19, 2023 at 10:19:46AM +0000, Jonathan Cameron wrote:
> Even if everything else worked, it will currently fail because of the
> issue with pass through decoders.
> (Kernel assumes always pass through for single rp, qemu assumes never
>  pass through - both are valid under spec).
> Add a second rp for now.  I should be able to post some patches
> to work around that shortly. 
> 

*facepalm* i knew this and yet i still kept old configs around.

I'll do some more testing today with 2 root ports

Gregory Price Jan. 19, 2023, 2:23 p.m. UTC | #17

On Thu, Jan 19, 2023 at 06:48:11AM -0500, Michael S. Tsirkin wrote:
> > 
> > I clearly need to run some more rests as not seen this issue (and I've
> > had a couple of reports of it now).  I guess I never tend to be running
> > a completely clean tree on either side + testing is mostly on arm64
> > (though I doubt that matters). It's not the pass through decoder issues as
> > that is visible only when bringing up a region.
> 
> OK so I should drop this from my queue for now?
> 

Even with this trace, I'm fairly confident that the DOE line + this have
fixed other issues, i'm seeing less general instability on the x86 than
i did before, and when loading in non-volatile mode i don't see any
stability issues.

Jonathan Cameron Jan. 19, 2023, 3:04 p.m. UTC | #18

On Thu, 19 Jan 2023 12:42:44 +0000
Jonathan Cameron via <qemu-devel@nongnu.org> wrote:

> On Wed, 18 Jan 2023 14:31:53 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > I apparently forgot an intro lol
> > 
> > I tested the DOE linux branch with the 2023-1-11 QEMU branch with both
> > volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode.
> > 
> > 1) *In volatile mode, there are no stack traces present (during boot*)
> > 
> > On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote:  
> > > 
> > > 1) No stack traces present
> > > 2) Device usage appears to work, but cxl-cli fails to create a region, i
> > > haven't checked why yet (also tried ndctl-75, same results)
> > > 3) There seems to be some other regression with the cxl_pmem_init
> > > routine, because I get a stack trace in this setup regardless of whether
> > > I apply the type-3 device commit.
> > > 
> > > 
> > > All tests below with the previously posted DOE linux branch.
> > > Base QEMU branch was Jonathan's 2023-1-11
> > > 
> > > 
> > > DOE Branch - 2023-1-11 (HEAD) (all commits)
> > > 
> > > QEMU Config:
> > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > > -m 3G,slots=4,maxmem=8G \
> > > -smp 4 \
> > > -machine type=q35,accel=kvm,cxl=on \
> > > -enable-kvm \
> > > -nographic \
> > > -object memory-backend-ram,id=mem0,size=1G,share=on \
> > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > 
> > > Result:  This worked, but cxl-cli could not create a region (will look
> > > into this further later).
> > > 
> > > 
> > > 
> > > 
> > > When running with a persistent memory configuration, I'm seeing a
> > > kernel stack trace on cxl_pmem_init
> > > 
> > > Config:
> > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > > -m 3G,slots=4,maxmem=4G \
> > > -smp 4 \
> > > -machine type=q35,accel=kvm,cxl=on \
> > > -enable-kvm \
> > > -nographic \
> > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
> > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
> > > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
> > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > 
> > > 
> > > [   62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> > > [   62.185069] #PF: supervisor read access in kernel mode
> > > [   62.198502] #PF: error_code(0x0000) - not-present page
> > > [   62.211019] PGD 0 P4D 0
> > > [   62.220521] Oops: 0000 [#1] PREEMPT SMP PTI
> > > [   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1
> > > [   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > > [   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 extents:1 across:2939900k SSDscFS
> > > [   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > [   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > > [   62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > > [   62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > > [   62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > > [   62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > > [   62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > > [   62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > > [   62.285541] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > > [   62.285542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> > > [   62.285653] Call Trace:
> > > [   62.285656]  <TASK>
> > > [   62.285660]  cxl_bus_probe+0x17/0x50
> > > [   62.285691]  really_probe+0xde/0x380
> > > [   62.285695]  ? pm_runtime_barrier+0x50/0x90
> > > [   62.285700]  __driver_probe_device+0x78/0x170
> > > [   62.285846]  driver_probe_device+0x1f/0x90
> > > [   62.285850]  __driver_attach+0xd2/0x1c0
> > > [   62.285853]  ? __pfx___driver_attach+0x10/0x10
> > > [   62.285856]  bus_for_each_dev+0x76/0xa0
> > > [   62.285860]  bus_add_driver+0x1b1/0x200
> > > [   62.285863]  driver_register+0x89/0xe0
> > > [   62.285868]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> > > [   62.285874]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> > > [   62.285880]  do_one_initcall+0x6e/0x330
> > > [   62.285888]  do_init_module+0x4a/0x200
> > > [   62.285892]  __do_sys_finit_module+0x93/0xf0
> > > [   62.285899]  do_syscall_64+0x5b/0x80
> > > [   62.285904]  ? do_syscall_64+0x67/0x80
> > > [   62.285906]  ? asm_exc_page_fault+0x22/0x30
> > > [   62.285910]  ? lockdep_hardirqs_on+0x7d/0x100
> > > [   62.285914]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > > [   62.285917] RIP: 0033:0x7f2619b0afbd
> > > [   62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8
> > > [   62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > > [   62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd
> > > [   62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006
> > > [   62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52
> > > [   62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
> > > [   62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0
> > > [   62.285932]  </TASK>
> > > [   62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g
> > > [   62.285954] CR2: 00000000000004c0
> > > [   62.288385] ---[ end trace 0000000000000000 ]---
> > > [   63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > [   63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > > [   63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > > [   63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > > [   63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > > [   63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > > [   63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > > [   63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > > [   63.203580] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > > [   63.203583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0  
> 
> Possibly replicated.  What I did was stop cxl_pmem.ko being probed automatically and
> added it manually later. Trace that results is certainly similar to yours.
> 
> Now the MODULE_SOFTDEP() in drivers/cxl/acpi.c should stop that happening
> assuming you are letting autoloading run.
> I wonder if there is a path in which it doesn't?

Gregory, would you mind checking if
cxl_nvb is NULL here...
https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67
(printk before it is used should work).

Might also be worth checking cxl_nvd and cxl_ds
but my guess is cxl_nvb is our problem (it is when I deliberate change
the load order).

Jonathan


> 
> Dan, any thoughts?
> 
> There is another race that I can trigger by repeatedly injecting errors and
> causing resets, but the trace for that is very different and
> points at cxl_pmem_ctl() called via nvdimm_probe(). I was going to try
> and pin that one down a little more before posting a report but might
> as well muddy the waters :)
> 
>  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000358
>  Mem abort info:
>  ESR = 0x0000000096000004
>  EC = 0x25: DABT (current EL), IL = 32 bits
>  SET = 0, FnV = 0
>  EA = 0, S1PTW = 0
>  FSC = 0x04: level 0 translation fault
>  Data abort info:
>  ISV = 0, ISS = 0x00000004
>  CM = 0, WnR = 0
>  user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102e12000
>  [0000000000000358] pgd=0000000000000000, p4d=0000000000000000
>  Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>  Modules linked in: cxl_mem cxl_port cxl_acpi cxl_pmem cxl_pci cxl_core
>  CPU: 0 PID: 236 Comm: kworker/u8:3 Not tainted 6.2.0-rc3+ #598
>  Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
>  Workqueue: events_unbound async_run_entry_fn
>  pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>  pc : cxl_pmem_ctl+0x74/0x244 [cxl_pmem]
>  lr : cxl_pmem_ctl+0x60/0x244 [cxl_pmem]
>  sp : ffff8000089239d0
>  x29: ffff8000089239d0 x28: 0000000000000000 x27: 0000000000000000
>  x26: ffffcd4f6b263000 x25: ffffcd4f6a25d9c8 x24: 0000000000000000
>  kernel: Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>  x23: 0000000000000000 x22: 000000000000000c x21: ffff0000c5b95400
>  x20: ffff0000c0d9ce0c x19: 0000000000000004 x18: 0000000000000000
>  x17: 0000000000000000 x16: ffffcd4f698a5fa0 x15: 0000000000000000
>  x14: 0000000000000002 x13: 0000000000000000 x12: 0000000000000000
>  x11: 0000000000000001 x10: 409e5dd45a38ef72 x9 : ffffcd4f607531f0
>  x8 : ffff0000c0d9ce80 x7 : 0000000000000000 x6 : 0000000000000000
>  x5 : ffff800008923a84 x4 : 000000000000000c x3 : ffff800008923a10
>  x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000070
>  Call trace:
>   cxl_pmem_ctl+0x74/0x244 [cxl_pmem]
>   nvdimm_init_nsarea+0xb8/0xdc
>   nvdimm_probe+0xc0/0x1d0
>   nvdimm_bus_probe+0x90/0x200
>   really_probe+0xc8/0x3e0
>   __driver_probe_device+0x84/0x190
>   driver_probe_device+0x44/0x120
>   __device_attach_driver+0xc4/0x160
>   bus_for_each_drv+0x80/0xe0
>   __device_attach+0xa4/0x1cc
>   device_initial_probe+0x1c/0x2c
>   bus_probe_device+0xa4/0xb0
>   device_add+0x404/0x920
>   nd_async_device_register+0x20/0x70
>   async_run_entry_fn+0x3c/0x154
>   process_one_work+0x200/0x474
>   worker_thread+0x74/0x43c
>   kthread+0x110/0x114
>   ret_from_fork+0x10/0x20
>   Code: 53067e61 f90023e0 f9417aa2 f8617860 (f941ac57)
>   ---[ end trace 0000000000000000 ]---
> 
> Note this seems to have gotten harder to hit for some reason - took
> about 50 resets.
> 
> I'll keep digging
> 
> Jonathan
>

Jonathan Cameron Jan. 19, 2023, 4:17 p.m. UTC | #19

On Thu, 19 Jan 2023 15:04:49 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Thu, 19 Jan 2023 12:42:44 +0000
> Jonathan Cameron via <qemu-devel@nongnu.org> wrote:
> 
> > On Wed, 18 Jan 2023 14:31:53 -0500
> > Gregory Price <gregory.price@memverge.com> wrote:
> >   
> > > I apparently forgot an intro lol
> > > 
> > > I tested the DOE linux branch with the 2023-1-11 QEMU branch with both
> > > volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode.
> > > 
> > > 1) *In volatile mode, there are no stack traces present (during boot*)
> > > 
> > > On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote:    
> > > > 
> > > > 1) No stack traces present
> > > > 2) Device usage appears to work, but cxl-cli fails to create a region, i
> > > > haven't checked why yet (also tried ndctl-75, same results)
> > > > 3) There seems to be some other regression with the cxl_pmem_init
> > > > routine, because I get a stack trace in this setup regardless of whether
> > > > I apply the type-3 device commit.
> > > > 
> > > > 
> > > > All tests below with the previously posted DOE linux branch.
> > > > Base QEMU branch was Jonathan's 2023-1-11
> > > > 
> > > > 
> > > > DOE Branch - 2023-1-11 (HEAD) (all commits)
> > > > 
> > > > QEMU Config:
> > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > > > -m 3G,slots=4,maxmem=8G \
> > > > -smp 4 \
> > > > -machine type=q35,accel=kvm,cxl=on \
> > > > -enable-kvm \
> > > > -nographic \
> > > > -object memory-backend-ram,id=mem0,size=1G,share=on \
> > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > > 
> > > > Result:  This worked, but cxl-cli could not create a region (will look
> > > > into this further later).
> > > > 
> > > > 
> > > > 
> > > > 
> > > > When running with a persistent memory configuration, I'm seeing a
> > > > kernel stack trace on cxl_pmem_init
> > > > 
> > > > Config:
> > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > > > -m 3G,slots=4,maxmem=4G \
> > > > -smp 4 \
> > > > -machine type=q35,accel=kvm,cxl=on \
> > > > -enable-kvm \
> > > > -nographic \
> > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
> > > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
> > > > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
> > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > > 
> > > > 
> > > > [   62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> > > > [   62.185069] #PF: supervisor read access in kernel mode
> > > > [   62.198502] #PF: error_code(0x0000) - not-present page
> > > > [   62.211019] PGD 0 P4D 0
> > > > [   62.220521] Oops: 0000 [#1] PREEMPT SMP PTI
> > > > [   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1
> > > > [   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > > > [   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 extents:1 across:2939900k SSDscFS
> > > > [   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > > [   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > > > [   62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > > > [   62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > > > [   62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > > > [   62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > > > [   62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > > > [   62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > > > [   62.285541] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > > > [   62.285542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [   62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> > > > [   62.285653] Call Trace:
> > > > [   62.285656]  <TASK>
> > > > [   62.285660]  cxl_bus_probe+0x17/0x50
> > > > [   62.285691]  really_probe+0xde/0x380
> > > > [   62.285695]  ? pm_runtime_barrier+0x50/0x90
> > > > [   62.285700]  __driver_probe_device+0x78/0x170
> > > > [   62.285846]  driver_probe_device+0x1f/0x90
> > > > [   62.285850]  __driver_attach+0xd2/0x1c0
> > > > [   62.285853]  ? __pfx___driver_attach+0x10/0x10
> > > > [   62.285856]  bus_for_each_dev+0x76/0xa0
> > > > [   62.285860]  bus_add_driver+0x1b1/0x200
> > > > [   62.285863]  driver_register+0x89/0xe0
> > > > [   62.285868]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> > > > [   62.285874]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> > > > [   62.285880]  do_one_initcall+0x6e/0x330
> > > > [   62.285888]  do_init_module+0x4a/0x200
> > > > [   62.285892]  __do_sys_finit_module+0x93/0xf0
> > > > [   62.285899]  do_syscall_64+0x5b/0x80
> > > > [   62.285904]  ? do_syscall_64+0x67/0x80
> > > > [   62.285906]  ? asm_exc_page_fault+0x22/0x30
> > > > [   62.285910]  ? lockdep_hardirqs_on+0x7d/0x100
> > > > [   62.285914]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > > > [   62.285917] RIP: 0033:0x7f2619b0afbd
> > > > [   62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8
> > > > [   62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > > > [   62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd
> > > > [   62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006
> > > > [   62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52
> > > > [   62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
> > > > [   62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0
> > > > [   62.285932]  </TASK>
> > > > [   62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g
> > > > [   62.285954] CR2: 00000000000004c0
> > > > [   62.288385] ---[ end trace 0000000000000000 ]---
> > > > [   63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > > [   63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > > > [   63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > > > [   63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > > > [   63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > > > [   63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > > > [   63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > > > [   63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > > > [   63.203580] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > > > [   63.203583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [   63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0    
> > 
> > Possibly replicated.  What I did was stop cxl_pmem.ko being probed automatically and
> > added it manually later. Trace that results is certainly similar to yours.
> > 
> > Now the MODULE_SOFTDEP() in drivers/cxl/acpi.c should stop that happening
> > assuming you are letting autoloading run.
> > I wonder if there is a path in which it doesn't?  
> 
> Gregory, would you mind checking if
> cxl_nvb is NULL here...
> https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67
> (printk before it is used should work).
> 
> Might also be worth checking cxl_nvd and cxl_ds
> but my guess is cxl_nvb is our problem (it is when I deliberate change
> the load order).

Whilst I still have no idea if this is the same problem, I have identified
what goes wrong if there is a module probe ordering issue.
https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/core/pmem.c#L306

	/*
	 * The two actions below arrange for @cxl_nvd to be deleted when either
	 * the top-level PMEM bridge goes down, or the endpoint device goes
	 * through ->remove().
	 */
	device_lock(&cxl_nvb->dev);
	if (cxl_nvb->dev.driver)
		rc = devm_add_action_or_reset(&cxl_nvb->dev, cxl_nvd_unregister,
					      cxl_nvd);
	else
// bridge driver not loaded, so we hit this path.
		rc = -ENXIO;
	device_unlock(&cxl_nvb->dev);

	if (rc)
/// and this one
		goto err_alloc;

	/* @cxlmd carries a reference on @cxl_nvb until cxlmd_release_nvdimm */
	return devm_add_action_or_reset(&cxlmd->dev, cxlmd_release_nvdimm, cxlmd);

err:
	put_device(dev);
err_alloc:
	cxlmd->cxl_nvb = NULL;
	cxlmd->cxl_nvd = NULL;
	put_device(&cxl_nvb->dev);
// whilst we scrub the pointers we don't actually get rid of the
// cxl_nvd that we registered.  Hence later load of the driver tries to
// attach to that and boom because we've scrubbed these pointers here.
// A quick hack is to just call device_del(&cxl_nvd->dev) if rc = -ENXIO here.
// There may well be a races though....
	return rc;
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_add_nvdimm, CXL);


Of course this "fix" just stops things blowing up, it doesn't leave things
in a remotely useful state.  If it's triggered because someone
is messing with the load order that's fine.  If the same issue
is occurring for Gregory, not so much. 

Jonathan

> 
> Jonathan
> 
> 
> > 
> > Dan, any thoughts?
> > 
> > There is another race that I can trigger by repeatedly injecting errors and
> > causing resets, but the trace for that is very different and
> > points at cxl_pmem_ctl() called via nvdimm_probe(). I was going to try
> > and pin that one down a little more before posting a report but might
> > as well muddy the waters :)
> > 
> >  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000358
> >  Mem abort info:
> >  ESR = 0x0000000096000004
> >  EC = 0x25: DABT (current EL), IL = 32 bits
> >  SET = 0, FnV = 0
> >  EA = 0, S1PTW = 0
> >  FSC = 0x04: level 0 translation fault
> >  Data abort info:
> >  ISV = 0, ISS = 0x00000004
> >  CM = 0, WnR = 0
> >  user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102e12000
> >  [0000000000000358] pgd=0000000000000000, p4d=0000000000000000
> >  Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> >  Modules linked in: cxl_mem cxl_port cxl_acpi cxl_pmem cxl_pci cxl_core
> >  CPU: 0 PID: 236 Comm: kworker/u8:3 Not tainted 6.2.0-rc3+ #598
> >  Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> >  Workqueue: events_unbound async_run_entry_fn
> >  pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> >  pc : cxl_pmem_ctl+0x74/0x244 [cxl_pmem]
> >  lr : cxl_pmem_ctl+0x60/0x244 [cxl_pmem]
> >  sp : ffff8000089239d0
> >  x29: ffff8000089239d0 x28: 0000000000000000 x27: 0000000000000000
> >  x26: ffffcd4f6b263000 x25: ffffcd4f6a25d9c8 x24: 0000000000000000
> >  kernel: Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> >  x23: 0000000000000000 x22: 000000000000000c x21: ffff0000c5b95400
> >  x20: ffff0000c0d9ce0c x19: 0000000000000004 x18: 0000000000000000
> >  x17: 0000000000000000 x16: ffffcd4f698a5fa0 x15: 0000000000000000
> >  x14: 0000000000000002 x13: 0000000000000000 x12: 0000000000000000
> >  x11: 0000000000000001 x10: 409e5dd45a38ef72 x9 : ffffcd4f607531f0
> >  x8 : ffff0000c0d9ce80 x7 : 0000000000000000 x6 : 0000000000000000
> >  x5 : ffff800008923a84 x4 : 000000000000000c x3 : ffff800008923a10
> >  x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000070
> >  Call trace:
> >   cxl_pmem_ctl+0x74/0x244 [cxl_pmem]
> >   nvdimm_init_nsarea+0xb8/0xdc
> >   nvdimm_probe+0xc0/0x1d0
> >   nvdimm_bus_probe+0x90/0x200
> >   really_probe+0xc8/0x3e0
> >   __driver_probe_device+0x84/0x190
> >   driver_probe_device+0x44/0x120
> >   __device_attach_driver+0xc4/0x160
> >   bus_for_each_drv+0x80/0xe0
> >   __device_attach+0xa4/0x1cc
> >   device_initial_probe+0x1c/0x2c
> >   bus_probe_device+0xa4/0xb0
> >   device_add+0x404/0x920
> >   nd_async_device_register+0x20/0x70
> >   async_run_entry_fn+0x3c/0x154
> >   process_one_work+0x200/0x474
> >   worker_thread+0x74/0x43c
> >   kthread+0x110/0x114
> >   ret_from_fork+0x10/0x20
> >   Code: 53067e61 f90023e0 f9417aa2 f8617860 (f941ac57)
> >   ---[ end trace 0000000000000000 ]---
> > 
> > Note this seems to have gotten harder to hit for some reason - took
> > about 50 resets.
> > 
> > I'll keep digging
> > 
> > Jonathan
> >

Gregory Price Jan. 20, 2023, 4:53 a.m. UTC | #20

On Thu, Jan 19, 2023 at 03:04:49PM +0000, Jonathan Cameron wrote:
> Gregory, would you mind checking if
> cxl_nvb is NULL here...
> https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67
> (printk before it is used should work).
> 
> Might also be worth checking cxl_nvd and cxl_ds
> but my guess is cxl_nvb is our problem (it is when I deliberate change
> the load order).
> 
> Jonathan
> 

This is exactly the issue.  cxl_nvb is null, the rest appear fine.

Also, note, that weirdly the non-volatile bridge shows up when launching
this in volatile mode, but no stack trace appears.

¯\_(ツ)_/¯

After spending way too much time tracing through the current cxl driver
code, i have only really determined that

1) The code is very pmem oriented, and it's unclear to me how the driver
   as-is differentiates a persistent device from a volatile device. That
	 code path still completely escapes me.  The only differentiating code
	 i see is in the memdev probe path that creates mem#/pmem and mem#/ram

2) The code successfully manages probe, enable, and mount a REAL device
   - cxl memdev appears (/sys/bus/cxl/devices/mem0)
	 - a dax device appears (/sys/bus/dax/devices/)
	   This happens at boot, which I assume must be bios related
	 - The memory *does not* auto-online, instead the dax device can be
	   onlined as system-ram *manually* via ndctl and friends

3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless
   of the type-3 device configuration (pmem-only or vmem-only)

   # CFMW defined
   [root@fedora ~]# ls /sys/bus/cxl/devices/
   decoder0.0  decoder2.0  mem0            port1
   decoder1.0  endpoint2   nvdimm-bridge0  root0

   # CFMW not defined
	 [root@fedora ~]# ls /sys/bus/cxl/devices/
   decoder1.0  decoder2.0  endpoint2  mem0  port1  root0

4) As you can see above, multiple decoders are registered.  I'm not sure
   if that's correct or not, but it does seem odd given there's only one
	 cxl type-3 device.  Odd that decoder0.0 shows up when CFMW is there,
	 but not when it isn't.

	 Note: All these tests have two root ports:
	 -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
   -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
   -device cxl-rp,id=rp1,bus=cxl.0,chassis=0,port=1,slot=1 \


Don't know why I haven't thought of this until now, but is the CFMW code
reporting something odd about what's behind it?  Is it assuming the
devices are pmem?

Gregory Price Jan. 20, 2023, 5:51 a.m. UTC | #21

On Thu, Jan 19, 2023 at 04:17:11PM +0000, Jonathan Cameron wrote:
> 
> Whilst I still have no idea if this is the same problem, I have identified
> what goes wrong if there is a module probe ordering issue.
> https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/core/pmem.c#L306
> 
> 	/*
> 	 * The two actions below arrange for @cxl_nvd to be deleted when either
> 	 * the top-level PMEM bridge goes down, or the endpoint device goes
> 	 * through ->remove().
> 	 */
> 	device_lock(&cxl_nvb->dev);
> 	if (cxl_nvb->dev.driver)
> 		rc = devm_add_action_or_reset(&cxl_nvb->dev, cxl_nvd_unregister,
> 					      cxl_nvd);
> 	else
> // bridge driver not loaded, so we hit this path.
> 		rc = -ENXIO;
> 	device_unlock(&cxl_nvb->dev);
> 
> 	if (rc)
> /// and this one
> 		goto err_alloc;
> 
> 	/* @cxlmd carries a reference on @cxl_nvb until cxlmd_release_nvdimm */
> 	return devm_add_action_or_reset(&cxlmd->dev, cxlmd_release_nvdimm, cxlmd);
> 
> err:
> 	put_device(dev);
> err_alloc:
> 	cxlmd->cxl_nvb = NULL;
> 	cxlmd->cxl_nvd = NULL;
> 	put_device(&cxl_nvb->dev);
> // whilst we scrub the pointers we don't actually get rid of the
> // cxl_nvd that we registered.  Hence later load of the driver tries to
> // attach to that and boom because we've scrubbed these pointers here.
> // A quick hack is to just call device_del(&cxl_nvd->dev) if rc = -ENXIO here.
> // There may well be a races though....
> 	return rc;
> }
> EXPORT_SYMBOL_NS_GPL(devm_cxl_add_nvdimm, CXL);
> 
> 
> Of course this "fix" just stops things blowing up, it doesn't leave things
> in a remotely useful state.  If it's triggered because someone
> is messing with the load order that's fine.  If the same issue
> is occurring for Gregory, not so much. 
> 
> Jonathan
> 

mild hint in the dev_cxl_add_nvdimm_bridge path

driver/cxl/acpi.c

static int cxl_acpi_probe(struct platform_device *pdev)
{
... snip ...
  if (IS_ENABLED(CONFIG_CXL_PMEM))
    rc = device_for_each_child(&root_port->dev, root_port,
             add_root_nvdimm_bridge);
  if (rc < 0)
    return rc;

  /* In case PCI is scanned before ACPI re-trigger memdev attach */
  cxl_bus_rescan();
  return 0;
}


if PCI is presently written in a way that it's expecting nvdimm_bridge
to be present (via acpi_probe), then clearly this would break.

From the other discussion here... that seems to be the issue?  If that's
an issue, I also imagine there are other parts that may be subject to
the same problem.


static int cxl_pmem_region_probe(struct device *dev)
{
  struct nd_mapping_desc mappings[CXL_DECODER_MAX_INTERLEAVE];
  struct cxl_pmem_region *cxlr_pmem = to_cxl_pmem_region(dev);
  struct cxl_region *cxlr = cxlr_pmem->cxlr;
  struct cxl_nvdimm_bridge *cxl_nvb = cxlr->cxl_nvb;


this may be unreachable due to prior stack traces, but you get the
point.

Reiterating my confusion a bit: I don't have an nvdimm, why am i getting
an nvdimm_bridge?  The reason it no longer appears to trigger on my
memexp example is because it doesnt go down this path:

static int cxl_mem_probe(struct device *dev)
{
... snip ...

  // resource size is 0 here due to type3dev->persistent_capacity=0
  if (resource_size(&cxlds->pmem_res) && IS_ENABLED(CONFIG_CXL_PMEM)) {
    rc = devm_cxl_add_nvdimm(cxlmd);
    if (rc == -ENODEV)
      dev_info(dev, "PMEM disabled by platform\n");
    else
      return rc;
  }
... snip ...
}

This seems like more than an ordering issue.

Jonathan Cameron Jan. 20, 2023, 10:47 a.m. UTC | #22

On Thu, 19 Jan 2023 23:53:53 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Thu, Jan 19, 2023 at 03:04:49PM +0000, Jonathan Cameron wrote:
> > Gregory, would you mind checking if
> > cxl_nvb is NULL here...
> > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67
> > (printk before it is used should work).
> > 
> > Might also be worth checking cxl_nvd and cxl_ds
> > but my guess is cxl_nvb is our problem (it is when I deliberate change
> > the load order).
> > 
> > Jonathan
> >   
> 
> This is exactly the issue.  cxl_nvb is null, the rest appear fine.
> 
> Also, note, that weirdly the non-volatile bridge shows up when launching
> this in volatile mode, but no stack trace appears.
> 
> ¯\_(ツ)_/¯
> 
> After spending way too much time tracing through the current cxl driver
> code, i have only really determined that
> 
> 1) The code is very pmem oriented, and it's unclear to me how the driver
>    as-is differentiates a persistent device from a volatile device. That
> 	 code path still completely escapes me.  The only differentiating code
> 	 i see is in the memdev probe path that creates mem#/pmem and mem#/ram

Absolutely on pmem.  Target for kernel side of things was always pmem
first. Volatile has been on roadmap / todo list for a few kernel cycles
but I haven't seen any code yet.

> 
> 2) The code successfully manages probe, enable, and mount a REAL device
>    - cxl memdev appears (/sys/bus/cxl/devices/mem0)
> 	 - a dax device appears (/sys/bus/dax/devices/)
> 	   This happens at boot, which I assume must be bios related
> 	 - The memory *does not* auto-online, instead the dax device can be
> 	   onlined as system-ram *manually* via ndctl and friends

Interesting.  Just curious, is the host a CXL 1.1 host or a CXL 2.0 host?

> 
> 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless
>    of the type-3 device configuration (pmem-only or vmem-only)
> 
>    # CFMW defined
>    [root@fedora ~]# ls /sys/bus/cxl/devices/
>    decoder0.0  decoder2.0  mem0            port1
>    decoder1.0  endpoint2   nvdimm-bridge0  root0
> 
>    # CFMW not defined
> 	 [root@fedora ~]# ls /sys/bus/cxl/devices/
>    decoder1.0  decoder2.0  endpoint2  mem0  port1  root0

That should be harmless and may be needed to tie everything
through to DAX.

> 
> 4) As you can see above, multiple decoders are registered.  I'm not sure
>    if that's correct or not, but it does seem odd given there's only one
> 	 cxl type-3 device.  Odd that decoder0.0 shows up when CFMW is there,
> 	 but not when it isn't.
> 
> 	 Note: All these tests have two root ports:
> 	 -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
>    -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
>    -device cxl-rp,id=rp1,bus=cxl.0,chassis=0,port=1,slot=1 \

IIRC

decoder0.0 represents the fixed routing in the host as defined
by the CFMWS - not an actual programmable decoder.

decoder1.0 is the routing in the host bridge - may be pass through
decoder if there is only one root port.

decoder2.0 is the one is the endpoint itself.

> 
> 
> Don't know why I haven't thought of this until now, but is the CFMW code
> reporting something odd about what's behind it?  Is it assuming the
> devices are pmem?

It reports the ability to support pmem or support volatile or support both.
Currently
https://elixir.bootlin.com/qemu/latest/source/hw/acpi/cxl.c#L107
qemu reports that all CFMWS windows support everything except
"Fixed Device Configuration (Bit[4])" which would tell the OS not
to move devices that are already programmed out of this window
and doesn't really make sense for QEMU to ever set.
That is we support all of:
Device Coherent (type 2 and back invalidate flows on type 3, though we
aren't emulating the back invalidate stuff yet on the EP)
Host only coherent. [thinking about it we should probably not
support both this and device coherent as they would be mutually
incompatible on a real host]
Volatile
Persistent

Jonathan

Dan Williams Jan. 20, 2023, 5:26 p.m. UTC | #23

Jonathan Cameron wrote:
> On Thu, 19 Jan 2023 15:04:49 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > On Thu, 19 Jan 2023 12:42:44 +0000
> > Jonathan Cameron via <qemu-devel@nongnu.org> wrote:
> > 
> > > On Wed, 18 Jan 2023 14:31:53 -0500
> > > Gregory Price <gregory.price@memverge.com> wrote:
> > >   
> > > > I apparently forgot an intro lol
> > > > 
> > > > I tested the DOE linux branch with the 2023-1-11 QEMU branch with both
> > > > volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode.
> > > > 
> > > > 1) *In volatile mode, there are no stack traces present (during boot*)
> > > > 
> > > > On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote:    
> > > > > 
> > > > > 1) No stack traces present
> > > > > 2) Device usage appears to work, but cxl-cli fails to create a region, i
> > > > > haven't checked why yet (also tried ndctl-75, same results)
> > > > > 3) There seems to be some other regression with the cxl_pmem_init
> > > > > routine, because I get a stack trace in this setup regardless of whether
> > > > > I apply the type-3 device commit.
> > > > > 
> > > > > 
> > > > > All tests below with the previously posted DOE linux branch.
> > > > > Base QEMU branch was Jonathan's 2023-1-11
> > > > > 
> > > > > 
> > > > > DOE Branch - 2023-1-11 (HEAD) (all commits)
> > > > > 
> > > > > QEMU Config:
> > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > > > > -m 3G,slots=4,maxmem=8G \
> > > > > -smp 4 \
> > > > > -machine type=q35,accel=kvm,cxl=on \
> > > > > -enable-kvm \
> > > > > -nographic \
> > > > > -object memory-backend-ram,id=mem0,size=1G,share=on \
> > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > > > 
> > > > > Result:  This worked, but cxl-cli could not create a region (will look
> > > > > into this further later).
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > When running with a persistent memory configuration, I'm seeing a
> > > > > kernel stack trace on cxl_pmem_init
> > > > > 
> > > > > Config:
> > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > > > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
> > > > > -m 3G,slots=4,maxmem=4G \
> > > > > -smp 4 \
> > > > > -machine type=q35,accel=kvm,cxl=on \
> > > > > -enable-kvm \
> > > > > -nographic \
> > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
> > > > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
> > > > > -device cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
> > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > > > 
> > > > > 
> > > > > [   62.167518] BUG: kernel NULL pointer dereference, address: 00000000000004c0
> > > > > [   62.185069] #PF: supervisor read access in kernel mode
> > > > > [   62.198502] #PF: error_code(0x0000) - not-present page
> > > > > [   62.211019] PGD 0 P4D 0
> > > > > [   62.220521] Oops: 0000 [#1] PREEMPT SMP PTI
> > > > > [   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 6.2.0-rc1+ #1
> > > > > [   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > > > > [   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 extents:1 across:2939900k SSDscFS
> > > > > [   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > > > [   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > > > > [   62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > > > > [   62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > > > > [   62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > > > > [   62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > > > > [   62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > > > > [   62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > > > > [   62.285541] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > > > > [   62.285542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [   62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0
> > > > > [   62.285653] Call Trace:
> > > > > [   62.285656]  <TASK>
> > > > > [   62.285660]  cxl_bus_probe+0x17/0x50
> > > > > [   62.285691]  really_probe+0xde/0x380
> > > > > [   62.285695]  ? pm_runtime_barrier+0x50/0x90
> > > > > [   62.285700]  __driver_probe_device+0x78/0x170
> > > > > [   62.285846]  driver_probe_device+0x1f/0x90
> > > > > [   62.285850]  __driver_attach+0xd2/0x1c0
> > > > > [   62.285853]  ? __pfx___driver_attach+0x10/0x10
> > > > > [   62.285856]  bus_for_each_dev+0x76/0xa0
> > > > > [   62.285860]  bus_add_driver+0x1b1/0x200
> > > > > [   62.285863]  driver_register+0x89/0xe0
> > > > > [   62.285868]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> > > > > [   62.285874]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> > > > > [   62.285880]  do_one_initcall+0x6e/0x330
> > > > > [   62.285888]  do_init_module+0x4a/0x200
> > > > > [   62.285892]  __do_sys_finit_module+0x93/0xf0
> > > > > [   62.285899]  do_syscall_64+0x5b/0x80
> > > > > [   62.285904]  ? do_syscall_64+0x67/0x80
> > > > > [   62.285906]  ? asm_exc_page_fault+0x22/0x30
> > > > > [   62.285910]  ? lockdep_hardirqs_on+0x7d/0x100
> > > > > [   62.285914]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > > > > [   62.285917] RIP: 0033:0x7f2619b0afbd
> > > > > [   62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 8
> > > > > [   62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > > > > [   62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 00007f2619b0afbd
> > > > > [   62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 0000000000000006
> > > > > [   62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 00007f261a17bb52
> > > > > [   62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000020000
> > > > > [   62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 00005557c0dc18a0
> > > > > [   62.285932]  </TASK>
> > > > > [   62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc soundcore drm_ttm_g
> > > > > [   62.285954] CR2: 00000000000004c0
> > > > > [   62.288385] ---[ end trace 0000000000000000 ]---
> > > > > [   63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > > > [   63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 4c 89 0
> > > > > [   63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > > > > [   63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 0000000000000000
> > > > > [   63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 00000000ffffffff
> > > > > [   63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 0000000000000001
> > > > > [   63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: ffff97a8a37b8000
> > > > > [   63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 0000000000000000
> > > > > [   63.203580] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) knlGS:0000000000000000
> > > > > [   63.203583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [   63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 00000000000006e0    
> > > 
> > > Possibly replicated.  What I did was stop cxl_pmem.ko being probed automatically and
> > > added it manually later. Trace that results is certainly similar to yours.
> > > 
> > > Now the MODULE_SOFTDEP() in drivers/cxl/acpi.c should stop that happening
> > > assuming you are letting autoloading run.
> > > I wonder if there is a path in which it doesn't?  
> > 
> > Gregory, would you mind checking if
> > cxl_nvb is NULL here...
> > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67
> > (printk before it is used should work).
> > 
> > Might also be worth checking cxl_nvd and cxl_ds
> > but my guess is cxl_nvb is our problem (it is when I deliberate change
> > the load order).
> 
> Whilst I still have no idea if this is the same problem, I have identified
> what goes wrong if there is a module probe ordering issue.
> https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/core/pmem.c#L306
> 
> 	/*
> 	 * The two actions below arrange for @cxl_nvd to be deleted when either
> 	 * the top-level PMEM bridge goes down, or the endpoint device goes
> 	 * through ->remove().
> 	 */
> 	device_lock(&cxl_nvb->dev);
> 	if (cxl_nvb->dev.driver)
> 		rc = devm_add_action_or_reset(&cxl_nvb->dev, cxl_nvd_unregister,
> 					      cxl_nvd);
> 	else
> // bridge driver not loaded, so we hit this path.
> 		rc = -ENXIO;
> 	device_unlock(&cxl_nvb->dev);
> 
> 	if (rc)
> /// and this one
> 		goto err_alloc;
> 
> 	/* @cxlmd carries a reference on @cxl_nvb until cxlmd_release_nvdimm */
> 	return devm_add_action_or_reset(&cxlmd->dev, cxlmd_release_nvdimm, cxlmd);
> 
> err:
> 	put_device(dev);
> err_alloc:
> 	cxlmd->cxl_nvb = NULL;
> 	cxlmd->cxl_nvd = NULL;
> 	put_device(&cxl_nvb->dev);
> // whilst we scrub the pointers we don't actually get rid of the
> // cxl_nvd that we registered.  Hence later load of the driver tries to
> // attach to that and boom because we've scrubbed these pointers here.
> // A quick hack is to just call device_del(&cxl_nvd->dev) if rc = -ENXIO here.
> // There may well be a races though....
> 	return rc;
> }
> EXPORT_SYMBOL_NS_GPL(devm_cxl_add_nvdimm, CXL);
> 
> 
> Of course this "fix" just stops things blowing up, it doesn't leave things
> in a remotely useful state.  If it's triggered because someone
> is messing with the load order that's fine.  If the same issue
> is occurring for Gregory, not so much. 

Apologies for not engaging on this sooner, I have been heads down on
trying to get pre-existing region enumeration on its feet.

One workaround for this is to just preclude the load order scenario from
happening by making the pmem enabling part of the core module. There is
not much reason for it to exist indepdendently.

...but I think it is useful to be able to have some policy for disabling
all pmem enabling in an emergency. So I think fixing the cxl_nvd
unregistration is the right way to go.

Dan Williams Jan. 20, 2023, 5:38 p.m. UTC | #24

Gregory Price wrote:
> On Thu, Jan 19, 2023 at 03:04:49PM +0000, Jonathan Cameron wrote:
> > Gregory, would you mind checking if
> > cxl_nvb is NULL here...
> > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67
> > (printk before it is used should work).
> > 
> > Might also be worth checking cxl_nvd and cxl_ds
> > but my guess is cxl_nvb is our problem (it is when I deliberate change
> > the load order).
> > 
> > Jonathan
> > 
> 
> This is exactly the issue.  cxl_nvb is null, the rest appear fine.
> 
> Also, note, that weirdly the non-volatile bridge shows up when launching
> this in volatile mode, but no stack trace appears.
> 
> ¯\_(ツ)_/¯
> 
> After spending way too much time tracing through the current cxl driver
> code, i have only really determined that
> 
> 1) The code is very pmem oriented, and it's unclear to me how the driver
>    as-is differentiates a persistent device from a volatile device. That
> 	 code path still completely escapes me.  The only differentiating code
> 	 i see is in the memdev probe path that creates mem#/pmem and mem#/ram

Yes, pmem was the initial focus because it had the most dependency on
the OS to setup vs BIOS, but the ram enabling is at the top of the queue
now.

> 
> 2) The code successfully manages probe, enable, and mount a REAL device
>    - cxl memdev appears (/sys/bus/cxl/devices/mem0)
> 	 - a dax device appears (/sys/bus/dax/devices/)
> 	   This happens at boot, which I assume must be bios related

As it stands currently that dax device and the cxl device are not
related since a default dax-device is loaded just based on the presence
of an EFI_MEMORY_SP address range in the address map. With the new ram
enabling that default device will be elided and CXL will register a
dax-device parented by a cxl region.

> 	 - The memory *does not* auto-online, instead the dax device can be
> 	   onlined as system-ram *manually* via ndctl and friends

That *manually* part is the problem that needs distro help to solve. It
should be the case that by default all Linux distributions auto-online
all dax-devices. If that happens to online memory that is too slow for
general use, or too high-performance / precious for general purpose use
then the administrator can set policy after the fact. Unfortunately user
policy can not be applied if these memory ranges were onlined by the
kernel at boot , so that's why the kernel policy defaults to not-online.

In other words, there is no guarantee that memory that was assigned to
the general purpose pool at boot can be removed. The only guaranteed
behavior is to never give the memory to the core kernel in the first
instance and always let user policy route the memory.

> 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless
>    of the type-3 device configuration (pmem-only or vmem-only)

Correct, the top-level bus code (cxl_acpi) and the endpoint code
(cxl_mem, cxl_port) need to handshake before establishing regions. For
pmem regions the platform needs to claim the availability of a pmem
capable CXL window.

> 
>    # CFMW defined
>    [root@fedora ~]# ls /sys/bus/cxl/devices/
>    decoder0.0  decoder2.0  mem0            port1
>    decoder1.0  endpoint2   nvdimm-bridge0  root0
> 
>    # CFMW not defined
> 	 [root@fedora ~]# ls /sys/bus/cxl/devices/
>    decoder1.0  decoder2.0  endpoint2  mem0  port1  root0
> 
> 4) As you can see above, multiple decoders are registered.  I'm not sure
>    if that's correct or not, but it does seem odd given there's only one
> 	 cxl type-3 device.  Odd that decoder0.0 shows up when CFMW is there,
> 	 but not when it isn't.

CXL windows are modeled as decoders hanging off the the CXL root device
(ACPI0017 on ACPI based platforms). An endpoint decoder can then map a
selection of that window.

> 
> 	 Note: All these tests have two root ports:
> 	 -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
>    -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
>    -device cxl-rp,id=rp1,bus=cxl.0,chassis=0,port=1,slot=1 \
> 
> 
> Don't know why I haven't thought of this until now, but is the CFMW code
> reporting something odd about what's behind it?  Is it assuming the
> devices are pmem?

No, the cxl_acpi code is just advertising platform decode possibilities
independent of what devices show up. Think of this like the PCI MMIO
space that gets allocated to a root bridge at the beginning of time.
That space may or may not get consumed based on what devices show up
downstream.

Gregory Price Jan. 20, 2023, 9:54 p.m. UTC | #25

On Fri, Jan 20, 2023 at 09:38:13AM -0800, Dan Williams wrote:
> As it stands currently that dax device and the cxl device are not
> related since a default dax-device is loaded just based on the presence
> of an EFI_MEMORY_SP address range in the address map. With the new ram
> enabling that default device will be elided and CXL will register a
> dax-device parented by a cxl region.
> 
> > 	 - The memory *does not* auto-online, instead the dax device can be
> > 	   onlined as system-ram *manually* via ndctl and friends
> 
> That *manually* part is the problem that needs distro help to solve. It
> should be the case that by default all Linux distributions auto-online
> all dax-devices. If that happens to online memory that is too slow for
> general use, or too high-performance / precious for general purpose use
> then the administrator can set policy after the fact. Unfortunately user
> policy can not be applied if these memory ranges were onlined by the
> kernel at boot , so that's why the kernel policy defaults to not-online.
> 
> In other words, there is no guarantee that memory that was assigned to
> the general purpose pool at boot can be removed. The only guaranteed
> behavior is to never give the memory to the core kernel in the first
> instance and always let user policy route the memory.
> 
> > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless
> >    of the type-3 device configuration (pmem-only or vmem-only)
> 
> Correct, the top-level bus code (cxl_acpi) and the endpoint code
> (cxl_mem, cxl_port) need to handshake before establishing regions. For
> pmem regions the platform needs to claim the availability of a pmem
> capable CXL window.
> 
> > 4) As you can see above, multiple decoders are registered.  I'm not sure
> >    if that's correct or not, but it does seem odd given there's only one
> > 	 cxl type-3 device.  Odd that decoder0.0 shows up when CFMW is there,
> > 	 but not when it isn't.
> 
> CXL windows are modeled as decoders hanging off the the CXL root device
> (ACPI0017 on ACPI based platforms). An endpoint decoder can then map a
> selection of that window.
> 
> > Don't know why I haven't thought of this until now, but is the CFMW code
> > reporting something odd about what's behind it?  Is it assuming the
> > devices are pmem?
> 
> No, the cxl_acpi code is just advertising platform decode possibilities
> independent of what devices show up. Think of this like the PCI MMIO
> space that gets allocated to a root bridge at the beginning of time.
> That space may or may not get consumed based on what devices show up
> downstream.

Thank you for the explanation Dan, and thank you for you patience
@JCameron.  I'm fairly sure I grok it now.

Summarizing to make sure: the cxl driver is providing what would be the
CXL.io (control) path, and the CXL.mem path is basically being simulated
by what otherwise would be a traditional PCI memory region. This explains
why turning off Legacy mode drops the dax devices, and why the topology
looks strange - the devices are basically attached in 2 different ways.

Might there be interest from the QEMU community to implement this
legacy-style setup in the short term, in an effort to test the the
control path of type-3 devices while we wait for the kernel to catch up?

Or should we forget this mode ever existed and just barrel forward
with HDM decoders and writing the kernel code to hook up the underlying
devices in drivers/cxl?

Dan Williams Jan. 20, 2023, 10:41 p.m. UTC | #26

Gregory Price wrote:
> On Fri, Jan 20, 2023 at 09:38:13AM -0800, Dan Williams wrote:
> > As it stands currently that dax device and the cxl device are not
> > related since a default dax-device is loaded just based on the presence
> > of an EFI_MEMORY_SP address range in the address map. With the new ram
> > enabling that default device will be elided and CXL will register a
> > dax-device parented by a cxl region.
> > 
> > > 	 - The memory *does not* auto-online, instead the dax device can be
> > > 	   onlined as system-ram *manually* via ndctl and friends
> > 
> > That *manually* part is the problem that needs distro help to solve. It
> > should be the case that by default all Linux distributions auto-online
> > all dax-devices. If that happens to online memory that is too slow for
> > general use, or too high-performance / precious for general purpose use
> > then the administrator can set policy after the fact. Unfortunately user
> > policy can not be applied if these memory ranges were onlined by the
> > kernel at boot , so that's why the kernel policy defaults to not-online.
> > 
> > In other words, there is no guarantee that memory that was assigned to
> > the general purpose pool at boot can be removed. The only guaranteed
> > behavior is to never give the memory to the core kernel in the first
> > instance and always let user policy route the memory.
> > 
> > > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless
> > >    of the type-3 device configuration (pmem-only or vmem-only)
> > 
> > Correct, the top-level bus code (cxl_acpi) and the endpoint code
> > (cxl_mem, cxl_port) need to handshake before establishing regions. For
> > pmem regions the platform needs to claim the availability of a pmem
> > capable CXL window.
> > 
> > > 4) As you can see above, multiple decoders are registered.  I'm not sure
> > >    if that's correct or not, but it does seem odd given there's only one
> > > 	 cxl type-3 device.  Odd that decoder0.0 shows up when CFMW is there,
> > > 	 but not when it isn't.
> > 
> > CXL windows are modeled as decoders hanging off the the CXL root device
> > (ACPI0017 on ACPI based platforms). An endpoint decoder can then map a
> > selection of that window.
> > 
> > > Don't know why I haven't thought of this until now, but is the CFMW code
> > > reporting something odd about what's behind it?  Is it assuming the
> > > devices are pmem?
> > 
> > No, the cxl_acpi code is just advertising platform decode possibilities
> > independent of what devices show up. Think of this like the PCI MMIO
> > space that gets allocated to a root bridge at the beginning of time.
> > That space may or may not get consumed based on what devices show up
> > downstream.
> 
> Thank you for the explanation Dan, and thank you for you patience
> @JCameron.  I'm fairly sure I grok it now.
> 
> Summarizing to make sure: the cxl driver is providing what would be the
> CXL.io (control) path, and the CXL.mem path is basically being simulated
> by what otherwise would be a traditional PCI memory region. This explains
> why turning off Legacy mode drops the dax devices, and why the topology
> looks strange - the devices are basically attached in 2 different ways.
> 
> Might there be interest from the QEMU community to implement this
> legacy-style setup in the short term, in an effort to test the the
> control path of type-3 devices while we wait for the kernel to catch up?
> 
> Or should we forget this mode ever existed and just barrel forward
> with HDM decoders and writing the kernel code to hook up the underlying
> devices in drivers/cxl?

Which mode are you referring?

The next steps for the kernel enabling relevant to this thread are:

* ram region discovery (platform firmware or kexec established)
* ram region creation
* pmem region discovery (from labels)

Jonathan Cameron Jan. 23, 2023, 9:44 a.m. UTC | #27

On Fri, 20 Jan 2023 14:41:05 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Gregory Price wrote:
> > On Fri, Jan 20, 2023 at 09:38:13AM -0800, Dan Williams wrote:  
> > > As it stands currently that dax device and the cxl device are not
> > > related since a default dax-device is loaded just based on the presence
> > > of an EFI_MEMORY_SP address range in the address map. With the new ram
> > > enabling that default device will be elided and CXL will register a
> > > dax-device parented by a cxl region.
> > >   
> > > > 	 - The memory *does not* auto-online, instead the dax device can be
> > > > 	   onlined as system-ram *manually* via ndctl and friends  
> > > 
> > > That *manually* part is the problem that needs distro help to solve. It
> > > should be the case that by default all Linux distributions auto-online
> > > all dax-devices. If that happens to online memory that is too slow for
> > > general use, or too high-performance / precious for general purpose use
> > > then the administrator can set policy after the fact. Unfortunately user
> > > policy can not be applied if these memory ranges were onlined by the
> > > kernel at boot , so that's why the kernel policy defaults to not-online.
> > > 
> > > In other words, there is no guarantee that memory that was assigned to
> > > the general purpose pool at boot can be removed. The only guaranteed
> > > behavior is to never give the memory to the core kernel in the first
> > > instance and always let user policy route the memory.
> > >   
> > > > 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless
> > > >    of the type-3 device configuration (pmem-only or vmem-only)  
> > > 
> > > Correct, the top-level bus code (cxl_acpi) and the endpoint code
> > > (cxl_mem, cxl_port) need to handshake before establishing regions. For
> > > pmem regions the platform needs to claim the availability of a pmem
> > > capable CXL window.
> > >   
> > > > 4) As you can see above, multiple decoders are registered.  I'm not sure
> > > >    if that's correct or not, but it does seem odd given there's only one
> > > > 	 cxl type-3 device.  Odd that decoder0.0 shows up when CFMW is there,
> > > > 	 but not when it isn't.  
> > > 
> > > CXL windows are modeled as decoders hanging off the the CXL root device
> > > (ACPI0017 on ACPI based platforms). An endpoint decoder can then map a
> > > selection of that window.
> > >   
> > > > Don't know why I haven't thought of this until now, but is the CFMW code
> > > > reporting something odd about what's behind it?  Is it assuming the
> > > > devices are pmem?  
> > > 
> > > No, the cxl_acpi code is just advertising platform decode possibilities
> > > independent of what devices show up. Think of this like the PCI MMIO
> > > space that gets allocated to a root bridge at the beginning of time.
> > > That space may or may not get consumed based on what devices show up
> > > downstream.  
> > 
> > Thank you for the explanation Dan, and thank you for you patience
> > @JCameron.  I'm fairly sure I grok it now.
> > 
> > Summarizing to make sure: the cxl driver is providing what would be the
> > CXL.io (control) path, and the CXL.mem path is basically being simulated
> > by what otherwise would be a traditional PCI memory region. This explains
> > why turning off Legacy mode drops the dax devices, and why the topology
> > looks strange - the devices are basically attached in 2 different ways.
> > 
> > Might there be interest from the QEMU community to implement this
> > legacy-style setup in the short term, in an effort to test the the
> > control path of type-3 devices while we wait for the kernel to catch up?

I'd happily review such code, but it's not on my list of things to work on
otherwise. Too many other things to support!

Jonathan

> > 
> > Or should we forget this mode ever existed and just barrel forward
> > with HDM decoders and writing the kernel code to hook up the underlying
> > devices in drivers/cxl?  
> 
> Which mode are you referring?
> 
> The next steps for the kernel enabling relevant to this thread are:
> 
> * ram region discovery (platform firmware or kexec established)
> * ram region creation
> * pmem region discovery (from labels)

Gregory Price Jan. 23, 2023, 6:16 p.m. UTC | #28

On Fri, Jan 20, 2023 at 02:41:05PM -0800, Dan Williams wrote:
> 
> Which mode are you referring?
> 
> The next steps for the kernel enabling relevant to this thread are:
> 
> * ram region discovery (platform firmware or kexec established)
> * ram region creation
> * pmem region discovery (from labels)

On some bios there is an option for "legacy cxl" which produces the
nvdimm-esque behavior described above.  After further investigation,
supporting this would require a bigger lift than i expected since the
existing emulated CXL devices don't carry any kind of node assignment
and we'd have to plumb that through

Not worth it, better off working towards the new stuff.

[0/8] hw/cxl: CXL emulation cleanups and minor fixes for upstream

Message

Comments