[RFC,00/13] RFC: add Type2 device support

Message ID	20240516081202.27023-1-alucerop@amd.com
Headers	show Received: from NAM04-DM6-obe.outbound.protection.outlook.com (mail-dm6nam04on2079.outbound.protection.outlook.com [40.107.102.79]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7AB704120A for <linux-cxl@vger.kernel.org>; Thu, 16 May 2024 08:12:22 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB03.amd.com; pr=C From: <alucerop@amd.com> To: <linux-cxl@vger.kernel.org>, <dan.j.williams@intel.com>, <pieter.jansen-van-vuuren@amd.com>, <richard.hughes@amd.com>, <dinan.gunawardena@amd.com> CC: Alejandro Lucero <alejandro.lucero-palau@amd.com> Subject: [RFC PATCH 00/13] RFC: add Type2 device support Date: Thu, 16 May 2024 09:11:49 +0100 Message-ID: <20240516081202.27023-1-alucerop@amd.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Received-SPF: None (SATLEXMB03.amd.com: alucerop@amd.com does not designate permitted sender hosts)
Series	RFC: add Type2 device support \| expand [RFC,00/13] RFC: add Type2 device support [RFC,01/13] cxl: move header files for absolute references [RFC,02/13] cxl: add type2 device basic support [RFC,03/13] cxl: export core function for type2 devices [RFC,04/13] cxl: allow devices without mailbox capability [RFC,05/13] cxl: fix check about pmem resource [RFC,06/13] cxl: support type2 memdev creation [RFC,07/13] cxl: add functions for exclusive access to endpoint port topology [RFC,08/13] cxl: add cxl_get_hpa_freespace [RFC,09/13] cxl: add cxl_request_dpa [RFC,10/13] cxl: make region type based on endpoint type [RFC,11/13] cxl: allow automatic region creation by type2 drivers [RFC,12/13] cxl: preclude device memory to be used for dax [RFC,13/13] cxl: test type2 private mapping

Alejandro Lucero Palau May 16, 2024, 8:11 a.m. UTC

From: Alejandro Lucero <alejandro.lucero-palau@amd.com>

I need to start this RFC explaining not what the patchset does, but what we
expected to obtain and currently is under doubt: to configure a CXL memory
range advertised by our CXL network device and to use it "privately".

The main reason behind that privacy, but not the only one, is to avoid any
arbitrary use by any user-space "client" with enough privileges for using a
public interface to map the device memory and use it as regular memory. The
reason is obvious: the device expects writes to such a memory in a specific
format.

The doubt comes from the fact that after implementing the functionality exposed
in this patchset, we realized the current expectation seems to be a BIOS/UEFI
configuring HDM decoders or HPA ranges and passing memory ranges to the kernel,
and with the kernel having a default action on that memory range based on the
flag EFI_MEMORY_SP being set or not. If it is not set, the memory range will
be part of the kernel memory management, what we do not want for sure. If the
flag is set, a DAX device will be created which allows an user-space client to
map such a memory through the DAX device API, what we also prefer to avoid. I
know this is likely going to face opposition, but we see this RFC as the
opportunity for discussing the matter and, if it turns out to be the case, to be
guided towards the proper solution accepted by the maintainers/community. This
patchset does not tackle this default kernel behaviour although we already have
some ideas and workarounds for the short term. We'll be happy to discuss this
openly.

So, this patchset assumes a BIOS/UEFI not programming the HDM decoder or HPA
ranges for a CXL Type2 device. Although maybe a weird assumption as explained
above, a Type2 device added after boot, that is hotplugged, will likely need the
changes added here. Exporting some of the CXL core for vendor drivers is also
required for avoiding code duplication when reading DVSEC or decoder registers.
Finally if there is no such HPA range assigned, whatever the reason, this patchset
offers some way of obtaining one and/or finding out what is the problem behind the
lack of such HPA range.

Current CXL kernel code is focused on supporting Type3 CXL devices, aka memory
expanders. Type2 CXL devices, aka device accelerators, share some functionalities
but require some special handling.

First of all, Type2 are by definition specific to vendor designs and a specific
vendor driver is expected to work with the CXL specifics. This implies the CXL
setup needs to be done by such a driver instead of by a generic CXL PCI driver
as for memory expanders. Most of such setup needs to use current CXL core code
and therefore needs to be accessible to those vendor drivers. This is
accomplished with the first patch and with extensions to the exported functions
in subsequent patches.

Keeping with kernel rules of having a client using any new functionality added,
a CXL type2 driver is increasingly added. This is not a driver supporting a real
device but an emulated one in QEMU, with the QEMU patch following this patchset.

The reason for adding such a Type2 driver instead of changes to a current kernel
driver or a new driver is threefold:

1) the expected kernel driver to use the functionality added is a netdev one.
   Current internal CXL support is a codesign effort, therefore software and
   hardware evolving in lockstep. Adding changes to a netdev driver requires the
   full functionality and doing things following the netdev standard which is
   not the best option for this development stage.

2) Waiting for completing the development will delay the required Type2 support,
   and most of the required changes are unrelated to specific CXL usage by any
   vendor driver.

3) Type2 support will need some testing infrastructure, unit tests for ensuring
   Type2 devices are working, and module tests for ensuring CXL core changes do
   not affect Type2 support.

I hope these reasons are convincing enough.

I have decided to follow a gradual approach for adding such a driver using the
exported CXL functions and structs. I think it is easier to review the driver
when the new funcionality is added than to add the driver at the end, but not a
big deal if my approach is not liked.

The patches are based on a patchset sent by Dan Williams [1] which was just
partially integrated, most related to making things ready for Type2 but none
related to specific Type2 support. Those patches based on Dan´s work have Dan´s
signing, so Dan, tell me if you do not want me to add you.

Type2 implies, I think, only the related driver to manage the CXL specifics.
This means no user space intervention and therefore no sysfs files. This makes
easy to avoid the current problem of most of the sysfs related code expecting
Type3 devices. If I´m wrong in this regard, such a code will need further
changes.

A final note about CXL.cache is needed. This patchset does not cover it at all,
although the emulated Type2 device advertises it. From the kernel point of view
supporting CXL.cache will imply to be sure the CXL path supports what the Type2
device needs. A device accelerator will likely be connected to a Root Switch,
but other configurations can not be discarded. Therefore the kernel will need to
check not just HPA, DPA, interleave and granularity, but also the available
CXL.cache support and resources in each switch in the CXL path to the Type2
device. I expect to contribute to this support in the following months, and
it would be good to discuss about it when possible.

Alejandro.

[1] https://lore.kernel.org/linux-cxl/98b1f61a-e6c2-71d4-c368-50d958501b0c@intel.com/T/

Alejandro Lucero (13):
  cxl: move header files for absolute references
  cxl: add type2 device basic support
  cxl: export core function for type2 devices
  cxl: allow devices without mailbox capability
  cxl: fix check about pmem resource
  cxl: support type2 memdev creation
  cxl: add functions for exclusive access to endpoint port topology
  cxl: add cxl_get_hpa_freespace
  cxl: add cxl_request_dpa
  cxl: make region type based on endpoint type
  cxl: allow automatic region creation by type2 drivers
  cxl: preclude device memory to be used for dax
  cxl: test type2 private mapping

 drivers/cxl/acpi.c                      |   4 +-
 drivers/cxl/core/cdat.c                 |   9 +-
 drivers/cxl/core/core.h                 |   1 -
 drivers/cxl/core/hdm.c                  | 159 +++++++--
 drivers/cxl/core/mbox.c                 |   6 +-
 drivers/cxl/core/memdev.c               |  75 +++-
 drivers/cxl/core/pci.c                  |   6 +-
 drivers/cxl/core/pmem.c                 |   4 +-
 drivers/cxl/core/pmu.c                  |   4 +-
 drivers/cxl/core/port.c                 |   6 +-
 drivers/cxl/core/region.c               | 446 ++++++++++++++++++++----
 drivers/cxl/core/regs.c                 |   9 +-
 drivers/cxl/core/suspend.c              |   2 +-
 drivers/cxl/core/trace.c                |   2 +-
 drivers/cxl/core/trace.h                |   4 +-
 drivers/cxl/mem.c                       |  23 +-
 drivers/cxl/pci.c                       |   9 +-
 drivers/cxl/pmem.c                      |   4 +-
 drivers/cxl/port.c                      |   4 +-
 drivers/cxl/security.c                  |   4 +-
 drivers/dax/cxl.c                       |   2 +-
 drivers/perf/cxl_pmu.c                  |   4 +-
 {drivers/cxl => include/linux}/cxl.h    |   5 +
 {drivers/cxl => include/linux}/cxlmem.h |  22 +-
 {drivers/cxl => include/linux}/cxlpci.h |   2 +
 tools/testing/cxl/Kbuild                |   1 +
 tools/testing/cxl/cxl_core_exports.c    |   2 +-
 tools/testing/cxl/mock_acpi.c           |   2 +-
 tools/testing/cxl/test/cxl.c            |   2 +-
 tools/testing/cxl/test/mem.c            |   2 +-
 tools/testing/cxl/test/mock.c           |   4 +-
 tools/testing/cxl/test/mock.h           |   2 +-
 tools/testing/cxl/type2/Kbuild          |   7 +
 tools/testing/cxl/type2/pci_type2.c     | 201 +++++++++++
 34 files changed, 886 insertions(+), 153 deletions(-)
 rename {drivers/cxl => include/linux}/cxl.h (99%)
 rename {drivers/cxl => include/linux}/cxlmem.h (97%)
 rename {drivers/cxl => include/linux}/cxlpci.h (97%)
 create mode 100644 tools/testing/cxl/type2/Kbuild
 create mode 100644 tools/testing/cxl/type2/pci_type2.c

Dan Williams May 17, 2024, 12:08 a.m. UTC | #1

alucerop@ wrote:
> From: Alejandro Lucero <alejandro.lucero-palau@amd.com>
> 
> I need to start this RFC explaining not what the patchset does, but what we
> expected to obtain and currently is under doubt: to configure a CXL memory
> range advertised by our CXL network device and to use it "privately".
> 
> The main reason behind that privacy, but not the only one, is to avoid any
> arbitrary use by any user-space "client" with enough privileges for using a
> public interface to map the device memory and use it as regular memory. The
> reason is obvious: the device expects writes to such a memory in a specific
> format.
> 
> The doubt comes from the fact that after implementing the functionality exposed
> in this patchset, we realized the current expectation seems to be a BIOS/UEFI
> configuring HDM decoders or HPA ranges and passing memory ranges to the kernel,
> and with the kernel having a default action on that memory range based on the
> flag EFI_MEMORY_SP being set or not.

Lets clear up this confusion first.

My expectation for CXL accelerator provided memory is that it is marked "EFI
Reserved" in the memory map (or CDAT), not "EFI Conventional + Special Purpose
Attribute". Recall that the EFI_MEMORY_SP attribute is just a *hint* that OS may
*not* want to consider a memory range as part of the general purpose memory
pool. CXL Accelerator Memory is far from that definition. If a driver needs to
be loaded to effectively use a memory range that is not a range that can be
transparently converted to "EFI Conventional Memory".

> If it is not set, the memory range will be part of the kernel memory
> management, what we do not want for sure. If the flag is set, a DAX device
> will be created which allows an user-space client to map such a memory through
> the DAX device API, what we also prefer to avoid. I know this is likely going
> to face opposition, but we see this RFC as the opportunity for discussing the
> matter and, if it turns out to be the case, to be guided towards the proper
> solution accepted by the maintainers/community. This patchset does not tackle
> this default kernel behaviour although we already have some ideas and
> workarounds for the short term. We'll be happy to discuss this openly.

It may have been subtle, but even in my initial RFC [1]. I was explicitly
skipping exposing accelerator memory via device-dax:

@@ -3085,6 +3183,15 @@ static int cxl_region_probe(struct device *dev)
 					p->res->start, p->res->end, cxlr,
 					is_system_ram) > 0)
 			return 0;
+
+		/*
+		 * HDM-D[B] (device-memory) regions have accelerator
+		 * specific usage, skip device-dax registration.
+		 */
+		if (cxlr->type == CXL_DECODER_DEVMEM)
+			return 0;
+
+		/* HDM-H routes to device-dax */
 		return devm_cxl_add_dax_region(cxlr);
 	default:
 		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",

[1] https://lore.kernel.org/r/168592159835.1948938.1647215579839222774.stgit@dwillia2-xfh.jf.intel.com

> So, this patchset assumes a BIOS/UEFI not programming the HDM decoder or HPA
> ranges for a CXL Type2 device.

At least for Linux it should not care whether BIOS maps it or not. The
expectation is that whichever agent maps it, BIOS or Linux CXL core, that agent
honors the memory type specified in the device CDAT. I.e. the "Device Scoped EFI
Memory Type Structure (DSEMTS)" in the accelerator CDAT should pass "2" in the
"EFI Memory Type and Attribute" field, where "2" indicates "EFI Reserved" memory
type for any HPA that maps this device's DPA.

> above, a Type2 device added after boot, that is hotplugged, will likely need the
> changes added here. Exporting some of the CXL core for vendor drivers is also
> required for avoiding code duplication when reading DVSEC or decoder registers.
> Finally if there is no such HPA range assigned, whatever the reason, this patchset
> offers some way of obtaining one and/or finding out what is the problem behind the
> lack of such HPA range.

Yes, that was the whole point of the "Device Memory Setup" thread [2] to show a
plausible way to reuse the common core infrastructure for accelerators with
CXL.mem capability.

[2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/

It would help me if you can summarize what you did and did not adopt from that
proposal, i.e. where it helped and where it missed the mark for your use case.

[..]
> Keeping with kernel rules of having a client using any new functionality added,
> a CXL type2 driver is increasingly added. This is not a driver supporting a real
> device but an emulated one in QEMU, with the QEMU patch following this patchset.

Lets start with the deltas compared to [2], the observation there is that much
of the CXL core is reusable for this type-2 use case, and even improves the
organization of the core. The consumer of the refactoring just ends up being a
few new helpers around the core.

> The reason for adding such a Type2 driver instead of changes to a current kernel
> driver or a new driver is threefold:
> 
> 1) the expected kernel driver to use the functionality added is a netdev one.
>    Current internal CXL support is a codesign effort, therefore software and
>    hardware evolving in lockstep. Adding changes to a netdev driver requires the
>    full functionality and doing things following the netdev standard which is
>    not the best option for this development stage.

Not sure what "netdev standard" means, and not sure it matters for a discussion
that is constrained to how the CXL core should evolve to accommodate
accelerator-local memory.

> 2) Waiting for completing the development will delay the required Type2 support,
>    and most of the required changes are unrelated to specific CXL usage by any
>    vendor driver.

Again, [2] already path cleared the idea that early plumbing of the CXL core for
type-2 is in scope. Just need to make sure this effort is focused on general CXL
specification use cases.

> 
> 3) Type2 support will need some testing infrastructure, unit tests for ensuring
>    Type2 devices are working, and module tests for ensuring CXL core changes do
>    not affect Type2 support.
> 
> I hope these reasons are convincing enough.

I am happy to have general specification discussions about core functionality
that all CXL accelerators need and plumbing those ABIs to be exercised via
something like cxl_test.

I expect that cxl_test only covers coarse grained inter-module ABIs and that the
fine details will need to wait until the accelerator specifics are closer to
being disclosed and the real driver patches can start flowing. In other words, I
am not keen to see QEMU used as a way to proxy features into the kernel without
disclosing what the real world consumer actually needs.

Now, that said, there may also be opportunity for targeted extensions of the
QEMU CXL memory-expander device, like DPA capacity that is mapped as "EFI
Reserved" device-memory, but lets talk about the specifics here.

Also recall that the current CXL driver does not yet support parsing CXL PMEM
region labels. Support for that ends up looking quite similar to an accelerator
asking for a specific DPA range to be dynamically mapped by the CXL core. So I
think we can get quite far along the path to enabling type-2-CXL.mem just by
flushing out some of the generic type-3-CXL.mem use cases.

> I have decided to follow a gradual approach for adding such a driver using the
> exported CXL functions and structs. I think it is easier to review the driver
> when the new funcionality is added than to add the driver at the end, but not a
> big deal if my approach is not liked.
> 
> The patches are based on a patchset sent by Dan Williams [1] which was just
> partially integrated, most related to making things ready for Type2 but none
> related to specific Type2 support. Those patches based on Dan´s work have Dan´s
> signing, so Dan, tell me if you do not want me to add you.

It is fine to reuse those patches, but please do include:

Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Link: <lore-link for copied code>

...so I can try to remember if I still think the source patch was a good idea or
not. Again, a summary analysis of what did and did not work for you from that
posting would help me get a running start at reviewing this.

> Type2 implies, I think, only the related driver to manage the CXL specifics.
> This means no user space intervention and therefore no sysfs files. This makes
> easy to avoid the current problem of most of the sysfs related code expecting
> Type3 devices. If I´m wrong in this regard, such a code will need further
> changes.

Not sure what you mean here, I am hoping that sysfs ABI for enumerating objects
like memdevs, decoders, and regions "just works" and is hidden from the
accelerator vendor driver.

> A final note about CXL.cache is needed. This patchset does not cover it at all,
> although the emulated Type2 device advertises it. From the kernel point of view
> supporting CXL.cache will imply to be sure the CXL path supports what the Type2
> device needs. A device accelerator will likely be connected to a Root Switch,
> but other configurations can not be discarded. Therefore the kernel will need to
> check not just HPA, DPA, interleave and granularity, but also the available
> CXL.cache support and resources in each switch in the CXL path to the Type2
> device. I expect to contribute to this support in the following months, and
> it would be good to discuss about it when possible.

Yup, especially CXL 3.x Cache ID management. Lets talk about some ways to plumb
that, because it is also a general CXL specification capability. Another topic
is what is needed for CXL Protocol and Component error handling, I expect some
of that to be CXL core common and some of that to be accelertor driver
augmented.

If you think we will still be talking about this driver enabling in September
I would invite you to submit this as a topic for the CXL uConf.

https://lpc.events/event/18/contributions/1673/

Alejandro Lucero Palau May 18, 2024, 9:59 a.m. UTC | #2

On 5/17/24 01:08, Dan Williams wrote:
> alucerop@ wrote:
>> From: Alejandro Lucero <alejandro.lucero-palau@amd.com>
>>
>> I need to start this RFC explaining not what the patchset does, but what we
>> expected to obtain and currently is under doubt: to configure a CXL memory
>> range advertised by our CXL network device and to use it "privately".
>>
>> The main reason behind that privacy, but not the only one, is to avoid any
>> arbitrary use by any user-space "client" with enough privileges for using a
>> public interface to map the device memory and use it as regular memory. The
>> reason is obvious: the device expects writes to such a memory in a specific
>> format.
>>
>> The doubt comes from the fact that after implementing the functionality exposed
>> in this patchset, we realized the current expectation seems to be a BIOS/UEFI
>> configuring HDM decoders or HPA ranges and passing memory ranges to the kernel,
>> and with the kernel having a default action on that memory range based on the
>> flag EFI_MEMORY_SP being set or not.
> Lets clear up this confusion first.
>
> My expectation for CXL accelerator provided memory is that it is marked "EFI
> Reserved" in the memory map (or CDAT), not "EFI Conventional + Special Purpose
> Attribute". Recall that the EFI_MEMORY_SP attribute is just a *hint* that OS may
> *not* want to consider a memory range as part of the general purpose memory
> pool. CXL Accelerator Memory is far from that definition. If a driver needs to
> be loaded to effectively use a memory range that is not a range that can be
> transparently converted to "EFI Conventional Memory".

Well, that "EFI Reserved" is what we want. However, AFAIK, it is 
EFI_MEMORY_SP the only one being discussed with our BIOS/UEFI contacts.

Because how to assign those flags seems to be based on BIOS 
implementation, we count on some BIOS versions doing the wrong thing for 
our purposes, so I have been working on a workaround implying udev 
events and executing a modified daxctl for switching a dax device mode 
or even removing/destroying it.

Anyway, that "EFI Reserved" flag gives up hope.

>> If it is not set, the memory range will be part of the kernel memory
>> management, what we do not want for sure. If the flag is set, a DAX device
>> will be created which allows an user-space client to map such a memory through
>> the DAX device API, what we also prefer to avoid. I know this is likely going
>> to face opposition, but we see this RFC as the opportunity for discussing the
>> matter and, if it turns out to be the case, to be guided towards the proper
>> solution accepted by the maintainers/community. This patchset does not tackle
>> this default kernel behaviour although we already have some ideas and
>> workarounds for the short term. We'll be happy to discuss this openly.
> It may have been subtle, but even in my initial RFC [1]. I was explicitly
> skipping exposing accelerator memory via device-dax:


I did not see that. So, happy to know we are in the same page here.


> @@ -3085,6 +3183,15 @@ static int cxl_region_probe(struct device *dev)
>   					p->res->start, p->res->end, cxlr,
>   					is_system_ram) > 0)
>   			return 0;
> +
> +		/*
> +		 * HDM-D[B] (device-memory) regions have accelerator
> +		 * specific usage, skip device-dax registration.
> +		 */
> +		if (cxlr->type == CXL_DECODER_DEVMEM)
> +			return 0;
> +
> +		/* HDM-H routes to device-dax */
>   		return devm_cxl_add_dax_region(cxlr);
>   	default:
>   		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>
> [1] https://lore.kernel.org/r/168592159835.1948938.1647215579839222774.stgit@dwillia2-xfh.jf.intel.com
>
>> So, this patchset assumes a BIOS/UEFI not programming the HDM decoder or HPA
>> ranges for a CXL Type2 device.
> At least for Linux it should not care whether BIOS maps it or not. The
> expectation is that whichever agent maps it, BIOS or Linux CXL core, that agent
> honors the memory type specified in the device CDAT. I.e. the "Device Scoped EFI
> Memory Type Structure (DSEMTS)" in the accelerator CDAT should pass "2" in the
> "EFI Memory Type and Attribute" field, where "2" indicates "EFI Reserved" memory
> type for any HPA that maps this device's DPA.
>
>> above, a Type2 device added after boot, that is hotplugged, will likely need the
>> changes added here. Exporting some of the CXL core for vendor drivers is also
>> required for avoiding code duplication when reading DVSEC or decoder registers.
>> Finally if there is no such HPA range assigned, whatever the reason, this patchset
>> offers some way of obtaining one and/or finding out what is the problem behind the
>> lack of such HPA range.
> Yes, that was the whole point of the "Device Memory Setup" thread [2] to show a
> plausible way to reuse the common core infrastructure for accelerators with
> CXL.mem capability.
>
> [2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/
>
> It would help me if you can summarize what you did and did not adopt from that
> proposal, i.e. where it helped and where it missed the mark for your use case.


I basically took your patchset referenced in the RFC, specifically the 
19th one, and tried to work with it from a pcie/CXL device driver. I did 
debug all the problems precluding the CXL invoked function to succeed 
and applying those changes from previous patches required and not 
committed yet. Patches 7, 8, 9 and 11 are based on your patchset, with 
just some minor change since I did not see a reason for changing them 
after studying them. The other patches are those fixes for having a 
Type2 finally able to create a dax region and map it.

> [..]
>> Keeping with kernel rules of having a client using any new functionality added,
>> a CXL type2 driver is increasingly added. This is not a driver supporting a real
>> device but an emulated one in QEMU, with the QEMU patch following this patchset.
> Lets start with the deltas compared to [2], the observation there is that much
> of the CXL core is reusable for this type-2 use case, and even improves the
> organization of the core. The consumer of the refactoring just ends up being a
> few new helpers around the core.
>
>> The reason for adding such a Type2 driver instead of changes to a current kernel
>> driver or a new driver is threefold:
>>
>> 1) the expected kernel driver to use the functionality added is a netdev one.
>>     Current internal CXL support is a codesign effort, therefore software and
>>     hardware evolving in lockstep. Adding changes to a netdev driver requires the
>>     full functionality and doing things following the netdev standard which is
>>     not the best option for this development stage.
> Not sure what "netdev standard" means, and not sure it matters for a discussion
> that is constrained to how the CXL core should evolve to accommodate
> accelerator-local memory.

You know I could not add to a netdev driver just those changes for CXL 
initialization, then mapping a CXL region and doing nothing else. Such a 
mapping needs to be linked to something inside the driver what is just 
under development and testing.

>> 2) Waiting for completing the development will delay the required Type2 support,
>>     and most of the required changes are unrelated to specific CXL usage by any
>>     vendor driver.
> Again, [2] already path cleared the idea that early plumbing of the CXL core for
> type-2 is in scope. Just need to make sure this effort is focused on general CXL
> specification use cases.
>
>> 3) Type2 support will need some testing infrastructure, unit tests for ensuring
>>     Type2 devices are working, and module tests for ensuring CXL core changes do
>>     not affect Type2 support.
>>
>> I hope these reasons are convincing enough.
> I am happy to have general specification discussions about core functionality
> that all CXL accelerators need and plumbing those ABIs to be exercised via
> something like cxl_test.
>
> I expect that cxl_test only covers coarse grained inter-module ABIs and that the
> fine details will need to wait until the accelerator specifics are closer to
> being disclosed and the real driver patches can start flowing. In other words, I
> am not keen to see QEMU used as a way to proxy features into the kernel without
> disclosing what the real world consumer actually needs.

Sure. FWIW, this will end up changing a internal driver mapping HW 
buffers which are now based on PCI BAR offsets by same mapping based on 
CXL region mapping.

This involves, PFs and VFs, and because CXL is only advertised by PF0, 
this all brings more complexity to those changes than could be expected.

I think taking the CXL core to the support needed, hopefully based on 
this RFC, should be started as soon as possible instead of waiting for 
our driver/device to be ready.

>
> Now, that said, there may also be opportunity for targeted extensions of the
> QEMU CXL memory-expander device, like DPA capacity that is mapped as "EFI
> Reserved" device-memory, but lets talk about the specifics here.
Yes. I'm actually using qemu hmem and kernel efi_fake_mem for some 
testing about those workarounds, so happy to help with new qemu support 
for the "EFI Reserved" case as well.
> Also recall that the current CXL driver does not yet support parsing CXL PMEM
> region labels. Support for that ends up looking quite similar to an accelerator
> asking for a specific DPA range to be dynamically mapped by the CXL core. So I
> think we can get quite far along the path to enabling type-2-CXL.mem just by
> flushing out some of the generic type-3-CXL.mem use cases.
OK. I'll take a look to those pmem needs.
>> I have decided to follow a gradual approach for adding such a driver using the
>> exported CXL functions and structs. I think it is easier to review the driver
>> when the new funcionality is added than to add the driver at the end, but not a
>> big deal if my approach is not liked.
>>
>> The patches are based on a patchset sent by Dan Williams [1] which was just
>> partially integrated, most related to making things ready for Type2 but none
>> related to specific Type2 support. Those patches based on Dan´s work have Dan´s
>> signing, so Dan, tell me if you do not want me to add you.
> It is fine to reuse those patches, but please do include:
>
> Co-developed-by: Dan Williams <dan.j.williams@intel.com>
> Link: <lore-link for copied code>

Ok. I'll do the changes for next versions.


> ...so I can try to remember if I still think the source patch was a good idea or
> not. Again, a summary analysis of what did and did not work for you from that
> posting would help me get a running start at reviewing this.
>
>> Type2 implies, I think, only the related driver to manage the CXL specifics.
>> This means no user space intervention and therefore no sysfs files. This makes
>> easy to avoid the current problem of most of the sysfs related code expecting
>> Type3 devices. If I´m wrong in this regard, such a code will need further
>> changes.
> Not sure what you mean here, I am hoping that sysfs ABI for enumerating objects
> like memdevs, decoders, and regions "just works" and is hidden from the
> accelerator vendor driver.

If a Type2 is fully private, should be any sysfs files about it?

AFAIK, those files are mainly for user space able to create memories 
joining them, or for changing the dax mode. What do you think a Type2 
needs through sysfs apart from info? While writing this I realize I have 
not thought about power management ...

>
>> A final note about CXL.cache is needed. This patchset does not cover it at all,
>> although the emulated Type2 device advertises it. From the kernel point of view
>> supporting CXL.cache will imply to be sure the CXL path supports what the Type2
>> device needs. A device accelerator will likely be connected to a Root Switch,
>> but other configurations can not be discarded. Therefore the kernel will need to
>> check not just HPA, DPA, interleave and granularity, but also the available
>> CXL.cache support and resources in each switch in the CXL path to the Type2
>> device. I expect to contribute to this support in the following months, and
>> it would be good to discuss about it when possible.
> Yup, especially CXL 3.x Cache ID management. Lets talk about some ways to plumb
> that, because it is also a general CXL specification capability. Another topic
> is what is needed for CXL Protocol and Component error handling, I expect some
> of that to be CXL core common and some of that to be accelertor driver
> augmented.
>
> If you think we will still be talking about this driver enabling in September
> I would invite you to submit this as a topic for the CXL uConf.

That would be great!

If this RFC work is integrated by then with the CXL core, that would be 
fantastic, but the CXL.cache support will still be pending.

>
> https://lpc.events/event/18/contributions/1673/

Dan Williams May 21, 2024, 4:56 a.m. UTC | #3

Alejandro Lucero Palau wrote:
> On 5/17/24 01:08, Dan Williams wrote:
[..]
> > My expectation for CXL accelerator provided memory is that it is marked "EFI
> > Reserved" in the memory map (or CDAT), not "EFI Conventional + Special Purpose
> > Attribute". Recall that the EFI_MEMORY_SP attribute is just a *hint* that OS may
> > *not* want to consider a memory range as part of the general purpose memory
> > pool. CXL Accelerator Memory is far from that definition. If a driver needs to
> > be loaded to effectively use a memory range that is not a range that can be
> > transparently converted to "EFI Conventional Memory".
> 
> Well, that "EFI Reserved" is what we want. However, AFAIK, it is 
> EFI_MEMORY_SP the only one being discussed with our BIOS/UEFI contacts.

Well, tell your BIOS contacts that they need to honor what the device
puts in the CDAT for the memory-type, not just make it up on their own.

From the Linux side we can scream loudly with
WARN_TAINT_FIRMWARE_WORKAROUND every time we see a BIOS that has failed
to match EFI memory-type with whatever the device puts in the CDAT,
because that's clearly a broken BIOS.

> Because how to assign those flags seems to be based on BIOS 
> implementation, we count on some BIOS versions doing the wrong thing for 
> our purposes, so I have been working on a workaround implying udev 
> events and executing a modified daxctl for switching a dax device mode 
> or even removing/destroying it.

Yikes, no, lets not do that. Another thing we can do from the kernel
side, in case the CDAT is missing, is to never attach device-dax to
address ranges within CXL windows with the "Device Coherent Window
Restriction".

[..]
> > It would help me if you can summarize what you did and did not adopt from that
> > proposal, i.e. where it helped and where it missed the mark for your use case.
> 
> I basically took your patchset referenced in the RFC, specifically the 
> 19th one, and tried to work with it from a pcie/CXL device driver. I did 
> debug all the problems precluding the CXL invoked function to succeed 
> and applying those changes from previous patches required and not 
> committed yet. Patches 7, 8, 9 and 11 are based on your patchset, with 
> just some minor change since I did not see a reason for changing them 
> after studying them. The other patches are those fixes for having a 
> Type2 finally able to create a dax region and map it.

Ok, let's not add dax region's for this case, but the rest sounds
straightforward, I'll take a look.

[..]
> >> The reason for adding such a Type2 driver instead of changes to a current kernel
> >> driver or a new driver is threefold:
> >>
> >> 1) the expected kernel driver to use the functionality added is a netdev one.
> >>     Current internal CXL support is a codesign effort, therefore software and
> >>     hardware evolving in lockstep. Adding changes to a netdev driver requires the
> >>     full functionality and doing things following the netdev standard which is
> >>     not the best option for this development stage.
> > Not sure what "netdev standard" means, and not sure it matters for a discussion
> > that is constrained to how the CXL core should evolve to accommodate
> > accelerator-local memory.
> 
> You know I could not add to a netdev driver just those changes for CXL 
> initialization, then mapping a CXL region and doing nothing else. Such a 
> mapping needs to be linked to something inside the driver what is just 
> under development and testing.

Ok, you're just talking about pre-plumbing the CXL bits before getting
started on the netdev side. Makes sense to me.

> >> 2) Waiting for completing the development will delay the required Type2 support,
> >>     and most of the required changes are unrelated to specific CXL usage by any
> >>     vendor driver.
> > Again, [2] already path cleared the idea that early plumbing of the CXL core for
> > type-2 is in scope. Just need to make sure this effort is focused on general CXL
> > specification use cases.
> >
> >> 3) Type2 support will need some testing infrastructure, unit tests for ensuring
> >>     Type2 devices are working, and module tests for ensuring CXL core changes do
> >>     not affect Type2 support.
> >>
> >> I hope these reasons are convincing enough.
> > I am happy to have general specification discussions about core functionality
> > that all CXL accelerators need and plumbing those ABIs to be exercised via
> > something like cxl_test.
> >
> > I expect that cxl_test only covers coarse grained inter-module ABIs and that the
> > fine details will need to wait until the accelerator specifics are closer to
> > being disclosed and the real driver patches can start flowing. In other words, I
> > am not keen to see QEMU used as a way to proxy features into the kernel without
> > disclosing what the real world consumer actually needs.
> 
> Sure. FWIW, this will end up changing a internal driver mapping HW 
> buffers which are now based on PCI BAR offsets by same mapping based on 
> CXL region mapping.
> 
> This involves, PFs and VFs, and because CXL is only advertised by PF0, 
> this all brings more complexity to those changes than could be expected.
> 
> I think taking the CXL core to the support needed, hopefully based on 
> this RFC, should be started as soon as possible instead of waiting for 
> our driver/device to be ready.

I agree, but continued transparency on what the CXL use case looks like
in the final implementation helps make sure the right upstream CXL
mechanism is being built.

For example, just from that insight above I am wondering if we need some
centralized interface for VFs to lookup / allocate CXL resources from
their PF0.

[..]
> >> Type2 implies, I think, only the related driver to manage the CXL specifics.
> >> This means no user space intervention and therefore no sysfs files. This makes
> >> easy to avoid the current problem of most of the sysfs related code expecting
> >> Type3 devices. If I´m wrong in this regard, such a code will need further
> >> changes.
> > Not sure what you mean here, I am hoping that sysfs ABI for enumerating objects
> > like memdevs, decoders, and regions "just works" and is hidden from the
> > accelerator vendor driver.
> 
> If a Type2 is fully private, should be any sysfs files about it?
> 
> AFAIK, those files are mainly for user space able to create memories 
> joining them, or for changing the dax mode. What do you think a Type2 
> needs through sysfs apart from info? While writing this I realize I have 
> not thought about power management ...

They are primarily for userspace to enumerate resources, and they
provide kernel object names for CXL error events. For example userspace
could generically map a CXL switch error to impacted endpoints without
needing to know or care whether it was a type-2 or type-3 endpoint.

[..]
> > If you think we will still be talking about this driver enabling in September
> > I would invite you to submit this as a topic for the CXL uConf.
> 
> That would be great!
> 
> If this RFC work is integrated by then with the CXL core, that would be 
> fantastic, but the CXL.cache support will still be pending.

Sounds good.

Alejandro Lucero Palau May 22, 2024, 4:38 p.m. UTC | #4

On 5/21/24 05:56, Dan Williams wrote:
> Alejandro Lucero Palau wrote:
>> On 5/17/24 01:08, Dan Williams wrote:
> [..]
>>> My expectation for CXL accelerator provided memory is that it is marked "EFI
>>> Reserved" in the memory map (or CDAT), not "EFI Conventional + Special Purpose
>>> Attribute". Recall that the EFI_MEMORY_SP attribute is just a *hint* that OS may
>>> *not* want to consider a memory range as part of the general purpose memory
>>> pool. CXL Accelerator Memory is far from that definition. If a driver needs to
>>> be loaded to effectively use a memory range that is not a range that can be
>>> transparently converted to "EFI Conventional Memory".
>> Well, that "EFI Reserved" is what we want. However, AFAIK, it is
>> EFI_MEMORY_SP the only one being discussed with our BIOS/UEFI contacts.
> Well, tell your BIOS contacts that they need to honor what the device
> puts in the CDAT for the memory-type, not just make it up on their own.
>
>  From the Linux side we can scream loudly with
> WARN_TAINT_FIRMWARE_WORKAROUND every time we see a BIOS that has failed
> to match EFI memory-type with whatever the device puts in the CDAT,
> because that's clearly a broken BIOS.


Yes, we need to sync with these guys :-)

EFI_MEMORY_SP was our/my only focus and that is probably the problem and 
the confusion.

BTW, How can the kernel warn about this? I guess this requires the 
driver raising it. doesn't it?


>> Because how to assign those flags seems to be based on BIOS
>> implementation, we count on some BIOS versions doing the wrong thing for
>> our purposes, so I have been working on a workaround implying udev
>> events and executing a modified daxctl for switching a dax device mode
>> or even removing/destroying it.
> Yikes, no, lets not do that. Another thing we can do from the kernel
> side, in case the CDAT is missing, is to never attach device-dax to
> address ranges within CXL windows with the "Device Coherent Window
> Restriction".


I do not want to put my reputation on this workaround, but I see it as a 
first and quick fallback and not requiring kernel changes to CXL/DAX. My 
idea is to add some interface between drivers and DAX for doing this if 
necessary, once the driver tries to get the resource and it fails. Maybe 
with some kernel boot param like dax_allow_change=drivername or 
something similar. What else could we do for the case of the BIOS/kernel 
doing the wrong thing?


> [..]
>>> It would help me if you can summarize what you did and did not adopt from that
>>> proposal, i.e. where it helped and where it missed the mark for your use case.
>> I basically took your patchset referenced in the RFC, specifically the
>> 19th one, and tried to work with it from a pcie/CXL device driver. I did
>> debug all the problems precluding the CXL invoked function to succeed
>> and applying those changes from previous patches required and not
>> committed yet. Patches 7, 8, 9 and 11 are based on your patchset, with
>> just some minor change since I did not see a reason for changing them
>> after studying them. The other patches are those fixes for having a
>> Type2 finally able to create a dax region and map it.
> Ok, let's not add dax region's for this case, but the rest sounds
> straightforward, I'll take a look.
>
> [..]
>>>> The reason for adding such a Type2 driver instead of changes to a current kernel
>>>> driver or a new driver is threefold:
>>>>
>>>> 1) the expected kernel driver to use the functionality added is a netdev one.
>>>>      Current internal CXL support is a codesign effort, therefore software and
>>>>      hardware evolving in lockstep. Adding changes to a netdev driver requires the
>>>>      full functionality and doing things following the netdev standard which is
>>>>      not the best option for this development stage.
>>> Not sure what "netdev standard" means, and not sure it matters for a discussion
>>> that is constrained to how the CXL core should evolve to accommodate
>>> accelerator-local memory.
>> You know I could not add to a netdev driver just those changes for CXL
>> initialization, then mapping a CXL region and doing nothing else. Such a
>> mapping needs to be linked to something inside the driver what is just
>> under development and testing.
> Ok, you're just talking about pre-plumbing the CXL bits before getting
> started on the netdev side. Makes sense to me.
>
>>>> 2) Waiting for completing the development will delay the required Type2 support,
>>>>      and most of the required changes are unrelated to specific CXL usage by any
>>>>      vendor driver.
>>> Again, [2] already path cleared the idea that early plumbing of the CXL core for
>>> type-2 is in scope. Just need to make sure this effort is focused on general CXL
>>> specification use cases.
>>>
>>>> 3) Type2 support will need some testing infrastructure, unit tests for ensuring
>>>>      Type2 devices are working, and module tests for ensuring CXL core changes do
>>>>      not affect Type2 support.
>>>>
>>>> I hope these reasons are convincing enough.
>>> I am happy to have general specification discussions about core functionality
>>> that all CXL accelerators need and plumbing those ABIs to be exercised via
>>> something like cxl_test.
>>>
>>> I expect that cxl_test only covers coarse grained inter-module ABIs and that the
>>> fine details will need to wait until the accelerator specifics are closer to
>>> being disclosed and the real driver patches can start flowing. In other words, I
>>> am not keen to see QEMU used as a way to proxy features into the kernel without
>>> disclosing what the real world consumer actually needs.
>> Sure. FWIW, this will end up changing a internal driver mapping HW
>> buffers which are now based on PCI BAR offsets by same mapping based on
>> CXL region mapping.
>>
>> This involves, PFs and VFs, and because CXL is only advertised by PF0,
>> this all brings more complexity to those changes than could be expected.
>>
>> I think taking the CXL core to the support needed, hopefully based on
>> this RFC, should be started as soon as possible instead of waiting for
>> our driver/device to be ready.
> I agree, but continued transparency on what the CXL use case looks like
> in the final implementation helps make sure the right upstream CXL
> mechanism is being built.
>
> For example, just from that insight above I am wondering if we need some
> centralized interface for VFs to lookup / allocate CXL resources from
> their PF0.


Not in our case, but it could be needed (or convenient) for more complex 
devices regarding CXL ranges or decoders.


> [..]
>>>> Type2 implies, I think, only the related driver to manage the CXL specifics.
>>>> This means no user space intervention and therefore no sysfs files. This makes
>>>> easy to avoid the current problem of most of the sysfs related code expecting
>>>> Type3 devices. If I´m wrong in this regard, such a code will need further
>>>> changes.
>>> Not sure what you mean here, I am hoping that sysfs ABI for enumerating objects
>>> like memdevs, decoders, and regions "just works" and is hidden from the
>>> accelerator vendor driver.
>> If a Type2 is fully private, should be any sysfs files about it?
>>
>> AFAIK, those files are mainly for user space able to create memories
>> joining them, or for changing the dax mode. What do you think a Type2
>> needs through sysfs apart from info? While writing this I realize I have
>> not thought about power management ...
> They are primarily for userspace to enumerate resources, and they
> provide kernel object names for CXL error events. For example userspace
> could generically map a CXL switch error to impacted endpoints without
> needing to know or care whether it was a type-2 or type-3 endpoint.


That makes a lot of sense. Then I'm afraid I should work on this. I'll 
try to find out if it could be just modifying current patchset with 
minor changes.

Thanks


> [..]
>>> If you think we will still be talking about this driver enabling in September
>>> I would invite you to submit this as a topic for the CXL uConf.
>> That would be great!
>>
>> If this RFC work is integrated by then with the CXL core, that would be
>> fantastic, but the CXL.cache support will still be pending.
> Sounds good.
>

Alejandro Lucero Palau May 31, 2024, 10:52 a.m. UTC | #5

ping

...

and some new comments below.


On 5/22/24 17:38, Alejandro Lucero Palau wrote:
>
> On 5/21/24 05:56, Dan Williams wrote:
>> Alejandro Lucero Palau wrote:
>>> On 5/17/24 01:08, Dan Williams wrote:
>> [..]
>>>> My expectation for CXL accelerator provided memory is that it is 
>>>> marked "EFI
>>>> Reserved" in the memory map (or CDAT), not "EFI Conventional + 
>>>> Special Purpose
>>>> Attribute". Recall that the EFI_MEMORY_SP attribute is just a 
>>>> *hint* that OS may
>>>> *not* want to consider a memory range as part of the general 
>>>> purpose memory
>>>> pool. CXL Accelerator Memory is far from that definition. If a 
>>>> driver needs to
>>>> be loaded to effectively use a memory range that is not a range 
>>>> that can be
>>>> transparently converted to "EFI Conventional Memory".
>>> Well, that "EFI Reserved" is what we want. However, AFAIK, it is
>>> EFI_MEMORY_SP the only one being discussed with our BIOS/UEFI contacts.
>> Well, tell your BIOS contacts that they need to honor what the device
>> puts in the CDAT for the memory-type, not just make it up on their own.
>>
>>  From the Linux side we can scream loudly with
>> WARN_TAINT_FIRMWARE_WORKAROUND every time we see a BIOS that has failed
>> to match EFI memory-type with whatever the device puts in the CDAT,
>> because that's clearly a broken BIOS.
>
>
> Yes, we need to sync with these guys :-)
>
> EFI_MEMORY_SP was our/my only focus and that is probably the problem 
> and the confusion.
>
> BTW, How can the kernel warn about this? I guess this requires the 
> driver raising it. doesn't it?
>
>
>>> Because how to assign those flags seems to be based on BIOS
>>> implementation, we count on some BIOS versions doing the wrong thing 
>>> for
>>> our purposes, so I have been working on a workaround implying udev
>>> events and executing a modified daxctl for switching a dax device mode
>>> or even removing/destroying it.
>> Yikes, no, lets not do that. Another thing we can do from the kernel
>> side, in case the CDAT is missing, is to never attach device-dax to
>> address ranges within CXL windows with the "Device Coherent Window
>> Restriction".
>
>
> I do not want to put my reputation on this workaround, but I see it as 
> a first and quick fallback and not requiring kernel changes to 
> CXL/DAX. My idea is to add some interface between drivers and DAX for 
> doing this if necessary, once the driver tries to get the resource and 
> it fails. Maybe with some kernel boot param like 
> dax_allow_change=drivername or something similar. What else could we 
> do for the case of the BIOS/kernel doing the wrong thing?
>
>
>> [..]
>>>> It would help me if you can summarize what you did and did not 
>>>> adopt from that
>>>> proposal, i.e. where it helped and where it missed the mark for 
>>>> your use case.
>>> I basically took your patchset referenced in the RFC, specifically the
>>> 19th one, and tried to work with it from a pcie/CXL device driver. I 
>>> did
>>> debug all the problems precluding the CXL invoked function to succeed
>>> and applying those changes from previous patches required and not
>>> committed yet. Patches 7, 8, 9 and 11 are based on your patchset, with
>>> just some minor change since I did not see a reason for changing them
>>> after studying them. The other patches are those fixes for having a
>>> Type2 finally able to create a dax region and map it.
>> Ok, let's not add dax region's for this case, but the rest sounds
>> straightforward, I'll take a look.
>>
>> [..]
>>>>> The reason for adding such a Type2 driver instead of changes to a 
>>>>> current kernel
>>>>> driver or a new driver is threefold:
>>>>>
>>>>> 1) the expected kernel driver to use the functionality added is a 
>>>>> netdev one.
>>>>>      Current internal CXL support is a codesign effort, therefore 
>>>>> software and
>>>>>      hardware evolving in lockstep. Adding changes to a netdev 
>>>>> driver requires the
>>>>>      full functionality and doing things following the netdev 
>>>>> standard which is
>>>>>      not the best option for this development stage.
>>>> Not sure what "netdev standard" means, and not sure it matters for 
>>>> a discussion
>>>> that is constrained to how the CXL core should evolve to accommodate
>>>> accelerator-local memory.
>>> You know I could not add to a netdev driver just those changes for CXL
>>> initialization, then mapping a CXL region and doing nothing else. 
>>> Such a
>>> mapping needs to be linked to something inside the driver what is just
>>> under development and testing.
>> Ok, you're just talking about pre-plumbing the CXL bits before getting
>> started on the netdev side. Makes sense to me.
>>
>>>>> 2) Waiting for completing the development will delay the required 
>>>>> Type2 support,
>>>>>      and most of the required changes are unrelated to specific 
>>>>> CXL usage by any
>>>>>      vendor driver.
>>>> Again, [2] already path cleared the idea that early plumbing of the 
>>>> CXL core for
>>>> type-2 is in scope. Just need to make sure this effort is focused 
>>>> on general CXL
>>>> specification use cases.
>>>>
>>>>> 3) Type2 support will need some testing infrastructure, unit tests 
>>>>> for ensuring
>>>>>      Type2 devices are working, and module tests for ensuring CXL 
>>>>> core changes do
>>>>>      not affect Type2 support.
>>>>>
>>>>> I hope these reasons are convincing enough.
>>>> I am happy to have general specification discussions about core 
>>>> functionality
>>>> that all CXL accelerators need and plumbing those ABIs to be 
>>>> exercised via
>>>> something like cxl_test.
>>>>
>>>> I expect that cxl_test only covers coarse grained inter-module ABIs 
>>>> and that the
>>>> fine details will need to wait until the accelerator specifics are 
>>>> closer to
>>>> being disclosed and the real driver patches can start flowing. In 
>>>> other words, I
>>>> am not keen to see QEMU used as a way to proxy features into the 
>>>> kernel without
>>>> disclosing what the real world consumer actually needs.
>>> Sure. FWIW, this will end up changing a internal driver mapping HW
>>> buffers which are now based on PCI BAR offsets by same mapping based on
>>> CXL region mapping.
>>>
>>> This involves, PFs and VFs, and because CXL is only advertised by PF0,
>>> this all brings more complexity to those changes than could be 
>>> expected.
>>>
>>> I think taking the CXL core to the support needed, hopefully based on
>>> this RFC, should be started as soon as possible instead of waiting for
>>> our driver/device to be ready.
>> I agree, but continued transparency on what the CXL use case looks like
>> in the final implementation helps make sure the right upstream CXL
>> mechanism is being built.
>>
>> For example, just from that insight above I am wondering if we need some
>> centralized interface for VFs to lookup / allocate CXL resources from
>> their PF0.
>
>
> Not in our case, but it could be needed (or convenient) for more 
> complex devices regarding CXL ranges or decoders.
>
>
>> [..]
>>>>> Type2 implies, I think, only the related driver to manage the CXL 
>>>>> specifics.
>>>>> This means no user space intervention and therefore no sysfs 
>>>>> files. This makes
>>>>> easy to avoid the current problem of most of the sysfs related 
>>>>> code expecting
>>>>> Type3 devices. If I´m wrong in this regard, such a code will need 
>>>>> further
>>>>> changes.
>>>> Not sure what you mean here, I am hoping that sysfs ABI for 
>>>> enumerating objects
>>>> like memdevs, decoders, and regions "just works" and is hidden from 
>>>> the
>>>> accelerator vendor driver.
>>> If a Type2 is fully private, should be any sysfs files about it?
>>>
>>> AFAIK, those files are mainly for user space able to create memories
>>> joining them, or for changing the dax mode. What do you think a Type2
>>> needs through sysfs apart from info? While writing this I realize I 
>>> have
>>> not thought about power management ...
>> They are primarily for userspace to enumerate resources, and they
>> provide kernel object names for CXL error events. For example userspace
>> could generically map a CXL switch error to impacted endpoints without
>> needing to know or care whether it was a type-2 or type-3 endpoint.
>
>
> That makes a lot of sense. Then I'm afraid I should work on this. I'll 
> try to find out if it could be just modifying current patchset with 
> minor changes.
>
> Thanks
>
>

I've been looking at this, mainly alarmed by the error handling problem 
mentioned.

Firstly, my original comment in the cover letter is misleading. The 
patchset does not avoid kobjects to be created just some sysfs files 
related to kobjects attributes.

I think it is the right thing to do, although some of those attributes 
could in fact be shown, so I will fix that. However, I think some fields 
now in cxl_memdev_state should be inside clx_memdev, like 
ram_perf/pmem_perf, but moreover, I think we need a more flexible way of 
linking Type2 to those structs where the optional Type2 functionalities 
can be set or not. Not something to be faced in this patchset but by for 
some refactoring in the near future if my concern turns out to be a real 
one.


Secondly, there is another concern, maybe a more important one, and it 
is related to the error handling. I have to admit I did not think about 
this at all before Dan's comment, and I think it is something worth to 
discuss.

There are, at least, two error situations I'm concern with, although 
they (can) end up with the same action: a pcie/CXL slot reset. Assuming 
there could be cases where the system can recover from this situation, 
and noting that CXL.mem or CXL.cache could stall cpus and likely 
crashing the system, the concern is how the CXL device configuration 
like HDM should be done after the slot reset. While a FLR keeps that 
config, a slot reset does not. The current Type3 support inside the CXL 
pci driver does nothing when the device is detached (no release 
callback), what is happening when the system performs the slot reset, 
and cxl_error_resume is not trying anything. If I'm not wrong, it is in 
the resume callback where the device is probed again.

So here is my question: should not an endpoint slot reset be supported?


>> [..]
>>>> If you think we will still be talking about this driver enabling in 
>>>> September
>>>> I would invite you to submit this as a topic for the CXL uConf.
>>> That would be great!
>>>
>>> If this RFC work is integrated by then with the CXL core, that would be
>>> fantastic, but the CXL.cache support will still be pending.
>> Sounds good.
>>

[RFC,00/13] RFC: add Type2 device support

Message

Comments