Message ID | 20240516081202.27023-1-alucerop@amd.com |
---|---|
Headers | show |
Series | RFC: add Type2 device support | expand |
alucerop@ wrote: > From: Alejandro Lucero <alejandro.lucero-palau@amd.com> > > I need to start this RFC explaining not what the patchset does, but what we > expected to obtain and currently is under doubt: to configure a CXL memory > range advertised by our CXL network device and to use it "privately". > > The main reason behind that privacy, but not the only one, is to avoid any > arbitrary use by any user-space "client" with enough privileges for using a > public interface to map the device memory and use it as regular memory. The > reason is obvious: the device expects writes to such a memory in a specific > format. > > The doubt comes from the fact that after implementing the functionality exposed > in this patchset, we realized the current expectation seems to be a BIOS/UEFI > configuring HDM decoders or HPA ranges and passing memory ranges to the kernel, > and with the kernel having a default action on that memory range based on the > flag EFI_MEMORY_SP being set or not. Lets clear up this confusion first. My expectation for CXL accelerator provided memory is that it is marked "EFI Reserved" in the memory map (or CDAT), not "EFI Conventional + Special Purpose Attribute". Recall that the EFI_MEMORY_SP attribute is just a *hint* that OS may *not* want to consider a memory range as part of the general purpose memory pool. CXL Accelerator Memory is far from that definition. If a driver needs to be loaded to effectively use a memory range that is not a range that can be transparently converted to "EFI Conventional Memory". > If it is not set, the memory range will be part of the kernel memory > management, what we do not want for sure. If the flag is set, a DAX device > will be created which allows an user-space client to map such a memory through > the DAX device API, what we also prefer to avoid. I know this is likely going > to face opposition, but we see this RFC as the opportunity for discussing the > matter and, if it turns out to be the case, to be guided towards the proper > solution accepted by the maintainers/community. This patchset does not tackle > this default kernel behaviour although we already have some ideas and > workarounds for the short term. We'll be happy to discuss this openly. It may have been subtle, but even in my initial RFC [1]. I was explicitly skipping exposing accelerator memory via device-dax: @@ -3085,6 +3183,15 @@ static int cxl_region_probe(struct device *dev) p->res->start, p->res->end, cxlr, is_system_ram) > 0) return 0; + + /* + * HDM-D[B] (device-memory) regions have accelerator + * specific usage, skip device-dax registration. + */ + if (cxlr->type == CXL_DECODER_DEVMEM) + return 0; + + /* HDM-H routes to device-dax */ return devm_cxl_add_dax_region(cxlr); default: dev_dbg(&cxlr->dev, "unsupported region mode: %d\n", [1] https://lore.kernel.org/r/168592159835.1948938.1647215579839222774.stgit@dwillia2-xfh.jf.intel.com > So, this patchset assumes a BIOS/UEFI not programming the HDM decoder or HPA > ranges for a CXL Type2 device. At least for Linux it should not care whether BIOS maps it or not. The expectation is that whichever agent maps it, BIOS or Linux CXL core, that agent honors the memory type specified in the device CDAT. I.e. the "Device Scoped EFI Memory Type Structure (DSEMTS)" in the accelerator CDAT should pass "2" in the "EFI Memory Type and Attribute" field, where "2" indicates "EFI Reserved" memory type for any HPA that maps this device's DPA. > above, a Type2 device added after boot, that is hotplugged, will likely need the > changes added here. Exporting some of the CXL core for vendor drivers is also > required for avoiding code duplication when reading DVSEC or decoder registers. > Finally if there is no such HPA range assigned, whatever the reason, this patchset > offers some way of obtaining one and/or finding out what is the problem behind the > lack of such HPA range. Yes, that was the whole point of the "Device Memory Setup" thread [2] to show a plausible way to reuse the common core infrastructure for accelerators with CXL.mem capability. [2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/ It would help me if you can summarize what you did and did not adopt from that proposal, i.e. where it helped and where it missed the mark for your use case. [..] > Keeping with kernel rules of having a client using any new functionality added, > a CXL type2 driver is increasingly added. This is not a driver supporting a real > device but an emulated one in QEMU, with the QEMU patch following this patchset. Lets start with the deltas compared to [2], the observation there is that much of the CXL core is reusable for this type-2 use case, and even improves the organization of the core. The consumer of the refactoring just ends up being a few new helpers around the core. > The reason for adding such a Type2 driver instead of changes to a current kernel > driver or a new driver is threefold: > > 1) the expected kernel driver to use the functionality added is a netdev one. > Current internal CXL support is a codesign effort, therefore software and > hardware evolving in lockstep. Adding changes to a netdev driver requires the > full functionality and doing things following the netdev standard which is > not the best option for this development stage. Not sure what "netdev standard" means, and not sure it matters for a discussion that is constrained to how the CXL core should evolve to accommodate accelerator-local memory. > 2) Waiting for completing the development will delay the required Type2 support, > and most of the required changes are unrelated to specific CXL usage by any > vendor driver. Again, [2] already path cleared the idea that early plumbing of the CXL core for type-2 is in scope. Just need to make sure this effort is focused on general CXL specification use cases. > > 3) Type2 support will need some testing infrastructure, unit tests for ensuring > Type2 devices are working, and module tests for ensuring CXL core changes do > not affect Type2 support. > > I hope these reasons are convincing enough. I am happy to have general specification discussions about core functionality that all CXL accelerators need and plumbing those ABIs to be exercised via something like cxl_test. I expect that cxl_test only covers coarse grained inter-module ABIs and that the fine details will need to wait until the accelerator specifics are closer to being disclosed and the real driver patches can start flowing. In other words, I am not keen to see QEMU used as a way to proxy features into the kernel without disclosing what the real world consumer actually needs. Now, that said, there may also be opportunity for targeted extensions of the QEMU CXL memory-expander device, like DPA capacity that is mapped as "EFI Reserved" device-memory, but lets talk about the specifics here. Also recall that the current CXL driver does not yet support parsing CXL PMEM region labels. Support for that ends up looking quite similar to an accelerator asking for a specific DPA range to be dynamically mapped by the CXL core. So I think we can get quite far along the path to enabling type-2-CXL.mem just by flushing out some of the generic type-3-CXL.mem use cases. > I have decided to follow a gradual approach for adding such a driver using the > exported CXL functions and structs. I think it is easier to review the driver > when the new funcionality is added than to add the driver at the end, but not a > big deal if my approach is not liked. > > The patches are based on a patchset sent by Dan Williams [1] which was just > partially integrated, most related to making things ready for Type2 but none > related to specific Type2 support. Those patches based on Dan´s work have Dan´s > signing, so Dan, tell me if you do not want me to add you. It is fine to reuse those patches, but please do include: Co-developed-by: Dan Williams <dan.j.williams@intel.com> Link: <lore-link for copied code> ...so I can try to remember if I still think the source patch was a good idea or not. Again, a summary analysis of what did and did not work for you from that posting would help me get a running start at reviewing this. > Type2 implies, I think, only the related driver to manage the CXL specifics. > This means no user space intervention and therefore no sysfs files. This makes > easy to avoid the current problem of most of the sysfs related code expecting > Type3 devices. If I´m wrong in this regard, such a code will need further > changes. Not sure what you mean here, I am hoping that sysfs ABI for enumerating objects like memdevs, decoders, and regions "just works" and is hidden from the accelerator vendor driver. > A final note about CXL.cache is needed. This patchset does not cover it at all, > although the emulated Type2 device advertises it. From the kernel point of view > supporting CXL.cache will imply to be sure the CXL path supports what the Type2 > device needs. A device accelerator will likely be connected to a Root Switch, > but other configurations can not be discarded. Therefore the kernel will need to > check not just HPA, DPA, interleave and granularity, but also the available > CXL.cache support and resources in each switch in the CXL path to the Type2 > device. I expect to contribute to this support in the following months, and > it would be good to discuss about it when possible. Yup, especially CXL 3.x Cache ID management. Lets talk about some ways to plumb that, because it is also a general CXL specification capability. Another topic is what is needed for CXL Protocol and Component error handling, I expect some of that to be CXL core common and some of that to be accelertor driver augmented. If you think we will still be talking about this driver enabling in September I would invite you to submit this as a topic for the CXL uConf. https://lpc.events/event/18/contributions/1673/
On 5/17/24 01:08, Dan Williams wrote: > alucerop@ wrote: >> From: Alejandro Lucero <alejandro.lucero-palau@amd.com> >> >> I need to start this RFC explaining not what the patchset does, but what we >> expected to obtain and currently is under doubt: to configure a CXL memory >> range advertised by our CXL network device and to use it "privately". >> >> The main reason behind that privacy, but not the only one, is to avoid any >> arbitrary use by any user-space "client" with enough privileges for using a >> public interface to map the device memory and use it as regular memory. The >> reason is obvious: the device expects writes to such a memory in a specific >> format. >> >> The doubt comes from the fact that after implementing the functionality exposed >> in this patchset, we realized the current expectation seems to be a BIOS/UEFI >> configuring HDM decoders or HPA ranges and passing memory ranges to the kernel, >> and with the kernel having a default action on that memory range based on the >> flag EFI_MEMORY_SP being set or not. > Lets clear up this confusion first. > > My expectation for CXL accelerator provided memory is that it is marked "EFI > Reserved" in the memory map (or CDAT), not "EFI Conventional + Special Purpose > Attribute". Recall that the EFI_MEMORY_SP attribute is just a *hint* that OS may > *not* want to consider a memory range as part of the general purpose memory > pool. CXL Accelerator Memory is far from that definition. If a driver needs to > be loaded to effectively use a memory range that is not a range that can be > transparently converted to "EFI Conventional Memory". Well, that "EFI Reserved" is what we want. However, AFAIK, it is EFI_MEMORY_SP the only one being discussed with our BIOS/UEFI contacts. Because how to assign those flags seems to be based on BIOS implementation, we count on some BIOS versions doing the wrong thing for our purposes, so I have been working on a workaround implying udev events and executing a modified daxctl for switching a dax device mode or even removing/destroying it. Anyway, that "EFI Reserved" flag gives up hope. >> If it is not set, the memory range will be part of the kernel memory >> management, what we do not want for sure. If the flag is set, a DAX device >> will be created which allows an user-space client to map such a memory through >> the DAX device API, what we also prefer to avoid. I know this is likely going >> to face opposition, but we see this RFC as the opportunity for discussing the >> matter and, if it turns out to be the case, to be guided towards the proper >> solution accepted by the maintainers/community. This patchset does not tackle >> this default kernel behaviour although we already have some ideas and >> workarounds for the short term. We'll be happy to discuss this openly. > It may have been subtle, but even in my initial RFC [1]. I was explicitly > skipping exposing accelerator memory via device-dax: I did not see that. So, happy to know we are in the same page here. > @@ -3085,6 +3183,15 @@ static int cxl_region_probe(struct device *dev) > p->res->start, p->res->end, cxlr, > is_system_ram) > 0) > return 0; > + > + /* > + * HDM-D[B] (device-memory) regions have accelerator > + * specific usage, skip device-dax registration. > + */ > + if (cxlr->type == CXL_DECODER_DEVMEM) > + return 0; > + > + /* HDM-H routes to device-dax */ > return devm_cxl_add_dax_region(cxlr); > default: > dev_dbg(&cxlr->dev, "unsupported region mode: %d\n", > > [1] https://lore.kernel.org/r/168592159835.1948938.1647215579839222774.stgit@dwillia2-xfh.jf.intel.com > >> So, this patchset assumes a BIOS/UEFI not programming the HDM decoder or HPA >> ranges for a CXL Type2 device. > At least for Linux it should not care whether BIOS maps it or not. The > expectation is that whichever agent maps it, BIOS or Linux CXL core, that agent > honors the memory type specified in the device CDAT. I.e. the "Device Scoped EFI > Memory Type Structure (DSEMTS)" in the accelerator CDAT should pass "2" in the > "EFI Memory Type and Attribute" field, where "2" indicates "EFI Reserved" memory > type for any HPA that maps this device's DPA. > >> above, a Type2 device added after boot, that is hotplugged, will likely need the >> changes added here. Exporting some of the CXL core for vendor drivers is also >> required for avoiding code duplication when reading DVSEC or decoder registers. >> Finally if there is no such HPA range assigned, whatever the reason, this patchset >> offers some way of obtaining one and/or finding out what is the problem behind the >> lack of such HPA range. > Yes, that was the whole point of the "Device Memory Setup" thread [2] to show a > plausible way to reuse the common core infrastructure for accelerators with > CXL.mem capability. > > [2] https://lore.kernel.org/all/168592149709.1948938.8663425987110396027.stgit@dwillia2-xfh.jf.intel.com/ > > It would help me if you can summarize what you did and did not adopt from that > proposal, i.e. where it helped and where it missed the mark for your use case. I basically took your patchset referenced in the RFC, specifically the 19th one, and tried to work with it from a pcie/CXL device driver. I did debug all the problems precluding the CXL invoked function to succeed and applying those changes from previous patches required and not committed yet. Patches 7, 8, 9 and 11 are based on your patchset, with just some minor change since I did not see a reason for changing them after studying them. The other patches are those fixes for having a Type2 finally able to create a dax region and map it. > [..] >> Keeping with kernel rules of having a client using any new functionality added, >> a CXL type2 driver is increasingly added. This is not a driver supporting a real >> device but an emulated one in QEMU, with the QEMU patch following this patchset. > Lets start with the deltas compared to [2], the observation there is that much > of the CXL core is reusable for this type-2 use case, and even improves the > organization of the core. The consumer of the refactoring just ends up being a > few new helpers around the core. > >> The reason for adding such a Type2 driver instead of changes to a current kernel >> driver or a new driver is threefold: >> >> 1) the expected kernel driver to use the functionality added is a netdev one. >> Current internal CXL support is a codesign effort, therefore software and >> hardware evolving in lockstep. Adding changes to a netdev driver requires the >> full functionality and doing things following the netdev standard which is >> not the best option for this development stage. > Not sure what "netdev standard" means, and not sure it matters for a discussion > that is constrained to how the CXL core should evolve to accommodate > accelerator-local memory. You know I could not add to a netdev driver just those changes for CXL initialization, then mapping a CXL region and doing nothing else. Such a mapping needs to be linked to something inside the driver what is just under development and testing. >> 2) Waiting for completing the development will delay the required Type2 support, >> and most of the required changes are unrelated to specific CXL usage by any >> vendor driver. > Again, [2] already path cleared the idea that early plumbing of the CXL core for > type-2 is in scope. Just need to make sure this effort is focused on general CXL > specification use cases. > >> 3) Type2 support will need some testing infrastructure, unit tests for ensuring >> Type2 devices are working, and module tests for ensuring CXL core changes do >> not affect Type2 support. >> >> I hope these reasons are convincing enough. > I am happy to have general specification discussions about core functionality > that all CXL accelerators need and plumbing those ABIs to be exercised via > something like cxl_test. > > I expect that cxl_test only covers coarse grained inter-module ABIs and that the > fine details will need to wait until the accelerator specifics are closer to > being disclosed and the real driver patches can start flowing. In other words, I > am not keen to see QEMU used as a way to proxy features into the kernel without > disclosing what the real world consumer actually needs. Sure. FWIW, this will end up changing a internal driver mapping HW buffers which are now based on PCI BAR offsets by same mapping based on CXL region mapping. This involves, PFs and VFs, and because CXL is only advertised by PF0, this all brings more complexity to those changes than could be expected. I think taking the CXL core to the support needed, hopefully based on this RFC, should be started as soon as possible instead of waiting for our driver/device to be ready. > > Now, that said, there may also be opportunity for targeted extensions of the > QEMU CXL memory-expander device, like DPA capacity that is mapped as "EFI > Reserved" device-memory, but lets talk about the specifics here. Yes. I'm actually using qemu hmem and kernel efi_fake_mem for some testing about those workarounds, so happy to help with new qemu support for the "EFI Reserved" case as well. > Also recall that the current CXL driver does not yet support parsing CXL PMEM > region labels. Support for that ends up looking quite similar to an accelerator > asking for a specific DPA range to be dynamically mapped by the CXL core. So I > think we can get quite far along the path to enabling type-2-CXL.mem just by > flushing out some of the generic type-3-CXL.mem use cases. OK. I'll take a look to those pmem needs. >> I have decided to follow a gradual approach for adding such a driver using the >> exported CXL functions and structs. I think it is easier to review the driver >> when the new funcionality is added than to add the driver at the end, but not a >> big deal if my approach is not liked. >> >> The patches are based on a patchset sent by Dan Williams [1] which was just >> partially integrated, most related to making things ready for Type2 but none >> related to specific Type2 support. Those patches based on Dan´s work have Dan´s >> signing, so Dan, tell me if you do not want me to add you. > It is fine to reuse those patches, but please do include: > > Co-developed-by: Dan Williams <dan.j.williams@intel.com> > Link: <lore-link for copied code> Ok. I'll do the changes for next versions. > ...so I can try to remember if I still think the source patch was a good idea or > not. Again, a summary analysis of what did and did not work for you from that > posting would help me get a running start at reviewing this. > >> Type2 implies, I think, only the related driver to manage the CXL specifics. >> This means no user space intervention and therefore no sysfs files. This makes >> easy to avoid the current problem of most of the sysfs related code expecting >> Type3 devices. If I´m wrong in this regard, such a code will need further >> changes. > Not sure what you mean here, I am hoping that sysfs ABI for enumerating objects > like memdevs, decoders, and regions "just works" and is hidden from the > accelerator vendor driver. If a Type2 is fully private, should be any sysfs files about it? AFAIK, those files are mainly for user space able to create memories joining them, or for changing the dax mode. What do you think a Type2 needs through sysfs apart from info? While writing this I realize I have not thought about power management ... > >> A final note about CXL.cache is needed. This patchset does not cover it at all, >> although the emulated Type2 device advertises it. From the kernel point of view >> supporting CXL.cache will imply to be sure the CXL path supports what the Type2 >> device needs. A device accelerator will likely be connected to a Root Switch, >> but other configurations can not be discarded. Therefore the kernel will need to >> check not just HPA, DPA, interleave and granularity, but also the available >> CXL.cache support and resources in each switch in the CXL path to the Type2 >> device. I expect to contribute to this support in the following months, and >> it would be good to discuss about it when possible. > Yup, especially CXL 3.x Cache ID management. Lets talk about some ways to plumb > that, because it is also a general CXL specification capability. Another topic > is what is needed for CXL Protocol and Component error handling, I expect some > of that to be CXL core common and some of that to be accelertor driver > augmented. > > If you think we will still be talking about this driver enabling in September > I would invite you to submit this as a topic for the CXL uConf. That would be great! If this RFC work is integrated by then with the CXL core, that would be fantastic, but the CXL.cache support will still be pending. > > https://lpc.events/event/18/contributions/1673/
Alejandro Lucero Palau wrote: > On 5/17/24 01:08, Dan Williams wrote: [..] > > My expectation for CXL accelerator provided memory is that it is marked "EFI > > Reserved" in the memory map (or CDAT), not "EFI Conventional + Special Purpose > > Attribute". Recall that the EFI_MEMORY_SP attribute is just a *hint* that OS may > > *not* want to consider a memory range as part of the general purpose memory > > pool. CXL Accelerator Memory is far from that definition. If a driver needs to > > be loaded to effectively use a memory range that is not a range that can be > > transparently converted to "EFI Conventional Memory". > > Well, that "EFI Reserved" is what we want. However, AFAIK, it is > EFI_MEMORY_SP the only one being discussed with our BIOS/UEFI contacts. Well, tell your BIOS contacts that they need to honor what the device puts in the CDAT for the memory-type, not just make it up on their own. From the Linux side we can scream loudly with WARN_TAINT_FIRMWARE_WORKAROUND every time we see a BIOS that has failed to match EFI memory-type with whatever the device puts in the CDAT, because that's clearly a broken BIOS. > Because how to assign those flags seems to be based on BIOS > implementation, we count on some BIOS versions doing the wrong thing for > our purposes, so I have been working on a workaround implying udev > events and executing a modified daxctl for switching a dax device mode > or even removing/destroying it. Yikes, no, lets not do that. Another thing we can do from the kernel side, in case the CDAT is missing, is to never attach device-dax to address ranges within CXL windows with the "Device Coherent Window Restriction". [..] > > It would help me if you can summarize what you did and did not adopt from that > > proposal, i.e. where it helped and where it missed the mark for your use case. > > I basically took your patchset referenced in the RFC, specifically the > 19th one, and tried to work with it from a pcie/CXL device driver. I did > debug all the problems precluding the CXL invoked function to succeed > and applying those changes from previous patches required and not > committed yet. Patches 7, 8, 9 and 11 are based on your patchset, with > just some minor change since I did not see a reason for changing them > after studying them. The other patches are those fixes for having a > Type2 finally able to create a dax region and map it. Ok, let's not add dax region's for this case, but the rest sounds straightforward, I'll take a look. [..] > >> The reason for adding such a Type2 driver instead of changes to a current kernel > >> driver or a new driver is threefold: > >> > >> 1) the expected kernel driver to use the functionality added is a netdev one. > >> Current internal CXL support is a codesign effort, therefore software and > >> hardware evolving in lockstep. Adding changes to a netdev driver requires the > >> full functionality and doing things following the netdev standard which is > >> not the best option for this development stage. > > Not sure what "netdev standard" means, and not sure it matters for a discussion > > that is constrained to how the CXL core should evolve to accommodate > > accelerator-local memory. > > You know I could not add to a netdev driver just those changes for CXL > initialization, then mapping a CXL region and doing nothing else. Such a > mapping needs to be linked to something inside the driver what is just > under development and testing. Ok, you're just talking about pre-plumbing the CXL bits before getting started on the netdev side. Makes sense to me. > >> 2) Waiting for completing the development will delay the required Type2 support, > >> and most of the required changes are unrelated to specific CXL usage by any > >> vendor driver. > > Again, [2] already path cleared the idea that early plumbing of the CXL core for > > type-2 is in scope. Just need to make sure this effort is focused on general CXL > > specification use cases. > > > >> 3) Type2 support will need some testing infrastructure, unit tests for ensuring > >> Type2 devices are working, and module tests for ensuring CXL core changes do > >> not affect Type2 support. > >> > >> I hope these reasons are convincing enough. > > I am happy to have general specification discussions about core functionality > > that all CXL accelerators need and plumbing those ABIs to be exercised via > > something like cxl_test. > > > > I expect that cxl_test only covers coarse grained inter-module ABIs and that the > > fine details will need to wait until the accelerator specifics are closer to > > being disclosed and the real driver patches can start flowing. In other words, I > > am not keen to see QEMU used as a way to proxy features into the kernel without > > disclosing what the real world consumer actually needs. > > Sure. FWIW, this will end up changing a internal driver mapping HW > buffers which are now based on PCI BAR offsets by same mapping based on > CXL region mapping. > > This involves, PFs and VFs, and because CXL is only advertised by PF0, > this all brings more complexity to those changes than could be expected. > > I think taking the CXL core to the support needed, hopefully based on > this RFC, should be started as soon as possible instead of waiting for > our driver/device to be ready. I agree, but continued transparency on what the CXL use case looks like in the final implementation helps make sure the right upstream CXL mechanism is being built. For example, just from that insight above I am wondering if we need some centralized interface for VFs to lookup / allocate CXL resources from their PF0. [..] > >> Type2 implies, I think, only the related driver to manage the CXL specifics. > >> This means no user space intervention and therefore no sysfs files. This makes > >> easy to avoid the current problem of most of the sysfs related code expecting > >> Type3 devices. If I´m wrong in this regard, such a code will need further > >> changes. > > Not sure what you mean here, I am hoping that sysfs ABI for enumerating objects > > like memdevs, decoders, and regions "just works" and is hidden from the > > accelerator vendor driver. > > If a Type2 is fully private, should be any sysfs files about it? > > AFAIK, those files are mainly for user space able to create memories > joining them, or for changing the dax mode. What do you think a Type2 > needs through sysfs apart from info? While writing this I realize I have > not thought about power management ... They are primarily for userspace to enumerate resources, and they provide kernel object names for CXL error events. For example userspace could generically map a CXL switch error to impacted endpoints without needing to know or care whether it was a type-2 or type-3 endpoint. [..] > > If you think we will still be talking about this driver enabling in September > > I would invite you to submit this as a topic for the CXL uConf. > > That would be great! > > If this RFC work is integrated by then with the CXL core, that would be > fantastic, but the CXL.cache support will still be pending. Sounds good.
On 5/21/24 05:56, Dan Williams wrote: > Alejandro Lucero Palau wrote: >> On 5/17/24 01:08, Dan Williams wrote: > [..] >>> My expectation for CXL accelerator provided memory is that it is marked "EFI >>> Reserved" in the memory map (or CDAT), not "EFI Conventional + Special Purpose >>> Attribute". Recall that the EFI_MEMORY_SP attribute is just a *hint* that OS may >>> *not* want to consider a memory range as part of the general purpose memory >>> pool. CXL Accelerator Memory is far from that definition. If a driver needs to >>> be loaded to effectively use a memory range that is not a range that can be >>> transparently converted to "EFI Conventional Memory". >> Well, that "EFI Reserved" is what we want. However, AFAIK, it is >> EFI_MEMORY_SP the only one being discussed with our BIOS/UEFI contacts. > Well, tell your BIOS contacts that they need to honor what the device > puts in the CDAT for the memory-type, not just make it up on their own. > > From the Linux side we can scream loudly with > WARN_TAINT_FIRMWARE_WORKAROUND every time we see a BIOS that has failed > to match EFI memory-type with whatever the device puts in the CDAT, > because that's clearly a broken BIOS. Yes, we need to sync with these guys :-) EFI_MEMORY_SP was our/my only focus and that is probably the problem and the confusion. BTW, How can the kernel warn about this? I guess this requires the driver raising it. doesn't it? >> Because how to assign those flags seems to be based on BIOS >> implementation, we count on some BIOS versions doing the wrong thing for >> our purposes, so I have been working on a workaround implying udev >> events and executing a modified daxctl for switching a dax device mode >> or even removing/destroying it. > Yikes, no, lets not do that. Another thing we can do from the kernel > side, in case the CDAT is missing, is to never attach device-dax to > address ranges within CXL windows with the "Device Coherent Window > Restriction". I do not want to put my reputation on this workaround, but I see it as a first and quick fallback and not requiring kernel changes to CXL/DAX. My idea is to add some interface between drivers and DAX for doing this if necessary, once the driver tries to get the resource and it fails. Maybe with some kernel boot param like dax_allow_change=drivername or something similar. What else could we do for the case of the BIOS/kernel doing the wrong thing? > [..] >>> It would help me if you can summarize what you did and did not adopt from that >>> proposal, i.e. where it helped and where it missed the mark for your use case. >> I basically took your patchset referenced in the RFC, specifically the >> 19th one, and tried to work with it from a pcie/CXL device driver. I did >> debug all the problems precluding the CXL invoked function to succeed >> and applying those changes from previous patches required and not >> committed yet. Patches 7, 8, 9 and 11 are based on your patchset, with >> just some minor change since I did not see a reason for changing them >> after studying them. The other patches are those fixes for having a >> Type2 finally able to create a dax region and map it. > Ok, let's not add dax region's for this case, but the rest sounds > straightforward, I'll take a look. > > [..] >>>> The reason for adding such a Type2 driver instead of changes to a current kernel >>>> driver or a new driver is threefold: >>>> >>>> 1) the expected kernel driver to use the functionality added is a netdev one. >>>> Current internal CXL support is a codesign effort, therefore software and >>>> hardware evolving in lockstep. Adding changes to a netdev driver requires the >>>> full functionality and doing things following the netdev standard which is >>>> not the best option for this development stage. >>> Not sure what "netdev standard" means, and not sure it matters for a discussion >>> that is constrained to how the CXL core should evolve to accommodate >>> accelerator-local memory. >> You know I could not add to a netdev driver just those changes for CXL >> initialization, then mapping a CXL region and doing nothing else. Such a >> mapping needs to be linked to something inside the driver what is just >> under development and testing. > Ok, you're just talking about pre-plumbing the CXL bits before getting > started on the netdev side. Makes sense to me. > >>>> 2) Waiting for completing the development will delay the required Type2 support, >>>> and most of the required changes are unrelated to specific CXL usage by any >>>> vendor driver. >>> Again, [2] already path cleared the idea that early plumbing of the CXL core for >>> type-2 is in scope. Just need to make sure this effort is focused on general CXL >>> specification use cases. >>> >>>> 3) Type2 support will need some testing infrastructure, unit tests for ensuring >>>> Type2 devices are working, and module tests for ensuring CXL core changes do >>>> not affect Type2 support. >>>> >>>> I hope these reasons are convincing enough. >>> I am happy to have general specification discussions about core functionality >>> that all CXL accelerators need and plumbing those ABIs to be exercised via >>> something like cxl_test. >>> >>> I expect that cxl_test only covers coarse grained inter-module ABIs and that the >>> fine details will need to wait until the accelerator specifics are closer to >>> being disclosed and the real driver patches can start flowing. In other words, I >>> am not keen to see QEMU used as a way to proxy features into the kernel without >>> disclosing what the real world consumer actually needs. >> Sure. FWIW, this will end up changing a internal driver mapping HW >> buffers which are now based on PCI BAR offsets by same mapping based on >> CXL region mapping. >> >> This involves, PFs and VFs, and because CXL is only advertised by PF0, >> this all brings more complexity to those changes than could be expected. >> >> I think taking the CXL core to the support needed, hopefully based on >> this RFC, should be started as soon as possible instead of waiting for >> our driver/device to be ready. > I agree, but continued transparency on what the CXL use case looks like > in the final implementation helps make sure the right upstream CXL > mechanism is being built. > > For example, just from that insight above I am wondering if we need some > centralized interface for VFs to lookup / allocate CXL resources from > their PF0. Not in our case, but it could be needed (or convenient) for more complex devices regarding CXL ranges or decoders. > [..] >>>> Type2 implies, I think, only the related driver to manage the CXL specifics. >>>> This means no user space intervention and therefore no sysfs files. This makes >>>> easy to avoid the current problem of most of the sysfs related code expecting >>>> Type3 devices. If I´m wrong in this regard, such a code will need further >>>> changes. >>> Not sure what you mean here, I am hoping that sysfs ABI for enumerating objects >>> like memdevs, decoders, and regions "just works" and is hidden from the >>> accelerator vendor driver. >> If a Type2 is fully private, should be any sysfs files about it? >> >> AFAIK, those files are mainly for user space able to create memories >> joining them, or for changing the dax mode. What do you think a Type2 >> needs through sysfs apart from info? While writing this I realize I have >> not thought about power management ... > They are primarily for userspace to enumerate resources, and they > provide kernel object names for CXL error events. For example userspace > could generically map a CXL switch error to impacted endpoints without > needing to know or care whether it was a type-2 or type-3 endpoint. That makes a lot of sense. Then I'm afraid I should work on this. I'll try to find out if it could be just modifying current patchset with minor changes. Thanks > [..] >>> If you think we will still be talking about this driver enabling in September >>> I would invite you to submit this as a topic for the CXL uConf. >> That would be great! >> >> If this RFC work is integrated by then with the CXL core, that would be >> fantastic, but the CXL.cache support will still be pending. > Sounds good. >
ping ... and some new comments below. On 5/22/24 17:38, Alejandro Lucero Palau wrote: > > On 5/21/24 05:56, Dan Williams wrote: >> Alejandro Lucero Palau wrote: >>> On 5/17/24 01:08, Dan Williams wrote: >> [..] >>>> My expectation for CXL accelerator provided memory is that it is >>>> marked "EFI >>>> Reserved" in the memory map (or CDAT), not "EFI Conventional + >>>> Special Purpose >>>> Attribute". Recall that the EFI_MEMORY_SP attribute is just a >>>> *hint* that OS may >>>> *not* want to consider a memory range as part of the general >>>> purpose memory >>>> pool. CXL Accelerator Memory is far from that definition. If a >>>> driver needs to >>>> be loaded to effectively use a memory range that is not a range >>>> that can be >>>> transparently converted to "EFI Conventional Memory". >>> Well, that "EFI Reserved" is what we want. However, AFAIK, it is >>> EFI_MEMORY_SP the only one being discussed with our BIOS/UEFI contacts. >> Well, tell your BIOS contacts that they need to honor what the device >> puts in the CDAT for the memory-type, not just make it up on their own. >> >> From the Linux side we can scream loudly with >> WARN_TAINT_FIRMWARE_WORKAROUND every time we see a BIOS that has failed >> to match EFI memory-type with whatever the device puts in the CDAT, >> because that's clearly a broken BIOS. > > > Yes, we need to sync with these guys :-) > > EFI_MEMORY_SP was our/my only focus and that is probably the problem > and the confusion. > > BTW, How can the kernel warn about this? I guess this requires the > driver raising it. doesn't it? > > >>> Because how to assign those flags seems to be based on BIOS >>> implementation, we count on some BIOS versions doing the wrong thing >>> for >>> our purposes, so I have been working on a workaround implying udev >>> events and executing a modified daxctl for switching a dax device mode >>> or even removing/destroying it. >> Yikes, no, lets not do that. Another thing we can do from the kernel >> side, in case the CDAT is missing, is to never attach device-dax to >> address ranges within CXL windows with the "Device Coherent Window >> Restriction". > > > I do not want to put my reputation on this workaround, but I see it as > a first and quick fallback and not requiring kernel changes to > CXL/DAX. My idea is to add some interface between drivers and DAX for > doing this if necessary, once the driver tries to get the resource and > it fails. Maybe with some kernel boot param like > dax_allow_change=drivername or something similar. What else could we > do for the case of the BIOS/kernel doing the wrong thing? > > >> [..] >>>> It would help me if you can summarize what you did and did not >>>> adopt from that >>>> proposal, i.e. where it helped and where it missed the mark for >>>> your use case. >>> I basically took your patchset referenced in the RFC, specifically the >>> 19th one, and tried to work with it from a pcie/CXL device driver. I >>> did >>> debug all the problems precluding the CXL invoked function to succeed >>> and applying those changes from previous patches required and not >>> committed yet. Patches 7, 8, 9 and 11 are based on your patchset, with >>> just some minor change since I did not see a reason for changing them >>> after studying them. The other patches are those fixes for having a >>> Type2 finally able to create a dax region and map it. >> Ok, let's not add dax region's for this case, but the rest sounds >> straightforward, I'll take a look. >> >> [..] >>>>> The reason for adding such a Type2 driver instead of changes to a >>>>> current kernel >>>>> driver or a new driver is threefold: >>>>> >>>>> 1) the expected kernel driver to use the functionality added is a >>>>> netdev one. >>>>> Current internal CXL support is a codesign effort, therefore >>>>> software and >>>>> hardware evolving in lockstep. Adding changes to a netdev >>>>> driver requires the >>>>> full functionality and doing things following the netdev >>>>> standard which is >>>>> not the best option for this development stage. >>>> Not sure what "netdev standard" means, and not sure it matters for >>>> a discussion >>>> that is constrained to how the CXL core should evolve to accommodate >>>> accelerator-local memory. >>> You know I could not add to a netdev driver just those changes for CXL >>> initialization, then mapping a CXL region and doing nothing else. >>> Such a >>> mapping needs to be linked to something inside the driver what is just >>> under development and testing. >> Ok, you're just talking about pre-plumbing the CXL bits before getting >> started on the netdev side. Makes sense to me. >> >>>>> 2) Waiting for completing the development will delay the required >>>>> Type2 support, >>>>> and most of the required changes are unrelated to specific >>>>> CXL usage by any >>>>> vendor driver. >>>> Again, [2] already path cleared the idea that early plumbing of the >>>> CXL core for >>>> type-2 is in scope. Just need to make sure this effort is focused >>>> on general CXL >>>> specification use cases. >>>> >>>>> 3) Type2 support will need some testing infrastructure, unit tests >>>>> for ensuring >>>>> Type2 devices are working, and module tests for ensuring CXL >>>>> core changes do >>>>> not affect Type2 support. >>>>> >>>>> I hope these reasons are convincing enough. >>>> I am happy to have general specification discussions about core >>>> functionality >>>> that all CXL accelerators need and plumbing those ABIs to be >>>> exercised via >>>> something like cxl_test. >>>> >>>> I expect that cxl_test only covers coarse grained inter-module ABIs >>>> and that the >>>> fine details will need to wait until the accelerator specifics are >>>> closer to >>>> being disclosed and the real driver patches can start flowing. In >>>> other words, I >>>> am not keen to see QEMU used as a way to proxy features into the >>>> kernel without >>>> disclosing what the real world consumer actually needs. >>> Sure. FWIW, this will end up changing a internal driver mapping HW >>> buffers which are now based on PCI BAR offsets by same mapping based on >>> CXL region mapping. >>> >>> This involves, PFs and VFs, and because CXL is only advertised by PF0, >>> this all brings more complexity to those changes than could be >>> expected. >>> >>> I think taking the CXL core to the support needed, hopefully based on >>> this RFC, should be started as soon as possible instead of waiting for >>> our driver/device to be ready. >> I agree, but continued transparency on what the CXL use case looks like >> in the final implementation helps make sure the right upstream CXL >> mechanism is being built. >> >> For example, just from that insight above I am wondering if we need some >> centralized interface for VFs to lookup / allocate CXL resources from >> their PF0. > > > Not in our case, but it could be needed (or convenient) for more > complex devices regarding CXL ranges or decoders. > > >> [..] >>>>> Type2 implies, I think, only the related driver to manage the CXL >>>>> specifics. >>>>> This means no user space intervention and therefore no sysfs >>>>> files. This makes >>>>> easy to avoid the current problem of most of the sysfs related >>>>> code expecting >>>>> Type3 devices. If I´m wrong in this regard, such a code will need >>>>> further >>>>> changes. >>>> Not sure what you mean here, I am hoping that sysfs ABI for >>>> enumerating objects >>>> like memdevs, decoders, and regions "just works" and is hidden from >>>> the >>>> accelerator vendor driver. >>> If a Type2 is fully private, should be any sysfs files about it? >>> >>> AFAIK, those files are mainly for user space able to create memories >>> joining them, or for changing the dax mode. What do you think a Type2 >>> needs through sysfs apart from info? While writing this I realize I >>> have >>> not thought about power management ... >> They are primarily for userspace to enumerate resources, and they >> provide kernel object names for CXL error events. For example userspace >> could generically map a CXL switch error to impacted endpoints without >> needing to know or care whether it was a type-2 or type-3 endpoint. > > > That makes a lot of sense. Then I'm afraid I should work on this. I'll > try to find out if it could be just modifying current patchset with > minor changes. > > Thanks > > I've been looking at this, mainly alarmed by the error handling problem mentioned. Firstly, my original comment in the cover letter is misleading. The patchset does not avoid kobjects to be created just some sysfs files related to kobjects attributes. I think it is the right thing to do, although some of those attributes could in fact be shown, so I will fix that. However, I think some fields now in cxl_memdev_state should be inside clx_memdev, like ram_perf/pmem_perf, but moreover, I think we need a more flexible way of linking Type2 to those structs where the optional Type2 functionalities can be set or not. Not something to be faced in this patchset but by for some refactoring in the near future if my concern turns out to be a real one. Secondly, there is another concern, maybe a more important one, and it is related to the error handling. I have to admit I did not think about this at all before Dan's comment, and I think it is something worth to discuss. There are, at least, two error situations I'm concern with, although they (can) end up with the same action: a pcie/CXL slot reset. Assuming there could be cases where the system can recover from this situation, and noting that CXL.mem or CXL.cache could stall cpus and likely crashing the system, the concern is how the CXL device configuration like HDM should be done after the slot reset. While a FLR keeps that config, a slot reset does not. The current Type3 support inside the CXL pci driver does nothing when the device is detached (no release callback), what is happening when the system performs the slot reset, and cxl_error_resume is not trying anything. If I'm not wrong, it is in the resume callback where the device is probed again. So here is my question: should not an endpoint slot reset be supported? >> [..] >>>> If you think we will still be talking about this driver enabling in >>>> September >>>> I would invite you to submit this as a topic for the CXL uConf. >>> That would be great! >>> >>> If this RFC work is integrated by then with the CXL core, that would be >>> fantastic, but the CXL.cache support will still be pending. >> Sounds good. >>
From: Alejandro Lucero <alejandro.lucero-palau@amd.com> I need to start this RFC explaining not what the patchset does, but what we expected to obtain and currently is under doubt: to configure a CXL memory range advertised by our CXL network device and to use it "privately". The main reason behind that privacy, but not the only one, is to avoid any arbitrary use by any user-space "client" with enough privileges for using a public interface to map the device memory and use it as regular memory. The reason is obvious: the device expects writes to such a memory in a specific format. The doubt comes from the fact that after implementing the functionality exposed in this patchset, we realized the current expectation seems to be a BIOS/UEFI configuring HDM decoders or HPA ranges and passing memory ranges to the kernel, and with the kernel having a default action on that memory range based on the flag EFI_MEMORY_SP being set or not. If it is not set, the memory range will be part of the kernel memory management, what we do not want for sure. If the flag is set, a DAX device will be created which allows an user-space client to map such a memory through the DAX device API, what we also prefer to avoid. I know this is likely going to face opposition, but we see this RFC as the opportunity for discussing the matter and, if it turns out to be the case, to be guided towards the proper solution accepted by the maintainers/community. This patchset does not tackle this default kernel behaviour although we already have some ideas and workarounds for the short term. We'll be happy to discuss this openly. So, this patchset assumes a BIOS/UEFI not programming the HDM decoder or HPA ranges for a CXL Type2 device. Although maybe a weird assumption as explained above, a Type2 device added after boot, that is hotplugged, will likely need the changes added here. Exporting some of the CXL core for vendor drivers is also required for avoiding code duplication when reading DVSEC or decoder registers. Finally if there is no such HPA range assigned, whatever the reason, this patchset offers some way of obtaining one and/or finding out what is the problem behind the lack of such HPA range. Current CXL kernel code is focused on supporting Type3 CXL devices, aka memory expanders. Type2 CXL devices, aka device accelerators, share some functionalities but require some special handling. First of all, Type2 are by definition specific to vendor designs and a specific vendor driver is expected to work with the CXL specifics. This implies the CXL setup needs to be done by such a driver instead of by a generic CXL PCI driver as for memory expanders. Most of such setup needs to use current CXL core code and therefore needs to be accessible to those vendor drivers. This is accomplished with the first patch and with extensions to the exported functions in subsequent patches. Keeping with kernel rules of having a client using any new functionality added, a CXL type2 driver is increasingly added. This is not a driver supporting a real device but an emulated one in QEMU, with the QEMU patch following this patchset. The reason for adding such a Type2 driver instead of changes to a current kernel driver or a new driver is threefold: 1) the expected kernel driver to use the functionality added is a netdev one. Current internal CXL support is a codesign effort, therefore software and hardware evolving in lockstep. Adding changes to a netdev driver requires the full functionality and doing things following the netdev standard which is not the best option for this development stage. 2) Waiting for completing the development will delay the required Type2 support, and most of the required changes are unrelated to specific CXL usage by any vendor driver. 3) Type2 support will need some testing infrastructure, unit tests for ensuring Type2 devices are working, and module tests for ensuring CXL core changes do not affect Type2 support. I hope these reasons are convincing enough. I have decided to follow a gradual approach for adding such a driver using the exported CXL functions and structs. I think it is easier to review the driver when the new funcionality is added than to add the driver at the end, but not a big deal if my approach is not liked. The patches are based on a patchset sent by Dan Williams [1] which was just partially integrated, most related to making things ready for Type2 but none related to specific Type2 support. Those patches based on Dan´s work have Dan´s signing, so Dan, tell me if you do not want me to add you. Type2 implies, I think, only the related driver to manage the CXL specifics. This means no user space intervention and therefore no sysfs files. This makes easy to avoid the current problem of most of the sysfs related code expecting Type3 devices. If I´m wrong in this regard, such a code will need further changes. A final note about CXL.cache is needed. This patchset does not cover it at all, although the emulated Type2 device advertises it. From the kernel point of view supporting CXL.cache will imply to be sure the CXL path supports what the Type2 device needs. A device accelerator will likely be connected to a Root Switch, but other configurations can not be discarded. Therefore the kernel will need to check not just HPA, DPA, interleave and granularity, but also the available CXL.cache support and resources in each switch in the CXL path to the Type2 device. I expect to contribute to this support in the following months, and it would be good to discuss about it when possible. Alejandro. [1] https://lore.kernel.org/linux-cxl/98b1f61a-e6c2-71d4-c368-50d958501b0c@intel.com/T/ Alejandro Lucero (13): cxl: move header files for absolute references cxl: add type2 device basic support cxl: export core function for type2 devices cxl: allow devices without mailbox capability cxl: fix check about pmem resource cxl: support type2 memdev creation cxl: add functions for exclusive access to endpoint port topology cxl: add cxl_get_hpa_freespace cxl: add cxl_request_dpa cxl: make region type based on endpoint type cxl: allow automatic region creation by type2 drivers cxl: preclude device memory to be used for dax cxl: test type2 private mapping drivers/cxl/acpi.c | 4 +- drivers/cxl/core/cdat.c | 9 +- drivers/cxl/core/core.h | 1 - drivers/cxl/core/hdm.c | 159 +++++++-- drivers/cxl/core/mbox.c | 6 +- drivers/cxl/core/memdev.c | 75 +++- drivers/cxl/core/pci.c | 6 +- drivers/cxl/core/pmem.c | 4 +- drivers/cxl/core/pmu.c | 4 +- drivers/cxl/core/port.c | 6 +- drivers/cxl/core/region.c | 446 ++++++++++++++++++++---- drivers/cxl/core/regs.c | 9 +- drivers/cxl/core/suspend.c | 2 +- drivers/cxl/core/trace.c | 2 +- drivers/cxl/core/trace.h | 4 +- drivers/cxl/mem.c | 23 +- drivers/cxl/pci.c | 9 +- drivers/cxl/pmem.c | 4 +- drivers/cxl/port.c | 4 +- drivers/cxl/security.c | 4 +- drivers/dax/cxl.c | 2 +- drivers/perf/cxl_pmu.c | 4 +- {drivers/cxl => include/linux}/cxl.h | 5 + {drivers/cxl => include/linux}/cxlmem.h | 22 +- {drivers/cxl => include/linux}/cxlpci.h | 2 + tools/testing/cxl/Kbuild | 1 + tools/testing/cxl/cxl_core_exports.c | 2 +- tools/testing/cxl/mock_acpi.c | 2 +- tools/testing/cxl/test/cxl.c | 2 +- tools/testing/cxl/test/mem.c | 2 +- tools/testing/cxl/test/mock.c | 4 +- tools/testing/cxl/test/mock.h | 2 +- tools/testing/cxl/type2/Kbuild | 7 + tools/testing/cxl/type2/pci_type2.c | 201 +++++++++++ 34 files changed, 886 insertions(+), 153 deletions(-) rename {drivers/cxl => include/linux}/cxl.h (99%) rename {drivers/cxl => include/linux}/cxlmem.h (97%) rename {drivers/cxl => include/linux}/cxlpci.h (97%) create mode 100644 tools/testing/cxl/type2/Kbuild create mode 100644 tools/testing/cxl/type2/pci_type2.c