Message ID | 20240422071606.52637-1-dongsheng.yang@easystack.cn |
---|---|
Headers | show |
Series | block: Introduce CBD (CXL Block Device) | expand |
Dongsheng Yang wrote: > From: Dongsheng Yang <dongsheng.yang.linux@gmail.com> > > Hi all, > This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at: > https://github.com/DataTravelGuide/linux > [..] > (4) dax is not supported yet: > same with famfs, dax device is not supported here, because dax device does not support > dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode. I am glad that famfs is mentioned here, it demonstrates you know about it. However, unfortunately this cover letter does not offer any analysis of *why* the Linux project should consider this additional approach to the inter-host shared-memory enabling problem. To be clear I am neutral at best on some of the initiatives around CXL memory sharing vs pooling, but famfs at least jettisons block-devices and gets closer to a purpose-built memory semantic. So my primary question is why would Linux need both famfs and cbd? I am sure famfs would love feedback and help vs developing competing efforts.
在 2024/4/24 星期三 下午 12:29, Dan Williams 写道: > Dongsheng Yang wrote: >> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com> >> >> Hi all, >> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at: >> https://github.com/DataTravelGuide/linux >> > [..] >> (4) dax is not supported yet: >> same with famfs, dax device is not supported here, because dax device does not support >> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode. > > I am glad that famfs is mentioned here, it demonstrates you know about > it. However, unfortunately this cover letter does not offer any analysis > of *why* the Linux project should consider this additional approach to > the inter-host shared-memory enabling problem. > > To be clear I am neutral at best on some of the initiatives around CXL > memory sharing vs pooling, but famfs at least jettisons block-devices > and gets closer to a purpose-built memory semantic. > > So my primary question is why would Linux need both famfs and cbd? I am > sure famfs would love feedback and help vs developing competing efforts. Hi, Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in shared memory, and related nodes can share the data inside this file system; whereas cbd does not store data in shared memory, it uses shared memory as a channel for data transmission, and the actual data is stored in the backend block device of remote nodes. In cbd, shared memory works more like network to connect different hosts. That is to say, in my view, FAMfs and cbd do not conflict at all; they meet different scenario requirements. cbd simply uses shared memory to transmit data, shared memory plays the role of a data transmission channel, while in FAMfs, shared memory serves as a data store role. Please correct me if I am wrong. Thanx > . >
On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote: > > > 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道: > > Dongsheng Yang wrote: > > > From: Dongsheng Yang <dongsheng.yang.linux@gmail.com> > > > > > > Hi all, > > > This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at: > > > https://github.com/DataTravelGuide/linux > > > > > [..] > > > (4) dax is not supported yet: > > > same with famfs, dax device is not supported here, because dax device does not support > > > dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode. > > > > I am glad that famfs is mentioned here, it demonstrates you know about > > it. However, unfortunately this cover letter does not offer any analysis > > of *why* the Linux project should consider this additional approach to > > the inter-host shared-memory enabling problem. > > > > To be clear I am neutral at best on some of the initiatives around CXL > > memory sharing vs pooling, but famfs at least jettisons block-devices > > and gets closer to a purpose-built memory semantic. > > > > So my primary question is why would Linux need both famfs and cbd? I am > > sure famfs would love feedback and help vs developing competing efforts. > > Hi, > Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in > shared memory, and related nodes can share the data inside this file system; > whereas cbd does not store data in shared memory, it uses shared memory as a > channel for data transmission, and the actual data is stored in the backend > block device of remote nodes. In cbd, shared memory works more like network > to connect different hosts. > Couldn't you basically just allocate a file for use as a uni-directional buffer on top of FAMFS and achieve the same thing without the need for additional kernel support? Similar in a sense to allocating a file on network storage and pinging the remote host when it's ready (except now it's fast!) (The point here is not "FAMFS is better" or "CBD is better", simply trying to identify the function that will ultimately dictate the form). ~Gregory
Dongsheng Yang wrote: > > > 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道: > > Dongsheng Yang wrote: > >> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com> > >> > >> Hi all, > >> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at: > >> https://github.com/DataTravelGuide/linux > >> > > [..] > >> (4) dax is not supported yet: > >> same with famfs, dax device is not supported here, because dax device does not support > >> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode. > > > > I am glad that famfs is mentioned here, it demonstrates you know about > > it. However, unfortunately this cover letter does not offer any analysis > > of *why* the Linux project should consider this additional approach to > > the inter-host shared-memory enabling problem. > > > > To be clear I am neutral at best on some of the initiatives around CXL > > memory sharing vs pooling, but famfs at least jettisons block-devices > > and gets closer to a purpose-built memory semantic. > > > > So my primary question is why would Linux need both famfs and cbd? I am > > sure famfs would love feedback and help vs developing competing efforts. > > Hi, > Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in > shared memory, and related nodes can share the data inside this file > system; whereas cbd does not store data in shared memory, it uses shared > memory as a channel for data transmission, and the actual data is stored > in the backend block device of remote nodes. In cbd, shared memory works > more like network to connect different hosts. > > That is to say, in my view, FAMfs and cbd do not conflict at all; they > meet different scenario requirements. cbd simply uses shared memory to > transmit data, shared memory plays the role of a data transmission > channel, while in FAMfs, shared memory serves as a data store role. If shared memory is just a communication transport then a block-device abstraction does not seem a proper fit. From the above description this sounds similar to what CONFIG_NTB_TRANSPORT offers which is a way for two hosts to communicate over a shared memory channel. So, I am not really looking for an analysis of famfs vs CBD I am looking for CBD to clarify why Linux should consider it, and why the architecture is fit for purpose.
在 2024/4/24 星期三 下午 11:14, Gregory Price 写道: > On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote: >> >> >> 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道: >>> Dongsheng Yang wrote: >>>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com> >>>> >>>> Hi all, >>>> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at: >>>> https://github.com/DataTravelGuide/linux >>>> >>> [..] >>>> (4) dax is not supported yet: >>>> same with famfs, dax device is not supported here, because dax device does not support >>>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode. >>> >>> I am glad that famfs is mentioned here, it demonstrates you know about >>> it. However, unfortunately this cover letter does not offer any analysis >>> of *why* the Linux project should consider this additional approach to >>> the inter-host shared-memory enabling problem. >>> >>> To be clear I am neutral at best on some of the initiatives around CXL >>> memory sharing vs pooling, but famfs at least jettisons block-devices >>> and gets closer to a purpose-built memory semantic. >>> >>> So my primary question is why would Linux need both famfs and cbd? I am >>> sure famfs would love feedback and help vs developing competing efforts. >> >> Hi, >> Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in >> shared memory, and related nodes can share the data inside this file system; >> whereas cbd does not store data in shared memory, it uses shared memory as a >> channel for data transmission, and the actual data is stored in the backend >> block device of remote nodes. In cbd, shared memory works more like network >> to connect different hosts. >> > > Couldn't you basically just allocate a file for use as a uni-directional > buffer on top of FAMFS and achieve the same thing without the need for > additional kernel support? Similar in a sense to allocating a file on > network storage and pinging the remote host when it's ready (except now > it's fast!) I'm not entirely sure I follow your suggestion. I guess it means that cbd would no longer directly manage the pmem device, but allocate files on famfs to transfer data. I didn't do it this way because I considered at least a few points: one of them is, cbd_transport actually requires a DAX device to access shared memory, and cbd has very simple requirements for space management, so there's no need to rely on a file system layer, which would increase architectural complexity. However, we still need cbd_blkdev to provide a block device, so it doesn't achieve "achieve the same without the need for additional kernel support". Could you please provide more specific details about your suggestion? > > (The point here is not "FAMFS is better" or "CBD is better", simply > trying to identify the function that will ultimately dictate the form). Thank you for your clarification. totally aggree with it, discussions always make the issues clearer. Thanx > > ~Gregory >
On Fri, Apr 26, 2024 at 09:25:53AM +0800, Dongsheng Yang wrote: > > > 在 2024/4/24 星期三 下午 11:14, Gregory Price 写道: > > On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote: > > > > > > > > > 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道: > > > > Dongsheng Yang wrote: > > > > > From: Dongsheng Yang <dongsheng.yang.linux@gmail.com> > > > > > > > > > > Hi all, > > > > > This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at: > > > > > https://github.com/DataTravelGuide/linux > > > > > > > > > [..] > > > > > (4) dax is not supported yet: > > > > > same with famfs, dax device is not supported here, because dax device does not support > > > > > dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode. > > > > > > > > I am glad that famfs is mentioned here, it demonstrates you know about > > > > it. However, unfortunately this cover letter does not offer any analysis > > > > of *why* the Linux project should consider this additional approach to > > > > the inter-host shared-memory enabling problem. > > > > > > > > To be clear I am neutral at best on some of the initiatives around CXL > > > > memory sharing vs pooling, but famfs at least jettisons block-devices > > > > and gets closer to a purpose-built memory semantic. > > > > > > > > So my primary question is why would Linux need both famfs and cbd? I am > > > > sure famfs would love feedback and help vs developing competing efforts. > > > > > > Hi, > > > Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in > > > shared memory, and related nodes can share the data inside this file system; > > > whereas cbd does not store data in shared memory, it uses shared memory as a > > > channel for data transmission, and the actual data is stored in the backend > > > block device of remote nodes. In cbd, shared memory works more like network > > > to connect different hosts. > > > > > > > Couldn't you basically just allocate a file for use as a uni-directional > > buffer on top of FAMFS and achieve the same thing without the need for > > additional kernel support? Similar in a sense to allocating a file on > > network storage and pinging the remote host when it's ready (except now > > it's fast!) > > I'm not entirely sure I follow your suggestion. I guess it means that cbd > would no longer directly manage the pmem device, but allocate files on famfs > to transfer data. I didn't do it this way because I considered at least a > few points: one of them is, cbd_transport actually requires a DAX device to > access shared memory, and cbd has very simple requirements for space > management, so there's no need to rely on a file system layer, which would > increase architectural complexity. > > However, we still need cbd_blkdev to provide a block device, so it doesn't > achieve "achieve the same without the need for additional kernel support". > > Could you please provide more specific details about your suggestion? Fundamentally you're shuffling bits from one place to another, the ultimate target is storage located on another device as opposed to the memory itself. So you're using CXL as a transport medium. Could you not do the same thing with a file in FAMFS, and put all of the transport logic in userland? Then you'd just have what looks like a kernel bypass transport mechanism built on top of a file backed by shared memory. Basically it's unclear to me why this must be done in the kernel. Performance? Explicit bypass? Some technical reason I'm missing? Also, on a tangential note, you're using pmem/qemu to emulate the behavior of shared CXL memory. You should probably explain the coherence implications of the system more explicitly. The emulated system implements what amounts to hardware-coherent memory (i.e. the two QEMU machines run on the same physical machine, so coherency is managed within the same coherence domain). If there is no explicit coherence control in software, then it is important to state that this system relies on hardware that implements snoop back-invalidate (which is not a requirement of a CXL 3.x device, just a feature described by the spec that may be implemented). ~Gregory
在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: > On Fri, Apr 26, 2024 at 09:25:53AM +0800, Dongsheng Yang wrote: >> >> >> 在 2024/4/24 星期三 下午 11:14, Gregory Price 写道: >>> On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote: >>>> >>>> >>>> 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道: >>>>> Dongsheng Yang wrote: >>>>>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com> >>>>>> >>>>>> Hi all, >>>>>> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at: >>>>>> https://github.com/DataTravelGuide/linux >>>>>> >>>>> [..] >>>>>> (4) dax is not supported yet: >>>>>> same with famfs, dax device is not supported here, because dax device does not support >>>>>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode. >>>>> >>>>> I am glad that famfs is mentioned here, it demonstrates you know about >>>>> it. However, unfortunately this cover letter does not offer any analysis >>>>> of *why* the Linux project should consider this additional approach to >>>>> the inter-host shared-memory enabling problem. >>>>> >>>>> To be clear I am neutral at best on some of the initiatives around CXL >>>>> memory sharing vs pooling, but famfs at least jettisons block-devices >>>>> and gets closer to a purpose-built memory semantic. >>>>> >>>>> So my primary question is why would Linux need both famfs and cbd? I am >>>>> sure famfs would love feedback and help vs developing competing efforts. >>>> >>>> Hi, >>>> Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in >>>> shared memory, and related nodes can share the data inside this file system; >>>> whereas cbd does not store data in shared memory, it uses shared memory as a >>>> channel for data transmission, and the actual data is stored in the backend >>>> block device of remote nodes. In cbd, shared memory works more like network >>>> to connect different hosts. >>>> >>> >>> Couldn't you basically just allocate a file for use as a uni-directional >>> buffer on top of FAMFS and achieve the same thing without the need for >>> additional kernel support? Similar in a sense to allocating a file on >>> network storage and pinging the remote host when it's ready (except now >>> it's fast!) >> >> I'm not entirely sure I follow your suggestion. I guess it means that cbd >> would no longer directly manage the pmem device, but allocate files on famfs >> to transfer data. I didn't do it this way because I considered at least a >> few points: one of them is, cbd_transport actually requires a DAX device to >> access shared memory, and cbd has very simple requirements for space >> management, so there's no need to rely on a file system layer, which would >> increase architectural complexity. >> >> However, we still need cbd_blkdev to provide a block device, so it doesn't >> achieve "achieve the same without the need for additional kernel support". >> >> Could you please provide more specific details about your suggestion? > > Fundamentally you're shuffling bits from one place to another, the > ultimate target is storage located on another device as opposed to > the memory itself. So you're using CXL as a transport medium. > > Could you not do the same thing with a file in FAMFS, and put all of > the transport logic in userland? Then you'd just have what looks like > a kernel bypass transport mechanism built on top of a file backed by > shared memory. > > Basically it's unclear to me why this must be done in the kernel. > Performance? Explicit bypass? Some technical reason I'm missing? In user space, transferring data via FAMFS files poses no problem, but how do we present this data to users? We cannot expect users to revamp all their business I/O methods. For example, suppose a user needs to run a database on a compute node. As the cloud infrastructure department, we need to allocate a block storage on the storage node and provide it to the database on the compute node through a certain transmission protocol (such as iSCSI, NVMe over Fabrics, or our current solution, cbd). Users can then create any file system they like on the block device and run the database on it. We aim to enhance the performance of this block device with cbd, rather than requiring the business department to adapt their database to fit our shared memory-facing storage node disks. This is why we need to provide users with a block device. If it were only about data transmission, we wouldn't need a block device. But when it comes to actually running business operations, we need a block storage interface for the upper layer. Additionally, the block device layer offers many other rich features, such as RAID. If accessing shared memory in user space is mandatory, there's another option: using user space block storage technologies like ublk. However, this would lead to performance issues as data would need to traverse back to the kernel space block device from the user space process. In summary, we need a block device sharing mechanism, similar to what is provided by NBD, iSCSI, or NVMe over Fabrics, because user businesses rely on the block device interface and ecosystem. > > > Also, on a tangential note, you're using pmem/qemu to emulate the > behavior of shared CXL memory. You should probably explain the > coherence implications of the system more explicitly. > > The emulated system implements what amounts to hardware-coherent > memory (i.e. the two QEMU machines run on the same physical machine, > so coherency is managed within the same coherence domain). > > If there is no explicit coherence control in software, then it is > important to state that this system relies on hardware that implements > snoop back-invalidate (which is not a requirement of a CXL 3.x device, > just a feature described by the spec that may be implemented). In (5) of the cover letter, I mentioned that cbd addresses cache coherence at the software level: (5) How do blkdev and backend interact through the channel? a) For reader side, before reading the data, if the data in this channel may be modified by the other party, then I need to flush the cache before reading to ensure that I get the latest data. For example, the blkdev needs to flush the cache before obtaining compr_head because compr_head will be updated by the backend handler. b) For writter side, if the written information will be read by others, then after writing, I need to flush the cache to let the other party see it immediately. For example, after blkdev submits cbd_se, it needs to update cmd_head to let the handler have a new cbd_se. Therefore, after updating cmd_head, I need to flush the cache to let the backend see it. This part of the code is indeed implemented, however, as you pointed out, since I am currently using qemu/pmem for emulation, the effects of this code cannot be observed. Thanx > > ~Gregory > . >
On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: > > > 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: > > > > Also, on a tangential note, you're using pmem/qemu to emulate the > > behavior of shared CXL memory. You should probably explain the > > coherence implications of the system more explicitly. > > > > The emulated system implements what amounts to hardware-coherent > > memory (i.e. the two QEMU machines run on the same physical machine, > > so coherency is managed within the same coherence domain). > > > > If there is no explicit coherence control in software, then it is > > important to state that this system relies on hardware that implements > > snoop back-invalidate (which is not a requirement of a CXL 3.x device, > > just a feature described by the spec that may be implemented). > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence > at the software level: > > (5) How do blkdev and backend interact through the channel? > a) For reader side, before reading the data, if the data in this channel > may be modified by the other party, then I need to flush the cache before > reading to ensure that I get the latest data. For example, the blkdev needs > to flush the cache before obtaining compr_head because compr_head will be > updated by the backend handler. > b) For writter side, if the written information will be read by others, > then after writing, I need to flush the cache to let the other party see it > immediately. For example, after blkdev submits cbd_se, it needs to update > cmd_head to let the handler have a new cbd_se. Therefore, after updating > cmd_head, I need to flush the cache to let the backend see it. > Flushing the cache is insufficient. All that cache flushing guarantees is that the memory has left the writer's CPU cache. There are potentially many write buffers between the CPU and the actual backing media that the CPU has no visibility of and cannot pierce through to force a full guaranteed flush back to the media. for example: memcpy(some_cacheline, data, 64); mfence(); Will not guarantee that after mfence() completes that the remote host will have visibility of the data. mfence() does not guarantee a full flush back down to the device, it only guarantees it has been pushed out of the CPU's cache. similarly: memcpy(some_cacheline, data, 64); mfence(); memcpy(some_other_cacheline, data, 64); mfence() Will not guarantee that some_cacheline reaches the backing media prior to some_other_cacheline, as there is no guarantee of write-ordering in CXL controllers (with the exception of writes to the same cacheline). So this statement: > I need to flush the cache to let the other party see it immediately. Is misleading. They will not see is "immediately", they will see it "eventually at some completely unknowable time in the future". ~Gregory
在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: >> >> >> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: >>> >> >> In (5) of the cover letter, I mentioned that cbd addresses cache coherence >> at the software level: >> >> (5) How do blkdev and backend interact through the channel? >> a) For reader side, before reading the data, if the data in this channel >> may be modified by the other party, then I need to flush the cache before >> reading to ensure that I get the latest data. For example, the blkdev needs >> to flush the cache before obtaining compr_head because compr_head will be >> updated by the backend handler. >> b) For writter side, if the written information will be read by others, >> then after writing, I need to flush the cache to let the other party see it >> immediately. For example, after blkdev submits cbd_se, it needs to update >> cmd_head to let the handler have a new cbd_se. Therefore, after updating >> cmd_head, I need to flush the cache to let the backend see it. >> > > Flushing the cache is insufficient. All that cache flushing guarantees > is that the memory has left the writer's CPU cache. There are potentially > many write buffers between the CPU and the actual backing media that the > CPU has no visibility of and cannot pierce through to force a full > guaranteed flush back to the media. > > for example: > > memcpy(some_cacheline, data, 64); > mfence(); > > Will not guarantee that after mfence() completes that the remote host > will have visibility of the data. mfence() does not guarantee a full > flush back down to the device, it only guarantees it has been pushed out > of the CPU's cache. > > similarly: > > memcpy(some_cacheline, data, 64); > mfence(); > memcpy(some_other_cacheline, data, 64); > mfence() > > Will not guarantee that some_cacheline reaches the backing media prior > to some_other_cacheline, as there is no guarantee of write-ordering in > CXL controllers (with the exception of writes to the same cacheline). > > So this statement: > >> I need to flush the cache to let the other party see it immediately. > > Is misleading. They will not see is "immediately", they will see it > "eventually at some completely unknowable time in the future". This is indeed one of the issues I wanted to discuss at the RFC stage. Thank you for pointing it out. In my opinion, using "nvdimm_flush" might be one way to address this issue, but it seems to flush the entire nd_region, which might be too heavy. Moreover, it only applies to non-volatile memory. This should be a general problem for cxl shared memory. In theory, FAMFS should also encounter this issue. Gregory, John, and Dan, Any suggestion about it? Thanx a lot > > ~Gregory >
On Sun, Apr 28, 2024 at 01:47:29PM +0800, Dongsheng Yang wrote: > > > 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: > > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: > > > > > > > > > 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: > > > > > > > > > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence > > > at the software level: > > > > > > (5) How do blkdev and backend interact through the channel? > > > a) For reader side, before reading the data, if the data in this channel > > > may be modified by the other party, then I need to flush the cache before > > > reading to ensure that I get the latest data. For example, the blkdev needs > > > to flush the cache before obtaining compr_head because compr_head will be > > > updated by the backend handler. > > > b) For writter side, if the written information will be read by others, > > > then after writing, I need to flush the cache to let the other party see it > > > immediately. For example, after blkdev submits cbd_se, it needs to update > > > cmd_head to let the handler have a new cbd_se. Therefore, after updating > > > cmd_head, I need to flush the cache to let the backend see it. > > > > > > > Flushing the cache is insufficient. All that cache flushing guarantees > > is that the memory has left the writer's CPU cache. There are potentially > > many write buffers between the CPU and the actual backing media that the > > CPU has no visibility of and cannot pierce through to force a full > > guaranteed flush back to the media. > > > > for example: > > > > memcpy(some_cacheline, data, 64); > > mfence(); > > > > Will not guarantee that after mfence() completes that the remote host > > will have visibility of the data. mfence() does not guarantee a full > > flush back down to the device, it only guarantees it has been pushed out > > of the CPU's cache. > > > > similarly: > > > > memcpy(some_cacheline, data, 64); > > mfence(); > > memcpy(some_other_cacheline, data, 64); > > mfence() > > just a derp here, meant to add an explicit clflush(some_cacheline) between the copy and the mfence. But the result is the same. > > Will not guarantee that some_cacheline reaches the backing media prior > > to some_other_cacheline, as there is no guarantee of write-ordering in > > CXL controllers (with the exception of writes to the same cacheline). > > > > So this statement: > > > > > I need to flush the cache to let the other party see it immediately. > > > > Is misleading. They will not see is "immediately", they will see it > > "eventually at some completely unknowable time in the future". > > This is indeed one of the issues I wanted to discuss at the RFC stage. Thank > you for pointing it out. > > In my opinion, using "nvdimm_flush" might be one way to address this issue, > but it seems to flush the entire nd_region, which might be too heavy. > Moreover, it only applies to non-volatile memory. > The problem is that the coherence domain really ends at the root complex, and from the perspective of any one host the data is coherent. Flushing only guarantees it gets pushed out from that domain, but does not guarantee anything south of it. Flushing semantics that don't puncture through the root complex won't help > > This should be a general problem for cxl shared memory. In theory, FAMFS > should also encounter this issue. > > Gregory, John, and Dan, Any suggestion about it? > > Thanx a lot > > > > ~Gregory > >
On 24/04/28 01:47PM, Dongsheng Yang wrote: > > > 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: > > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: > > > > > > > > > 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: > > > > > > > > > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence > > > at the software level: > > > > > > (5) How do blkdev and backend interact through the channel? > > > a) For reader side, before reading the data, if the data in this channel > > > may be modified by the other party, then I need to flush the cache before > > > reading to ensure that I get the latest data. For example, the blkdev needs > > > to flush the cache before obtaining compr_head because compr_head will be > > > updated by the backend handler. > > > b) For writter side, if the written information will be read by others, > > > then after writing, I need to flush the cache to let the other party see it > > > immediately. For example, after blkdev submits cbd_se, it needs to update > > > cmd_head to let the handler have a new cbd_se. Therefore, after updating > > > cmd_head, I need to flush the cache to let the backend see it. > > > > > > > Flushing the cache is insufficient. All that cache flushing guarantees > > is that the memory has left the writer's CPU cache. There are potentially > > many write buffers between the CPU and the actual backing media that the > > CPU has no visibility of and cannot pierce through to force a full > > guaranteed flush back to the media. > > > > for example: > > > > memcpy(some_cacheline, data, 64); > > mfence(); > > > > Will not guarantee that after mfence() completes that the remote host > > will have visibility of the data. mfence() does not guarantee a full > > flush back down to the device, it only guarantees it has been pushed out > > of the CPU's cache. > > > > similarly: > > > > memcpy(some_cacheline, data, 64); > > mfence(); > > memcpy(some_other_cacheline, data, 64); > > mfence() > > > > Will not guarantee that some_cacheline reaches the backing media prior > > to some_other_cacheline, as there is no guarantee of write-ordering in > > CXL controllers (with the exception of writes to the same cacheline). > > > > So this statement: > > > > > I need to flush the cache to let the other party see it immediately. > > > > Is misleading. They will not see is "immediately", they will see it > > "eventually at some completely unknowable time in the future". > > This is indeed one of the issues I wanted to discuss at the RFC stage. Thank > you for pointing it out. > > In my opinion, using "nvdimm_flush" might be one way to address this issue, > but it seems to flush the entire nd_region, which might be too heavy. > Moreover, it only applies to non-volatile memory. > > This should be a general problem for cxl shared memory. In theory, FAMFS > should also encounter this issue. > > Gregory, John, and Dan, Any suggestion about it? > > Thanx a lot > > > > ~Gregory > > Hi Dongsheng, Gregory is right about the uncertainty around "clflush" operations, but let me drill in a bit further. Say you copy a payload into a "bucket" in a queue and then update an index in a metadata structure; I'm thinking of the standard producer/ consumer queuing model here, with one index mutated by the producer and the other mutated by the consumer. (I have not reviewed your queueing code, but you *must* be using this model - things like linked-lists won't work in shared memory without shared locks/atomics.) Normal logic says that you should clflush the payload before updating the index, then update and clflush the index. But we still observe in non-cache-coherent shared memory that the payload may become valid *after* the clflush of the queue index. The famfs user space has a program called pcq.c, which implements a producer/consumer queue in a pair of famfs files. The only way to currently guarantee a valid read of a payload is to use sequence numbers and checksums on payloads. We do observe mismatches with actual shared memory, and the recovery is to clflush and re-read the payload from the client side. (Aside: These file pairs theoretically might work for CBD queues.) Anoter side note: it would be super-helpful if the CPU gave us an explicit invalidate rather than just clflush, which will write-back before invalidating *if* the cache line is marked as dirty, even when software knows this should not happen. Note that CXL 3.1 provides a way to guarantee that stuff that should not be written back can't be written back: read-only mappings. This one of the features I got into the spec; using this requires CXL 3.1 DCD, and would require two DCD allocations (i.e. two tagged-capacity dax devices - one writable by the server and one by the client). Just to make things slightly gnarlier, the MESI cache coherency protocol allows a CPU to speculatively convert a line from exclusive to modified, meaning it's not clear as of now whether "occasional" clean write-backs can be avoided. Meaning those read-only mappings may be more important than one might think. (Clean write-backs basically make it impossible for software to manage cache coherency.) Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and shared memory is not explicitly legal in cxl 2, so there are things a cpu could do (or not do) in a cxl 2 environment that are not illegal because they should not be observable in a no-shared-memory environment. CBD is interesting work, though for some of the reasons above I'm somewhat skeptical of shared memory as an IPC mechanism. Regards, John
Dongsheng Yang wrote: > > > 在 2024/4/25 星期四 上午 2:08, Dan Williams 写道: > > Dongsheng Yang wrote: > >> > >> > >> 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道: > >>> Dongsheng Yang wrote: > >>>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com> > >>>> > >>>> Hi all, > >>>> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at: > >>>> https://github.com/DataTravelGuide/linux > >>>> > >>> [..] > >>>> (4) dax is not supported yet: > >>>> same with famfs, dax device is not supported here, because dax device does not support > >>>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode. > >>> > >>> I am glad that famfs is mentioned here, it demonstrates you know about > >>> it. However, unfortunately this cover letter does not offer any analysis > >>> of *why* the Linux project should consider this additional approach to > >>> the inter-host shared-memory enabling problem. > >>> > >>> To be clear I am neutral at best on some of the initiatives around CXL > >>> memory sharing vs pooling, but famfs at least jettisons block-devices > >>> and gets closer to a purpose-built memory semantic. > >>> > >>> So my primary question is why would Linux need both famfs and cbd? I am > >>> sure famfs would love feedback and help vs developing competing efforts. > >> > >> Hi, > >> Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in > >> shared memory, and related nodes can share the data inside this file > >> system; whereas cbd does not store data in shared memory, it uses shared > >> memory as a channel for data transmission, and the actual data is stored > >> in the backend block device of remote nodes. In cbd, shared memory works > >> more like network to connect different hosts. > >> > >> That is to say, in my view, FAMfs and cbd do not conflict at all; they > >> meet different scenario requirements. cbd simply uses shared memory to > >> transmit data, shared memory plays the role of a data transmission > >> channel, while in FAMfs, shared memory serves as a data store role. > > > > If shared memory is just a communication transport then a block-device > > abstraction does not seem a proper fit. From the above description this > > sounds similar to what CONFIG_NTB_TRANSPORT offers which is a way for > > two hosts to communicate over a shared memory channel. > > > > So, I am not really looking for an analysis of famfs vs CBD I am looking > > for CBD to clarify why Linux should consider it, and why the > > architecture is fit for purpose. > > Let me explain why we need cbd: > > In cloud storage scenarios, we often need to expose block devices of > storage nodes to compute nodes. We have options like nbd, iscsi, nvmeof, > etc., but these all communicate over the network. cbd aims to address > the same scenario but using shared memory for data transfer instead of > the network, aiming for better performance and reduced network latency. > > Furthermore, shared memory can not only transfer data but also implement > features like write-ahead logging (WAL) or read/write cache, further > improving performance, especially latency-sensitive business scenarios. > (If I understand correctly, this might not be achievable with the > previously mentioned ntb.) > > To ensure we have a common understanding, I'd like to clarify one point: > the /dev/cbdX block device is not an abstraction of shared memory; it is > a mapping of a block device (such as /dev/sda) on the remote host. > Reading/writing to /dev/cbdX is equivalent to reading/writing to > /dev/sda on the remote host. > > This is the design intention of cbd. I hope this clarifies things. I does, thanks for the clarification. Let me go back and take a another look now that I undertand that this is a "remote storage target over CXL memory" solution.
Dongsheng Yang wrote: > > > 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: > > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: > >> > >> > >> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: > >>> > >> > >> In (5) of the cover letter, I mentioned that cbd addresses cache coherence > >> at the software level: > >> > >> (5) How do blkdev and backend interact through the channel? > >> a) For reader side, before reading the data, if the data in this channel > >> may be modified by the other party, then I need to flush the cache before > >> reading to ensure that I get the latest data. For example, the blkdev needs > >> to flush the cache before obtaining compr_head because compr_head will be > >> updated by the backend handler. > >> b) For writter side, if the written information will be read by others, > >> then after writing, I need to flush the cache to let the other party see it > >> immediately. For example, after blkdev submits cbd_se, it needs to update > >> cmd_head to let the handler have a new cbd_se. Therefore, after updating > >> cmd_head, I need to flush the cache to let the backend see it. > >> > > > > Flushing the cache is insufficient. All that cache flushing guarantees > > is that the memory has left the writer's CPU cache. There are potentially > > many write buffers between the CPU and the actual backing media that the > > CPU has no visibility of and cannot pierce through to force a full > > guaranteed flush back to the media. > > > > for example: > > > > memcpy(some_cacheline, data, 64); > > mfence(); > > > > Will not guarantee that after mfence() completes that the remote host > > will have visibility of the data. mfence() does not guarantee a full > > flush back down to the device, it only guarantees it has been pushed out > > of the CPU's cache. > > > > similarly: > > > > memcpy(some_cacheline, data, 64); > > mfence(); > > memcpy(some_other_cacheline, data, 64); > > mfence() > > > > Will not guarantee that some_cacheline reaches the backing media prior > > to some_other_cacheline, as there is no guarantee of write-ordering in > > CXL controllers (with the exception of writes to the same cacheline). > > > > So this statement: > > > >> I need to flush the cache to let the other party see it immediately. > > > > Is misleading. They will not see is "immediately", they will see it > > "eventually at some completely unknowable time in the future". > > This is indeed one of the issues I wanted to discuss at the RFC stage. > Thank you for pointing it out. > > In my opinion, using "nvdimm_flush" might be one way to address this > issue, but it seems to flush the entire nd_region, which might be too > heavy. Moreover, it only applies to non-volatile memory. > > This should be a general problem for cxl shared memory. In theory, FAMFS > should also encounter this issue. > > Gregory, John, and Dan, Any suggestion about it? The CXL equivalent is GPF (Global Persistence Flush), not be confused with "General Protection Fault" which is likely what will happen if software needs to manage cache coherency for this solution. CXL GPF was not designed to be triggered by software. It is hardware response to a power supply indicating loss of input power. I do not think you want to spend community resources reviewing software cache coherency considerations, and instead "just" mandate that this solution requires inter-host hardware cache coherence. I understand that is a difficult requirement to mandate, but it is likely less difficult than getting Linux to carry a software cache coherence mitigation. In some ways this reminds me of SMR drives and the problems those posed to software where ultimately the programming difficulties needed to be solved in hardware, not exported to the Linux kernel to solve.
On Sun, 28 Apr 2024 11:55:10 -0500 John Groves <John@groves.net> wrote: > On 24/04/28 01:47PM, Dongsheng Yang wrote: > > > > > > 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: > > > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: > > > > > > > > > > > > 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: > > > > > > > > > > > > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence > > > > at the software level: > > > > > > > > (5) How do blkdev and backend interact through the channel? > > > > a) For reader side, before reading the data, if the data in this channel > > > > may be modified by the other party, then I need to flush the cache before > > > > reading to ensure that I get the latest data. For example, the blkdev needs > > > > to flush the cache before obtaining compr_head because compr_head will be > > > > updated by the backend handler. > > > > b) For writter side, if the written information will be read by others, > > > > then after writing, I need to flush the cache to let the other party see it > > > > immediately. For example, after blkdev submits cbd_se, it needs to update > > > > cmd_head to let the handler have a new cbd_se. Therefore, after updating > > > > cmd_head, I need to flush the cache to let the backend see it. > > > > > > > > > > Flushing the cache is insufficient. All that cache flushing guarantees > > > is that the memory has left the writer's CPU cache. There are potentially > > > many write buffers between the CPU and the actual backing media that the > > > CPU has no visibility of and cannot pierce through to force a full > > > guaranteed flush back to the media. > > > > > > for example: > > > > > > memcpy(some_cacheline, data, 64); > > > mfence(); > > > > > > Will not guarantee that after mfence() completes that the remote host > > > will have visibility of the data. mfence() does not guarantee a full > > > flush back down to the device, it only guarantees it has been pushed out > > > of the CPU's cache. > > > > > > similarly: > > > > > > memcpy(some_cacheline, data, 64); > > > mfence(); > > > memcpy(some_other_cacheline, data, 64); > > > mfence() > > > > > > Will not guarantee that some_cacheline reaches the backing media prior > > > to some_other_cacheline, as there is no guarantee of write-ordering in > > > CXL controllers (with the exception of writes to the same cacheline). > > > > > > So this statement: > > > > > > > I need to flush the cache to let the other party see it immediately. > > > > > > Is misleading. They will not see is "immediately", they will see it > > > "eventually at some completely unknowable time in the future". > > > > This is indeed one of the issues I wanted to discuss at the RFC stage. Thank > > you for pointing it out. > > > > In my opinion, using "nvdimm_flush" might be one way to address this issue, > > but it seems to flush the entire nd_region, which might be too heavy. > > Moreover, it only applies to non-volatile memory. > > > > This should be a general problem for cxl shared memory. In theory, FAMFS > > should also encounter this issue. > > > > Gregory, John, and Dan, Any suggestion about it? > > > > Thanx a lot > > > > > > ~Gregory > > > > > Hi Dongsheng, > > Gregory is right about the uncertainty around "clflush" operations, but > let me drill in a bit further. > > Say you copy a payload into a "bucket" in a queue and then update an > index in a metadata structure; I'm thinking of the standard producer/ > consumer queuing model here, with one index mutated by the producer and > the other mutated by the consumer. > > (I have not reviewed your queueing code, but you *must* be using this > model - things like linked-lists won't work in shared memory without > shared locks/atomics.) > > Normal logic says that you should clflush the payload before updating > the index, then update and clflush the index. > > But we still observe in non-cache-coherent shared memory that the payload > may become valid *after* the clflush of the queue index. > > The famfs user space has a program called pcq.c, which implements a > producer/consumer queue in a pair of famfs files. The only way to > currently guarantee a valid read of a payload is to use sequence numbers > and checksums on payloads. We do observe mismatches with actual shared > memory, and the recovery is to clflush and re-read the payload from the > client side. (Aside: These file pairs theoretically might work for CBD > queues.) > > Anoter side note: it would be super-helpful if the CPU gave us an explicit > invalidate rather than just clflush, which will write-back before > invalidating *if* the cache line is marked as dirty, even when software > knows this should not happen. > > Note that CXL 3.1 provides a way to guarantee that stuff that should not > be written back can't be written back: read-only mappings. This one of > the features I got into the spec; using this requires CXL 3.1 DCD, and > would require two DCD allocations (i.e. two tagged-capacity dax devices - > one writable by the server and one by the client). > > Just to make things slightly gnarlier, the MESI cache coherency protocol > allows a CPU to speculatively convert a line from exclusive to modified, > meaning it's not clear as of now whether "occasional" clean write-backs > can be avoided. Meaning those read-only mappings may be more important > than one might think. (Clean write-backs basically make it > impossible for software to manage cache coherency.) My understanding is that clean write backs are an implementation specific issue that came as a surprise to some CPU arch folk I spoke to, we will need some path for a host to say if they can ever do that. Given this definitely effects one CPU vendor, maybe solutions that rely on this not happening are not suitable for upstream. Maybe this market will be important enough for that CPU vendor to stop doing it but if they do it will take a while... Flushing in general is as CPU architecture problem where each of the architectures needs to be clear what they do / specify that their licensees do. I'm with Dan on encouraging all memory vendors to do hardware coherence! J > > Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and > shared memory is not explicitly legal in cxl 2, so there are things a cpu > could do (or not do) in a cxl 2 environment that are not illegal because > they should not be observable in a no-shared-memory environment. > > CBD is interesting work, though for some of the reasons above I'm somewhat > skeptical of shared memory as an IPC mechanism. > > Regards, > John > > >
在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道: > On Sun, 28 Apr 2024 11:55:10 -0500 > John Groves <John@groves.net> wrote: > >> On 24/04/28 01:47PM, Dongsheng Yang wrote: >>> >>> >>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: >>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: >>>>> >>>>> >>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: >>>>>> >>>>> ... >> >> Just to make things slightly gnarlier, the MESI cache coherency protocol >> allows a CPU to speculatively convert a line from exclusive to modified, >> meaning it's not clear as of now whether "occasional" clean write-backs >> can be avoided. Meaning those read-only mappings may be more important >> than one might think. (Clean write-backs basically make it >> impossible for software to manage cache coherency.) > > My understanding is that clean write backs are an implementation specific > issue that came as a surprise to some CPU arch folk I spoke to, we will > need some path for a host to say if they can ever do that. > > Given this definitely effects one CPU vendor, maybe solutions that > rely on this not happening are not suitable for upstream. > > Maybe this market will be important enough for that CPU vendor to stop > doing it but if they do it will take a while... > > Flushing in general is as CPU architecture problem where each of the > architectures needs to be clear what they do / specify that their > licensees do. > > I'm with Dan on encouraging all memory vendors to do hardware coherence! Hi Gregory, John, Jonathan and Dan: Thanx for your information, they help a lot, and sorry for the late reply. After some internal discussions, I think we can design it as follows: (1) If the hardware implements cache coherence, then the software layer doesn't need to consider this issue, and can perform read and write operations directly. (2) If the hardware doesn't implement cache coherence, we can consider a DMA-like approach, where we check architectural features to determine if cache coherence is supported. This could be similar to `dev_is_dma_coherent`. Additionally, if the architecture supports flushing and invalidating CPU caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`, `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`, `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`), then we can handle cache coherence at the software layer. (For the clean writeback issue, I think it may also require clarification from the architecture, and how DMA handles the clean writeback problem, which I haven't further checked.) (3) If the hardware doesn't implement cache coherence and the cpu doesn't support the required CPU cache operations, then we can run in nocache mode. CBD can initially support (3), and then transition to (1) when hardware supports cache-coherency. If there's sufficient market demand, we can also consider supporting (2). How does this approach sound? Thanx > > J > >> >> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and >> shared memory is not explicitly legal in cxl 2, so there are things a cpu >> could do (or not do) in a cxl 2 environment that are not illegal because >> they should not be observable in a no-shared-memory environment. >> >> CBD is interesting work, though for some of the reasons above I'm somewhat >> skeptical of shared memory as an IPC mechanism. >> >> Regards, >> John >> >> >> > > . >
On Wed, 8 May 2024 19:39:23 +0800 Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道: > > On Sun, 28 Apr 2024 11:55:10 -0500 > > John Groves <John@groves.net> wrote: > > > >> On 24/04/28 01:47PM, Dongsheng Yang wrote: > >>> > >>> > >>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: > >>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: > >>>>> > >>>>> > >>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: > >>>>>> > >>>>> > > ... > >> > >> Just to make things slightly gnarlier, the MESI cache coherency protocol > >> allows a CPU to speculatively convert a line from exclusive to modified, > >> meaning it's not clear as of now whether "occasional" clean write-backs > >> can be avoided. Meaning those read-only mappings may be more important > >> than one might think. (Clean write-backs basically make it > >> impossible for software to manage cache coherency.) > > > > My understanding is that clean write backs are an implementation specific > > issue that came as a surprise to some CPU arch folk I spoke to, we will > > need some path for a host to say if they can ever do that. > > > > Given this definitely effects one CPU vendor, maybe solutions that > > rely on this not happening are not suitable for upstream. > > > > Maybe this market will be important enough for that CPU vendor to stop > > doing it but if they do it will take a while... > > > > Flushing in general is as CPU architecture problem where each of the > > architectures needs to be clear what they do / specify that their > > licensees do. > > > > I'm with Dan on encouraging all memory vendors to do hardware coherence! > > Hi Gregory, John, Jonathan and Dan: > Thanx for your information, they help a lot, and sorry for the late reply. > > After some internal discussions, I think we can design it as follows: > > (1) If the hardware implements cache coherence, then the software layer > doesn't need to consider this issue, and can perform read and write > operations directly. Agreed - this is one easier case. > > (2) If the hardware doesn't implement cache coherence, we can consider a > DMA-like approach, where we check architectural features to determine if > cache coherence is supported. This could be similar to > `dev_is_dma_coherent`. Ok. So this would combine host support checks with checking if the shared memory on the device is multi host cache coherent (it will be single host cache coherent which is what makes this messy) > > Additionally, if the architecture supports flushing and invalidating CPU > caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`, > `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`, > `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`), Those particular calls won't tell you much at all. They indicate that a flush can happen as far as a common point for DMA engines in the system. No information on whether there are caches beyond that point. > > then we can handle cache coherence at the software layer. > (For the clean writeback issue, I think it may also require > clarification from the architecture, and how DMA handles the clean > writeback problem, which I haven't further checked.) I believe the relevant architecture only does IO coherent DMA so it is never a problem (unlike with multihost cache coherence). > > (3) If the hardware doesn't implement cache coherence and the cpu > doesn't support the required CPU cache operations, then we can run in > nocache mode. I suspect that gets you no where either. Never believe an architecture that provides a flag that says not to cache something. That just means you should not be able to tell that it is cached - many many implementations actually cache such accesses. > > CBD can initially support (3), and then transition to (1) when hardware > supports cache-coherency. If there's sufficient market demand, we can > also consider supporting (2). I'd assume only (3) works. The others rely on assumptions I don't think you can rely on. Fun fun fun, Jonathan > > How does this approach sound? > > Thanx > > > > J > > > >> > >> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and > >> shared memory is not explicitly legal in cxl 2, so there are things a cpu > >> could do (or not do) in a cxl 2 environment that are not illegal because > >> they should not be observable in a no-shared-memory environment. > >> > >> CBD is interesting work, though for some of the reasons above I'm somewhat > >> skeptical of shared memory as an IPC mechanism. > >> > >> Regards, > >> John > >> > >> > >> > > > > . > >
在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道: > On Wed, 8 May 2024 19:39:23 +0800 > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > >> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道: >>> On Sun, 28 Apr 2024 11:55:10 -0500 >>> John Groves <John@groves.net> wrote: >>> >>>> On 24/04/28 01:47PM, Dongsheng Yang wrote: >>>>> >>>>> >>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: >>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: >>>>>>> >>>>>>> >>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: >>>>>>>> >>>>>>> >> >> ... >>>> >>>> Just to make things slightly gnarlier, the MESI cache coherency protocol >>>> allows a CPU to speculatively convert a line from exclusive to modified, >>>> meaning it's not clear as of now whether "occasional" clean write-backs >>>> can be avoided. Meaning those read-only mappings may be more important >>>> than one might think. (Clean write-backs basically make it >>>> impossible for software to manage cache coherency.) >>> >>> My understanding is that clean write backs are an implementation specific >>> issue that came as a surprise to some CPU arch folk I spoke to, we will >>> need some path for a host to say if they can ever do that. >>> >>> Given this definitely effects one CPU vendor, maybe solutions that >>> rely on this not happening are not suitable for upstream. >>> >>> Maybe this market will be important enough for that CPU vendor to stop >>> doing it but if they do it will take a while... >>> >>> Flushing in general is as CPU architecture problem where each of the >>> architectures needs to be clear what they do / specify that their >>> licensees do. >>> >>> I'm with Dan on encouraging all memory vendors to do hardware coherence! >> >> Hi Gregory, John, Jonathan and Dan: >> Thanx for your information, they help a lot, and sorry for the late reply. >> >> After some internal discussions, I think we can design it as follows: >> >> (1) If the hardware implements cache coherence, then the software layer >> doesn't need to consider this issue, and can perform read and write >> operations directly. > > Agreed - this is one easier case. > >> >> (2) If the hardware doesn't implement cache coherence, we can consider a >> DMA-like approach, where we check architectural features to determine if >> cache coherence is supported. This could be similar to >> `dev_is_dma_coherent`. > > Ok. So this would combine host support checks with checking if the shared > memory on the device is multi host cache coherent (it will be single host > cache coherent which is what makes this messy) >> >> Additionally, if the architecture supports flushing and invalidating CPU >> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`, >> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`, >> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`), > > Those particular calls won't tell you much at all. They indicate that a flush > can happen as far as a common point for DMA engines in the system. No > information on whether there are caches beyond that point. > >> >> then we can handle cache coherence at the software layer. >> (For the clean writeback issue, I think it may also require >> clarification from the architecture, and how DMA handles the clean >> writeback problem, which I haven't further checked.) > > I believe the relevant architecture only does IO coherent DMA so it is > never a problem (unlike with multihost cache coherence).Hi Jonathan, let me provide an example, In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into `req->sqe.dma`. (1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates the CPU cache: ib_dma_sync_single_for_cpu(dev, sqe->dma, sizeof(struct nvme_command), DMA_TO_DEVICE); For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed by `dcache_inval_poc(start, start + size)`. (2) Setting up data related to the NVMe request. (3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to DMA memory: ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(struct nvme_command), DMA_TO_DEVICE); Of course, if the hardware ensures cache coherency, the above operations are skipped. However, if the hardware does not guarantee cache coherency, RDMA appears to ensure cache coherency through this method. In the RDMA scenario, we also face the issue of multi-host cache coherence. so I'm thinking, can we adopt a similar approach in CXL shared memory to achieve data sharing? >> >> (3) If the hardware doesn't implement cache coherence and the cpu >> doesn't support the required CPU cache operations, then we can run in >> nocache mode. > > I suspect that gets you no where either. Never believe an architecture > that provides a flag that says not to cache something. That just means > you should not be able to tell that it is cached - many many implementations > actually cache such accesses. Sigh, then that really makes thing difficult. > >> >> CBD can initially support (3), and then transition to (1) when hardware >> supports cache-coherency. If there's sufficient market demand, we can >> also consider supporting (2). > I'd assume only (3) works. The others rely on assumptions I don't think I guess you mean (1), the hardware cache-coherency way, right? :) Thanx > you can rely on. > > Fun fun fun, > > Jonathan > >> >> How does this approach sound? >> >> Thanx >>> >>> J >>> >>>> >>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and >>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu >>>> could do (or not do) in a cxl 2 environment that are not illegal because >>>> they should not be observable in a no-shared-memory environment. >>>> >>>> CBD is interesting work, though for some of the reasons above I'm somewhat >>>> skeptical of shared memory as an IPC mechanism. >>>> >>>> Regards, >>>> John >>>> >>>> >>>> >>> >>> . >>> > > . >
On Wed, 8 May 2024 21:03:54 +0800 Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > 在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道: > > On Wed, 8 May 2024 19:39:23 +0800 > > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > > > >> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道: > >>> On Sun, 28 Apr 2024 11:55:10 -0500 > >>> John Groves <John@groves.net> wrote: > >>> > >>>> On 24/04/28 01:47PM, Dongsheng Yang wrote: > >>>>> > >>>>> > >>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: > >>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: > >>>>>>> > >>>>>>> > >>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: > >>>>>>>> > >>>>>>> > >> > >> ... > >>>> > >>>> Just to make things slightly gnarlier, the MESI cache coherency protocol > >>>> allows a CPU to speculatively convert a line from exclusive to modified, > >>>> meaning it's not clear as of now whether "occasional" clean write-backs > >>>> can be avoided. Meaning those read-only mappings may be more important > >>>> than one might think. (Clean write-backs basically make it > >>>> impossible for software to manage cache coherency.) > >>> > >>> My understanding is that clean write backs are an implementation specific > >>> issue that came as a surprise to some CPU arch folk I spoke to, we will > >>> need some path for a host to say if they can ever do that. > >>> > >>> Given this definitely effects one CPU vendor, maybe solutions that > >>> rely on this not happening are not suitable for upstream. > >>> > >>> Maybe this market will be important enough for that CPU vendor to stop > >>> doing it but if they do it will take a while... > >>> > >>> Flushing in general is as CPU architecture problem where each of the > >>> architectures needs to be clear what they do / specify that their > >>> licensees do. > >>> > >>> I'm with Dan on encouraging all memory vendors to do hardware coherence! > >> > >> Hi Gregory, John, Jonathan and Dan: > >> Thanx for your information, they help a lot, and sorry for the late reply. > >> > >> After some internal discussions, I think we can design it as follows: > >> > >> (1) If the hardware implements cache coherence, then the software layer > >> doesn't need to consider this issue, and can perform read and write > >> operations directly. > > > > Agreed - this is one easier case. > > > >> > >> (2) If the hardware doesn't implement cache coherence, we can consider a > >> DMA-like approach, where we check architectural features to determine if > >> cache coherence is supported. This could be similar to > >> `dev_is_dma_coherent`. > > > > Ok. So this would combine host support checks with checking if the shared > > memory on the device is multi host cache coherent (it will be single host > > cache coherent which is what makes this messy) > >> > >> Additionally, if the architecture supports flushing and invalidating CPU > >> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`, > >> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`, > >> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`), > > > > Those particular calls won't tell you much at all. They indicate that a flush > > can happen as far as a common point for DMA engines in the system. No > > information on whether there are caches beyond that point. > > > >> > >> then we can handle cache coherence at the software layer. > >> (For the clean writeback issue, I think it may also require > >> clarification from the architecture, and how DMA handles the clean > >> writeback problem, which I haven't further checked.) > > > > I believe the relevant architecture only does IO coherent DMA so it is > > never a problem (unlike with multihost cache coherence).Hi Jonathan, > > let me provide an example, > In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into > `req->sqe.dma`. > > (1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates > the CPU cache: > > > ib_dma_sync_single_for_cpu(dev, sqe->dma, > sizeof(struct nvme_command), DMA_TO_DEVICE); > > > For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed > by `dcache_inval_poc(start, start + size)`. Key here is the POC. It's a flush to the point of coherence of the local system. It has no idea about interhost coherency and is not necessarily the DRAM (in CXL or otherwise). If you are doing software coherence, those devices will plug into today's hosts and they have no idea that such a flush means pushing out into the CXL fabric and to the type 3 device. > > (2) Setting up data related to the NVMe request. > > (3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to > DMA memory: > > ib_dma_sync_single_for_device(dev, sqe->dma, > sizeof(struct nvme_command), > DMA_TO_DEVICE); > > Of course, if the hardware ensures cache coherency, the above operations > are skipped. However, if the hardware does not guarantee cache > coherency, RDMA appears to ensure cache coherency through this method. > > In the RDMA scenario, we also face the issue of multi-host cache > coherence. so I'm thinking, can we adopt a similar approach in CXL > shared memory to achieve data sharing? You don't face the same coherence issues, or at least not in the same way. In that case the coherence guarantees are actually to the RDMA NIC. It is guaranteed to see the clean data by the host - that may involve flushes to PoC. A one time snapshot is then sent to readers on other hosts. If writes occur they are also guarantee to replace cached copies on this host - because there is well define guarantee of IO coherence or explicit cache maintenance to the PoC. > > >> > >> (3) If the hardware doesn't implement cache coherence and the cpu > >> doesn't support the required CPU cache operations, then we can run in > >> nocache mode. > > > > I suspect that gets you no where either. Never believe an architecture > > that provides a flag that says not to cache something. That just means > > you should not be able to tell that it is cached - many many implementations > > actually cache such accesses. > > Sigh, then that really makes thing difficult. Yes. I think we are going to have to wait on architecture specific clarifications before any software coherent use case can be guaranteed to work beyond the 3.1 ones for temporal sharing (only one accessing host at a time) and read only sharing where writes are dropped anyway so clean write back is irrelevant beyond some noise in logs possibly (if they do get logged it is considered so rare we don't care!). > > > >> > >> CBD can initially support (3), and then transition to (1) when hardware > >> supports cache-coherency. If there's sufficient market demand, we can > >> also consider supporting (2). > > I'd assume only (3) works. The others rely on assumptions I don't think > > I guess you mean (1), the hardware cache-coherency way, right? Indeed - oops! Hardware coherency is the way to go, or a well defined and clearly document description of how to play with the various host architectures. Jonathan > > :) > Thanx > > > you can rely on. > > > > Fun fun fun, > > > > Jonathan > > > >> > >> How does this approach sound? > >> > >> Thanx > >>> > >>> J > >>> > >>>> > >>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and > >>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu > >>>> could do (or not do) in a cxl 2 environment that are not illegal because > >>>> they should not be observable in a no-shared-memory environment. > >>>> > >>>> CBD is interesting work, though for some of the reasons above I'm somewhat > >>>> skeptical of shared memory as an IPC mechanism. > >>>> > >>>> Regards, > >>>> John > >>>> > >>>> > >>>> > >>> > >>> . > >>> > > > > . > >
在 2024/5/8 星期三 下午 11:44, Jonathan Cameron 写道: > On Wed, 8 May 2024 21:03:54 +0800 > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > >> 在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道: >>> On Wed, 8 May 2024 19:39:23 +0800 >>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: >>> >>>> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道: >>>>> On Sun, 28 Apr 2024 11:55:10 -0500 >>>>> John Groves <John@groves.net> wrote: >>>>> >>>>>> On 24/04/28 01:47PM, Dongsheng Yang wrote: >>>>>>> >>>>>>> >>>>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: >>>>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: >>>>>>>>>> >>>>>>>>> >>>> >>>> ... >>>>>> >>>>>> Just to make things slightly gnarlier, the MESI cache coherency protocol >>>>>> allows a CPU to speculatively convert a line from exclusive to modified, >>>>>> meaning it's not clear as of now whether "occasional" clean write-backs >>>>>> can be avoided. Meaning those read-only mappings may be more important >>>>>> than one might think. (Clean write-backs basically make it >>>>>> impossible for software to manage cache coherency.) >>>>> >>>>> My understanding is that clean write backs are an implementation specific >>>>> issue that came as a surprise to some CPU arch folk I spoke to, we will >>>>> need some path for a host to say if they can ever do that. >>>>> >>>>> Given this definitely effects one CPU vendor, maybe solutions that >>>>> rely on this not happening are not suitable for upstream. >>>>> >>>>> Maybe this market will be important enough for that CPU vendor to stop >>>>> doing it but if they do it will take a while... >>>>> >>>>> Flushing in general is as CPU architecture problem where each of the >>>>> architectures needs to be clear what they do / specify that their >>>>> licensees do. >>>>> >>>>> I'm with Dan on encouraging all memory vendors to do hardware coherence! >>>> >>>> Hi Gregory, John, Jonathan and Dan: >>>> Thanx for your information, they help a lot, and sorry for the late reply. >>>> >>>> After some internal discussions, I think we can design it as follows: >>>> >>>> (1) If the hardware implements cache coherence, then the software layer >>>> doesn't need to consider this issue, and can perform read and write >>>> operations directly. >>> >>> Agreed - this is one easier case. >>> >>>> >>>> (2) If the hardware doesn't implement cache coherence, we can consider a >>>> DMA-like approach, where we check architectural features to determine if >>>> cache coherence is supported. This could be similar to >>>> `dev_is_dma_coherent`. >>> >>> Ok. So this would combine host support checks with checking if the shared >>> memory on the device is multi host cache coherent (it will be single host >>> cache coherent which is what makes this messy) >>>> >>>> Additionally, if the architecture supports flushing and invalidating CPU >>>> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`, >>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`, >>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`), >>> >>> Those particular calls won't tell you much at all. They indicate that a flush >>> can happen as far as a common point for DMA engines in the system. No >>> information on whether there are caches beyond that point. >>> >>>> >>>> then we can handle cache coherence at the software layer. >>>> (For the clean writeback issue, I think it may also require >>>> clarification from the architecture, and how DMA handles the clean >>>> writeback problem, which I haven't further checked.) >>> >>> I believe the relevant architecture only does IO coherent DMA so it is >>> never a problem (unlike with multihost cache coherence).Hi Jonathan, >> >> let me provide an example, >> In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into >> `req->sqe.dma`. >> >> (1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates >> the CPU cache: >> >> >> ib_dma_sync_single_for_cpu(dev, sqe->dma, >> sizeof(struct nvme_command), DMA_TO_DEVICE); >> >> >> For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed >> by `dcache_inval_poc(start, start + size)`. > > Key here is the POC. It's a flush to the point of coherence of the local > system. It has no idea about interhost coherency and is not necessarily > the DRAM (in CXL or otherwise). > > If you are doing software coherence, those devices will plug into today's > hosts and they have no idea that such a flush means pushing out into > the CXL fabric and to the type 3 device. > >> >> (2) Setting up data related to the NVMe request. >> >> (3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to >> DMA memory: >> >> ib_dma_sync_single_for_device(dev, sqe->dma, >> sizeof(struct nvme_command), >> DMA_TO_DEVICE); >> >> Of course, if the hardware ensures cache coherency, the above operations >> are skipped. However, if the hardware does not guarantee cache >> coherency, RDMA appears to ensure cache coherency through this method. >> >> In the RDMA scenario, we also face the issue of multi-host cache >> coherence. so I'm thinking, can we adopt a similar approach in CXL >> shared memory to achieve data sharing? > > You don't face the same coherence issues, or at least not in the same way. > In that case the coherence guarantees are actually to the RDMA NIC. > It is guaranteed to see the clean data by the host - that may involve > flushes to PoC. A one time snapshot is then sent to readers on other > hosts. If writes occur they are also guarantee to replace cached copies > on this host - because there is well define guarantee of IO coherence > or explicit cache maintenance to the PoC right, the PoC is not point of cohenrence with other host. it sounds correct. thanx. > > >> >>>> >>>> (3) If the hardware doesn't implement cache coherence and the cpu >>>> doesn't support the required CPU cache operations, then we can run in >>>> nocache mode. >>> >>> I suspect that gets you no where either. Never believe an architecture >>> that provides a flag that says not to cache something. That just means >>> you should not be able to tell that it is cached - many many implementations >>> actually cache such accesses. >> >> Sigh, then that really makes thing difficult. > > Yes. I think we are going to have to wait on architecture specific clarifications > before any software coherent use case can be guaranteed to work beyond the 3.1 ones > for temporal sharing (only one accessing host at a time) and read only sharing where > writes are dropped anyway so clean write back is irrelevant beyond some noise in > logs possibly (if they do get logged it is considered so rare we don't care!). Hi Jonathan, Allow me to discuss further. As described in CXL 3.1: ``` Software-managed coherency schemes are complicated by any host or device whose caching agents generate clean writebacks. A “No Clean Writebacks” capability bit is available for a host in the CXL System Description Structure (CSDS; see Section 9.18.1.6) or for a device in the DVSEC CXL Capability2 register (see Section 8.1.3.7). ``` If we check and find that the "No clean writeback" bit in both CSDS and DVSEC is set, can we then assume that software cache-coherency is feasible, as outlined below: (1) Both the writer and reader ensure cache flushes. Since there are no clean writebacks, there will be no background data writes. (2) The writer writes data to shared memory and then executes a cache flush. If we trust the "No clean writeback" bit, we can assume that the data in shared memory is coherent. (3) Before reading the data, the reader performs cache invalidation. Since there are no clean writebacks, this invalidation operation will not destroy the data written by the writer. Therefore, the data read by the reader should be the data written by the writer, and since the writer's cache is clean, it will not write data to shared memory during the reader's reading process. Additionally, data integrity can be ensured. The first step for CBD should depend on hardware cache coherence, which is clearer and more feasible. Here, I am just exploring the possibility of software cache coherence, not insisting on implementing software cache-coherency right away. :) Thanx > >>> >>>> >>>> CBD can initially support (3), and then transition to (1) when hardware >>>> supports cache-coherency. If there's sufficient market demand, we can >>>> also consider supporting (2). >>> I'd assume only (3) works. The others rely on assumptions I don't think >> >> I guess you mean (1), the hardware cache-coherency way, right? > > Indeed - oops! > Hardware coherency is the way to go, or a well defined and clearly document > description of how to play with the various host architectures. > > Jonathan > > >> >> :) >> Thanx >> >>> you can rely on. >>> >>> Fun fun fun, >>> >>> Jonathan >>> >>>> >>>> How does this approach sound? >>>> >>>> Thanx >>>>> >>>>> J >>>>> >>>>>> >>>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and >>>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu >>>>>> could do (or not do) in a cxl 2 environment that are not illegal because >>>>>> they should not be observable in a no-shared-memory environment. >>>>>> >>>>>> CBD is interesting work, though for some of the reasons above I'm somewhat >>>>>> skeptical of shared memory as an IPC mechanism. >>>>>> >>>>>> Regards, >>>>>> John >>>>>> >>>>>> >>>>>> >>>>> >>>>> . >>>>> >>> >>> . >>> > >
On Thu, 9 May 2024 19:24:28 +0800 Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > 在 2024/5/8 星期三 下午 11:44, Jonathan Cameron 写道: > > On Wed, 8 May 2024 21:03:54 +0800 > > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > > > >> 在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道: > >>> On Wed, 8 May 2024 19:39:23 +0800 > >>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > >>> > >>>> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道: > >>>>> On Sun, 28 Apr 2024 11:55:10 -0500 > >>>>> John Groves <John@groves.net> wrote: > >>>>> > >>>>>> On 24/04/28 01:47PM, Dongsheng Yang wrote: > >>>>>>> > >>>>>>> > >>>>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道: > >>>>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道: > >>>>>>>>>> > >>>>>>>>> > >>>> > >>>> ... > >>>>>> > >>>>>> Just to make things slightly gnarlier, the MESI cache coherency protocol > >>>>>> allows a CPU to speculatively convert a line from exclusive to modified, > >>>>>> meaning it's not clear as of now whether "occasional" clean write-backs > >>>>>> can be avoided. Meaning those read-only mappings may be more important > >>>>>> than one might think. (Clean write-backs basically make it > >>>>>> impossible for software to manage cache coherency.) > >>>>> > >>>>> My understanding is that clean write backs are an implementation specific > >>>>> issue that came as a surprise to some CPU arch folk I spoke to, we will > >>>>> need some path for a host to say if they can ever do that. > >>>>> > >>>>> Given this definitely effects one CPU vendor, maybe solutions that > >>>>> rely on this not happening are not suitable for upstream. > >>>>> > >>>>> Maybe this market will be important enough for that CPU vendor to stop > >>>>> doing it but if they do it will take a while... > >>>>> > >>>>> Flushing in general is as CPU architecture problem where each of the > >>>>> architectures needs to be clear what they do / specify that their > >>>>> licensees do. > >>>>> > >>>>> I'm with Dan on encouraging all memory vendors to do hardware coherence! > >>>> > >>>> Hi Gregory, John, Jonathan and Dan: > >>>> Thanx for your information, they help a lot, and sorry for the late reply. > >>>> > >>>> After some internal discussions, I think we can design it as follows: > >>>> > >>>> (1) If the hardware implements cache coherence, then the software layer > >>>> doesn't need to consider this issue, and can perform read and write > >>>> operations directly. > >>> > >>> Agreed - this is one easier case. > >>> > >>>> > >>>> (2) If the hardware doesn't implement cache coherence, we can consider a > >>>> DMA-like approach, where we check architectural features to determine if > >>>> cache coherence is supported. This could be similar to > >>>> `dev_is_dma_coherent`. > >>> > >>> Ok. So this would combine host support checks with checking if the shared > >>> memory on the device is multi host cache coherent (it will be single host > >>> cache coherent which is what makes this messy) > >>>> > >>>> Additionally, if the architecture supports flushing and invalidating CPU > >>>> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`, > >>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`, > >>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`), > >>> > >>> Those particular calls won't tell you much at all. They indicate that a flush > >>> can happen as far as a common point for DMA engines in the system. No > >>> information on whether there are caches beyond that point. > >>> > >>>> > >>>> then we can handle cache coherence at the software layer. > >>>> (For the clean writeback issue, I think it may also require > >>>> clarification from the architecture, and how DMA handles the clean > >>>> writeback problem, which I haven't further checked.) > >>> > >>> I believe the relevant architecture only does IO coherent DMA so it is > >>> never a problem (unlike with multihost cache coherence).Hi Jonathan, > >> > >> let me provide an example, > >> In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into > >> `req->sqe.dma`. > >> > >> (1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates > >> the CPU cache: > >> > >> > >> ib_dma_sync_single_for_cpu(dev, sqe->dma, > >> sizeof(struct nvme_command), DMA_TO_DEVICE); > >> > >> > >> For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed > >> by `dcache_inval_poc(start, start + size)`. > > > > Key here is the POC. It's a flush to the point of coherence of the local > > system. It has no idea about interhost coherency and is not necessarily > > the DRAM (in CXL or otherwise). > > > > If you are doing software coherence, those devices will plug into today's > > hosts and they have no idea that such a flush means pushing out into > > the CXL fabric and to the type 3 device. > > > >> > >> (2) Setting up data related to the NVMe request. > >> > >> (3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to > >> DMA memory: > >> > >> ib_dma_sync_single_for_device(dev, sqe->dma, > >> sizeof(struct nvme_command), > >> DMA_TO_DEVICE); > >> > >> Of course, if the hardware ensures cache coherency, the above operations > >> are skipped. However, if the hardware does not guarantee cache > >> coherency, RDMA appears to ensure cache coherency through this method. > >> > >> In the RDMA scenario, we also face the issue of multi-host cache > >> coherence. so I'm thinking, can we adopt a similar approach in CXL > >> shared memory to achieve data sharing? > > > > You don't face the same coherence issues, or at least not in the same way. > > In that case the coherence guarantees are actually to the RDMA NIC. > > It is guaranteed to see the clean data by the host - that may involve > > flushes to PoC. A one time snapshot is then sent to readers on other > > hosts. If writes occur they are also guarantee to replace cached copies > > on this host - because there is well define guarantee of IO coherence > > or explicit cache maintenance to the PoC > right, the PoC is not point of cohenrence with other host. it sounds > correct. thanx. > > > > > >> > >>>> > >>>> (3) If the hardware doesn't implement cache coherence and the cpu > >>>> doesn't support the required CPU cache operations, then we can run in > >>>> nocache mode. > >>> > >>> I suspect that gets you no where either. Never believe an architecture > >>> that provides a flag that says not to cache something. That just means > >>> you should not be able to tell that it is cached - many many implementations > >>> actually cache such accesses. > >> > >> Sigh, then that really makes thing difficult. > > > > Yes. I think we are going to have to wait on architecture specific clarifications > > before any software coherent use case can be guaranteed to work beyond the 3.1 ones > > for temporal sharing (only one accessing host at a time) and read only sharing where > > writes are dropped anyway so clean write back is irrelevant beyond some noise in > > logs possibly (if they do get logged it is considered so rare we don't care!). > > Hi Jonathan, > Allow me to discuss further. As described in CXL 3.1: > ``` > Software-managed coherency schemes are complicated by any host or device > whose caching agents generate clean writebacks. A “No Clean Writebacks” > capability bit is available for a host in the CXL System Description > Structure (CSDS; see Section 9.18.1.6) or for a device in the DVSEC CXL > Capability2 register (see Section 8.1.3.7). > ``` > > If we check and find that the "No clean writeback" bit in both CSDS and > DVSEC is set, can we then assume that software cache-coherency is > feasible, as outlined below: > > (1) Both the writer and reader ensure cache flushes. Since there are no > clean writebacks, there will be no background data writes. > > (2) The writer writes data to shared memory and then executes a cache > flush. If we trust the "No clean writeback" bit, we can assume that the > data in shared memory is coherent. > > (3) Before reading the data, the reader performs cache invalidation. > Since there are no clean writebacks, this invalidation operation will > not destroy the data written by the writer. Therefore, the data read by > the reader should be the data written by the writer, and since the > writer's cache is clean, it will not write data to shared memory during > the reader's reading process. Additionally, data integrity can be ensured. > > The first step for CBD should depend on hardware cache coherence, which > is clearer and more feasible. Here, I am just exploring the possibility > of software cache coherence, not insisting on implementing software > cache-coherency right away. :) Yes, if a platform sets that bit, you 'should' be fine. What exact flush is needed is architecture specific however and the DMA related ones may not be sufficient. I'd keep an eye open for arch doc update from the various vendors. Also, the architecture that motivated that bit existing is a 'moderately large' chip vendor so I'd go so far as to say adoption will be limited unless they resolve that in a future implementation :) Jonathan > > Thanx > > > >>> > >>>> > >>>> CBD can initially support (3), and then transition to (1) when hardware > >>>> supports cache-coherency. If there's sufficient market demand, we can > >>>> also consider supporting (2). > >>> I'd assume only (3) works. The others rely on assumptions I don't think > >> > >> I guess you mean (1), the hardware cache-coherency way, right? > > > > Indeed - oops! > > Hardware coherency is the way to go, or a well defined and clearly document > > description of how to play with the various host architectures. > > > > Jonathan > > > > > >> > >> :) > >> Thanx > >> > >>> you can rely on. > >>> > >>> Fun fun fun, > >>> > >>> Jonathan > >>> > >>>> > >>>> How does this approach sound? > >>>> > >>>> Thanx > >>>>> > >>>>> J > >>>>> > >>>>>> > >>>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and > >>>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu > >>>>>> could do (or not do) in a cxl 2 environment that are not illegal because > >>>>>> they should not be observable in a no-shared-memory environment. > >>>>>> > >>>>>> CBD is interesting work, though for some of the reasons above I'm somewhat > >>>>>> skeptical of shared memory as an IPC mechanism. > >>>>>> > >>>>>> Regards, > >>>>>> John > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> . > >>>>> > >>> > >>> . > >>> > > > >
在 2024/5/9 星期四 下午 8:21, Jonathan Cameron 写道: > On Thu, 9 May 2024 19:24:28 +0800 > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > ... >>> Yes. I think we are going to have to wait on architecture specific clarifications >>> before any software coherent use case can be guaranteed to work beyond the 3.1 ones >>> for temporal sharing (only one accessing host at a time) and read only sharing where >>> writes are dropped anyway so clean write back is irrelevant beyond some noise in >>> logs possibly (if they do get logged it is considered so rare we don't care!). >> >> Hi Jonathan, >> Allow me to discuss further. As described in CXL 3.1: >> ``` >> Software-managed coherency schemes are complicated by any host or device >> whose caching agents generate clean writebacks. A “No Clean Writebacks” >> capability bit is available for a host in the CXL System Description >> Structure (CSDS; see Section 9.18.1.6) or for a device in the DVSEC CXL >> Capability2 register (see Section 8.1.3.7). >> ``` >> >> If we check and find that the "No clean writeback" bit in both CSDS and >> DVSEC is set, can we then assume that software cache-coherency is >> feasible, as outlined below: >> >> (1) Both the writer and reader ensure cache flushes. Since there are no >> clean writebacks, there will be no background data writes. >> >> (2) The writer writes data to shared memory and then executes a cache >> flush. If we trust the "No clean writeback" bit, we can assume that the >> data in shared memory is coherent. >> >> (3) Before reading the data, the reader performs cache invalidation. >> Since there are no clean writebacks, this invalidation operation will >> not destroy the data written by the writer. Therefore, the data read by >> the reader should be the data written by the writer, and since the >> writer's cache is clean, it will not write data to shared memory during >> the reader's reading process. Additionally, data integrity can be ensured. >> >> The first step for CBD should depend on hardware cache coherence, which >> is clearer and more feasible. Here, I am just exploring the possibility >> of software cache coherence, not insisting on implementing software >> cache-coherency right away. :) > > Yes, if a platform sets that bit, you 'should' be fine. What exact flush > is needed is architecture specific however and the DMA related ones > may not be sufficient. I'd keep an eye open for arch doc update from the > various vendors. > > Also, the architecture that motivated that bit existing is a 'moderately > large' chip vendor so I'd go so far as to say adoption will be limited > unless they resolve that in a future implementation :) Great, I think we've had a good discussion and reached a consensus on this issue. The remaining aspect will depend on hardware updates. Thank you for the information, that helps a lot. Thanx > > Jonathan > >> >> Thanx >>> >>>>> >>>>>> >>>>>> CBD can initially support (3), and then transition to (1) when hardware >>>>>> supports cache-coherency. If there's sufficient market demand, we can >>>>>> also consider supporting (2). >>>>> I'd assume only (3) works. The others rely on assumptions I don't think >>>> >>>> I guess you mean (1), the hardware cache-coherency way, right? >>> >>> Indeed - oops! >>> Hardware coherency is the way to go, or a well defined and clearly document >>> description of how to play with the various host architectures. >>> >>> Jonathan >>> >>> >>>> >>>> :) >>>> Thanx >>>> >>>>> you can rely on. >>>>> >>>>> Fun fun fun, >>>>> >>>>> Jonathan >>>>> >>>>>> >>>>>> How does this approach sound? >>>>>> >>>>>> Thanx >>>>>>> >>>>>>> J >>>>>>> >>>>>>>> >>>>>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and >>>>>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu >>>>>>>> could do (or not do) in a cxl 2 environment that are not illegal because >>>>>>>> they should not be observable in a no-shared-memory environment. >>>>>>>> >>>>>>>> CBD is interesting work, though for some of the reasons above I'm somewhat >>>>>>>> skeptical of shared memory as an IPC mechanism. >>>>>>>> >>>>>>>> Regards, >>>>>>>> John >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> . >>>>>>> >>>>> >>>>> . >>>>> >>> >>> >
Dongsheng Yang wrote: > 在 2024/5/9 星期四 下午 8:21, Jonathan Cameron 写道: [..] > >> If we check and find that the "No clean writeback" bit in both CSDS and > >> DVSEC is set, can we then assume that software cache-coherency is > >> feasible, as outlined below: > >> > >> (1) Both the writer and reader ensure cache flushes. Since there are no > >> clean writebacks, there will be no background data writes. > >> > >> (2) The writer writes data to shared memory and then executes a cache > >> flush. If we trust the "No clean writeback" bit, we can assume that the > >> data in shared memory is coherent. > >> > >> (3) Before reading the data, the reader performs cache invalidation. > >> Since there are no clean writebacks, this invalidation operation will > >> not destroy the data written by the writer. Therefore, the data read by > >> the reader should be the data written by the writer, and since the > >> writer's cache is clean, it will not write data to shared memory during > >> the reader's reading process. Additionally, data integrity can be ensured. What guarantees this property? How does the reader know that its local cache invalidation is sufficient for reading data that has only reached global visibility on the remote peer? As far as I can see, there is nothing that guarantees that local global visibility translates to remote visibility. In fact, the GPF feature is counter-evidence of the fact that writes can be pending in buffers that are only flushed on a GPF event. I remain skeptical that a software managed inter-host cache-coherency scheme can be made reliable with current CXL defined mechanisms.
在 2024/5/22 星期三 上午 2:41, Dan Williams 写道: > Dongsheng Yang wrote: >> 在 2024/5/9 星期四 下午 8:21, Jonathan Cameron 写道: > [..] >>>> If we check and find that the "No clean writeback" bit in both CSDS and >>>> DVSEC is set, can we then assume that software cache-coherency is >>>> feasible, as outlined below: >>>> >>>> (1) Both the writer and reader ensure cache flushes. Since there are no >>>> clean writebacks, there will be no background data writes. >>>> >>>> (2) The writer writes data to shared memory and then executes a cache >>>> flush. If we trust the "No clean writeback" bit, we can assume that the >>>> data in shared memory is coherent. >>>> >>>> (3) Before reading the data, the reader performs cache invalidation. >>>> Since there are no clean writebacks, this invalidation operation will >>>> not destroy the data written by the writer. Therefore, the data read by >>>> the reader should be the data written by the writer, and since the >>>> writer's cache is clean, it will not write data to shared memory during >>>> the reader's reading process. Additionally, data integrity can be ensured. > > What guarantees this property? How does the reader know that its local > cache invalidation is sufficient for reading data that has only reached > global visibility on the remote peer? As far as I can see, there is > nothing that guarantees that local global visibility translates to > remote visibility. In fact, the GPF feature is counter-evidence of the > fact that writes can be pending in buffers that are only flushed on a > GPF event. Sounds correct. From what I learned from GPF, ADR, and eADR, there would still be data in WPQ even though we perform a CPU cache line flush in the OS. This means we don't have a explicit method to make data puncture all caches and land in the media after writing. also it seems there isn't a explicit method to invalidate all caches along the entire path. > > I remain skeptical that a software managed inter-host cache-coherency > scheme can be made reliable with current CXL defined mechanisms. I got your point now, acorrding current CXL Spec, it seems software managed cache-coherency for inter-host shared memory is not working. Will the next version of CXL spec consider it? >
On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote: > > > 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道: > > Dongsheng Yang wrote: > > > > What guarantees this property? How does the reader know that its local > > cache invalidation is sufficient for reading data that has only reached > > global visibility on the remote peer? As far as I can see, there is > > nothing that guarantees that local global visibility translates to > > remote visibility. In fact, the GPF feature is counter-evidence of the > > fact that writes can be pending in buffers that are only flushed on a > > GPF event. > > Sounds correct. From what I learned from GPF, ADR, and eADR, there would > still be data in WPQ even though we perform a CPU cache line flush in the > OS. > > This means we don't have a explicit method to make data puncture all caches > and land in the media after writing. also it seems there isn't a explicit > method to invalidate all caches along the entire path. > > > > > I remain skeptical that a software managed inter-host cache-coherency > > scheme can be made reliable with current CXL defined mechanisms. > > > I got your point now, acorrding current CXL Spec, it seems software managed > cache-coherency for inter-host shared memory is not working. Will the next > version of CXL spec consider it? > > Sorry for missing the conversation, have been out of office for a bit. It's not just a CXL spec issue, though that is part of it. I think the CXL spec would have to expose some form of puncturing flush, and this makes the assumption that such a flush doesn't cause some kind of race/deadlock issue. Certainly this needs to be discussed. However, consider that the upstream processor actually has to generate this flush. This means adding the flush to existing coherence protocols, or at the very least a new instruction to generate the flush explicitly. The latter seems more likely than the former. This flush would need to ensure the data is forced out of the local WPQ AND all WPQs south of the PCIE complex - because what you really want to know is that the data has actually made it back to a place where remote viewers are capable of percieving the change. So this means: 1) Spec revision with puncturing flush 2) Buy-in from CPU vendors to generate such a flush 3) A new instruction added to the architecture. Call me in a decade or so. But really, I think it likely we see hardware-coherence well before this. For this reason, I have become skeptical of all but a few memory sharing use cases that depend on software-controlled cache-coherency. There are some (FAMFS, for example). The coherence state of these systems tend to be less volatile (e.g. mappings are read-only), or they have inherent design limitations (cacheline-sized message passing via write-ahead logging only). ~Gregory
在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: > On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote: >> >> >> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道: >>> Dongsheng Yang wrote: >>> >>> What guarantees this property? How does the reader know that its local >>> cache invalidation is sufficient for reading data that has only reached >>> global visibility on the remote peer? As far as I can see, there is >>> nothing that guarantees that local global visibility translates to >>> remote visibility. In fact, the GPF feature is counter-evidence of the >>> fact that writes can be pending in buffers that are only flushed on a >>> GPF event. >> >> Sounds correct. From what I learned from GPF, ADR, and eADR, there would >> still be data in WPQ even though we perform a CPU cache line flush in the >> OS. >> >> This means we don't have a explicit method to make data puncture all caches >> and land in the media after writing. also it seems there isn't a explicit >> method to invalidate all caches along the entire path. >> >>> >>> I remain skeptical that a software managed inter-host cache-coherency >>> scheme can be made reliable with current CXL defined mechanisms. >> >> >> I got your point now, acorrding current CXL Spec, it seems software managed >> cache-coherency for inter-host shared memory is not working. Will the next >> version of CXL spec consider it? >>> > > Sorry for missing the conversation, have been out of office for a bit. > > It's not just a CXL spec issue, though that is part of it. I think the > CXL spec would have to expose some form of puncturing flush, and this > makes the assumption that such a flush doesn't cause some kind of > race/deadlock issue. Certainly this needs to be discussed. > > However, consider that the upstream processor actually has to generate > this flush. This means adding the flush to existing coherence protocols, > or at the very least a new instruction to generate the flush explicitly. > The latter seems more likely than the former. > > This flush would need to ensure the data is forced out of the local WPQ > AND all WPQs south of the PCIE complex - because what you really want to > know is that the data has actually made it back to a place where remote > viewers are capable of percieving the change. > > So this means: > 1) Spec revision with puncturing flush > 2) Buy-in from CPU vendors to generate such a flush > 3) A new instruction added to the architecture. > > Call me in a decade or so. > > > But really, I think it likely we see hardware-coherence well before this. > For this reason, I have become skeptical of all but a few memory sharing > use cases that depend on software-controlled cache-coherency. Hi Gregory, From my understanding, we actually has the same idea here. What I am saying is that we need SPEC to consider this issue, meaning we need to describe how the entire software-coherency mechanism operates, which includes the necessary hardware support. Additionally, I agree that if software-coherency also requires hardware support, it seems that hardware-coherency is the better path. > > There are some (FAMFS, for example). The coherence state of these > systems tend to be less volatile (e.g. mappings are read-only), or > they have inherent design limitations (cacheline-sized message passing > via write-ahead logging only). Can you explain more about this? I understand that if the reader in the writer-reader model is using a readonly mapping, the interaction will be much simpler. However, after the writer writes data, if we don't have a mechanism to flush and invalidate puncturing all caches, how can the readonly reader access the new data? > > ~Gregory >
On Thu, 30 May 2024 14:59:38 +0800 Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: > > On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote: > >> > >> > >> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道: > >>> Dongsheng Yang wrote: > >>> > >>> What guarantees this property? How does the reader know that its local > >>> cache invalidation is sufficient for reading data that has only reached > >>> global visibility on the remote peer? As far as I can see, there is > >>> nothing that guarantees that local global visibility translates to > >>> remote visibility. In fact, the GPF feature is counter-evidence of the > >>> fact that writes can be pending in buffers that are only flushed on a > >>> GPF event. > >> > >> Sounds correct. From what I learned from GPF, ADR, and eADR, there would > >> still be data in WPQ even though we perform a CPU cache line flush in the > >> OS. > >> > >> This means we don't have a explicit method to make data puncture all caches > >> and land in the media after writing. also it seems there isn't a explicit > >> method to invalidate all caches along the entire path. > >> > >>> > >>> I remain skeptical that a software managed inter-host cache-coherency > >>> scheme can be made reliable with current CXL defined mechanisms. > >> > >> > >> I got your point now, acorrding current CXL Spec, it seems software managed > >> cache-coherency for inter-host shared memory is not working. Will the next > >> version of CXL spec consider it? > >>> > > > > Sorry for missing the conversation, have been out of office for a bit. > > > > It's not just a CXL spec issue, though that is part of it. I think the > > CXL spec would have to expose some form of puncturing flush, and this > > makes the assumption that such a flush doesn't cause some kind of > > race/deadlock issue. Certainly this needs to be discussed. > > > > However, consider that the upstream processor actually has to generate > > this flush. This means adding the flush to existing coherence protocols, > > or at the very least a new instruction to generate the flush explicitly. > > The latter seems more likely than the former. > > > > This flush would need to ensure the data is forced out of the local WPQ > > AND all WPQs south of the PCIE complex - because what you really want to > > know is that the data has actually made it back to a place where remote > > viewers are capable of percieving the change. > > > > So this means: > > 1) Spec revision with puncturing flush > > 2) Buy-in from CPU vendors to generate such a flush > > 3) A new instruction added to the architecture. > > > > Call me in a decade or so. > > > > > > But really, I think it likely we see hardware-coherence well before this. > > For this reason, I have become skeptical of all but a few memory sharing > > use cases that depend on software-controlled cache-coherency. > > Hi Gregory, > > From my understanding, we actually has the same idea here. What I am > saying is that we need SPEC to consider this issue, meaning we need to > describe how the entire software-coherency mechanism operates, which > includes the necessary hardware support. Additionally, I agree that if > software-coherency also requires hardware support, it seems that > hardware-coherency is the better path. > > > > There are some (FAMFS, for example). The coherence state of these > > systems tend to be less volatile (e.g. mappings are read-only), or > > they have inherent design limitations (cacheline-sized message passing > > via write-ahead logging only). > > Can you explain more about this? I understand that if the reader in the > writer-reader model is using a readonly mapping, the interaction will be > much simpler. However, after the writer writes data, if we don't have a > mechanism to flush and invalidate puncturing all caches, how can the > readonly reader access the new data? There is a mechanism for doing coarse grained flushing that is known to work on some architectures. Look at cpu_cache_invalidate_memregion(). On intel/x86 it's wbinvd_on_all_cpu_cpus() on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a public alpha specification for PSCI 1.3 with that defined but we don't yet have kernel code.) These are very big hammers and so unsuited for anything fine grained. In the extreme end of possible implementations they briefly stop all CPUs and clean and invalidate all caches of all types. So not suited to anything fine grained, but may be acceptable for a rare setup event, particularly if the main job of the writing host is to fill that memory for lots of other hosts to use. At least the ARM one takes a range so allows for a less painful implementation. I'm assuming we'll see new architecture over time but this is a different (and potentially easier) problem space to what you need. Jonathan > > ~Gregory > >
On Thu, May 30, 2024 at 02:59:38PM +0800, Dongsheng Yang wrote: > > > 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: > > > > There are some (FAMFS, for example). The coherence state of these > > systems tend to be less volatile (e.g. mappings are read-only), or > > they have inherent design limitations (cacheline-sized message passing > > via write-ahead logging only). > > Can you explain more about this? I understand that if the reader in the > writer-reader model is using a readonly mapping, the interaction will be > much simpler. However, after the writer writes data, if we don't have a > mechanism to flush and invalidate puncturing all caches, how can the > readonly reader access the new data? This is exactly right, so the coherence/correctness of the data needs to be enforced in some other way. Generally speaking, the WPQs will *eventually* get flushed. As such, the memory will *eventually* become coherent. So if you set up the following pattern, you will end up with an "eventually coherent" system 1) Writer instantiates the memory to be used 2) Writer calculates and records a checksum of that data into memory 3) Writer invalidates everything 4) Reader maps the memory 5) Reader reads the checksum and calculates the checksum of the data a) if the checksums match, the data is coherent b) if they don't, we must wait longer for the queues to flush This is just one example of a system design which enforces coherence by placing the limitation on the system that the data will never change once it becomes coherent. Whatever the case, regardless of the scheme you come up with, you will end up with a system where the data must be inspected and validated before it can be used. This has the limiting factor of performance: throughput will be limited by how fast you can validate the data. ~Gregory
Jonathan Cameron wrote: > On Thu, 30 May 2024 14:59:38 +0800 > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > > > 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: > > > On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote: > > >> > > >> > > >> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道: > > >>> Dongsheng Yang wrote: > > >>> > > >>> What guarantees this property? How does the reader know that its local > > >>> cache invalidation is sufficient for reading data that has only reached > > >>> global visibility on the remote peer? As far as I can see, there is > > >>> nothing that guarantees that local global visibility translates to > > >>> remote visibility. In fact, the GPF feature is counter-evidence of the > > >>> fact that writes can be pending in buffers that are only flushed on a > > >>> GPF event. > > >> > > >> Sounds correct. From what I learned from GPF, ADR, and eADR, there would > > >> still be data in WPQ even though we perform a CPU cache line flush in the > > >> OS. > > >> > > >> This means we don't have a explicit method to make data puncture all caches > > >> and land in the media after writing. also it seems there isn't a explicit > > >> method to invalidate all caches along the entire path. > > >> > > >>> > > >>> I remain skeptical that a software managed inter-host cache-coherency > > >>> scheme can be made reliable with current CXL defined mechanisms. > > >> > > >> > > >> I got your point now, acorrding current CXL Spec, it seems software managed > > >> cache-coherency for inter-host shared memory is not working. Will the next > > >> version of CXL spec consider it? > > >>> > > > > > > Sorry for missing the conversation, have been out of office for a bit. > > > > > > It's not just a CXL spec issue, though that is part of it. I think the > > > CXL spec would have to expose some form of puncturing flush, and this > > > makes the assumption that such a flush doesn't cause some kind of > > > race/deadlock issue. Certainly this needs to be discussed. > > > > > > However, consider that the upstream processor actually has to generate > > > this flush. This means adding the flush to existing coherence protocols, > > > or at the very least a new instruction to generate the flush explicitly. > > > The latter seems more likely than the former. > > > > > > This flush would need to ensure the data is forced out of the local WPQ > > > AND all WPQs south of the PCIE complex - because what you really want to > > > know is that the data has actually made it back to a place where remote > > > viewers are capable of percieving the change. > > > > > > So this means: > > > 1) Spec revision with puncturing flush > > > 2) Buy-in from CPU vendors to generate such a flush > > > 3) A new instruction added to the architecture. > > > > > > Call me in a decade or so. > > > > > > > > > But really, I think it likely we see hardware-coherence well before this. > > > For this reason, I have become skeptical of all but a few memory sharing > > > use cases that depend on software-controlled cache-coherency. > > > > Hi Gregory, > > > > From my understanding, we actually has the same idea here. What I am > > saying is that we need SPEC to consider this issue, meaning we need to > > describe how the entire software-coherency mechanism operates, which > > includes the necessary hardware support. Additionally, I agree that if > > software-coherency also requires hardware support, it seems that > > hardware-coherency is the better path. > > > > > > There are some (FAMFS, for example). The coherence state of these > > > systems tend to be less volatile (e.g. mappings are read-only), or > > > they have inherent design limitations (cacheline-sized message passing > > > via write-ahead logging only). > > > > Can you explain more about this? I understand that if the reader in the > > writer-reader model is using a readonly mapping, the interaction will be > > much simpler. However, after the writer writes data, if we don't have a > > mechanism to flush and invalidate puncturing all caches, how can the > > readonly reader access the new data? > > There is a mechanism for doing coarse grained flushing that is known to > work on some architectures. Look at cpu_cache_invalidate_memregion(). > On intel/x86 it's wbinvd_on_all_cpu_cpus() There is no guarantee on x86 that after cpu_cache_invalidate_memregion() that a remote shared memory consumer can be assured to see the writes from that event. > on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a > public alpha specification for PSCI 1.3 with that defined but we > don't yet have kernel code.) That punches visibility through CXL shared memory devices? > These are very big hammers and so unsuited for anything fine grained. > In the extreme end of possible implementations they briefly stop all > CPUs and clean and invalidate all caches of all types. So not suited > to anything fine grained, but may be acceptable for a rare setup event, > particularly if the main job of the writing host is to fill that memory > for lots of other hosts to use. > > At least the ARM one takes a range so allows for a less painful > implementation. I'm assuming we'll see new architecture over time > but this is a different (and potentially easier) problem space > to what you need. cpu_cache_invalidate_memregion() is only about making sure local CPU sees new contents after an DPA:HPA remap event. I hope CPUs are able to get away from that responsibility long term when / if future memory expanders just issue back-invalidate automatically when the HDM decoder configuration changes.
在 2024/5/31 星期五 下午 10:23, Gregory Price 写道: > On Thu, May 30, 2024 at 02:59:38PM +0800, Dongsheng Yang wrote: >> >> >> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: >>> >>> There are some (FAMFS, for example). The coherence state of these >>> systems tend to be less volatile (e.g. mappings are read-only), or >>> they have inherent design limitations (cacheline-sized message passing >>> via write-ahead logging only). >> >> Can you explain more about this? I understand that if the reader in the >> writer-reader model is using a readonly mapping, the interaction will be >> much simpler. However, after the writer writes data, if we don't have a >> mechanism to flush and invalidate puncturing all caches, how can the >> readonly reader access the new data? > > This is exactly right, so the coherence/correctness of the data needs to > be enforced in some other way. > > Generally speaking, the WPQs will *eventually* get flushed. As such, > the memory will *eventually* become coherent. So if you set up the > following pattern, you will end up with an "eventually coherent" system Yes, it is "eventually coherent" if "NO CLEAN WRITEBACK" bit in both CSDS and DVSEC is set. > > 1) Writer instantiates the memory to be used > 2) Writer calculates and records a checksum of that data into memory > 3) Writer invalidates everything > 4) Reader maps the memory > 5) Reader reads the checksum and calculates the checksum of the data > a) if the checksums match, the data is coherent > b) if they don't, we must wait longer for the queues to flush Yes, the checksum was mentioned by John, it is used in FAMFS/pcq_lib.c, pcq use sequence and checksum in consumer to make sure data consistency. I think it's a good idea and was planning to introduce it into cbd, of coures it should be optional for cbd, as cbd current only supports hardware-consistency usage. it can be an option to do data verification. Thanx > > This is just one example of a system design which enforces coherence by > placing the limitation on the system that the data will never change > once it becomes coherent. > > Whatever the case, regardless of the scheme you come up with, you will > end up with a system where the data must be inspected and validated > before it can be used. This has the limiting factor of performance: > throughput will be limited by how fast you can validate the data. > > ~Gregory > . >
On Fri, 31 May 2024 20:22:42 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > > On Thu, 30 May 2024 14:59:38 +0800 > > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > > > > > 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: > > > > On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote: > > > >> > > > >> > > > >> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道: > > > >>> Dongsheng Yang wrote: > > > >>> > > > >>> What guarantees this property? How does the reader know that its local > > > >>> cache invalidation is sufficient for reading data that has only reached > > > >>> global visibility on the remote peer? As far as I can see, there is > > > >>> nothing that guarantees that local global visibility translates to > > > >>> remote visibility. In fact, the GPF feature is counter-evidence of the > > > >>> fact that writes can be pending in buffers that are only flushed on a > > > >>> GPF event. > > > >> > > > >> Sounds correct. From what I learned from GPF, ADR, and eADR, there would > > > >> still be data in WPQ even though we perform a CPU cache line flush in the > > > >> OS. > > > >> > > > >> This means we don't have a explicit method to make data puncture all caches > > > >> and land in the media after writing. also it seems there isn't a explicit > > > >> method to invalidate all caches along the entire path. > > > >> > > > >>> > > > >>> I remain skeptical that a software managed inter-host cache-coherency > > > >>> scheme can be made reliable with current CXL defined mechanisms. > > > >> > > > >> > > > >> I got your point now, acorrding current CXL Spec, it seems software managed > > > >> cache-coherency for inter-host shared memory is not working. Will the next > > > >> version of CXL spec consider it? > > > >>> > > > > > > > > Sorry for missing the conversation, have been out of office for a bit. > > > > > > > > It's not just a CXL spec issue, though that is part of it. I think the > > > > CXL spec would have to expose some form of puncturing flush, and this > > > > makes the assumption that such a flush doesn't cause some kind of > > > > race/deadlock issue. Certainly this needs to be discussed. > > > > > > > > However, consider that the upstream processor actually has to generate > > > > this flush. This means adding the flush to existing coherence protocols, > > > > or at the very least a new instruction to generate the flush explicitly. > > > > The latter seems more likely than the former. > > > > > > > > This flush would need to ensure the data is forced out of the local WPQ > > > > AND all WPQs south of the PCIE complex - because what you really want to > > > > know is that the data has actually made it back to a place where remote > > > > viewers are capable of percieving the change. > > > > > > > > So this means: > > > > 1) Spec revision with puncturing flush > > > > 2) Buy-in from CPU vendors to generate such a flush > > > > 3) A new instruction added to the architecture. > > > > > > > > Call me in a decade or so. > > > > > > > > > > > > But really, I think it likely we see hardware-coherence well before this. > > > > For this reason, I have become skeptical of all but a few memory sharing > > > > use cases that depend on software-controlled cache-coherency. > > > > > > Hi Gregory, > > > > > > From my understanding, we actually has the same idea here. What I am > > > saying is that we need SPEC to consider this issue, meaning we need to > > > describe how the entire software-coherency mechanism operates, which > > > includes the necessary hardware support. Additionally, I agree that if > > > software-coherency also requires hardware support, it seems that > > > hardware-coherency is the better path. > > > > > > > > There are some (FAMFS, for example). The coherence state of these > > > > systems tend to be less volatile (e.g. mappings are read-only), or > > > > they have inherent design limitations (cacheline-sized message passing > > > > via write-ahead logging only). > > > > > > Can you explain more about this? I understand that if the reader in the > > > writer-reader model is using a readonly mapping, the interaction will be > > > much simpler. However, after the writer writes data, if we don't have a > > > mechanism to flush and invalidate puncturing all caches, how can the > > > readonly reader access the new data? > > > > There is a mechanism for doing coarse grained flushing that is known to > > work on some architectures. Look at cpu_cache_invalidate_memregion(). > > On intel/x86 it's wbinvd_on_all_cpu_cpus() > > There is no guarantee on x86 that after cpu_cache_invalidate_memregion() > that a remote shared memory consumer can be assured to see the writes > from that event. I was wondering about that after I wrote this... I guess it guarantees we won't get a late landing write or is that not even true? So if we remove memory, then added fresh memory again quickly enough can we get a left over write showing up? I guess that doesn't matter as the kernel will chase it with a memset(0) anyway and that will be ordered as to the same address. However we won't be able to elide that zeroing even if we know the device did it which is makes some operations the device might support rather pointless :( > > > on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a > > public alpha specification for PSCI 1.3 with that defined but we > > don't yet have kernel code.) > > That punches visibility through CXL shared memory devices? It's a draft spec and Mark + James in +CC can hopefully confirm. It does say "Cleans and invalidates all caches, including system caches". which I'd read as meaning it should but good to confirm. > > > These are very big hammers and so unsuited for anything fine grained. > > In the extreme end of possible implementations they briefly stop all > > CPUs and clean and invalidate all caches of all types. So not suited > > to anything fine grained, but may be acceptable for a rare setup event, > > particularly if the main job of the writing host is to fill that memory > > for lots of other hosts to use. > > > > At least the ARM one takes a range so allows for a less painful > > implementation. I'm assuming we'll see new architecture over time > > but this is a different (and potentially easier) problem space > > to what you need. > > cpu_cache_invalidate_memregion() is only about making sure local CPU > sees new contents after an DPA:HPA remap event. I hope CPUs are able to > get away from that responsibility long term when / if future memory > expanders just issue back-invalidate automatically when the HDM decoder > configuration changes. I would love that to be the way things go, but I fear the overheads of doing that on the protocol means people will want the option of the painful approach. Jonathan
Hi guys, On 03/06/2024 13:48, Jonathan Cameron wrote: > On Fri, 31 May 2024 20:22:42 -0700 > Dan Williams <dan.j.williams@intel.com> wrote: >> Jonathan Cameron wrote: >>> On Thu, 30 May 2024 14:59:38 +0800 >>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: >>>> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: >>>>> It's not just a CXL spec issue, though that is part of it. I think the >>>>> CXL spec would have to expose some form of puncturing flush, and this >>>>> makes the assumption that such a flush doesn't cause some kind of >>>>> race/deadlock issue. Certainly this needs to be discussed. >>>>> >>>>> However, consider that the upstream processor actually has to generate >>>>> this flush. This means adding the flush to existing coherence protocols, >>>>> or at the very least a new instruction to generate the flush explicitly. >>>>> The latter seems more likely than the former. >>>>> >>>>> This flush would need to ensure the data is forced out of the local WPQ >>>>> AND all WPQs south of the PCIE complex - because what you really want to >>>>> know is that the data has actually made it back to a place where remote >>>>> viewers are capable of percieving the change. >>>>> >>>>> So this means: >>>>> 1) Spec revision with puncturing flush >>>>> 2) Buy-in from CPU vendors to generate such a flush >>>>> 3) A new instruction added to the architecture. >>>>> >>>>> Call me in a decade or so. >>>>> >>>>> >>>>> But really, I think it likely we see hardware-coherence well before this. >>>>> For this reason, I have become skeptical of all but a few memory sharing >>>>> use cases that depend on software-controlled cache-coherency. >>>> >>>> Hi Gregory, >>>> >>>> From my understanding, we actually has the same idea here. What I am >>>> saying is that we need SPEC to consider this issue, meaning we need to >>>> describe how the entire software-coherency mechanism operates, which >>>> includes the necessary hardware support. Additionally, I agree that if >>>> software-coherency also requires hardware support, it seems that >>>> hardware-coherency is the better path. >>>>> >>>>> There are some (FAMFS, for example). The coherence state of these >>>>> systems tend to be less volatile (e.g. mappings are read-only), or >>>>> they have inherent design limitations (cacheline-sized message passing >>>>> via write-ahead logging only). >>>> >>>> Can you explain more about this? I understand that if the reader in the >>>> writer-reader model is using a readonly mapping, the interaction will be >>>> much simpler. However, after the writer writes data, if we don't have a >>>> mechanism to flush and invalidate puncturing all caches, how can the >>>> readonly reader access the new data? >>> >>> There is a mechanism for doing coarse grained flushing that is known to >>> work on some architectures. Look at cpu_cache_invalidate_memregion(). >>> On intel/x86 it's wbinvd_on_all_cpu_cpus() >> >> There is no guarantee on x86 that after cpu_cache_invalidate_memregion() >> that a remote shared memory consumer can be assured to see the writes >> from that event. > > I was wondering about that after I wrote this... I guess it guarantees > we won't get a late landing write or is that not even true? > > So if we remove memory, then added fresh memory again quickly enough > can we get a left over write showing up? I guess that doesn't matter as > the kernel will chase it with a memset(0) anyway and that will be ordered > as to the same address. > > However we won't be able to elide that zeroing even if we know the device > did it which is makes some operations the device might support rather > pointless :( >>> on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a >>> public alpha specification for PSCI 1.3 with that defined but we >>> don't yet have kernel code.) I have an RFC for that - but I haven't had time to update and re-test it. If you need this, and have a platform where it can be implemented, please get in touch with the people that look after the specs to move it along from alpha. >> That punches visibility through CXL shared memory devices? > It's a draft spec and Mark + James in +CC can hopefully confirm. > It does say > "Cleans and invalidates all caches, including system caches". > which I'd read as meaning it should but good to confirm. It's intended to remove any cached entries - including lines in what the arm-arm calls "invisible" system caches, which typically only platform firmware can touch. The next access should have to go all the way to the media. (I don't know enough about CXL to say what a remote shared memory consumer observes) Without it, all we have are the by-VA operations which are painfully slow for large regions, and insufficient for system caches. As with all those firmware interfaces - its for the platform implementer to wire up whatever is necessary to remove cached content for the specified range. Just because there is an (alpha!) spec doesn't mean it can be supported efficiently by a particular platform. >>> These are very big hammers and so unsuited for anything fine grained. You forgot really ugly too! >>> In the extreme end of possible implementations they briefly stop all >>> CPUs and clean and invalidate all caches of all types. So not suited >>> to anything fine grained, but may be acceptable for a rare setup event, >>> particularly if the main job of the writing host is to fill that memory >>> for lots of other hosts to use. >>> >>> At least the ARM one takes a range so allows for a less painful >>> implementation. That is to allow some ranges to fail. (e.g. you can do this to the CXL windows, but not the regular DRAM). On the less painful implementation, arm's interconnect has a gadget that does "Address based flush" which could be used here. I'd hope platforms with that don't need to interrupt all CPUs - but it depends on what else needs to be done. >>> I'm assuming we'll see new architecture over time >>> but this is a different (and potentially easier) problem space >>> to what you need. >> >> cpu_cache_invalidate_memregion() is only about making sure local CPU >> sees new contents after an DPA:HPA remap event. I hope CPUs are able to >> get away from that responsibility long term when / if future memory >> expanders just issue back-invalidate automatically when the HDM decoder >> configuration changes. > > I would love that to be the way things go, but I fear the overheads of > doing that on the protocol means people will want the option of the painful > approach. Thanks, James
On Mon, 3 Jun 2024 18:28:51 +0100 James Morse <james.morse@arm.com> wrote: > Hi guys, > > On 03/06/2024 13:48, Jonathan Cameron wrote: > > On Fri, 31 May 2024 20:22:42 -0700 > > Dan Williams <dan.j.williams@intel.com> wrote: > >> Jonathan Cameron wrote: > >>> On Thu, 30 May 2024 14:59:38 +0800 > >>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > >>>> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: > >>>>> It's not just a CXL spec issue, though that is part of it. I think the > >>>>> CXL spec would have to expose some form of puncturing flush, and this > >>>>> makes the assumption that such a flush doesn't cause some kind of > >>>>> race/deadlock issue. Certainly this needs to be discussed. > >>>>> > >>>>> However, consider that the upstream processor actually has to generate > >>>>> this flush. This means adding the flush to existing coherence protocols, > >>>>> or at the very least a new instruction to generate the flush explicitly. > >>>>> The latter seems more likely than the former. > >>>>> > >>>>> This flush would need to ensure the data is forced out of the local WPQ > >>>>> AND all WPQs south of the PCIE complex - because what you really want to > >>>>> know is that the data has actually made it back to a place where remote > >>>>> viewers are capable of percieving the change. > >>>>> > >>>>> So this means: > >>>>> 1) Spec revision with puncturing flush > >>>>> 2) Buy-in from CPU vendors to generate such a flush > >>>>> 3) A new instruction added to the architecture. > >>>>> > >>>>> Call me in a decade or so. > >>>>> > >>>>> > >>>>> But really, I think it likely we see hardware-coherence well before this. > >>>>> For this reason, I have become skeptical of all but a few memory sharing > >>>>> use cases that depend on software-controlled cache-coherency. > >>>> > >>>> Hi Gregory, > >>>> > >>>> From my understanding, we actually has the same idea here. What I am > >>>> saying is that we need SPEC to consider this issue, meaning we need to > >>>> describe how the entire software-coherency mechanism operates, which > >>>> includes the necessary hardware support. Additionally, I agree that if > >>>> software-coherency also requires hardware support, it seems that > >>>> hardware-coherency is the better path. > >>>>> > >>>>> There are some (FAMFS, for example). The coherence state of these > >>>>> systems tend to be less volatile (e.g. mappings are read-only), or > >>>>> they have inherent design limitations (cacheline-sized message passing > >>>>> via write-ahead logging only). > >>>> > >>>> Can you explain more about this? I understand that if the reader in the > >>>> writer-reader model is using a readonly mapping, the interaction will be > >>>> much simpler. However, after the writer writes data, if we don't have a > >>>> mechanism to flush and invalidate puncturing all caches, how can the > >>>> readonly reader access the new data? > >>> > >>> There is a mechanism for doing coarse grained flushing that is known to > >>> work on some architectures. Look at cpu_cache_invalidate_memregion(). > >>> On intel/x86 it's wbinvd_on_all_cpu_cpus() > >> > >> There is no guarantee on x86 that after cpu_cache_invalidate_memregion() > >> that a remote shared memory consumer can be assured to see the writes > >> from that event. > > > > I was wondering about that after I wrote this... I guess it guarantees > > we won't get a late landing write or is that not even true? > > > > So if we remove memory, then added fresh memory again quickly enough > > can we get a left over write showing up? I guess that doesn't matter as > > the kernel will chase it with a memset(0) anyway and that will be ordered > > as to the same address. > > > > However we won't be able to elide that zeroing even if we know the device > > did it which is makes some operations the device might support rather > > pointless :( > > >>> on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a > >>> public alpha specification for PSCI 1.3 with that defined but we > >>> don't yet have kernel code.) > > I have an RFC for that - but I haven't had time to update and re-test it. If it's useful, I might either be able to find time to take that forwards (or get someone else to do it). Let me know if that would be helpful; I'd love to add this to the list of things I can forget about because it just works for kernel (and hence is a problem for the firmware and uarch folk). > > If you need this, and have a platform where it can be implemented, please get in touch > with the people that look after the specs to move it along from alpha. > > > >> That punches visibility through CXL shared memory devices? > > > It's a draft spec and Mark + James in +CC can hopefully confirm. > > It does say > > "Cleans and invalidates all caches, including system caches". > > which I'd read as meaning it should but good to confirm. > > It's intended to remove any cached entries - including lines in what the arm-arm calls > "invisible" system caches, which typically only platform firmware can touch. The next > access should have to go all the way to the media. (I don't know enough about CXL to say > what a remote shared memory consumer observes) If it's out of the host bridge buffers (and known to have succeeded in write back) which I think the host should know, I believe what happens next is a device implementer problem. Hopefully anyone designing a device that does memory sharing has built that part right. > > Without it, all we have are the by-VA operations which are painfully slow for large > regions, and insufficient for system caches. > > As with all those firmware interfaces - its for the platform implementer to wire up > whatever is necessary to remove cached content for the specified range. Just because there > is an (alpha!) spec doesn't mean it can be supported efficiently by a particular platform. > > > >>> These are very big hammers and so unsuited for anything fine grained. > > You forgot really ugly too! I was being polite :) > > > >>> In the extreme end of possible implementations they briefly stop all > >>> CPUs and clean and invalidate all caches of all types. So not suited > >>> to anything fine grained, but may be acceptable for a rare setup event, > >>> particularly if the main job of the writing host is to fill that memory > >>> for lots of other hosts to use. > >>> > >>> At least the ARM one takes a range so allows for a less painful > >>> implementation. > > That is to allow some ranges to fail. (e.g. you can do this to the CXL windows, but not > the regular DRAM). > > On the less painful implementation, arm's interconnect has a gadget that does "Address > based flush" which could be used here. I'd hope platforms with that don't need to > interrupt all CPUs - but it depends on what else needs to be done. > > > >>> I'm assuming we'll see new architecture over time > >>> but this is a different (and potentially easier) problem space > >>> to what you need. > >> > >> cpu_cache_invalidate_memregion() is only about making sure local CPU > >> sees new contents after an DPA:HPA remap event. I hope CPUs are able to > >> get away from that responsibility long term when / if future memory > >> expanders just issue back-invalidate automatically when the HDM decoder > >> configuration changes. > > > > I would love that to be the way things go, but I fear the overheads of > > doing that on the protocol means people will want the option of the painful > > approach. > > > > Thanks, > > James Thanks for the info, Jonathan >
From: Dongsheng Yang <dongsheng.yang.linux@gmail.com> Hi all, This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at: https://github.com/DataTravelGuide/linux (1) What is cbd: As shared memory is supported in CXL3.0 spec, we can transfer data via CXL shared memory. CBD means CXL block device, it use CXL shared memory to transfer command and data to access block device in different host, as shown below: ┌───────────────────────────────┐ ┌────────────────────────────────────┐ │ node-1 │ │ node-2 │ ├───────────────────────────────┤ ├────────────────────────────────────┤ │ │ │ │ │ ┌───────┤ ├─────────┐ │ │ │ cbd0 │ │ backend0├──────────────────┐ │ │ ├───────┤ ├─────────┤ │ │ │ │ pmem0 │ │ pmem0 │ ▼ │ │ ┌───────┴───────┤ ├─────────┴────┐ ┌───────────────┤ │ │ cxl driver │ │ cxl driver │ │ /dev/sda │ └───────────────┴────────┬──────┘ └─────┬────────┴─────┴───────────────┘ │ │ │ │ │ CXL CXL │ └────────────────┐ ┌───────────┘ │ │ │ │ │ │ ┌───┴───────────────┴────---------------─┐ │ shared memory device(cbd transport) │ └──────────────────────---------------───┘ any read/write to cbd0 on node-1 will be transferred to node-2 /dev/sda. It works similar with nbd (network block device), but it transfer data via CXL shared memory rather than network. (2) Layout of transport: ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ cbd transport │ ├────────────────────┬───────────────────────┬───────────────────────┬──────────────────────┬───────────────────────────────────┤ │ │ hosts │ backends │ blkdevs │ channels │ │ cbd transport info ├────┬────┬────┬────────┼────┬────┬────┬────────┼────┬────┬────┬───────┼───────┬───────┬───────┬───────────┤ │ │ │ │ │ ... │ │ │ │ ... │ │ │ │ ... │ │ │ │ ... │ └────────────────────┴────┴────┴────┴────────┴────┴────┴────┴────────┴────┴────┴────┴───────┴───┬───┴───────┴───────┴───────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────────────────────┘ │ │ ▼ ┌───────────────────────────────────────────────────────────┐ │ channel │ ├────────────────────┬──────────────────────────────────────┤ │ channel meta │ channel data │ └─────────┬──────────┴────────────────────────────────-─────┘ │ │ │ ▼ ┌──────────────────────────────────────────────────────────┐ │ channel meta │ ├───────────┬──────────────┬───────────────────────────────┤ │ meta ctrl │ comp ring │ cmd ring │ └───────────┴──────────────┴───────────────────────────────┘ The shared memory is divided into five regions: a) Transport_info: Information about the overall transport, including the layout of the transport. b) Hosts: Each host wishing to utilize this transport needs to register its own information within a host entry in this region. c) Backends: Starting a backend on a host requires filling in information in a backend entry within this region. d) Blkdevs: Once a backend is established, it can be mapped to any associated host. The information about the blkdevs is then filled into the blkdevs region. e) Channels: This is the actual data communication area, where communication between blkdev and backend occurs. Each queue of a block device uses a channel, and each backend has a corresponding handler interacting with this queue. f) Channel: Channel is further divided into meta and data regions. The meta region includes cmd rings and comp rings. The blkdev converts upper-layer requests into cbd_se and fills them into the cmd ring. The handler accepts the cbd_se from the cmd ring and sends them to the local actual block device of the backend (e.g., sda). After completion, the results are formed into cbd_ce and filled into the comp ring. The blkdev then receives the cbd_ce and returns the results to the upper-layer IO sender. Currently, the number of entries in each region and the channel size are both set to default values. In the future, they will be made configurable. (3) Naming of CBD: Actually it is not strictly depends on CXL, any shared memory can be used for cbd, but I did not find out a better name, maybe smxbd(shared memory transport block device)? I choose CBD as it sounds more concise and elegant. Any suggestion? (4) dax is not supported yet: same with famfs, dax device is not supported here, because dax device does not support dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode. (5) How do blkdev and backend interact through the channel? a) For reader side, before reading the data, if the data in this channel may be modified by the other party, then I need to flush the cache before reading to ensure that I get the latest data. For example, the blkdev needs to flush the cache before obtaining compr_head because compr_head will be updated by the backend handler. b) For writter side, if the written information will be read by others, then after writing, I need to flush the cache to let the other party see it immediately. For example, after blkdev submits cbd_se, it needs to update cmd_head to let the handler have a new cbd_se. Therefore, after updating cmd_head, I need to flush the cache to let the backend see it. (6) race between management operations: There may be a race condition, for example: if we use backend-start on different nodes at the same time, it's possible to allocate the same backend ID. This issue should be handled by the upper-layer manager, ensuring that all management operations are serialized, such as acquiring a distributed lock. (7) What's Next? This is an first version of CBD, and there is still much work to be done, such as: how to recover a backend service when a backend node fails? How to gracefully stop associated blkdev when a backend service cannot be recovered? How to clear dead information within the transport layer? For non-volatile memory transport, it may be considered to allocate a new area as a Write-Ahead Log (WAL). (8) testing with qemu: We can use two QEMU virtual machines to test CBD by sharing a CXLMemDev: a) Start two QEMU virtual machines, sharing a CXLMemDev. root@qemu-2:~# cxl list [ { "memdev":"mem0", "pmem_size":536870912, "serial":0, "host":"0000:0d:00.0" } ] root@qemu-1:~# cxl list [ { "memdev":"mem0", "pmem_size":536870912, "serial":0, "host":"0000:0d:00.0" } ] b) Register a CBD transport on node-1 and add a backend, specifying the path as /dev/ram0p1. root@qemu-1:~# cxl create-region -m mem0 -d decoder0.0 -t pmem { "region":"region0", "resource":"0x1890000000", "size":"512.00 MiB (536.87 MB)", "type":"pmem", "interleave_ways":1, "interleave_granularity":256, "decode_state":"commit", "mappings":[ { "position":0, "memdev":"mem0", "decoder":"decoder2.0" } ] } cxl region: cmd_create_region: created 1 region root@qemu-1:~# ndctl create-namespace -r region0 -m fsdax --map dev -t pmem { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":"502.00 MiB (526.39 MB)", "uuid":"618e9627-4345-4046-ba46-becf430a1464", "sector_size":512, "align":2097152, "blockdev":"pmem0" } root@qemu-1:~# echo "path=/dev/pmem0,hostname=node-1,force=1,format=1" > /sys/bus/cbd/transport_register root@qemu-1:~# echo "op=backend-start,path=/dev/ram0p1" > /sys/bus/cbd/devices/transport0/adm c) Register a CBD transport on node-2 and add a blkdev, specifying the backend ID as the backend on node-1. root@qemu-2:~# cxl create-region -m mem0 -d decoder0.0 -t pmem { "region":"region0", "resource":"0x390000000", "size":"512.00 MiB (536.87 MB)", "type":"pmem", "interleave_ways":1, "interleave_granularity":256, "decode_state":"commit", "mappings":[ { "position":0, "memdev":"mem0", "decoder":"decoder2.0" } ] } cxl region: cmd_create_region: created 1 region root@qemu-2:~# ndctl create-namespace -r region0 -m fsdax --map dev -t pmem -b 0 { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":"502.00 MiB (526.39 MB)", "uuid":"a7fae1a5-2cba-46d7-83a2-20a76d736848", "sector_size":512, "align":2097152, "blockdev":"pmem0" } root@qemu-2:~# echo "path=/dev/pmem0,hostname=node-2" > /sys/bus/cbd/transport_register root@qemu-2:~# echo "op=dev-start,backend_id=0,queues=1" > /sys/bus/cbd/devices/transport0/adm d) On node-2, you will get a /dev/cbd0, and all reads and writes to cbd0 will actually read from and write to /dev/ram0p1 on node-1. root@qemu-2:~# mkfs.xfs -f /dev/cbd0 meta-data=/dev/cbd0 isize=512 agcount=4, agsize=655360 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=0 inobtcount=0 data = bsize=4096 blocks=2621440, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Thanx Dongsheng Yang (7): block: Init for CBD(CXL Block Device) cbd: introduce cbd_transport cbd: introduce cbd_channel cbd: introduce cbd_host cbd: introuce cbd_backend cbd: introduce cbd_blkdev cbd: add related sysfs files in transport register drivers/block/Kconfig | 2 + drivers/block/Makefile | 2 + drivers/block/cbd/Kconfig | 4 + drivers/block/cbd/Makefile | 3 + drivers/block/cbd/cbd_backend.c | 254 +++++++++ drivers/block/cbd/cbd_blkdev.c | 375 +++++++++++++ drivers/block/cbd/cbd_channel.c | 179 +++++++ drivers/block/cbd/cbd_handler.c | 261 +++++++++ drivers/block/cbd/cbd_host.c | 123 +++++ drivers/block/cbd/cbd_internal.h | 830 +++++++++++++++++++++++++++++ drivers/block/cbd/cbd_main.c | 230 ++++++++ drivers/block/cbd/cbd_queue.c | 621 ++++++++++++++++++++++ drivers/block/cbd/cbd_transport.c | 845 ++++++++++++++++++++++++++++++ 13 files changed, 3729 insertions(+) create mode 100644 drivers/block/cbd/Kconfig create mode 100644 drivers/block/cbd/Makefile create mode 100644 drivers/block/cbd/cbd_backend.c create mode 100644 drivers/block/cbd/cbd_blkdev.c create mode 100644 drivers/block/cbd/cbd_channel.c create mode 100644 drivers/block/cbd/cbd_handler.c create mode 100644 drivers/block/cbd/cbd_host.c create mode 100644 drivers/block/cbd/cbd_internal.h create mode 100644 drivers/block/cbd/cbd_main.c create mode 100644 drivers/block/cbd/cbd_queue.c create mode 100644 drivers/block/cbd/cbd_transport.c