Message ID | 20231213175329.594-1-sthanneeru.opensrc@micron.com |
---|---|
Headers | show |
Series | Node migration between memory tiers | expand |
<sthanneeru.opensrc@micron.com> writes: > From: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > > The memory tiers feature allows nodes with similar memory types > or performance characteristics to be grouped together in a > memory tier. However, there is currently no provision for > moving a node from one tier to another on demand. > > This patch series aims to support node migration between tiers > on demand by sysadmin/root user using the provided sysfs for > node migration. > > To migrate a node to a tier, the corresponding node’s sysfs > memtier_override is written with target tier id. > > Example: Move node2 to memory tier2 from its default tier(i.e 4) > > 1. To check current memtier of node2 > $cat /sys/devices/system/node/node2/memtier_override > memory_tier4 > > 2. To migrate node2 to memory_tier2 > $echo 2 > /sys/devices/system/node/node2/memtier_override > $cat /sys/devices/system/node/node2/memtier_override > memory_tier2 > > Usecases: > > 1. Useful to move cxl nodes to the right tiers from userspace, when > the hardware fails to assign the tiers correctly based on > memorytypes. > > On some platforms we have observed cxl memory being assigned to > the same tier as DDR memory. This is arguably a system firmware > bug, but it is true that tiers represent *ranges* of performance > and we believe it's important for the system operator to have > the ability to override bad firmware or OS decisions about tier > assignment as a fail-safe against potential bad outcomes. > > 2. Useful if we want interleave weights to be applied on memory tiers > instead of nodes. > In a previous thread, Huang Ying <ying.huang@intel.com> thought > this feature might be useful to overcome limitations of systems > where nodes with different bandwidth characteristics are grouped > in a single tier. > https://lore.kernel.org/lkml/87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com/ > > ============= > Version Notes: > > V2 : Changed interface to memtier_override from adistance_offset. > memtier_override was recommended by > 1. John Groves <john@jagalactic.com> > 2. Ravi Shankar <ravis.opensrc@micron.com> > 3. Brice Goglin <Brice.Goglin@inria.fr> It appears that you ignored my comments for V1 as follows ... https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/ https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/ https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/ -- Best Regards, Huang, Ying > V1 : Introduced adistance_offset sysfs. > > ============= > > Srinivasulu Thanneeru (2): > base/node: Add sysfs for memtier_override > memory tier: Support node migration between tiers > > Documentation/ABI/stable/sysfs-devices-node | 7 ++ > drivers/base/node.c | 47 ++++++++++++ > include/linux/memory-tiers.h | 11 +++ > include/linux/node.h | 11 +++ > mm/memory-tiers.c | 85 ++++++++++++--------- > 5 files changed, 125 insertions(+), 36 deletions(-)
On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote: > <sthanneeru.opensrc@micron.com> writes: > > > ============= > > Version Notes: > > > > V2 : Changed interface to memtier_override from adistance_offset. > > memtier_override was recommended by > > 1. John Groves <john@jagalactic.com> > > 2. Ravi Shankar <ravis.opensrc@micron.com> > > 3. Brice Goglin <Brice.Goglin@inria.fr> > > It appears that you ignored my comments for V1 as follows ... > > https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/ > https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/ > https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/ > Not speaking for the group, just chiming in because i'd discussed it with them. "Memory Type" is a bit nebulous. Is a Micron Type-3 with performance X and an SK Hynix Type-3 with performance Y a "Different type", or are they the "Same Type" given that they're both Type 3 backed by some form of DDR? Is socket placement of those devices relevant for determining "Type"? Is whether they are behind a switch relevant for determining "Type"? "Type" is frustrating when everything we're talking about managing is "Type-3" with difference performance. A concrete example: To the system, a Multi-Headed Single Logical Device (MH-SLD) looks exactly the same as an standard SLD. I may want to have some combination of local memory expansion devices on the majority of my expansion slots, but reserve 1 slot on each socket for a connection to the MH-SLD. As of right now: There is no good way to differentiate the devices in terms of "Type" - and even if you had that, the tiering system would still lump them together. Similarly, an initial run of switches may or may not allow enumeration of devices behind it (depends on the configuration), so you may end up with a static numa node that "looks like" another SLD - despite it being some definition of "GFAM". Do number of hops matter in determining "Type"? So I really don't think "Type" is useful for determining tier placement. As of right now, the system lumps DRAM nodes as one tier, and pretty much everything else as "the other tier". To me, this patch set is an initial pass meant to allow user-control over tier composition while the internal mechanism is sussed out and the environment develops. In general, a release valve that lets you redefine tiers is very welcome for testing and validation of different setups while the industry evolves. Just my two cents. ~Gregory > -- > Best Regards, > Huang, Ying >
Gregory Price <gregory.price@memverge.com> writes: > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote: >> <sthanneeru.opensrc@micron.com> writes: >> >> > ============= >> > Version Notes: >> > >> > V2 : Changed interface to memtier_override from adistance_offset. >> > memtier_override was recommended by >> > 1. John Groves <john@jagalactic.com> >> > 2. Ravi Shankar <ravis.opensrc@micron.com> >> > 3. Brice Goglin <Brice.Goglin@inria.fr> >> >> It appears that you ignored my comments for V1 as follows ... >> >> https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/ >> https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/ >> https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/ >> > > Not speaking for the group, just chiming in because i'd discussed it > with them. > > "Memory Type" is a bit nebulous. Is a Micron Type-3 with performance X > and an SK Hynix Type-3 with performance Y a "Different type", or are > they the "Same Type" given that they're both Type 3 backed by some form > of DDR? Is socket placement of those devices relevant for determining > "Type"? Is whether they are behind a switch relevant for determining > "Type"? "Type" is frustrating when everything we're talking about > managing is "Type-3" with difference performance. > > A concrete example: > To the system, a Multi-Headed Single Logical Device (MH-SLD) looks > exactly the same as an standard SLD. I may want to have some > combination of local memory expansion devices on the majority of my > expansion slots, but reserve 1 slot on each socket for a connection to > the MH-SLD. As of right now: There is no good way to differentiate the > devices in terms of "Type" - and even if you had that, the tiering > system would still lump them together. > > Similarly, an initial run of switches may or may not allow enumeration > of devices behind it (depends on the configuration), so you may end up > with a static numa node that "looks like" another SLD - despite it being > some definition of "GFAM". Do number of hops matter in determining > "Type"? In the original design, the memory devices of same memory type are managed by the same device driver, linked with system in same way (including switches), built with same media. So, the performance is same too. And, same as memory tiers, memory types are orthogonal to sockets. Do you think the definition itself is clear enough? I admit "memory type" is a confusing name. Do you have some better suggestion? > So I really don't think "Type" is useful for determining tier placement. > > As of right now, the system lumps DRAM nodes as one tier, and pretty > much everything else as "the other tier". To me, this patch set is an > initial pass meant to allow user-control over tier composition while > the internal mechanism is sussed out and the environment develops. The patchset to identify the performance of memory devices and put them in proper "memory types" and memory tiers via HMAT has been merged by v6.7-rc1. 07a8bdd4120c (memory tiering: add abstract distance calculation algorithms management, 2023-09-26) d0376aac59a1 (acpi, hmat: refactor hmat_register_target_initiators(), 2023-09-26) 3718c02dbd4c (acpi, hmat: calculate abstract distance with HMAT, 2023-09-26) 6bc2cfdf82d5 (dax, kmem: calculate abstract distance with general interface, 2023-09-26) > In general, a release valve that lets you redefine tiers is very welcome > for testing and validation of different setups while the industry evolves. > > Just my two cents. -- Best Regards, Huang, Ying
Micron Confidential Micron Confidential
Hi, Srinivasulu, Please use a email client that works for kernel patch review. Your email is hard to read. It's hard to identify which part is your text and which part is my text. Please refer to, https://www.kernel.org/doc/html/latest/process/email-clients.html Or something similar, for example, https://elinux.org/Mail_client_tips Srinivasulu Thanneeru <sthanneeru@micron.com> writes: > Micron Confidential > > > > Micron Confidential > ________________________________________ > From: Huang, Ying <ying.huang@intel.com> > Sent: Friday, December 15, 2023 10:32 AM > To: Srinivasulu Opensrc > Cc: linux-cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu > Thanneeru; aneesh.kumar@linux.ibm.com; dan.j.williams@intel.com; > gregory.price; mhocko@suse.com; tj@kernel.org; john@jagalactic.com; > Eishan Mirakhur; Vinicius Tavares Petrucci; Ravis OpenSrc; > Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org > Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers > > CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you recognize the sender and were expecting this message. > > > <sthanneeru.opensrc@micron.com> writes: > >> From: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> >> >> The memory tiers feature allows nodes with similar memory types >> or performance characteristics to be grouped together in a >> memory tier. However, there is currently no provision for >> moving a node from one tier to another on demand. >> >> This patch series aims to support node migration between tiers >> on demand by sysadmin/root user using the provided sysfs for >> node migration. >> >> To migrate a node to a tier, the corresponding node’s sysfs >> memtier_override is written with target tier id. >> >> Example: Move node2 to memory tier2 from its default tier(i.e 4) >> >> 1. To check current memtier of node2 >> $cat /sys/devices/system/node/node2/memtier_override >> memory_tier4 >> >> 2. To migrate node2 to memory_tier2 >> $echo 2 > /sys/devices/system/node/node2/memtier_override >> $cat /sys/devices/system/node/node2/memtier_override >> memory_tier2 >> >> Usecases: >> >> 1. Useful to move cxl nodes to the right tiers from userspace, when >> the hardware fails to assign the tiers correctly based on >> memorytypes. >> >> On some platforms we have observed cxl memory being assigned to >> the same tier as DDR memory. This is arguably a system firmware >> bug, but it is true that tiers represent *ranges* of performance >> and we believe it's important for the system operator to have >> the ability to override bad firmware or OS decisions about tier >> assignment as a fail-safe against potential bad outcomes. >> >> 2. Useful if we want interleave weights to be applied on memory tiers >> instead of nodes. >> In a previous thread, Huang Ying <ying.huang@intel.com> thought >> this feature might be useful to overcome limitations of systems >> where nodes with different bandwidth characteristics are grouped >> in a single tier. >> https://lore.kernel.org/lkml/87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com/ >> >> ============= >> Version Notes: >> >> V2 : Changed interface to memtier_override from adistance_offset. >> memtier_override was recommended by >> 1. John Groves <john@jagalactic.com> >> 2. Ravi Shankar <ravis.opensrc@micron.com> >> 3. Brice Goglin <Brice.Goglin@inria.fr> > > It appears that you ignored my comments for V1 as follows ... > > https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/ > > Thank you Huang, Ying for pointing to this. > > https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf > > In the presentation above, the adistance_offsets are per memtype. > We believe that adistance_offset per node is more suitable and flexible > since we can change it per node. If we keep adistance_offset per memtype, > then we cannot change it for a specific node of a given memtype. Why do you need to change it for a specific node? Why do you needn't to chagne it for all nodes of a given memtype? > https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/ > > I guess that you need to move all NUMA nodes with same performance > metrics together? If so, That is why we previously proposed to place > the knob in "memory_type"? (From: Huang, Ying ) > > Yes, memory_type would be group the related memories togather as single tier. > We should also have a flexibility to move nodes between tiers, to address the issues described in usecases above. > > https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/ > > This patch provides a way to move a node to the correct tier. > We observed in test setups where DRAM and CXL are put under the same > tier (memory_tier4). > By using this patch, we can move the CXL node away from the DRAM-linked > tier4 and put it in the desired tier. Good! Can you give more details? So I can resend the patch with your supporting data. -- Best Regards, Huang, Ying > Regards, > Srini > > -- > Best Regards, > Huang, Ying > >> V1 : Introduced adistance_offset sysfs. >> >> ============= >> >> Srinivasulu Thanneeru (2): >> base/node: Add sysfs for memtier_override >> memory tier: Support node migration between tiers >> >> Documentation/ABI/stable/sysfs-devices-node | 7 ++ >> drivers/base/node.c | 47 ++++++++++++ >> include/linux/memory-tiers.h | 11 +++ >> include/linux/node.h | 11 +++ >> mm/memory-tiers.c | 85 ++++++++++++--------- >> 5 files changed, 125 insertions(+), 36 deletions(-)
Micron Confidential Hi Huang, Ying, My apologies for wrong mail reply format, my mail client settings got changed on my PC. Please find comments bellow inline. Regards, Srini Micron Confidential > -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Monday, December 18, 2023 11:26 AM > To: gregory.price <gregory.price@memverge.com> > Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux- > cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru > <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com; > dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org; > john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>; Vinicius > Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc > <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com; linux- > kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei Xu > <weixugc@google.com> > Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers > > CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless > you recognize the sender and were expecting this message. > > > Gregory Price <gregory.price@memverge.com> writes: > > > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote: > >> <sthanneeru.opensrc@micron.com> writes: > >> > >> > ============= > >> > Version Notes: > >> > > >> > V2 : Changed interface to memtier_override from adistance_offset. > >> > memtier_override was recommended by > >> > 1. John Groves <john@jagalactic.com> > >> > 2. Ravi Shankar <ravis.opensrc@micron.com> > >> > 3. Brice Goglin <Brice.Goglin@inria.fr> > >> > >> It appears that you ignored my comments for V1 as follows ... > >> > >> > https://lore.k/ > ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6- > desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com > %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 > 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d > 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 > D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi > bLhwV12Fg%3D&reserved=0 Thank you, Huang, Ying for pointing to this. https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf In the presentation above, the adistance_offsets are per memtype. We believe that adistance_offset per node is more suitable and flexible. since we can change it per node. If we keep adistance_offset per memtype, then we cannot change it for a specific node of a given memtype. > >> > https://lore.k/ > ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6- > desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com > %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 > 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d > 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 > D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh > D%2BcMI%2BflOsI1M%3D&reserved=0 Yes, memory_type would be grouping the related memories together as single tier. We should also have a flexibility to move nodes between tiers, to address the issues. described in use cases above. > >> > https://lore.k/ > ernel.org%2Flkml%2F87a5qp2et0.fsf%40yhuang6- > desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com > %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 > 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d > 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 > D%7C3000%7C%7C%7C&sdata=W%2FWcAD4b9od%2BS0zIak%2Bv5hkjFG1Xcf > 6p8q3xwmspUiI%3D&reserved=0 This patch provides a way to move a node to the correct tier. We observed in test setups where DRAM and CXL are put under the same. tier (memory_tier4). By using this patch, we can move the CXL node away from the DRAM-linked (memory_tier4) and put it in the desired tier. > >> > > > > Not speaking for the group, just chiming in because i'd discussed it > > with them. > > > > "Memory Type" is a bit nebulous. Is a Micron Type-3 with performance X > > and an SK Hynix Type-3 with performance Y a "Different type", or are > > they the "Same Type" given that they're both Type 3 backed by some form > > of DDR? Is socket placement of those devices relevant for determining > > "Type"? Is whether they are behind a switch relevant for determining > > "Type"? "Type" is frustrating when everything we're talking about > > managing is "Type-3" with difference performance. > > > > A concrete example: > > To the system, a Multi-Headed Single Logical Device (MH-SLD) looks > > exactly the same as an standard SLD. I may want to have some > > combination of local memory expansion devices on the majority of my > > expansion slots, but reserve 1 slot on each socket for a connection to > > the MH-SLD. As of right now: There is no good way to differentiate the > > devices in terms of "Type" - and even if you had that, the tiering > > system would still lump them together. > > > > Similarly, an initial run of switches may or may not allow enumeration > > of devices behind it (depends on the configuration), so you may end up > > with a static numa node that "looks like" another SLD - despite it being > > some definition of "GFAM". Do number of hops matter in determining > > "Type"? > > In the original design, the memory devices of same memory type are > managed by the same device driver, linked with system in same way > (including switches), built with same media. So, the performance is > same too. And, same as memory tiers, memory types are orthogonal to > sockets. Do you think the definition itself is clear enough? > > I admit "memory type" is a confusing name. Do you have some better > suggestion? > > > So I really don't think "Type" is useful for determining tier placement. > > > > As of right now, the system lumps DRAM nodes as one tier, and pretty > > much everything else as "the other tier". To me, this patch set is an > > initial pass meant to allow user-control over tier composition while > > the internal mechanism is sussed out and the environment develops. > > The patchset to identify the performance of memory devices and put them > in proper "memory types" and memory tiers via HMAT has been merged by > v6.7-rc1. > > 07a8bdd4120c (memory tiering: add abstract distance calculation > algorithms management, 2023-09-26) > d0376aac59a1 (acpi, hmat: refactor hmat_register_target_initiators(), > 2023-09-26) > 3718c02dbd4c (acpi, hmat: calculate abstract distance with HMAT, 2023-09- > 26) > 6bc2cfdf82d5 (dax, kmem: calculate abstract distance with general > interface, 2023-09-26) > > > In general, a release valve that lets you redefine tiers is very welcome > > for testing and validation of different setups while the industry evolves. > > > > Just my two cents. > > -- > Best Regards, > Huang, Ying
Srinivasulu Thanneeru <sthanneeru@micron.com> writes: > Micron Confidential > > Hi Huang, Ying, > > My apologies for wrong mail reply format, my mail client settings got changed on my PC. > Please find comments bellow inline. > > Regards, > Srini > > > Micron Confidential >> -----Original Message----- >> From: Huang, Ying <ying.huang@intel.com> >> Sent: Monday, December 18, 2023 11:26 AM >> To: gregory.price <gregory.price@memverge.com> >> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux- >> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru >> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com; >> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org; >> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>; Vinicius >> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc >> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com; linux- >> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei Xu >> <weixugc@google.com> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless >> you recognize the sender and were expecting this message. >> >> >> Gregory Price <gregory.price@memverge.com> writes: >> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote: >> >> <sthanneeru.opensrc@micron.com> writes: >> >> >> >> > ============= >> >> > Version Notes: >> >> > >> >> > V2 : Changed interface to memtier_override from adistance_offset. >> >> > memtier_override was recommended by >> >> > 1. John Groves <john@jagalactic.com> >> >> > 2. Ravi Shankar <ravis.opensrc@micron.com> >> >> > 3. Brice Goglin <Brice.Goglin@inria.fr> >> >> >> >> It appears that you ignored my comments for V1 as follows ... >> >> >> >> >> https://lore.k/ >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6- >> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com >> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 >> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d >> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 >> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi >> bLhwV12Fg%3D&reserved=0 > > Thank you, Huang, Ying for pointing to this. > https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf > > In the presentation above, the adistance_offsets are per memtype. > We believe that adistance_offset per node is more suitable and flexible. > since we can change it per node. If we keep adistance_offset per memtype, > then we cannot change it for a specific node of a given memtype. > >> >> >> https://lore.k/ >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6- >> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com >> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 >> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d >> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 >> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh >> D%2BcMI%2BflOsI1M%3D&reserved=0 > > Yes, memory_type would be grouping the related memories together as single tier. > We should also have a flexibility to move nodes between tiers, to address the issues. > described in use cases above. We don't pursue absolute flexibility. We add necessary flexibility only. Why do you need this kind of flexibility? Can you provide some use cases where memory_type based "adistance_offset" doesn't work? -- Best Regards, Huang, Ying
Micron Confidential Micron Confidential > -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Wednesday, January 3, 2024 11:38 AM > To: Srinivasulu Thanneeru <sthanneeru@micron.com> > Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc > <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux- > mm@kvack.org; aneesh.kumar@linux.ibm.com; dan.j.williams@intel.com; > mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur > <emirakhur@micron.com>; Vinicius Tavares Petrucci > <vtavarespetr@micron.com>; Ravis OpenSrc <Ravis.OpenSrc@micron.com>; > Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes > Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com> > Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory > tiers > > CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless > you recognize the sender and were expecting this message. > > > Srinivasulu Thanneeru <sthanneeru@micron.com> writes: > > > Micron Confidential > > > > Hi Huang, Ying, > > > > My apologies for wrong mail reply format, my mail client settings got > changed on my PC. > > Please find comments bellow inline. > > > > Regards, > > Srini > > > > > > Micron Confidential > >> -----Original Message----- > >> From: Huang, Ying <ying.huang@intel.com> > >> Sent: Monday, December 18, 2023 11:26 AM > >> To: gregory.price <gregory.price@memverge.com> > >> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux- > >> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru > >> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com; > >> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org; > >> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>; Vinicius > >> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc > >> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com; linux- > >> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei Xu > >> <weixugc@google.com> > >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory > tiers > >> > >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless > >> you recognize the sender and were expecting this message. > >> > >> > >> Gregory Price <gregory.price@memverge.com> writes: > >> > >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote: > >> >> <sthanneeru.opensrc@micron.com> writes: > >> >> > >> >> > ============= > >> >> > Version Notes: > >> >> > > >> >> > V2 : Changed interface to memtier_override from adistance_offset. > >> >> > memtier_override was recommended by > >> >> > 1. John Groves <john@jagalactic.com> > >> >> > 2. Ravi Shankar <ravis.opensrc@micron.com> > >> >> > 3. Brice Goglin <Brice.Goglin@inria.fr> > >> >> > >> >> It appears that you ignored my comments for V1 as follows ... > >> >> > >> >> > >> > https://lore.k/ > %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2 > 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 > 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD > AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C > &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re > served=0 > >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6- > >> > desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com > >> > %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 > >> > 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d > >> > 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 > >> > D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi > >> bLhwV12Fg%3D&reserved=0 > > > > Thank you, Huang, Ying for pointing to this. > > > https://lpc.ev/ > ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1 > 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me > mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e > 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806 > f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW > IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3 > 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7 > n8%3D&reserved=0 > > > > In the presentation above, the adistance_offsets are per memtype. > > We believe that adistance_offset per node is more suitable and flexible. > > since we can change it per node. If we keep adistance_offset per memtype, > > then we cannot change it for a specific node of a given memtype. > > > >> >> > >> > https://lore.k/ > %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2 > 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 > 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD > AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C > &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re > served=0 > >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6- > >> > desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com > >> > %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 > >> > 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d > >> > 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 > >> > D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh > >> D%2BcMI%2BflOsI1M%3D&reserved=0 > > > > Yes, memory_type would be grouping the related memories together as > single tier. > > We should also have a flexibility to move nodes between tiers, to address > the issues. > > described in use cases above. > > We don't pursue absolute flexibility. We add necessary flexibility > only. Why do you need this kind of flexibility? Can you provide some > use cases where memory_type based "adistance_offset" doesn't work? - /sys/devices/virtual/memory_type/memory_type/ adistance_offset memory_type based "adistance_offset will provide a way to move all nodes of same memory_type (e.g. all cxl nodes) to different tier. Whereas /sys/devices/system/node/node2/memtier_override provide a way migrate a node from one tier to another. Considering a case where we would like to move two cxl nodes into two different tiers in future. So, I thought it would be good to have flexibility at node level instead of at memory_type. > > -- > Best Regards, > Huang, Ying
Srinivasulu Thanneeru <sthanneeru@micron.com> writes: > Micron Confidential > > > > Micron Confidential >> -----Original Message----- >> From: Huang, Ying <ying.huang@intel.com> >> Sent: Wednesday, January 3, 2024 11:38 AM >> To: Srinivasulu Thanneeru <sthanneeru@micron.com> >> Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc >> <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux- >> mm@kvack.org; aneesh.kumar@linux.ibm.com; dan.j.williams@intel.com; >> mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur >> <emirakhur@micron.com>; Vinicius Tavares Petrucci >> <vtavarespetr@micron.com>; Ravis OpenSrc <Ravis.OpenSrc@micron.com>; >> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes >> Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com> >> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory >> tiers >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless >> you recognize the sender and were expecting this message. >> >> >> Srinivasulu Thanneeru <sthanneeru@micron.com> writes: >> >> > Micron Confidential >> > >> > Hi Huang, Ying, >> > >> > My apologies for wrong mail reply format, my mail client settings got >> changed on my PC. >> > Please find comments bellow inline. >> > >> > Regards, >> > Srini >> > >> > >> > Micron Confidential >> >> -----Original Message----- >> >> From: Huang, Ying <ying.huang@intel.com> >> >> Sent: Monday, December 18, 2023 11:26 AM >> >> To: gregory.price <gregory.price@memverge.com> >> >> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux- >> >> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru >> >> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com; >> >> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org; >> >> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>; Vinicius >> >> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc >> >> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com; linux- >> >> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei Xu >> >> <weixugc@google.com> >> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory >> tiers >> >> >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless >> >> you recognize the sender and were expecting this message. >> >> >> >> >> >> Gregory Price <gregory.price@memverge.com> writes: >> >> >> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote: >> >> >> <sthanneeru.opensrc@micron.com> writes: >> >> >> >> >> >> > ============= >> >> >> > Version Notes: >> >> >> > >> >> >> > V2 : Changed interface to memtier_override from adistance_offset. >> >> >> > memtier_override was recommended by >> >> >> > 1. John Groves <john@jagalactic.com> >> >> >> > 2. Ravi Shankar <ravis.opensrc@micron.com> >> >> >> > 3. Brice Goglin <Brice.Goglin@inria.fr> >> >> >> >> >> >> It appears that you ignored my comments for V1 as follows ... >> >> >> >> >> >> >> >> >> https://lore.k/ >> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2 >> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 >> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD >> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C >> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re >> served=0 >> >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6- >> >> >> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com >> >> >> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 >> >> >> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d >> >> >> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 >> >> >> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi >> >> bLhwV12Fg%3D&reserved=0 >> > >> > Thank you, Huang, Ying for pointing to this. >> > >> https://lpc.ev/ >> ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1 >> 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me >> mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e >> 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806 >> f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW >> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3 >> 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7 >> n8%3D&reserved=0 >> > >> > In the presentation above, the adistance_offsets are per memtype. >> > We believe that adistance_offset per node is more suitable and flexible. >> > since we can change it per node. If we keep adistance_offset per memtype, >> > then we cannot change it for a specific node of a given memtype. >> > >> >> >> >> >> >> https://lore.k/ >> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2 >> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 >> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD >> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C >> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re >> served=0 >> >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6- >> >> >> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com >> >> >> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 >> >> >> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d >> >> >> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 >> >> >> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh >> >> D%2BcMI%2BflOsI1M%3D&reserved=0 >> > >> > Yes, memory_type would be grouping the related memories together as >> single tier. >> > We should also have a flexibility to move nodes between tiers, to address >> the issues. >> > described in use cases above. >> >> We don't pursue absolute flexibility. We add necessary flexibility >> only. Why do you need this kind of flexibility? Can you provide some >> use cases where memory_type based "adistance_offset" doesn't work? > > - /sys/devices/virtual/memory_type/memory_type/ adistance_offset > memory_type based "adistance_offset will provide a way to move all nodes of same memory_type (e.g. all cxl nodes) > to different tier. We will not put the CXL nodes with different performance metrics in one memory_type. If so, do you still need to move one of them? > Whereas /sys/devices/system/node/node2/memtier_override provide a way migrate a node from one tier to another. > Considering a case where we would like to move two cxl nodes into two different tiers in future. > So, I thought it would be good to have flexibility at node level instead of at memory_type. -- Best Regards, Huang, Ying
> -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Wednesday, January 3, 2024 2:00 PM > To: Srinivasulu Thanneeru <sthanneeru@micron.com> > Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc > <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux- > mm@kvack.org; aneesh.kumar@linux.ibm.com; dan.j.williams@intel.com; > mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur > <emirakhur@micron.com>; Vinicius Tavares Petrucci > <vtavarespetr@micron.com>; Ravis OpenSrc <Ravis.OpenSrc@micron.com>; > Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes > Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com> > Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory > tiers > > CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless > you recognize the sender and were expecting this message. > > > Srinivasulu Thanneeru <sthanneeru@micron.com> writes: > > > Micron Confidential > > > > > > > > Micron Confidential > >> -----Original Message----- > >> From: Huang, Ying <ying.huang@intel.com> > >> Sent: Wednesday, January 3, 2024 11:38 AM > >> To: Srinivasulu Thanneeru <sthanneeru@micron.com> > >> Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc > >> <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux- > >> mm@kvack.org; aneesh.kumar@linux.ibm.com; > dan.j.williams@intel.com; > >> mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur > >> <emirakhur@micron.com>; Vinicius Tavares Petrucci > >> <vtavarespetr@micron.com>; Ravis OpenSrc > <Ravis.OpenSrc@micron.com>; > >> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes > >> Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com> > >> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between > memory > >> tiers > >> > >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless > >> you recognize the sender and were expecting this message. > >> > >> > >> Srinivasulu Thanneeru <sthanneeru@micron.com> writes: > >> > >> > Micron Confidential > >> > > >> > Hi Huang, Ying, > >> > > >> > My apologies for wrong mail reply format, my mail client settings got > >> changed on my PC. > >> > Please find comments bellow inline. > >> > > >> > Regards, > >> > Srini > >> > > >> > > >> > Micron Confidential > >> >> -----Original Message----- > >> >> From: Huang, Ying <ying.huang@intel.com> > >> >> Sent: Monday, December 18, 2023 11:26 AM > >> >> To: gregory.price <gregory.price@memverge.com> > >> >> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux- > >> >> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru > >> >> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com; > >> >> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org; > >> >> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>; > Vinicius > >> >> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc > >> >> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com; > linux- > >> >> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei > Xu > >> >> <weixugc@google.com> > >> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory > >> tiers > >> >> > >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments > unless > >> >> you recognize the sender and were expecting this message. > >> >> > >> >> > >> >> Gregory Price <gregory.price@memverge.com> writes: > >> >> > >> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote: > >> >> >> <sthanneeru.opensrc@micron.com> writes: > >> >> >> > >> >> >> > ============= > >> >> >> > Version Notes: > >> >> >> > > >> >> >> > V2 : Changed interface to memtier_override from adistance_offset. > >> >> >> > memtier_override was recommended by > >> >> >> > 1. John Groves <john@jagalactic.com> > >> >> >> > 2. Ravi Shankar <ravis.opensrc@micron.com> > >> >> >> > 3. Brice Goglin <Brice.Goglin@inria.fr> > >> >> >> > >> >> >> It appears that you ignored my comments for V1 as follows ... > >> >> >> > >> >> >> > >> >> > >> > https://lore.k/ > %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100 > cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 > 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD > AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C > &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0 > >> > %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2 > >> > 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 > >> > 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD > >> > AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C > >> > &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re > >> served=0 > >> >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6- > >> >> > >> > desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com > >> >> > >> > %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 > >> >> > >> > 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d > >> >> > >> > 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 > >> >> > >> > D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi > >> >> bLhwV12Fg%3D&reserved=0 > >> > > >> > Thank you, Huang, Ying for pointing to this. > >> > > >> > https://lpc.ev/ > %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100 > cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 > 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD > AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C > &sdata=%2F0AW8RYpTIa7%2FiScnkzmmTeAE9TYqjsuWWjTuxBPptk%3D&rese > rved=0 > >> > ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1 > >> > 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me > >> > mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e > >> > 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806 > >> > f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW > >> > IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3 > >> > 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7 > >> n8%3D&reserved=0 > >> > > >> > In the presentation above, the adistance_offsets are per memtype. > >> > We believe that adistance_offset per node is more suitable and flexible. > >> > since we can change it per node. If we keep adistance_offset per > memtype, > >> > then we cannot change it for a specific node of a given memtype. > >> > > >> >> >> > >> >> > >> > https://lore.k/ > %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100 > cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 > 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD > AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C > &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0 > >> > %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2 > >> > 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 > >> > 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD > >> > AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C > >> > &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re > >> served=0 > >> >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6- > >> >> > >> > desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com > >> >> > >> > %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 > >> >> > >> > 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d > >> >> > >> > 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 > >> >> > >> > D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh > >> >> D%2BcMI%2BflOsI1M%3D&reserved=0 > >> > > >> > Yes, memory_type would be grouping the related memories together as > >> single tier. > >> > We should also have a flexibility to move nodes between tiers, to > address > >> the issues. > >> > described in use cases above. > >> > >> We don't pursue absolute flexibility. We add necessary flexibility > >> only. Why do you need this kind of flexibility? Can you provide some > >> use cases where memory_type based "adistance_offset" doesn't work? > > > > - /sys/devices/virtual/memory_type/memory_type/ adistance_offset > > memory_type based "adistance_offset will provide a way to move all nodes > of same memory_type (e.g. all cxl nodes) > > to different tier. > > We will not put the CXL nodes with different performance metrics in one > memory_type. If so, do you still need to move one of them? From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf abstract_distance_offset: override by users to deal with firmware issue. say firmware can configure the cxl node into wrong tiers, similar to that it may also configure all cxl nodes into single memtype, hence all these nodes can fall into a single wrong tier. In this case, per node adistance_offset would be good to have ? -- Srini > > Whereas /sys/devices/system/node/node2/memtier_override provide a > way migrate a node from one tier to another. > > Considering a case where we would like to move two cxl nodes into two > different tiers in future. > > So, I thought it would be good to have flexibility at node level instead of at > memory_type. > > -- > Best Regards, > Huang, Ying
Srinivasulu Thanneeru <sthanneeru@micron.com> writes: >> -----Original Message----- >> From: Huang, Ying <ying.huang@intel.com> >> Sent: Wednesday, January 3, 2024 2:00 PM >> To: Srinivasulu Thanneeru <sthanneeru@micron.com> >> Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc >> <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux- >> mm@kvack.org; aneesh.kumar@linux.ibm.com; dan.j.williams@intel.com; >> mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur >> <emirakhur@micron.com>; Vinicius Tavares Petrucci >> <vtavarespetr@micron.com>; Ravis OpenSrc <Ravis.OpenSrc@micron.com>; >> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes >> Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com> >> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory >> tiers >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless >> you recognize the sender and were expecting this message. >> >> >> Srinivasulu Thanneeru <sthanneeru@micron.com> writes: >> >> > Micron Confidential >> > >> > >> > >> > Micron Confidential >> >> -----Original Message----- >> >> From: Huang, Ying <ying.huang@intel.com> >> >> Sent: Wednesday, January 3, 2024 11:38 AM >> >> To: Srinivasulu Thanneeru <sthanneeru@micron.com> >> >> Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc >> >> <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux- >> >> mm@kvack.org; aneesh.kumar@linux.ibm.com; >> dan.j.williams@intel.com; >> >> mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur >> >> <emirakhur@micron.com>; Vinicius Tavares Petrucci >> >> <vtavarespetr@micron.com>; Ravis OpenSrc >> <Ravis.OpenSrc@micron.com>; >> >> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes >> >> Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com> >> >> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between >> memory >> >> tiers >> >> >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless >> >> you recognize the sender and were expecting this message. >> >> >> >> >> >> Srinivasulu Thanneeru <sthanneeru@micron.com> writes: >> >> >> >> > Micron Confidential >> >> > >> >> > Hi Huang, Ying, >> >> > >> >> > My apologies for wrong mail reply format, my mail client settings got >> >> changed on my PC. >> >> > Please find comments bellow inline. >> >> > >> >> > Regards, >> >> > Srini >> >> > >> >> > >> >> > Micron Confidential >> >> >> -----Original Message----- >> >> >> From: Huang, Ying <ying.huang@intel.com> >> >> >> Sent: Monday, December 18, 2023 11:26 AM >> >> >> To: gregory.price <gregory.price@memverge.com> >> >> >> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux- >> >> >> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru >> >> >> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com; >> >> >> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org; >> >> >> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>; >> Vinicius >> >> >> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc >> >> >> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com; >> linux- >> >> >> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei >> Xu >> >> >> <weixugc@google.com> >> >> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory >> >> tiers >> >> >> >> >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments >> unless >> >> >> you recognize the sender and were expecting this message. >> >> >> >> >> >> >> >> >> Gregory Price <gregory.price@memverge.com> writes: >> >> >> >> >> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote: >> >> >> >> <sthanneeru.opensrc@micron.com> writes: >> >> >> >> >> >> >> >> > ============= >> >> >> >> > Version Notes: >> >> >> >> > >> >> >> >> > V2 : Changed interface to memtier_override from adistance_offset. >> >> >> >> > memtier_override was recommended by >> >> >> >> > 1. John Groves <john@jagalactic.com> >> >> >> >> > 2. Ravi Shankar <ravis.opensrc@micron.com> >> >> >> >> > 3. Brice Goglin <Brice.Goglin@inria.fr> >> >> >> >> >> >> >> >> It appears that you ignored my comments for V1 as follows ... >> >> >> >> >> >> >> >> >> >> >> >> >> >> https://lore.k/ >> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100 >> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 >> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD >> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C >> &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0 >> >> >> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2 >> >> >> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 >> >> >> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD >> >> >> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C >> >> >> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re >> >> served=0 >> >> >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6- >> >> >> >> >> >> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com >> >> >> >> >> >> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 >> >> >> >> >> >> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d >> >> >> >> >> >> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 >> >> >> >> >> >> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi >> >> >> bLhwV12Fg%3D&reserved=0 >> >> > >> >> > Thank you, Huang, Ying for pointing to this. >> >> > >> >> >> https://lpc.ev/ >> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100 >> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 >> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD >> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C >> &sdata=%2F0AW8RYpTIa7%2FiScnkzmmTeAE9TYqjsuWWjTuxBPptk%3D&rese >> rved=0 >> >> >> ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1 >> >> >> 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me >> >> >> mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e >> >> >> 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806 >> >> >> f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW >> >> >> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3 >> >> >> 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7 >> >> n8%3D&reserved=0 >> >> > >> >> > In the presentation above, the adistance_offsets are per memtype. >> >> > We believe that adistance_offset per node is more suitable and flexible. >> >> > since we can change it per node. If we keep adistance_offset per >> memtype, >> >> > then we cannot change it for a specific node of a given memtype. >> >> > >> >> >> >> >> >> >> >> >> >> https://lore.k/ >> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100 >> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 >> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD >> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C >> &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0 >> >> >> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2 >> >> >> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63 >> >> >> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD >> >> >> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C >> >> >> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re >> >> served=0 >> >> >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6- >> >> >> >> >> >> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com >> >> >> >> >> >> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56 >> >> >> >> >> >> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d >> >> >> >> >> >> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3 >> >> >> >> >> >> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh >> >> >> D%2BcMI%2BflOsI1M%3D&reserved=0 >> >> > >> >> > Yes, memory_type would be grouping the related memories together as >> >> single tier. >> >> > We should also have a flexibility to move nodes between tiers, to >> address >> >> the issues. >> >> > described in use cases above. >> >> >> >> We don't pursue absolute flexibility. We add necessary flexibility >> >> only. Why do you need this kind of flexibility? Can you provide some >> >> use cases where memory_type based "adistance_offset" doesn't work? >> > >> > - /sys/devices/virtual/memory_type/memory_type/ adistance_offset >> > memory_type based "adistance_offset will provide a way to move all nodes >> of same memory_type (e.g. all cxl nodes) >> > to different tier. >> >> We will not put the CXL nodes with different performance metrics in one >> memory_type. If so, do you still need to move one of them? > > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf > abstract_distance_offset: override by users to deal with firmware issue. > > say firmware can configure the cxl node into wrong tiers, similar to > that it may also configure all cxl nodes into single memtype, hence > all these nodes can fall into a single wrong tier. > In this case, per node adistance_offset would be good to have ? I think that it's better to fix the error firmware if possible. And these are only theoretical, not practical issues. Do you have some practical issues? I understand that users may want to move nodes between memory tiers for different policy choices. For that, memory_type based adistance_offset should be good. > -- > Srini >> > Whereas /sys/devices/system/node/node2/memtier_override provide a >> way migrate a node from one tier to another. >> > Considering a case where we would like to move two cxl nodes into two >> different tiers in future. >> > So, I thought it would be good to have flexibility at node level instead of at >> memory_type. -- Best Regards, Huang, Ying
On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: > > > > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf > > abstract_distance_offset: override by users to deal with firmware issue. > > > > say firmware can configure the cxl node into wrong tiers, similar to > > that it may also configure all cxl nodes into single memtype, hence > > all these nodes can fall into a single wrong tier. > > In this case, per node adistance_offset would be good to have ? > > I think that it's better to fix the error firmware if possible. And > these are only theoretical, not practical issues. Do you have some > practical issues? > > I understand that users may want to move nodes between memory tiers for > different policy choices. For that, memory_type based adistance_offset > should be good. > There's actually an affirmative case to change memory tiering to allow either movement of nodes between tiers, or at least base placement on HMAT information. Preferably, membership would be changable to allow hotplug/DCD to be managed (there's no guarantee that the memory passed through will always be what HMAT says on initial boot). https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ This group wants to enable passing CXL memory through to KVM/QEMU (i.e. host CXL expander memory passed through to the guest), and allow the guest to apply memory tiering. There are multiple issues with this, presently: 1. The QEMU CXL virtual device is not and probably never will be performant enough to be a commodity class virtualization. The reason is that the virtual CXL device is built off the I/O virtualization stack, which treats memory accesses as I/O accesses. KVM also seems incompatible with the design of the CXL memory device in general, but this problem may or may not be a blocker. As a result, access to virtual CXL memory device leads to QEMU crawling to a halt - and this is unlikely to change. There is presently no good way forward to create a performant virtual CXL device in QEMU. This means the memory tiering component in the kernel is functionally useless for virtual CXL memory, because... 2. When passing memory through as an explicit NUMA node, but not as part of a CXL memory device, the nodes are lumped together in the DRAM tier. None of this has to do with firmware. Memory-type is an awful way of denoting membership of a tier, but we have HMAT information that can be passed through via QEMU: -object memory-backend-ram,size=4G,id=ram-node0 \ -object memory-backend-ram,size=4G,id=ram-node1 \ -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 Not only would it be nice if we could change tier membership based on this data, it's realistically the only way to allow guests to accomplish memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. ~Gregory
Gregory Price <gregory.price@memverge.com> writes: > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: >> > >> > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf >> > abstract_distance_offset: override by users to deal with firmware issue. >> > >> > say firmware can configure the cxl node into wrong tiers, similar to >> > that it may also configure all cxl nodes into single memtype, hence >> > all these nodes can fall into a single wrong tier. >> > In this case, per node adistance_offset would be good to have ? >> >> I think that it's better to fix the error firmware if possible. And >> these are only theoretical, not practical issues. Do you have some >> practical issues? >> >> I understand that users may want to move nodes between memory tiers for >> different policy choices. For that, memory_type based adistance_offset >> should be good. >> > > There's actually an affirmative case to change memory tiering to allow > either movement of nodes between tiers, or at least base placement on > HMAT information. Preferably, membership would be changable to allow > hotplug/DCD to be managed (there's no guarantee that the memory passed > through will always be what HMAT says on initial boot). IIUC, from Jonathan Cameron as below, the performance of memory shouldn't change even for DCD devices. https://lore.kernel.org/linux-mm/20231103141636.000007e4@Huawei.com/ It's possible to change the performance of a NUMA node changed, if we hot-remove a memory device, then hot-add another different memory device. It's hoped that the CDAT changes too. So, all in all, HMAT + CDAT can help us to put the memory device in appropriate memory tiers. Now, we have HMAT support in upstream. We will working on CDAT support. -- Best Regards, Huang, Ying > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ > > This group wants to enable passing CXL memory through to KVM/QEMU > (i.e. host CXL expander memory passed through to the guest), and > allow the guest to apply memory tiering. > > There are multiple issues with this, presently: > > 1. The QEMU CXL virtual device is not and probably never will be > performant enough to be a commodity class virtualization. The > reason is that the virtual CXL device is built off the I/O > virtualization stack, which treats memory accesses as I/O accesses. > > KVM also seems incompatible with the design of the CXL memory device > in general, but this problem may or may not be a blocker. > > As a result, access to virtual CXL memory device leads to QEMU > crawling to a halt - and this is unlikely to change. > > There is presently no good way forward to create a performant virtual > CXL device in QEMU. This means the memory tiering component in the > kernel is functionally useless for virtual CXL memory, because... > > 2. When passing memory through as an explicit NUMA node, but not as > part of a CXL memory device, the nodes are lumped together in the > DRAM tier. > > None of this has to do with firmware. > > Memory-type is an awful way of denoting membership of a tier, but we > have HMAT information that can be passed through via QEMU: > > -object memory-backend-ram,size=4G,id=ram-node0 \ > -object memory-backend-ram,size=4G,id=ram-node1 \ > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 > > Not only would it be nice if we could change tier membership based on > this data, it's realistically the only way to allow guests to accomplish > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. > > ~Gregory
On Tue, 09 Jan 2024 11:41:11 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: > >> > > >> > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf > >> > abstract_distance_offset: override by users to deal with firmware issue. > >> > > >> > say firmware can configure the cxl node into wrong tiers, similar to > >> > that it may also configure all cxl nodes into single memtype, hence > >> > all these nodes can fall into a single wrong tier. > >> > In this case, per node adistance_offset would be good to have ? > >> > >> I think that it's better to fix the error firmware if possible. And > >> these are only theoretical, not practical issues. Do you have some > >> practical issues? > >> > >> I understand that users may want to move nodes between memory tiers for > >> different policy choices. For that, memory_type based adistance_offset > >> should be good. > >> > > > > There's actually an affirmative case to change memory tiering to allow > > either movement of nodes between tiers, or at least base placement on > > HMAT information. Preferably, membership would be changable to allow > > hotplug/DCD to be managed (there's no guarantee that the memory passed > > through will always be what HMAT says on initial boot). > > IIUC, from Jonathan Cameron as below, the performance of memory > shouldn't change even for DCD devices. > > https://lore.kernel.org/linux-mm/20231103141636.000007e4@Huawei.com/ > > It's possible to change the performance of a NUMA node changed, if we > hot-remove a memory device, then hot-add another different memory > device. It's hoped that the CDAT changes too. Not supported, but ACPI has _HMA methods to in theory allow changing HMAT values based on firmware notifications... So we 'could' make it work for HMAT based description. Ultimately my current thinking is we'll end up emulating CXL type3 devices (hiding topology complexity) and you can update CDAT but IIRC that is only meant to be for degraded situations - so if you want multiple performance regions, CDAT should describe them form the start. > > So, all in all, HMAT + CDAT can help us to put the memory device in > appropriate memory tiers. Now, we have HMAT support in upstream. We > will working on CDAT support. > > -- > Best Regards, > Huang, Ying > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ > > > > This group wants to enable passing CXL memory through to KVM/QEMU > > (i.e. host CXL expander memory passed through to the guest), and > > allow the guest to apply memory tiering. > > > > There are multiple issues with this, presently: > > > > 1. The QEMU CXL virtual device is not and probably never will be > > performant enough to be a commodity class virtualization. I'd flex that a bit - we will end up with a solution for virtualization but it isn't the emulation that is there today because it's not possible to emulate some of the topology in a peformant manner (interleaving with sub page granularity / interleaving at all (to a lesser degree)). There are ways to do better than we are today, but they start to look like software dissagregated memory setups (think lots of page faults in the host). > > The > > reason is that the virtual CXL device is built off the I/O > > virtualization stack, which treats memory accesses as I/O accesses. That will remain true for complex emulation, but it needn't always be the case. I'm not 100% sure we can make it work but my current thinking is: When decoders are set up: Check if there is any interleaving going on. interleaving happening: Current functionally correct path. no interleaving: More conventional memory access path. > > > > KVM also seems incompatible with the design of the CXL memory device > > in general, but this problem may or may not be a blocker. That's true if we are doing fine grained routing but as above we can probably avoid that. > > > > As a result, access to virtual CXL memory device leads to QEMU > > crawling to a halt - and this is unlikely to change. In general yes, but hopefully not for carefully configured cases (the simple one of direct connect single device, no host interleaving for example). > > > > There is presently no good way forward to create a performant virtual > > CXL device in QEMU. This means the memory tiering component in the > > kernel is functionally useless for virtual CXL memory, because... Agreed - nothing there yet and I don't think the question of CXL virtualization in general is anywhere near solved... Maybe emulating a CXL device doesn't make sense, maybe we end up extending virtio-mem instead. Needs some PoC work to flesh this out. (it's about number 3 on my list of stuff to look at this year) > > > > 2. When passing memory through as an explicit NUMA node, but not as > > part of a CXL memory device, the nodes are lumped together in the > > DRAM tier. > > > > None of this has to do with firmware. > > > > Memory-type is an awful way of denoting membership of a tier, but we > > have HMAT information that can be passed through via QEMU: > > > > -object memory-backend-ram,size=4G,id=ram-node0 \ > > -object memory-backend-ram,size=4G,id=ram-node1 \ > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 > > > > Not only would it be nice if we could change tier membership based on > > this data, it's realistically the only way to allow guests to accomplish > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. This I fully agree with. There will be systems with a bunch of normal DDR with different access characteristics irrespective of CXL. + likely HMAT solutions will be used before we get anything more complex in place for CXL. Jonathan p.s. I'd love to see _HMA handling implemented in the kernel.. Would trail blaze what we will probably need to do for fiddly CXL cases where performance degrades on old devices etc. > > > > ~Gregory >
On Tue, Jan 09, 2024 at 11:41:11AM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: > >> > > >> > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf > >> > abstract_distance_offset: override by users to deal with firmware issue. > >> > > >> > say firmware can configure the cxl node into wrong tiers, similar to > >> > that it may also configure all cxl nodes into single memtype, hence > >> > all these nodes can fall into a single wrong tier. > >> > In this case, per node adistance_offset would be good to have ? > >> > >> I think that it's better to fix the error firmware if possible. And > >> these are only theoretical, not practical issues. Do you have some > >> practical issues? > >> > >> I understand that users may want to move nodes between memory tiers for > >> different policy choices. For that, memory_type based adistance_offset > >> should be good. > >> > > > > There's actually an affirmative case to change memory tiering to allow > > either movement of nodes between tiers, or at least base placement on > > HMAT information. Preferably, membership would be changable to allow > > hotplug/DCD to be managed (there's no guarantee that the memory passed > > through will always be what HMAT says on initial boot). > > IIUC, from Jonathan Cameron as below, the performance of memory > shouldn't change even for DCD devices. > > https://lore.kernel.org/linux-mm/20231103141636.000007e4@Huawei.com/ > > It's possible to change the performance of a NUMA node changed, if we > hot-remove a memory device, then hot-add another different memory > device. It's hoped that the CDAT changes too. > > So, all in all, HMAT + CDAT can help us to put the memory device in > appropriate memory tiers. Now, we have HMAT support in upstream. We > will working on CDAT support. That should be sufficient assuming the `-numa hmat-lb` setting in QEMU does the right thing. I suppose we also need to figure out a way to set CDAT information for a memory device that isn't related to CXL (from the perspective of the guest). I'll take a look if I get cycles. ~Gregory
On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: > On Tue, 09 Jan 2024 11:41:11 +0800 > "Huang, Ying" <ying.huang@intel.com> wrote: > > Gregory Price <gregory.price@memverge.com> writes: > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: > > It's possible to change the performance of a NUMA node changed, if we > > hot-remove a memory device, then hot-add another different memory > > device. It's hoped that the CDAT changes too. > > Not supported, but ACPI has _HMA methods to in theory allow changing > HMAT values based on firmware notifications... So we 'could' make > it work for HMAT based description. > > Ultimately my current thinking is we'll end up emulating CXL type3 > devices (hiding topology complexity) and you can update CDAT but > IIRC that is only meant to be for degraded situations - so if you > want multiple performance regions, CDAT should describe them form the start. > That was my thought. I don't think it's particularly *realistic* for HMAT/CDAT values to change at runtime, but I can imagine a case where it could be valuable. > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ > > > > > > This group wants to enable passing CXL memory through to KVM/QEMU > > > (i.e. host CXL expander memory passed through to the guest), and > > > allow the guest to apply memory tiering. > > > > > > There are multiple issues with this, presently: > > > > > > 1. The QEMU CXL virtual device is not and probably never will be > > > performant enough to be a commodity class virtualization. > > I'd flex that a bit - we will end up with a solution for virtualization but > it isn't the emulation that is there today because it's not possible to > emulate some of the topology in a peformant manner (interleaving with sub > page granularity / interleaving at all (to a lesser degree)). There are > ways to do better than we are today, but they start to look like > software dissagregated memory setups (think lots of page faults in the host). > Agreed, the emulated device as-is can't be the virtualization device, but it doesn't mean it can't be the basis for it. My thought is, if you want to pass host CXL *memory* through to the guest, you don't actually care to pass CXL *control* through to the guest. That control lies pretty squarely with the host/hypervisor. So, at least in theory, you can just cut the type3 device out of the QEMU configuration entirely and just pass it through as a distinct numa node with specific hmat qualities. Barring that, if we must go through the type3 device, the question is how difficult would it be to just make a stripped down type3 device to provide the informational components, but hack off anything topology/interleave related? Then you just do direct passthrough as you described below. qemu/kvm would report errors if you tried to touch the naughty bits. The second question is... is that device "compliant" or does it need super special handling from the kernel driver :D? If what i described is not "compliant", then it's probably a bad idea, and KVM/QEMU should just hide the CXL device entirely from the guest (for this use case) and just pass the memory through as a numa node. Which gets us back to: The memory-tiering component needs a way to place nodes in different tiers based on HMAT/CDAT/User Whim. All three of those seem like totally valid ways to go about it. > > > > > > 2. When passing memory through as an explicit NUMA node, but not as > > > part of a CXL memory device, the nodes are lumped together in the > > > DRAM tier. > > > > > > None of this has to do with firmware. > > > > > > Memory-type is an awful way of denoting membership of a tier, but we > > > have HMAT information that can be passed through via QEMU: > > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \ > > > -object memory-backend-ram,size=4G,id=ram-node1 \ > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 > > > > > > Not only would it be nice if we could change tier membership based on > > > this data, it's realistically the only way to allow guests to accomplish > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. > > This I fully agree with. There will be systems with a bunch of normal DDR with different > access characteristics irrespective of CXL. + likely HMAT solutions will be used > before we get anything more complex in place for CXL. > Had not even considered this, but that's completely accurate as well. And more discretely: What of devices that don't provide HMAT/CDAT? That isn't necessarily a violation of any standard. There probably could be a release valve for us to still make those devices useful. The concern I have with not implementing a movement mechanism *at all* is that a one-size-fits-all initial-placement heuristic feels gross when we're, at least ideologically, moving toward "software defined memory". Personally I think the movement mechanism is a good idea that gets folks where they're going sooner, and it doesn't hurt anything by existing. We can change the initial placement mechanism too. </2cents> ~Gregory
On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote: > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: > > On Tue, 09 Jan 2024 11:41:11 +0800 > > "Huang, Ying" <ying.huang@intel.com> wrote: > > > Gregory Price <gregory.price@memverge.com> writes: > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: > > > It's possible to change the performance of a NUMA node changed, if we > > > hot-remove a memory device, then hot-add another different memory > > > device. It's hoped that the CDAT changes too. > > > > Not supported, but ACPI has _HMA methods to in theory allow changing > > HMAT values based on firmware notifications... So we 'could' make > > it work for HMAT based description. > > > > Ultimately my current thinking is we'll end up emulating CXL type3 > > devices (hiding topology complexity) and you can update CDAT but > > IIRC that is only meant to be for degraded situations - so if you > > want multiple performance regions, CDAT should describe them form the start. > > > > That was my thought. I don't think it's particularly *realistic* for > HMAT/CDAT values to change at runtime, but I can imagine a case where > it could be valuable. > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ > > > > > > > > This group wants to enable passing CXL memory through to KVM/QEMU > > > > (i.e. host CXL expander memory passed through to the guest), and > > > > allow the guest to apply memory tiering. > > > > > > > > There are multiple issues with this, presently: > > > > > > > > 1. The QEMU CXL virtual device is not and probably never will be > > > > performant enough to be a commodity class virtualization. > > > > I'd flex that a bit - we will end up with a solution for virtualization but > > it isn't the emulation that is there today because it's not possible to > > emulate some of the topology in a peformant manner (interleaving with sub > > page granularity / interleaving at all (to a lesser degree)). There are > > ways to do better than we are today, but they start to look like > > software dissagregated memory setups (think lots of page faults in the host). > > > > Agreed, the emulated device as-is can't be the virtualization device, > but it doesn't mean it can't be the basis for it. > > My thought is, if you want to pass host CXL *memory* through to the > guest, you don't actually care to pass CXL *control* through to the > guest. That control lies pretty squarely with the host/hypervisor. > > So, at least in theory, you can just cut the type3 device out of the > QEMU configuration entirely and just pass it through as a distinct numa > node with specific hmat qualities. > > Barring that, if we must go through the type3 device, the question is > how difficult would it be to just make a stripped down type3 device > to provide the informational components, but hack off anything > topology/interleave related? Then you just do direct passthrough as you > described below. > > qemu/kvm would report errors if you tried to touch the naughty bits. > > The second question is... is that device "compliant" or does it need > super special handling from the kernel driver :D? If what i described > is not "compliant", then it's probably a bad idea, and KVM/QEMU should > just hide the CXL device entirely from the guest (for this use case) > and just pass the memory through as a numa node. > > Which gets us back to: The memory-tiering component needs a way to > place nodes in different tiers based on HMAT/CDAT/User Whim. All three > of those seem like totally valid ways to go about it. > > > > > > > > > 2. When passing memory through as an explicit NUMA node, but not as > > > > part of a CXL memory device, the nodes are lumped together in the > > > > DRAM tier. > > > > > > > > None of this has to do with firmware. > > > > > > > > Memory-type is an awful way of denoting membership of a tier, but we > > > > have HMAT information that can be passed through via QEMU: > > > > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \ > > > > -object memory-backend-ram,size=4G,id=ram-node1 \ > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 > > > > > > > > Not only would it be nice if we could change tier membership based on > > > > this data, it's realistically the only way to allow guests to accomplish > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. > > > > This I fully agree with. There will be systems with a bunch of normal DDR with different > > access characteristics irrespective of CXL. + likely HMAT solutions will be used > > before we get anything more complex in place for CXL. > > > > Had not even considered this, but that's completely accurate as well. > > And more discretely: What of devices that don't provide HMAT/CDAT? That > isn't necessarily a violation of any standard. There probably could be > a release valve for us to still make those devices useful. > > The concern I have with not implementing a movement mechanism *at all* > is that a one-size-fits-all initial-placement heuristic feels gross > when we're, at least ideologically, moving toward "software defined memory". > > Personally I think the movement mechanism is a good idea that gets folks > where they're going sooner, and it doesn't hurt anything by existing. We > can change the initial placement mechanism too. I think providing users a way to "FIX" the memory tiering is a backup option. Given that DDRs with different access characteristics provide the relevant CDAT/HMAT information, the kernel should be able to correctly establish memory tiering on boot. Current memory tiering code has 1) memory_tier_init() to iterate through all boot onlined memory nodes. All nodes are assumed to be fast tier (adistance MEMTIER_ADISTANCE_DRAM is used). 2) dev_dax_kmem_probe to iterate through all devdax controlled memory nodes. This is the place the kernel reads the memory attributes from HMAT and recognizes the memory nodes into the correct tier (devdax controlled CXL, pmem, etc). If we want DDRs with different memory characteristics to be put into the correct tier (as in the guest VM memory tiering case), we probably need a third path to iterate the boot onlined memory nodes and also be able to read their memory attributes. I don't think we can do that in 1) because the ACPI subsystem is not yet initialized. > > </2cents> > > ~Gregory
Gregory Price <gregory.price@memverge.com> writes: > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: >> On Tue, 09 Jan 2024 11:41:11 +0800 >> "Huang, Ying" <ying.huang@intel.com> wrote: >> > Gregory Price <gregory.price@memverge.com> writes: >> > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: >> > It's possible to change the performance of a NUMA node changed, if we >> > hot-remove a memory device, then hot-add another different memory >> > device. It's hoped that the CDAT changes too. >> >> Not supported, but ACPI has _HMA methods to in theory allow changing >> HMAT values based on firmware notifications... So we 'could' make >> it work for HMAT based description. >> >> Ultimately my current thinking is we'll end up emulating CXL type3 >> devices (hiding topology complexity) and you can update CDAT but >> IIRC that is only meant to be for degraded situations - so if you >> want multiple performance regions, CDAT should describe them form the start. >> > > That was my thought. I don't think it's particularly *realistic* for > HMAT/CDAT values to change at runtime, but I can imagine a case where > it could be valuable. > >> > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ >> > > >> > > This group wants to enable passing CXL memory through to KVM/QEMU >> > > (i.e. host CXL expander memory passed through to the guest), and >> > > allow the guest to apply memory tiering. >> > > >> > > There are multiple issues with this, presently: >> > > >> > > 1. The QEMU CXL virtual device is not and probably never will be >> > > performant enough to be a commodity class virtualization. >> >> I'd flex that a bit - we will end up with a solution for virtualization but >> it isn't the emulation that is there today because it's not possible to >> emulate some of the topology in a peformant manner (interleaving with sub >> page granularity / interleaving at all (to a lesser degree)). There are >> ways to do better than we are today, but they start to look like >> software dissagregated memory setups (think lots of page faults in the host). >> > > Agreed, the emulated device as-is can't be the virtualization device, > but it doesn't mean it can't be the basis for it. > > My thought is, if you want to pass host CXL *memory* through to the > guest, you don't actually care to pass CXL *control* through to the > guest. That control lies pretty squarely with the host/hypervisor. > > So, at least in theory, you can just cut the type3 device out of the > QEMU configuration entirely and just pass it through as a distinct numa > node with specific hmat qualities. > > Barring that, if we must go through the type3 device, the question is > how difficult would it be to just make a stripped down type3 device > to provide the informational components, but hack off anything > topology/interleave related? Then you just do direct passthrough as you > described below. > > qemu/kvm would report errors if you tried to touch the naughty bits. > > The second question is... is that device "compliant" or does it need > super special handling from the kernel driver :D? If what i described > is not "compliant", then it's probably a bad idea, and KVM/QEMU should > just hide the CXL device entirely from the guest (for this use case) > and just pass the memory through as a numa node. > > Which gets us back to: The memory-tiering component needs a way to > place nodes in different tiers based on HMAT/CDAT/User Whim. All three > of those seem like totally valid ways to go about it. > >> > > >> > > 2. When passing memory through as an explicit NUMA node, but not as >> > > part of a CXL memory device, the nodes are lumped together in the >> > > DRAM tier. >> > > >> > > None of this has to do with firmware. >> > > >> > > Memory-type is an awful way of denoting membership of a tier, but we >> > > have HMAT information that can be passed through via QEMU: >> > > >> > > -object memory-backend-ram,size=4G,id=ram-node0 \ >> > > -object memory-backend-ram,size=4G,id=ram-node1 \ >> > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ >> > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ >> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ >> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ >> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ >> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 >> > > >> > > Not only would it be nice if we could change tier membership based on >> > > this data, it's realistically the only way to allow guests to accomplish >> > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. >> >> This I fully agree with. There will be systems with a bunch of normal DDR with different >> access characteristics irrespective of CXL. + likely HMAT solutions will be used >> before we get anything more complex in place for CXL. >> > > Had not even considered this, but that's completely accurate as well. > > And more discretely: What of devices that don't provide HMAT/CDAT? That > isn't necessarily a violation of any standard. There probably could be > a release valve for us to still make those devices useful. > > The concern I have with not implementing a movement mechanism *at all* > is that a one-size-fits-all initial-placement heuristic feels gross > when we're, at least ideologically, moving toward "software defined memory". > > Personally I think the movement mechanism is a good idea that gets folks > where they're going sooner, and it doesn't hurt anything by existing. We > can change the initial placement mechanism too. > > </2cents> It's the last resort to provide hardware information from user space. We should try to avoid that if possible. Per my understanding, per-memory-type abstract distance overriding is to apply specific policy. While, per-memory-node abstract distance overriding is to provide missing hardware information. -- Best Regards, Huang, Ying
Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes: > On Tue, 09 Jan 2024 11:41:11 +0800 > "Huang, Ying" <ying.huang@intel.com> wrote: > >> Gregory Price <gregory.price@memverge.com> writes: >> >> > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: >> >> > >> >> > From https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf >> >> > abstract_distance_offset: override by users to deal with firmware issue. >> >> > >> >> > say firmware can configure the cxl node into wrong tiers, similar to >> >> > that it may also configure all cxl nodes into single memtype, hence >> >> > all these nodes can fall into a single wrong tier. >> >> > In this case, per node adistance_offset would be good to have ? >> >> >> >> I think that it's better to fix the error firmware if possible. And >> >> these are only theoretical, not practical issues. Do you have some >> >> practical issues? >> >> >> >> I understand that users may want to move nodes between memory tiers for >> >> different policy choices. For that, memory_type based adistance_offset >> >> should be good. >> >> >> > >> > There's actually an affirmative case to change memory tiering to allow >> > either movement of nodes between tiers, or at least base placement on >> > HMAT information. Preferably, membership would be changable to allow >> > hotplug/DCD to be managed (there's no guarantee that the memory passed >> > through will always be what HMAT says on initial boot). >> >> IIUC, from Jonathan Cameron as below, the performance of memory >> shouldn't change even for DCD devices. >> >> https://lore.kernel.org/linux-mm/20231103141636.000007e4@Huawei.com/ >> >> It's possible to change the performance of a NUMA node changed, if we >> hot-remove a memory device, then hot-add another different memory >> device. It's hoped that the CDAT changes too. > > Not supported, but ACPI has _HMA methods to in theory allow changing > HMAT values based on firmware notifications... So we 'could' make > it work for HMAT based description. > > Ultimately my current thinking is we'll end up emulating CXL type3 > devices (hiding topology complexity) and you can update CDAT but > IIRC that is only meant to be for degraded situations - so if you > want multiple performance regions, CDAT should describe them form the start. Thank you very much for input! So, to support degraded performance, we will need to move a NUMA node between memory tiers. And, per my understanding, we should do that in kernel. >> >> So, all in all, HMAT + CDAT can help us to put the memory device in >> appropriate memory tiers. Now, we have HMAT support in upstream. We >> will working on CDAT support. >> -- Best Regards, Huang, Ying
On Tue, 9 Jan 2024 12:59:19 -0500 Gregory Price <gregory.price@memverge.com> wrote: > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: > > On Tue, 09 Jan 2024 11:41:11 +0800 > > "Huang, Ying" <ying.huang@intel.com> wrote: > > > Gregory Price <gregory.price@memverge.com> writes: > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: > > > It's possible to change the performance of a NUMA node changed, if we > > > hot-remove a memory device, then hot-add another different memory > > > device. It's hoped that the CDAT changes too. > > > > Not supported, but ACPI has _HMA methods to in theory allow changing > > HMAT values based on firmware notifications... So we 'could' make > > it work for HMAT based description. > > > > Ultimately my current thinking is we'll end up emulating CXL type3 > > devices (hiding topology complexity) and you can update CDAT but > > IIRC that is only meant to be for degraded situations - so if you > > want multiple performance regions, CDAT should describe them form the start. > > > > That was my thought. I don't think it's particularly *realistic* for > HMAT/CDAT values to change at runtime, but I can imagine a case where > it could be valuable. For now I'm thinking we might spit that CDAT info via a tracepoint if it happens, but given it's degraded perf only maybe we don't care. HMAT is more interesting because it may be used by a firmware first model to paper over some weird hardware being hotplugged, or for giggles a hypervisor moving memory around under the hood (think powering down whole DRAM controllers etc). Anyhow, that's highly speculative and whoever cares about it can make it work! :) > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ > > > > > > > > This group wants to enable passing CXL memory through to KVM/QEMU > > > > (i.e. host CXL expander memory passed through to the guest), and > > > > allow the guest to apply memory tiering. > > > > > > > > There are multiple issues with this, presently: > > > > > > > > 1. The QEMU CXL virtual device is not and probably never will be > > > > performant enough to be a commodity class virtualization. > > > > I'd flex that a bit - we will end up with a solution for virtualization but > > it isn't the emulation that is there today because it's not possible to > > emulate some of the topology in a peformant manner (interleaving with sub > > page granularity / interleaving at all (to a lesser degree)). There are > > ways to do better than we are today, but they start to look like > > software dissagregated memory setups (think lots of page faults in the host). > > > > Agreed, the emulated device as-is can't be the virtualization device, > but it doesn't mean it can't be the basis for it. > > My thought is, if you want to pass host CXL *memory* through to the > guest, you don't actually care to pass CXL *control* through to the > guest. That control lies pretty squarely with the host/hypervisor. > > So, at least in theory, you can just cut the type3 device out of the > QEMU configuration entirely and just pass it through as a distinct numa > node with specific hmat qualities. > > Barring that, if we must go through the type3 device, the question is > how difficult would it be to just make a stripped down type3 device > to provide the informational components, but hack off anything > topology/interleave related? Then you just do direct passthrough as you > described below. Not stripped down as such, just lock the decoders as if a firmware had configured it (in reality the config will be really really simple). The kernel stack handles that fine today. The only dynamic bit would be the DC related part. Not sure our lockdown support in the emulated device is complete (some of it is there but might have missed some registers). > > qemu/kvm would report errors if you tried to touch the naughty bits. Might do that a temporary step along way to enabling thing but given CXL assumes that the host firmware 'might' have configured everything and locked it (kernel may be booting out of CXL memory for instance) it should 'just work' without needing this. > The second question is... is that device "compliant" or does it need > super special handling from the kernel driver :D? If what i described > is not "compliant", then it's probably a bad idea, and KVM/QEMU should > just hide the CXL device entirely from the guest (for this use case) > and just pass the memory through as a numa node. Would need to be compliant or very nearly so - I can see we might advertise no interleave support even though not setting any of the interleave address bits is technically a spec violation. However, don't think we need to do that because of decoder locking. We advertise interleave options but don't allow current setting to be changed. If someone manually resets the bus they are on their own though :( (that will clear the lock registers as it's the same as removing power). > > Which gets us back to: The memory-tiering component needs a way to > place nodes in different tiers based on HMAT/CDAT/User Whim. All three > of those seem like totally valid ways to go about it. > > > > > > > > > 2. When passing memory through as an explicit NUMA node, but not as > > > > part of a CXL memory device, the nodes are lumped together in the > > > > DRAM tier. > > > > > > > > None of this has to do with firmware. > > > > > > > > Memory-type is an awful way of denoting membership of a tier, but we > > > > have HMAT information that can be passed through via QEMU: > > > > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \ > > > > -object memory-backend-ram,size=4G,id=ram-node1 \ > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 > > > > > > > > Not only would it be nice if we could change tier membership based on > > > > this data, it's realistically the only way to allow guests to accomplish > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. > > > > This I fully agree with. There will be systems with a bunch of normal DDR with different > > access characteristics irrespective of CXL. + likely HMAT solutions will be used > > before we get anything more complex in place for CXL. > > > > Had not even considered this, but that's completely accurate as well. > > And more discretely: What of devices that don't provide HMAT/CDAT? That > isn't necessarily a violation of any standard. There probably could be > a release valve for us to still make those devices useful. I'd argue any such device needs some driver support. Release valve is they provide the info from that driver, just like the CDAT solution is doing. If they don't then meh, their system is borked so they'll will add it fairly quickly! > > The concern I have with not implementing a movement mechanism *at all* > is that a one-size-fits-all initial-placement heuristic feels gross > when we're, at least ideologically, moving toward "software defined memory". > > Personally I think the movement mechanism is a good idea that gets folks > where they're going sooner, and it doesn't hurt anything by existing. We > can change the initial placement mechanism too. I've no problem with a movement mechanism. Hopefully in the long run it never gets used though! Maybe in short term it's out of tree code. Jonathan > > </2cents> > > ~Gregory
On Tue, 9 Jan 2024 16:28:15 -0800 Hao Xiang <hao.xiang@bytedance.com> wrote: > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote: > > > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: > > > On Tue, 09 Jan 2024 11:41:11 +0800 > > > "Huang, Ying" <ying.huang@intel.com> wrote: > > > > Gregory Price <gregory.price@memverge.com> writes: > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: > > > > It's possible to change the performance of a NUMA node changed, if we > > > > hot-remove a memory device, then hot-add another different memory > > > > device. It's hoped that the CDAT changes too. > > > > > > Not supported, but ACPI has _HMA methods to in theory allow changing > > > HMAT values based on firmware notifications... So we 'could' make > > > it work for HMAT based description. > > > > > > Ultimately my current thinking is we'll end up emulating CXL type3 > > > devices (hiding topology complexity) and you can update CDAT but > > > IIRC that is only meant to be for degraded situations - so if you > > > want multiple performance regions, CDAT should describe them form the start. > > > > > > > That was my thought. I don't think it's particularly *realistic* for > > HMAT/CDAT values to change at runtime, but I can imagine a case where > > it could be valuable. > > > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ > > > > > > > > > > This group wants to enable passing CXL memory through to KVM/QEMU > > > > > (i.e. host CXL expander memory passed through to the guest), and > > > > > allow the guest to apply memory tiering. > > > > > > > > > > There are multiple issues with this, presently: > > > > > > > > > > 1. The QEMU CXL virtual device is not and probably never will be > > > > > performant enough to be a commodity class virtualization. > > > > > > I'd flex that a bit - we will end up with a solution for virtualization but > > > it isn't the emulation that is there today because it's not possible to > > > emulate some of the topology in a peformant manner (interleaving with sub > > > page granularity / interleaving at all (to a lesser degree)). There are > > > ways to do better than we are today, but they start to look like > > > software dissagregated memory setups (think lots of page faults in the host). > > > > > > > Agreed, the emulated device as-is can't be the virtualization device, > > but it doesn't mean it can't be the basis for it. > > > > My thought is, if you want to pass host CXL *memory* through to the > > guest, you don't actually care to pass CXL *control* through to the > > guest. That control lies pretty squarely with the host/hypervisor. > > > > So, at least in theory, you can just cut the type3 device out of the > > QEMU configuration entirely and just pass it through as a distinct numa > > node with specific hmat qualities. > > > > Barring that, if we must go through the type3 device, the question is > > how difficult would it be to just make a stripped down type3 device > > to provide the informational components, but hack off anything > > topology/interleave related? Then you just do direct passthrough as you > > described below. > > > > qemu/kvm would report errors if you tried to touch the naughty bits. > > > > The second question is... is that device "compliant" or does it need > > super special handling from the kernel driver :D? If what i described > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should > > just hide the CXL device entirely from the guest (for this use case) > > and just pass the memory through as a numa node. > > > > Which gets us back to: The memory-tiering component needs a way to > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three > > of those seem like totally valid ways to go about it. > > > > > > > > > > > > 2. When passing memory through as an explicit NUMA node, but not as > > > > > part of a CXL memory device, the nodes are lumped together in the > > > > > DRAM tier. > > > > > > > > > > None of this has to do with firmware. > > > > > > > > > > Memory-type is an awful way of denoting membership of a tier, but we > > > > > have HMAT information that can be passed through via QEMU: > > > > > > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \ > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \ > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 > > > > > > > > > > Not only would it be nice if we could change tier membership based on > > > > > this data, it's realistically the only way to allow guests to accomplish > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. > > > > > > This I fully agree with. There will be systems with a bunch of normal DDR with different > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used > > > before we get anything more complex in place for CXL. > > > > > > > Had not even considered this, but that's completely accurate as well. > > > > And more discretely: What of devices that don't provide HMAT/CDAT? That > > isn't necessarily a violation of any standard. There probably could be > > a release valve for us to still make those devices useful. > > > > The concern I have with not implementing a movement mechanism *at all* > > is that a one-size-fits-all initial-placement heuristic feels gross > > when we're, at least ideologically, moving toward "software defined memory". > > > > Personally I think the movement mechanism is a good idea that gets folks > > where they're going sooner, and it doesn't hurt anything by existing. We > > can change the initial placement mechanism too. > > I think providing users a way to "FIX" the memory tiering is a backup > option. Given that DDRs with different access characteristics provide > the relevant CDAT/HMAT information, the kernel should be able to > correctly establish memory tiering on boot. Include hotplug and I'll be happier! I know that's messy though. > Current memory tiering code has > 1) memory_tier_init() to iterate through all boot onlined memory > nodes. All nodes are assumed to be fast tier (adistance > MEMTIER_ADISTANCE_DRAM is used). > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory > nodes. This is the place the kernel reads the memory attributes from > HMAT and recognizes the memory nodes into the correct tier (devdax > controlled CXL, pmem, etc). > If we want DDRs with different memory characteristics to be put into > the correct tier (as in the guest VM memory tiering case), we probably > need a third path to iterate the boot onlined memory nodes and also be > able to read their memory attributes. I don't think we can do that in > 1) because the ACPI subsystem is not yet initialized. Can we move it later in general? Or drag HMAT parsing earlier? ACPI table availability is pretty early, it's just that we don't bother with HMAT because nothing early uses it. IIRC SRAT parsing occurs way before memory_tier_init() will be called. Jonathan > > > > > </2cents> > > > > ~Gregory
On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > On Tue, 9 Jan 2024 16:28:15 -0800 > Hao Xiang <hao.xiang@bytedance.com> wrote: > > > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote: > > > > > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: > > > > On Tue, 09 Jan 2024 11:41:11 +0800 > > > > "Huang, Ying" <ying.huang@intel.com> wrote: > > > > > Gregory Price <gregory.price@memverge.com> writes: > > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: > > > > > It's possible to change the performance of a NUMA node changed, if we > > > > > hot-remove a memory device, then hot-add another different memory > > > > > device. It's hoped that the CDAT changes too. > > > > > > > > Not supported, but ACPI has _HMA methods to in theory allow changing > > > > HMAT values based on firmware notifications... So we 'could' make > > > > it work for HMAT based description. > > > > > > > > Ultimately my current thinking is we'll end up emulating CXL type3 > > > > devices (hiding topology complexity) and you can update CDAT but > > > > IIRC that is only meant to be for degraded situations - so if you > > > > want multiple performance regions, CDAT should describe them form the start. > > > > > > > > > > That was my thought. I don't think it's particularly *realistic* for > > > HMAT/CDAT values to change at runtime, but I can imagine a case where > > > it could be valuable. > > > > > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ > > > > > > > > > > > > This group wants to enable passing CXL memory through to KVM/QEMU > > > > > > (i.e. host CXL expander memory passed through to the guest), and > > > > > > allow the guest to apply memory tiering. > > > > > > > > > > > > There are multiple issues with this, presently: > > > > > > > > > > > > 1. The QEMU CXL virtual device is not and probably never will be > > > > > > performant enough to be a commodity class virtualization. > > > > > > > > I'd flex that a bit - we will end up with a solution for virtualization but > > > > it isn't the emulation that is there today because it's not possible to > > > > emulate some of the topology in a peformant manner (interleaving with sub > > > > page granularity / interleaving at all (to a lesser degree)). There are > > > > ways to do better than we are today, but they start to look like > > > > software dissagregated memory setups (think lots of page faults in the host). > > > > > > > > > > Agreed, the emulated device as-is can't be the virtualization device, > > > but it doesn't mean it can't be the basis for it. > > > > > > My thought is, if you want to pass host CXL *memory* through to the > > > guest, you don't actually care to pass CXL *control* through to the > > > guest. That control lies pretty squarely with the host/hypervisor. > > > > > > So, at least in theory, you can just cut the type3 device out of the > > > QEMU configuration entirely and just pass it through as a distinct numa > > > node with specific hmat qualities. > > > > > > Barring that, if we must go through the type3 device, the question is > > > how difficult would it be to just make a stripped down type3 device > > > to provide the informational components, but hack off anything > > > topology/interleave related? Then you just do direct passthrough as you > > > described below. > > > > > > qemu/kvm would report errors if you tried to touch the naughty bits. > > > > > > The second question is... is that device "compliant" or does it need > > > super special handling from the kernel driver :D? If what i described > > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should > > > just hide the CXL device entirely from the guest (for this use case) > > > and just pass the memory through as a numa node. > > > > > > Which gets us back to: The memory-tiering component needs a way to > > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three > > > of those seem like totally valid ways to go about it. > > > > > > > > > > > > > > > 2. When passing memory through as an explicit NUMA node, but not as > > > > > > part of a CXL memory device, the nodes are lumped together in the > > > > > > DRAM tier. > > > > > > > > > > > > None of this has to do with firmware. > > > > > > > > > > > > Memory-type is an awful way of denoting membership of a tier, but we > > > > > > have HMAT information that can be passed through via QEMU: > > > > > > > > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \ > > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \ > > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ > > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ > > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ > > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ > > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ > > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 > > > > > > > > > > > > Not only would it be nice if we could change tier membership based on > > > > > > this data, it's realistically the only way to allow guests to accomplish > > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. > > > > > > > > This I fully agree with. There will be systems with a bunch of normal DDR with different > > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used > > > > before we get anything more complex in place for CXL. > > > > > > > > > > Had not even considered this, but that's completely accurate as well. > > > > > > And more discretely: What of devices that don't provide HMAT/CDAT? That > > > isn't necessarily a violation of any standard. There probably could be > > > a release valve for us to still make those devices useful. > > > > > > The concern I have with not implementing a movement mechanism *at all* > > > is that a one-size-fits-all initial-placement heuristic feels gross > > > when we're, at least ideologically, moving toward "software defined memory". > > > > > > Personally I think the movement mechanism is a good idea that gets folks > > > where they're going sooner, and it doesn't hurt anything by existing. We > > > can change the initial placement mechanism too. > > > > I think providing users a way to "FIX" the memory tiering is a backup > > option. Given that DDRs with different access characteristics provide > > the relevant CDAT/HMAT information, the kernel should be able to > > correctly establish memory tiering on boot. > > Include hotplug and I'll be happier! I know that's messy though. > > > Current memory tiering code has > > 1) memory_tier_init() to iterate through all boot onlined memory > > nodes. All nodes are assumed to be fast tier (adistance > > MEMTIER_ADISTANCE_DRAM is used). > > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory > > nodes. This is the place the kernel reads the memory attributes from > > HMAT and recognizes the memory nodes into the correct tier (devdax > > controlled CXL, pmem, etc). > > If we want DDRs with different memory characteristics to be put into > > the correct tier (as in the guest VM memory tiering case), we probably > > need a third path to iterate the boot onlined memory nodes and also be > > able to read their memory attributes. I don't think we can do that in > > 1) because the ACPI subsystem is not yet initialized. > > Can we move it later in general? Or drag HMAT parsing earlier? > ACPI table availability is pretty early, it's just that we don't bother > with HMAT because nothing early uses it. > IIRC SRAT parsing occurs way before memory_tier_init() will be called. I tested the call sequence under a debugger earlier. hmat_init() is called after memory_tier_init(). Let me poke around and see what our options are. > > Jonathan > > > > > > > > > > > </2cents> > > > > > > ~Gregory >
Hao Xiang <hao.xiang@bytedance.com> writes: > On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron > <Jonathan.Cameron@huawei.com> wrote: >> >> On Tue, 9 Jan 2024 16:28:15 -0800 >> Hao Xiang <hao.xiang@bytedance.com> wrote: >> >> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote: >> > > >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: >> > > > On Tue, 09 Jan 2024 11:41:11 +0800 >> > > > "Huang, Ying" <ying.huang@intel.com> wrote: >> > > > > Gregory Price <gregory.price@memverge.com> writes: >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: >> > > > > It's possible to change the performance of a NUMA node changed, if we >> > > > > hot-remove a memory device, then hot-add another different memory >> > > > > device. It's hoped that the CDAT changes too. >> > > > >> > > > Not supported, but ACPI has _HMA methods to in theory allow changing >> > > > HMAT values based on firmware notifications... So we 'could' make >> > > > it work for HMAT based description. >> > > > >> > > > Ultimately my current thinking is we'll end up emulating CXL type3 >> > > > devices (hiding topology complexity) and you can update CDAT but >> > > > IIRC that is only meant to be for degraded situations - so if you >> > > > want multiple performance regions, CDAT should describe them form the start. >> > > > >> > > >> > > That was my thought. I don't think it's particularly *realistic* for >> > > HMAT/CDAT values to change at runtime, but I can imagine a case where >> > > it could be valuable. >> > > >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ >> > > > > > >> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU >> > > > > > (i.e. host CXL expander memory passed through to the guest), and >> > > > > > allow the guest to apply memory tiering. >> > > > > > >> > > > > > There are multiple issues with this, presently: >> > > > > > >> > > > > > 1. The QEMU CXL virtual device is not and probably never will be >> > > > > > performant enough to be a commodity class virtualization. >> > > > >> > > > I'd flex that a bit - we will end up with a solution for virtualization but >> > > > it isn't the emulation that is there today because it's not possible to >> > > > emulate some of the topology in a peformant manner (interleaving with sub >> > > > page granularity / interleaving at all (to a lesser degree)). There are >> > > > ways to do better than we are today, but they start to look like >> > > > software dissagregated memory setups (think lots of page faults in the host). >> > > > >> > > >> > > Agreed, the emulated device as-is can't be the virtualization device, >> > > but it doesn't mean it can't be the basis for it. >> > > >> > > My thought is, if you want to pass host CXL *memory* through to the >> > > guest, you don't actually care to pass CXL *control* through to the >> > > guest. That control lies pretty squarely with the host/hypervisor. >> > > >> > > So, at least in theory, you can just cut the type3 device out of the >> > > QEMU configuration entirely and just pass it through as a distinct numa >> > > node with specific hmat qualities. >> > > >> > > Barring that, if we must go through the type3 device, the question is >> > > how difficult would it be to just make a stripped down type3 device >> > > to provide the informational components, but hack off anything >> > > topology/interleave related? Then you just do direct passthrough as you >> > > described below. >> > > >> > > qemu/kvm would report errors if you tried to touch the naughty bits. >> > > >> > > The second question is... is that device "compliant" or does it need >> > > super special handling from the kernel driver :D? If what i described >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should >> > > just hide the CXL device entirely from the guest (for this use case) >> > > and just pass the memory through as a numa node. >> > > >> > > Which gets us back to: The memory-tiering component needs a way to >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three >> > > of those seem like totally valid ways to go about it. >> > > >> > > > > > >> > > > > > 2. When passing memory through as an explicit NUMA node, but not as >> > > > > > part of a CXL memory device, the nodes are lumped together in the >> > > > > > DRAM tier. >> > > > > > >> > > > > > None of this has to do with firmware. >> > > > > > >> > > > > > Memory-type is an awful way of denoting membership of a tier, but we >> > > > > > have HMAT information that can be passed through via QEMU: >> > > > > > >> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \ >> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \ >> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ >> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 >> > > > > > >> > > > > > Not only would it be nice if we could change tier membership based on >> > > > > > this data, it's realistically the only way to allow guests to accomplish >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. >> > > > >> > > > This I fully agree with. There will be systems with a bunch of normal DDR with different >> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used >> > > > before we get anything more complex in place for CXL. >> > > > >> > > >> > > Had not even considered this, but that's completely accurate as well. >> > > >> > > And more discretely: What of devices that don't provide HMAT/CDAT? That >> > > isn't necessarily a violation of any standard. There probably could be >> > > a release valve for us to still make those devices useful. >> > > >> > > The concern I have with not implementing a movement mechanism *at all* >> > > is that a one-size-fits-all initial-placement heuristic feels gross >> > > when we're, at least ideologically, moving toward "software defined memory". >> > > >> > > Personally I think the movement mechanism is a good idea that gets folks >> > > where they're going sooner, and it doesn't hurt anything by existing. We >> > > can change the initial placement mechanism too. >> > >> > I think providing users a way to "FIX" the memory tiering is a backup >> > option. Given that DDRs with different access characteristics provide >> > the relevant CDAT/HMAT information, the kernel should be able to >> > correctly establish memory tiering on boot. >> >> Include hotplug and I'll be happier! I know that's messy though. >> >> > Current memory tiering code has >> > 1) memory_tier_init() to iterate through all boot onlined memory >> > nodes. All nodes are assumed to be fast tier (adistance >> > MEMTIER_ADISTANCE_DRAM is used). >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory >> > nodes. This is the place the kernel reads the memory attributes from >> > HMAT and recognizes the memory nodes into the correct tier (devdax >> > controlled CXL, pmem, etc). >> > If we want DDRs with different memory characteristics to be put into >> > the correct tier (as in the guest VM memory tiering case), we probably >> > need a third path to iterate the boot onlined memory nodes and also be >> > able to read their memory attributes. I don't think we can do that in >> > 1) because the ACPI subsystem is not yet initialized. >> >> Can we move it later in general? Or drag HMAT parsing earlier? >> ACPI table availability is pretty early, it's just that we don't bother >> with HMAT because nothing early uses it. >> IIRC SRAT parsing occurs way before memory_tier_init() will be called. > > I tested the call sequence under a debugger earlier. hmat_init() is > called after memory_tier_init(). Let me poke around and see what our > options are. This sounds reasonable. Please keep in mind that we need a way to identify the base line memory type(default_dram_type). A simple method is to use NUMA nodes with CPU attached. But I remember that Aneesh said that some NUMA nodes without CPU will need to be put in default_dram_type too on their systems. We need a way to identify that. -- Best Regards, Huang, Ying
On Thu, Jan 11, 2024 at 11:02 PM Huang, Ying <ying.huang@intel.com> wrote: > > Hao Xiang <hao.xiang@bytedance.com> writes: > > > On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron > > <Jonathan.Cameron@huawei.com> wrote: > >> > >> On Tue, 9 Jan 2024 16:28:15 -0800 > >> Hao Xiang <hao.xiang@bytedance.com> wrote: > >> > >> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote: > >> > > > >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: > >> > > > On Tue, 09 Jan 2024 11:41:11 +0800 > >> > > > "Huang, Ying" <ying.huang@intel.com> wrote: > >> > > > > Gregory Price <gregory.price@memverge.com> writes: > >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: > >> > > > > It's possible to change the performance of a NUMA node changed, if we > >> > > > > hot-remove a memory device, then hot-add another different memory > >> > > > > device. It's hoped that the CDAT changes too. > >> > > > > >> > > > Not supported, but ACPI has _HMA methods to in theory allow changing > >> > > > HMAT values based on firmware notifications... So we 'could' make > >> > > > it work for HMAT based description. > >> > > > > >> > > > Ultimately my current thinking is we'll end up emulating CXL type3 > >> > > > devices (hiding topology complexity) and you can update CDAT but > >> > > > IIRC that is only meant to be for degraded situations - so if you > >> > > > want multiple performance regions, CDAT should describe them form the start. > >> > > > > >> > > > >> > > That was my thought. I don't think it's particularly *realistic* for > >> > > HMAT/CDAT values to change at runtime, but I can imagine a case where > >> > > it could be valuable. > >> > > > >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ > >> > > > > > > >> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU > >> > > > > > (i.e. host CXL expander memory passed through to the guest), and > >> > > > > > allow the guest to apply memory tiering. > >> > > > > > > >> > > > > > There are multiple issues with this, presently: > >> > > > > > > >> > > > > > 1. The QEMU CXL virtual device is not and probably never will be > >> > > > > > performant enough to be a commodity class virtualization. > >> > > > > >> > > > I'd flex that a bit - we will end up with a solution for virtualization but > >> > > > it isn't the emulation that is there today because it's not possible to > >> > > > emulate some of the topology in a peformant manner (interleaving with sub > >> > > > page granularity / interleaving at all (to a lesser degree)). There are > >> > > > ways to do better than we are today, but they start to look like > >> > > > software dissagregated memory setups (think lots of page faults in the host). > >> > > > > >> > > > >> > > Agreed, the emulated device as-is can't be the virtualization device, > >> > > but it doesn't mean it can't be the basis for it. > >> > > > >> > > My thought is, if you want to pass host CXL *memory* through to the > >> > > guest, you don't actually care to pass CXL *control* through to the > >> > > guest. That control lies pretty squarely with the host/hypervisor. > >> > > > >> > > So, at least in theory, you can just cut the type3 device out of the > >> > > QEMU configuration entirely and just pass it through as a distinct numa > >> > > node with specific hmat qualities. > >> > > > >> > > Barring that, if we must go through the type3 device, the question is > >> > > how difficult would it be to just make a stripped down type3 device > >> > > to provide the informational components, but hack off anything > >> > > topology/interleave related? Then you just do direct passthrough as you > >> > > described below. > >> > > > >> > > qemu/kvm would report errors if you tried to touch the naughty bits. > >> > > > >> > > The second question is... is that device "compliant" or does it need > >> > > super special handling from the kernel driver :D? If what i described > >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should > >> > > just hide the CXL device entirely from the guest (for this use case) > >> > > and just pass the memory through as a numa node. > >> > > > >> > > Which gets us back to: The memory-tiering component needs a way to > >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three > >> > > of those seem like totally valid ways to go about it. > >> > > > >> > > > > > > >> > > > > > 2. When passing memory through as an explicit NUMA node, but not as > >> > > > > > part of a CXL memory device, the nodes are lumped together in the > >> > > > > > DRAM tier. > >> > > > > > > >> > > > > > None of this has to do with firmware. > >> > > > > > > >> > > > > > Memory-type is an awful way of denoting membership of a tier, but we > >> > > > > > have HMAT information that can be passed through via QEMU: > >> > > > > > > >> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \ > >> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \ > >> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ > >> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ > >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ > >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ > >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ > >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 > >> > > > > > > >> > > > > > Not only would it be nice if we could change tier membership based on > >> > > > > > this data, it's realistically the only way to allow guests to accomplish > >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. > >> > > > > >> > > > This I fully agree with. There will be systems with a bunch of normal DDR with different > >> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used > >> > > > before we get anything more complex in place for CXL. > >> > > > > >> > > > >> > > Had not even considered this, but that's completely accurate as well. > >> > > > >> > > And more discretely: What of devices that don't provide HMAT/CDAT? That > >> > > isn't necessarily a violation of any standard. There probably could be > >> > > a release valve for us to still make those devices useful. > >> > > > >> > > The concern I have with not implementing a movement mechanism *at all* > >> > > is that a one-size-fits-all initial-placement heuristic feels gross > >> > > when we're, at least ideologically, moving toward "software defined memory". > >> > > > >> > > Personally I think the movement mechanism is a good idea that gets folks > >> > > where they're going sooner, and it doesn't hurt anything by existing. We > >> > > can change the initial placement mechanism too. > >> > > >> > I think providing users a way to "FIX" the memory tiering is a backup > >> > option. Given that DDRs with different access characteristics provide > >> > the relevant CDAT/HMAT information, the kernel should be able to > >> > correctly establish memory tiering on boot. > >> > >> Include hotplug and I'll be happier! I know that's messy though. > >> > >> > Current memory tiering code has > >> > 1) memory_tier_init() to iterate through all boot onlined memory > >> > nodes. All nodes are assumed to be fast tier (adistance > >> > MEMTIER_ADISTANCE_DRAM is used). > >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory > >> > nodes. This is the place the kernel reads the memory attributes from > >> > HMAT and recognizes the memory nodes into the correct tier (devdax > >> > controlled CXL, pmem, etc). > >> > If we want DDRs with different memory characteristics to be put into > >> > the correct tier (as in the guest VM memory tiering case), we probably > >> > need a third path to iterate the boot onlined memory nodes and also be > >> > able to read their memory attributes. I don't think we can do that in > >> > 1) because the ACPI subsystem is not yet initialized. > >> > >> Can we move it later in general? Or drag HMAT parsing earlier? > >> ACPI table availability is pretty early, it's just that we don't bother > >> with HMAT because nothing early uses it. > >> IIRC SRAT parsing occurs way before memory_tier_init() will be called. > > > > I tested the call sequence under a debugger earlier. hmat_init() is > > called after memory_tier_init(). Let me poke around and see what our > > options are. > > This sounds reasonable. > > Please keep in mind that we need a way to identify the base line memory > type(default_dram_type). A simple method is to use NUMA nodes with CPU > attached. But I remember that Aneesh said that some NUMA nodes without > CPU will need to be put in default_dram_type too on their systems. We > need a way to identify that. Yes, I am doing some prototyping the way you described. In memory_tier_init(), we will just set the memory tier for the NUMA nodes with CPU. In hmat_init(), I am trying to call back to mm to finish the memory tier initialization for the CPUless NUMA nodes. If a CPUless numa node can't get the effective adistance from mt_calc_adistance(), we will fallback to add that node to default_dram_type. The other thing I want to experiment is to call mt_calc_adistance() on a memory node with CPU and see what kind of adistance will be returned. > > -- > Best Regards, > Huang, Ying
Hao Xiang <hao.xiang@bytedance.com> writes: > On Thu, Jan 11, 2024 at 11:02 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Hao Xiang <hao.xiang@bytedance.com> writes: >> >> > On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron >> > <Jonathan.Cameron@huawei.com> wrote: >> >> >> >> On Tue, 9 Jan 2024 16:28:15 -0800 >> >> Hao Xiang <hao.xiang@bytedance.com> wrote: >> >> >> >> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote: >> >> > > >> >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: >> >> > > > On Tue, 09 Jan 2024 11:41:11 +0800 >> >> > > > "Huang, Ying" <ying.huang@intel.com> wrote: >> >> > > > > Gregory Price <gregory.price@memverge.com> writes: >> >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: >> >> > > > > It's possible to change the performance of a NUMA node changed, if we >> >> > > > > hot-remove a memory device, then hot-add another different memory >> >> > > > > device. It's hoped that the CDAT changes too. >> >> > > > >> >> > > > Not supported, but ACPI has _HMA methods to in theory allow changing >> >> > > > HMAT values based on firmware notifications... So we 'could' make >> >> > > > it work for HMAT based description. >> >> > > > >> >> > > > Ultimately my current thinking is we'll end up emulating CXL type3 >> >> > > > devices (hiding topology complexity) and you can update CDAT but >> >> > > > IIRC that is only meant to be for degraded situations - so if you >> >> > > > want multiple performance regions, CDAT should describe them form the start. >> >> > > > >> >> > > >> >> > > That was my thought. I don't think it's particularly *realistic* for >> >> > > HMAT/CDAT values to change at runtime, but I can imagine a case where >> >> > > it could be valuable. >> >> > > >> >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ >> >> > > > > > >> >> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU >> >> > > > > > (i.e. host CXL expander memory passed through to the guest), and >> >> > > > > > allow the guest to apply memory tiering. >> >> > > > > > >> >> > > > > > There are multiple issues with this, presently: >> >> > > > > > >> >> > > > > > 1. The QEMU CXL virtual device is not and probably never will be >> >> > > > > > performant enough to be a commodity class virtualization. >> >> > > > >> >> > > > I'd flex that a bit - we will end up with a solution for virtualization but >> >> > > > it isn't the emulation that is there today because it's not possible to >> >> > > > emulate some of the topology in a peformant manner (interleaving with sub >> >> > > > page granularity / interleaving at all (to a lesser degree)). There are >> >> > > > ways to do better than we are today, but they start to look like >> >> > > > software dissagregated memory setups (think lots of page faults in the host). >> >> > > > >> >> > > >> >> > > Agreed, the emulated device as-is can't be the virtualization device, >> >> > > but it doesn't mean it can't be the basis for it. >> >> > > >> >> > > My thought is, if you want to pass host CXL *memory* through to the >> >> > > guest, you don't actually care to pass CXL *control* through to the >> >> > > guest. That control lies pretty squarely with the host/hypervisor. >> >> > > >> >> > > So, at least in theory, you can just cut the type3 device out of the >> >> > > QEMU configuration entirely and just pass it through as a distinct numa >> >> > > node with specific hmat qualities. >> >> > > >> >> > > Barring that, if we must go through the type3 device, the question is >> >> > > how difficult would it be to just make a stripped down type3 device >> >> > > to provide the informational components, but hack off anything >> >> > > topology/interleave related? Then you just do direct passthrough as you >> >> > > described below. >> >> > > >> >> > > qemu/kvm would report errors if you tried to touch the naughty bits. >> >> > > >> >> > > The second question is... is that device "compliant" or does it need >> >> > > super special handling from the kernel driver :D? If what i described >> >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should >> >> > > just hide the CXL device entirely from the guest (for this use case) >> >> > > and just pass the memory through as a numa node. >> >> > > >> >> > > Which gets us back to: The memory-tiering component needs a way to >> >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three >> >> > > of those seem like totally valid ways to go about it. >> >> > > >> >> > > > > > >> >> > > > > > 2. When passing memory through as an explicit NUMA node, but not as >> >> > > > > > part of a CXL memory device, the nodes are lumped together in the >> >> > > > > > DRAM tier. >> >> > > > > > >> >> > > > > > None of this has to do with firmware. >> >> > > > > > >> >> > > > > > Memory-type is an awful way of denoting membership of a tier, but we >> >> > > > > > have HMAT information that can be passed through via QEMU: >> >> > > > > > >> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \ >> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \ >> >> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ >> >> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ >> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ >> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ >> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ >> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 >> >> > > > > > >> >> > > > > > Not only would it be nice if we could change tier membership based on >> >> > > > > > this data, it's realistically the only way to allow guests to accomplish >> >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest. >> >> > > > >> >> > > > This I fully agree with. There will be systems with a bunch of normal DDR with different >> >> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used >> >> > > > before we get anything more complex in place for CXL. >> >> > > > >> >> > > >> >> > > Had not even considered this, but that's completely accurate as well. >> >> > > >> >> > > And more discretely: What of devices that don't provide HMAT/CDAT? That >> >> > > isn't necessarily a violation of any standard. There probably could be >> >> > > a release valve for us to still make those devices useful. >> >> > > >> >> > > The concern I have with not implementing a movement mechanism *at all* >> >> > > is that a one-size-fits-all initial-placement heuristic feels gross >> >> > > when we're, at least ideologically, moving toward "software defined memory". >> >> > > >> >> > > Personally I think the movement mechanism is a good idea that gets folks >> >> > > where they're going sooner, and it doesn't hurt anything by existing. We >> >> > > can change the initial placement mechanism too. >> >> > >> >> > I think providing users a way to "FIX" the memory tiering is a backup >> >> > option. Given that DDRs with different access characteristics provide >> >> > the relevant CDAT/HMAT information, the kernel should be able to >> >> > correctly establish memory tiering on boot. >> >> >> >> Include hotplug and I'll be happier! I know that's messy though. >> >> >> >> > Current memory tiering code has >> >> > 1) memory_tier_init() to iterate through all boot onlined memory >> >> > nodes. All nodes are assumed to be fast tier (adistance >> >> > MEMTIER_ADISTANCE_DRAM is used). >> >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory >> >> > nodes. This is the place the kernel reads the memory attributes from >> >> > HMAT and recognizes the memory nodes into the correct tier (devdax >> >> > controlled CXL, pmem, etc). >> >> > If we want DDRs with different memory characteristics to be put into >> >> > the correct tier (as in the guest VM memory tiering case), we probably >> >> > need a third path to iterate the boot onlined memory nodes and also be >> >> > able to read their memory attributes. I don't think we can do that in >> >> > 1) because the ACPI subsystem is not yet initialized. >> >> >> >> Can we move it later in general? Or drag HMAT parsing earlier? >> >> ACPI table availability is pretty early, it's just that we don't bother >> >> with HMAT because nothing early uses it. >> >> IIRC SRAT parsing occurs way before memory_tier_init() will be called. >> > >> > I tested the call sequence under a debugger earlier. hmat_init() is >> > called after memory_tier_init(). Let me poke around and see what our >> > options are. >> >> This sounds reasonable. >> >> Please keep in mind that we need a way to identify the base line memory >> type(default_dram_type). A simple method is to use NUMA nodes with CPU >> attached. But I remember that Aneesh said that some NUMA nodes without >> CPU will need to be put in default_dram_type too on their systems. We >> need a way to identify that. > > Yes, I am doing some prototyping the way you described. In > memory_tier_init(), we will just set the memory tier for the NUMA > nodes with CPU. In hmat_init(), I am trying to call back to mm to > finish the memory tier initialization for the CPUless NUMA nodes. If a > CPUless numa node can't get the effective adistance from > mt_calc_adistance(), we will fallback to add that node to > default_dram_type. Sound reasonable for me. > The other thing I want to experiment is to call mt_calc_adistance() on > a memory node with CPU and see what kind of adistance will be > returned. Anyway, we need a base line to start. The abstract distance is calculated based on the ratio of the performance of a node to that of default DRAM node. -- Best Regards, Huang, Ying
From: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> The memory tiers feature allows nodes with similar memory types or performance characteristics to be grouped together in a memory tier. However, there is currently no provision for moving a node from one tier to another on demand. This patch series aims to support node migration between tiers on demand by sysadmin/root user using the provided sysfs for node migration. To migrate a node to a tier, the corresponding node’s sysfs memtier_override is written with target tier id. Example: Move node2 to memory tier2 from its default tier(i.e 4) 1. To check current memtier of node2 $cat /sys/devices/system/node/node2/memtier_override memory_tier4 2. To migrate node2 to memory_tier2 $echo 2 > /sys/devices/system/node/node2/memtier_override $cat /sys/devices/system/node/node2/memtier_override memory_tier2 Usecases: 1. Useful to move cxl nodes to the right tiers from userspace, when the hardware fails to assign the tiers correctly based on memorytypes. On some platforms we have observed cxl memory being assigned to the same tier as DDR memory. This is arguably a system firmware bug, but it is true that tiers represent *ranges* of performance and we believe it's important for the system operator to have the ability to override bad firmware or OS decisions about tier assignment as a fail-safe against potential bad outcomes. 2. Useful if we want interleave weights to be applied on memory tiers instead of nodes. In a previous thread, Huang Ying <ying.huang@intel.com> thought this feature might be useful to overcome limitations of systems where nodes with different bandwidth characteristics are grouped in a single tier. https://lore.kernel.org/lkml/87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com/ ============= Version Notes: V2 : Changed interface to memtier_override from adistance_offset. memtier_override was recommended by 1. John Groves <john@jagalactic.com> 2. Ravi Shankar <ravis.opensrc@micron.com> 3. Brice Goglin <Brice.Goglin@inria.fr> V1 : Introduced adistance_offset sysfs. ============= Srinivasulu Thanneeru (2): base/node: Add sysfs for memtier_override memory tier: Support node migration between tiers Documentation/ABI/stable/sysfs-devices-node | 7 ++ drivers/base/node.c | 47 ++++++++++++ include/linux/memory-tiers.h | 11 +++ include/linux/node.h | 11 +++ mm/memory-tiers.c | 85 ++++++++++++--------- 5 files changed, 125 insertions(+), 36 deletions(-)