Message ID | 1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | Another Approach to Use PMEM as NUMA Node | expand |
Le 23/03/2019 à 05:44, Yang Shi a écrit : > With Dave Hansen's patches merged into Linus's tree > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node > effectively and efficiently is still a question. > > There have been a couple of proposals posted on the mailing list [1] [2]. > > The patchset is aimed to try a different approach from this proposal [1] > to use PMEM as NUMA nodes. > > The approach is designed to follow the below principles: > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. > > 2. DRAM first/by default. No surprise to existing applications and default > running. PMEM will not be allocated unless its node is specified explicitly > by NUMA policy. Some applications may be not very sensitive to memory latency, > so they could be placed on PMEM nodes then have hot pages promote to DRAM > gradually. I am not against the approach for some workloads. However, many HPC people would rather do this manually. But there's currently no easy way to find out from userspace whether a given NUMA node is DDR or PMEM*. We have to assume HMAT is available (and correct) and look at performance attributes. When talking to humans, it would be better to say "I allocated on the local DDR NUMA node" rather than "I allocated on the fastest node according to HMAT latency". Also, when we'll have HBM+DDR, some applications may want to use DDR by default, which means they want the *slowest* node according to HMAT (by the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?). Performance attributes could help, but how does user-space know for sure that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years? It seems to me that exporting a flag in sysfs saying whether a node is PMEM could be convenient. Patch series [1] exported a "type" in sysfs node directories ("pmem" or "dram"). I don't know how if there's an easy way to define what HBM is and expose that type too. Brice * As far as I know, the only way is to look at all DAX devices until you find the given NUMA node in the "target_node" attribute. If none, you're likely not PMEM-backed. > [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
On Mon, Mar 25, 2019 at 9:15 AM Brice Goglin <Brice.Goglin@inria.fr> wrote: > > > Le 23/03/2019 à 05:44, Yang Shi a écrit : > > With Dave Hansen's patches merged into Linus's tree > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node > > effectively and efficiently is still a question. > > > > There have been a couple of proposals posted on the mailing list [1] [2]. > > > > The patchset is aimed to try a different approach from this proposal [1] > > to use PMEM as NUMA nodes. > > > > The approach is designed to follow the below principles: > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. > > > > 2. DRAM first/by default. No surprise to existing applications and default > > running. PMEM will not be allocated unless its node is specified explicitly > > by NUMA policy. Some applications may be not very sensitive to memory latency, > > so they could be placed on PMEM nodes then have hot pages promote to DRAM > > gradually. > > > I am not against the approach for some workloads. However, many HPC > people would rather do this manually. But there's currently no easy way > to find out from userspace whether a given NUMA node is DDR or PMEM*. We > have to assume HMAT is available (and correct) and look at performance > attributes. When talking to humans, it would be better to say "I > allocated on the local DDR NUMA node" rather than "I allocated on the > fastest node according to HMAT latency". > > Also, when we'll have HBM+DDR, some applications may want to use DDR by > default, which means they want the *slowest* node according to HMAT (by > the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?). > Performance attributes could help, but how does user-space know for sure > that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years? > > It seems to me that exporting a flag in sysfs saying whether a node is > PMEM could be convenient. Patch series [1] exported a "type" in sysfs > node directories ("pmem" or "dram"). I don't know how if there's an easy > way to define what HBM is and expose that type too. I'm generally against the concept that a "pmem" or "type" flag should indicate anything about the expected performance of the address range. The kernel should explicitly look to the HMAT for performance data and not otherwise make type-based performance assumptions.
Le 25/03/2019 à 17:56, Dan Williams a écrit : > > I'm generally against the concept that a "pmem" or "type" flag should > indicate anything about the expected performance of the address range. > The kernel should explicitly look to the HMAT for performance data and > not otherwise make type-based performance assumptions. Oh sorry, I didn't mean to have the kernel use such a flag to decide of placement, but rather to expose more information to userspace to clarify what all these nodes are about when userspace will decide where to allocate things. I understand that current NVDIMM-F are not slower than DDR and HMAT would better describe this than a flag. But I have seen so many buggy or dummy SLIT tables in the past that I wonder if we can expect HMAT to be widely available (and correct). Is there a safe fallback in case of missing or buggy HMAT? For instance, is DDR supposed to be listed before NVDIMM (or HBM) in SRAT? Brice
On Mon, Mar 25, 2019 at 10:45 AM Brice Goglin <Brice.Goglin@inria.fr> wrote: > > Le 25/03/2019 à 17:56, Dan Williams a écrit : > > > > I'm generally against the concept that a "pmem" or "type" flag should > > indicate anything about the expected performance of the address range. > > The kernel should explicitly look to the HMAT for performance data and > > not otherwise make type-based performance assumptions. > > > Oh sorry, I didn't mean to have the kernel use such a flag to decide of > placement, but rather to expose more information to userspace to clarify > what all these nodes are about when userspace will decide where to > allocate things. I understand, but I'm concerned about the risk of userspace developing vendor-specific, or generation-specific policies around a coarse type identifier. I think the lack of type specificity is a feature rather than a gap, because it requires userspace to consider deeper information. Perhaps "path" might be a suitable replacement identifier rather than type. I.e. memory that originates from an ACPI.NFIT root device is likely "pmem". > I understand that current NVDIMM-F are not slower than DDR and HMAT > would better describe this than a flag. But I have seen so many buggy or > dummy SLIT tables in the past that I wonder if we can expect HMAT to be > widely available (and correct). That's always a fear that the platform BIOS will try to game OS behavior. However, that was the reason that HMAT was defined to indicate actual performance values rather than relative. It is hopefully harder to game than the relative SLIT values, but I'l grant you it's now impossible. > Is there a safe fallback in case of missing or buggy HMAT? For instance, > is DDR supposed to be listed before NVDIMM (or HBM) in SRAT? One fallback might be to make some of these sysfs attributes writable so userspace can correct the situation, but I'm otherwise unclear of what you mean by "safe". If a platform has hard dependencies on correctly enumerating memory performance capabilities then there's not much the kernel can do if the HMAT is botched. I would expect the general case is that the performance capabilities are a soft dependency. but things still work if the data is wrong.
On 3/25/19 9:15 AM, Brice Goglin wrote: > Le 23/03/2019 à 05:44, Yang Shi a écrit : >> With Dave Hansen's patches merged into Linus's tree >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 >> >> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node >> effectively and efficiently is still a question. >> >> There have been a couple of proposals posted on the mailing list [1] [2]. >> >> The patchset is aimed to try a different approach from this proposal [1] >> to use PMEM as NUMA nodes. >> >> The approach is designed to follow the below principles: >> >> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. >> >> 2. DRAM first/by default. No surprise to existing applications and default >> running. PMEM will not be allocated unless its node is specified explicitly >> by NUMA policy. Some applications may be not very sensitive to memory latency, >> so they could be placed on PMEM nodes then have hot pages promote to DRAM >> gradually. > > I am not against the approach for some workloads. However, many HPC > people would rather do this manually. But there's currently no easy way > to find out from userspace whether a given NUMA node is DDR or PMEM*. We > have to assume HMAT is available (and correct) and look at performance > attributes. When talking to humans, it would be better to say "I > allocated on the local DDR NUMA node" rather than "I allocated on the > fastest node according to HMAT latency". Yes, I agree to have some information exposed to kernel or userspace to tell what nodes are DRAM nodes what nodes are not (maybe HBM or PMEM). I assume the default allocation should end up on DRAM nodes for the most workloads. If someone would like to control this manually other than mempolicy, the default allocation node mask may be exported to user space by sysfs so that it can be changed on demand. > > Also, when we'll have HBM+DDR, some applications may want to use DDR by > default, which means they want the *slowest* node according to HMAT (by > the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?). > Performance attributes could help, but how does user-space know for sure > that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years? This is what I mentioned above we need the information exported from HMAT or anything similar to tell us what nodes are DRAM nodes since DRAM may be the lowest tier memory. Or we may be able to assume the nodes associated with CPUs are DRAM nodes by assuming both HBM and PMEM is CPU less node. Thanks, Yang > > It seems to me that exporting a flag in sysfs saying whether a node is > PMEM could be convenient. Patch series [1] exported a "type" in sysfs > node directories ("pmem" or "dram"). I don't know how if there's an easy > way to define what HBM is and expose that type too. > > Brice > > * As far as I know, the only way is to look at all DAX devices until you > find the given NUMA node in the "target_node" attribute. If none, you're > likely not PMEM-backed. > > >> [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
Le 25/03/2019 à 20:29, Dan Williams a écrit : > Perhaps "path" might be a suitable replacement identifier rather than > type. I.e. memory that originates from an ACPI.NFIT root device is > likely "pmem". Could work. What kind of "path" would we get for other types of memory? (DDR, non-ACPI-based based PMEM if any, NVMe PMR?) Thanks Brice
On Mon, Mar 25, 2019 at 4:09 PM Brice Goglin <Brice.Goglin@inria.fr> wrote: > > > Le 25/03/2019 à 20:29, Dan Williams a écrit : > > Perhaps "path" might be a suitable replacement identifier rather than > > type. I.e. memory that originates from an ACPI.NFIT root device is > > likely "pmem". > > > Could work. > > What kind of "path" would we get for other types of memory? (DDR, > non-ACPI-based based PMEM if any, NVMe PMR?) I think for memory that is described by the HMAT "Reservation hint", and no other ACPI table, it would need to have "HMAT" in the path. For anything not ACPI it gets easier because the path can be the parent PCI device.
On Mon, 25 Mar 2019 16:37:07 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > On Mon, Mar 25, 2019 at 4:09 PM Brice Goglin <Brice.Goglin@inria.fr> wrote: > > > > > > Le 25/03/2019 à 20:29, Dan Williams a écrit : > > > Perhaps "path" might be a suitable replacement identifier rather than > > > type. I.e. memory that originates from an ACPI.NFIT root device is > > > likely "pmem". > > > > > > Could work. > > > > What kind of "path" would we get for other types of memory? (DDR, > > non-ACPI-based based PMEM if any, NVMe PMR?) > > I think for memory that is described by the HMAT "Reservation hint", > and no other ACPI table, it would need to have "HMAT" in the path. For > anything not ACPI it gets easier because the path can be the parent > PCI device. > There is no HMAT reservation hint in ACPI 6.3 - but there are other ways of doing much the same thing so this is just a nitpick. J
On Sat 23-03-19 12:44:25, Yang Shi wrote: > > With Dave Hansen's patches merged into Linus's tree > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node > effectively and efficiently is still a question. > > There have been a couple of proposals posted on the mailing list [1] [2]. > > The patchset is aimed to try a different approach from this proposal [1] > to use PMEM as NUMA nodes. > > The approach is designed to follow the below principles: > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. > > 2. DRAM first/by default. No surprise to existing applications and default > running. PMEM will not be allocated unless its node is specified explicitly > by NUMA policy. Some applications may be not very sensitive to memory latency, > so they could be placed on PMEM nodes then have hot pages promote to DRAM > gradually. Why are you pushing yourself into the corner right at the beginning? If the PMEM is exported as a regular NUMA node then the only difference should be performance characteristics (module durability which shouldn't play any role in this particular case, right?). Applications which are already sensitive to memory access should better use proper binding already. Some NUMA topologies might have quite a large interconnect penalties already. So this doesn't sound like an argument to me, TBH. > 5. Control memory allocation and hot/cold pages promotion/demotion on per VMA > basis. What does that mean? Anon vs. file backed memory? [...] > 2. Introduce a new mempolicy, called MPOL_HYBRID to keep other mempolicy > semantics intact. We would like to have memory placement control on per process > or even per VMA granularity. So, mempolicy sounds more reasonable than madvise. > The new mempolicy is mainly used for launching processes on PMEM nodes then > migrate hot pages to DRAM nodes via NUMA balancing. MPOL_BIND could bind to > PMEM nodes too, but migrating to DRAM nodes would just break the semantic of > it. MPOL_PREFERRED can't constraint the allocation to PMEM nodes. So, it sounds > a new mempolicy is needed to fulfill the usecase. The above restriction pushes you to invent an API which is not really trivial to get right and it seems quite artificial to me already. > 3. The new mempolicy would promote pages to DRAM via NUMA balancing. IMHO, I > don't think kernel is a good place to implement sophisticated hot/cold page > distinguish algorithm due to the complexity and overhead. But, kernel should > have such capability. NUMA balancing sounds like a good start point. This is what the kernel does all the time. We call it memory reclaim. > 4. Promote twice faulted page. Use PG_promote to track if a page is faulted > twice. This is an optimization to NUMA balancing to reduce the migration > thrashing and overhead for migrating from PMEM. I am sorry, but page flags are an extremely scarce resource and a new flag is extremely hard to get. On the other hand we already do have use-twice detection for mapped page cache (see page_check_references). I believe we can generalize that to anon pages as well. > 5. When DRAM has memory pressure, demote page to PMEM via page reclaim path. > This is quite similar to other proposals. Then NUMA balancing will promote > page to DRAM as long as the page is referenced again. But, the > promotion/demotion still assumes two tier main memory. And, the demotion may > break mempolicy. Yes, this sounds like a good idea to me ;) > 6. Anonymous page only for the time being since NUMA balancing can't promote > unmapped page cache. As long as the nvdimm access is faster than the regular storage then using any node (including pmem one) should be OK.
On 3/26/19 6:58 AM, Michal Hocko wrote: > On Sat 23-03-19 12:44:25, Yang Shi wrote: >> With Dave Hansen's patches merged into Linus's tree >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 >> >> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node >> effectively and efficiently is still a question. >> >> There have been a couple of proposals posted on the mailing list [1] [2]. >> >> The patchset is aimed to try a different approach from this proposal [1] >> to use PMEM as NUMA nodes. >> >> The approach is designed to follow the below principles: >> >> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. >> >> 2. DRAM first/by default. No surprise to existing applications and default >> running. PMEM will not be allocated unless its node is specified explicitly >> by NUMA policy. Some applications may be not very sensitive to memory latency, >> so they could be placed on PMEM nodes then have hot pages promote to DRAM >> gradually. > Why are you pushing yourself into the corner right at the beginning? If > the PMEM is exported as a regular NUMA node then the only difference > should be performance characteristics (module durability which shouldn't > play any role in this particular case, right?). Applications which are > already sensitive to memory access should better use proper binding already. > Some NUMA topologies might have quite a large interconnect penalties > already. So this doesn't sound like an argument to me, TBH. The major rationale behind this is we assume the most applications should be sensitive to memory access, particularly for meeting the SLA. The applications run on the machine may be agnostic to us, they may be sensitive or non-sensitive. But, assuming they are sensitive to memory access sounds safer from SLA point of view. Then the "cold" pages could be demoted to PMEM nodes by kernel's memory reclaim or other tools without impairing the SLA. If the applications are not sensitive to memory access, they could be bound to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly, then the "hot" pages could be promoted to DRAM. > >> 5. Control memory allocation and hot/cold pages promotion/demotion on per VMA >> basis. > What does that mean? Anon vs. file backed memory? Yes, kind of. Basically, we would like to control the memory placement and promotion (by NUMA balancing) per VMA basis. For example, anon VMAs may be DRAM by default, file backed VMAs may be PMEM by default. Anyway, basically this is achieved freely by mempolicy. > > [...] > >> 2. Introduce a new mempolicy, called MPOL_HYBRID to keep other mempolicy >> semantics intact. We would like to have memory placement control on per process >> or even per VMA granularity. So, mempolicy sounds more reasonable than madvise. >> The new mempolicy is mainly used for launching processes on PMEM nodes then >> migrate hot pages to DRAM nodes via NUMA balancing. MPOL_BIND could bind to >> PMEM nodes too, but migrating to DRAM nodes would just break the semantic of >> it. MPOL_PREFERRED can't constraint the allocation to PMEM nodes. So, it sounds >> a new mempolicy is needed to fulfill the usecase. > The above restriction pushes you to invent an API which is not really > trivial to get right and it seems quite artificial to me already. First of all, the use case is some applications may be not that sensitive to memory access or are willing to achieve net win by trading some performance to save some cost (have some memory on PMEM). So, such applications may be bound to PMEM at the first place then promote hot pages to DRAM via NUMA balancing or whatever mechanism. Both MPOL_BIND and MPOL_PREFERRED sounds not fit into this usecase quite naturally. Secondly, it looks just default policy does NUMA balancing. Once the policy is changed to MPOL_BIND, NUMA balancing would not chime in. So, I invented the new mempolicy. > >> 3. The new mempolicy would promote pages to DRAM via NUMA balancing. IMHO, I >> don't think kernel is a good place to implement sophisticated hot/cold page >> distinguish algorithm due to the complexity and overhead. But, kernel should >> have such capability. NUMA balancing sounds like a good start point. > This is what the kernel does all the time. We call it memory reclaim. > >> 4. Promote twice faulted page. Use PG_promote to track if a page is faulted >> twice. This is an optimization to NUMA balancing to reduce the migration >> thrashing and overhead for migrating from PMEM. > I am sorry, but page flags are an extremely scarce resource and a new > flag is extremely hard to get. On the other hand we already do have > use-twice detection for mapped page cache (see page_check_references). I > believe we can generalize that to anon pages as well. Yes, I agree. A new page flag sounds not preferred. I'm going to take a look at page_check_references(). > >> 5. When DRAM has memory pressure, demote page to PMEM via page reclaim path. >> This is quite similar to other proposals. Then NUMA balancing will promote >> page to DRAM as long as the page is referenced again. But, the >> promotion/demotion still assumes two tier main memory. And, the demotion may >> break mempolicy. > Yes, this sounds like a good idea to me ;) > >> 6. Anonymous page only for the time being since NUMA balancing can't promote >> unmapped page cache. > As long as the nvdimm access is faster than the regular storage then > using any node (including pmem one) should be OK. However, it still sounds better to have some frequently accessed page cache on DRAM. Thanks, Yang
On Tue 26-03-19 11:33:17, Yang Shi wrote: > > > On 3/26/19 6:58 AM, Michal Hocko wrote: > > On Sat 23-03-19 12:44:25, Yang Shi wrote: > > > With Dave Hansen's patches merged into Linus's tree > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 > > > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node > > > effectively and efficiently is still a question. > > > > > > There have been a couple of proposals posted on the mailing list [1] [2]. > > > > > > The patchset is aimed to try a different approach from this proposal [1] > > > to use PMEM as NUMA nodes. > > > > > > The approach is designed to follow the below principles: > > > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. > > > > > > 2. DRAM first/by default. No surprise to existing applications and default > > > running. PMEM will not be allocated unless its node is specified explicitly > > > by NUMA policy. Some applications may be not very sensitive to memory latency, > > > so they could be placed on PMEM nodes then have hot pages promote to DRAM > > > gradually. > > Why are you pushing yourself into the corner right at the beginning? If > > the PMEM is exported as a regular NUMA node then the only difference > > should be performance characteristics (module durability which shouldn't > > play any role in this particular case, right?). Applications which are > > already sensitive to memory access should better use proper binding already. > > Some NUMA topologies might have quite a large interconnect penalties > > already. So this doesn't sound like an argument to me, TBH. > > The major rationale behind this is we assume the most applications should be > sensitive to memory access, particularly for meeting the SLA. The > applications run on the machine may be agnostic to us, they may be sensitive > or non-sensitive. But, assuming they are sensitive to memory access sounds > safer from SLA point of view. Then the "cold" pages could be demoted to PMEM > nodes by kernel's memory reclaim or other tools without impairing the SLA. > > If the applications are not sensitive to memory access, they could be bound > to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly, > then the "hot" pages could be promoted to DRAM. Again, how is this different from NUMA in general?
On 3/26/19 11:37 AM, Michal Hocko wrote: > On Tue 26-03-19 11:33:17, Yang Shi wrote: >> >> On 3/26/19 6:58 AM, Michal Hocko wrote: >>> On Sat 23-03-19 12:44:25, Yang Shi wrote: >>>> With Dave Hansen's patches merged into Linus's tree >>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 >>>> >>>> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node >>>> effectively and efficiently is still a question. >>>> >>>> There have been a couple of proposals posted on the mailing list [1] [2]. >>>> >>>> The patchset is aimed to try a different approach from this proposal [1] >>>> to use PMEM as NUMA nodes. >>>> >>>> The approach is designed to follow the below principles: >>>> >>>> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. >>>> >>>> 2. DRAM first/by default. No surprise to existing applications and default >>>> running. PMEM will not be allocated unless its node is specified explicitly >>>> by NUMA policy. Some applications may be not very sensitive to memory latency, >>>> so they could be placed on PMEM nodes then have hot pages promote to DRAM >>>> gradually. >>> Why are you pushing yourself into the corner right at the beginning? If >>> the PMEM is exported as a regular NUMA node then the only difference >>> should be performance characteristics (module durability which shouldn't >>> play any role in this particular case, right?). Applications which are >>> already sensitive to memory access should better use proper binding already. >>> Some NUMA topologies might have quite a large interconnect penalties >>> already. So this doesn't sound like an argument to me, TBH. >> The major rationale behind this is we assume the most applications should be >> sensitive to memory access, particularly for meeting the SLA. The >> applications run on the machine may be agnostic to us, they may be sensitive >> or non-sensitive. But, assuming they are sensitive to memory access sounds >> safer from SLA point of view. Then the "cold" pages could be demoted to PMEM >> nodes by kernel's memory reclaim or other tools without impairing the SLA. >> >> If the applications are not sensitive to memory access, they could be bound >> to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly, >> then the "hot" pages could be promoted to DRAM. > Again, how is this different from NUMA in general? It is still NUMA, users still can see all the NUMA nodes. Introduced default allocation node mask (please refer to patch #1) to control the memory placement. Typically, the node mask just includes DRAM nodes. PMEM nodes are excluded by the node mask for memory allocation. The node mask could be override by user per the discussion with Dan. Thanks, Yang
On Tue 26-03-19 19:58:56, Yang Shi wrote: > > > On 3/26/19 11:37 AM, Michal Hocko wrote: > > On Tue 26-03-19 11:33:17, Yang Shi wrote: > > > > > > On 3/26/19 6:58 AM, Michal Hocko wrote: > > > > On Sat 23-03-19 12:44:25, Yang Shi wrote: > > > > > With Dave Hansen's patches merged into Linus's tree > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 > > > > > > > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node > > > > > effectively and efficiently is still a question. > > > > > > > > > > There have been a couple of proposals posted on the mailing list [1] [2]. > > > > > > > > > > The patchset is aimed to try a different approach from this proposal [1] > > > > > to use PMEM as NUMA nodes. > > > > > > > > > > The approach is designed to follow the below principles: > > > > > > > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. > > > > > > > > > > 2. DRAM first/by default. No surprise to existing applications and default > > > > > running. PMEM will not be allocated unless its node is specified explicitly > > > > > by NUMA policy. Some applications may be not very sensitive to memory latency, > > > > > so they could be placed on PMEM nodes then have hot pages promote to DRAM > > > > > gradually. > > > > Why are you pushing yourself into the corner right at the beginning? If > > > > the PMEM is exported as a regular NUMA node then the only difference > > > > should be performance characteristics (module durability which shouldn't > > > > play any role in this particular case, right?). Applications which are > > > > already sensitive to memory access should better use proper binding already. > > > > Some NUMA topologies might have quite a large interconnect penalties > > > > already. So this doesn't sound like an argument to me, TBH. > > > The major rationale behind this is we assume the most applications should be > > > sensitive to memory access, particularly for meeting the SLA. The > > > applications run on the machine may be agnostic to us, they may be sensitive > > > or non-sensitive. But, assuming they are sensitive to memory access sounds > > > safer from SLA point of view. Then the "cold" pages could be demoted to PMEM > > > nodes by kernel's memory reclaim or other tools without impairing the SLA. > > > > > > If the applications are not sensitive to memory access, they could be bound > > > to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly, > > > then the "hot" pages could be promoted to DRAM. > > Again, how is this different from NUMA in general? > > It is still NUMA, users still can see all the NUMA nodes. No, Linux NUMA implementation makes all numa nodes available by default and provides an API to opt-in for more fine tuning. What you are suggesting goes against that semantic and I am asking why. How is pmem NUMA node any different from any any other distant node in principle?
On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 26-03-19 19:58:56, Yang Shi wrote: > > > > > > On 3/26/19 11:37 AM, Michal Hocko wrote: > > > On Tue 26-03-19 11:33:17, Yang Shi wrote: > > > > > > > > On 3/26/19 6:58 AM, Michal Hocko wrote: > > > > > On Sat 23-03-19 12:44:25, Yang Shi wrote: > > > > > > With Dave Hansen's patches merged into Linus's tree > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 > > > > > > > > > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node > > > > > > effectively and efficiently is still a question. > > > > > > > > > > > > There have been a couple of proposals posted on the mailing list [1] [2]. > > > > > > > > > > > > The patchset is aimed to try a different approach from this proposal [1] > > > > > > to use PMEM as NUMA nodes. > > > > > > > > > > > > The approach is designed to follow the below principles: > > > > > > > > > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. > > > > > > > > > > > > 2. DRAM first/by default. No surprise to existing applications and default > > > > > > running. PMEM will not be allocated unless its node is specified explicitly > > > > > > by NUMA policy. Some applications may be not very sensitive to memory latency, > > > > > > so they could be placed on PMEM nodes then have hot pages promote to DRAM > > > > > > gradually. > > > > > Why are you pushing yourself into the corner right at the beginning? If > > > > > the PMEM is exported as a regular NUMA node then the only difference > > > > > should be performance characteristics (module durability which shouldn't > > > > > play any role in this particular case, right?). Applications which are > > > > > already sensitive to memory access should better use proper binding already. > > > > > Some NUMA topologies might have quite a large interconnect penalties > > > > > already. So this doesn't sound like an argument to me, TBH. > > > > The major rationale behind this is we assume the most applications should be > > > > sensitive to memory access, particularly for meeting the SLA. The > > > > applications run on the machine may be agnostic to us, they may be sensitive > > > > or non-sensitive. But, assuming they are sensitive to memory access sounds > > > > safer from SLA point of view. Then the "cold" pages could be demoted to PMEM > > > > nodes by kernel's memory reclaim or other tools without impairing the SLA. > > > > > > > > If the applications are not sensitive to memory access, they could be bound > > > > to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly, > > > > then the "hot" pages could be promoted to DRAM. > > > Again, how is this different from NUMA in general? > > > > It is still NUMA, users still can see all the NUMA nodes. > > No, Linux NUMA implementation makes all numa nodes available by default > and provides an API to opt-in for more fine tuning. What you are > suggesting goes against that semantic and I am asking why. How is pmem > NUMA node any different from any any other distant node in principle? Agree. It's just another NUMA node and shouldn't be special cased. Userspace policy can choose to avoid it, but typical node distance preference should otherwise let the kernel fall back to it as additional memory pressure relief for "near" memory.
On 3/27/19 10:34 AM, Dan Williams wrote: > On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote: >> On Tue 26-03-19 19:58:56, Yang Shi wrote: >>> >>> On 3/26/19 11:37 AM, Michal Hocko wrote: >>>> On Tue 26-03-19 11:33:17, Yang Shi wrote: >>>>> On 3/26/19 6:58 AM, Michal Hocko wrote: >>>>>> On Sat 23-03-19 12:44:25, Yang Shi wrote: >>>>>>> With Dave Hansen's patches merged into Linus's tree >>>>>>> >>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 >>>>>>> >>>>>>> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node >>>>>>> effectively and efficiently is still a question. >>>>>>> >>>>>>> There have been a couple of proposals posted on the mailing list [1] [2]. >>>>>>> >>>>>>> The patchset is aimed to try a different approach from this proposal [1] >>>>>>> to use PMEM as NUMA nodes. >>>>>>> >>>>>>> The approach is designed to follow the below principles: >>>>>>> >>>>>>> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. >>>>>>> >>>>>>> 2. DRAM first/by default. No surprise to existing applications and default >>>>>>> running. PMEM will not be allocated unless its node is specified explicitly >>>>>>> by NUMA policy. Some applications may be not very sensitive to memory latency, >>>>>>> so they could be placed on PMEM nodes then have hot pages promote to DRAM >>>>>>> gradually. >>>>>> Why are you pushing yourself into the corner right at the beginning? If >>>>>> the PMEM is exported as a regular NUMA node then the only difference >>>>>> should be performance characteristics (module durability which shouldn't >>>>>> play any role in this particular case, right?). Applications which are >>>>>> already sensitive to memory access should better use proper binding already. >>>>>> Some NUMA topologies might have quite a large interconnect penalties >>>>>> already. So this doesn't sound like an argument to me, TBH. >>>>> The major rationale behind this is we assume the most applications should be >>>>> sensitive to memory access, particularly for meeting the SLA. The >>>>> applications run on the machine may be agnostic to us, they may be sensitive >>>>> or non-sensitive. But, assuming they are sensitive to memory access sounds >>>>> safer from SLA point of view. Then the "cold" pages could be demoted to PMEM >>>>> nodes by kernel's memory reclaim or other tools without impairing the SLA. >>>>> >>>>> If the applications are not sensitive to memory access, they could be bound >>>>> to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly, >>>>> then the "hot" pages could be promoted to DRAM. >>>> Again, how is this different from NUMA in general? >>> It is still NUMA, users still can see all the NUMA nodes. >> No, Linux NUMA implementation makes all numa nodes available by default >> and provides an API to opt-in for more fine tuning. What you are >> suggesting goes against that semantic and I am asking why. How is pmem >> NUMA node any different from any any other distant node in principle? > Agree. It's just another NUMA node and shouldn't be special cased. > Userspace policy can choose to avoid it, but typical node distance > preference should otherwise let the kernel fall back to it as > additional memory pressure relief for "near" memory. In ideal case, yes, I agree. However, in real life world the performance is a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has higher latency and lower bandwidth. We observed much higher latency on PMEM than DRAM with multi threads. In real production environment we don't know what kind of applications would end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have unexpected performance degradation. I understand to have mempolicy to choose to avoid it. But, there might be hundreds or thousands of applications running on the machine, it sounds not that feasible to me to have each single application set mempolicy to avoid it. So, I think we still need a default allocation node mask. The default value may include all nodes or just DRAM nodes. But, they should be able to be override by user globally, not only per process basis. Due to the performance disparity, currently our usecases treat PMEM as second tier memory for demoting cold page or binding to not memory access sensitive applications (this is the reason for inventing a new mempolicy) although it is a NUMA node. Thanks, Yang
On Wed 27-03-19 11:59:28, Yang Shi wrote: > > > On 3/27/19 10:34 AM, Dan Williams wrote: > > On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote: > > > On Tue 26-03-19 19:58:56, Yang Shi wrote: [...] > > > > It is still NUMA, users still can see all the NUMA nodes. > > > No, Linux NUMA implementation makes all numa nodes available by default > > > and provides an API to opt-in for more fine tuning. What you are > > > suggesting goes against that semantic and I am asking why. How is pmem > > > NUMA node any different from any any other distant node in principle? > > Agree. It's just another NUMA node and shouldn't be special cased. > > Userspace policy can choose to avoid it, but typical node distance > > preference should otherwise let the kernel fall back to it as > > additional memory pressure relief for "near" memory. > > In ideal case, yes, I agree. However, in real life world the performance is > a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has > higher latency and lower bandwidth. We observed much higher latency on PMEM > than DRAM with multi threads. One rule of thumb is: Do not design user visible interfaces based on the contemporary technology and its up/down sides. This will almost always fire back. Btw. if you keep arguing about performance without any numbers. Can you present something specific? > In real production environment we don't know what kind of applications would > end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have > unexpected performance degradation. I understand to have mempolicy to choose > to avoid it. But, there might be hundreds or thousands of applications > running on the machine, it sounds not that feasible to me to have each > single application set mempolicy to avoid it. we have cpuset cgroup controller to help here. > So, I think we still need a default allocation node mask. The default value > may include all nodes or just DRAM nodes. But, they should be able to be > override by user globally, not only per process basis. > > Due to the performance disparity, currently our usecases treat PMEM as > second tier memory for demoting cold page or binding to not memory access > sensitive applications (this is the reason for inventing a new mempolicy) > although it is a NUMA node. If the performance sucks that badly then do not use the pmem as NUMA, really. There are certainly other ways to export the pmem storage. Use it as a fast swap storage. Or try to work on a swap caching mechanism that still allows much faster access than a slow swap storage. But do not try to pretend to abuse the NUMA interface while you are breaking some of its long term established semantics.
On 3/27/19 11:59 AM, Yang Shi wrote: > In real production environment we don't know what kind of applications > would end up on PMEM (DRAM may be full, allocation fall back to PMEM) > then have unexpected performance degradation. I understand to have > mempolicy to choose to avoid it. But, there might be hundreds or > thousands of applications running on the machine, it sounds not that > feasible to me to have each single application set mempolicy to avoid it. Maybe not manually, but it's entirely possible to automate this. It would be trivial to get help from an orchestrator, or even systemd to get apps launched with a particular policy. Or, even a *shell* that launches apps to have a particular policy.
On Wed, Mar 27, 2019 at 10:34:11AM -0700, Dan Williams wrote: > On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote: > > No, Linux NUMA implementation makes all numa nodes available by default > > and provides an API to opt-in for more fine tuning. What you are > > suggesting goes against that semantic and I am asking why. How is pmem > > NUMA node any different from any any other distant node in principle? > > Agree. It's just another NUMA node and shouldn't be special cased. > Userspace policy can choose to avoid it, but typical node distance > preference should otherwise let the kernel fall back to it as > additional memory pressure relief for "near" memory. I think this is sort of true, but sort of different. These are essentially CPU-less nodes; there is no CPU for which they are fast memory. Yes, they're further from some CPUs than from others. I have never paid attention to how Linux treats CPU-less memory nodes, but it would make sense to me if we don't default to allocating from remote nodes. And treating pmem nodes as being remote from all CPUs makes a certain amount of sense to me. eg on a four CPU-socket system, consider this as being pmem1 --- node1 --- node2 --- pmem2 | \ / | | X | | / \ | pmem3 --- node3 --- node4 --- pmem4 which I could actually see someone building with normal DRAM, and we should probably handle the same way as pmem; for a process running on node3, allocate preferentially from node3, then pmem3, then other nodes, then other pmems.
On 3/27/19 1:35 PM, Matthew Wilcox wrote: > > pmem1 --- node1 --- node2 --- pmem2 > | \ / | > | X | > | / \ | > pmem3 --- node3 --- node4 --- pmem4 > > which I could actually see someone building with normal DRAM, and we > should probably handle the same way as pmem; for a process running on > node3, allocate preferentially from node3, then pmem3, then other nodes, > then other pmems. That makes sense. But, it might _also_ make sense to fill up all DRAM first before using any pmem. That could happen if the NUMA interconnect is really fast and pmem is really slow. Basically, with the current patches we are depending on the firmware to "nicely" enumerate the topology and we're keeping the behavior that we end up with, for now, whatever it might be. Now, let's sit back and see how nice the firmware is. :)
On 3/27/19 1:09 PM, Michal Hocko wrote: > On Wed 27-03-19 11:59:28, Yang Shi wrote: >> >> On 3/27/19 10:34 AM, Dan Williams wrote: >>> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote: >>>> On Tue 26-03-19 19:58:56, Yang Shi wrote: > [...] >>>>> It is still NUMA, users still can see all the NUMA nodes. >>>> No, Linux NUMA implementation makes all numa nodes available by default >>>> and provides an API to opt-in for more fine tuning. What you are >>>> suggesting goes against that semantic and I am asking why. How is pmem >>>> NUMA node any different from any any other distant node in principle? >>> Agree. It's just another NUMA node and shouldn't be special cased. >>> Userspace policy can choose to avoid it, but typical node distance >>> preference should otherwise let the kernel fall back to it as >>> additional memory pressure relief for "near" memory. >> In ideal case, yes, I agree. However, in real life world the performance is >> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has >> higher latency and lower bandwidth. We observed much higher latency on PMEM >> than DRAM with multi threads. > One rule of thumb is: Do not design user visible interfaces based on the > contemporary technology and its up/down sides. This will almost always > fire back. Thanks. It does make sense to me. > > Btw. if you keep arguing about performance without any numbers. Can you > present something specific? Yes, I did have some numbers. We did simple memory sequential rw latency test with a designed-in-house test program on PMEM (bind to PMEM) and DRAM (bind to DRAM). When running with 20 threads the result is as below: Threads w/lat r/lat PMEM 20 537.15 68.06 DRAM 20 14.19 6.47 And, sysbench test with command: sysbench --time=600 memory --memory-block-size=8G --memory-total-size=1024T --memory-scope=global --memory-oper=read --memory-access-mode=rnd --rand-type=gaussian --rand-pareto-h=0.1 --threads=1 run The result is: lat/ms PMEM 103766.09 DRAM 31946.30 > >> In real production environment we don't know what kind of applications would >> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have >> unexpected performance degradation. I understand to have mempolicy to choose >> to avoid it. But, there might be hundreds or thousands of applications >> running on the machine, it sounds not that feasible to me to have each >> single application set mempolicy to avoid it. > we have cpuset cgroup controller to help here. > >> So, I think we still need a default allocation node mask. The default value >> may include all nodes or just DRAM nodes. But, they should be able to be >> override by user globally, not only per process basis. >> >> Due to the performance disparity, currently our usecases treat PMEM as >> second tier memory for demoting cold page or binding to not memory access >> sensitive applications (this is the reason for inventing a new mempolicy) >> although it is a NUMA node. > If the performance sucks that badly then do not use the pmem as NUMA, > really. There are certainly other ways to export the pmem storage. Use > it as a fast swap storage. Or try to work on a swap caching mechanism > that still allows much faster access than a slow swap storage. But do > not try to pretend to abuse the NUMA interface while you are breaking > some of its long term established semantics. Yes, we are looking into using it as a fast swap storage too and perhaps other usecases. Anyway, though nobody thought it makes sense to restrict default allocation nodes, it sounds over-engineered. I'm going to drop it. One question, when doing demote and promote we need define a path, for example, DRAM <-> PMEM (assume two tier memory). When determining what nodes are "DRAM" nodes, does it make sense to assume the nodes with both cpu and memory are DRAM nodes since PMEM nodes are typically cpuless nodes? Thanks, Yang
On Wed 27-03-19 19:09:10, Yang Shi wrote: > One question, when doing demote and promote we need define a path, for > example, DRAM <-> PMEM (assume two tier memory). When determining what nodes > are "DRAM" nodes, does it make sense to assume the nodes with both cpu and > memory are DRAM nodes since PMEM nodes are typically cpuless nodes? Do we really have to special case this for PMEM? Why cannot we simply go in the zonelist order? In other words why cannot we use the same logic for a larger NUMA machine and instead of swapping simply fallback to a less contended NUMA node? It can be a regular DRAM, PMEM or whatever other type of memory node.
On Wed, Mar 27, 2019 at 7:09 PM Yang Shi <yang.shi@linux.alibaba.com> wrote: > On 3/27/19 1:09 PM, Michal Hocko wrote: > > On Wed 27-03-19 11:59:28, Yang Shi wrote: > >> > >> On 3/27/19 10:34 AM, Dan Williams wrote: > >>> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote: > >>>> On Tue 26-03-19 19:58:56, Yang Shi wrote: > > [...] > >>>>> It is still NUMA, users still can see all the NUMA nodes. > >>>> No, Linux NUMA implementation makes all numa nodes available by default > >>>> and provides an API to opt-in for more fine tuning. What you are > >>>> suggesting goes against that semantic and I am asking why. How is pmem > >>>> NUMA node any different from any any other distant node in principle? > >>> Agree. It's just another NUMA node and shouldn't be special cased. > >>> Userspace policy can choose to avoid it, but typical node distance > >>> preference should otherwise let the kernel fall back to it as > >>> additional memory pressure relief for "near" memory. > >> In ideal case, yes, I agree. However, in real life world the performance is > >> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has > >> higher latency and lower bandwidth. We observed much higher latency on PMEM > >> than DRAM with multi threads. > > One rule of thumb is: Do not design user visible interfaces based on the > > contemporary technology and its up/down sides. This will almost always > > fire back. > > Thanks. It does make sense to me. > > > > > Btw. if you keep arguing about performance without any numbers. Can you > > present something specific? > > Yes, I did have some numbers. We did simple memory sequential rw latency > test with a designed-in-house test program on PMEM (bind to PMEM) and > DRAM (bind to DRAM). When running with 20 threads the result is as below: > > Threads w/lat r/lat > PMEM 20 537.15 68.06 > DRAM 20 14.19 6.47 > > And, sysbench test with command: sysbench --time=600 memory > --memory-block-size=8G --memory-total-size=1024T --memory-scope=global > --memory-oper=read --memory-access-mode=rnd --rand-type=gaussian > --rand-pareto-h=0.1 --threads=1 run > > The result is: > lat/ms > PMEM 103766.09 > DRAM 31946.30 > > > > >> In real production environment we don't know what kind of applications would > >> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have > >> unexpected performance degradation. I understand to have mempolicy to choose > >> to avoid it. But, there might be hundreds or thousands of applications > >> running on the machine, it sounds not that feasible to me to have each > >> single application set mempolicy to avoid it. > > we have cpuset cgroup controller to help here. > > > >> So, I think we still need a default allocation node mask. The default value > >> may include all nodes or just DRAM nodes. But, they should be able to be > >> override by user globally, not only per process basis. > >> > >> Due to the performance disparity, currently our usecases treat PMEM as > >> second tier memory for demoting cold page or binding to not memory access > >> sensitive applications (this is the reason for inventing a new mempolicy) > >> although it is a NUMA node. > > If the performance sucks that badly then do not use the pmem as NUMA, > > really. There are certainly other ways to export the pmem storage. Use > > it as a fast swap storage. Or try to work on a swap caching mechanism > > that still allows much faster access than a slow swap storage. But do > > not try to pretend to abuse the NUMA interface while you are breaking > > some of its long term established semantics. > > Yes, we are looking into using it as a fast swap storage too and perhaps > other usecases. > > Anyway, though nobody thought it makes sense to restrict default > allocation nodes, it sounds over-engineered. I'm going to drop it. > > One question, when doing demote and promote we need define a path, for > example, DRAM <-> PMEM (assume two tier memory). When determining what > nodes are "DRAM" nodes, does it make sense to assume the nodes with both > cpu and memory are DRAM nodes since PMEM nodes are typically cpuless nodes? For ACPI platforms the HMAT is effectively going to enforce "cpu-less" nodes for any memory range that has differentiated performance from the conventional memory pool, or differentiated performance for a specific initiator. So "memory-less == PMEM" is not a robust assumption. The plan is to use the HMAT to populate the default fallback order, but allow for an override if the HMAT information is missing or incorrect.
On 3/27/19 11:58 PM, Michal Hocko wrote: > On Wed 27-03-19 19:09:10, Yang Shi wrote: >> One question, when doing demote and promote we need define a path, for >> example, DRAM <-> PMEM (assume two tier memory). When determining what nodes >> are "DRAM" nodes, does it make sense to assume the nodes with both cpu and >> memory are DRAM nodes since PMEM nodes are typically cpuless nodes? > Do we really have to special case this for PMEM? Why cannot we simply go > in the zonelist order? In other words why cannot we use the same logic > for a larger NUMA machine and instead of swapping simply fallback to a > less contended NUMA node? It can be a regular DRAM, PMEM or whatever > other type of memory node. Thanks for the suggestion. It makes sense. However, if we don't specialize a pmem node, its fallback node may be a DRAM node, then the memory reclaim may move the inactive page to the DRAM node, it sounds not make too much sense since memory reclaim would prefer to move downwards (DRAM -> PMEM -> Disk). Yang
On Thu 28-03-19 11:58:57, Yang Shi wrote: > > > On 3/27/19 11:58 PM, Michal Hocko wrote: > > On Wed 27-03-19 19:09:10, Yang Shi wrote: > > > One question, when doing demote and promote we need define a path, for > > > example, DRAM <-> PMEM (assume two tier memory). When determining what nodes > > > are "DRAM" nodes, does it make sense to assume the nodes with both cpu and > > > memory are DRAM nodes since PMEM nodes are typically cpuless nodes? > > Do we really have to special case this for PMEM? Why cannot we simply go > > in the zonelist order? In other words why cannot we use the same logic > > for a larger NUMA machine and instead of swapping simply fallback to a > > less contended NUMA node? It can be a regular DRAM, PMEM or whatever > > other type of memory node. > > Thanks for the suggestion. It makes sense. However, if we don't specialize a > pmem node, its fallback node may be a DRAM node, then the memory reclaim may > move the inactive page to the DRAM node, it sounds not make too much sense > since memory reclaim would prefer to move downwards (DRAM -> PMEM -> Disk). There are certainly many details to sort out. One thing is how to handle cpuless nodes (e.g. PMEM). Those shouldn't get any direct allocations without an explicit binding, right? My first naive idea would be to only migrate-on-reclaim only from the preferred node. We might need additional heuristics but I wouldn't special case PMEM from other cpuless NUMA nodes.
On 3/28/19 12:12 PM, Michal Hocko wrote: > On Thu 28-03-19 11:58:57, Yang Shi wrote: >> >> On 3/27/19 11:58 PM, Michal Hocko wrote: >>> On Wed 27-03-19 19:09:10, Yang Shi wrote: >>>> One question, when doing demote and promote we need define a path, for >>>> example, DRAM <-> PMEM (assume two tier memory). When determining what nodes >>>> are "DRAM" nodes, does it make sense to assume the nodes with both cpu and >>>> memory are DRAM nodes since PMEM nodes are typically cpuless nodes? >>> Do we really have to special case this for PMEM? Why cannot we simply go >>> in the zonelist order? In other words why cannot we use the same logic >>> for a larger NUMA machine and instead of swapping simply fallback to a >>> less contended NUMA node? It can be a regular DRAM, PMEM or whatever >>> other type of memory node. >> Thanks for the suggestion. It makes sense. However, if we don't specialize a >> pmem node, its fallback node may be a DRAM node, then the memory reclaim may >> move the inactive page to the DRAM node, it sounds not make too much sense >> since memory reclaim would prefer to move downwards (DRAM -> PMEM -> Disk). > There are certainly many details to sort out. One thing is how to handle > cpuless nodes (e.g. PMEM). Those shouldn't get any direct allocations > without an explicit binding, right? My first naive idea would be to only Wait a minute. I thought we were arguing about the default allocation node mask yesterday. And, the conclusion is PMEM node should not be excluded from the node mask. PMEM nodes are cpuless nodes. I think I should replace all "PMEM node" to "cpuless node" in the cover letter and commit logs to make it explicitly. Quoted from Dan "For ACPI platforms the HMAT is effectively going to enforce "cpu-less" nodes for any memory range that has differentiated performance from the conventional memory pool, or differentiated performance for a specific initiator." I apologize I didn't elaborate PMEM nodes are cpuless nodes at the first place. Of course, cpuless node may be not PMEM node. To your question, yes, I do agree. Actually, this is what I mean about "DRAM only by default", or I should rephrase it to "exclude cpuless node", I thought they mean the same thing. > migrate-on-reclaim only from the preferred node. We might need If we exclude cpuless nodes, yes. The preferred node would be DRAM node only. Actually, the patchset does follow "migrate-on-reclaim only from the preferred node". Thanks, Yang > additional heuristics but I wouldn't special case PMEM from other > cpuless NUMA nodes.
On Thu 28-03-19 12:40:14, Yang Shi wrote: > > > On 3/28/19 12:12 PM, Michal Hocko wrote: > > On Thu 28-03-19 11:58:57, Yang Shi wrote: > > > > > > On 3/27/19 11:58 PM, Michal Hocko wrote: > > > > On Wed 27-03-19 19:09:10, Yang Shi wrote: > > > > > One question, when doing demote and promote we need define a path, for > > > > > example, DRAM <-> PMEM (assume two tier memory). When determining what nodes > > > > > are "DRAM" nodes, does it make sense to assume the nodes with both cpu and > > > > > memory are DRAM nodes since PMEM nodes are typically cpuless nodes? > > > > Do we really have to special case this for PMEM? Why cannot we simply go > > > > in the zonelist order? In other words why cannot we use the same logic > > > > for a larger NUMA machine and instead of swapping simply fallback to a > > > > less contended NUMA node? It can be a regular DRAM, PMEM or whatever > > > > other type of memory node. > > > Thanks for the suggestion. It makes sense. However, if we don't specialize a > > > pmem node, its fallback node may be a DRAM node, then the memory reclaim may > > > move the inactive page to the DRAM node, it sounds not make too much sense > > > since memory reclaim would prefer to move downwards (DRAM -> PMEM -> Disk). > > There are certainly many details to sort out. One thing is how to handle > > cpuless nodes (e.g. PMEM). Those shouldn't get any direct allocations > > without an explicit binding, right? My first naive idea would be to only > > Wait a minute. I thought we were arguing about the default allocation node > mask yesterday. And, the conclusion is PMEM node should not be excluded from > the node mask. PMEM nodes are cpuless nodes. I think I should replace all > "PMEM node" to "cpuless node" in the cover letter and commit logs to make it > explicitly. No, this is not about the default allocation mask at all. Your allocations start from a local/mempolicy node. CPUless nodes thus cannot be a primary node so it will always be only in a fallback zonelist without an explicit binding.