mbox series

[RFC,0/10] Another Approach to Use PMEM as NUMA Node

Message ID 1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
Headers show
Series Another Approach to Use PMEM as NUMA Node | expand

Message

Yang Shi March 23, 2019, 4:44 a.m. UTC
With Dave Hansen's patches merged into Linus's tree

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4

PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
effectively and efficiently is still a question. 

There have been a couple of proposals posted on the mailing list [1] [2].

The patchset is aimed to try a different approach from this proposal [1]
to use PMEM as NUMA nodes.

The approach is designed to follow the below principles:

1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.

2. DRAM first/by default. No surprise to existing applications and default
running. PMEM will not be allocated unless its node is specified explicitly
by NUMA policy. Some applications may be not very sensitive to memory latency,
so they could be placed on PMEM nodes then have hot pages promote to DRAM
gradually.

3. Compatible with current NUMA policy semantics.

4. Don't assume hardware topology. But, the patchset still assumes two tier
heterogeneous memory system. I understood generalizing multi tier heterogeneous
memory had been discussed before. I do agree that is preferred eventually.
However, currently kernel doesn't have such capability yet. When HMAT is fully
ready we definitely could extract NUMA topology from it.

5. Control memory allocation and hot/cold pages promotion/demotion on per VMA
basis.

To achieve the above principles, the design can be summarized by the
following points:

1. Per node global fallback zonelists (include both DRAM and PMEM), use
def_alloc_nodemask to exclude non-DRAM nodes from default allocation unless
they are specified by mempolicy. Currently kernel just can distinguish volatile
and non-volatile memory. So, just build the nodemask by SRAT flag. In the
future it may be better to build the nodemask with more exposed hardware
information, i.e. HMAT attributes so that it could be extended to multi tier
memory system easily.

2. Introduce a new mempolicy, called MPOL_HYBRID to keep other mempolicy
semantics intact. We would like to have memory placement control on per process
or even per VMA granularity. So, mempolicy sounds more reasonable than madvise.
The new mempolicy is mainly used for launching processes on PMEM nodes then
migrate hot pages to DRAM nodes via NUMA balancing. MPOL_BIND could bind to
PMEM nodes too, but migrating to DRAM nodes would just break the semantic of
it. MPOL_PREFERRED can't constraint the allocation to PMEM nodes. So, it sounds
a new mempolicy is needed to fulfill the usecase.

3. The new mempolicy would promote pages to DRAM via NUMA balancing. IMHO, I
don't think kernel is a good place to implement sophisticated hot/cold page
distinguish algorithm due to the complexity and overhead. But, kernel should
have such capability. NUMA balancing sounds like a good start point.

4. Promote twice faulted page. Use PG_promote to track if a page is faulted
twice. This is an optimization to NUMA balancing to reduce the migration
thrashing and overhead for migrating from PMEM.

5. When DRAM has memory pressure, demote page to PMEM via page reclaim path.
This is quite similar to other proposals. Then NUMA balancing will promote
page to DRAM as long as the page is referenced again. But, the
promotion/demotion still assumes two tier main memory. And, the demotion may
break mempolicy.

6. Anonymous page only for the time being since NUMA balancing can't promote
unmapped page cache.

The patchset still misses a some pieces and is pre-mature, but I would like to
post to LKML to gather more feedback and comments and have more eyes on it to
make sure I'm on the right track.

Any comment is welcome.


TODO:

1. Promote page cache. There are a couple of ways to handle this in kernel,
i.e. promote via active LRU in reclaim path on PMEM node, or promote in
mark_page_accessed().

2. Promote/demote HugeTLB. Now HugeTLB is not on LRU and NUMA balancing just
skips it.

3. May place kernel pages (i.e. page table, slabs, etc) on DRAM only.

4. Support the new mempolicy in userspace tools, i.e. numactl.


[1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
[2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/


Yang Shi (10):
      mm: control memory placement by nodemask for two tier main memory
      mm: mempolicy: introduce MPOL_HYBRID policy
      mm: mempolicy: promote page to DRAM for MPOL_HYBRID
      mm: numa: promote pages to DRAM when it is accessed twice
      mm: page_alloc: make find_next_best_node could skip DRAM node
      mm: vmscan: demote anon DRAM pages to PMEM node
      mm: vmscan: add page demotion counter
      mm: numa: add page promotion counter
      doc: add description for MPOL_HYBRID mode
      doc: elaborate the PMEM allocation rule

 Documentation/admin-guide/mm/numa_memory_policy.rst |  10 ++++
 Documentation/vm/numa.rst                           |   7 ++-
 arch/x86/mm/numa.c                                  |   1 +
 drivers/acpi/numa.c                                 |   8 +++
 include/linux/migrate.h                             |   1 +
 include/linux/mmzone.h                              |   3 ++
 include/linux/page-flags.h                          |   4 ++
 include/linux/vm_event_item.h                       |   3 ++
 include/linux/vmstat.h                              |   1 +
 include/trace/events/migrate.h                      |   3 +-
 include/trace/events/mmflags.h                      |   3 +-
 include/uapi/linux/mempolicy.h                      |   1 +
 mm/debug.c                                          |   1 +
 mm/huge_memory.c                                    |  14 ++++++
 mm/internal.h                                       |  33 ++++++++++++
 mm/memory.c                                         |  12 +++++
 mm/mempolicy.c                                      |  74 ++++++++++++++++++++++++---
 mm/page_alloc.c                                     |  33 +++++++++---
 mm/vmscan.c                                         | 113 +++++++++++++++++++++++++++++++++++-------
 mm/vmstat.c                                         |   3 ++
 20 files changed, 295 insertions(+), 33 deletions(-)

Comments

Brice Goglin March 25, 2019, 4:15 p.m. UTC | #1
Le 23/03/2019 à 05:44, Yang Shi a écrit :
> With Dave Hansen's patches merged into Linus's tree
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>
> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> effectively and efficiently is still a question. 
>
> There have been a couple of proposals posted on the mailing list [1] [2].
>
> The patchset is aimed to try a different approach from this proposal [1]
> to use PMEM as NUMA nodes.
>
> The approach is designed to follow the below principles:
>
> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
>
> 2. DRAM first/by default. No surprise to existing applications and default
> running. PMEM will not be allocated unless its node is specified explicitly
> by NUMA policy. Some applications may be not very sensitive to memory latency,
> so they could be placed on PMEM nodes then have hot pages promote to DRAM
> gradually.


I am not against the approach for some workloads. However, many HPC
people would rather do this manually. But there's currently no easy way
to find out from userspace whether a given NUMA node is DDR or PMEM*. We
have to assume HMAT is available (and correct) and look at performance
attributes. When talking to humans, it would be better to say "I
allocated on the local DDR NUMA node" rather than "I allocated on the
fastest node according to HMAT latency".

Also, when we'll have HBM+DDR, some applications may want to use DDR by
default, which means they want the *slowest* node according to HMAT (by
the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?).
Performance attributes could help, but how does user-space know for sure
that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years?

It seems to me that exporting a flag in sysfs saying whether a node is
PMEM could be convenient. Patch series [1] exported a "type" in sysfs
node directories ("pmem" or "dram"). I don't know how if there's an easy
way to define what HBM is and expose that type too.

Brice

* As far as I know, the only way is to look at all DAX devices until you
find the given NUMA node in the "target_node" attribute. If none, you're
likely not PMEM-backed.


> [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
Dan Williams March 25, 2019, 4:56 p.m. UTC | #2
On Mon, Mar 25, 2019 at 9:15 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
>
> Le 23/03/2019 à 05:44, Yang Shi a écrit :
> > With Dave Hansen's patches merged into Linus's tree
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> >
> > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> > effectively and efficiently is still a question.
> >
> > There have been a couple of proposals posted on the mailing list [1] [2].
> >
> > The patchset is aimed to try a different approach from this proposal [1]
> > to use PMEM as NUMA nodes.
> >
> > The approach is designed to follow the below principles:
> >
> > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> >
> > 2. DRAM first/by default. No surprise to existing applications and default
> > running. PMEM will not be allocated unless its node is specified explicitly
> > by NUMA policy. Some applications may be not very sensitive to memory latency,
> > so they could be placed on PMEM nodes then have hot pages promote to DRAM
> > gradually.
>
>
> I am not against the approach for some workloads. However, many HPC
> people would rather do this manually. But there's currently no easy way
> to find out from userspace whether a given NUMA node is DDR or PMEM*. We
> have to assume HMAT is available (and correct) and look at performance
> attributes. When talking to humans, it would be better to say "I
> allocated on the local DDR NUMA node" rather than "I allocated on the
> fastest node according to HMAT latency".
>
> Also, when we'll have HBM+DDR, some applications may want to use DDR by
> default, which means they want the *slowest* node according to HMAT (by
> the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?).
> Performance attributes could help, but how does user-space know for sure
> that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years?
>
> It seems to me that exporting a flag in sysfs saying whether a node is
> PMEM could be convenient. Patch series [1] exported a "type" in sysfs
> node directories ("pmem" or "dram"). I don't know how if there's an easy
> way to define what HBM is and expose that type too.

I'm generally against the concept that a "pmem" or "type" flag should
indicate anything about the expected performance of the address range.
The kernel should explicitly look to the HMAT for performance data and
not otherwise make type-based performance assumptions.
Brice Goglin March 25, 2019, 5:45 p.m. UTC | #3
Le 25/03/2019 à 17:56, Dan Williams a écrit :
>
> I'm generally against the concept that a "pmem" or "type" flag should
> indicate anything about the expected performance of the address range.
> The kernel should explicitly look to the HMAT for performance data and
> not otherwise make type-based performance assumptions.


Oh sorry, I didn't mean to have the kernel use such a flag to decide of
placement, but rather to expose more information to userspace to clarify
what all these nodes are about when userspace will decide where to
allocate things.

I understand that current NVDIMM-F are not slower than DDR and HMAT
would better describe this than a flag. But I have seen so many buggy or
dummy SLIT tables in the past that I wonder if we can expect HMAT to be
widely available (and correct).

Is there a safe fallback in case of missing or buggy HMAT? For instance,
is DDR supposed to be listed before NVDIMM (or HBM) in SRAT?

Brice
Dan Williams March 25, 2019, 7:29 p.m. UTC | #4
On Mon, Mar 25, 2019 at 10:45 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
> Le 25/03/2019 à 17:56, Dan Williams a écrit :
> >
> > I'm generally against the concept that a "pmem" or "type" flag should
> > indicate anything about the expected performance of the address range.
> > The kernel should explicitly look to the HMAT for performance data and
> > not otherwise make type-based performance assumptions.
>
>
> Oh sorry, I didn't mean to have the kernel use such a flag to decide of
> placement, but rather to expose more information to userspace to clarify
> what all these nodes are about when userspace will decide where to
> allocate things.

I understand, but I'm concerned about the risk of userspace developing
vendor-specific, or generation-specific policies around a coarse type
identifier. I think the lack of type specificity is a feature rather
than a gap, because it requires userspace to consider deeper
information.

Perhaps "path" might be a suitable replacement identifier rather than
type. I.e. memory that originates from an ACPI.NFIT root device is
likely "pmem".

> I understand that current NVDIMM-F are not slower than DDR and HMAT
> would better describe this than a flag. But I have seen so many buggy or
> dummy SLIT tables in the past that I wonder if we can expect HMAT to be
> widely available (and correct).

That's always a fear that the platform BIOS will try to game OS
behavior. However, that was the reason that HMAT was defined to
indicate actual performance values rather than relative. It is
hopefully harder to game than the relative SLIT values, but I'l  grant
you it's now impossible.

> Is there a safe fallback in case of missing or buggy HMAT? For instance,
> is DDR supposed to be listed before NVDIMM (or HBM) in SRAT?

One fallback might be to make some of these sysfs attributes writable
so userspace can correct the situation, but I'm otherwise unclear of
what you mean by "safe". If a platform has hard dependencies on
correctly enumerating memory performance capabilities then there's not
much the kernel can do if the HMAT is botched. I would expect the
general case is that the performance capabilities are a soft
dependency. but things still work if the data is wrong.
Yang Shi March 25, 2019, 8:04 p.m. UTC | #5
On 3/25/19 9:15 AM, Brice Goglin wrote:
> Le 23/03/2019 à 05:44, Yang Shi a écrit :
>> With Dave Hansen's patches merged into Linus's tree
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>>
>> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
>> effectively and efficiently is still a question.
>>
>> There have been a couple of proposals posted on the mailing list [1] [2].
>>
>> The patchset is aimed to try a different approach from this proposal [1]
>> to use PMEM as NUMA nodes.
>>
>> The approach is designed to follow the below principles:
>>
>> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
>>
>> 2. DRAM first/by default. No surprise to existing applications and default
>> running. PMEM will not be allocated unless its node is specified explicitly
>> by NUMA policy. Some applications may be not very sensitive to memory latency,
>> so they could be placed on PMEM nodes then have hot pages promote to DRAM
>> gradually.
>
> I am not against the approach for some workloads. However, many HPC
> people would rather do this manually. But there's currently no easy way
> to find out from userspace whether a given NUMA node is DDR or PMEM*. We
> have to assume HMAT is available (and correct) and look at performance
> attributes. When talking to humans, it would be better to say "I
> allocated on the local DDR NUMA node" rather than "I allocated on the
> fastest node according to HMAT latency".

Yes, I agree to have some information exposed to kernel or userspace to 
tell what nodes are DRAM nodes what nodes are not (maybe HBM or PMEM). I 
assume the default allocation should end up on DRAM nodes for the most 
workloads. If someone would like to control this manually other than 
mempolicy, the default allocation node mask may be exported to user 
space by sysfs so that it can be changed on demand.

>
> Also, when we'll have HBM+DDR, some applications may want to use DDR by
> default, which means they want the *slowest* node according to HMAT (by
> the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?).
> Performance attributes could help, but how does user-space know for sure
> that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years?

This is what I mentioned above we need the information exported from 
HMAT or anything similar to tell us what nodes are DRAM nodes since DRAM 
may be the lowest tier memory.

Or we may be able to assume the nodes associated with CPUs are DRAM 
nodes by assuming both HBM and PMEM is CPU less node.

Thanks,
Yang

>
> It seems to me that exporting a flag in sysfs saying whether a node is
> PMEM could be convenient. Patch series [1] exported a "type" in sysfs
> node directories ("pmem" or "dram"). I don't know how if there's an easy
> way to define what HBM is and expose that type too.
>
> Brice
>
> * As far as I know, the only way is to look at all DAX devices until you
> find the given NUMA node in the "target_node" attribute. If none, you're
> likely not PMEM-backed.
>
>
>> [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
Brice Goglin March 25, 2019, 11:09 p.m. UTC | #6
Le 25/03/2019 à 20:29, Dan Williams a écrit :
> Perhaps "path" might be a suitable replacement identifier rather than
> type. I.e. memory that originates from an ACPI.NFIT root device is
> likely "pmem".


Could work.

What kind of "path" would we get for other types of memory? (DDR,
non-ACPI-based based PMEM if any, NVMe PMR?)

Thanks

Brice
Dan Williams March 25, 2019, 11:37 p.m. UTC | #7
On Mon, Mar 25, 2019 at 4:09 PM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
>
> Le 25/03/2019 à 20:29, Dan Williams a écrit :
> > Perhaps "path" might be a suitable replacement identifier rather than
> > type. I.e. memory that originates from an ACPI.NFIT root device is
> > likely "pmem".
>
>
> Could work.
>
> What kind of "path" would we get for other types of memory? (DDR,
> non-ACPI-based based PMEM if any, NVMe PMR?)

I think for memory that is described by the HMAT "Reservation hint",
and no other ACPI table, it would need to have "HMAT" in the path. For
anything not ACPI it gets easier because the path can be the parent
PCI device.
Jonathan Cameron March 26, 2019, 12:19 p.m. UTC | #8
On Mon, 25 Mar 2019 16:37:07 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Mon, Mar 25, 2019 at 4:09 PM Brice Goglin <Brice.Goglin@inria.fr> wrote:
> >
> >
> > Le 25/03/2019 à 20:29, Dan Williams a écrit :  
> > > Perhaps "path" might be a suitable replacement identifier rather than
> > > type. I.e. memory that originates from an ACPI.NFIT root device is
> > > likely "pmem".  
> >
> >
> > Could work.
> >
> > What kind of "path" would we get for other types of memory? (DDR,
> > non-ACPI-based based PMEM if any, NVMe PMR?)  
> 
> I think for memory that is described by the HMAT "Reservation hint",
> and no other ACPI table, it would need to have "HMAT" in the path. For
> anything not ACPI it gets easier because the path can be the parent
> PCI device.
> 

There is no HMAT reservation hint in ACPI 6.3 - but there are other ways
of doing much the same thing so this is just a nitpick.

J
Michal Hocko March 26, 2019, 1:58 p.m. UTC | #9
On Sat 23-03-19 12:44:25, Yang Shi wrote:
> 
> With Dave Hansen's patches merged into Linus's tree
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> 
> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> effectively and efficiently is still a question. 
> 
> There have been a couple of proposals posted on the mailing list [1] [2].
> 
> The patchset is aimed to try a different approach from this proposal [1]
> to use PMEM as NUMA nodes.
> 
> The approach is designed to follow the below principles:
> 
> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> 
> 2. DRAM first/by default. No surprise to existing applications and default
> running. PMEM will not be allocated unless its node is specified explicitly
> by NUMA policy. Some applications may be not very sensitive to memory latency,
> so they could be placed on PMEM nodes then have hot pages promote to DRAM
> gradually.

Why are you pushing yourself into the corner right at the beginning? If
the PMEM is exported as a regular NUMA node then the only difference
should be performance characteristics (module durability which shouldn't
play any role in this particular case, right?). Applications which are
already sensitive to memory access should better use proper binding already.
Some NUMA topologies might have quite a large interconnect penalties
already. So this doesn't sound like an argument to me, TBH.

> 5. Control memory allocation and hot/cold pages promotion/demotion on per VMA
> basis.

What does that mean? Anon vs. file backed memory?

[...]

> 2. Introduce a new mempolicy, called MPOL_HYBRID to keep other mempolicy
> semantics intact. We would like to have memory placement control on per process
> or even per VMA granularity. So, mempolicy sounds more reasonable than madvise.
> The new mempolicy is mainly used for launching processes on PMEM nodes then
> migrate hot pages to DRAM nodes via NUMA balancing. MPOL_BIND could bind to
> PMEM nodes too, but migrating to DRAM nodes would just break the semantic of
> it. MPOL_PREFERRED can't constraint the allocation to PMEM nodes. So, it sounds
> a new mempolicy is needed to fulfill the usecase.

The above restriction pushes you to invent an API which is not really
trivial to get right and it seems quite artificial to me already.

> 3. The new mempolicy would promote pages to DRAM via NUMA balancing. IMHO, I
> don't think kernel is a good place to implement sophisticated hot/cold page
> distinguish algorithm due to the complexity and overhead. But, kernel should
> have such capability. NUMA balancing sounds like a good start point.

This is what the kernel does all the time. We call it memory reclaim.

> 4. Promote twice faulted page. Use PG_promote to track if a page is faulted
> twice. This is an optimization to NUMA balancing to reduce the migration
> thrashing and overhead for migrating from PMEM.

I am sorry, but page flags are an extremely scarce resource and a new
flag is extremely hard to get. On the other hand we already do have
use-twice detection for mapped page cache (see page_check_references). I
believe we can generalize that to anon pages as well.

> 5. When DRAM has memory pressure, demote page to PMEM via page reclaim path.
> This is quite similar to other proposals. Then NUMA balancing will promote
> page to DRAM as long as the page is referenced again. But, the
> promotion/demotion still assumes two tier main memory. And, the demotion may
> break mempolicy.

Yes, this sounds like a good idea to me ;)

> 6. Anonymous page only for the time being since NUMA balancing can't promote
> unmapped page cache.

As long as the nvdimm access is faster than the regular storage then
using any node (including pmem one) should be OK.
Yang Shi March 26, 2019, 6:33 p.m. UTC | #10
On 3/26/19 6:58 AM, Michal Hocko wrote:
> On Sat 23-03-19 12:44:25, Yang Shi wrote:
>> With Dave Hansen's patches merged into Linus's tree
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>>
>> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
>> effectively and efficiently is still a question.
>>
>> There have been a couple of proposals posted on the mailing list [1] [2].
>>
>> The patchset is aimed to try a different approach from this proposal [1]
>> to use PMEM as NUMA nodes.
>>
>> The approach is designed to follow the below principles:
>>
>> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
>>
>> 2. DRAM first/by default. No surprise to existing applications and default
>> running. PMEM will not be allocated unless its node is specified explicitly
>> by NUMA policy. Some applications may be not very sensitive to memory latency,
>> so they could be placed on PMEM nodes then have hot pages promote to DRAM
>> gradually.
> Why are you pushing yourself into the corner right at the beginning? If
> the PMEM is exported as a regular NUMA node then the only difference
> should be performance characteristics (module durability which shouldn't
> play any role in this particular case, right?). Applications which are
> already sensitive to memory access should better use proper binding already.
> Some NUMA topologies might have quite a large interconnect penalties
> already. So this doesn't sound like an argument to me, TBH.

The major rationale behind this is we assume the most applications 
should be sensitive to memory access, particularly for meeting the SLA. 
The applications run on the machine may be agnostic to us, they may be 
sensitive or non-sensitive. But, assuming they are sensitive to memory 
access sounds safer from SLA point of view. Then the "cold" pages could 
be demoted to PMEM nodes by kernel's memory reclaim or other tools 
without impairing the SLA.

If the applications are not sensitive to memory access, they could be 
bound to PMEM or allowed to use PMEM (nice to have allocation on DRAM) 
explicitly, then the "hot" pages could be promoted to DRAM.

>
>> 5. Control memory allocation and hot/cold pages promotion/demotion on per VMA
>> basis.
> What does that mean? Anon vs. file backed memory?

Yes, kind of. Basically, we would like to control the memory placement 
and promotion (by NUMA balancing) per VMA basis. For example, anon VMAs 
may be DRAM by default, file backed VMAs may be PMEM by default. Anyway, 
basically this is achieved freely by mempolicy.

>
> [...]
>
>> 2. Introduce a new mempolicy, called MPOL_HYBRID to keep other mempolicy
>> semantics intact. We would like to have memory placement control on per process
>> or even per VMA granularity. So, mempolicy sounds more reasonable than madvise.
>> The new mempolicy is mainly used for launching processes on PMEM nodes then
>> migrate hot pages to DRAM nodes via NUMA balancing. MPOL_BIND could bind to
>> PMEM nodes too, but migrating to DRAM nodes would just break the semantic of
>> it. MPOL_PREFERRED can't constraint the allocation to PMEM nodes. So, it sounds
>> a new mempolicy is needed to fulfill the usecase.
> The above restriction pushes you to invent an API which is not really
> trivial to get right and it seems quite artificial to me already.

First of all, the use case is some applications may be not that 
sensitive to memory access or are willing to achieve net win by trading 
some performance to save some cost (have some memory on PMEM). So, such 
applications may be bound to PMEM at the first place then promote hot 
pages to DRAM via NUMA balancing or whatever mechanism.

Both MPOL_BIND and MPOL_PREFERRED sounds not fit into this usecase quite 
naturally.

Secondly, it looks just default policy does NUMA balancing. Once the 
policy is changed to MPOL_BIND, NUMA balancing would not chime in.

So, I invented the new mempolicy.

>
>> 3. The new mempolicy would promote pages to DRAM via NUMA balancing. IMHO, I
>> don't think kernel is a good place to implement sophisticated hot/cold page
>> distinguish algorithm due to the complexity and overhead. But, kernel should
>> have such capability. NUMA balancing sounds like a good start point.
> This is what the kernel does all the time. We call it memory reclaim.
>
>> 4. Promote twice faulted page. Use PG_promote to track if a page is faulted
>> twice. This is an optimization to NUMA balancing to reduce the migration
>> thrashing and overhead for migrating from PMEM.
> I am sorry, but page flags are an extremely scarce resource and a new
> flag is extremely hard to get. On the other hand we already do have
> use-twice detection for mapped page cache (see page_check_references). I
> believe we can generalize that to anon pages as well.

Yes, I agree. A new page flag sounds not preferred. I'm going to take a 
look at page_check_references().

>
>> 5. When DRAM has memory pressure, demote page to PMEM via page reclaim path.
>> This is quite similar to other proposals. Then NUMA balancing will promote
>> page to DRAM as long as the page is referenced again. But, the
>> promotion/demotion still assumes two tier main memory. And, the demotion may
>> break mempolicy.
> Yes, this sounds like a good idea to me ;)
>
>> 6. Anonymous page only for the time being since NUMA balancing can't promote
>> unmapped page cache.
> As long as the nvdimm access is faster than the regular storage then
> using any node (including pmem one) should be OK.

However, it still sounds better to have some frequently accessed page 
cache on DRAM.

Thanks,
Yang
Michal Hocko March 26, 2019, 6:37 p.m. UTC | #11
On Tue 26-03-19 11:33:17, Yang Shi wrote:
> 
> 
> On 3/26/19 6:58 AM, Michal Hocko wrote:
> > On Sat 23-03-19 12:44:25, Yang Shi wrote:
> > > With Dave Hansen's patches merged into Linus's tree
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> > > 
> > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> > > effectively and efficiently is still a question.
> > > 
> > > There have been a couple of proposals posted on the mailing list [1] [2].
> > > 
> > > The patchset is aimed to try a different approach from this proposal [1]
> > > to use PMEM as NUMA nodes.
> > > 
> > > The approach is designed to follow the below principles:
> > > 
> > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> > > 
> > > 2. DRAM first/by default. No surprise to existing applications and default
> > > running. PMEM will not be allocated unless its node is specified explicitly
> > > by NUMA policy. Some applications may be not very sensitive to memory latency,
> > > so they could be placed on PMEM nodes then have hot pages promote to DRAM
> > > gradually.
> > Why are you pushing yourself into the corner right at the beginning? If
> > the PMEM is exported as a regular NUMA node then the only difference
> > should be performance characteristics (module durability which shouldn't
> > play any role in this particular case, right?). Applications which are
> > already sensitive to memory access should better use proper binding already.
> > Some NUMA topologies might have quite a large interconnect penalties
> > already. So this doesn't sound like an argument to me, TBH.
> 
> The major rationale behind this is we assume the most applications should be
> sensitive to memory access, particularly for meeting the SLA. The
> applications run on the machine may be agnostic to us, they may be sensitive
> or non-sensitive. But, assuming they are sensitive to memory access sounds
> safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
> nodes by kernel's memory reclaim or other tools without impairing the SLA.
> 
> If the applications are not sensitive to memory access, they could be bound
> to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
> then the "hot" pages could be promoted to DRAM.

Again, how is this different from NUMA in general?
Yang Shi March 27, 2019, 2:58 a.m. UTC | #12
On 3/26/19 11:37 AM, Michal Hocko wrote:
> On Tue 26-03-19 11:33:17, Yang Shi wrote:
>>
>> On 3/26/19 6:58 AM, Michal Hocko wrote:
>>> On Sat 23-03-19 12:44:25, Yang Shi wrote:
>>>> With Dave Hansen's patches merged into Linus's tree
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>>>>
>>>> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
>>>> effectively and efficiently is still a question.
>>>>
>>>> There have been a couple of proposals posted on the mailing list [1] [2].
>>>>
>>>> The patchset is aimed to try a different approach from this proposal [1]
>>>> to use PMEM as NUMA nodes.
>>>>
>>>> The approach is designed to follow the below principles:
>>>>
>>>> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
>>>>
>>>> 2. DRAM first/by default. No surprise to existing applications and default
>>>> running. PMEM will not be allocated unless its node is specified explicitly
>>>> by NUMA policy. Some applications may be not very sensitive to memory latency,
>>>> so they could be placed on PMEM nodes then have hot pages promote to DRAM
>>>> gradually.
>>> Why are you pushing yourself into the corner right at the beginning? If
>>> the PMEM is exported as a regular NUMA node then the only difference
>>> should be performance characteristics (module durability which shouldn't
>>> play any role in this particular case, right?). Applications which are
>>> already sensitive to memory access should better use proper binding already.
>>> Some NUMA topologies might have quite a large interconnect penalties
>>> already. So this doesn't sound like an argument to me, TBH.
>> The major rationale behind this is we assume the most applications should be
>> sensitive to memory access, particularly for meeting the SLA. The
>> applications run on the machine may be agnostic to us, they may be sensitive
>> or non-sensitive. But, assuming they are sensitive to memory access sounds
>> safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
>> nodes by kernel's memory reclaim or other tools without impairing the SLA.
>>
>> If the applications are not sensitive to memory access, they could be bound
>> to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
>> then the "hot" pages could be promoted to DRAM.
> Again, how is this different from NUMA in general?

It is still NUMA, users still can see all the NUMA nodes.

Introduced default allocation node mask (please refer to patch #1) to 
control the memory placement. Typically, the node mask just includes 
DRAM nodes. PMEM nodes are excluded by the node mask for memory allocation.

The node mask could be override by user per the discussion with Dan.

Thanks,
Yang
Michal Hocko March 27, 2019, 9:01 a.m. UTC | #13
On Tue 26-03-19 19:58:56, Yang Shi wrote:
> 
> 
> On 3/26/19 11:37 AM, Michal Hocko wrote:
> > On Tue 26-03-19 11:33:17, Yang Shi wrote:
> > > 
> > > On 3/26/19 6:58 AM, Michal Hocko wrote:
> > > > On Sat 23-03-19 12:44:25, Yang Shi wrote:
> > > > > With Dave Hansen's patches merged into Linus's tree
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> > > > > 
> > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> > > > > effectively and efficiently is still a question.
> > > > > 
> > > > > There have been a couple of proposals posted on the mailing list [1] [2].
> > > > > 
> > > > > The patchset is aimed to try a different approach from this proposal [1]
> > > > > to use PMEM as NUMA nodes.
> > > > > 
> > > > > The approach is designed to follow the below principles:
> > > > > 
> > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> > > > > 
> > > > > 2. DRAM first/by default. No surprise to existing applications and default
> > > > > running. PMEM will not be allocated unless its node is specified explicitly
> > > > > by NUMA policy. Some applications may be not very sensitive to memory latency,
> > > > > so they could be placed on PMEM nodes then have hot pages promote to DRAM
> > > > > gradually.
> > > > Why are you pushing yourself into the corner right at the beginning? If
> > > > the PMEM is exported as a regular NUMA node then the only difference
> > > > should be performance characteristics (module durability which shouldn't
> > > > play any role in this particular case, right?). Applications which are
> > > > already sensitive to memory access should better use proper binding already.
> > > > Some NUMA topologies might have quite a large interconnect penalties
> > > > already. So this doesn't sound like an argument to me, TBH.
> > > The major rationale behind this is we assume the most applications should be
> > > sensitive to memory access, particularly for meeting the SLA. The
> > > applications run on the machine may be agnostic to us, they may be sensitive
> > > or non-sensitive. But, assuming they are sensitive to memory access sounds
> > > safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
> > > nodes by kernel's memory reclaim or other tools without impairing the SLA.
> > > 
> > > If the applications are not sensitive to memory access, they could be bound
> > > to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
> > > then the "hot" pages could be promoted to DRAM.
> > Again, how is this different from NUMA in general?
> 
> It is still NUMA, users still can see all the NUMA nodes.

No, Linux NUMA implementation makes all numa nodes available by default
and provides an API to opt-in for more fine tuning. What you are
suggesting goes against that semantic and I am asking why. How is pmem
NUMA node any different from any any other distant node in principle?
Dan Williams March 27, 2019, 5:34 p.m. UTC | #14
On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 26-03-19 19:58:56, Yang Shi wrote:
> >
> >
> > On 3/26/19 11:37 AM, Michal Hocko wrote:
> > > On Tue 26-03-19 11:33:17, Yang Shi wrote:
> > > >
> > > > On 3/26/19 6:58 AM, Michal Hocko wrote:
> > > > > On Sat 23-03-19 12:44:25, Yang Shi wrote:
> > > > > > With Dave Hansen's patches merged into Linus's tree
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> > > > > >
> > > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> > > > > > effectively and efficiently is still a question.
> > > > > >
> > > > > > There have been a couple of proposals posted on the mailing list [1] [2].
> > > > > >
> > > > > > The patchset is aimed to try a different approach from this proposal [1]
> > > > > > to use PMEM as NUMA nodes.
> > > > > >
> > > > > > The approach is designed to follow the below principles:
> > > > > >
> > > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> > > > > >
> > > > > > 2. DRAM first/by default. No surprise to existing applications and default
> > > > > > running. PMEM will not be allocated unless its node is specified explicitly
> > > > > > by NUMA policy. Some applications may be not very sensitive to memory latency,
> > > > > > so they could be placed on PMEM nodes then have hot pages promote to DRAM
> > > > > > gradually.
> > > > > Why are you pushing yourself into the corner right at the beginning? If
> > > > > the PMEM is exported as a regular NUMA node then the only difference
> > > > > should be performance characteristics (module durability which shouldn't
> > > > > play any role in this particular case, right?). Applications which are
> > > > > already sensitive to memory access should better use proper binding already.
> > > > > Some NUMA topologies might have quite a large interconnect penalties
> > > > > already. So this doesn't sound like an argument to me, TBH.
> > > > The major rationale behind this is we assume the most applications should be
> > > > sensitive to memory access, particularly for meeting the SLA. The
> > > > applications run on the machine may be agnostic to us, they may be sensitive
> > > > or non-sensitive. But, assuming they are sensitive to memory access sounds
> > > > safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
> > > > nodes by kernel's memory reclaim or other tools without impairing the SLA.
> > > >
> > > > If the applications are not sensitive to memory access, they could be bound
> > > > to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
> > > > then the "hot" pages could be promoted to DRAM.
> > > Again, how is this different from NUMA in general?
> >
> > It is still NUMA, users still can see all the NUMA nodes.
>
> No, Linux NUMA implementation makes all numa nodes available by default
> and provides an API to opt-in for more fine tuning. What you are
> suggesting goes against that semantic and I am asking why. How is pmem
> NUMA node any different from any any other distant node in principle?

Agree. It's just another NUMA node and shouldn't be special cased.
Userspace policy can choose to avoid it, but typical node distance
preference should otherwise let the kernel fall back to it as
additional memory pressure relief for "near" memory.
Yang Shi March 27, 2019, 6:59 p.m. UTC | #15
On 3/27/19 10:34 AM, Dan Williams wrote:
> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
>> On Tue 26-03-19 19:58:56, Yang Shi wrote:
>>>
>>> On 3/26/19 11:37 AM, Michal Hocko wrote:
>>>> On Tue 26-03-19 11:33:17, Yang Shi wrote:
>>>>> On 3/26/19 6:58 AM, Michal Hocko wrote:
>>>>>> On Sat 23-03-19 12:44:25, Yang Shi wrote:
>>>>>>> With Dave Hansen's patches merged into Linus's tree
>>>>>>>
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>>>>>>>
>>>>>>> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
>>>>>>> effectively and efficiently is still a question.
>>>>>>>
>>>>>>> There have been a couple of proposals posted on the mailing list [1] [2].
>>>>>>>
>>>>>>> The patchset is aimed to try a different approach from this proposal [1]
>>>>>>> to use PMEM as NUMA nodes.
>>>>>>>
>>>>>>> The approach is designed to follow the below principles:
>>>>>>>
>>>>>>> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
>>>>>>>
>>>>>>> 2. DRAM first/by default. No surprise to existing applications and default
>>>>>>> running. PMEM will not be allocated unless its node is specified explicitly
>>>>>>> by NUMA policy. Some applications may be not very sensitive to memory latency,
>>>>>>> so they could be placed on PMEM nodes then have hot pages promote to DRAM
>>>>>>> gradually.
>>>>>> Why are you pushing yourself into the corner right at the beginning? If
>>>>>> the PMEM is exported as a regular NUMA node then the only difference
>>>>>> should be performance characteristics (module durability which shouldn't
>>>>>> play any role in this particular case, right?). Applications which are
>>>>>> already sensitive to memory access should better use proper binding already.
>>>>>> Some NUMA topologies might have quite a large interconnect penalties
>>>>>> already. So this doesn't sound like an argument to me, TBH.
>>>>> The major rationale behind this is we assume the most applications should be
>>>>> sensitive to memory access, particularly for meeting the SLA. The
>>>>> applications run on the machine may be agnostic to us, they may be sensitive
>>>>> or non-sensitive. But, assuming they are sensitive to memory access sounds
>>>>> safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
>>>>> nodes by kernel's memory reclaim or other tools without impairing the SLA.
>>>>>
>>>>> If the applications are not sensitive to memory access, they could be bound
>>>>> to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
>>>>> then the "hot" pages could be promoted to DRAM.
>>>> Again, how is this different from NUMA in general?
>>> It is still NUMA, users still can see all the NUMA nodes.
>> No, Linux NUMA implementation makes all numa nodes available by default
>> and provides an API to opt-in for more fine tuning. What you are
>> suggesting goes against that semantic and I am asking why. How is pmem
>> NUMA node any different from any any other distant node in principle?
> Agree. It's just another NUMA node and shouldn't be special cased.
> Userspace policy can choose to avoid it, but typical node distance
> preference should otherwise let the kernel fall back to it as
> additional memory pressure relief for "near" memory.

In ideal case, yes, I agree. However, in real life world the performance 
is a concern. It is well-known that PMEM (not considering NVDIMM-F or 
HBM) has higher latency and lower bandwidth. We observed much higher 
latency on PMEM than DRAM with multi threads.

In real production environment we don't know what kind of applications 
would end up on PMEM (DRAM may be full, allocation fall back to PMEM) 
then have unexpected performance degradation. I understand to have 
mempolicy to choose to avoid it. But, there might be hundreds or 
thousands of applications running on the machine, it sounds not that 
feasible to me to have each single application set mempolicy to avoid it.

So, I think we still need a default allocation node mask. The default 
value may include all nodes or just DRAM nodes. But, they should be able 
to be override by user globally, not only per process basis.

Due to the performance disparity, currently our usecases treat PMEM as 
second tier memory for demoting cold page or binding to not memory 
access sensitive applications (this is the reason for inventing a new 
mempolicy) although it is a NUMA node.

Thanks,
Yang
Michal Hocko March 27, 2019, 8:09 p.m. UTC | #16
On Wed 27-03-19 11:59:28, Yang Shi wrote:
> 
> 
> On 3/27/19 10:34 AM, Dan Williams wrote:
> > On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > On Tue 26-03-19 19:58:56, Yang Shi wrote:
[...]
> > > > It is still NUMA, users still can see all the NUMA nodes.
> > > No, Linux NUMA implementation makes all numa nodes available by default
> > > and provides an API to opt-in for more fine tuning. What you are
> > > suggesting goes against that semantic and I am asking why. How is pmem
> > > NUMA node any different from any any other distant node in principle?
> > Agree. It's just another NUMA node and shouldn't be special cased.
> > Userspace policy can choose to avoid it, but typical node distance
> > preference should otherwise let the kernel fall back to it as
> > additional memory pressure relief for "near" memory.
> 
> In ideal case, yes, I agree. However, in real life world the performance is
> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has
> higher latency and lower bandwidth. We observed much higher latency on PMEM
> than DRAM with multi threads.

One rule of thumb is: Do not design user visible interfaces based on the
contemporary technology and its up/down sides. This will almost always
fire back.

Btw. if you keep arguing about performance without any numbers. Can you
present something specific?

> In real production environment we don't know what kind of applications would
> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have
> unexpected performance degradation. I understand to have mempolicy to choose
> to avoid it. But, there might be hundreds or thousands of applications
> running on the machine, it sounds not that feasible to me to have each
> single application set mempolicy to avoid it.

we have cpuset cgroup controller to help here.

> So, I think we still need a default allocation node mask. The default value
> may include all nodes or just DRAM nodes. But, they should be able to be
> override by user globally, not only per process basis.
> 
> Due to the performance disparity, currently our usecases treat PMEM as
> second tier memory for demoting cold page or binding to not memory access
> sensitive applications (this is the reason for inventing a new mempolicy)
> although it is a NUMA node.

If the performance sucks that badly then do not use the pmem as NUMA,
really. There are certainly other ways to export the pmem storage. Use
it as a fast swap storage. Or try to work on a swap caching mechanism
that still allows much faster access than a slow swap storage. But do
not try to pretend to abuse the NUMA interface while you are breaking
some of its long term established semantics.
Dave Hansen March 27, 2019, 8:14 p.m. UTC | #17
On 3/27/19 11:59 AM, Yang Shi wrote:
> In real production environment we don't know what kind of applications
> would end up on PMEM (DRAM may be full, allocation fall back to PMEM)
> then have unexpected performance degradation. I understand to have
> mempolicy to choose to avoid it. But, there might be hundreds or
> thousands of applications running on the machine, it sounds not that
> feasible to me to have each single application set mempolicy to avoid it.

Maybe not manually, but it's entirely possible to automate this.

It would be trivial to get help from an orchestrator, or even systemd to
get apps launched with a particular policy.  Or, even a *shell* that
launches apps to have a particular policy.
Matthew Wilcox March 27, 2019, 8:35 p.m. UTC | #18
On Wed, Mar 27, 2019 at 10:34:11AM -0700, Dan Williams wrote:
> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
> > No, Linux NUMA implementation makes all numa nodes available by default
> > and provides an API to opt-in for more fine tuning. What you are
> > suggesting goes against that semantic and I am asking why. How is pmem
> > NUMA node any different from any any other distant node in principle?
> 
> Agree. It's just another NUMA node and shouldn't be special cased.
> Userspace policy can choose to avoid it, but typical node distance
> preference should otherwise let the kernel fall back to it as
> additional memory pressure relief for "near" memory.

I think this is sort of true, but sort of different.  These are
essentially CPU-less nodes; there is no CPU for which they are
fast memory.  Yes, they're further from some CPUs than from others.
I have never paid attention to how Linux treats CPU-less memory nodes,
but it would make sense to me if we don't default to allocating from
remote nodes.  And treating pmem nodes as being remote from all CPUs
makes a certain amount of sense to me.

eg on a four CPU-socket system, consider this as being

pmem1 --- node1 --- node2 --- pmem2
            |   \ /   |
            |    X    |
            |   / \   |
pmem3 --- node3 --- node4 --- pmem4

which I could actually see someone building with normal DRAM, and we
should probably handle the same way as pmem; for a process running on
node3, allocate preferentially from node3, then pmem3, then other nodes,
then other pmems.
Dave Hansen March 27, 2019, 8:40 p.m. UTC | #19
On 3/27/19 1:35 PM, Matthew Wilcox wrote:
> 
> pmem1 --- node1 --- node2 --- pmem2
>             |   \ /   |
>             |    X    |
>             |   / \   |
> pmem3 --- node3 --- node4 --- pmem4
> 
> which I could actually see someone building with normal DRAM, and we
> should probably handle the same way as pmem; for a process running on
> node3, allocate preferentially from node3, then pmem3, then other nodes,
> then other pmems.

That makes sense.  But, it might _also_ make sense to fill up all DRAM
first before using any pmem.  That could happen if the NUMA interconnect
is really fast and pmem is really slow.

Basically, with the current patches we are depending on the firmware to
"nicely" enumerate the topology and we're keeping the behavior that we
end up with, for now, whatever it might be.

Now, let's sit back and see how nice the firmware is. :)
Yang Shi March 28, 2019, 2:09 a.m. UTC | #20
On 3/27/19 1:09 PM, Michal Hocko wrote:
> On Wed 27-03-19 11:59:28, Yang Shi wrote:
>>
>> On 3/27/19 10:34 AM, Dan Williams wrote:
>>> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
>>>> On Tue 26-03-19 19:58:56, Yang Shi wrote:
> [...]
>>>>> It is still NUMA, users still can see all the NUMA nodes.
>>>> No, Linux NUMA implementation makes all numa nodes available by default
>>>> and provides an API to opt-in for more fine tuning. What you are
>>>> suggesting goes against that semantic and I am asking why. How is pmem
>>>> NUMA node any different from any any other distant node in principle?
>>> Agree. It's just another NUMA node and shouldn't be special cased.
>>> Userspace policy can choose to avoid it, but typical node distance
>>> preference should otherwise let the kernel fall back to it as
>>> additional memory pressure relief for "near" memory.
>> In ideal case, yes, I agree. However, in real life world the performance is
>> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has
>> higher latency and lower bandwidth. We observed much higher latency on PMEM
>> than DRAM with multi threads.
> One rule of thumb is: Do not design user visible interfaces based on the
> contemporary technology and its up/down sides. This will almost always
> fire back.

Thanks. It does make sense to me.

>
> Btw. if you keep arguing about performance without any numbers. Can you
> present something specific?

Yes, I did have some numbers. We did simple memory sequential rw latency 
test with a designed-in-house test program on PMEM (bind to PMEM) and 
DRAM (bind to DRAM). When running with 20 threads the result is as below:

              Threads          w/lat            r/lat
PMEM      20                537.15         68.06
DRAM      20                14.19           6.47

And, sysbench test with command: sysbench --time=600 memory 
--memory-block-size=8G --memory-total-size=1024T --memory-scope=global 
--memory-oper=read --memory-access-mode=rnd --rand-type=gaussian 
--rand-pareto-h=0.1 --threads=1 run

The result is:
                    lat/ms
PMEM      103766.09
DRAM      31946.30

>
>> In real production environment we don't know what kind of applications would
>> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have
>> unexpected performance degradation. I understand to have mempolicy to choose
>> to avoid it. But, there might be hundreds or thousands of applications
>> running on the machine, it sounds not that feasible to me to have each
>> single application set mempolicy to avoid it.
> we have cpuset cgroup controller to help here.
>
>> So, I think we still need a default allocation node mask. The default value
>> may include all nodes or just DRAM nodes. But, they should be able to be
>> override by user globally, not only per process basis.
>>
>> Due to the performance disparity, currently our usecases treat PMEM as
>> second tier memory for demoting cold page or binding to not memory access
>> sensitive applications (this is the reason for inventing a new mempolicy)
>> although it is a NUMA node.
> If the performance sucks that badly then do not use the pmem as NUMA,
> really. There are certainly other ways to export the pmem storage. Use
> it as a fast swap storage. Or try to work on a swap caching mechanism
> that still allows much faster access than a slow swap storage. But do
> not try to pretend to abuse the NUMA interface while you are breaking
> some of its long term established semantics.

Yes, we are looking into using it as a fast swap storage too and perhaps 
other usecases.

Anyway, though nobody thought it makes sense to restrict default 
allocation nodes, it sounds over-engineered. I'm going to drop it.

One question, when doing demote and promote we need define a path, for 
example, DRAM <-> PMEM (assume two tier memory). When determining what 
nodes are "DRAM" nodes, does it make sense to assume the nodes with both 
cpu and memory are DRAM nodes since PMEM nodes are typically cpuless nodes?

Thanks,
Yang
Michal Hocko March 28, 2019, 6:58 a.m. UTC | #21
On Wed 27-03-19 19:09:10, Yang Shi wrote:
> One question, when doing demote and promote we need define a path, for
> example, DRAM <-> PMEM (assume two tier memory). When determining what nodes
> are "DRAM" nodes, does it make sense to assume the nodes with both cpu and
> memory are DRAM nodes since PMEM nodes are typically cpuless nodes?

Do we really have to special case this for PMEM? Why cannot we simply go
in the zonelist order? In other words why cannot we use the same logic
for a larger NUMA machine and instead of swapping simply fallback to a
less contended NUMA node? It can be a regular DRAM, PMEM or whatever
other type of memory node.
Dan Williams March 28, 2019, 8:21 a.m. UTC | #22
On Wed, Mar 27, 2019 at 7:09 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> On 3/27/19 1:09 PM, Michal Hocko wrote:
> > On Wed 27-03-19 11:59:28, Yang Shi wrote:
> >>
> >> On 3/27/19 10:34 AM, Dan Williams wrote:
> >>> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
> >>>> On Tue 26-03-19 19:58:56, Yang Shi wrote:
> > [...]
> >>>>> It is still NUMA, users still can see all the NUMA nodes.
> >>>> No, Linux NUMA implementation makes all numa nodes available by default
> >>>> and provides an API to opt-in for more fine tuning. What you are
> >>>> suggesting goes against that semantic and I am asking why. How is pmem
> >>>> NUMA node any different from any any other distant node in principle?
> >>> Agree. It's just another NUMA node and shouldn't be special cased.
> >>> Userspace policy can choose to avoid it, but typical node distance
> >>> preference should otherwise let the kernel fall back to it as
> >>> additional memory pressure relief for "near" memory.
> >> In ideal case, yes, I agree. However, in real life world the performance is
> >> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has
> >> higher latency and lower bandwidth. We observed much higher latency on PMEM
> >> than DRAM with multi threads.
> > One rule of thumb is: Do not design user visible interfaces based on the
> > contemporary technology and its up/down sides. This will almost always
> > fire back.
>
> Thanks. It does make sense to me.
>
> >
> > Btw. if you keep arguing about performance without any numbers. Can you
> > present something specific?
>
> Yes, I did have some numbers. We did simple memory sequential rw latency
> test with a designed-in-house test program on PMEM (bind to PMEM) and
> DRAM (bind to DRAM). When running with 20 threads the result is as below:
>
>               Threads          w/lat            r/lat
> PMEM      20                537.15         68.06
> DRAM      20                14.19           6.47
>
> And, sysbench test with command: sysbench --time=600 memory
> --memory-block-size=8G --memory-total-size=1024T --memory-scope=global
> --memory-oper=read --memory-access-mode=rnd --rand-type=gaussian
> --rand-pareto-h=0.1 --threads=1 run
>
> The result is:
>                     lat/ms
> PMEM      103766.09
> DRAM      31946.30
>
> >
> >> In real production environment we don't know what kind of applications would
> >> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have
> >> unexpected performance degradation. I understand to have mempolicy to choose
> >> to avoid it. But, there might be hundreds or thousands of applications
> >> running on the machine, it sounds not that feasible to me to have each
> >> single application set mempolicy to avoid it.
> > we have cpuset cgroup controller to help here.
> >
> >> So, I think we still need a default allocation node mask. The default value
> >> may include all nodes or just DRAM nodes. But, they should be able to be
> >> override by user globally, not only per process basis.
> >>
> >> Due to the performance disparity, currently our usecases treat PMEM as
> >> second tier memory for demoting cold page or binding to not memory access
> >> sensitive applications (this is the reason for inventing a new mempolicy)
> >> although it is a NUMA node.
> > If the performance sucks that badly then do not use the pmem as NUMA,
> > really. There are certainly other ways to export the pmem storage. Use
> > it as a fast swap storage. Or try to work on a swap caching mechanism
> > that still allows much faster access than a slow swap storage. But do
> > not try to pretend to abuse the NUMA interface while you are breaking
> > some of its long term established semantics.
>
> Yes, we are looking into using it as a fast swap storage too and perhaps
> other usecases.
>
> Anyway, though nobody thought it makes sense to restrict default
> allocation nodes, it sounds over-engineered. I'm going to drop it.
>
> One question, when doing demote and promote we need define a path, for
> example, DRAM <-> PMEM (assume two tier memory). When determining what
> nodes are "DRAM" nodes, does it make sense to assume the nodes with both
> cpu and memory are DRAM nodes since PMEM nodes are typically cpuless nodes?

For ACPI platforms the HMAT is effectively going to enforce "cpu-less"
nodes for any memory range that has differentiated performance from
the conventional memory pool, or differentiated performance for a
specific initiator. So "memory-less == PMEM" is not a robust
assumption.

The plan is to use the HMAT to populate the default fallback order,
but allow for an override if the HMAT information is missing or
incorrect.
Yang Shi March 28, 2019, 6:58 p.m. UTC | #23
On 3/27/19 11:58 PM, Michal Hocko wrote:
> On Wed 27-03-19 19:09:10, Yang Shi wrote:
>> One question, when doing demote and promote we need define a path, for
>> example, DRAM <-> PMEM (assume two tier memory). When determining what nodes
>> are "DRAM" nodes, does it make sense to assume the nodes with both cpu and
>> memory are DRAM nodes since PMEM nodes are typically cpuless nodes?
> Do we really have to special case this for PMEM? Why cannot we simply go
> in the zonelist order? In other words why cannot we use the same logic
> for a larger NUMA machine and instead of swapping simply fallback to a
> less contended NUMA node? It can be a regular DRAM, PMEM or whatever
> other type of memory node.

Thanks for the suggestion. It makes sense. However, if we don't 
specialize a pmem node, its fallback node may be a DRAM node, then the 
memory reclaim may move the inactive page to the DRAM node, it sounds 
not make too much sense since memory reclaim would prefer to move 
downwards (DRAM -> PMEM -> Disk).

Yang
Michal Hocko March 28, 2019, 7:12 p.m. UTC | #24
On Thu 28-03-19 11:58:57, Yang Shi wrote:
> 
> 
> On 3/27/19 11:58 PM, Michal Hocko wrote:
> > On Wed 27-03-19 19:09:10, Yang Shi wrote:
> > > One question, when doing demote and promote we need define a path, for
> > > example, DRAM <-> PMEM (assume two tier memory). When determining what nodes
> > > are "DRAM" nodes, does it make sense to assume the nodes with both cpu and
> > > memory are DRAM nodes since PMEM nodes are typically cpuless nodes?
> > Do we really have to special case this for PMEM? Why cannot we simply go
> > in the zonelist order? In other words why cannot we use the same logic
> > for a larger NUMA machine and instead of swapping simply fallback to a
> > less contended NUMA node? It can be a regular DRAM, PMEM or whatever
> > other type of memory node.
> 
> Thanks for the suggestion. It makes sense. However, if we don't specialize a
> pmem node, its fallback node may be a DRAM node, then the memory reclaim may
> move the inactive page to the DRAM node, it sounds not make too much sense
> since memory reclaim would prefer to move downwards (DRAM -> PMEM -> Disk).

There are certainly many details to sort out. One thing is how to handle
cpuless nodes (e.g. PMEM). Those shouldn't get any direct allocations
without an explicit binding, right? My first naive idea would be to only
migrate-on-reclaim only from the preferred node. We might need
additional heuristics but I wouldn't special case PMEM from other
cpuless NUMA nodes.
Yang Shi March 28, 2019, 7:40 p.m. UTC | #25
On 3/28/19 12:12 PM, Michal Hocko wrote:
> On Thu 28-03-19 11:58:57, Yang Shi wrote:
>>
>> On 3/27/19 11:58 PM, Michal Hocko wrote:
>>> On Wed 27-03-19 19:09:10, Yang Shi wrote:
>>>> One question, when doing demote and promote we need define a path, for
>>>> example, DRAM <-> PMEM (assume two tier memory). When determining what nodes
>>>> are "DRAM" nodes, does it make sense to assume the nodes with both cpu and
>>>> memory are DRAM nodes since PMEM nodes are typically cpuless nodes?
>>> Do we really have to special case this for PMEM? Why cannot we simply go
>>> in the zonelist order? In other words why cannot we use the same logic
>>> for a larger NUMA machine and instead of swapping simply fallback to a
>>> less contended NUMA node? It can be a regular DRAM, PMEM or whatever
>>> other type of memory node.
>> Thanks for the suggestion. It makes sense. However, if we don't specialize a
>> pmem node, its fallback node may be a DRAM node, then the memory reclaim may
>> move the inactive page to the DRAM node, it sounds not make too much sense
>> since memory reclaim would prefer to move downwards (DRAM -> PMEM -> Disk).
> There are certainly many details to sort out. One thing is how to handle
> cpuless nodes (e.g. PMEM). Those shouldn't get any direct allocations
> without an explicit binding, right? My first naive idea would be to only

Wait a minute. I thought we were arguing about the default allocation 
node mask yesterday. And, the conclusion is PMEM node should not be 
excluded from the node mask. PMEM nodes are cpuless nodes. I think I 
should replace all "PMEM node" to "cpuless node" in the cover letter and 
commit logs to make it explicitly.

Quoted from Dan "For ACPI platforms the HMAT is effectively going to 
enforce "cpu-less" nodes for any memory range that has differentiated 
performance from the conventional memory pool, or differentiated 
performance for a specific initiator."

I apologize I didn't elaborate PMEM nodes are cpuless nodes at the first 
place. Of course, cpuless node may be not PMEM node.

To your question, yes, I do agree. Actually, this is what I mean about 
"DRAM only by default", or I should rephrase it to "exclude cpuless 
node", I thought they mean the same thing.

> migrate-on-reclaim only from the preferred node. We might need

If we exclude cpuless nodes, yes. The preferred node would be DRAM node 
only. Actually, the patchset does follow "migrate-on-reclaim only from 
the preferred node".

Thanks,
Yang

> additional heuristics but I wouldn't special case PMEM from other
> cpuless NUMA nodes.
Michal Hocko March 28, 2019, 8:40 p.m. UTC | #26
On Thu 28-03-19 12:40:14, Yang Shi wrote:
> 
> 
> On 3/28/19 12:12 PM, Michal Hocko wrote:
> > On Thu 28-03-19 11:58:57, Yang Shi wrote:
> > > 
> > > On 3/27/19 11:58 PM, Michal Hocko wrote:
> > > > On Wed 27-03-19 19:09:10, Yang Shi wrote:
> > > > > One question, when doing demote and promote we need define a path, for
> > > > > example, DRAM <-> PMEM (assume two tier memory). When determining what nodes
> > > > > are "DRAM" nodes, does it make sense to assume the nodes with both cpu and
> > > > > memory are DRAM nodes since PMEM nodes are typically cpuless nodes?
> > > > Do we really have to special case this for PMEM? Why cannot we simply go
> > > > in the zonelist order? In other words why cannot we use the same logic
> > > > for a larger NUMA machine and instead of swapping simply fallback to a
> > > > less contended NUMA node? It can be a regular DRAM, PMEM or whatever
> > > > other type of memory node.
> > > Thanks for the suggestion. It makes sense. However, if we don't specialize a
> > > pmem node, its fallback node may be a DRAM node, then the memory reclaim may
> > > move the inactive page to the DRAM node, it sounds not make too much sense
> > > since memory reclaim would prefer to move downwards (DRAM -> PMEM -> Disk).
> > There are certainly many details to sort out. One thing is how to handle
> > cpuless nodes (e.g. PMEM). Those shouldn't get any direct allocations
> > without an explicit binding, right? My first naive idea would be to only
> 
> Wait a minute. I thought we were arguing about the default allocation node
> mask yesterday. And, the conclusion is PMEM node should not be excluded from
> the node mask. PMEM nodes are cpuless nodes. I think I should replace all
> "PMEM node" to "cpuless node" in the cover letter and commit logs to make it
> explicitly.

No, this is not about the default allocation mask at all. Your
allocations start from a local/mempolicy node. CPUless nodes thus cannot be a
primary node so it will always be only in a fallback zonelist without an
explicit binding.