mbox series

[RFC,v2,00/21] PMEM NUMA node and hotness accounting/migration

Message ID 20181226131446.330864849@intel.com (mailing list archive)
Headers show
Series PMEM NUMA node and hotness accounting/migration | expand

Message

Fengguang Wu Dec. 26, 2018, 1:14 p.m. UTC
This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's
transparent to normal applications and virtual machines.

The code is still in active development. It's provided for early design review.

Key functionalities:

1) create and describe PMEM NUMA node for NVDIMM memory
2) dumb /proc/PID/idle_pages interface, for user space driven hot page accounting
3) passive kernel cold page migration in page reclaim path
4) improved move_pages() for active user space hot/cold page migration

(1) is foundation for transparent usage of NVDIMM for normal apps and virtual
machines. (2-4) enable auto placing hot pages in DRAM for better performance.
A user space migration daemon is being built based on this kernel patchset to
make the full vertical solution.

Base kernel is v4.20 . The patches are not suitable for upstreaming in near
future -- some are quick hacks, some others need more works. However they are
complete enough to demo the necessary kernel changes for the proposed app&VM
transparent NVDIMM volatile use model.

The interfaces are far from finalized. They kind of illustrate what would be
necessary for creating a user space driven solution. The exact forms will ask
for more thoughts and inputs. We may adopt HMAT based solution for NUMA node
related interface when they are ready. The /proc/PID/idle_pages interface is
standalone but non-trivial. Before upstreaming some day, it's expected to take
long time to collect various real use cases and feedbacks, so as to refine and
stabilize the format.

Create PMEM numa node

	[PATCH 01/21] e820: cheat PMEM as DRAM

Mark numa node as DRAM/PMEM

	[PATCH 02/21] acpi/numa: memorize NUMA node type from SRAT table
	[PATCH 03/21] x86/numa_emulation: fix fake NUMA in uniform case
	[PATCH 04/21] x86/numa_emulation: pass numa node type to fake nodes
	[PATCH 05/21] mmzone: new pgdat flags for DRAM and PMEM
	[PATCH 06/21] x86,numa: update numa node type
	[PATCH 07/21] mm: export node type {pmem|dram} under /sys/bus/node

Point neighbor DRAM/PMEM to each other

	[PATCH 08/21] mm: introduce and export pgdat peer_node
	[PATCH 09/21] mm: avoid duplicate peer target node

Standalone zonelist for DRAM and PMEM nodes

	[PATCH 10/21] mm: build separate zonelist for PMEM and DRAM node

Keep page table pages in DRAM

	[PATCH 11/21] kvm: allocate page table pages from DRAM
	[PATCH 12/21] x86/pgtable: allocate page table pages from DRAM

/proc/PID/idle_pages interface for virtual machine and normal tasks

	[PATCH 13/21] x86/pgtable: dont check PMD accessed bit
	[PATCH 14/21] kvm: register in mm_struct
	[PATCH 15/21] ept-idle: EPT walk for virtual machine
	[PATCH 16/21] mm-idle: mm_walk for normal task
	[PATCH 17/21] proc: introduce /proc/PID/idle_pages
	[PATCH 18/21] kvm-ept-idle: enable module

Mark hot pages

	[PATCH 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag

Kernel DRAM=>PMEM migration

	[PATCH 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node
	[PATCH 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM

 arch/x86/include/asm/numa.h    |    2 
 arch/x86/include/asm/pgalloc.h |   10 
 arch/x86/include/asm/pgtable.h |    3 
 arch/x86/kernel/e820.c         |    3 
 arch/x86/kvm/Kconfig           |   11 
 arch/x86/kvm/Makefile          |    4 
 arch/x86/kvm/ept_idle.c        |  841 +++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h        |  116 ++++
 arch/x86/kvm/mmu.c             |   12 
 arch/x86/mm/numa.c             |    3 
 arch/x86/mm/numa_emulation.c   |   30 +
 arch/x86/mm/pgtable.c          |   22 
 drivers/acpi/numa.c            |    5 
 drivers/base/node.c            |   21 
 fs/proc/base.c                 |    2 
 fs/proc/internal.h             |    1 
 fs/proc/task_mmu.c             |   54 +
 include/linux/mm_types.h       |   11 
 include/linux/mmzone.h         |   38 +
 mm/mempolicy.c                 |   14 
 mm/migrate.c                   |   13 
 mm/page_alloc.c                |   77 ++
 mm/pagewalk.c                  |    1 
 mm/vmscan.c                    |   38 +
 virt/kvm/kvm_main.c            |    3 
 25 files changed, 1306 insertions(+), 29 deletions(-)

V1 patches: https://lkml.org/lkml/2018/9/2/13

Regards,
Fengguang

Comments

Michal Hocko Dec. 27, 2018, 8:31 p.m. UTC | #1
On Wed 26-12-18 21:14:46, Wu Fengguang wrote:
> This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's
> transparent to normal applications and virtual machines.
> 
> The code is still in active development. It's provided for early design review.

So can we get a high level description of the design and expected
usecases please?
Fengguang Wu Dec. 28, 2018, 5:08 a.m. UTC | #2
On Thu, Dec 27, 2018 at 09:31:58PM +0100, Michal Hocko wrote:
>On Wed 26-12-18 21:14:46, Wu Fengguang wrote:
>> This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's
>> transparent to normal applications and virtual machines.
>>
>> The code is still in active development. It's provided for early design review.
>
>So can we get a high level description of the design and expected
>usecases please?

Good question.

Use cases
=========

The general use case is to use PMEM as slower but cheaper "DRAM".
The suitable ones can be

- workloads care memory size more than bandwidth/latency
- workloads with a set of warm/cold pages that don't change rapidly over time
- low cost VM/containers

Foundation: create PMEM NUMA nodes
==================================

To create PMEM nodes in native kernel, Dave Hansen and Dan Williams
have working patches for kernel and ndctl. According to Ying, it'll
work like this

        ndctl destroy-namespace -f namespace0.0
        ndctl destroy-namespace -f namespace1.0
        ipmctl create -goal MemoryMode=100
        reboot

To create PMEM nodes in QEMU VMs, current Debian/Fedora etc. distros
already support this

	qemu-system-x86_64
	-machine pc,nvdimm
        -enable-kvm
        -smp 64
        -m 256G
        # DRAM node 0
        -object memory-backend-file,size=128G,share=on,mem-path=/dev/shm/qemu_node0,id=tmpfs-node0
	-numa node,cpus=0-31,nodeid=0,memdev=tmpfs-node0
        # PMEM node 1
        -object memory-backend-file,size=128G,share=on,mem-path=/dev/dax1.0,align=128M,id=dax-node1
        -numa node,cpus=32-63,nodeid=1,memdev=dax-node1

Optimization: do hot/cold page tracking and migration
=====================================================

Since PMEM is slower than DRAM, we need to make sure hot pages go to
DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM.

- DRAM=>PMEM cold page migration

It can be done in kernel page reclaim path, near the anonymous page
swap out point. Instead of swapping out, we now have the option to
migrate cold pages to PMEM NUMA nodes.

User space may also do it, however cannot act on-demand, when there
are memory pressure in DRAM nodes.

- PMEM=>DRAM hot page migration

While LRU can be good enough for identifying cold pages, frequency
based accounting can be more suitable for identifying hot pages.

Our design choice is to create a flexible user space daemon to drive
the accounting and migration, with necessary kernel supports by this
patchset.

Linux kernel already offers move_pages(2) for user space to migrate
pages to specified NUMA nodes. The major gap lies in hotness accounting.

User space driven hotness accounting
====================================

One way to find out hot/cold pages is to scan page table multiple
times and collect the "accessed" bits.

We created the kvm-ept-idle kernel module to provide the "accessed"
bits via interface /proc/PID/idle_pages. User space can open it and
read the "accessed" bits for a range of virtual address.

Inside kernel module, it implements 2 independent set of page table
scan code, seamlessly providing the same interface:

- for QEMU, scan HVA range of the VM's EPT(Extended Page Table)
- for others, scan VA range of the process page table 

With /proc/PID/idle_pages and move_pages(2), the user space daemon
can work like this

One round of scan+migration:

        loop N=(3-10) times:
                sleep 0.01-10s (typical values)
                scan page tables and read/accumulate accessed bits into arrays
        treat pages with accessed_count == N as hot  pages
        treat pages with accessed_count == 0 as cold pages
        migrate hot  pages to DRAM nodes
        migrate cold pages to PMEM nodes (optional, may do it once on multi scan rounds, to make sure they are really cold)

That just describes the bare minimal working model. A real world
daemon should consider lots more to be useful and robust. The notable
one is to avoid thrashing.

Hotness accounting can be rough and workload can be unstable. We need
to avoid promoting a warm page to DRAM and then demoting it soon.

The basic scheme is to auto control scan interval and count, so that
each round of scan will get hot pages < 1/2 DRAM size.

May also do multiple round of scans before migration, to filter out
unstable/burst accesses.

In long run, most of the accounted hot pages will already be in DRAM.
So only need to migrate the new ones to DRAM. When doing so, should
consider QoS and rate limiting to reduce impacts to user workloads.

When user space drives hot page migration, the DRAM nodes may well be
pressured, which will in turn trigger in-kernel cold page migration.
The above 1/2 DRAM size hot pages target can help kernel easily find
cold pages on LRU scan.

To avoid thrashing, it's also important to maintain persistent kernel
and user-space view of hot/cold pages. Since they will do migrations
in 2 different directions.

- the regular page table scans will clear PMD/PTE young
- user space compensate that by setting PG_referenced on
  move_pages(hot pages, MPOL_MF_SW_YOUNG)

That guarantees the user space collected view of hot pages will be
conveyed to kernel.

Regards,
Fengguang
Michal Hocko Dec. 28, 2018, 8:41 a.m. UTC | #3
On Fri 28-12-18 13:08:06, Wu Fengguang wrote:
[...]
> Optimization: do hot/cold page tracking and migration
> =====================================================
> 
> Since PMEM is slower than DRAM, we need to make sure hot pages go to
> DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM.
> 
> - DRAM=>PMEM cold page migration
> 
> It can be done in kernel page reclaim path, near the anonymous page
> swap out point. Instead of swapping out, we now have the option to
> migrate cold pages to PMEM NUMA nodes.

OK, this makes sense to me except I am not sure this is something that
should be pmem specific. Is there any reason why we shouldn't migrate
pages on memory pressure to other nodes in general? In other words
rather than paging out we whould migrate over to the next node that is
not under memory pressure. Swapout would be the next level when the
memory is (almost_) fully utilized. That wouldn't be pmem specific.

> User space may also do it, however cannot act on-demand, when there
> are memory pressure in DRAM nodes.
> 
> - PMEM=>DRAM hot page migration
> 
> While LRU can be good enough for identifying cold pages, frequency
> based accounting can be more suitable for identifying hot pages.
> 
> Our design choice is to create a flexible user space daemon to drive
> the accounting and migration, with necessary kernel supports by this
> patchset.

We do have numa balancing, why cannot we rely on it? This along with the
above would allow to have pmem numa nodes (cpuless nodes in fact)
without any special casing and a natural part of the MM. It would be
only the matter of the configuration to set the appropriate distance to
allow reasonable allocation fallback strategy.

I haven't looked at the implementation yet but if you are proposing a
special cased zone lists then this is something CDM (Coherent Device
Memory) was trying to do two years ago and there was quite some
skepticism in the approach.
Fengguang Wu Dec. 28, 2018, 9:42 a.m. UTC | #4
On Fri, Dec 28, 2018 at 09:41:05AM +0100, Michal Hocko wrote:
>On Fri 28-12-18 13:08:06, Wu Fengguang wrote:
>[...]
>> Optimization: do hot/cold page tracking and migration
>> =====================================================
>>
>> Since PMEM is slower than DRAM, we need to make sure hot pages go to
>> DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM.
>>
>> - DRAM=>PMEM cold page migration
>>
>> It can be done in kernel page reclaim path, near the anonymous page
>> swap out point. Instead of swapping out, we now have the option to
>> migrate cold pages to PMEM NUMA nodes.
>
>OK, this makes sense to me except I am not sure this is something that
>should be pmem specific. Is there any reason why we shouldn't migrate
>pages on memory pressure to other nodes in general? In other words
>rather than paging out we whould migrate over to the next node that is
>not under memory pressure. Swapout would be the next level when the
>memory is (almost_) fully utilized. That wouldn't be pmem specific.

In future there could be multi memory levels with different
performance/size/cost metric. There are ongoing HMAT works to describe
that. When ready, we can switch to the HMAT based general infrastructure.
Then the code will no longer be PMEM specific, but do general
promotion/demotion migrations between high/low memory levels.
Swapout could be from the lowest level memory.

Migration between peer nodes is the obvious simple way and a good
choice for the initial implementation. But yeah, it's possible to
migrate to other nodes. For example, it can be combined with NUMA
balancing: if we know the page is mostly accessed by the other socket,
then it'd best to migrate hot/cold pages directly to that socket.

>> User space may also do it, however cannot act on-demand, when there
>> are memory pressure in DRAM nodes.
>>
>> - PMEM=>DRAM hot page migration
>>
>> While LRU can be good enough for identifying cold pages, frequency
>> based accounting can be more suitable for identifying hot pages.
>>
>> Our design choice is to create a flexible user space daemon to drive
>> the accounting and migration, with necessary kernel supports by this
>> patchset.
>
>We do have numa balancing, why cannot we rely on it? This along with the
>above would allow to have pmem numa nodes (cpuless nodes in fact)
>without any special casing and a natural part of the MM. It would be
>only the matter of the configuration to set the appropriate distance to
>allow reasonable allocation fallback strategy.

Good question. We actually tried reusing NUMA balancing mechanism to
do page-fault triggered migration. move_pages() only calls
change_prot_numa(). It turns out the 2 migration types have different
purposes (one for hotness, another for home node) and hence implement
details. We end up modifying some few NUMA balancing logic -- removing
rate limiting, changing target node logics, etc.

Those look unnecessary complexities for this post. This v2 patchset
mainly fulfills our first milestone goal: a minimal viable solution
that's relatively clean to backport. Even when preparing for new
upstreamable versions, it may be good to keep it simple for the
initial upstream inclusion.

>I haven't looked at the implementation yet but if you are proposing a
>special cased zone lists then this is something CDM (Coherent Device
>Memory) was trying to do two years ago and there was quite some
>skepticism in the approach.

It looks we are pretty different than CDM. :)
We creating new NUMA nodes rather than CDM's new ZONE.
The zonelists modification is just to make PMEM nodes more separated.

Thanks,
Fengguang
Michal Hocko Dec. 28, 2018, 12:15 p.m. UTC | #5
On Fri 28-12-18 17:42:08, Wu Fengguang wrote:
[...]
> Those look unnecessary complexities for this post. This v2 patchset
> mainly fulfills our first milestone goal: a minimal viable solution
> that's relatively clean to backport. Even when preparing for new
> upstreamable versions, it may be good to keep it simple for the
> initial upstream inclusion.

On the other hand this is creating a new NUMA semantic and I would like
to have something long term thatn let's throw something in now and care
about long term later. So I would really prefer to talk about long term
plans first and only care about implementation details later.

> > I haven't looked at the implementation yet but if you are proposing a
> > special cased zone lists then this is something CDM (Coherent Device
> > Memory) was trying to do two years ago and there was quite some
> > skepticism in the approach.
> 
> It looks we are pretty different than CDM. :)
> We creating new NUMA nodes rather than CDM's new ZONE.
> The zonelists modification is just to make PMEM nodes more separated.

Yes, this is exactly what CDM was after. Have a zone which is not
reachable without explicit request AFAIR. So no, I do not think you are
too different, you just use a different terminology ;)
Fengguang Wu Dec. 28, 2018, 1:15 p.m. UTC | #6
On Fri, Dec 28, 2018 at 01:15:15PM +0100, Michal Hocko wrote:
>On Fri 28-12-18 17:42:08, Wu Fengguang wrote:
>[...]
>> Those look unnecessary complexities for this post. This v2 patchset
>> mainly fulfills our first milestone goal: a minimal viable solution
>> that's relatively clean to backport. Even when preparing for new
>> upstreamable versions, it may be good to keep it simple for the
>> initial upstream inclusion.
>
>On the other hand this is creating a new NUMA semantic and I would like
>to have something long term thatn let's throw something in now and care
>about long term later. So I would really prefer to talk about long term
>plans first and only care about implementation details later.

That makes good sense. FYI here are the several in-house patches that
try to leverage (but not yet integrate with) NUMA balancing. The last
one is brutal force hacking. They obviously break original NUMA
balancing logic.

Thanks,
Fengguang
From ef41a542568913c8c62251021c3bc38b7a549440 Mon Sep 17 00:00:00 2001
From: Liu Jingqi <jingqi.liu@intel.com>
Date: Sat, 29 Sep 2018 23:29:56 +0800
Subject: [PATCH 074/166] migrate: set PROT_NONE on the PTEs and let NUMA
 balancing

Need to enable CONFIG_NUMA_BALANCING firstly.
Set PROT_NONE on the PTEs that map to the page,
and do the actual migration in the context of process which initiate migration.

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/migrate.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index b27a287081c2..d933f6966601 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1530,6 +1530,21 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 	if (page_mapcount(page) > 1 && !migrate_all)
 		goto out_putpage;
 
+	if (flags & MPOL_MF_SW_YOUNG) {
+		unsigned long start, end;
+		unsigned long nr_pte_updates = 0;
+
+		start = max(addr, vma->vm_start);
+
+		/* TODO: if huge page  */
+		end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE);
+		end = min(end, vma->vm_end);
+		nr_pte_updates = change_prot_numa(vma, start, end);
+
+		err = 0;
+		goto out_putpage;
+	}
+
 	if (PageHuge(page)) {
 		if (PageHead(page)) {
 			/* Check if the page is software young. */
Fengguang Wu Dec. 28, 2018, 1:31 p.m. UTC | #7
>> > I haven't looked at the implementation yet but if you are proposing a
>> > special cased zone lists then this is something CDM (Coherent Device
>> > Memory) was trying to do two years ago and there was quite some
>> > skepticism in the approach.
>>
>> It looks we are pretty different than CDM. :)
>> We creating new NUMA nodes rather than CDM's new ZONE.
>> The zonelists modification is just to make PMEM nodes more separated.
>
>Yes, this is exactly what CDM was after. Have a zone which is not
>reachable without explicit request AFAIR. So no, I do not think you are
>too different, you just use a different terminology ;)

Got it. OK.. The fall back zonelists patch does need more thoughts.

In long term POV, Linux should be prepared for multi-level memory.
Then there will arise the need to "allocate from this level memory".
So it looks good to have separated zonelists for each level of memory.  

On the other hand, there will also be page allocations that don't care
about the exact memory level. So it looks reasonable to expect
different kind of fallback zonelists that can be selected by NUMA policy.

Thanks,
Fengguang
Yang Shi Dec. 28, 2018, 6:28 p.m. UTC | #8
On Fri, Dec 28, 2018 at 5:31 AM Fengguang Wu <fengguang.wu@intel.com> wrote:
>
> >> > I haven't looked at the implementation yet but if you are proposing a
> >> > special cased zone lists then this is something CDM (Coherent Device
> >> > Memory) was trying to do two years ago and there was quite some
> >> > skepticism in the approach.
> >>
> >> It looks we are pretty different than CDM. :)
> >> We creating new NUMA nodes rather than CDM's new ZONE.
> >> The zonelists modification is just to make PMEM nodes more separated.
> >
> >Yes, this is exactly what CDM was after. Have a zone which is not
> >reachable without explicit request AFAIR. So no, I do not think you are
> >too different, you just use a different terminology ;)
>
> Got it. OK.. The fall back zonelists patch does need more thoughts.
>
> In long term POV, Linux should be prepared for multi-level memory.
> Then there will arise the need to "allocate from this level memory".
> So it looks good to have separated zonelists for each level of memory.

I tend to agree with Fengguang. We do have needs for finer grained
control to the usage of DRAM and PMEM, for example, controlling the
percentage of DRAM and PMEM for a specific VMA.

NUMA policy sounds not good enough for some usecases since it just can
control what mempolicy is used by what memory range. Our usecase's
memory access pattern is random in a VMA. So, we can't control the
percentage by mempolicy. We have to put PMEM into a separate zonelist
to make sure memory allocation happens on PMEM when certain criteria
is met as what Fengguang does in this patch series.

Thanks,
Yang

>
> On the other hand, there will also be page allocations that don't care
> about the exact memory level. So it looks reasonable to expect
> different kind of fallback zonelists that can be selected by NUMA policy.
>
> Thanks,
> Fengguang
>
Michal Hocko Dec. 28, 2018, 7:46 p.m. UTC | #9
[Cc Mel and Andrea - the thread started
http://lkml.kernel.org/r/20181226131446.330864849@intel.com]

On Fri 28-12-18 21:15:42, Wu Fengguang wrote:
> On Fri, Dec 28, 2018 at 01:15:15PM +0100, Michal Hocko wrote:
> > On Fri 28-12-18 17:42:08, Wu Fengguang wrote:
> > [...]
> > > Those look unnecessary complexities for this post. This v2 patchset
> > > mainly fulfills our first milestone goal: a minimal viable solution
> > > that's relatively clean to backport. Even when preparing for new
> > > upstreamable versions, it may be good to keep it simple for the
> > > initial upstream inclusion.
> > 
> > On the other hand this is creating a new NUMA semantic and I would like
> > to have something long term thatn let's throw something in now and care
> > about long term later. So I would really prefer to talk about long term
> > plans first and only care about implementation details later.
> 
> That makes good sense. FYI here are the several in-house patches that
> try to leverage (but not yet integrate with) NUMA balancing. The last
> one is brutal force hacking. They obviously break original NUMA
> balancing logic.
> 
> Thanks,
> Fengguang

> >From ef41a542568913c8c62251021c3bc38b7a549440 Mon Sep 17 00:00:00 2001
> From: Liu Jingqi <jingqi.liu@intel.com>
> Date: Sat, 29 Sep 2018 23:29:56 +0800
> Subject: [PATCH 074/166] migrate: set PROT_NONE on the PTEs and let NUMA
>  balancing
> 
> Need to enable CONFIG_NUMA_BALANCING firstly.
> Set PROT_NONE on the PTEs that map to the page,
> and do the actual migration in the context of process which initiate migration.
> 
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  mm/migrate.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b27a287081c2..d933f6966601 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1530,6 +1530,21 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>  	if (page_mapcount(page) > 1 && !migrate_all)
>  		goto out_putpage;
>  
> +	if (flags & MPOL_MF_SW_YOUNG) {
> +		unsigned long start, end;
> +		unsigned long nr_pte_updates = 0;
> +
> +		start = max(addr, vma->vm_start);
> +
> +		/* TODO: if huge page  */
> +		end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE);
> +		end = min(end, vma->vm_end);
> +		nr_pte_updates = change_prot_numa(vma, start, end);
> +
> +		err = 0;
> +		goto out_putpage;
> +	}
> +
>  	if (PageHuge(page)) {
>  		if (PageHead(page)) {
>  			/* Check if the page is software young. */
> -- 
> 2.15.0
> 

> >From e617e8c2034387cbed50bafa786cf83528dbe3df Mon Sep 17 00:00:00 2001
> From: Fengguang Wu <fengguang.wu@intel.com>
> Date: Sun, 30 Sep 2018 10:50:58 +0800
> Subject: [PATCH 075/166] migrate: consolidate MPOL_MF_SW_YOUNG behaviors
> 
> - if page already in target node: SetPageReferenced
> - otherwise: change_prot_numa
> 
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  arch/x86/kvm/Kconfig |  1 +
>  mm/migrate.c         | 65 +++++++++++++++++++++++++++++++---------------------
>  2 files changed, 40 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 4c6dec47fac6..c103373536fc 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -100,6 +100,7 @@ config KVM_EPT_IDLE
>  	tristate "KVM EPT idle page tracking"
>  	depends on KVM_INTEL
>  	depends on PROC_PAGE_MONITOR
> +	depends on NUMA_BALANCING
>  	---help---
>  	  Provides support for walking EPT to get the A bits on Intel
>  	  processors equipped with the VT extensions.
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d933f6966601..d944f031c9ea 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1500,6 +1500,8 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>  {
>  	struct vm_area_struct *vma;
>  	struct page *page;
> +	unsigned long end;
> +	unsigned int page_nid;
>  	unsigned int follflags;
>  	int err;
>  	bool migrate_all = flags & MPOL_MF_MOVE_ALL;
> @@ -1522,49 +1524,60 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>  	if (!page)
>  		goto out;
>  
> -	err = 0;
> -	if (page_to_nid(page) == node)
> -		goto out_putpage;
> +	page_nid = page_to_nid(page);
>  
>  	err = -EACCES;
>  	if (page_mapcount(page) > 1 && !migrate_all)
>  		goto out_putpage;
>  
> -	if (flags & MPOL_MF_SW_YOUNG) {
> -		unsigned long start, end;
> -		unsigned long nr_pte_updates = 0;
> -
> -		start = max(addr, vma->vm_start);
> -
> -		/* TODO: if huge page  */
> -		end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE);
> -		end = min(end, vma->vm_end);
> -		nr_pte_updates = change_prot_numa(vma, start, end);
> -
> -		err = 0;
> -		goto out_putpage;
> -	}
> -
> +	err = 0;
>  	if (PageHuge(page)) {
> -		if (PageHead(page)) {
> -			/* Check if the page is software young. */
> -			if (flags & MPOL_MF_SW_YOUNG)
> +		if (!PageHead(page)) {
> +			err = -EACCES;
> +			goto out_putpage;
> +		}
> +		if (flags & MPOL_MF_SW_YOUNG) {
> +			if (page_nid == node)
>  				SetPageReferenced(page);
> -			isolate_huge_page(page, pagelist);
> -			err = 0;
> +			else if (PageAnon(page)) {
> +				end = addr + (hpage_nr_pages(page) << PAGE_SHIFT);
> +				if (end <= vma->vm_end)
> +					change_prot_numa(vma, addr, end);
> +			}
> +			goto out_putpage;
>  		}
> +		if (page_nid == node)
> +			goto out_putpage;
> +		isolate_huge_page(page, pagelist);
>  	} else {
>  		struct page *head;
>  
>  		head = compound_head(page);
> +
> +		if (flags & MPOL_MF_SW_YOUNG) {
> +			if (page_nid == node)
> +				SetPageReferenced(head);
> +			else {
> +				unsigned long size;
> +				size = hpage_nr_pages(head) << PAGE_SHIFT;
> +				end = addr + size;
> +				if (unlikely(addr & (size - 1)))
> +					err = -EXDEV;
> +				else if (likely(end <= vma->vm_end))
> +					change_prot_numa(vma, addr, end);
> +				else
> +					err = -ERANGE;
> +			}
> +			goto out_putpage;
> +		}
> +		if (page_nid == node)
> +			goto out_putpage;
> +
>  		err = isolate_lru_page(head);
>  		if (err)
>  			goto out_putpage;
>  
>  		err = 0;
> -		/* Check if the page is software young. */
> -		if (flags & MPOL_MF_SW_YOUNG)
> -			SetPageReferenced(head);
>  		list_add_tail(&head->lru, pagelist);
>  		mod_node_page_state(page_pgdat(head),
>  			NR_ISOLATED_ANON + page_is_file_cache(head),
> -- 
> 2.15.0
> 

> >From a2d9740d1639f807868014c16dc9e2620d356f3c Mon Sep 17 00:00:00 2001
> From: Fengguang Wu <fengguang.wu@intel.com>
> Date: Sun, 30 Sep 2018 19:22:27 +0800
> Subject: [PATCH 076/166] mempolicy: force NUMA balancing
> 
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  mm/memory.c    | 3 ++-
>  mm/mempolicy.c | 5 -----
>  2 files changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c467102a5cbc..20c7efdff63b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3775,7 +3775,8 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>  		*flags |= TNF_FAULT_LOCAL;
>  	}
>  
> -	return mpol_misplaced(page, vma, addr);
> +	return 0;
> +	/* return mpol_misplaced(page, vma, addr); */
>  }
>  
>  static vm_fault_t do_numa_page(struct vm_fault *vmf)
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..21dc6ba1d062 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2295,8 +2295,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>  	int ret = -1;
>  
>  	pol = get_vma_policy(vma, addr);
> -	if (!(pol->flags & MPOL_F_MOF))
> -		goto out;
>  
>  	switch (pol->mode) {
>  	case MPOL_INTERLEAVE:
> @@ -2336,9 +2334,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>  	/* Migrate the page towards the node whose CPU is referencing it */
>  	if (pol->flags & MPOL_F_MORON) {
>  		polnid = thisnid;
> -
> -		if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
> -			goto out;
>  	}
>  
>  	if (curnid != polnid)
> -- 
> 2.15.0
>
Michal Hocko Dec. 28, 2018, 7:52 p.m. UTC | #10
[Ccing Mel and Andrea]

On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > I haven't looked at the implementation yet but if you are proposing a
> > > > special cased zone lists then this is something CDM (Coherent Device
> > > > Memory) was trying to do two years ago and there was quite some
> > > > skepticism in the approach.
> > > 
> > > It looks we are pretty different than CDM. :)
> > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > The zonelists modification is just to make PMEM nodes more separated.
> > 
> > Yes, this is exactly what CDM was after. Have a zone which is not
> > reachable without explicit request AFAIR. So no, I do not think you are
> > too different, you just use a different terminology ;)
> 
> Got it. OK.. The fall back zonelists patch does need more thoughts.
> 
> In long term POV, Linux should be prepared for multi-level memory.
> Then there will arise the need to "allocate from this level memory".
> So it looks good to have separated zonelists for each level of memory.

Well, I do not have a good answer for you here. We do not have good
experiences with those systems, I am afraid. NUMA is with us for more
than a decade yet our APIs are coarse to say the least and broken at so
many times as well. Starting a new API just based on PMEM sounds like a
ticket to another disaster to me.

I would like to see solid arguments why the current model of numa nodes
with fallback in distances order cannot be used for those new
technologies in the beginning and develop something better based on our
experiences that we gain on the way.

I would be especially interested about a possibility of the memory
migration idea during a memory pressure and relying on numa balancing to
resort the locality on demand rather than hiding certain NUMA nodes or
zones from the allocator and expose them only to the userspace.

> On the other hand, there will also be page allocations that don't care
> about the exact memory level. So it looks reasonable to expect
> different kind of fallback zonelists that can be selected by NUMA policy.
> 
> Thanks,
> Fengguang
Dave Hansen Jan. 2, 2019, 6:12 p.m. UTC | #11
On 12/28/18 12:41 AM, Michal Hocko wrote:
>>
>> It can be done in kernel page reclaim path, near the anonymous page
>> swap out point. Instead of swapping out, we now have the option to
>> migrate cold pages to PMEM NUMA nodes.
> OK, this makes sense to me except I am not sure this is something that
> should be pmem specific. Is there any reason why we shouldn't migrate
> pages on memory pressure to other nodes in general? In other words
> rather than paging out we whould migrate over to the next node that is
> not under memory pressure. Swapout would be the next level when the
> memory is (almost_) fully utilized. That wouldn't be pmem specific.

Yeah, we don't want to make this specific to any particular kind of
memory.  For instance, with lots of pressure on expensive, small
high-bandwidth memory (HBM), we might want to migrate some HBM contents
to DRAM.

We need to decide on whether we want to cause pressure on the
destination nodes or not, though.  I think you're suggesting that we try
to look for things under some pressure and totally avoid them.  That
sounds sane, but I also like the idea of this being somewhat ordered.

Think of if we have three nodes, A, B, C.  A is fast, B is medium, C is
slow.  If A and B are "full" and we want to reclaim some of A, do we:

1. Migrate A->B, and put pressure on a later B->C migration, or
2. Migrate A->C directly

?

Doing A->C is less resource intensive because there's only one migration
involved.  But, doing A->B/B->C probably makes the app behave better
because the "A data" is presumably more valuable and is more
appropriately placed in B rather than being demoted all the way to C.
Mel Gorman Jan. 3, 2019, 10:57 a.m. UTC | #12
On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> [Ccing Mel and Andrea]
> 
> On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > Memory) was trying to do two years ago and there was quite some
> > > > > skepticism in the approach.
> > > > 
> > > > It looks we are pretty different than CDM. :)
> > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > The zonelists modification is just to make PMEM nodes more separated.
> > > 
> > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > reachable without explicit request AFAIR. So no, I do not think you are
> > > too different, you just use a different terminology ;)
> > 
> > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > 
> > In long term POV, Linux should be prepared for multi-level memory.
> > Then there will arise the need to "allocate from this level memory".
> > So it looks good to have separated zonelists for each level of memory.
> 
> Well, I do not have a good answer for you here. We do not have good
> experiences with those systems, I am afraid. NUMA is with us for more
> than a decade yet our APIs are coarse to say the least and broken at so
> many times as well. Starting a new API just based on PMEM sounds like a
> ticket to another disaster to me.
> 
> I would like to see solid arguments why the current model of numa nodes
> with fallback in distances order cannot be used for those new
> technologies in the beginning and develop something better based on our
> experiences that we gain on the way.
> 
> I would be especially interested about a possibility of the memory
> migration idea during a memory pressure and relying on numa balancing to
> resort the locality on demand rather than hiding certain NUMA nodes or
> zones from the allocator and expose them only to the userspace.
> 

I didn't read the thread as I'm backlogged as I imagine a lot of people
are. However, I would agree that zonelists are not a good fit for something
like PMEM-based being available via a zonelist with a fake distance combined
with NUMA balancing moving pages in and out DRAM and PMEM. The same applies
to a much lesser extent for something like a special higher-speed memory
that is faster than RAM.

The fundamental problem encountered will be a hot-page-inversion issue.
In the PMEM case, DRAM fills, then PMEM starts filling except now we
know that the most recently allocated page which is potentially the most
important in terms of hotness is allocated on slower "remote" memory.
Reclaim kicks in for the DRAM node and then there is interleaving of
hotness between DRAM and PMEM with NUMA balancing then getting involved
with non-deterministic performance.

I recognise that the same problem happens for remote NUMA nodes and it
also has an inversion issue once reclaim gets involved, but it also has a
clearly defined API for dealing with that problem if applications encounter
it. It's also relatively well known given the age of the problem and how
to cope with it. It's less clear whether applications could be able to
cope of it's a more distant PMEM instead of a remote DRAM and how that
should be advertised.

This has been brought up repeatedly over the last few years since high
speed memory was first mentioned but I think long-term what we should
be thinking of is "age-based-migration" where cold pages from DRAM
get migrated to PMEM when DRAM fills and use NUMA balancing to promote
hot pages from PMEM to DRAM. It should also be workable for remote DRAM
although that *might* violate the principal of least surprise given that
applications exist that are remote NUMA aware. It might be safer overall
if such age-based-migration is specific to local-but-different-speed
memory with the main DRAM only being in the zonelists. NUMA balancing
could still optionally promote from DRAM->faster memory while aging
moves pages from fast->slow as memory pressure dictates.

There still would need to be thought on exactly how this is advertised
to userspace because while "distance" is reasonably well understood,
it's not as clear to me whether distance is appropriate to describe
"local-but-different-speed" memory given that accessing a remote
NUMA node can saturate a single link where as the same may not
be true of local-but-different-speed memory which probably has
dedicated channels. In an ideal world, application developers
interested in higher-speed-memory-reserved-for-important-use and
cheaper-lower-speed-memory could describe what sort of application
modifications they'd be willing to do but that might be unlikely.
Michal Hocko Jan. 8, 2019, 2:52 p.m. UTC | #13
On Wed 02-01-19 12:21:10, Jonathan Cameron wrote:
[...]
> So ideally I'd love this set to head in a direction that helps me tick off
> at least some of the above usecases and hopefully have some visibility on
> how to address the others moving forwards,

Is it sufficient to have such a memory marked as movable (aka only have
ZONE_MOVABLE)? That should rule out most of the kernel allocations and
it fits the "balance by migration" concept.
Michal Hocko Jan. 8, 2019, 2:53 p.m. UTC | #14
On Wed 02-01-19 10:12:04, Dave Hansen wrote:
> On 12/28/18 12:41 AM, Michal Hocko wrote:
> >>
> >> It can be done in kernel page reclaim path, near the anonymous page
> >> swap out point. Instead of swapping out, we now have the option to
> >> migrate cold pages to PMEM NUMA nodes.
> > OK, this makes sense to me except I am not sure this is something that
> > should be pmem specific. Is there any reason why we shouldn't migrate
> > pages on memory pressure to other nodes in general? In other words
> > rather than paging out we whould migrate over to the next node that is
> > not under memory pressure. Swapout would be the next level when the
> > memory is (almost_) fully utilized. That wouldn't be pmem specific.
> 
> Yeah, we don't want to make this specific to any particular kind of
> memory.  For instance, with lots of pressure on expensive, small
> high-bandwidth memory (HBM), we might want to migrate some HBM contents
> to DRAM.
> 
> We need to decide on whether we want to cause pressure on the
> destination nodes or not, though.  I think you're suggesting that we try
> to look for things under some pressure and totally avoid them.  That
> sounds sane, but I also like the idea of this being somewhat ordered.
> 
> Think of if we have three nodes, A, B, C.  A is fast, B is medium, C is
> slow.  If A and B are "full" and we want to reclaim some of A, do we:
> 
> 1. Migrate A->B, and put pressure on a later B->C migration, or
> 2. Migrate A->C directly
> 
> ?
> 
> Doing A->C is less resource intensive because there's only one migration
> involved.  But, doing A->B/B->C probably makes the app behave better
> because the "A data" is presumably more valuable and is more
> appropriately placed in B rather than being demoted all the way to C.

This is a good question and I do not have a good answer because I lack
experiences with such "many levels" systems. If we followed CPU caches
model ten you are right that the fallback should be gradual. This is
more complex implementation wise of course. Anyway, I believe that there
is a lot of room for experimentations. If this stays an internal
implementation detail without user API then there is also no promise on
future behavior so nothing gets carved into stone since the day 1 when
our experiences are limited.
Jerome Glisse Jan. 10, 2019, 3:53 p.m. UTC | #15
On Tue, Jan 08, 2019 at 03:52:56PM +0100, Michal Hocko wrote:
> On Wed 02-01-19 12:21:10, Jonathan Cameron wrote:
> [...]
> > So ideally I'd love this set to head in a direction that helps me tick off
> > at least some of the above usecases and hopefully have some visibility on
> > how to address the others moving forwards,
> 
> Is it sufficient to have such a memory marked as movable (aka only have
> ZONE_MOVABLE)? That should rule out most of the kernel allocations and
> it fits the "balance by migration" concept.

This would not work for GPU, GPU driver really want to be in total
control of their memory yet sometimes they want to migrate some part
of the process to their memory.

Cheers,
Jérôme
Jerome Glisse Jan. 10, 2019, 4:25 p.m. UTC | #16
On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> [Ccing Mel and Andrea]
> 
> On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > Memory) was trying to do two years ago and there was quite some
> > > > > skepticism in the approach.
> > > > 
> > > > It looks we are pretty different than CDM. :)
> > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > The zonelists modification is just to make PMEM nodes more separated.
> > > 
> > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > reachable without explicit request AFAIR. So no, I do not think you are
> > > too different, you just use a different terminology ;)
> > 
> > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > 
> > In long term POV, Linux should be prepared for multi-level memory.
> > Then there will arise the need to "allocate from this level memory".
> > So it looks good to have separated zonelists for each level of memory.
> 
> Well, I do not have a good answer for you here. We do not have good
> experiences with those systems, I am afraid. NUMA is with us for more
> than a decade yet our APIs are coarse to say the least and broken at so
> many times as well. Starting a new API just based on PMEM sounds like a
> ticket to another disaster to me.
> 
> I would like to see solid arguments why the current model of numa nodes
> with fallback in distances order cannot be used for those new
> technologies in the beginning and develop something better based on our
> experiences that we gain on the way.

I see several issues with distance. First it does fully abstract the
underlying topology and this might be problematic, for instance if
you memory with different characteristic in same node like persistent
memory connected to some CPU then it might be faster for that CPU to
access that persistent memory has it has dedicated link to it than to
access some other remote memory for which the CPU might have to share
the link with other CPUs or devices.

Second distance is no longer easy to compute when you are not trying
to answer what is the fastest memory for CPU-N but rather asking what
is the fastest memory for CPU-N and device-M ie when you are trying to
find the best memory for a group of CPUs/devices. The answer can
changes drasticly depending on members of the groups.


Some advance programmer already do graph matching ie they match the
graph of their program dataset/computation with the topology graph
of the computer they run on to determine what is best placement both
for threads and memory.


> I would be especially interested about a possibility of the memory
> migration idea during a memory pressure and relying on numa balancing to
> resort the locality on demand rather than hiding certain NUMA nodes or
> zones from the allocator and expose them only to the userspace.

For device memory we have more things to think of like:
    - memory not accessible by CPU
    - non cache coherent memory (yet still useful in some case if
      application explicitly ask for it)
    - device driver want to keep full control over memory as older
      application like graphic for GPU, do need contiguous physical
      memory and other tight control over physical memory placement

So if we are talking about something to replace NUMA i would really
like for that to be inclusive of device memory (which can itself be
a hierarchy of different memory with different characteristics).

Note that i do believe the NUMA proposed solution is something useful
now. But for a new API it would be good to allow thing like device
memory.

This is a good topic to discuss during next LSF/MM

Cheers,
Jérôme
Michal Hocko Jan. 10, 2019, 4:42 p.m. UTC | #17
On Thu 10-01-19 10:53:17, Jerome Glisse wrote:
> On Tue, Jan 08, 2019 at 03:52:56PM +0100, Michal Hocko wrote:
> > On Wed 02-01-19 12:21:10, Jonathan Cameron wrote:
> > [...]
> > > So ideally I'd love this set to head in a direction that helps me tick off
> > > at least some of the above usecases and hopefully have some visibility on
> > > how to address the others moving forwards,
> > 
> > Is it sufficient to have such a memory marked as movable (aka only have
> > ZONE_MOVABLE)? That should rule out most of the kernel allocations and
> > it fits the "balance by migration" concept.
> 
> This would not work for GPU, GPU driver really want to be in total
> control of their memory yet sometimes they want to migrate some part
> of the process to their memory.

But that also means that GPU doesn't really fit the model discussed
here, right? I thought HMM is the way to manage such a memory.
Michal Hocko Jan. 10, 2019, 4:50 p.m. UTC | #18
On Thu 10-01-19 11:25:56, Jerome Glisse wrote:
> On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> > [Ccing Mel and Andrea]
> > 
> > On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > > Memory) was trying to do two years ago and there was quite some
> > > > > > skepticism in the approach.
> > > > > 
> > > > > It looks we are pretty different than CDM. :)
> > > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > > The zonelists modification is just to make PMEM nodes more separated.
> > > > 
> > > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > > reachable without explicit request AFAIR. So no, I do not think you are
> > > > too different, you just use a different terminology ;)
> > > 
> > > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > > 
> > > In long term POV, Linux should be prepared for multi-level memory.
> > > Then there will arise the need to "allocate from this level memory".
> > > So it looks good to have separated zonelists for each level of memory.
> > 
> > Well, I do not have a good answer for you here. We do not have good
> > experiences with those systems, I am afraid. NUMA is with us for more
> > than a decade yet our APIs are coarse to say the least and broken at so
> > many times as well. Starting a new API just based on PMEM sounds like a
> > ticket to another disaster to me.
> > 
> > I would like to see solid arguments why the current model of numa nodes
> > with fallback in distances order cannot be used for those new
> > technologies in the beginning and develop something better based on our
> > experiences that we gain on the way.
> 
> I see several issues with distance. First it does fully abstract the
> underlying topology and this might be problematic, for instance if
> you memory with different characteristic in same node like persistent
> memory connected to some CPU then it might be faster for that CPU to
> access that persistent memory has it has dedicated link to it than to
> access some other remote memory for which the CPU might have to share
> the link with other CPUs or devices.
> 
> Second distance is no longer easy to compute when you are not trying
> to answer what is the fastest memory for CPU-N but rather asking what
> is the fastest memory for CPU-N and device-M ie when you are trying to
> find the best memory for a group of CPUs/devices. The answer can
> changes drasticly depending on members of the groups.

While you might be right, I would _really_ appreciate to start with a
simpler model and go to a more complex one based on realy HW and real
experiences than start with an overly complicated and over engineered
approach from scratch.

> Some advance programmer already do graph matching ie they match the
> graph of their program dataset/computation with the topology graph
> of the computer they run on to determine what is best placement both
> for threads and memory.

And those can still use our mempolicy API to describe their needs. If
existing API is not sufficient then let's talk about which pieces are
missing.

> > I would be especially interested about a possibility of the memory
> > migration idea during a memory pressure and relying on numa balancing to
> > resort the locality on demand rather than hiding certain NUMA nodes or
> > zones from the allocator and expose them only to the userspace.
> 
> For device memory we have more things to think of like:
>     - memory not accessible by CPU
>     - non cache coherent memory (yet still useful in some case if
>       application explicitly ask for it)
>     - device driver want to keep full control over memory as older
>       application like graphic for GPU, do need contiguous physical
>       memory and other tight control over physical memory placement

Again, I believe that HMM is to target those non-coherent or
non-accessible memory and I do not think it is helpful to put them into
the mix here.

> So if we are talking about something to replace NUMA i would really
> like for that to be inclusive of device memory (which can itself be
> a hierarchy of different memory with different characteristics).

I think we should build on the existing NUMA infrastructure we have.
Developing something completely new is not going to happen anytime soon
and I am not convinced the result would be that much better either.
Jerome Glisse Jan. 10, 2019, 5:42 p.m. UTC | #19
On Thu, Jan 10, 2019 at 05:42:48PM +0100, Michal Hocko wrote:
> On Thu 10-01-19 10:53:17, Jerome Glisse wrote:
> > On Tue, Jan 08, 2019 at 03:52:56PM +0100, Michal Hocko wrote:
> > > On Wed 02-01-19 12:21:10, Jonathan Cameron wrote:
> > > [...]
> > > > So ideally I'd love this set to head in a direction that helps me tick off
> > > > at least some of the above usecases and hopefully have some visibility on
> > > > how to address the others moving forwards,
> > > 
> > > Is it sufficient to have such a memory marked as movable (aka only have
> > > ZONE_MOVABLE)? That should rule out most of the kernel allocations and
> > > it fits the "balance by migration" concept.
> > 
> > This would not work for GPU, GPU driver really want to be in total
> > control of their memory yet sometimes they want to migrate some part
> > of the process to their memory.
> 
> But that also means that GPU doesn't really fit the model discussed
> here, right? I thought HMM is the way to manage such a memory.

HMM provides the plumbing and tools to manage but right now the patchset
for nouveau expose API through nouveau device file as nouveau ioctl. This
is not a good long term solution when you want to mix and match multiple
GPUs memory (possibly from different vendors). Then you get each device
driver implementing their own mem policy infrastructure and without any
coordination between devices/drivers. While it is _mostly_ ok for single
GPU case, it is seriously crippling for the multi-GPUs or multi-devices
cases (for instance when you chain network and GPU together or GPU and
storage).

People have been asking for a single common API to manage both regular
memory and device memory. As anyway the common case is you move things
around depending on which devices/CPUs is working on the dataset.

Cheers,
Jérôme
Jerome Glisse Jan. 10, 2019, 6:02 p.m. UTC | #20
On Thu, Jan 10, 2019 at 05:50:01PM +0100, Michal Hocko wrote:
> On Thu 10-01-19 11:25:56, Jerome Glisse wrote:
> > On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> > > [Ccing Mel and Andrea]
> > > 
> > > On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > > > Memory) was trying to do two years ago and there was quite some
> > > > > > > skepticism in the approach.
> > > > > > 
> > > > > > It looks we are pretty different than CDM. :)
> > > > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > > > The zonelists modification is just to make PMEM nodes more separated.
> > > > > 
> > > > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > > > reachable without explicit request AFAIR. So no, I do not think you are
> > > > > too different, you just use a different terminology ;)
> > > > 
> > > > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > > > 
> > > > In long term POV, Linux should be prepared for multi-level memory.
> > > > Then there will arise the need to "allocate from this level memory".
> > > > So it looks good to have separated zonelists for each level of memory.
> > > 
> > > Well, I do not have a good answer for you here. We do not have good
> > > experiences with those systems, I am afraid. NUMA is with us for more
> > > than a decade yet our APIs are coarse to say the least and broken at so
> > > many times as well. Starting a new API just based on PMEM sounds like a
> > > ticket to another disaster to me.
> > > 
> > > I would like to see solid arguments why the current model of numa nodes
> > > with fallback in distances order cannot be used for those new
> > > technologies in the beginning and develop something better based on our
> > > experiences that we gain on the way.
> > 
> > I see several issues with distance. First it does fully abstract the
> > underlying topology and this might be problematic, for instance if
> > you memory with different characteristic in same node like persistent
> > memory connected to some CPU then it might be faster for that CPU to
> > access that persistent memory has it has dedicated link to it than to
> > access some other remote memory for which the CPU might have to share
> > the link with other CPUs or devices.
> > 
> > Second distance is no longer easy to compute when you are not trying
> > to answer what is the fastest memory for CPU-N but rather asking what
> > is the fastest memory for CPU-N and device-M ie when you are trying to
> > find the best memory for a group of CPUs/devices. The answer can
> > changes drasticly depending on members of the groups.
> 
> While you might be right, I would _really_ appreciate to start with a
> simpler model and go to a more complex one based on realy HW and real
> experiences than start with an overly complicated and over engineered
> approach from scratch.
> 
> > Some advance programmer already do graph matching ie they match the
> > graph of their program dataset/computation with the topology graph
> > of the computer they run on to determine what is best placement both
> > for threads and memory.
> 
> And those can still use our mempolicy API to describe their needs. If
> existing API is not sufficient then let's talk about which pieces are
> missing.

I understand people don't want the fully topology thing but device memory
can not be expose as a NUMA node hence at very least we need something
that is not NUMA node only and most likely an API that does not use bitmask
as front facing userspace API. So some kind of UID for memory, one for
each type of memory on each node (and also for each device memory). It
can be a 1 to 1 match with NUMA node id for all regular NUMA node memory
with extra id for device memory (for instance by setting the high bit on
the UID for device memory).


> > > I would be especially interested about a possibility of the memory
> > > migration idea during a memory pressure and relying on numa balancing to
> > > resort the locality on demand rather than hiding certain NUMA nodes or
> > > zones from the allocator and expose them only to the userspace.
> > 
> > For device memory we have more things to think of like:
> >     - memory not accessible by CPU
> >     - non cache coherent memory (yet still useful in some case if
> >       application explicitly ask for it)
> >     - device driver want to keep full control over memory as older
> >       application like graphic for GPU, do need contiguous physical
> >       memory and other tight control over physical memory placement
> 
> Again, I believe that HMM is to target those non-coherent or
> non-accessible memory and I do not think it is helpful to put them into
> the mix here.

HMM is the kernel plumbing it does not expose anything to userspace.
While right now for nouveau the plan is to expose API through nouveau
ioctl this does not scale/work for multiple devices or when you mix
and match different devices. A single API that can handle both device
memory and regular memory would be much more useful. Long term at least
that's what i would like to see.


> > So if we are talking about something to replace NUMA i would really
> > like for that to be inclusive of device memory (which can itself be
> > a hierarchy of different memory with different characteristics).
> 
> I think we should build on the existing NUMA infrastructure we have.
> Developing something completely new is not going to happen anytime soon
> and I am not convinced the result would be that much better either.

The issue with NUMA is that i do not see a way to add device memory as
node as the memory need to be fully manage by the device driver. Also
the number of nodes might get out of hands (think 32 devices per CPU
so with 1024 CPU that's 2^15 max nodes ...) this leads to node mask
taking a full page.

Also the whole NUMA access tracking does not work with devices (it can
be added but right now it is non existent). Forcing page fault to track
access is highly disruptive for GPU while the hw can provide much better
informations without fault and CPU counters might also be something we
might want to use rather than faulting.

I am not saying something new will solve all the issues we have today
with NUMA, actualy i don't believe we can solve all of them. But it
could at least be more flexible in terms of what memory program can
bind to.

Cheers,
Jérôme
Jonathan Cameron Jan. 10, 2019, 6:26 p.m. UTC | #21
On Tue, 8 Jan 2019 15:52:56 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Wed 02-01-19 12:21:10, Jonathan Cameron wrote:
> [...]
> > So ideally I'd love this set to head in a direction that helps me tick off
> > at least some of the above usecases and hopefully have some visibility on
> > how to address the others moving forwards,  
> 
> Is it sufficient to have such a memory marked as movable (aka only have
> ZONE_MOVABLE)? That should rule out most of the kernel allocations and
> it fits the "balance by migration" concept.

Yes, to some degree. That's exactly what we are doing, though a things currently
stand I think you have to turn it on via a kernel command line and mark it
hotpluggable in ACPI. Given it my or may not actually be hotpluggable
that's less than elegant.

Let's randomly decide not to explore that one further for a few more weeks.
la la la la

If we have general balancing by migration then things are definitely
heading in a useful direction as long as 'hot' takes into account the
main user not being a CPU.  You are right that migration dealing with
the movable kernel allocations is a nice side effect though which I
hadn't thought about.  Long run we might end up with everything where
it should be after some level of burn in period. A generic version of
this proposal is looking nicer and nicer!

Thanks,

Jonathan
Jonathan Cameron Jan. 28, 2019, 5:42 p.m. UTC | #22
On Wed, 2 Jan 2019 12:21:10 +0000
Jonathan Cameron <jonathan.cameron@huawei.com> wrote:

> On Fri, 28 Dec 2018 20:52:24 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > [Ccing Mel and Andrea]
> > 

Hi,

I just wanted to highlight this section as I didn't feel we really addressed this
in the earlier conversation.

> * Hot pages may not be hot just because the host is using them a lot.  It would be
>   very useful to have a means of adding information available from accelerators
>   beyond simple accessed bits (dreaming ;)  One problem here is translation
>   caches (ATCs) as they won't normally result in any updates to the page accessed
>   bits.  The arm SMMU v3 spec for example makes it clear (though it's kind of
>   obvious) that the ATS request is the only opportunity to update the accessed
>   bit.  The nasty option here would be to periodically flush the ATC to force
>   the access bit updates via repeats of the ATS request (ouch).
>   That option only works if the iommu supports updating the accessed flag
>   (optional on SMMU v3 for example).
> 

If we ignore the IOMMU hardware update issue which will simply need to be addressed
by future hardware if these techniques become common, how do we address the
Address Translation Cache issue without potentially causing big performance
problems by flushing the cache just to force an accessed bit update?

These devices are frequently used with PRI and Shared Virtual Addressing
and can be accessing most of your memory without you having any visibility
of it in the page tables (as they aren't walked if your ATC is well matched
in size to your usecase.

Classic example would be accelerated DB walkers like the the CCIX demo
Xilinx has shown at a few conferences.   The whole point of those is that
most of the time only your large set of database walkers is using your
memory and they have translations cached for for a good part of what
they are accessing.  Flushing that cache could hurt a lot.
Pinning pages hurts for all the normal flexibility reasons.

Last thing we want is to be migrating these pages that can be very hot but
in an invisible fashion.

Thanks,

Jonathan
Fengguang Wu Jan. 29, 2019, 2 a.m. UTC | #23
Hi Jonathan,

Thanks for showing the gap on tracking hot accesses from devices.

On Mon, Jan 28, 2019 at 05:42:39PM +0000, Jonathan Cameron wrote:
>On Wed, 2 Jan 2019 12:21:10 +0000
>Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
>
>> On Fri, 28 Dec 2018 20:52:24 +0100
>> Michal Hocko <mhocko@kernel.org> wrote:
>>
>> > [Ccing Mel and Andrea]
>> >
>
>Hi,
>
>I just wanted to highlight this section as I didn't feel we really addressed this
>in the earlier conversation.
>
>> * Hot pages may not be hot just because the host is using them a lot.  It would be
>>   very useful to have a means of adding information available from accelerators
>>   beyond simple accessed bits (dreaming ;)  One problem here is translation
>>   caches (ATCs) as they won't normally result in any updates to the page accessed
>>   bits.  The arm SMMU v3 spec for example makes it clear (though it's kind of
>>   obvious) that the ATS request is the only opportunity to update the accessed
>>   bit.  The nasty option here would be to periodically flush the ATC to force
>>   the access bit updates via repeats of the ATS request (ouch).
>>   That option only works if the iommu supports updating the accessed flag
>>   (optional on SMMU v3 for example).

If ATS based updates are supported, we may trigger it when closing the
/proc/pid/idle_pages file. We already do TLB flushes at that time. For
example,

[PATCH 15/21] ept-idle: EPT walk for virtual machine

        ept_idle_release():
          kvm_flush_remote_tlbs(kvm);

[PATCH 17/21] proc: introduce /proc/PID/idle_pages

        mm_idle_release():
          flush_tlb_mm(mm);

The flush cost is kind of "minimal necessary" in our current use
model, where user space scan+migration daemon will do such loop:

loop:
        walk page table N times:
                open,read,close /proc/PID/idle_pages
                (flushes TLB on file close)
                sleep for a short interval
        sort and migrate hot pages
        sleep for a while

>If we ignore the IOMMU hardware update issue which will simply need to be addressed
>by future hardware if these techniques become common, how do we address the
>Address Translation Cache issue without potentially causing big performance
>problems by flushing the cache just to force an accessed bit update?
>
>These devices are frequently used with PRI and Shared Virtual Addressing
>and can be accessing most of your memory without you having any visibility
>of it in the page tables (as they aren't walked if your ATC is well matched
>in size to your usecase.
>
>Classic example would be accelerated DB walkers like the the CCIX demo
>Xilinx has shown at a few conferences.   The whole point of those is that
>most of the time only your large set of database walkers is using your
>memory and they have translations cached for for a good part of what
>they are accessing.  Flushing that cache could hurt a lot.
>Pinning pages hurts for all the normal flexibility reasons.
>
>Last thing we want is to be migrating these pages that can be very hot but
>in an invisible fashion.

If there are some other way to get hotness for special device memory,
the user space daemon may be extended to cover that. Perhaps by
querying another new kernel interface.

By driving hotness accounting and migration in user space, we harvest
this kind of flexibility. In the daemon POV, /proc/PID/idle_pages
provides one common way to get "accessed" bits hence hotness, though
the daemon does not need to depend solely on it.

Thanks,
Fengguang