Message ID | 20181226131446.330864849@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | PMEM NUMA node and hotness accounting/migration | expand |
On Wed 26-12-18 21:14:46, Wu Fengguang wrote: > This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's > transparent to normal applications and virtual machines. > > The code is still in active development. It's provided for early design review. So can we get a high level description of the design and expected usecases please?
On Thu, Dec 27, 2018 at 09:31:58PM +0100, Michal Hocko wrote: >On Wed 26-12-18 21:14:46, Wu Fengguang wrote: >> This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's >> transparent to normal applications and virtual machines. >> >> The code is still in active development. It's provided for early design review. > >So can we get a high level description of the design and expected >usecases please? Good question. Use cases ========= The general use case is to use PMEM as slower but cheaper "DRAM". The suitable ones can be - workloads care memory size more than bandwidth/latency - workloads with a set of warm/cold pages that don't change rapidly over time - low cost VM/containers Foundation: create PMEM NUMA nodes ================================== To create PMEM nodes in native kernel, Dave Hansen and Dan Williams have working patches for kernel and ndctl. According to Ying, it'll work like this ndctl destroy-namespace -f namespace0.0 ndctl destroy-namespace -f namespace1.0 ipmctl create -goal MemoryMode=100 reboot To create PMEM nodes in QEMU VMs, current Debian/Fedora etc. distros already support this qemu-system-x86_64 -machine pc,nvdimm -enable-kvm -smp 64 -m 256G # DRAM node 0 -object memory-backend-file,size=128G,share=on,mem-path=/dev/shm/qemu_node0,id=tmpfs-node0 -numa node,cpus=0-31,nodeid=0,memdev=tmpfs-node0 # PMEM node 1 -object memory-backend-file,size=128G,share=on,mem-path=/dev/dax1.0,align=128M,id=dax-node1 -numa node,cpus=32-63,nodeid=1,memdev=dax-node1 Optimization: do hot/cold page tracking and migration ===================================================== Since PMEM is slower than DRAM, we need to make sure hot pages go to DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM. - DRAM=>PMEM cold page migration It can be done in kernel page reclaim path, near the anonymous page swap out point. Instead of swapping out, we now have the option to migrate cold pages to PMEM NUMA nodes. User space may also do it, however cannot act on-demand, when there are memory pressure in DRAM nodes. - PMEM=>DRAM hot page migration While LRU can be good enough for identifying cold pages, frequency based accounting can be more suitable for identifying hot pages. Our design choice is to create a flexible user space daemon to drive the accounting and migration, with necessary kernel supports by this patchset. Linux kernel already offers move_pages(2) for user space to migrate pages to specified NUMA nodes. The major gap lies in hotness accounting. User space driven hotness accounting ==================================== One way to find out hot/cold pages is to scan page table multiple times and collect the "accessed" bits. We created the kvm-ept-idle kernel module to provide the "accessed" bits via interface /proc/PID/idle_pages. User space can open it and read the "accessed" bits for a range of virtual address. Inside kernel module, it implements 2 independent set of page table scan code, seamlessly providing the same interface: - for QEMU, scan HVA range of the VM's EPT(Extended Page Table) - for others, scan VA range of the process page table With /proc/PID/idle_pages and move_pages(2), the user space daemon can work like this One round of scan+migration: loop N=(3-10) times: sleep 0.01-10s (typical values) scan page tables and read/accumulate accessed bits into arrays treat pages with accessed_count == N as hot pages treat pages with accessed_count == 0 as cold pages migrate hot pages to DRAM nodes migrate cold pages to PMEM nodes (optional, may do it once on multi scan rounds, to make sure they are really cold) That just describes the bare minimal working model. A real world daemon should consider lots more to be useful and robust. The notable one is to avoid thrashing. Hotness accounting can be rough and workload can be unstable. We need to avoid promoting a warm page to DRAM and then demoting it soon. The basic scheme is to auto control scan interval and count, so that each round of scan will get hot pages < 1/2 DRAM size. May also do multiple round of scans before migration, to filter out unstable/burst accesses. In long run, most of the accounted hot pages will already be in DRAM. So only need to migrate the new ones to DRAM. When doing so, should consider QoS and rate limiting to reduce impacts to user workloads. When user space drives hot page migration, the DRAM nodes may well be pressured, which will in turn trigger in-kernel cold page migration. The above 1/2 DRAM size hot pages target can help kernel easily find cold pages on LRU scan. To avoid thrashing, it's also important to maintain persistent kernel and user-space view of hot/cold pages. Since they will do migrations in 2 different directions. - the regular page table scans will clear PMD/PTE young - user space compensate that by setting PG_referenced on move_pages(hot pages, MPOL_MF_SW_YOUNG) That guarantees the user space collected view of hot pages will be conveyed to kernel. Regards, Fengguang
On Fri 28-12-18 13:08:06, Wu Fengguang wrote: [...] > Optimization: do hot/cold page tracking and migration > ===================================================== > > Since PMEM is slower than DRAM, we need to make sure hot pages go to > DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM. > > - DRAM=>PMEM cold page migration > > It can be done in kernel page reclaim path, near the anonymous page > swap out point. Instead of swapping out, we now have the option to > migrate cold pages to PMEM NUMA nodes. OK, this makes sense to me except I am not sure this is something that should be pmem specific. Is there any reason why we shouldn't migrate pages on memory pressure to other nodes in general? In other words rather than paging out we whould migrate over to the next node that is not under memory pressure. Swapout would be the next level when the memory is (almost_) fully utilized. That wouldn't be pmem specific. > User space may also do it, however cannot act on-demand, when there > are memory pressure in DRAM nodes. > > - PMEM=>DRAM hot page migration > > While LRU can be good enough for identifying cold pages, frequency > based accounting can be more suitable for identifying hot pages. > > Our design choice is to create a flexible user space daemon to drive > the accounting and migration, with necessary kernel supports by this > patchset. We do have numa balancing, why cannot we rely on it? This along with the above would allow to have pmem numa nodes (cpuless nodes in fact) without any special casing and a natural part of the MM. It would be only the matter of the configuration to set the appropriate distance to allow reasonable allocation fallback strategy. I haven't looked at the implementation yet but if you are proposing a special cased zone lists then this is something CDM (Coherent Device Memory) was trying to do two years ago and there was quite some skepticism in the approach.
On Fri, Dec 28, 2018 at 09:41:05AM +0100, Michal Hocko wrote: >On Fri 28-12-18 13:08:06, Wu Fengguang wrote: >[...] >> Optimization: do hot/cold page tracking and migration >> ===================================================== >> >> Since PMEM is slower than DRAM, we need to make sure hot pages go to >> DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM. >> >> - DRAM=>PMEM cold page migration >> >> It can be done in kernel page reclaim path, near the anonymous page >> swap out point. Instead of swapping out, we now have the option to >> migrate cold pages to PMEM NUMA nodes. > >OK, this makes sense to me except I am not sure this is something that >should be pmem specific. Is there any reason why we shouldn't migrate >pages on memory pressure to other nodes in general? In other words >rather than paging out we whould migrate over to the next node that is >not under memory pressure. Swapout would be the next level when the >memory is (almost_) fully utilized. That wouldn't be pmem specific. In future there could be multi memory levels with different performance/size/cost metric. There are ongoing HMAT works to describe that. When ready, we can switch to the HMAT based general infrastructure. Then the code will no longer be PMEM specific, but do general promotion/demotion migrations between high/low memory levels. Swapout could be from the lowest level memory. Migration between peer nodes is the obvious simple way and a good choice for the initial implementation. But yeah, it's possible to migrate to other nodes. For example, it can be combined with NUMA balancing: if we know the page is mostly accessed by the other socket, then it'd best to migrate hot/cold pages directly to that socket. >> User space may also do it, however cannot act on-demand, when there >> are memory pressure in DRAM nodes. >> >> - PMEM=>DRAM hot page migration >> >> While LRU can be good enough for identifying cold pages, frequency >> based accounting can be more suitable for identifying hot pages. >> >> Our design choice is to create a flexible user space daemon to drive >> the accounting and migration, with necessary kernel supports by this >> patchset. > >We do have numa balancing, why cannot we rely on it? This along with the >above would allow to have pmem numa nodes (cpuless nodes in fact) >without any special casing and a natural part of the MM. It would be >only the matter of the configuration to set the appropriate distance to >allow reasonable allocation fallback strategy. Good question. We actually tried reusing NUMA balancing mechanism to do page-fault triggered migration. move_pages() only calls change_prot_numa(). It turns out the 2 migration types have different purposes (one for hotness, another for home node) and hence implement details. We end up modifying some few NUMA balancing logic -- removing rate limiting, changing target node logics, etc. Those look unnecessary complexities for this post. This v2 patchset mainly fulfills our first milestone goal: a minimal viable solution that's relatively clean to backport. Even when preparing for new upstreamable versions, it may be good to keep it simple for the initial upstream inclusion. >I haven't looked at the implementation yet but if you are proposing a >special cased zone lists then this is something CDM (Coherent Device >Memory) was trying to do two years ago and there was quite some >skepticism in the approach. It looks we are pretty different than CDM. :) We creating new NUMA nodes rather than CDM's new ZONE. The zonelists modification is just to make PMEM nodes more separated. Thanks, Fengguang
On Fri 28-12-18 17:42:08, Wu Fengguang wrote: [...] > Those look unnecessary complexities for this post. This v2 patchset > mainly fulfills our first milestone goal: a minimal viable solution > that's relatively clean to backport. Even when preparing for new > upstreamable versions, it may be good to keep it simple for the > initial upstream inclusion. On the other hand this is creating a new NUMA semantic and I would like to have something long term thatn let's throw something in now and care about long term later. So I would really prefer to talk about long term plans first and only care about implementation details later. > > I haven't looked at the implementation yet but if you are proposing a > > special cased zone lists then this is something CDM (Coherent Device > > Memory) was trying to do two years ago and there was quite some > > skepticism in the approach. > > It looks we are pretty different than CDM. :) > We creating new NUMA nodes rather than CDM's new ZONE. > The zonelists modification is just to make PMEM nodes more separated. Yes, this is exactly what CDM was after. Have a zone which is not reachable without explicit request AFAIR. So no, I do not think you are too different, you just use a different terminology ;)
On Fri, Dec 28, 2018 at 01:15:15PM +0100, Michal Hocko wrote: >On Fri 28-12-18 17:42:08, Wu Fengguang wrote: >[...] >> Those look unnecessary complexities for this post. This v2 patchset >> mainly fulfills our first milestone goal: a minimal viable solution >> that's relatively clean to backport. Even when preparing for new >> upstreamable versions, it may be good to keep it simple for the >> initial upstream inclusion. > >On the other hand this is creating a new NUMA semantic and I would like >to have something long term thatn let's throw something in now and care >about long term later. So I would really prefer to talk about long term >plans first and only care about implementation details later. That makes good sense. FYI here are the several in-house patches that try to leverage (but not yet integrate with) NUMA balancing. The last one is brutal force hacking. They obviously break original NUMA balancing logic. Thanks, Fengguang From ef41a542568913c8c62251021c3bc38b7a549440 Mon Sep 17 00:00:00 2001 From: Liu Jingqi <jingqi.liu@intel.com> Date: Sat, 29 Sep 2018 23:29:56 +0800 Subject: [PATCH 074/166] migrate: set PROT_NONE on the PTEs and let NUMA balancing Need to enable CONFIG_NUMA_BALANCING firstly. Set PROT_NONE on the PTEs that map to the page, and do the actual migration in the context of process which initiate migration. Signed-off-by: Liu Jingqi <jingqi.liu@intel.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> --- mm/migrate.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/migrate.c b/mm/migrate.c index b27a287081c2..d933f6966601 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1530,6 +1530,21 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, if (page_mapcount(page) > 1 && !migrate_all) goto out_putpage; + if (flags & MPOL_MF_SW_YOUNG) { + unsigned long start, end; + unsigned long nr_pte_updates = 0; + + start = max(addr, vma->vm_start); + + /* TODO: if huge page */ + end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE); + end = min(end, vma->vm_end); + nr_pte_updates = change_prot_numa(vma, start, end); + + err = 0; + goto out_putpage; + } + if (PageHuge(page)) { if (PageHead(page)) { /* Check if the page is software young. */
>> > I haven't looked at the implementation yet but if you are proposing a >> > special cased zone lists then this is something CDM (Coherent Device >> > Memory) was trying to do two years ago and there was quite some >> > skepticism in the approach. >> >> It looks we are pretty different than CDM. :) >> We creating new NUMA nodes rather than CDM's new ZONE. >> The zonelists modification is just to make PMEM nodes more separated. > >Yes, this is exactly what CDM was after. Have a zone which is not >reachable without explicit request AFAIR. So no, I do not think you are >too different, you just use a different terminology ;) Got it. OK.. The fall back zonelists patch does need more thoughts. In long term POV, Linux should be prepared for multi-level memory. Then there will arise the need to "allocate from this level memory". So it looks good to have separated zonelists for each level of memory. On the other hand, there will also be page allocations that don't care about the exact memory level. So it looks reasonable to expect different kind of fallback zonelists that can be selected by NUMA policy. Thanks, Fengguang
On Fri, Dec 28, 2018 at 5:31 AM Fengguang Wu <fengguang.wu@intel.com> wrote: > > >> > I haven't looked at the implementation yet but if you are proposing a > >> > special cased zone lists then this is something CDM (Coherent Device > >> > Memory) was trying to do two years ago and there was quite some > >> > skepticism in the approach. > >> > >> It looks we are pretty different than CDM. :) > >> We creating new NUMA nodes rather than CDM's new ZONE. > >> The zonelists modification is just to make PMEM nodes more separated. > > > >Yes, this is exactly what CDM was after. Have a zone which is not > >reachable without explicit request AFAIR. So no, I do not think you are > >too different, you just use a different terminology ;) > > Got it. OK.. The fall back zonelists patch does need more thoughts. > > In long term POV, Linux should be prepared for multi-level memory. > Then there will arise the need to "allocate from this level memory". > So it looks good to have separated zonelists for each level of memory. I tend to agree with Fengguang. We do have needs for finer grained control to the usage of DRAM and PMEM, for example, controlling the percentage of DRAM and PMEM for a specific VMA. NUMA policy sounds not good enough for some usecases since it just can control what mempolicy is used by what memory range. Our usecase's memory access pattern is random in a VMA. So, we can't control the percentage by mempolicy. We have to put PMEM into a separate zonelist to make sure memory allocation happens on PMEM when certain criteria is met as what Fengguang does in this patch series. Thanks, Yang > > On the other hand, there will also be page allocations that don't care > about the exact memory level. So it looks reasonable to expect > different kind of fallback zonelists that can be selected by NUMA policy. > > Thanks, > Fengguang >
[Cc Mel and Andrea - the thread started http://lkml.kernel.org/r/20181226131446.330864849@intel.com] On Fri 28-12-18 21:15:42, Wu Fengguang wrote: > On Fri, Dec 28, 2018 at 01:15:15PM +0100, Michal Hocko wrote: > > On Fri 28-12-18 17:42:08, Wu Fengguang wrote: > > [...] > > > Those look unnecessary complexities for this post. This v2 patchset > > > mainly fulfills our first milestone goal: a minimal viable solution > > > that's relatively clean to backport. Even when preparing for new > > > upstreamable versions, it may be good to keep it simple for the > > > initial upstream inclusion. > > > > On the other hand this is creating a new NUMA semantic and I would like > > to have something long term thatn let's throw something in now and care > > about long term later. So I would really prefer to talk about long term > > plans first and only care about implementation details later. > > That makes good sense. FYI here are the several in-house patches that > try to leverage (but not yet integrate with) NUMA balancing. The last > one is brutal force hacking. They obviously break original NUMA > balancing logic. > > Thanks, > Fengguang > >From ef41a542568913c8c62251021c3bc38b7a549440 Mon Sep 17 00:00:00 2001 > From: Liu Jingqi <jingqi.liu@intel.com> > Date: Sat, 29 Sep 2018 23:29:56 +0800 > Subject: [PATCH 074/166] migrate: set PROT_NONE on the PTEs and let NUMA > balancing > > Need to enable CONFIG_NUMA_BALANCING firstly. > Set PROT_NONE on the PTEs that map to the page, > and do the actual migration in the context of process which initiate migration. > > Signed-off-by: Liu Jingqi <jingqi.liu@intel.com> > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> > --- > mm/migrate.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/mm/migrate.c b/mm/migrate.c > index b27a287081c2..d933f6966601 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1530,6 +1530,21 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, > if (page_mapcount(page) > 1 && !migrate_all) > goto out_putpage; > > + if (flags & MPOL_MF_SW_YOUNG) { > + unsigned long start, end; > + unsigned long nr_pte_updates = 0; > + > + start = max(addr, vma->vm_start); > + > + /* TODO: if huge page */ > + end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE); > + end = min(end, vma->vm_end); > + nr_pte_updates = change_prot_numa(vma, start, end); > + > + err = 0; > + goto out_putpage; > + } > + > if (PageHuge(page)) { > if (PageHead(page)) { > /* Check if the page is software young. */ > -- > 2.15.0 > > >From e617e8c2034387cbed50bafa786cf83528dbe3df Mon Sep 17 00:00:00 2001 > From: Fengguang Wu <fengguang.wu@intel.com> > Date: Sun, 30 Sep 2018 10:50:58 +0800 > Subject: [PATCH 075/166] migrate: consolidate MPOL_MF_SW_YOUNG behaviors > > - if page already in target node: SetPageReferenced > - otherwise: change_prot_numa > > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> > --- > arch/x86/kvm/Kconfig | 1 + > mm/migrate.c | 65 +++++++++++++++++++++++++++++++--------------------- > 2 files changed, 40 insertions(+), 26 deletions(-) > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index 4c6dec47fac6..c103373536fc 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -100,6 +100,7 @@ config KVM_EPT_IDLE > tristate "KVM EPT idle page tracking" > depends on KVM_INTEL > depends on PROC_PAGE_MONITOR > + depends on NUMA_BALANCING > ---help--- > Provides support for walking EPT to get the A bits on Intel > processors equipped with the VT extensions. > diff --git a/mm/migrate.c b/mm/migrate.c > index d933f6966601..d944f031c9ea 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1500,6 +1500,8 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, > { > struct vm_area_struct *vma; > struct page *page; > + unsigned long end; > + unsigned int page_nid; > unsigned int follflags; > int err; > bool migrate_all = flags & MPOL_MF_MOVE_ALL; > @@ -1522,49 +1524,60 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, > if (!page) > goto out; > > - err = 0; > - if (page_to_nid(page) == node) > - goto out_putpage; > + page_nid = page_to_nid(page); > > err = -EACCES; > if (page_mapcount(page) > 1 && !migrate_all) > goto out_putpage; > > - if (flags & MPOL_MF_SW_YOUNG) { > - unsigned long start, end; > - unsigned long nr_pte_updates = 0; > - > - start = max(addr, vma->vm_start); > - > - /* TODO: if huge page */ > - end = ALIGN(addr + (1 << PAGE_SHIFT), PAGE_SIZE); > - end = min(end, vma->vm_end); > - nr_pte_updates = change_prot_numa(vma, start, end); > - > - err = 0; > - goto out_putpage; > - } > - > + err = 0; > if (PageHuge(page)) { > - if (PageHead(page)) { > - /* Check if the page is software young. */ > - if (flags & MPOL_MF_SW_YOUNG) > + if (!PageHead(page)) { > + err = -EACCES; > + goto out_putpage; > + } > + if (flags & MPOL_MF_SW_YOUNG) { > + if (page_nid == node) > SetPageReferenced(page); > - isolate_huge_page(page, pagelist); > - err = 0; > + else if (PageAnon(page)) { > + end = addr + (hpage_nr_pages(page) << PAGE_SHIFT); > + if (end <= vma->vm_end) > + change_prot_numa(vma, addr, end); > + } > + goto out_putpage; > } > + if (page_nid == node) > + goto out_putpage; > + isolate_huge_page(page, pagelist); > } else { > struct page *head; > > head = compound_head(page); > + > + if (flags & MPOL_MF_SW_YOUNG) { > + if (page_nid == node) > + SetPageReferenced(head); > + else { > + unsigned long size; > + size = hpage_nr_pages(head) << PAGE_SHIFT; > + end = addr + size; > + if (unlikely(addr & (size - 1))) > + err = -EXDEV; > + else if (likely(end <= vma->vm_end)) > + change_prot_numa(vma, addr, end); > + else > + err = -ERANGE; > + } > + goto out_putpage; > + } > + if (page_nid == node) > + goto out_putpage; > + > err = isolate_lru_page(head); > if (err) > goto out_putpage; > > err = 0; > - /* Check if the page is software young. */ > - if (flags & MPOL_MF_SW_YOUNG) > - SetPageReferenced(head); > list_add_tail(&head->lru, pagelist); > mod_node_page_state(page_pgdat(head), > NR_ISOLATED_ANON + page_is_file_cache(head), > -- > 2.15.0 > > >From a2d9740d1639f807868014c16dc9e2620d356f3c Mon Sep 17 00:00:00 2001 > From: Fengguang Wu <fengguang.wu@intel.com> > Date: Sun, 30 Sep 2018 19:22:27 +0800 > Subject: [PATCH 076/166] mempolicy: force NUMA balancing > > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> > --- > mm/memory.c | 3 ++- > mm/mempolicy.c | 5 ----- > 2 files changed, 2 insertions(+), 6 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index c467102a5cbc..20c7efdff63b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3775,7 +3775,8 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, > *flags |= TNF_FAULT_LOCAL; > } > > - return mpol_misplaced(page, vma, addr); > + return 0; > + /* return mpol_misplaced(page, vma, addr); */ > } > > static vm_fault_t do_numa_page(struct vm_fault *vmf) > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index da858f794eb6..21dc6ba1d062 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2295,8 +2295,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long > int ret = -1; > > pol = get_vma_policy(vma, addr); > - if (!(pol->flags & MPOL_F_MOF)) > - goto out; > > switch (pol->mode) { > case MPOL_INTERLEAVE: > @@ -2336,9 +2334,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long > /* Migrate the page towards the node whose CPU is referencing it */ > if (pol->flags & MPOL_F_MORON) { > polnid = thisnid; > - > - if (!should_numa_migrate_memory(current, page, curnid, thiscpu)) > - goto out; > } > > if (curnid != polnid) > -- > 2.15.0 >
[Ccing Mel and Andrea] On Fri 28-12-18 21:31:11, Wu Fengguang wrote: > > > > I haven't looked at the implementation yet but if you are proposing a > > > > special cased zone lists then this is something CDM (Coherent Device > > > > Memory) was trying to do two years ago and there was quite some > > > > skepticism in the approach. > > > > > > It looks we are pretty different than CDM. :) > > > We creating new NUMA nodes rather than CDM's new ZONE. > > > The zonelists modification is just to make PMEM nodes more separated. > > > > Yes, this is exactly what CDM was after. Have a zone which is not > > reachable without explicit request AFAIR. So no, I do not think you are > > too different, you just use a different terminology ;) > > Got it. OK.. The fall back zonelists patch does need more thoughts. > > In long term POV, Linux should be prepared for multi-level memory. > Then there will arise the need to "allocate from this level memory". > So it looks good to have separated zonelists for each level of memory. Well, I do not have a good answer for you here. We do not have good experiences with those systems, I am afraid. NUMA is with us for more than a decade yet our APIs are coarse to say the least and broken at so many times as well. Starting a new API just based on PMEM sounds like a ticket to another disaster to me. I would like to see solid arguments why the current model of numa nodes with fallback in distances order cannot be used for those new technologies in the beginning and develop something better based on our experiences that we gain on the way. I would be especially interested about a possibility of the memory migration idea during a memory pressure and relying on numa balancing to resort the locality on demand rather than hiding certain NUMA nodes or zones from the allocator and expose them only to the userspace. > On the other hand, there will also be page allocations that don't care > about the exact memory level. So it looks reasonable to expect > different kind of fallback zonelists that can be selected by NUMA policy. > > Thanks, > Fengguang
On Fri, 28 Dec 2018 20:52:24 +0100 Michal Hocko <mhocko@kernel.org> wrote: > [Ccing Mel and Andrea] > > On Fri 28-12-18 21:31:11, Wu Fengguang wrote: > > > > > I haven't looked at the implementation yet but if you are proposing a > > > > > special cased zone lists then this is something CDM (Coherent Device > > > > > Memory) was trying to do two years ago and there was quite some > > > > > skepticism in the approach. > > > > > > > > It looks we are pretty different than CDM. :) > > > > We creating new NUMA nodes rather than CDM's new ZONE. > > > > The zonelists modification is just to make PMEM nodes more separated. > > > > > > Yes, this is exactly what CDM was after. Have a zone which is not > > > reachable without explicit request AFAIR. So no, I do not think you are > > > too different, you just use a different terminology ;) > > > > Got it. OK.. The fall back zonelists patch does need more thoughts. > > > > In long term POV, Linux should be prepared for multi-level memory. > > Then there will arise the need to "allocate from this level memory". > > So it looks good to have separated zonelists for each level of memory. > > Well, I do not have a good answer for you here. We do not have good > experiences with those systems, I am afraid. NUMA is with us for more > than a decade yet our APIs are coarse to say the least and broken at so > many times as well. Starting a new API just based on PMEM sounds like a > ticket to another disaster to me. > > I would like to see solid arguments why the current model of numa nodes > with fallback in distances order cannot be used for those new > technologies in the beginning and develop something better based on our > experiences that we gain on the way. > > I would be especially interested about a possibility of the memory > migration idea during a memory pressure and relying on numa balancing to > resort the locality on demand rather than hiding certain NUMA nodes or > zones from the allocator and expose them only to the userspace. This is indeed a very interesting direction. I'm coming at this from a CCIX point of view. Ignore the next bit of you are already familiar with CCIX :) Main thing CCIX brings is that memory can be fully coherent anywhere in the system including out near accelerators, all via shared physical address space, leveraging ATS / IOMMUs / MMUs to do translations. Result is a big and possibly extremely heterogenous NUMA system. All the setup is done in firmware so by the time the kernel sees it everything is in SRAT / SLIT / NFIT / HMAT etc. We have a few usecases that need some more fine grained control combined with automated balancing. So far we've been messing with nasty tricks like hotplugging memory after boot a long way away, or the original CDM zone patches (knowing they weren't likely to go anywhere!) Userspace is all hand tuned which is not great in the long run... Use cases (I've probably missed some): * Storage Class Memory near to the host CPU / DRAM controllers (pretty much the same as this series is considering). Note that there isn't necessarily any 'pairing' with host DRAM as seen in this RFC. A typical system might have a large single pool with similar access characteristics from each host SOC. The paired approach is probably going to be common in early systems though. Also not necessarily Non Volatile, could just be a big DDR expansion board. * RAM out near an accelerator. Aim would be to migrate data to that RAM if the access patterns from the accelerator justify it being there rather than near any of the host CPUs. In a memory pressure on host situation anything could be pushed out there as probably still better than swapping. Note that this would require some knowledge of 'who' is doing the accessing which isn't needed for what this RFC is doing. * Hot pages may not be hot just because the host is using them a lot. It would be very useful to have a means of adding information available from accelerators beyond simple accessed bits (dreaming ;) One problem here is translation caches (ATCs) as they won't normally result in any updates to the page accessed bits. The arm SMMU v3 spec for example makes it clear (though it's kind of obvious) that the ATS request is the only opportunity to update the accessed bit. The nasty option here would be to periodically flush the ATC to force the access bit updates via repeats of the ATS request (ouch). That option only works if the iommu supports updating the accessed flag (optional on SMMU v3 for example). We need the explicit placement, but can get that from existing NUMA controls. More of a concern is persuading the kernel it really doesn't want to put it's data structures in distant memory as it can be very very distant. So ideally I'd love this set to head in a direction that helps me tick off at least some of the above usecases and hopefully have some visibility on how to address the others moving forwards, Good to see some new thoughts in this area! Jonathan > > > On the other hand, there will also be page allocations that don't care > > about the exact memory level. So it looks reasonable to expect > > different kind of fallback zonelists that can be selected by NUMA policy. > > > > Thanks, > > Fengguang >
On 12/28/18 12:41 AM, Michal Hocko wrote: >> >> It can be done in kernel page reclaim path, near the anonymous page >> swap out point. Instead of swapping out, we now have the option to >> migrate cold pages to PMEM NUMA nodes. > OK, this makes sense to me except I am not sure this is something that > should be pmem specific. Is there any reason why we shouldn't migrate > pages on memory pressure to other nodes in general? In other words > rather than paging out we whould migrate over to the next node that is > not under memory pressure. Swapout would be the next level when the > memory is (almost_) fully utilized. That wouldn't be pmem specific. Yeah, we don't want to make this specific to any particular kind of memory. For instance, with lots of pressure on expensive, small high-bandwidth memory (HBM), we might want to migrate some HBM contents to DRAM. We need to decide on whether we want to cause pressure on the destination nodes or not, though. I think you're suggesting that we try to look for things under some pressure and totally avoid them. That sounds sane, but I also like the idea of this being somewhat ordered. Think of if we have three nodes, A, B, C. A is fast, B is medium, C is slow. If A and B are "full" and we want to reclaim some of A, do we: 1. Migrate A->B, and put pressure on a later B->C migration, or 2. Migrate A->C directly ? Doing A->C is less resource intensive because there's only one migration involved. But, doing A->B/B->C probably makes the app behave better because the "A data" is presumably more valuable and is more appropriately placed in B rather than being demoted all the way to C.
On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote: > [Ccing Mel and Andrea] > > On Fri 28-12-18 21:31:11, Wu Fengguang wrote: > > > > > I haven't looked at the implementation yet but if you are proposing a > > > > > special cased zone lists then this is something CDM (Coherent Device > > > > > Memory) was trying to do two years ago and there was quite some > > > > > skepticism in the approach. > > > > > > > > It looks we are pretty different than CDM. :) > > > > We creating new NUMA nodes rather than CDM's new ZONE. > > > > The zonelists modification is just to make PMEM nodes more separated. > > > > > > Yes, this is exactly what CDM was after. Have a zone which is not > > > reachable without explicit request AFAIR. So no, I do not think you are > > > too different, you just use a different terminology ;) > > > > Got it. OK.. The fall back zonelists patch does need more thoughts. > > > > In long term POV, Linux should be prepared for multi-level memory. > > Then there will arise the need to "allocate from this level memory". > > So it looks good to have separated zonelists for each level of memory. > > Well, I do not have a good answer for you here. We do not have good > experiences with those systems, I am afraid. NUMA is with us for more > than a decade yet our APIs are coarse to say the least and broken at so > many times as well. Starting a new API just based on PMEM sounds like a > ticket to another disaster to me. > > I would like to see solid arguments why the current model of numa nodes > with fallback in distances order cannot be used for those new > technologies in the beginning and develop something better based on our > experiences that we gain on the way. > > I would be especially interested about a possibility of the memory > migration idea during a memory pressure and relying on numa balancing to > resort the locality on demand rather than hiding certain NUMA nodes or > zones from the allocator and expose them only to the userspace. > I didn't read the thread as I'm backlogged as I imagine a lot of people are. However, I would agree that zonelists are not a good fit for something like PMEM-based being available via a zonelist with a fake distance combined with NUMA balancing moving pages in and out DRAM and PMEM. The same applies to a much lesser extent for something like a special higher-speed memory that is faster than RAM. The fundamental problem encountered will be a hot-page-inversion issue. In the PMEM case, DRAM fills, then PMEM starts filling except now we know that the most recently allocated page which is potentially the most important in terms of hotness is allocated on slower "remote" memory. Reclaim kicks in for the DRAM node and then there is interleaving of hotness between DRAM and PMEM with NUMA balancing then getting involved with non-deterministic performance. I recognise that the same problem happens for remote NUMA nodes and it also has an inversion issue once reclaim gets involved, but it also has a clearly defined API for dealing with that problem if applications encounter it. It's also relatively well known given the age of the problem and how to cope with it. It's less clear whether applications could be able to cope of it's a more distant PMEM instead of a remote DRAM and how that should be advertised. This has been brought up repeatedly over the last few years since high speed memory was first mentioned but I think long-term what we should be thinking of is "age-based-migration" where cold pages from DRAM get migrated to PMEM when DRAM fills and use NUMA balancing to promote hot pages from PMEM to DRAM. It should also be workable for remote DRAM although that *might* violate the principal of least surprise given that applications exist that are remote NUMA aware. It might be safer overall if such age-based-migration is specific to local-but-different-speed memory with the main DRAM only being in the zonelists. NUMA balancing could still optionally promote from DRAM->faster memory while aging moves pages from fast->slow as memory pressure dictates. There still would need to be thought on exactly how this is advertised to userspace because while "distance" is reasonably well understood, it's not as clear to me whether distance is appropriate to describe "local-but-different-speed" memory given that accessing a remote NUMA node can saturate a single link where as the same may not be true of local-but-different-speed memory which probably has dedicated channels. In an ideal world, application developers interested in higher-speed-memory-reserved-for-important-use and cheaper-lower-speed-memory could describe what sort of application modifications they'd be willing to do but that might be unlikely.
On Wed 02-01-19 12:21:10, Jonathan Cameron wrote: [...] > So ideally I'd love this set to head in a direction that helps me tick off > at least some of the above usecases and hopefully have some visibility on > how to address the others moving forwards, Is it sufficient to have such a memory marked as movable (aka only have ZONE_MOVABLE)? That should rule out most of the kernel allocations and it fits the "balance by migration" concept.
On Wed 02-01-19 10:12:04, Dave Hansen wrote: > On 12/28/18 12:41 AM, Michal Hocko wrote: > >> > >> It can be done in kernel page reclaim path, near the anonymous page > >> swap out point. Instead of swapping out, we now have the option to > >> migrate cold pages to PMEM NUMA nodes. > > OK, this makes sense to me except I am not sure this is something that > > should be pmem specific. Is there any reason why we shouldn't migrate > > pages on memory pressure to other nodes in general? In other words > > rather than paging out we whould migrate over to the next node that is > > not under memory pressure. Swapout would be the next level when the > > memory is (almost_) fully utilized. That wouldn't be pmem specific. > > Yeah, we don't want to make this specific to any particular kind of > memory. For instance, with lots of pressure on expensive, small > high-bandwidth memory (HBM), we might want to migrate some HBM contents > to DRAM. > > We need to decide on whether we want to cause pressure on the > destination nodes or not, though. I think you're suggesting that we try > to look for things under some pressure and totally avoid them. That > sounds sane, but I also like the idea of this being somewhat ordered. > > Think of if we have three nodes, A, B, C. A is fast, B is medium, C is > slow. If A and B are "full" and we want to reclaim some of A, do we: > > 1. Migrate A->B, and put pressure on a later B->C migration, or > 2. Migrate A->C directly > > ? > > Doing A->C is less resource intensive because there's only one migration > involved. But, doing A->B/B->C probably makes the app behave better > because the "A data" is presumably more valuable and is more > appropriately placed in B rather than being demoted all the way to C. This is a good question and I do not have a good answer because I lack experiences with such "many levels" systems. If we followed CPU caches model ten you are right that the fallback should be gradual. This is more complex implementation wise of course. Anyway, I believe that there is a lot of room for experimentations. If this stays an internal implementation detail without user API then there is also no promise on future behavior so nothing gets carved into stone since the day 1 when our experiences are limited.
On Tue, Jan 08, 2019 at 03:52:56PM +0100, Michal Hocko wrote: > On Wed 02-01-19 12:21:10, Jonathan Cameron wrote: > [...] > > So ideally I'd love this set to head in a direction that helps me tick off > > at least some of the above usecases and hopefully have some visibility on > > how to address the others moving forwards, > > Is it sufficient to have such a memory marked as movable (aka only have > ZONE_MOVABLE)? That should rule out most of the kernel allocations and > it fits the "balance by migration" concept. This would not work for GPU, GPU driver really want to be in total control of their memory yet sometimes they want to migrate some part of the process to their memory. Cheers, Jérôme
On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote: > [Ccing Mel and Andrea] > > On Fri 28-12-18 21:31:11, Wu Fengguang wrote: > > > > > I haven't looked at the implementation yet but if you are proposing a > > > > > special cased zone lists then this is something CDM (Coherent Device > > > > > Memory) was trying to do two years ago and there was quite some > > > > > skepticism in the approach. > > > > > > > > It looks we are pretty different than CDM. :) > > > > We creating new NUMA nodes rather than CDM's new ZONE. > > > > The zonelists modification is just to make PMEM nodes more separated. > > > > > > Yes, this is exactly what CDM was after. Have a zone which is not > > > reachable without explicit request AFAIR. So no, I do not think you are > > > too different, you just use a different terminology ;) > > > > Got it. OK.. The fall back zonelists patch does need more thoughts. > > > > In long term POV, Linux should be prepared for multi-level memory. > > Then there will arise the need to "allocate from this level memory". > > So it looks good to have separated zonelists for each level of memory. > > Well, I do not have a good answer for you here. We do not have good > experiences with those systems, I am afraid. NUMA is with us for more > than a decade yet our APIs are coarse to say the least and broken at so > many times as well. Starting a new API just based on PMEM sounds like a > ticket to another disaster to me. > > I would like to see solid arguments why the current model of numa nodes > with fallback in distances order cannot be used for those new > technologies in the beginning and develop something better based on our > experiences that we gain on the way. I see several issues with distance. First it does fully abstract the underlying topology and this might be problematic, for instance if you memory with different characteristic in same node like persistent memory connected to some CPU then it might be faster for that CPU to access that persistent memory has it has dedicated link to it than to access some other remote memory for which the CPU might have to share the link with other CPUs or devices. Second distance is no longer easy to compute when you are not trying to answer what is the fastest memory for CPU-N but rather asking what is the fastest memory for CPU-N and device-M ie when you are trying to find the best memory for a group of CPUs/devices. The answer can changes drasticly depending on members of the groups. Some advance programmer already do graph matching ie they match the graph of their program dataset/computation with the topology graph of the computer they run on to determine what is best placement both for threads and memory. > I would be especially interested about a possibility of the memory > migration idea during a memory pressure and relying on numa balancing to > resort the locality on demand rather than hiding certain NUMA nodes or > zones from the allocator and expose them only to the userspace. For device memory we have more things to think of like: - memory not accessible by CPU - non cache coherent memory (yet still useful in some case if application explicitly ask for it) - device driver want to keep full control over memory as older application like graphic for GPU, do need contiguous physical memory and other tight control over physical memory placement So if we are talking about something to replace NUMA i would really like for that to be inclusive of device memory (which can itself be a hierarchy of different memory with different characteristics). Note that i do believe the NUMA proposed solution is something useful now. But for a new API it would be good to allow thing like device memory. This is a good topic to discuss during next LSF/MM Cheers, Jérôme
On Thu 10-01-19 10:53:17, Jerome Glisse wrote: > On Tue, Jan 08, 2019 at 03:52:56PM +0100, Michal Hocko wrote: > > On Wed 02-01-19 12:21:10, Jonathan Cameron wrote: > > [...] > > > So ideally I'd love this set to head in a direction that helps me tick off > > > at least some of the above usecases and hopefully have some visibility on > > > how to address the others moving forwards, > > > > Is it sufficient to have such a memory marked as movable (aka only have > > ZONE_MOVABLE)? That should rule out most of the kernel allocations and > > it fits the "balance by migration" concept. > > This would not work for GPU, GPU driver really want to be in total > control of their memory yet sometimes they want to migrate some part > of the process to their memory. But that also means that GPU doesn't really fit the model discussed here, right? I thought HMM is the way to manage such a memory.
On Thu 10-01-19 11:25:56, Jerome Glisse wrote: > On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote: > > [Ccing Mel and Andrea] > > > > On Fri 28-12-18 21:31:11, Wu Fengguang wrote: > > > > > > I haven't looked at the implementation yet but if you are proposing a > > > > > > special cased zone lists then this is something CDM (Coherent Device > > > > > > Memory) was trying to do two years ago and there was quite some > > > > > > skepticism in the approach. > > > > > > > > > > It looks we are pretty different than CDM. :) > > > > > We creating new NUMA nodes rather than CDM's new ZONE. > > > > > The zonelists modification is just to make PMEM nodes more separated. > > > > > > > > Yes, this is exactly what CDM was after. Have a zone which is not > > > > reachable without explicit request AFAIR. So no, I do not think you are > > > > too different, you just use a different terminology ;) > > > > > > Got it. OK.. The fall back zonelists patch does need more thoughts. > > > > > > In long term POV, Linux should be prepared for multi-level memory. > > > Then there will arise the need to "allocate from this level memory". > > > So it looks good to have separated zonelists for each level of memory. > > > > Well, I do not have a good answer for you here. We do not have good > > experiences with those systems, I am afraid. NUMA is with us for more > > than a decade yet our APIs are coarse to say the least and broken at so > > many times as well. Starting a new API just based on PMEM sounds like a > > ticket to another disaster to me. > > > > I would like to see solid arguments why the current model of numa nodes > > with fallback in distances order cannot be used for those new > > technologies in the beginning and develop something better based on our > > experiences that we gain on the way. > > I see several issues with distance. First it does fully abstract the > underlying topology and this might be problematic, for instance if > you memory with different characteristic in same node like persistent > memory connected to some CPU then it might be faster for that CPU to > access that persistent memory has it has dedicated link to it than to > access some other remote memory for which the CPU might have to share > the link with other CPUs or devices. > > Second distance is no longer easy to compute when you are not trying > to answer what is the fastest memory for CPU-N but rather asking what > is the fastest memory for CPU-N and device-M ie when you are trying to > find the best memory for a group of CPUs/devices. The answer can > changes drasticly depending on members of the groups. While you might be right, I would _really_ appreciate to start with a simpler model and go to a more complex one based on realy HW and real experiences than start with an overly complicated and over engineered approach from scratch. > Some advance programmer already do graph matching ie they match the > graph of their program dataset/computation with the topology graph > of the computer they run on to determine what is best placement both > for threads and memory. And those can still use our mempolicy API to describe their needs. If existing API is not sufficient then let's talk about which pieces are missing. > > I would be especially interested about a possibility of the memory > > migration idea during a memory pressure and relying on numa balancing to > > resort the locality on demand rather than hiding certain NUMA nodes or > > zones from the allocator and expose them only to the userspace. > > For device memory we have more things to think of like: > - memory not accessible by CPU > - non cache coherent memory (yet still useful in some case if > application explicitly ask for it) > - device driver want to keep full control over memory as older > application like graphic for GPU, do need contiguous physical > memory and other tight control over physical memory placement Again, I believe that HMM is to target those non-coherent or non-accessible memory and I do not think it is helpful to put them into the mix here. > So if we are talking about something to replace NUMA i would really > like for that to be inclusive of device memory (which can itself be > a hierarchy of different memory with different characteristics). I think we should build on the existing NUMA infrastructure we have. Developing something completely new is not going to happen anytime soon and I am not convinced the result would be that much better either.
On Thu, Jan 10, 2019 at 05:42:48PM +0100, Michal Hocko wrote: > On Thu 10-01-19 10:53:17, Jerome Glisse wrote: > > On Tue, Jan 08, 2019 at 03:52:56PM +0100, Michal Hocko wrote: > > > On Wed 02-01-19 12:21:10, Jonathan Cameron wrote: > > > [...] > > > > So ideally I'd love this set to head in a direction that helps me tick off > > > > at least some of the above usecases and hopefully have some visibility on > > > > how to address the others moving forwards, > > > > > > Is it sufficient to have such a memory marked as movable (aka only have > > > ZONE_MOVABLE)? That should rule out most of the kernel allocations and > > > it fits the "balance by migration" concept. > > > > This would not work for GPU, GPU driver really want to be in total > > control of their memory yet sometimes they want to migrate some part > > of the process to their memory. > > But that also means that GPU doesn't really fit the model discussed > here, right? I thought HMM is the way to manage such a memory. HMM provides the plumbing and tools to manage but right now the patchset for nouveau expose API through nouveau device file as nouveau ioctl. This is not a good long term solution when you want to mix and match multiple GPUs memory (possibly from different vendors). Then you get each device driver implementing their own mem policy infrastructure and without any coordination between devices/drivers. While it is _mostly_ ok for single GPU case, it is seriously crippling for the multi-GPUs or multi-devices cases (for instance when you chain network and GPU together or GPU and storage). People have been asking for a single common API to manage both regular memory and device memory. As anyway the common case is you move things around depending on which devices/CPUs is working on the dataset. Cheers, Jérôme
On Thu, Jan 10, 2019 at 05:50:01PM +0100, Michal Hocko wrote: > On Thu 10-01-19 11:25:56, Jerome Glisse wrote: > > On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote: > > > [Ccing Mel and Andrea] > > > > > > On Fri 28-12-18 21:31:11, Wu Fengguang wrote: > > > > > > > I haven't looked at the implementation yet but if you are proposing a > > > > > > > special cased zone lists then this is something CDM (Coherent Device > > > > > > > Memory) was trying to do two years ago and there was quite some > > > > > > > skepticism in the approach. > > > > > > > > > > > > It looks we are pretty different than CDM. :) > > > > > > We creating new NUMA nodes rather than CDM's new ZONE. > > > > > > The zonelists modification is just to make PMEM nodes more separated. > > > > > > > > > > Yes, this is exactly what CDM was after. Have a zone which is not > > > > > reachable without explicit request AFAIR. So no, I do not think you are > > > > > too different, you just use a different terminology ;) > > > > > > > > Got it. OK.. The fall back zonelists patch does need more thoughts. > > > > > > > > In long term POV, Linux should be prepared for multi-level memory. > > > > Then there will arise the need to "allocate from this level memory". > > > > So it looks good to have separated zonelists for each level of memory. > > > > > > Well, I do not have a good answer for you here. We do not have good > > > experiences with those systems, I am afraid. NUMA is with us for more > > > than a decade yet our APIs are coarse to say the least and broken at so > > > many times as well. Starting a new API just based on PMEM sounds like a > > > ticket to another disaster to me. > > > > > > I would like to see solid arguments why the current model of numa nodes > > > with fallback in distances order cannot be used for those new > > > technologies in the beginning and develop something better based on our > > > experiences that we gain on the way. > > > > I see several issues with distance. First it does fully abstract the > > underlying topology and this might be problematic, for instance if > > you memory with different characteristic in same node like persistent > > memory connected to some CPU then it might be faster for that CPU to > > access that persistent memory has it has dedicated link to it than to > > access some other remote memory for which the CPU might have to share > > the link with other CPUs or devices. > > > > Second distance is no longer easy to compute when you are not trying > > to answer what is the fastest memory for CPU-N but rather asking what > > is the fastest memory for CPU-N and device-M ie when you are trying to > > find the best memory for a group of CPUs/devices. The answer can > > changes drasticly depending on members of the groups. > > While you might be right, I would _really_ appreciate to start with a > simpler model and go to a more complex one based on realy HW and real > experiences than start with an overly complicated and over engineered > approach from scratch. > > > Some advance programmer already do graph matching ie they match the > > graph of their program dataset/computation with the topology graph > > of the computer they run on to determine what is best placement both > > for threads and memory. > > And those can still use our mempolicy API to describe their needs. If > existing API is not sufficient then let's talk about which pieces are > missing. I understand people don't want the fully topology thing but device memory can not be expose as a NUMA node hence at very least we need something that is not NUMA node only and most likely an API that does not use bitmask as front facing userspace API. So some kind of UID for memory, one for each type of memory on each node (and also for each device memory). It can be a 1 to 1 match with NUMA node id for all regular NUMA node memory with extra id for device memory (for instance by setting the high bit on the UID for device memory). > > > I would be especially interested about a possibility of the memory > > > migration idea during a memory pressure and relying on numa balancing to > > > resort the locality on demand rather than hiding certain NUMA nodes or > > > zones from the allocator and expose them only to the userspace. > > > > For device memory we have more things to think of like: > > - memory not accessible by CPU > > - non cache coherent memory (yet still useful in some case if > > application explicitly ask for it) > > - device driver want to keep full control over memory as older > > application like graphic for GPU, do need contiguous physical > > memory and other tight control over physical memory placement > > Again, I believe that HMM is to target those non-coherent or > non-accessible memory and I do not think it is helpful to put them into > the mix here. HMM is the kernel plumbing it does not expose anything to userspace. While right now for nouveau the plan is to expose API through nouveau ioctl this does not scale/work for multiple devices or when you mix and match different devices. A single API that can handle both device memory and regular memory would be much more useful. Long term at least that's what i would like to see. > > So if we are talking about something to replace NUMA i would really > > like for that to be inclusive of device memory (which can itself be > > a hierarchy of different memory with different characteristics). > > I think we should build on the existing NUMA infrastructure we have. > Developing something completely new is not going to happen anytime soon > and I am not convinced the result would be that much better either. The issue with NUMA is that i do not see a way to add device memory as node as the memory need to be fully manage by the device driver. Also the number of nodes might get out of hands (think 32 devices per CPU so with 1024 CPU that's 2^15 max nodes ...) this leads to node mask taking a full page. Also the whole NUMA access tracking does not work with devices (it can be added but right now it is non existent). Forcing page fault to track access is highly disruptive for GPU while the hw can provide much better informations without fault and CPU counters might also be something we might want to use rather than faulting. I am not saying something new will solve all the issues we have today with NUMA, actualy i don't believe we can solve all of them. But it could at least be more flexible in terms of what memory program can bind to. Cheers, Jérôme
On Tue, 8 Jan 2019 15:52:56 +0100 Michal Hocko <mhocko@kernel.org> wrote: > On Wed 02-01-19 12:21:10, Jonathan Cameron wrote: > [...] > > So ideally I'd love this set to head in a direction that helps me tick off > > at least some of the above usecases and hopefully have some visibility on > > how to address the others moving forwards, > > Is it sufficient to have such a memory marked as movable (aka only have > ZONE_MOVABLE)? That should rule out most of the kernel allocations and > it fits the "balance by migration" concept. Yes, to some degree. That's exactly what we are doing, though a things currently stand I think you have to turn it on via a kernel command line and mark it hotpluggable in ACPI. Given it my or may not actually be hotpluggable that's less than elegant. Let's randomly decide not to explore that one further for a few more weeks. la la la la If we have general balancing by migration then things are definitely heading in a useful direction as long as 'hot' takes into account the main user not being a CPU. You are right that migration dealing with the movable kernel allocations is a nice side effect though which I hadn't thought about. Long run we might end up with everything where it should be after some level of burn in period. A generic version of this proposal is looking nicer and nicer! Thanks, Jonathan
On Wed, 2 Jan 2019 12:21:10 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > On Fri, 28 Dec 2018 20:52:24 +0100 > Michal Hocko <mhocko@kernel.org> wrote: > > > [Ccing Mel and Andrea] > > Hi, I just wanted to highlight this section as I didn't feel we really addressed this in the earlier conversation. > * Hot pages may not be hot just because the host is using them a lot. It would be > very useful to have a means of adding information available from accelerators > beyond simple accessed bits (dreaming ;) One problem here is translation > caches (ATCs) as they won't normally result in any updates to the page accessed > bits. The arm SMMU v3 spec for example makes it clear (though it's kind of > obvious) that the ATS request is the only opportunity to update the accessed > bit. The nasty option here would be to periodically flush the ATC to force > the access bit updates via repeats of the ATS request (ouch). > That option only works if the iommu supports updating the accessed flag > (optional on SMMU v3 for example). > If we ignore the IOMMU hardware update issue which will simply need to be addressed by future hardware if these techniques become common, how do we address the Address Translation Cache issue without potentially causing big performance problems by flushing the cache just to force an accessed bit update? These devices are frequently used with PRI and Shared Virtual Addressing and can be accessing most of your memory without you having any visibility of it in the page tables (as they aren't walked if your ATC is well matched in size to your usecase. Classic example would be accelerated DB walkers like the the CCIX demo Xilinx has shown at a few conferences. The whole point of those is that most of the time only your large set of database walkers is using your memory and they have translations cached for for a good part of what they are accessing. Flushing that cache could hurt a lot. Pinning pages hurts for all the normal flexibility reasons. Last thing we want is to be migrating these pages that can be very hot but in an invisible fashion. Thanks, Jonathan
Hi Jonathan, Thanks for showing the gap on tracking hot accesses from devices. On Mon, Jan 28, 2019 at 05:42:39PM +0000, Jonathan Cameron wrote: >On Wed, 2 Jan 2019 12:21:10 +0000 >Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > >> On Fri, 28 Dec 2018 20:52:24 +0100 >> Michal Hocko <mhocko@kernel.org> wrote: >> >> > [Ccing Mel and Andrea] >> > > >Hi, > >I just wanted to highlight this section as I didn't feel we really addressed this >in the earlier conversation. > >> * Hot pages may not be hot just because the host is using them a lot. It would be >> very useful to have a means of adding information available from accelerators >> beyond simple accessed bits (dreaming ;) One problem here is translation >> caches (ATCs) as they won't normally result in any updates to the page accessed >> bits. The arm SMMU v3 spec for example makes it clear (though it's kind of >> obvious) that the ATS request is the only opportunity to update the accessed >> bit. The nasty option here would be to periodically flush the ATC to force >> the access bit updates via repeats of the ATS request (ouch). >> That option only works if the iommu supports updating the accessed flag >> (optional on SMMU v3 for example). If ATS based updates are supported, we may trigger it when closing the /proc/pid/idle_pages file. We already do TLB flushes at that time. For example, [PATCH 15/21] ept-idle: EPT walk for virtual machine ept_idle_release(): kvm_flush_remote_tlbs(kvm); [PATCH 17/21] proc: introduce /proc/PID/idle_pages mm_idle_release(): flush_tlb_mm(mm); The flush cost is kind of "minimal necessary" in our current use model, where user space scan+migration daemon will do such loop: loop: walk page table N times: open,read,close /proc/PID/idle_pages (flushes TLB on file close) sleep for a short interval sort and migrate hot pages sleep for a while >If we ignore the IOMMU hardware update issue which will simply need to be addressed >by future hardware if these techniques become common, how do we address the >Address Translation Cache issue without potentially causing big performance >problems by flushing the cache just to force an accessed bit update? > >These devices are frequently used with PRI and Shared Virtual Addressing >and can be accessing most of your memory without you having any visibility >of it in the page tables (as they aren't walked if your ATC is well matched >in size to your usecase. > >Classic example would be accelerated DB walkers like the the CCIX demo >Xilinx has shown at a few conferences. The whole point of those is that >most of the time only your large set of database walkers is using your >memory and they have translations cached for for a good part of what >they are accessing. Flushing that cache could hurt a lot. >Pinning pages hurts for all the normal flexibility reasons. > >Last thing we want is to be migrating these pages that can be very hot but >in an invisible fashion. If there are some other way to get hotness for special device memory, the user space daemon may be extended to cover that. Perhaps by querying another new kernel interface. By driving hotness accounting and migration in user space, we harvest this kind of flexibility. In the daemon POV, /proc/PID/idle_pages provides one common way to get "accessed" bits hence hotness, though the daemon does not need to depend solely on it. Thanks, Fengguang