[RFC,v2,11/21] kvm: allocate page table pages from DRAM

Message ID	20181226133351.703380444@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Message-Id: <20181226133351.703380444@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:57 +0800 From: Fengguang Wu <fengguang.wu@intel.com> To: Andrew Morton <akpm@linux-foundation.org> cc: Linux Memory Management List <linux-mm@kvack.org>, Yao Yuan <yuan.yao@intel.com>, Fengguang Wu <fengguang.wu@intel.com> cc: kvm@vger.kernel.org Cc: LKML <linux-kernel@vger.kernel.org> cc: Fan Du <fan.du@intel.com> cc: Peng Dong <dongx.peng@intel.com> cc: Huang Ying <ying.huang@intel.com> CC: Liu Jingqi <jingqi.liu@intel.com> cc: Dong Eddie <eddie.dong@intel.com> cc: Dave Hansen <dave.hansen@intel.com> cc: Zhang Yi <yi.z.zhang@linux.intel.com> cc: Dan Williams <dan.j.williams@intel.com> Subject: [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline; filename=0001-kvm-allocate-page-table-pages-from-DRAM.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	PMEM NUMA node and hotness accounting/migration \| expand [RFC,v2,00/21] PMEM NUMA node and hotness accounting/migration [RFC,v2,01/21] e820: cheat PMEM as DRAM [RFC,v2,02/21] acpi/numa: memorize NUMA node type from SRAT table [RFC,v2,03/21] x86/numa_emulation: fix fake NUMA in uniform case [RFC,v2,04/21] x86/numa_emulation: pass numa node type to fake nodes [RFC,v2,05/21] mmzone: new pgdat flags for DRAM and PMEM [RFC,v2,06/21] x86,numa: update numa node type [RFC,v2,07/21] mm: export node type {pmem\|dram} under /sys/bus/node [RFC,v2,08/21] mm: introduce and export pgdat peer_node [RFC,v2,09/21] mm: avoid duplicate peer target node [RFC,v2,10/21] mm: build separate zonelist for PMEM and DRAM node [RFC,v2,11/21] kvm: allocate page table pages from DRAM [RFC,v2,12/21] x86/pgtable: allocate page table pages from DRAM [RFC,v2,13/21] x86/pgtable: dont check PMD accessed bit [RFC,v2,14/21] kvm: register in mm_struct [RFC,v2,15/21] ept-idle: EPT walk for virtual machine [RFC,v2,16/21] mm-idle: mm_walk for normal task [RFC,v2,17/21] proc: introduce /proc/PID/idle_pages [RFC,v2,18/21] kvm-ept-idle: enable module [RFC,v2,19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag [RFC,v2,20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node [RFC,v2,21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM

Message ID

20181226133351.703380444@intel.com (mailing list archive)

State

New, archived

Headers

Message-Id: <20181226133351.703380444@intel.com>
User-Agent: quilt/0.65
Date: Wed, 26 Dec 2018 21:14:57 +0800
From: Fengguang Wu <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
cc: Linux Memory Management List <linux-mm@kvack.org>,
        Yao Yuan <yuan.yao@intel.com>,
        Fengguang Wu <fengguang.wu@intel.com>
cc: kvm@vger.kernel.org
Cc: LKML <linux-kernel@vger.kernel.org>
cc: Fan Du <fan.du@intel.com>
cc: Peng Dong <dongx.peng@intel.com>
cc: Huang Ying <ying.huang@intel.com>
CC: Liu Jingqi <jingqi.liu@intel.com>
cc: Dong Eddie <eddie.dong@intel.com>
cc: Dave Hansen <dave.hansen@intel.com>
cc: Zhang Yi <yi.z.zhang@linux.intel.com>
cc: Dan Williams <dan.j.williams@intel.com>
Subject: [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM
References: <20181226131446.330864849@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Disposition: inline;
 filename=0001-kvm-allocate-page-table-pages-from-DRAM.patch
Sender: kvm-owner@vger.kernel.org
Precedence: bulk

Series

PMEM NUMA node and hotness accounting/migration | expand

From: Yao Yuan <yuan.yao@intel.com> Signed-off-by: Yao Yuan <yuan.yao@intel.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> --- arch/x86/kvm/mmu.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Aneesh Kumar K.V Jan. 1, 2019, 9:23 a.m. UTC | #1

Fengguang Wu <fengguang.wu@intel.com> writes:

> From: Yao Yuan <yuan.yao@intel.com>
>
> Signed-off-by: Yao Yuan <yuan.yao@intel.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
> arch/x86/kvm/mmu.c |   12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> --- linux.orig/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.846720344 +0800
> +++ linux/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.842719614 +0800
> @@ -950,6 +950,16 @@ static void mmu_free_memory_cache(struct
>  		kmem_cache_free(cache, mc->objects[--mc->nobjs]);
>  }
>  
> +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
> +{
> +       struct page *page;
> +
> +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
> +       if (!page)
> +	       return 0;
> +       return (unsigned long) page_address(page);
> +}
> +

May be it is explained in other patches. What is preventing the
allocation from pmem here? Is it that we are not using the memory
policy prefered node id and hence the zone list we built won't have the
PMEM node?


>  static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
>  				       int min)
>  {
> @@ -958,7 +968,7 @@ static int mmu_topup_memory_cache_page(s
>  	if (cache->nobjs >= min)
>  		return 0;
>  	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
> -		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
> +		page = (void *)__get_dram_free_pages(GFP_KERNEL_ACCOUNT);
>  		if (!page)
>  			return cache->nobjs >= min ? 0 : -ENOMEM;
>  		cache->objects[cache->nobjs++] = page;

-aneesh

Yuan Yao Jan. 2, 2019, 12:59 a.m. UTC | #2

On Tue, Jan 01, 2019 at 02:53:07PM +0530, Aneesh Kumar K.V wrote:
> Fengguang Wu <fengguang.wu@intel.com> writes:
> 
> > From: Yao Yuan <yuan.yao@intel.com>
> >
> > Signed-off-by: Yao Yuan <yuan.yao@intel.com>
> > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> > ---
> > arch/x86/kvm/mmu.c |   12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > --- linux.orig/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.846720344 +0800
> > +++ linux/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.842719614 +0800
> > @@ -950,6 +950,16 @@ static void mmu_free_memory_cache(struct
> >  		kmem_cache_free(cache, mc->objects[--mc->nobjs]);
> >  }
> >  
> > +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
> > +{
> > +       struct page *page;
> > +
> > +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
> > +       if (!page)
> > +	       return 0;
> > +       return (unsigned long) page_address(page);
> > +}
> > +
> 
> May be it is explained in other patches. What is preventing the
> allocation from pmem here? Is it that we are not using the memory
> policy prefered node id and hence the zone list we built won't have the
> PMEM node?

That because the PMEM nodes are memory-only node in the patchset,
so numa_node_id() will always return the node id from DRAM nodes.

About the zone list, yes in patch 10/21 we build the PMEM nodes to
seperate zonelist, so DRAM nodes will not fall back to PMEM nodes.

> 
> >  static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
> >  				       int min)
> >  {
> > @@ -958,7 +968,7 @@ static int mmu_topup_memory_cache_page(s
> >  	if (cache->nobjs >= min)
> >  		return 0;
> >  	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
> > -		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
> > +		page = (void *)__get_dram_free_pages(GFP_KERNEL_ACCOUNT);
> >  		if (!page)
> >  			return cache->nobjs >= min ? 0 : -ENOMEM;
> >  		cache->objects[cache->nobjs++] = page;
> 
> -aneesh
>

Dave Hansen Jan. 2, 2019, 4:47 p.m. UTC | #3

On 12/26/18 5:14 AM, Fengguang Wu wrote:
> +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
> +{
> +       struct page *page;
> +
> +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
> +       if (!page)
> +	       return 0;
> +       return (unsigned long) page_address(page);
> +}

There seems to be a ton of *policy* baked into these patches.  For
instance: thou shalt not allocate page tables pages from PMEM.  That's
surely not a policy we want to inflict on every Linux user until the end
of time.

I think the more important question is how we can have the specific
policy that this patch implements, but also leave open room for other
policies, such as: "I don't care how slow this VM runs, minimize the
amount of fast memory it eats."

Fengguang Wu Jan. 7, 2019, 10:21 a.m. UTC | #4

On Wed, Jan 02, 2019 at 08:47:25AM -0800, Dave Hansen wrote:
>On 12/26/18 5:14 AM, Fengguang Wu wrote:
>> +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
>> +{
>> +       struct page *page;
>> +
>> +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
>> +       if (!page)
>> +	       return 0;
>> +       return (unsigned long) page_address(page);
>> +}
>
>There seems to be a ton of *policy* baked into these patches.  For
>instance: thou shalt not allocate page tables pages from PMEM.  That's
>surely not a policy we want to inflict on every Linux user until the end
>of time.

Right. It's straight forward policy for users that care performance.
The project is planned by 3 steps, at this moment we are in phase (1):

1) core functionalities, easy to backport
2) upstream-able total solution
3) upstream when API stabilized

The dumb kernel interface /proc/PID/idle_pages enables doing
the majority policies in user space. However for the other smaller
parts, it looks easier to just implement an obvious policy first.
Then to consider more possibilities.

>I think the more important question is how we can have the specific
>policy that this patch implements, but also leave open room for other
>policies, such as: "I don't care how slow this VM runs, minimize the
>amount of fast memory it eats."

Agreed. I'm open for more ways. We can treat these patches as the
soliciting version. If anyone send reasonable improvements or even
totally different way of doing it, I'd be happy to incorporate.

Thanks,
Fengguang

--- linux.orig/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.846720344 +0800
+++ linux/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.842719614 +0800
@@ -950,6 +950,16 @@  static void mmu_free_memory_cache(struct
 		kmem_cache_free(cache, mc->objects[--mc->nobjs]);
 }
 
+static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
+{
+       struct page *page;
+
+       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
+       if (!page)
+	       return 0;
+       return (unsigned long) page_address(page);
+}
+
 static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
 				       int min)
 {
@@ -958,7 +968,7 @@  static int mmu_topup_memory_cache_page(s
 	if (cache->nobjs >= min)
 		return 0;
 	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
-		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
+		page = (void *)__get_dram_free_pages(GFP_KERNEL_ACCOUNT);
 		if (!page)
 			return cache->nobjs >= min ? 0 : -ENOMEM;
 		cache->objects[cache->nobjs++] = page;

[RFC,v2,11/21] kvm: allocate page table pages from DRAM

Commit Message

Comments

Patch