diff mbox

[RFC,1/1] Move two pinned pages to non-movable node in kvm.

Message ID 1403070600-6083-1-git-send-email-tangchen@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

tangchen June 18, 2014, 5:50 a.m. UTC
Hi,

I met a problem when offlining memory with a kvm guest running.


[Problem]
When qemu creates vpus, it will call the following two functions
to allocate two pages:
1. alloc_apic_access_page(): allocate apic access page for FlexPriority in intel cpu.
2. alloc_identity_pagetable(): allocate ept identity pagetable for real mode.

And unfortunately, these two pages will be pinned in memory, and they cannot
be migrated. As a result, they cannot be offlined. And memory hot-remove will fail.



[The way I tried]
I tried to migrate these two pages, but I think I cannot find a proper way
to migrate them.

Let's take ept identity pagetable for example:
In my opinion, since it is pagetable, CPU will access this page every time the guest
read/write memory. For example, the following code will access memory:
	int a;
	a = 0;
So this ept identity pagetable page can be accessed at any time by CPU automatically.



[Solution]
I have a basic idea to solve this problem: allocate these two pages in non-movable nodes.
(For now, we can only hot-remove memory in movable nodes.)

alloc_identity_pagetable()
|-> __kvm_set_memory_region()
|   |-> kvm_arch_prepare_memory_region()
|       |-> userspace_addr = vm_mmap();
|       |-> memslot->userspace_addr = userspace_addr;  /* map usespace address (qemu) */
|
|   /*
|    * Here, set memory policy for the mapped but not allocated page,
|    * make it can only be allocated in non-movable nodes.
|    * (We can reuse "numa_kernel_nodes" node mask in movable_node functionality.)
|    */
|
|-> page = gfn_to_page()  /* allocate and pin page */

Please refer to the attached patch for detail.
I did some basic test for the patch, and it will make memory offline succeed.



[Questions]
And by the way, would you guys please answer the following questions for me ?

1. What's the ept identity pagetable for ?  Only one page is enough ?

2. Is the ept identity pagetable only used in realmode ?
   Can we free it once the guest is up (vcpu in protect mode)?

3. Now, ept identity pagetable is allocated in qemu userspace.
   Can we allocate it in kernel space ?

4. If I want to migrate these two pages, what do you think is the best way ?

Thanks.


Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/numa.h | 1 +
 arch/x86/kvm/vmx.c          | 5 +++++
 arch/x86/kvm/x86.c          | 1 +
 arch/x86/mm/numa.c          | 3 ++-
 include/linux/mempolicy.h   | 6 ++++++
 mm/mempolicy.c              | 9 +++++++++
 6 files changed, 24 insertions(+), 1 deletion(-)

Comments

Gleb Natapov June 18, 2014, 6:12 a.m. UTC | #1
On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> [Questions]
> And by the way, would you guys please answer the following questions for me ?
> 
> 1. What's the ept identity pagetable for ?  Only one page is enough ?
> 
> 2. Is the ept identity pagetable only used in realmode ?
>    Can we free it once the guest is up (vcpu in protect mode)?
> 
> 3. Now, ept identity pagetable is allocated in qemu userspace.
>    Can we allocate it in kernel space ?
What would be the benefit?

> 
> 4. If I want to migrate these two pages, what do you think is the best way ?
> 
I answered most of those here: http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
tangchen June 18, 2014, 6:50 a.m. UTC | #2
Hi Gleb,

Thanks for the quick reply. Please see below.

On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
>> [Questions]
>> And by the way, would you guys please answer the following questions for me ?
>>
>> 1. What's the ept identity pagetable for ?  Only one page is enough ?
>>
>> 2. Is the ept identity pagetable only used in realmode ?
>>     Can we free it once the guest is up (vcpu in protect mode)?
>>
>> 3. Now, ept identity pagetable is allocated in qemu userspace.
>>     Can we allocate it in kernel space ?
> What would be the benefit?

I think the benefit is we can hot-remove the host memory a kvm guest
is using.

For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And 
the kernel
will never use ZONE_MOVABLE memory. So if we can allocate these two 
pages in
kernel space, we can pin them without any trouble. When doing memory 
hot-remove,
the kernel will not try to migrate these two pages.

>
>>
>> 4. If I want to migrate these two pages, what do you think is the best way ?
>>
> I answered most of those here: http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html

I'm sorry I must missed this email.

Seeing your advice, we can unpin these two pages and repin them in the 
next EPT violation.
So about this problem, which solution would you prefer, allocate these 
two pages in kernel
space, or migrate them before memory hot-remove ?

I think the first solution is simpler. But I'm not quite sure if there 
is any other pages
pinned in memory. If we have the same problem with other kvm pages, I 
think it is better to
solve it in the second way.

What do you think ?

Thanks.






--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov June 19, 2014, 9:20 a.m. UTC | #3
CCing Marcelo,

On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> Thanks for the quick reply. Please see below.
> 
> On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> >>[Questions]
> >>And by the way, would you guys please answer the following questions for me ?
> >>
> >>1. What's the ept identity pagetable for ?  Only one page is enough ?
> >>
> >>2. Is the ept identity pagetable only used in realmode ?
> >>    Can we free it once the guest is up (vcpu in protect mode)?
> >>
> >>3. Now, ept identity pagetable is allocated in qemu userspace.
> >>    Can we allocate it in kernel space ?
> >What would be the benefit?
> 
> I think the benefit is we can hot-remove the host memory a kvm guest
> is using.
> 
> For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And the
> kernel
> will never use ZONE_MOVABLE memory. So if we can allocate these two pages in
> kernel space, we can pin them without any trouble. When doing memory
> hot-remove,
> the kernel will not try to migrate these two pages.
But we can do that by other means, no? The patch you've sent for instance.

> 
> >
> >>
> >>4. If I want to migrate these two pages, what do you think is the best way ?
> >>
> >I answered most of those here: http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> 
> I'm sorry I must missed this email.
> 
> Seeing your advice, we can unpin these two pages and repin them in the next
> EPT violation.
> So about this problem, which solution would you prefer, allocate these two
> pages in kernel
> space, or migrate them before memory hot-remove ?
> 
> I think the first solution is simpler. But I'm not quite sure if there is
> any other pages
> pinned in memory. If we have the same problem with other kvm pages, I think
> it is better to
> solve it in the second way.
> 
> What do you think ?
Remove pinning is preferable. In fact looks like for identity pagetable
it should be trivial, just don't pin. APIC access page is a little bit
more complicated since its physical address needs to be tracked to be
updated in VMCS.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti June 19, 2014, 7 p.m. UTC | #4
On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> CCing Marcelo,
> 
> On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > Hi Gleb,
> > 
> > Thanks for the quick reply. Please see below.
> > 
> > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > >>[Questions]
> > >>And by the way, would you guys please answer the following questions for me ?
> > >>
> > >>1. What's the ept identity pagetable for ?  Only one page is enough ?
> > >>
> > >>2. Is the ept identity pagetable only used in realmode ?
> > >>    Can we free it once the guest is up (vcpu in protect mode)?
> > >>
> > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > >>    Can we allocate it in kernel space ?
> > >What would be the benefit?
> > 
> > I think the benefit is we can hot-remove the host memory a kvm guest
> > is using.
> > 
> > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And the
> > kernel
> > will never use ZONE_MOVABLE memory. So if we can allocate these two pages in
> > kernel space, we can pin them without any trouble. When doing memory
> > hot-remove,
> > the kernel will not try to migrate these two pages.
> But we can do that by other means, no? The patch you've sent for instance.
> 
> > 
> > >
> > >>
> > >>4. If I want to migrate these two pages, what do you think is the best way ?
> > >>
> > >I answered most of those here: http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > 
> > I'm sorry I must missed this email.
> > 
> > Seeing your advice, we can unpin these two pages and repin them in the next
> > EPT violation.
> > So about this problem, which solution would you prefer, allocate these two
> > pages in kernel
> > space, or migrate them before memory hot-remove ?
> > 
> > I think the first solution is simpler. But I'm not quite sure if there is
> > any other pages
> > pinned in memory. If we have the same problem with other kvm pages, I think
> > it is better to
> > solve it in the second way.
> > 
> > What do you think ?
> Remove pinning is preferable. In fact looks like for identity pagetable
> it should be trivial, just don't pin. APIC access page is a little bit
> more complicated since its physical address needs to be tracked to be
> updated in VMCS.

Yes, and there are new users of page pinning as well soon (see PEBS
threads on kvm-devel).

Was thinking of notifiers scheme. Perhaps:

->begin_page_unpin(struct page *page)
	- Remove any possible access to page.

->end_page_unpin(struct page *page)
	- Reinstantiate any possible access to page.

For KVM:

->begin_page_unpin()
	- Remove APIC-access page address from VMCS.
	  or
	- Remove spte translation to pinned page.
	
	- Put vcpu in state where no VM-entries are allowed.

->end_page_unpin()
	- Setup APIC-access page, ...
	- Allow vcpu to VM-entry.


Because allocating APIC access page from distant NUMA node can
be a performance problem, i believe.

I'd be happy to know why notifiers are overkill.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
tangchen June 20, 2014, 3:20 a.m. UTC | #5
Hi Marcelo,

Thanks for your reply. Please see below.

On 06/20/2014 03:00 AM, Marcelo Tosatti wrote:
......
>> Remove pinning is preferable. In fact looks like for identity pagetable
>> it should be trivial, just don't pin. APIC access page is a little bit
>> more complicated since its physical address needs to be tracked to be
>> updated in VMCS.
>
> Yes, and there are new users of page pinning as well soon (see PEBS
> threads on kvm-devel).
>
> Was thinking of notifiers scheme. Perhaps:
>
> ->begin_page_unpin(struct page *page)
> 	- Remove any possible access to page.
>
> ->end_page_unpin(struct page *page)
> 	- Reinstantiate any possible access to page.
>
> For KVM:
>
> ->begin_page_unpin()
> 	- Remove APIC-access page address from VMCS.
> 	  or
> 	- Remove spte translation to pinned page.
> 	
> 	- Put vcpu in state where no VM-entries are allowed.
>
> ->end_page_unpin()
> 	- Setup APIC-access page, ...
> 	- Allow vcpu to VM-entry.
>
>
> Because allocating APIC access page from distant NUMA node can
> be a performance problem, i believe.

Yes, I understand this.

>
> I'd be happy to know why notifiers are overkill.

The notifiers are not overkill. I have been thinking about a similar idea.

In fact, we have met the same pinned pages problem in AIO subsystem.
The aio ring pages are pinned in memory, and cannot be migrated.

And in kernel, I believe, there are some other places where pages are 
pinned.


So I was thinking a notifier framework to solve this problem.
But I can see some problems:

1. When getting a page, migration thread doesn't know who is using this 
page
    and how. So we need a callback for each page to be called before and 
after
    it is migrated.
    (A little over thinking, maybe. Please see below.)

2. When migrating a shared page, one callback is not enouch because the 
page
    could be shared by different subsystems. They may have different 
ways to
    pin and unpin the page.

3. Where should we put the callback? Only file backing pages have one 
and only one
    address_space->address_space_operations->migratepage(). For 
anonymous pages,
    nowhere to put the callback.

    (A basic idea: define a global radix tree or hash table to manage 
the pinned
     pages and their callbacks. Mel Gorman mentioned this idea when 
handling
     the aio ring page problem. I'm not sure if this is acceptable.)


The idea above may be a little over thinking. Actually we can reuse the
memory hotplug notify chain if the pinned page migration is only used by
memory hotplug subsystem.

The basic idea is: Each subsystem register a callback to memory hotplug 
notify
chain, and unpin and repin the pages before and after page migration.

But I think, finally we will met this problem: How to remember/manage the
pinned pages in each subsystem.

For example, for kvm, ept identity pagetable page and apic page are pinned.
Since these two pages' struct_page pointer and user_addr are remember in 
kvm,
they are easy to handle. If we pin a page and remember it only in a stack
variable, it could be difficult to handle.


For now for kvm, I think notifiers can solve this problem.

Thanks for the advice. If you guys have any idea about this probelm, please
share with me.

Thanks.



















--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov June 20, 2014, 11:15 a.m. UTC | #6
On Thu, Jun 19, 2014 at 04:00:24PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> > CCing Marcelo,
> > 
> > On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > > Hi Gleb,
> > > 
> > > Thanks for the quick reply. Please see below.
> > > 
> > > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > > >>[Questions]
> > > >>And by the way, would you guys please answer the following questions for me ?
> > > >>
> > > >>1. What's the ept identity pagetable for ?  Only one page is enough ?
> > > >>
> > > >>2. Is the ept identity pagetable only used in realmode ?
> > > >>    Can we free it once the guest is up (vcpu in protect mode)?
> > > >>
> > > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > > >>    Can we allocate it in kernel space ?
> > > >What would be the benefit?
> > > 
> > > I think the benefit is we can hot-remove the host memory a kvm guest
> > > is using.
> > > 
> > > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And the
> > > kernel
> > > will never use ZONE_MOVABLE memory. So if we can allocate these two pages in
> > > kernel space, we can pin them without any trouble. When doing memory
> > > hot-remove,
> > > the kernel will not try to migrate these two pages.
> > But we can do that by other means, no? The patch you've sent for instance.
> > 
> > > 
> > > >
> > > >>
> > > >>4. If I want to migrate these two pages, what do you think is the best way ?
> > > >>
> > > >I answered most of those here: http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > > 
> > > I'm sorry I must missed this email.
> > > 
> > > Seeing your advice, we can unpin these two pages and repin them in the next
> > > EPT violation.
> > > So about this problem, which solution would you prefer, allocate these two
> > > pages in kernel
> > > space, or migrate them before memory hot-remove ?
> > > 
> > > I think the first solution is simpler. But I'm not quite sure if there is
> > > any other pages
> > > pinned in memory. If we have the same problem with other kvm pages, I think
> > > it is better to
> > > solve it in the second way.
> > > 
> > > What do you think ?
> > Remove pinning is preferable. In fact looks like for identity pagetable
> > it should be trivial, just don't pin. APIC access page is a little bit
> > more complicated since its physical address needs to be tracked to be
> > updated in VMCS.
> 
> Yes, and there are new users of page pinning as well soon (see PEBS
> threads on kvm-devel).
> 
> Was thinking of notifiers scheme. Perhaps:
> 
> ->begin_page_unpin(struct page *page)
> 	- Remove any possible access to page.
> 
> ->end_page_unpin(struct page *page)
> 	- Reinstantiate any possible access to page.
> 
> For KVM:
> 
> ->begin_page_unpin()
> 	- Remove APIC-access page address from VMCS.
> 	  or
> 	- Remove spte translation to pinned page.
> 	
> 	- Put vcpu in state where no VM-entries are allowed.
> 
> ->end_page_unpin()
> 	- Setup APIC-access page, ...
> 	- Allow vcpu to VM-entry.
> 
I believe that to handle identity page and APIC access page we do not
need any of those. We can use mmu notifiers to track when page begins
to be moved and we can find new page location on EPT violation.

> 
> Because allocating APIC access page from distant NUMA node can
> be a performance problem, i believe.
I do not think this is the case. APIC access page is never written to,
and in fact SDM advice to share it between all vcpus.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti June 20, 2014, 12:53 p.m. UTC | #7
On Fri, Jun 20, 2014 at 02:15:10PM +0300, Gleb Natapov wrote:
> On Thu, Jun 19, 2014 at 04:00:24PM -0300, Marcelo Tosatti wrote:
> > On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> > > CCing Marcelo,
> > > 
> > > On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > > > Hi Gleb,
> > > > 
> > > > Thanks for the quick reply. Please see below.
> > > > 
> > > > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > > > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > > > >>[Questions]
> > > > >>And by the way, would you guys please answer the following questions for me ?
> > > > >>
> > > > >>1. What's the ept identity pagetable for ?  Only one page is enough ?
> > > > >>
> > > > >>2. Is the ept identity pagetable only used in realmode ?
> > > > >>    Can we free it once the guest is up (vcpu in protect mode)?
> > > > >>
> > > > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > > > >>    Can we allocate it in kernel space ?
> > > > >What would be the benefit?
> > > > 
> > > > I think the benefit is we can hot-remove the host memory a kvm guest
> > > > is using.
> > > > 
> > > > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And the
> > > > kernel
> > > > will never use ZONE_MOVABLE memory. So if we can allocate these two pages in
> > > > kernel space, we can pin them without any trouble. When doing memory
> > > > hot-remove,
> > > > the kernel will not try to migrate these two pages.
> > > But we can do that by other means, no? The patch you've sent for instance.
> > > 
> > > > 
> > > > >
> > > > >>
> > > > >>4. If I want to migrate these two pages, what do you think is the best way ?
> > > > >>
> > > > >I answered most of those here: http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > > > 
> > > > I'm sorry I must missed this email.
> > > > 
> > > > Seeing your advice, we can unpin these two pages and repin them in the next
> > > > EPT violation.
> > > > So about this problem, which solution would you prefer, allocate these two
> > > > pages in kernel
> > > > space, or migrate them before memory hot-remove ?
> > > > 
> > > > I think the first solution is simpler. But I'm not quite sure if there is
> > > > any other pages
> > > > pinned in memory. If we have the same problem with other kvm pages, I think
> > > > it is better to
> > > > solve it in the second way.
> > > > 
> > > > What do you think ?
> > > Remove pinning is preferable. In fact looks like for identity pagetable
> > > it should be trivial, just don't pin. APIC access page is a little bit
> > > more complicated since its physical address needs to be tracked to be
> > > updated in VMCS.
> > 
> > Yes, and there are new users of page pinning as well soon (see PEBS
> > threads on kvm-devel).
> > 
> > Was thinking of notifiers scheme. Perhaps:
> > 
> > ->begin_page_unpin(struct page *page)
> > 	- Remove any possible access to page.
> > 
> > ->end_page_unpin(struct page *page)
> > 	- Reinstantiate any possible access to page.
> > 
> > For KVM:
> > 
> > ->begin_page_unpin()
> > 	- Remove APIC-access page address from VMCS.
> > 	  or
> > 	- Remove spte translation to pinned page.
> > 	
> > 	- Put vcpu in state where no VM-entries are allowed.
> > 
> > ->end_page_unpin()
> > 	- Setup APIC-access page, ...
> > 	- Allow vcpu to VM-entry.
> > 
> I believe that to handle identity page and APIC access page we do not
> need any of those. 
> We can use mmu notifiers to track when page begins
> to be moved and we can find new page location on EPT violation.

Does page migration hook via mmu notifiers? I don't think so. 

It won't even attempt page migration because the page count is
increased (would have to confirm though). Tang?

The problem with identity page is this: its location is written into the
guest CR3. So you cannot allow it (the page which the guest CR3 points
to) to be reused before you remove the reference.

Where is the guarantee there will be an EPT violation, allowing a vcpu
to execute with guest CR3 pointing to page with random data?

Same with the APIC access page.

> > Because allocating APIC access page from distant NUMA node can
> > be a performance problem, i believe.
> I do not think this is the case. APIC access page is never written to,
> and in fact SDM advice to share it between all vcpus.

Right. 

But the point is not so much relevant as this should be handled for
PEBS pages which would be interesting to force to non-movable zones.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov June 20, 2014, 2:26 p.m. UTC | #8
On Fri, Jun 20, 2014 at 09:53:26AM -0300, Marcelo Tosatti wrote:
> On Fri, Jun 20, 2014 at 02:15:10PM +0300, Gleb Natapov wrote:
> > On Thu, Jun 19, 2014 at 04:00:24PM -0300, Marcelo Tosatti wrote:
> > > On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> > > > CCing Marcelo,
> > > > 
> > > > On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > > > > Hi Gleb,
> > > > > 
> > > > > Thanks for the quick reply. Please see below.
> > > > > 
> > > > > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > > > > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > > > > >>[Questions]
> > > > > >>And by the way, would you guys please answer the following questions for me ?
> > > > > >>
> > > > > >>1. What's the ept identity pagetable for ?  Only one page is enough ?
> > > > > >>
> > > > > >>2. Is the ept identity pagetable only used in realmode ?
> > > > > >>    Can we free it once the guest is up (vcpu in protect mode)?
> > > > > >>
> > > > > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > > > > >>    Can we allocate it in kernel space ?
> > > > > >What would be the benefit?
> > > > > 
> > > > > I think the benefit is we can hot-remove the host memory a kvm guest
> > > > > is using.
> > > > > 
> > > > > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And the
> > > > > kernel
> > > > > will never use ZONE_MOVABLE memory. So if we can allocate these two pages in
> > > > > kernel space, we can pin them without any trouble. When doing memory
> > > > > hot-remove,
> > > > > the kernel will not try to migrate these two pages.
> > > > But we can do that by other means, no? The patch you've sent for instance.
> > > > 
> > > > > 
> > > > > >
> > > > > >>
> > > > > >>4. If I want to migrate these two pages, what do you think is the best way ?
> > > > > >>
> > > > > >I answered most of those here: http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > > > > 
> > > > > I'm sorry I must missed this email.
> > > > > 
> > > > > Seeing your advice, we can unpin these two pages and repin them in the next
> > > > > EPT violation.
> > > > > So about this problem, which solution would you prefer, allocate these two
> > > > > pages in kernel
> > > > > space, or migrate them before memory hot-remove ?
> > > > > 
> > > > > I think the first solution is simpler. But I'm not quite sure if there is
> > > > > any other pages
> > > > > pinned in memory. If we have the same problem with other kvm pages, I think
> > > > > it is better to
> > > > > solve it in the second way.
> > > > > 
> > > > > What do you think ?
> > > > Remove pinning is preferable. In fact looks like for identity pagetable
> > > > it should be trivial, just don't pin. APIC access page is a little bit
> > > > more complicated since its physical address needs to be tracked to be
> > > > updated in VMCS.
> > > 
> > > Yes, and there are new users of page pinning as well soon (see PEBS
> > > threads on kvm-devel).
> > > 
> > > Was thinking of notifiers scheme. Perhaps:
> > > 
> > > ->begin_page_unpin(struct page *page)
> > > 	- Remove any possible access to page.
> > > 
> > > ->end_page_unpin(struct page *page)
> > > 	- Reinstantiate any possible access to page.
> > > 
> > > For KVM:
> > > 
> > > ->begin_page_unpin()
> > > 	- Remove APIC-access page address from VMCS.
> > > 	  or
> > > 	- Remove spte translation to pinned page.
> > > 	
> > > 	- Put vcpu in state where no VM-entries are allowed.
> > > 
> > > ->end_page_unpin()
> > > 	- Setup APIC-access page, ...
> > > 	- Allow vcpu to VM-entry.
> > > 
> > I believe that to handle identity page and APIC access page we do not
> > need any of those. 
> > We can use mmu notifiers to track when page begins
> > to be moved and we can find new page location on EPT violation.
> 
> Does page migration hook via mmu notifiers? I don't think so. 
> 
Both identity page and APIC access page are userspace pages which will
have to be unmap from process address space during migration. At this point
mmu notifiers will be called.

> It won't even attempt page migration because the page count is
> increased (would have to confirm though). Tang?
> 
Of course, we should not pin.
 
> The problem with identity page is this: its location is written into the
> guest CR3. So you cannot allow it (the page which the guest CR3 points
> to) to be reused before you remove the reference.
> 
> Where is the guarantee there will be an EPT violation, allowing a vcpu
> to execute with guest CR3 pointing to page with random data?
> 
A guest's physical address is written into CR3 (0xfffbc000 usually),
not a physical address of an identity page directly. When a guest will
try to use CR3 KVM will get EPT violation and shadow page code will find
a page that backs guest's address 0xfffbc000 and will map it into EPT
table. This is what happens on a first vmentry after vcpu creation.

> Same with the APIC access page.
APIC page is always mapped into guest's APIC base address 0xfee00000.
The way it works is that when vCPU accesses page at 0xfee00000 the access
is translated to APIC access page physical address. CPU sees that access
is for APIC page and generates APIC access exit instead of memory access.
If address 0xfee00000 is not mapped by EPT then EPT violation exit will
be generated instead, EPT mapping will be instantiated, access retired
by a guest and this time will generate APIC access exit.

> 
> > > Because allocating APIC access page from distant NUMA node can
> > > be a performance problem, i believe.
> > I do not think this is the case. APIC access page is never written to,
> > and in fact SDM advice to share it between all vcpus.
> 
> Right. 
> 
> But the point is not so much relevant as this should be handled for
> PEBS pages which would be interesting to force to non-movable zones.
>
IIRC your shadow page pinning patch series support flushing of ptes
by mmu notifier by forcing MMU reload and, as a result, faulting in of
pinned pages during next entry.  Your patch series does not pin pages
by elevating their page count.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti June 20, 2014, 8:31 p.m. UTC | #9
On Fri, Jun 20, 2014 at 05:26:22PM +0300, Gleb Natapov wrote:
> On Fri, Jun 20, 2014 at 09:53:26AM -0300, Marcelo Tosatti wrote:
> > On Fri, Jun 20, 2014 at 02:15:10PM +0300, Gleb Natapov wrote:
> > > On Thu, Jun 19, 2014 at 04:00:24PM -0300, Marcelo Tosatti wrote:
> > > > On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> > > > > CCing Marcelo,
> > > > > 
> > > > > On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > > > > > Hi Gleb,
> > > > > > 
> > > > > > Thanks for the quick reply. Please see below.
> > > > > > 
> > > > > > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > > > > > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > > > > > >>[Questions]
> > > > > > >>And by the way, would you guys please answer the following questions for me ?
> > > > > > >>
> > > > > > >>1. What's the ept identity pagetable for ?  Only one page is enough ?
> > > > > > >>
> > > > > > >>2. Is the ept identity pagetable only used in realmode ?
> > > > > > >>    Can we free it once the guest is up (vcpu in protect mode)?
> > > > > > >>
> > > > > > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > > > > > >>    Can we allocate it in kernel space ?
> > > > > > >What would be the benefit?
> > > > > > 
> > > > > > I think the benefit is we can hot-remove the host memory a kvm guest
> > > > > > is using.
> > > > > > 
> > > > > > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And the
> > > > > > kernel
> > > > > > will never use ZONE_MOVABLE memory. So if we can allocate these two pages in
> > > > > > kernel space, we can pin them without any trouble. When doing memory
> > > > > > hot-remove,
> > > > > > the kernel will not try to migrate these two pages.
> > > > > But we can do that by other means, no? The patch you've sent for instance.
> > > > > 
> > > > > > 
> > > > > > >
> > > > > > >>
> > > > > > >>4. If I want to migrate these two pages, what do you think is the best way ?
> > > > > > >>
> > > > > > >I answered most of those here: http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > > > > > 
> > > > > > I'm sorry I must missed this email.
> > > > > > 
> > > > > > Seeing your advice, we can unpin these two pages and repin them in the next
> > > > > > EPT violation.
> > > > > > So about this problem, which solution would you prefer, allocate these two
> > > > > > pages in kernel
> > > > > > space, or migrate them before memory hot-remove ?
> > > > > > 
> > > > > > I think the first solution is simpler. But I'm not quite sure if there is
> > > > > > any other pages
> > > > > > pinned in memory. If we have the same problem with other kvm pages, I think
> > > > > > it is better to
> > > > > > solve it in the second way.
> > > > > > 
> > > > > > What do you think ?
> > > > > Remove pinning is preferable. In fact looks like for identity pagetable
> > > > > it should be trivial, just don't pin. APIC access page is a little bit
> > > > > more complicated since its physical address needs to be tracked to be
> > > > > updated in VMCS.
> > > > 
> > > > Yes, and there are new users of page pinning as well soon (see PEBS
> > > > threads on kvm-devel).
> > > > 
> > > > Was thinking of notifiers scheme. Perhaps:
> > > > 
> > > > ->begin_page_unpin(struct page *page)
> > > > 	- Remove any possible access to page.
> > > > 
> > > > ->end_page_unpin(struct page *page)
> > > > 	- Reinstantiate any possible access to page.
> > > > 
> > > > For KVM:
> > > > 
> > > > ->begin_page_unpin()
> > > > 	- Remove APIC-access page address from VMCS.
> > > > 	  or
> > > > 	- Remove spte translation to pinned page.
> > > > 	
> > > > 	- Put vcpu in state where no VM-entries are allowed.
> > > > 
> > > > ->end_page_unpin()
> > > > 	- Setup APIC-access page, ...
> > > > 	- Allow vcpu to VM-entry.
> > > > 
> > > I believe that to handle identity page and APIC access page we do not
> > > need any of those. 
> > > We can use mmu notifiers to track when page begins
> > > to be moved and we can find new page location on EPT violation.
> > 
> > Does page migration hook via mmu notifiers? I don't think so. 
> > 
> Both identity page and APIC access page are userspace pages which will
> have to be unmap from process address space during migration. At this point
> mmu notifiers will be called.

Right.

> > It won't even attempt page migration because the page count is
> > increased (would have to confirm though). Tang?
> > 
> Of course, we should not pin.
>  
> > The problem with identity page is this: its location is written into the
> > guest CR3. So you cannot allow it (the page which the guest CR3 points
> > to) to be reused before you remove the reference.
> > 
> > Where is the guarantee there will be an EPT violation, allowing a vcpu
> > to execute with guest CR3 pointing to page with random data?
> > 
> A guest's physical address is written into CR3 (0xfffbc000 usually),
> not a physical address of an identity page directly. When a guest will
> try to use CR3 KVM will get EPT violation and shadow page code will find
> a page that backs guest's address 0xfffbc000 and will map it into EPT
> table. This is what happens on a first vmentry after vcpu creation.

Right.

> > Same with the APIC access page.
> APIC page is always mapped into guest's APIC base address 0xfee00000.
> The way it works is that when vCPU accesses page at 0xfee00000 the access
> is translated to APIC access page physical address. CPU sees that access
> is for APIC page and generates APIC access exit instead of memory access.
> If address 0xfee00000 is not mapped by EPT then EPT violation exit will
> be generated instead, EPT mapping will be instantiated, access retired
> by a guest and this time will generate APIC access exit.

Right, confused with the other APIC page which the CPU writes (the vAPIC page) 
to.

> > > > Because allocating APIC access page from distant NUMA node can
> > > > be a performance problem, i believe.
> > > I do not think this is the case. APIC access page is never written to,
> > > and in fact SDM advice to share it between all vcpus.
> > 
> > Right. 
> > 
> > But the point is not so much relevant as this should be handled for
> > PEBS pages which would be interesting to force to non-movable zones.
> >
> IIRC your shadow page pinning patch series support flushing of ptes
> by mmu notifier by forcing MMU reload and, as a result, faulting in of
> pinned pages during next entry.  Your patch series does not pin pages
> by elevating their page count.

No but PEBS series does and its required to stop swap-out
of the page.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti June 20, 2014, 8:39 p.m. UTC | #10
On Fri, Jun 20, 2014 at 05:31:46PM -0300, Marcelo Tosatti wrote:
> > IIRC your shadow page pinning patch series support flushing of ptes
> > by mmu notifier by forcing MMU reload and, as a result, faulting in of
> > pinned pages during next entry.  Your patch series does not pin pages
> > by elevating their page count.
> 
> No but PEBS series does and its required to stop swap-out
> of the page.

Well actually no because of mmu notifiers.

Tang, can you implement mmu notifiers for the other breaker of 
mem hotplug ?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov June 22, 2014, 9:19 a.m. UTC | #11
On Fri, Jun 20, 2014 at 05:31:46PM -0300, Marcelo Tosatti wrote:
> > > Same with the APIC access page.
> > APIC page is always mapped into guest's APIC base address 0xfee00000.
> > The way it works is that when vCPU accesses page at 0xfee00000 the access
> > is translated to APIC access page physical address. CPU sees that access
> > is for APIC page and generates APIC access exit instead of memory access.
> > If address 0xfee00000 is not mapped by EPT then EPT violation exit will
> > be generated instead, EPT mapping will be instantiated, access retired
> > by a guest and this time will generate APIC access exit.
> 
> Right, confused with the other APIC page which the CPU writes (the vAPIC page) 
> to.
> 
That one is allocated with kmalloc.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
tangchen June 23, 2014, 1:48 a.m. UTC | #12
Hi Marcelo, Gleb,

Sorry for the delayed reply and thanks for the advices.

On 06/21/2014 04:39 AM, Marcelo Tosatti wrote:
> On Fri, Jun 20, 2014 at 05:31:46PM -0300, Marcelo Tosatti wrote:
>>> IIRC your shadow page pinning patch series support flushing of ptes
>>> by mmu notifier by forcing MMU reload and, as a result, faulting in of
>>> pinned pages during next entry.  Your patch series does not pin pages
>>> by elevating their page count.
>>
>> No but PEBS series does and its required to stop swap-out
>> of the page.
>
> Well actually no because of mmu notifiers.
>
> Tang, can you implement mmu notifiers for the other breaker of
> mem hotplug ?

I'll try the mmu notifier idea and send a patch soon.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
tangchen June 30, 2014, 1:45 a.m. UTC | #13
On 06/21/2014 04:39 AM, Marcelo Tosatti wrote:
> On Fri, Jun 20, 2014 at 05:31:46PM -0300, Marcelo Tosatti wrote:
>>> IIRC your shadow page pinning patch series support flushing of ptes
>>> by mmu notifier by forcing MMU reload and, as a result, faulting in of
>>> pinned pages during next entry.  Your patch series does not pin pages
>>> by elevating their page count.
>>
>> No but PEBS series does and its required to stop swap-out
>> of the page.
>
> Well actually no because of mmu notifiers.
>
> Tang, can you implement mmu notifiers for the other breaker of
> mem hotplug ?

Hi Marcelo,

I made a patch to update ept and apic pages when finding them in the
next ept violation. And I also updated the APIC_ACCESS_ADDR phys_addr.
The pages can be migrated, but the guest crached.

How do I stop guest from access apic pages in mmu_notifier when the
page migration starts ?  Do I need to stop all the vcpus by set vcpu
state to KVM_MP_STATE_HALTED ?  If so, the vcpu will not able to go
to the next ept violation.

So, may I write any specific value into APIC_ACCESS_ADDR to stop guest
from access to apic page ?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov June 30, 2014, 6 a.m. UTC | #14
On Mon, Jun 30, 2014 at 09:45:32AM +0800, Tang Chen wrote:
> On 06/21/2014 04:39 AM, Marcelo Tosatti wrote:
> >On Fri, Jun 20, 2014 at 05:31:46PM -0300, Marcelo Tosatti wrote:
> >>>IIRC your shadow page pinning patch series support flushing of ptes
> >>>by mmu notifier by forcing MMU reload and, as a result, faulting in of
> >>>pinned pages during next entry.  Your patch series does not pin pages
> >>>by elevating their page count.
> >>
> >>No but PEBS series does and its required to stop swap-out
> >>of the page.
> >
> >Well actually no because of mmu notifiers.
> >
> >Tang, can you implement mmu notifiers for the other breaker of
> >mem hotplug ?
> 
> Hi Marcelo,
> 
> I made a patch to update ept and apic pages when finding them in the
> next ept violation. And I also updated the APIC_ACCESS_ADDR phys_addr.
> The pages can be migrated, but the guest crached.
How does it crash?

> 
> How do I stop guest from access apic pages in mmu_notifier when the
> page migration starts ?  Do I need to stop all the vcpus by set vcpu
> state to KVM_MP_STATE_HALTED ?  If so, the vcpu will not able to go
> to the next ept violation.
When apic access page is unmapped from ept pages by mmu notifiers you
need to set its value in VMCS to a physical address that will never be
mapped into guest memory. Zero for instance. You can do it by introducing
new KVM_REQ_ bit and set VMCS value during next vcpu's vmentry. On ept
violation you need to update VMCS pointer to newly allocated physical
address, you can use the same KVM_REQ_ mechanism again.

> 
> So, may I write any specific value into APIC_ACCESS_ADDR to stop guest
> from access to apic page ?
> 
Any phys address that will never be mapped into guest's memory should work.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
tangchen June 30, 2014, 8:58 a.m. UTC | #15
Hi Gleb,

On 06/30/2014 02:00 PM, Gleb Natapov wrote:
> On Mon, Jun 30, 2014 at 09:45:32AM +0800, Tang Chen wrote:
>> On 06/21/2014 04:39 AM, Marcelo Tosatti wrote:
>>> On Fri, Jun 20, 2014 at 05:31:46PM -0300, Marcelo Tosatti wrote:
>>>>> IIRC your shadow page pinning patch series support flushing of ptes
>>>>> by mmu notifier by forcing MMU reload and, as a result, faulting in of
>>>>> pinned pages during next entry.  Your patch series does not pin pages
>>>>> by elevating their page count.
>>>>
>>>> No but PEBS series does and its required to stop swap-out
>>>> of the page.
>>>
>>> Well actually no because of mmu notifiers.
>>>
>>> Tang, can you implement mmu notifiers for the other breaker of
>>> mem hotplug ?
>>
>> Hi Marcelo,
>>
>> I made a patch to update ept and apic pages when finding them in the
>> next ept violation. And I also updated the APIC_ACCESS_ADDR phys_addr.
>> The pages can be migrated, but the guest crached.
> How does it crash?

It just stopped running. The guest system is dead.
I'll try to debug it and give some more info.

>
>>
>> How do I stop guest from access apic pages in mmu_notifier when the
>> page migration starts ?  Do I need to stop all the vcpus by set vcpu
>> state to KVM_MP_STATE_HALTED ?  If so, the vcpu will not able to go
>> to the next ept violation.
> When apic access page is unmapped from ept pages by mmu notifiers you
> need to set its value in VMCS to a physical address that will never be
> mapped into guest memory. Zero for instance. You can do it by introducing
> new KVM_REQ_ bit and set VMCS value during next vcpu's vmentry. On ept
> violation you need to update VMCS pointer to newly allocated physical
> address, you can use the same KVM_REQ_ mechanism again.
>
>>
>> So, may I write any specific value into APIC_ACCESS_ADDR to stop guest
>> from access to apic page ?
>>
> Any phys address that will never be mapped into guest's memory should work.

Thanks for the advice. I'll try it.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 4064aca..6312577 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -30,6 +30,7 @@  extern int numa_off;
  */
 extern s16 __apicid_to_node[MAX_LOCAL_APIC];
 extern nodemask_t numa_nodes_parsed __initdata;
+extern nodemask_t numa_kernel_nodes;
 
 extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 extern void __init numa_set_distance(int from, int to, int distance);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 801332e..4a3b5b5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -32,6 +32,7 @@ 
 #include <linux/slab.h>
 #include <linux/tboot.h>
 #include <linux/hrtimer.h>
+#include <linux/mempolicy.h>
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -3988,6 +3989,8 @@  static int alloc_apic_access_page(struct kvm *kvm)
 	if (r)
 		goto out;
 
+	numa_bind_non_movable(kvm_userspace_mem.userspace_addr, PAGE_SIZE);
+
 	page = gfn_to_page(kvm, 0xfee00);
 	if (is_error_page(page)) {
 		r = -EFAULT;
@@ -4018,6 +4021,8 @@  static int alloc_identity_pagetable(struct kvm *kvm)
 	if (r)
 		goto out;
 
+	numa_bind_non_movable(kvm_userspace_mem.userspace_addr, PAGE_SIZE);
+
 	page = gfn_to_page(kvm, kvm->arch.ept_identity_map_addr >> PAGE_SHIFT);
 	if (is_error_page(page)) {
 		r = -EFAULT;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f32a025..3962a23 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7295,6 +7295,7 @@  int kvm_arch_prepare_memory_region(struct kvm *kvm,
 			return PTR_ERR((void *)userspace_addr);
 
 		memslot->userspace_addr = userspace_addr;
+		mem->userspace_addr = userspace_addr;
 	}
 
 	return 0;
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a32b706..d706148 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -22,6 +22,8 @@ 
 
 int __initdata numa_off;
 nodemask_t numa_nodes_parsed __initdata;
+nodemask_t numa_kernel_nodes;
+EXPORT_SYMBOL(numa_kernel_nodes);
 
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
@@ -557,7 +559,6 @@  static void __init numa_init_array(void)
 static void __init numa_clear_kernel_node_hotplug(void)
 {
 	int i, nid;
-	nodemask_t numa_kernel_nodes = NODE_MASK_NONE;
 	unsigned long start, end;
 	struct memblock_region *r;
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index f230a97..14f3f04 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -140,6 +140,7 @@  bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
+extern long numa_bind_non_movable(unsigned long start, unsigned long len);
 extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
 				enum mpol_rebind_step step);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
@@ -243,6 +244,11 @@  static inline void numa_default_policy(void)
 {
 }
 
+static inline long numa_bind_non_movable(unsigned long start, unsigned long len)
+{
+	return -EINVAL;
+}
+
 static inline void mpol_rebind_task(struct task_struct *tsk,
 				const nodemask_t *new,
 				enum mpol_rebind_step step)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2849742..20065a9 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -97,6 +97,7 @@ 
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
+#include <asm/numa.h>
 #include <linux/random.h>
 
 #include "internal.h"
@@ -2663,6 +2664,14 @@  void numa_default_policy(void)
 	do_set_mempolicy(MPOL_DEFAULT, 0, NULL);
 }
 
+/* Bind a memory range to non-movable nodes. */
+long numa_bind_non_movable(unsigned long start, unsigned long len)
+{
+	return do_mbind(start, len, MPOL_BIND, MPOL_MODE_FLAGS,
+			&numa_kernel_nodes, MPOL_MF_STRICT);
+}
+EXPORT_SYMBOL(numa_bind_non_movable);
+
 /*
  * Parse and format mempolicy from/to strings
  */