diff mbox

[RFC] KVM: x86: conditionally acquire/release slots_lock on entry/exit

Message ID 20090827155450.GA6312@amt.cnet (mailing list archive)
State New, archived
Headers show

Commit Message

Marcelo Tosatti Aug. 27, 2009, 3:54 p.m. UTC
perf report shows heavy overhead from down/up of slots_lock.

Attempted to remove slots_lock by having vcpus stop on a synchronization
point, but this introduced further complexity (a vcpu can be scheduled
out before reaching the synchronization point, and can sched back in at
points which are slots_lock protected, etc).

This patch changes vcpu_enter_guest to conditionally release/acquire
slots_lock in case a vcpu state bit is set.

vmexit performance improves by 5-10% on UP guest.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Avi Kivity Aug. 27, 2009, 4:27 p.m. UTC | #1
On 08/27/2009 06:54 PM, Marcelo Tosatti wrote:
> perf report shows heavy overhead from down/up of slots_lock.
>
> Attempted to remove slots_lock by having vcpus stop on a synchronization
> point, but this introduced further complexity (a vcpu can be scheduled
> out before reaching the synchronization point, and can sched back in at
> points which are slots_lock protected, etc).
>
> This patch changes vcpu_enter_guest to conditionally release/acquire
> slots_lock in case a vcpu state bit is set.
>
> vmexit performance improves by 5-10% on UP guest.
>    

Sorry, it looks pretty complex.  Have you considered using srcu?  It 
seems to me down/up_read() can be replaced by srcu_read_lock/unlock(), 
and after proper conversion of memslots and io_bus to 
rcu_assign_pointer(), we can just add synchronize_srcu() immediately 
after changing stuff (of course mmu_lock still needs to be held when 
updating slots).
Marcelo Tosatti Aug. 27, 2009, 10:59 p.m. UTC | #2
On Thu, Aug 27, 2009 at 07:27:48PM +0300, Avi Kivity wrote:
> On 08/27/2009 06:54 PM, Marcelo Tosatti wrote:
>> perf report shows heavy overhead from down/up of slots_lock.
>>
>> Attempted to remove slots_lock by having vcpus stop on a synchronization
>> point, but this introduced further complexity (a vcpu can be scheduled
>> out before reaching the synchronization point, and can sched back in at
>> points which are slots_lock protected, etc).
>>
>> This patch changes vcpu_enter_guest to conditionally release/acquire
>> slots_lock in case a vcpu state bit is set.
>>
>> vmexit performance improves by 5-10% on UP guest.
>>    
>
> Sorry, it looks pretty complex. 

Why?

> Have you considered using srcu? It seems to me down/up_read() can
> be replaced by srcu_read_lock/unlock(), and after proper conversion
> of memslots and io_bus to rcu_assign_pointer(), we can just add
> synchronize_srcu() immediately after changing stuff (of course
> mmu_lock still needs to be held when updating slots).

I don't see RCU as being suitable because in certain operations you
want to stop writers (on behalf of vcpus), do something, and let them
continue afterwards. The dirty log, for example. Or any operation that
wants to modify lockless vcpu specific data.

Also, synchronize_srcu() is limited to preemptible sections.

io_bus could use RCU, but I think being able to stop vcpus is also a
different requirement. Do you have any suggestion on how to do it in a
better way?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Aug. 28, 2009, 6:50 a.m. UTC | #3
On 08/28/2009 01:59 AM, Marcelo Tosatti wrote:
> On Thu, Aug 27, 2009 at 07:27:48PM +0300, Avi Kivity wrote:
>    
>> On 08/27/2009 06:54 PM, Marcelo Tosatti wrote:
>>      
>>> perf report shows heavy overhead from down/up of slots_lock.
>>>
>>> Attempted to remove slots_lock by having vcpus stop on a synchronization
>>> point, but this introduced further complexity (a vcpu can be scheduled
>>> out before reaching the synchronization point, and can sched back in at
>>> points which are slots_lock protected, etc).
>>>
>>> This patch changes vcpu_enter_guest to conditionally release/acquire
>>> slots_lock in case a vcpu state bit is set.
>>>
>>> vmexit performance improves by 5-10% on UP guest.
>>>
>>>        
>> Sorry, it looks pretty complex.
>>      
> Why?
>    

There's a new locking protocol in there.  I prefer sticking with the 
existing kernel plumbing, or it gets more and more difficult knowing who 
protects what and in what order you can do things.

>> Have you considered using srcu? It seems to me down/up_read() can
>> be replaced by srcu_read_lock/unlock(), and after proper conversion
>> of memslots and io_bus to rcu_assign_pointer(), we can just add
>> synchronize_srcu() immediately after changing stuff (of course
>> mmu_lock still needs to be held when updating slots).
>>      
> I don't see RCU as being suitable because in certain operations you
> want to stop writers (on behalf of vcpus), do something, and let them
> continue afterwards. The dirty log, for example. Or any operation that
> wants to modify lockless vcpu specific data.
>    

kvm_flush_remote_tlbs() (which you'd call after mmu operations), will 
get cpus out of guest mode, and synchronize_srcu() will wait for them to 
drop the srcu "read lock".  So it really happens naturally: do an RCU 
update, send some request to all vcpus, synchronize_srcu(), done.

> Also, synchronize_srcu() is limited to preemptible sections.
>
> io_bus could use RCU, but I think being able to stop vcpus is also a
> different requirement. Do you have any suggestion on how to do it in a
> better way?
>    

We don't need to stop vcpus, just kick them out of guest mode to let 
them notice the new data.  SRCU does that well.
Marcelo Tosatti Sept. 10, 2009, 10:30 p.m. UTC | #4
On Fri, Aug 28, 2009 at 09:50:36AM +0300, Avi Kivity wrote:
> On 08/28/2009 01:59 AM, Marcelo Tosatti wrote:
>> On Thu, Aug 27, 2009 at 07:27:48PM +0300, Avi Kivity wrote:
>>    
>>> On 08/27/2009 06:54 PM, Marcelo Tosatti wrote:
>>>      
>>>> perf report shows heavy overhead from down/up of slots_lock.
>>>>
>>>> Attempted to remove slots_lock by having vcpus stop on a synchronization
>>>> point, but this introduced further complexity (a vcpu can be scheduled
>>>> out before reaching the synchronization point, and can sched back in at
>>>> points which are slots_lock protected, etc).
>>>>
>>>> This patch changes vcpu_enter_guest to conditionally release/acquire
>>>> slots_lock in case a vcpu state bit is set.
>>>>
>>>> vmexit performance improves by 5-10% on UP guest.
>>>>
>>>>        
>>> Sorry, it looks pretty complex.
>>>      
>> Why?
>>    
>
> There's a new locking protocol in there.  I prefer sticking with the  
> existing kernel plumbing, or it gets more and more difficult knowing who  
> protects what and in what order you can do things.
>
>>> Have you considered using srcu? It seems to me down/up_read() can
>>> be replaced by srcu_read_lock/unlock(), and after proper conversion
>>> of memslots and io_bus to rcu_assign_pointer(), we can just add
>>> synchronize_srcu() immediately after changing stuff (of course
>>> mmu_lock still needs to be held when updating slots).
>>>      
>> I don't see RCU as being suitable because in certain operations you
>> want to stop writers (on behalf of vcpus), do something, and let them
>> continue afterwards. The dirty log, for example. Or any operation that
>> wants to modify lockless vcpu specific data.
>>    
>
> kvm_flush_remote_tlbs() (which you'd call after mmu operations), will  
> get cpus out of guest mode, and synchronize_srcu() will wait for them to  
> drop the srcu "read lock".  So it really happens naturally: do an RCU  
> update, send some request to all vcpus, synchronize_srcu(), done.
>
>> Also, synchronize_srcu() is limited to preemptible sections.
>>
>> io_bus could use RCU, but I think being able to stop vcpus is also a
>> different requirement. Do you have any suggestion on how to do it in a
>> better way?
>>    
>
> We don't need to stop vcpus, just kick them out of guest mode to let  
> them notice the new data.  SRCU does that well.

Two problems:

1. The removal of memslots/aliases and zapping of mmu (to remove any
shadow pages with stale sp->gfn) must be atomic from the POV of the
pagefault path. Otherwise something like this can happen:

fault path					set_memory_region

walk_addr fetches and validates
table_gfns
						delete memslot
						synchronize_srcu

fetch creates shadow
page with nonexistant sp->gfn

OR

mmu_alloc_roots path				set_memory_region
						
						delete memslot
root_gfn = vcpu->arch.cr3 << PAGE_SHIFT
mmu_check_root(root_gfn)			synchronize_rcu
kvm_mmu_get_page()
						kvm_mmu_zap_all

Accesses between kvm_mmu_get_page and kvm_mmu_zap_all window can see
shadow pages with stale gfn.

But, if you still think its worthwhile to use RCU, at least handling
gfn_to_memslot / unalias_gfn errors _and_ adding mmu_notifier_retry
invalidation to set_memory_region path will be necessary (so that
gfn_to_pfn validation, in the fault path, is discarded in case
of memslot/alias update).

2. Another complication is that memslot creation and kvm_iommu_map_pages
are not atomic.

create memslot
synchronize_srcu
				<----- vcpu grabs gfn reference without
				       iommu mapping.
kvm_iommu_map_pages

Which can be solved by changing kvm_iommu_map_pages (and new gfn_to_pfn
helper) to use base_gfn,npages,hva information from somewhere else other
than visible kvm->memslots (so that when the slot becomes visible its
already iommu mapped).

So it appears to me using RCU introduces more complications / subtle
details than mutual exclusion here. The new request bit which the
original patch introduces is limited to enabling/disabling conditional
acquision of slots_lock (calling it a "new locking protocol" is unfair)
to improve write acquision latency.

Better ideas/directions welcome.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 13, 2009, 3:42 p.m. UTC | #5
On 09/11/2009 01:30 AM, Marcelo Tosatti wrote:
>
>> We don't need to stop vcpus, just kick them out of guest mode to let
>> them notice the new data.  SRCU does that well.
>>      
> Two problems:
>
> 1. The removal of memslots/aliases and zapping of mmu (to remove any
> shadow pages with stale sp->gfn) must be atomic from the POV of the
> pagefault path. Otherwise something like this can happen:
>
> fault path					set_memory_region
>    

srcu_read_lock()

> walk_addr fetches and validates
> table_gfns
> 						delete memslot
> 						synchronize_srcu
>
> fetch creates shadow
>    

srcu_read_unlock()

> page with nonexistant sp->gfn
>    

I think synchronize_srcu() will be deferred until the fault path is 
complete (and srcu_read_unlock() runs).  Copying someone who knows for sure.


> OR
>
> mmu_alloc_roots path				set_memory_region
>    

srcu_read_lock()

> 						
> 						delete memslot
> root_gfn = vcpu->arch.cr3<<  PAGE_SHIFT
> mmu_check_root(root_gfn)			synchronize_rcu
> kvm_mmu_get_page()
>    

srcu_read_unlock()

> 						kvm_mmu_zap_all
>    

Ditto, srcu_read_lock() protects us.

> Accesses between kvm_mmu_get_page and kvm_mmu_zap_all window can see
> shadow pages with stale gfn.
>
> But, if you still think its worthwhile to use RCU, at least handling
> gfn_to_memslot / unalias_gfn errors _and_ adding mmu_notifier_retry
> invalidation to set_memory_region path will be necessary (so that
> gfn_to_pfn validation, in the fault path, is discarded in case
> of memslot/alias update).
>    

It really is worthwhile to reuse complex infrastructure instead of 
writing new infrastructure.

> 2. Another complication is that memslot creation and kvm_iommu_map_pages
> are not atomic.
>
> create memslot
> synchronize_srcu
> 				<----- vcpu grabs gfn reference without
> 				       iommu mapping.
> kvm_iommu_map_pages
>
> Which can be solved by changing kvm_iommu_map_pages (and new gfn_to_pfn
> helper) to use base_gfn,npages,hva information from somewhere else other
> than visible kvm->memslots (so that when the slot becomes visible its
> already iommu mapped).
>    

Yes.  It can accept a memslots structure instead of deriving it from 
kvm->memslots.  Then we do a rcu_assign_pointer() to switch the tables.

> So it appears to me using RCU introduces more complications / subtle
> details than mutual exclusion here. The new request bit which the
> original patch introduces is limited to enabling/disabling conditional
> acquision of slots_lock (calling it a "new locking protocol" is unfair)
> to improve write acquision latency.
>    

It's true that it is not a new locking protocol.  But I feel it is 
worthwhile to try to use rcu for this; at least it will make it easier 
for newcomers (provided they understand rcu).
Paul E. McKenney Sept. 13, 2009, 4:26 p.m. UTC | #6
On Sun, Sep 13, 2009 at 06:42:49PM +0300, Avi Kivity wrote:
> On 09/11/2009 01:30 AM, Marcelo Tosatti wrote:
>>
>>> We don't need to stop vcpus, just kick them out of guest mode to let
>>> them notice the new data.  SRCU does that well.
>>>      
>> Two problems:
>>
>> 1. The removal of memslots/aliases and zapping of mmu (to remove any
>> shadow pages with stale sp->gfn) must be atomic from the POV of the
>> pagefault path. Otherwise something like this can happen:
>>
>> fault path					set_memory_region
>>    
>
> srcu_read_lock()
>
>> walk_addr fetches and validates
>> table_gfns
>> 						delete memslot
>> 						synchronize_srcu
>>
>> fetch creates shadow
>>    
>
> srcu_read_unlock()
>
>> page with nonexistant sp->gfn
>>    
>
> I think synchronize_srcu() will be deferred until the fault path is 
> complete (and srcu_read_unlock() runs).  Copying someone who knows for 
> sure.

Yes, synchronize_srcu() will block until srcu_read_unlock() in this
scenario, assuming that the same srcu_struct is used by both.

>> OR
>>
>> mmu_alloc_roots path				set_memory_region
>>    
>
> srcu_read_lock()
>
>> 						
>> 						delete memslot
>> root_gfn = vcpu->arch.cr3<<  PAGE_SHIFT
>> mmu_check_root(root_gfn)			synchronize_rcu
>> kvm_mmu_get_page()
>>    
>
> srcu_read_unlock()
>
>> 						kvm_mmu_zap_all
>>    
>
> Ditto, srcu_read_lock() protects us.

Yep!

>> Accesses between kvm_mmu_get_page and kvm_mmu_zap_all window can see
>> shadow pages with stale gfn.
>>
>> But, if you still think its worthwhile to use RCU, at least handling
>> gfn_to_memslot / unalias_gfn errors _and_ adding mmu_notifier_retry
>> invalidation to set_memory_region path will be necessary (so that
>> gfn_to_pfn validation, in the fault path, is discarded in case
>> of memslot/alias update).
>
> It really is worthwhile to reuse complex infrastructure instead of writing 
> new infrastructure.

Marcelo, in your first example, is your concern that the fault path
needs to detect the memslot deletion?  Or that the use of sp->gfn "leaks"
out of the SRCU read-side critical section?

							Thanx, Paul

>> 2. Another complication is that memslot creation and kvm_iommu_map_pages
>> are not atomic.
>>
>> create memslot
>> synchronize_srcu
>> 				<----- vcpu grabs gfn reference without
>> 				       iommu mapping.
>> kvm_iommu_map_pages
>>
>> Which can be solved by changing kvm_iommu_map_pages (and new gfn_to_pfn
>> helper) to use base_gfn,npages,hva information from somewhere else other
>> than visible kvm->memslots (so that when the slot becomes visible its
>> already iommu mapped).
>
> Yes.  It can accept a memslots structure instead of deriving it from 
> kvm->memslots.  Then we do a rcu_assign_pointer() to switch the tables.
>
>> So it appears to me using RCU introduces more complications / subtle
>> details than mutual exclusion here. The new request bit which the
>> original patch introduces is limited to enabling/disabling conditional
>> acquision of slots_lock (calling it a "new locking protocol" is unfair)
>> to improve write acquision latency.
>>    
>
> It's true that it is not a new locking protocol.  But I feel it is 
> worthwhile to try to use rcu for this; at least it will make it easier for 
> newcomers (provided they understand rcu).
>
>
> -- 
> error compiling committee.c: too many arguments to function
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Sept. 13, 2009, 10:49 p.m. UTC | #7
On Sun, Sep 13, 2009 at 09:26:52AM -0700, Paul E. McKenney wrote:
> On Sun, Sep 13, 2009 at 06:42:49PM +0300, Avi Kivity wrote:
> > On 09/11/2009 01:30 AM, Marcelo Tosatti wrote:
> >>
> >>> We don't need to stop vcpus, just kick them out of guest mode to let
> >>> them notice the new data.  SRCU does that well.
> >>>      
> >> Two problems:
> >>
> >> 1. The removal of memslots/aliases and zapping of mmu (to remove any
> >> shadow pages with stale sp->gfn) must be atomic from the POV of the
> >> pagefault path. Otherwise something like this can happen:
> >>
> >> fault path					set_memory_region
> >>    
> >
> > srcu_read_lock()
> >
> >> walk_addr fetches and validates
> >> table_gfns
> >> 						delete memslot
> >> 						synchronize_srcu
> >>
> >> fetch creates shadow
> >>    
> >
> > srcu_read_unlock()
> >
> >> page with nonexistant sp->gfn
> >>    
> >
> > I think synchronize_srcu() will be deferred until the fault path is 
> > complete (and srcu_read_unlock() runs).  Copying someone who knows for 
> > sure.
> 
> Yes, synchronize_srcu() will block until srcu_read_unlock() in this
> scenario, assuming that the same srcu_struct is used by both.

Right it will. But this does not stop the fault path from creating
shadow pages with stale sp->gfn (the only way to do that would be mutual
exclusion AFAICS).

> >> OR
> >>
> >> mmu_alloc_roots path				set_memory_region
> >>    
> >
> > srcu_read_lock()
> >
> >> 						
> >> 						delete memslot
> >> root_gfn = vcpu->arch.cr3<<  PAGE_SHIFT
> >> mmu_check_root(root_gfn)			synchronize_rcu
> >> kvm_mmu_get_page()
> >>    
> >
> > srcu_read_unlock()
> >
> >> 						kvm_mmu_zap_all
> >>    
> >
> > Ditto, srcu_read_lock() protects us.
> 
> Yep!

The RCU read-protected side does not stop a new memslots pointer from
being assigned (with rcu_assign_pointer), does it? 

> >> Accesses between kvm_mmu_get_page and kvm_mmu_zap_all window can see
> >> shadow pages with stale gfn.
> >>
> >> But, if you still think its worthwhile to use RCU, at least handling
> >> gfn_to_memslot / unalias_gfn errors _and_ adding mmu_notifier_retry
> >> invalidation to set_memory_region path will be necessary (so that
> >> gfn_to_pfn validation, in the fault path, is discarded in case
> >> of memslot/alias update).
> >
> > It really is worthwhile to reuse complex infrastructure instead of writing 
> > new infrastructure.
> 
> Marcelo, in your first example, is your concern that the fault path
> needs to detect the memslot deletion? 

Yes, it needs to invalidate the leakage, which in this case is a shadow
page data structure which was created containing information from a now
deleted memslot.

> Or that the use of sp->gfn "leaks" out of the SRCU read-side critical
> section?

Yes, use of a stale sp->gfn leaks outside of the SRCU read side critical
section and currently the rest of the code is not ready to deal with
that... but it will have to.

> 							Thanx, Paul
> 
> >> 2. Another complication is that memslot creation and kvm_iommu_map_pages
> >> are not atomic.
> >>
> >> create memslot
> >> synchronize_srcu
> >> 				<----- vcpu grabs gfn reference without
> >> 				       iommu mapping.
> >> kvm_iommu_map_pages
> >>
> >> Which can be solved by changing kvm_iommu_map_pages (and new gfn_to_pfn
> >> helper) to use base_gfn,npages,hva information from somewhere else other
> >> than visible kvm->memslots (so that when the slot becomes visible its
> >> already iommu mapped).
> >
> > Yes.  It can accept a memslots structure instead of deriving it from 
> > kvm->memslots.  Then we do a rcu_assign_pointer() to switch the tables.

Alright.

> >> So it appears to me using RCU introduces more complications / subtle
> >> details than mutual exclusion here. The new request bit which the
> >> original patch introduces is limited to enabling/disabling conditional
> >> acquision of slots_lock (calling it a "new locking protocol" is unfair)
> >> to improve write acquision latency.
> >>    
> >
> > It's true that it is not a new locking protocol.  But I feel it is 
> > worthwhile to try to use rcu for this; at least it will make it easier for 
> > newcomers (provided they understand rcu).

Sure.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 14, 2009, 5:03 a.m. UTC | #8
On 09/14/2009 01:49 AM, Marcelo Tosatti wrote:
>
>>> I think synchronize_srcu() will be deferred until the fault path is
>>> complete (and srcu_read_unlock() runs).  Copying someone who knows for
>>> sure.
>>>        
>> Yes, synchronize_srcu() will block until srcu_read_unlock() in this
>> scenario, assuming that the same srcu_struct is used by both.
>>      
> Right it will. But this does not stop the fault path from creating
> shadow pages with stale sp->gfn (the only way to do that would be mutual
> exclusion AFAICS).
>    

So we put the kvm_mmu_zap_pages() call as part of the synchronize_srcu() 
callback to take advantage of the srcu guarantees.  We know that when 
when the callback is called all new reads see the new slots and all old 
readers have completed.

> The RCU read-protected side does not stop a new memslots pointer from
> being assigned (with rcu_assign_pointer), does it?
>
>    

It doesn't.  It only gives you a point in time where you know no one is 
using the old pointer, but before it has been deleted.
Avi Kivity Sept. 14, 2009, 7:17 a.m. UTC | #9
On 09/14/2009 08:03 AM, Avi Kivity wrote:
>> Right it will. But this does not stop the fault path from creating
>> shadow pages with stale sp->gfn (the only way to do that would be mutual
>> exclusion AFAICS).
>
> So we put the kvm_mmu_zap_pages() call as part of the 
> synchronize_srcu() callback to take advantage of the srcu guarantees.  
> We know that when when the callback is called all new reads see the 
> new slots and all old readers have completed.

I think I see your concern - assigning sp->gfn leaks information out of 
the srcu critical section.

Two ways out:

1) copy kvm->slots into sp->slots and use it when dropping the shadow 
page.  Intrusive and increases shadow footprint.

1b) Instead of sp->slots, use a 1-bit generation counter.  Even uglier 
but reduces the shadow footprint.

2) instead of removing the slot in rcu_assign_pointer(), mark it 
invalid.  gfn_to_page() will fail on such slots but the teardown paths 
(like unaccount_shadow) continue to work.  One we've zapped the mmu we 
drop the slot completely (can do in place, no need to rcu_assign_pointer).
diff mbox

Patch

Index: kvm-requests/arch/x86/kvm/vmx.c
===================================================================
--- kvm-requests.orig/arch/x86/kvm/vmx.c
+++ kvm-requests/arch/x86/kvm/vmx.c
@@ -2169,7 +2169,7 @@  static int alloc_apic_access_page(struct
 	struct kvm_userspace_memory_region kvm_userspace_mem;
 	int r = 0;
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 	if (kvm->arch.apic_access_page)
 		goto out;
 	kvm_userspace_mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT;
@@ -2191,7 +2191,7 @@  static int alloc_identity_pagetable(stru
 	struct kvm_userspace_memory_region kvm_userspace_mem;
 	int r = 0;
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 	if (kvm->arch.ept_identity_pagetable)
 		goto out;
 	kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT;
Index: kvm-requests/arch/x86/kvm/x86.c
===================================================================
--- kvm-requests.orig/arch/x86/kvm/x86.c
+++ kvm-requests/arch/x86/kvm/x86.c
@@ -1926,7 +1926,7 @@  static int kvm_vm_ioctl_set_nr_mmu_pages
 	if (kvm_nr_mmu_pages < KVM_MIN_ALLOC_MMU_PAGES)
 		return -EINVAL;
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 	spin_lock(&kvm->mmu_lock);
 
 	kvm_mmu_change_mmu_pages(kvm, kvm_nr_mmu_pages);
@@ -1982,7 +1982,7 @@  static int kvm_vm_ioctl_set_memory_alias
 	    < alias->target_phys_addr)
 		goto out;
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 	spin_lock(&kvm->mmu_lock);
 
 	p = &kvm->arch.aliases[alias->slot];
@@ -2137,7 +2137,7 @@  int kvm_vm_ioctl_get_dirty_log(struct kv
 	struct kvm_memory_slot *memslot;
 	int is_dirty = 0;
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 
 	r = kvm_get_dirty_log(kvm, log, &is_dirty);
 	if (r)
@@ -2253,7 +2253,7 @@  long kvm_arch_vm_ioctl(struct file *filp
 				   sizeof(struct kvm_pit_config)))
 			goto out;
 	create_pit:
-		down_write(&kvm->slots_lock);
+		kvm_grab_global_lock(kvm);
 		r = -EEXIST;
 		if (kvm->arch.vpit)
 			goto create_pit_unlock;
@@ -3548,7 +3548,7 @@  static void inject_pending_event(struct 
 
 static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 {
-	int r;
+	int r, dropped_slots_lock = 0;
 	bool req_int_win = !irqchip_in_kernel(vcpu->kvm) &&
 		vcpu->run->request_interrupt_window;
 
@@ -3616,7 +3616,10 @@  static int vcpu_enter_guest(struct kvm_v
 		kvm_lapic_sync_to_vapic(vcpu);
 	}
 
-	up_read(&vcpu->kvm->slots_lock);
+	if (unlikely(test_bit(KVM_VCPU_DROP_LOCK, &vcpu->vcpu_state))) {
+		dropped_slots_lock = 1;
+		up_read(&vcpu->kvm->slots_lock);
+	}
 
 	kvm_guest_enter();
 
@@ -3668,8 +3671,8 @@  static int vcpu_enter_guest(struct kvm_v
 
 	preempt_enable();
 
-	down_read(&vcpu->kvm->slots_lock);
-
+	if (dropped_slots_lock)
+		down_read(&vcpu->kvm->slots_lock);
 	/*
 	 * Profile KVM exit RIPs:
 	 */
Index: kvm-requests/include/linux/kvm_host.h
===================================================================
--- kvm-requests.orig/include/linux/kvm_host.h
+++ kvm-requests/include/linux/kvm_host.h
@@ -44,6 +44,7 @@ 
 
 #define KVM_VCPU_GUEST_MODE        0
 #define KVM_VCPU_KICKED            1
+#define KVM_VCPU_DROP_LOCK         2
 
 struct kvm;
 struct kvm_vcpu;
@@ -408,6 +409,7 @@  void kvm_unregister_irq_ack_notifier(str
 				   struct kvm_irq_ack_notifier *kian);
 int kvm_request_irq_source_id(struct kvm *kvm);
 void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
+void kvm_grab_global_lock(struct kvm *kvm);
 
 /* For vcpu->arch.iommu_flags */
 #define KVM_IOMMU_CACHE_COHERENCY	0x1
Index: kvm-requests/virt/kvm/coalesced_mmio.c
===================================================================
--- kvm-requests.orig/virt/kvm/coalesced_mmio.c
+++ kvm-requests/virt/kvm/coalesced_mmio.c
@@ -117,7 +117,7 @@  int kvm_vm_ioctl_register_coalesced_mmio
 	if (dev == NULL)
 		return -EINVAL;
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 	if (dev->nb_zones >= KVM_COALESCED_MMIO_ZONE_MAX) {
 		up_write(&kvm->slots_lock);
 		return -ENOBUFS;
@@ -140,7 +140,7 @@  int kvm_vm_ioctl_unregister_coalesced_mm
 	if (dev == NULL)
 		return -EINVAL;
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 
 	i = dev->nb_zones;
 	while(i) {
Index: kvm-requests/virt/kvm/eventfd.c
===================================================================
--- kvm-requests.orig/virt/kvm/eventfd.c
+++ kvm-requests/virt/kvm/eventfd.c
@@ -498,7 +498,7 @@  kvm_assign_ioeventfd(struct kvm *kvm, st
 	else
 		p->wildcard = true;
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 
 	/* Verify that there isnt a match already */
 	if (ioeventfd_check_collision(kvm, p)) {
@@ -541,7 +541,7 @@  kvm_deassign_ioeventfd(struct kvm *kvm, 
 	if (IS_ERR(eventfd))
 		return PTR_ERR(eventfd);
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 
 	list_for_each_entry_safe(p, tmp, &kvm->ioeventfds, list) {
 		bool wildcard = !(args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH);
Index: kvm-requests/virt/kvm/kvm_main.c
===================================================================
--- kvm-requests.orig/virt/kvm/kvm_main.c
+++ kvm-requests/virt/kvm/kvm_main.c
@@ -787,6 +787,22 @@  void kvm_reload_remote_mmus(struct kvm *
 	kvm_vcpus_request(kvm, KVM_REQ_MMU_RELOAD);
 }
 
+void kvm_grab_global_lock(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		set_bit(KVM_VCPU_DROP_LOCK, &vcpu->vcpu_state);
+		barrier();
+		kvm_vcpu_ipi(vcpu);
+	}
+	down_write(&kvm->slots_lock);
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		clear_bit(KVM_VCPU_DROP_LOCK, &vcpu->vcpu_state);
+}
+EXPORT_SYMBOL_GPL(kvm_grab_global_lock);
+
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 {
 	struct page *page;
@@ -1286,7 +1302,7 @@  int kvm_set_memory_region(struct kvm *kv
 {
 	int r;
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 	r = __kvm_set_memory_region(kvm, mem, user_alloc);
 	up_write(&kvm->slots_lock);
 	return r;
@@ -2556,7 +2572,7 @@  int kvm_io_bus_register_dev(struct kvm *
 {
 	int ret;
 
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 	ret = __kvm_io_bus_register_dev(bus, dev);
 	up_write(&kvm->slots_lock);
 
@@ -2579,7 +2595,7 @@  void kvm_io_bus_unregister_dev(struct kv
 			       struct kvm_io_bus *bus,
 			       struct kvm_io_device *dev)
 {
-	down_write(&kvm->slots_lock);
+	kvm_grab_global_lock(kvm);
 	__kvm_io_bus_unregister_dev(bus, dev);
 	up_write(&kvm->slots_lock);
 }