missing kvm smp tlb flush in invlpg

Message ID	20090312171843.GU27823@random.random (mailing list archive)
State	Accepted
Headers	show Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n2CHGC32008491 for <patchwork-kvm@patchwork.kernel.org>; Thu, 12 Mar 2009 17:18:48 GMT Date: Thu, 12 Mar 2009 18:18:43 +0100 From: Andrea Arcangeli <aarcange@redhat.com> To: kvm@vger.kernel.org Cc: Marcelo Tosatti <mtosatti@redhat.com> Subject: missing kvm smp tlb flush in invlpg Message-ID: <20090312171843.GU27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: kvm-owner@vger.kernel.org Precedence: bulk

Andrea Arcangeli March 12, 2009, 5:18 p.m. UTC

From: Andrea Arcangeli <aarcange@redhat.com>

While looking at invlpg out of sync code with Izik I think I noticed a
missing smp tlb flush here. Without this the other cpu can still write
to a freed host physical page. tlb smp flush must happen if
rmap_remove is called always before mmu_lock is released because the
VM will take the mmu_lock before it can finally add the page to the
freelist after swapout. mmu notifier makes it safe to flush the tlb
after freeing the page (otherwise it would never be safe) so we can do
a single flush for multiple sptes invalidated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity March 15, 2009, 10:35 a.m. UTC | #1

Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> While looking at invlpg out of sync code with Izik I think I noticed a
> missing smp tlb flush here. Without this the other cpu can still write
> to a freed host physical page. tlb smp flush must happen if
> rmap_remove is called always before mmu_lock is released because the
> VM will take the mmu_lock before it can finally add the page to the
> freelist after swapout. mmu notifier makes it safe to flush the tlb
> after freeing the page (otherwise it would never be safe) so we can do
> a single flush for multiple sptes invalidated.
>   

Izik pointed out that for invlpg, the guest is responsible for smp tlb 
flushes, and mmu notifiers will protect against pageout.

We still have a couple of holes, though, with the current code:

- tlb loaded with an entry
- guest invlpg
- invlpg code drops the spte and rmap entry
- pageout
- mmu notifiers don't find an rmap entry, so tlb is not flushed

The second hole is much simpler, we need a local invlpg at least.  This 
doesn't show up on Intel since a vmexit will flush the entire tlb (and 
most AMDs have NPT by now).

I think we can fix this without taking the hit of the IPI by
- running a local invlpg()
- making need_flush a vm flag instead of a local
- clearing need_flush whenever remote tlbs are flushed
- flushing remote tlbs on an mmu_notifier call when need_flush is set

Since mmu notifier calls are rare, this would collapse many remote tlb 
flushes into one.

Andrea Arcangeli March 15, 2009, 4:16 p.m. UTC | #2

On Sun, Mar 15, 2009 at 12:35:48PM +0200, Avi Kivity wrote:
> Izik pointed out that for invlpg, the guest is responsible for smp tlb 
> flushes, and mmu notifiers will protect against pageout.

How will mmu notifier protect against pageout if the spte is already
invalid and removed from the rmapp chain? mmu notifier will search the
rmapp chain and it'll find nothing, it'll do nothing, so then the page
will be freed under the other cpus without no ipi flushing their VT
tlbs.

All that mmu notifier does is to protect against pageout until the
mmu_lock is released. So that you can send a single ipi to the other
physical cpus after a flood of rmap_remove, without having to do the
array of pages like arch/x86/include/asm/tlb.h.

This because if the VM was in the process of swapping out that page
while we were inside the mmu_lock protected critical section, the mmu
notifier will force the swap path to take the vcpu->kvm->mmu_lock
first for each kvm instance registered with the mmu notifier. But
after taking that lock, the mmu notifier will do nothing if
rmap_remove already run before the mmu_lock was released (like in this
case). The mmu_lock is just to stop temporarily the swap, so that it
waits the ipi is delivered to all cpus before proceeding freeing the
page. It's up to the kvm code that takes the lock to flush the tlb of
any other running guest, before it is allowed to release the mmu_lock
as far as I can tell.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity March 15, 2009, 4:19 p.m. UTC | #3

Andrea Arcangeli wrote:
> On Sun, Mar 15, 2009 at 12:35:48PM +0200, Avi Kivity wrote:
>   
>> Izik pointed out that for invlpg, the guest is responsible for smp tlb 
>> flushes, and mmu notifiers will protect against pageout.
>>     
>
> How will mmu notifier protect against pageout if the spte is already
> invalid and removed from the rmapp chain? mmu notifier will search the
> rmapp chain and it'll find nothing, it'll do nothing, so then the page
> will be freed under the other cpus without no ipi flushing their VT
> tlbs.
>   

I mentioned this:

> I think we can fix this without taking the hit of the IPI by
> - running a local invlpg()
> - making need_flush a vm flag instead of a local
> - clearing need_flush whenever remote tlbs are flushed
> - flushing remote tlbs on an mmu_notifier call when need_flush is set

Andrea Arcangeli March 15, 2009, 4:30 p.m. UTC | #4

On Sun, Mar 15, 2009 at 06:19:58PM +0200, Avi Kivity wrote:
> I mentioned this:
>
>> I think we can fix this without taking the hit of the IPI by
>> - running a local invlpg()
>> - making need_flush a vm flag instead of a local
>> - clearing need_flush whenever remote tlbs are flushed
>> - flushing remote tlbs on an mmu_notifier call when need_flush is set 

Ah so this was a proposed fix for this bug, I thought you were talking
about different bugs, and you didn't acknowledge this as a bug sorry!

About the need_flush that could become a per-vcpu bit too cleared at
every exit so perhaps we'll never have to flush, but it'd need to stay
in the vcpu structure to avoid cacheline bouncing.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity March 15, 2009, 4:35 p.m. UTC | #5

Andrea Arcangeli wrote:
> Ah so this was a proposed fix for this bug, I thought you were talking
> about different bugs, and you didn't acknowledge this as a bug sorry!
>
>   

If ignoring bugs could make them go away...

> About the need_flush that could become a per-vcpu bit too cleared at
> every exit so perhaps we'll never have to flush, but it'd need to stay
> in the vcpu structure to avoid cacheline bouncing.
>   

But then we need to set it for all vcpus on every invlpg.  I'm assuming 
invlpg is much more frequent than mmu notifiers, so it's better to keep 
it global.

We've already taken a shared cacheline when we acquired mmu_lock.

btw, it's probably better to apply your patch, then adapt it to the 
non-IPIing version; your patch is more suitable for -stable.

Andrea Arcangeli March 15, 2009, 5:05 p.m. UTC | #6

On Sun, Mar 15, 2009 at 06:35:02PM +0200, Avi Kivity wrote:
> Andrea Arcangeli wrote:
>> Ah so this was a proposed fix for this bug, I thought you were talking
>> about different bugs, and you didn't acknowledge this as a bug sorry!
>>
>>   
>
> If ignoring bugs could make them go away...

;)

>> About the need_flush that could become a per-vcpu bit too cleared at
>> every exit so perhaps we'll never have to flush, but it'd need to stay
>> in the vcpu structure to avoid cacheline bouncing.
>>   
>
> But then we need to set it for all vcpus on every invlpg.  I'm assuming 
> invlpg is much more frequent than mmu notifiers, so it's better to keep it 
> global.
>
> We've already taken a shared cacheline when we acquired mmu_lock.

Ok.

> btw, it's probably better to apply your patch, then adapt it to the 
> non-IPIing version; your patch is more suitable for -stable.

It's up to you, I guess it'll make life easier with the compat code ;). 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti March 15, 2009, 7:23 p.m. UTC | #7

On Thu, Mar 12, 2009 at 06:18:43PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> While looking at invlpg out of sync code with Izik I think I noticed a
> missing smp tlb flush here. Without this the other cpu can still write
> to a freed host physical page. tlb smp flush must happen if
> rmap_remove is called always before mmu_lock is released because the
> VM will take the mmu_lock before it can finally add the page to the
> freelist after swapout. mmu notifier makes it safe to flush the tlb
> after freeing the page (otherwise it would never be safe) so we can do
> a single flush for multiple sptes invalidated.

I think this fix is more expensive than it needs to be, but better than
being unsafe for now.

Acked-by: Marcelo Tosatti <mtosatti@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Izik Eidus March 15, 2009, 8:11 p.m. UTC | #8

Marcelo Tosatti wrote:
> On Thu, Mar 12, 2009 at 06:18:43PM +0100, Andrea Arcangeli wrote:
>   
>> From: Andrea Arcangeli <aarcange@redhat.com>
>>
>> While looking at invlpg out of sync code with Izik I think I noticed a
>> missing smp tlb flush here. Without this the other cpu can still write
>> to a freed host physical page. tlb smp flush must happen if
>> rmap_remove is called always before mmu_lock is released because the
>> VM will take the mmu_lock before it can finally add the page to the
>> freelist after swapout. mmu notifier makes it safe to flush the tlb
>> after freeing the page (otherwise it would never be safe) so we can do
>> a single flush for multiple sptes invalidated.
>>     
>
> I think this fix is more expensive than it needs to be, but better than
> being unsafe for now.
>
> Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
>
>   
What about inside mmu_set_spte():
               } else if (pfn != spte_to_pfn(*shadow_pte)) {
                        pgprintk("hfn old %lx new %lx\n",
                                 spte_to_pfn(*shadow_pte), pfn);
                        rmap_remove(vcpu->kvm, shadow_pte);
                } else

Doesnt this required tlb flush for all the cpus as well?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity March 16, 2009, 10:16 a.m. UTC | #9

Andrea Arcangeli wrote:
> It's up to you, I guess it'll make life easier with the compat code ;).

I applied it, looking now at reducing the cost.

Marcelo Tosatti March 16, 2009, 6:22 p.m. UTC | #10

On Sun, Mar 15, 2009 at 10:11:29PM +0200, Izik Eidus wrote:
> Marcelo Tosatti wrote:
>> On Thu, Mar 12, 2009 at 06:18:43PM +0100, Andrea Arcangeli wrote:
>>   
>>> From: Andrea Arcangeli <aarcange@redhat.com>
>>>
>>> While looking at invlpg out of sync code with Izik I think I noticed a
>>> missing smp tlb flush here. Without this the other cpu can still write
>>> to a freed host physical page. tlb smp flush must happen if
>>> rmap_remove is called always before mmu_lock is released because the
>>> VM will take the mmu_lock before it can finally add the page to the
>>> freelist after swapout. mmu notifier makes it safe to flush the tlb
>>> after freeing the page (otherwise it would never be safe) so we can do
>>> a single flush for multiple sptes invalidated.
>>>     
>>
>> I think this fix is more expensive than it needs to be, but better than
>> being unsafe for now.
>>
>> Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
>>
>>   
> What about inside mmu_set_spte():
>               } else if (pfn != spte_to_pfn(*shadow_pte)) {
>                        pgprintk("hfn old %lx new %lx\n",
>                                 spte_to_pfn(*shadow_pte), pfn);
>                        rmap_remove(vcpu->kvm, shadow_pte);
>                } else
>
> Doesnt this required tlb flush for all the cpus as well?

Probably. This particular condition can only happen without mmu
notifiers, and when doing mmap(MADV_DONTNEED), though.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

missing kvm smp tlb flush in invlpg

Commit Message

Comments

Patch