kvm: Faults which trigger IO release the mmap_sem

Message ID	1410811885-17267-1-git-send-email-andreslc@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Andres Lagar-Cavilla <andreslc@google.com> To: Gleb Natapov <gleb@redhat.com>, Rik van Riel <riel@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Mel Gorman <mgorman@suse.de>, Andy Lutomirski <luto@amacapital.net>, Andrew Morton <akpm@linux-foundation.org>, Andrea Arcangeli <aarcange@redhat.com>, Sasha Levin <sasha.levin@oracle.com>, Jianyu Zhan <nasa4836@gmail.com>, Paul Cassella <cassella@cray.com>, Hugh Dickins <hughd@google.com>, Peter Feiner <pfeiner@google.com>, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Andres Lagar-Cavilla <andreslc@google.com> Subject: [PATCH] kvm: Faults which trigger IO release the mmap_sem Date: Mon, 15 Sep 2014 13:11:25 -0700 Message-Id: <1410811885-17267-1-git-send-email-andreslc@google.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Andres Lagar-Cavilla Sept. 15, 2014, 8:11 p.m. UTC

When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory has been
swapped out or is behind a filemap, this will trigger async readahead and
return immediately. The rationale is that KVM will kick back the guest with an
"async page fault" and allow for some other guest process to take over.

If async PFs are enabled the fault is retried asap from a workqueue, or
immediately if no async PFs. The retry will not relinquish the mmap semaphore
and will block on the IO. This is a bad thing, as other mmap semaphore users
now stall. The fault could take a long time, depending on swap or filemap
latency.

This patch ensures both the regular and async PF path re-enter the fault
allowing for the mmap semaphore to be relinquished in the case of IO wait.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
---
 include/linux/kvm_host.h |  9 +++++++++
 include/linux/mm.h       |  1 +
 mm/gup.c                 |  4 ++++
 virt/kvm/async_pf.c      |  4 +---
 virt/kvm/kvm_main.c      | 45 ++++++++++++++++++++++++++++++++++++++++++---
 5 files changed, 57 insertions(+), 6 deletions(-)

Paolo Bonzini Sept. 16, 2014, 1:51 p.m. UTC | #1

Il 15/09/2014 22:11, Andres Lagar-Cavilla ha scritto:
> +	if (!locked) {
> +		BUG_ON(npages != -EBUSY);

VM_BUG_ON perhaps?

> @@ -1177,9 +1210,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>  		npages = get_user_page_nowait(current, current->mm,
>  					      addr, write_fault, page);
>  		up_read(&current->mm->mmap_sem);
> -	} else
> -		npages = get_user_pages_fast(addr, 1, write_fault,
> -					     page);
> +	} else {
> +		/*
> +		 * By now we have tried gup_fast, and possible async_pf, and we
> +		 * are certainly not atomic. Time to retry the gup, allowing
> +		 * mmap semaphore to be relinquished in the case of IO.
> +		 */
> +		npages = kvm_get_user_page_retry(current, current->mm, addr,
> +						 write_fault, page);

This is a separate logical change.  Was this:

	down_read(&mm->mmap_sem);
	npages = get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
	up_read(&mm->mmap_sem);

the intention rather than get_user_pages_fast?

I think a first patch should introduce kvm_get_user_page_retry ("Retry a
fault after a gup with FOLL_NOWAIT.") and the second would add
FOLL_TRIED ("This properly relinquishes mmap semaphore if the
filemap/swap has to wait on page lock (and retries the gup to completion
after that").

Apart from this, the patch looks good.  The mm/ parts are minimal, so I
think it's best to merge it through the KVM tree with someone's Acked-by.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andres Lagar-Cavilla Sept. 16, 2014, 4:55 p.m. UTC | #2

On Tue, Sep 16, 2014 at 9:52 AM, Andres Lagar-Cavilla
<andreslc@google.com> wrote:

Apologies to all. Resend as lists rejected my gmail-formatted version.
Now on plain text. Won't happen again.

> On Tue, Sep 16, 2014 at 6:51 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> Il 15/09/2014 22:11, Andres Lagar-Cavilla ha scritto:
>> > +     if (!locked) {
>> > +             BUG_ON(npages != -EBUSY);
>>
>> VM_BUG_ON perhaps?
>
> Sure.
>
>>
>> > @@ -1177,9 +1210,15 @@ static int hva_to_pfn_slow(unsigned long addr,
>> > bool *async, bool write_fault,
>> >               npages = get_user_page_nowait(current, current->mm,
>> >                                             addr, write_fault, page);
>> >               up_read(&current->mm->mmap_sem);
>> > -     } else
>> > -             npages = get_user_pages_fast(addr, 1, write_fault,
>> > -                                          page);
>> > +     } else {
>> > +             /*
>> > +              * By now we have tried gup_fast, and possible async_pf,
>> > and we
>> > +              * are certainly not atomic. Time to retry the gup,
>> > allowing
>> > +              * mmap semaphore to be relinquished in the case of IO.
>> > +              */
>> > +             npages = kvm_get_user_page_retry(current, current->mm,
>> > addr,
>> > +                                              write_fault, page);
>>
>> This is a separate logical change.  Was this:
>>
>>         down_read(&mm->mmap_sem);
>>         npages = get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
>>         up_read(&mm->mmap_sem);
>>
>> the intention rather than get_user_pages_fast?
>
>
> Nope. The intention was to pass FAULT_FLAG_RETRY to the vma fault handler
> (without _NOWAIT). And once you do that, if you come back without holding
> the mmap sem, you need to call yet again.
>
> By that point in the call chain I felt comfortable dropping the _fast. All
> paths that get there have already tried _fast (and some have tried _NOWAIT).
>
>>
>> I think a first patch should introduce kvm_get_user_page_retry ("Retry a
>> fault after a gup with FOLL_NOWAIT.") and the second would add
>> FOLL_TRIED ("This properly relinquishes mmap semaphore if the
>> filemap/swap has to wait on page lock (and retries the gup to completion
>> after that").
>
>
> That's not what FOLL_TRIED does. The relinquishing of mmap semaphore is done
> by this patch minus the FOLL_TRIED bits. FOLL_TRIED will let the fault
> handler (e.g. filemap) know that we've been there and waited on the IO
> already, so in the common case we won't need to redo the IO.
>
> Have a look at how FAULT_FLAG_TRIED is used in e.g. arch/x86/mm/fault.c.
>
>>
>>
>> Apart from this, the patch looks good.  The mm/ parts are minimal, so I
>> think it's best to merge it through the KVM tree with someone's Acked-by.
>
>
> Thanks!
> Andres
>
>>
>>
>> Paolo
>
>
>
>
> --
> Andres Lagar-Cavilla | Google Cloud Platform | andreslc@google.com |
> 647-778-4380

Paolo Bonzini Sept. 16, 2014, 6:29 p.m. UTC | #3

Il 16/09/2014 18:52, Andres Lagar-Cavilla ha scritto:
> Was this:
> 
>         down_read(&mm->mmap_sem);
>         npages = get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
>         up_read(&mm->mmap_sem);
> 
> the intention rather than get_user_pages_fast?

I meant the intention of the original author, not yours.

> By that point in the call chain I felt comfortable dropping the _fast.
> All paths that get there have already tried _fast (and some have tried
> _NOWAIT).

Yes, understood.

>     I think a first patch should introduce kvm_get_user_page_retry ("Retry a
>     fault after a gup with FOLL_NOWAIT.") and the second would add
>     FOLL_TRIED ("This properly relinquishes mmap semaphore if the
>     filemap/swap has to wait on page lock (and retries the gup to completion
>     after that").
> 
> That's not what FOLL_TRIED does. The relinquishing of mmap semaphore is
> done by this patch minus the FOLL_TRIED bits. FOLL_TRIED will let the
> fault handler (e.g. filemap) know that we've been there and waited on
> the IO already, so in the common case we won't need to redo the IO.

Yes, that's not what FOLL_TRIED does.  But it's the difference between
get_user_pages and kvm_get_user_page_retry, right?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andres Lagar-Cavilla Sept. 16, 2014, 6:42 p.m. UTC | #4

On Tue, Sep 16, 2014 at 11:29 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 16/09/2014 18:52, Andres Lagar-Cavilla ha scritto:
>> Was this:
>>
>>         down_read(&mm->mmap_sem);
>>         npages = get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
>>         up_read(&mm->mmap_sem);
>>
>> the intention rather than get_user_pages_fast?
>
> I meant the intention of the original author, not yours.

Yes, in all likelihood. I hope!

>
>> By that point in the call chain I felt comfortable dropping the _fast.
>> All paths that get there have already tried _fast (and some have tried
>> _NOWAIT).
>
> Yes, understood.
>
>>     I think a first patch should introduce kvm_get_user_page_retry ("Retry a
>>     fault after a gup with FOLL_NOWAIT.") and the second would add
>>     FOLL_TRIED ("This properly relinquishes mmap semaphore if the
>>     filemap/swap has to wait on page lock (and retries the gup to completion
>>     after that").
>>
>> That's not what FOLL_TRIED does. The relinquishing of mmap semaphore is
>> done by this patch minus the FOLL_TRIED bits. FOLL_TRIED will let the
>> fault handler (e.g. filemap) know that we've been there and waited on
>> the IO already, so in the common case we won't need to redo the IO.
>
> Yes, that's not what FOLL_TRIED does.  But it's the difference between
> get_user_pages and kvm_get_user_page_retry, right?

Unfortunately get_user_pages does not expose the param (int
*nonblocking) that __gup will use to set FAULT_FLAG_ALLOW_RETRY. So
that's one difference. The second difference is that kvm_gup_retry
will call two times if necessary (the second without _RETRY but with
_TRIED).

Thanks
Andres
>
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Radim Krčmář Sept. 16, 2014, 8:51 p.m. UTC | #5

2014-09-15 13:11-0700, Andres Lagar-Cavilla:
> +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,

The suffix '_retry' is not best suited for this.
On first reading, I imagined we will be retrying something from before,
possibly calling it in a loop, but we are actually doing the first and
last try in one call.

Hard to find something that conveys our lock-dropping mechanic,
'_polite' is my best candidate at the moment.

> +	int flags = FOLL_TOUCH | FOLL_HWPOISON |

(FOLL_HWPOISON wasn't used before, but it's harmless.)

2014-09-16 15:51+0200, Paolo Bonzini:
> Il 15/09/2014 22:11, Andres Lagar-Cavilla ha scritto:
> > @@ -1177,9 +1210,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> >  		npages = get_user_page_nowait(current, current->mm,
> >  					      addr, write_fault, page);
> >  		up_read(&current->mm->mmap_sem);
> > -	} else
> > -		npages = get_user_pages_fast(addr, 1, write_fault,
> > -					     page);
> > +	} else {
> > +		/*
> > +		 * By now we have tried gup_fast, and possible async_pf, and we
                                        ^
(If we really tried get_user_pages_fast, we wouldn't be here, so I'd
 prepend two underscores here as well.)

> > +		 * are certainly not atomic. Time to retry the gup, allowing
> > +		 * mmap semaphore to be relinquished in the case of IO.
> > +		 */
> > +		npages = kvm_get_user_page_retry(current, current->mm, addr,
> > +						 write_fault, page);
> 
> This is a separate logical change.  Was this:
> 
> 	down_read(&mm->mmap_sem);
> 	npages = get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> 	up_read(&mm->mmap_sem);
> 
> the intention rather than get_user_pages_fast?

I believe so as well.

(Looking at get_user_pages_fast and __get_user_pages_fast made my
 abstraction detector very sad.)

> I think a first patch should introduce kvm_get_user_page_retry ("Retry a
> fault after a gup with FOLL_NOWAIT.") and the second would add
> FOLL_TRIED ("This properly relinquishes mmap semaphore if the
> filemap/swap has to wait on page lock (and retries the gup to completion
> after that").

Not sure if that would help to understand the goal ...

> Apart from this, the patch looks good.  The mm/ parts are minimal, so I
> think it's best to merge it through the KVM tree with someone's Acked-by.

I would prefer to have the last hunk in a separate patch, but still,

Acked-by: Radim Kr?má? <rkrcmar@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andres Lagar-Cavilla Sept. 16, 2014, 9:01 p.m. UTC | #6

On Tue, Sep 16, 2014 at 1:51 PM, Radim Kr?má? <rkrcmar@redhat.com> wrote:
> 2014-09-15 13:11-0700, Andres Lagar-Cavilla:
>> +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,
>
> The suffix '_retry' is not best suited for this.
> On first reading, I imagined we will be retrying something from before,
> possibly calling it in a loop, but we are actually doing the first and
> last try in one call.

We are doing ... the second and third in most scenarios. async_pf did
the first with _NOWAIT. We call this from the async pf retrier, or if
async pf couldn't be notified to the guest.

>
> Hard to find something that conveys our lock-dropping mechanic,
> '_polite' is my best candidate at the moment.

I'm at a loss towards finding a better name than '_retry'.

>
>> +     int flags = FOLL_TOUCH | FOLL_HWPOISON |
>
> (FOLL_HWPOISON wasn't used before, but it's harmless.)

Ok. Wasn't 100% sure TBH.

>
> 2014-09-16 15:51+0200, Paolo Bonzini:
>> Il 15/09/2014 22:11, Andres Lagar-Cavilla ha scritto:
>> > @@ -1177,9 +1210,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>> >             npages = get_user_page_nowait(current, current->mm,
>> >                                           addr, write_fault, page);
>> >             up_read(&current->mm->mmap_sem);
>> > -   } else
>> > -           npages = get_user_pages_fast(addr, 1, write_fault,
>> > -                                        page);
>> > +   } else {
>> > +           /*
>> > +            * By now we have tried gup_fast, and possible async_pf, and we
>                                         ^
> (If we really tried get_user_pages_fast, we wouldn't be here, so I'd
>  prepend two underscores here as well.)

Yes, async pf tries and fails to do fast, and then we fallback to
slow, and so on.

>
>> > +            * are certainly not atomic. Time to retry the gup, allowing
>> > +            * mmap semaphore to be relinquished in the case of IO.
>> > +            */
>> > +           npages = kvm_get_user_page_retry(current, current->mm, addr,
>> > +                                            write_fault, page);
>>
>> This is a separate logical change.  Was this:
>>
>>       down_read(&mm->mmap_sem);
>>       npages = get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
>>       up_read(&mm->mmap_sem);
>>
>> the intention rather than get_user_pages_fast?
>
> I believe so as well.
>
> (Looking at get_user_pages_fast and __get_user_pages_fast made my
>  abstraction detector very sad.)

It's clunky, but a separate battle.

>
>> I think a first patch should introduce kvm_get_user_page_retry ("Retry a
>> fault after a gup with FOLL_NOWAIT.") and the second would add
>> FOLL_TRIED ("This properly relinquishes mmap semaphore if the
>> filemap/swap has to wait on page lock (and retries the gup to completion
>> after that").
>
> Not sure if that would help to understand the goal ...
>
>> Apart from this, the patch looks good.  The mm/ parts are minimal, so I
>> think it's best to merge it through the KVM tree with someone's Acked-by.
>
> I would prefer to have the last hunk in a separate patch, but still,
>
> Acked-by: Radim Kr?má? <rkrcmar@redhat.com>
Awesome, thanks much.

I'll recut with the VM_BUG_ON from Paolo and your Ack. LMK if anything
else from this email should go into the recut.

Andres

Radim Krčmář Sept. 16, 2014, 10:34 p.m. UTC | #7

[Emergency posting to fix the tag and couldn't find unmangled Cc list,
 so some recipients were dropped, sorry.  (I guess you are glad though).]

2014-09-16 14:01-0700, Andres Lagar-Cavilla:
> On Tue, Sep 16, 2014 at 1:51 PM, Radim Kr?má? <rkrcmar@redhat.com> wrote:
> > 2014-09-15 13:11-0700, Andres Lagar-Cavilla:
> >> +int kvm_get_user_page_retry(struct task_struct *tsk, struct
> >> mm_struct *mm,
> >
> > The suffix '_retry' is not best suited for this.
> > On first reading, I imagined we will be retrying something from
> > before,
> > possibly calling it in a loop, but we are actually doing the first and
> > last try in one call.
> 
> We are doing ... the second and third in most scenarios. async_pf did
> the first with _NOWAIT. We call this from the async pf retrier, or if
> async pf couldn't be notified to the guest.

I was thinking more about what the function does, not how we currently
use it -- nothing prevents us from using it as first somewhere -- but
yeah, even comments would be off then.

> >> Apart from this, the patch looks good.  The mm/ parts are minimal, so
> >> I
> >> think it's best to merge it through the KVM tree with someone's
> >> Acked-by.
> >
> > I would prefer to have the last hunk in a separate patch, but still,
> >
> > Acked-by: Radim Kr?má? <rkrcmar@redhat.com>
> 
> Awesome, thanks much.
> 
> I'll recut with the VM_BUG_ON from Paolo and your Ack. LMK if anything
> else from this email should go into the recut.

Ah, sorry, I'm not maintaining mm ... what I meant was

Reviewed-by: Radim Kr?má? <rkrcmar@redhat.com>

and I had to leave before I could find a good apology for
VM_WARN_ON_ONCE(), so if you are replacing BUG_ON, you might want to
look at that one as well.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andres Lagar-Cavilla Sept. 17, 2014, 4:15 a.m. UTC | #8

On Tue, Sep 16, 2014 at 3:34 PM, Radim Kr?má? <rkrcmar@redhat.com> wrote:
> [Emergency posting to fix the tag and couldn't find unmangled Cc list,
>  so some recipients were dropped, sorry.  (I guess you are glad though).]
>
> 2014-09-16 14:01-0700, Andres Lagar-Cavilla:
>> On Tue, Sep 16, 2014 at 1:51 PM, Radim Kr?má? <rkrcmar@redhat.com> wrote:
>> > 2014-09-15 13:11-0700, Andres Lagar-Cavilla:
>> >> +int kvm_get_user_page_retry(struct task_struct *tsk, struct
>> >> mm_struct *mm,
>> >
>> > The suffix '_retry' is not best suited for this.
>> > On first reading, I imagined we will be retrying something from
>> > before,
>> > possibly calling it in a loop, but we are actually doing the first and
>> > last try in one call.
>>
>> We are doing ... the second and third in most scenarios. async_pf did
>> the first with _NOWAIT. We call this from the async pf retrier, or if
>> async pf couldn't be notified to the guest.
>
> I was thinking more about what the function does, not how we currently
> use it -- nothing prevents us from using it as first somewhere -- but
> yeah, even comments would be off then.
>

Good point. Happy to expand comments. What about _complete? _io? _full?

>> >> Apart from this, the patch looks good.  The mm/ parts are minimal, so
>> >> I
>> >> think it's best to merge it through the KVM tree with someone's
>> >> Acked-by.
>> >
>> > I would prefer to have the last hunk in a separate patch, but still,
>> >
>> > Acked-by: Radim Kr?má? <rkrcmar@redhat.com>
>>
>> Awesome, thanks much.
>>
>> I'll recut with the VM_BUG_ON from Paolo and your Ack. LMK if anything
>> else from this email should go into the recut.
>
> Ah, sorry, I'm not maintaining mm ... what I meant was
>
> Reviewed-by: Radim Kr?má? <rkrcmar@redhat.com>

Cool cool cool
Andres

>
> and I had to leave before I could find a good apology for
> VM_WARN_ON_ONCE(), so if you are replacing BUG_ON, you might want to
> look at that one as well.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini Sept. 17, 2014, 7:43 a.m. UTC | #9

Il 16/09/2014 20:42, Andres Lagar-Cavilla ha scritto:
> On Tue, Sep 16, 2014 at 11:29 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>     I think a first patch should introduce kvm_get_user_page_retry ("Retry a
>>>     fault after a gup with FOLL_NOWAIT.") and the second would add
>>>     FOLL_TRIED ("This properly relinquishes mmap semaphore if the
>>>     filemap/swap has to wait on page lock (and retries the gup to completion
>>>     after that").
>>>
>>> That's not what FOLL_TRIED does. The relinquishing of mmap semaphore is
>>> done by this patch minus the FOLL_TRIED bits. FOLL_TRIED will let the
>>> fault handler (e.g. filemap) know that we've been there and waited on
>>> the IO already, so in the common case we won't need to redo the IO.
>>
>> Yes, that's not what FOLL_TRIED does.  But it's the difference between
>> get_user_pages and kvm_get_user_page_retry, right?
> 
> Unfortunately get_user_pages does not expose the param (int
> *nonblocking) that __gup will use to set FAULT_FLAG_ALLOW_RETRY. So
> that's one difference. The second difference is that kvm_gup_retry
> will call two times if necessary (the second without _RETRY but with
> _TRIED).

Yeah, that's how it is in your patch.  I can see that.

What I'm saying is that your patch is two changes in one:

1) do not use gup_fast in hva_to_pfn_slow, instead use gup as in
async_pf_execute.  This change can already introduce a function called
kvm_get_user_page_retry, and can already use it in async_pf_execute and
hva_to_pfn_slow

2) introduce the two-phase RETRY + TRIED mechanism in
kvm_get_user_page_retry, so that the mmap semaphore is relinquished
properly if the filemap or swap has to wait on the page lock.

I would prefer to split it in two patches.  Is it clearer now?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Sept. 17, 2014, 10:26 a.m. UTC | #10

On Mon, Sep 15, 2014 at 01:11:25PM -0700, Andres Lagar-Cavilla wrote:
> When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory has been
> swapped out or is behind a filemap, this will trigger async readahead and
> return immediately. The rationale is that KVM will kick back the guest with an
> "async page fault" and allow for some other guest process to take over.
> 
> If async PFs are enabled the fault is retried asap from a workqueue, or
> immediately if no async PFs. The retry will not relinquish the mmap semaphore
> and will block on the IO. This is a bad thing, as other mmap semaphore users
> now stall. The fault could take a long time, depending on swap or filemap
> latency.
> 
> This patch ensures both the regular and async PF path re-enter the fault
> allowing for the mmap semaphore to be relinquished in the case of IO wait.
> 
> Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
> ---
>  include/linux/kvm_host.h |  9 +++++++++
>  include/linux/mm.h       |  1 +
>  mm/gup.c                 |  4 ++++
>  virt/kvm/async_pf.c      |  4 +---
>  virt/kvm/kvm_main.c      | 45 ++++++++++++++++++++++++++++++++++++++++++---
>  5 files changed, 57 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3addcbc..704908d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -198,6 +198,15 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>  
> +/*
> + * Retry a fault after a gup with FOLL_NOWAIT. This properly relinquishes mmap
> + * semaphore if the filemap/swap has to wait on page lock (and retries the gup
> + * to completion after that).
> + */
> +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,
> +			    unsigned long addr, bool write_fault,
> +			    struct page **pagep);
> +
>  enum {
>  	OUTSIDE_GUEST_MODE,
>  	IN_GUEST_MODE,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ebc5f90..13e585f7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
>  #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
>  #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
>  #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
> +#define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
>  
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>  			void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..332d1c3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>  		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>  	if (*flags & FOLL_NOWAIT)
>  		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> +	if (*flags & FOLL_TRIED) {
> +		WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> +		fault_flags |= FAULT_FLAG_TRIED;
> +	}
>  
>  	ret = handle_mm_fault(mm, vma, address, fault_flags);
>  	if (ret & VM_FAULT_ERROR) {
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d6a3d09..17b78b1 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>  
>  	might_sleep();
>  
> -	down_read(&mm->mmap_sem);
> -	get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> -	up_read(&mm->mmap_sem);
> +	kvm_get_user_page_retry(NULL, mm, addr, 1, NULL);
>  	kvm_async_page_present_sync(vcpu, apf);
>  
>  	spin_lock(&vcpu->async_pf.lock);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7ef6b48..43a9ab9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1115,6 +1115,39 @@ static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
>  	return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
>  }
>  
> +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,
> +				unsigned long addr, bool write_fault,
> +				struct page **pagep)
> +{
> +	int npages;
> +	int locked = 1;
> +	int flags = FOLL_TOUCH | FOLL_HWPOISON |
> +		    (pagep ? FOLL_GET : 0) |
> +		    (write_fault ? FOLL_WRITE : 0);
> +
> +	/*
> +	 * Retrying fault, we get here *not* having allowed the filemap to wait
> +	 * on the page lock. We should now allow waiting on the IO with the
> +	 * mmap semaphore released.
> +	 */
> +	down_read(&mm->mmap_sem);
> +	npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL,
> +				  &locked);
> +	if (!locked) {
> +		BUG_ON(npages != -EBUSY);
> +		/*
> +		 * The previous call has now waited on the IO. Now we can
> +		 * retry and complete. Pass TRIED to ensure we do not re
> +		 * schedule async IO (see e.g. filemap_fault).
> +		 */
> +		down_read(&mm->mmap_sem);
> +		npages = __get_user_pages(tsk, mm, addr, 1, flags | FOLL_TRIED,
> +					  pagep, NULL, NULL);
For async_pf_execute() you do not need to even retry. Next guest's page fault
will retry it for you.

> +	}
> +	up_read(&mm->mmap_sem);
> +	return npages;
> +}
> +
>  static inline int check_user_page_hwpoison(unsigned long addr)
>  {
>  	int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
> @@ -1177,9 +1210,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>  		npages = get_user_page_nowait(current, current->mm,
>  					      addr, write_fault, page);
>  		up_read(&current->mm->mmap_sem);
> -	} else
> -		npages = get_user_pages_fast(addr, 1, write_fault,
> -					     page);
> +	} else {
> +		/*
> +		 * By now we have tried gup_fast, and possible async_pf, and we
> +		 * are certainly not atomic. Time to retry the gup, allowing
> +		 * mmap semaphore to be relinquished in the case of IO.
> +		 */
> +		npages = kvm_get_user_page_retry(current, current->mm, addr,
> +						 write_fault, page);
> +	}
>  	if (npages != 1)
>  		return npages;
>  
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Radim Krčmář Sept. 17, 2014, 11:27 a.m. UTC | #11

2014-09-17 13:26+0300, Gleb Natapov:
> For async_pf_execute() you do not need to even retry. Next guest's page fault
> will retry it for you.

Wouldn't that be a waste of vmentries?

The guest might be able to handle interrupts while we are waiting, so if
we used async-io-done notifier, this could work without CPU spinning.
(Probably with added latency.)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Radim Krčmář Sept. 17, 2014, 11:35 a.m. UTC | #12

[Repost for lists, the last mail was eaten by a security troll.]

2014-09-16 14:01-0700, Andres Lagar-Cavilla:
> On Tue, Sep 16, 2014 at 1:51 PM, Radim Kr?má? <rkrcmar@redhat.com>
> wrote:
> > 2014-09-15 13:11-0700, Andres Lagar-Cavilla:
> >> +int kvm_get_user_page_retry(struct task_struct *tsk, struct
> >> mm_struct *mm,
> >
> > The suffix '_retry' is not best suited for this.
> > On first reading, I imagined we will be retrying something from
> > before,
> > possibly calling it in a loop, but we are actually doing the first
> > and
> > last try in one call.
>
> We are doing ... the second and third in most scenarios. async_pf did
> the first with _NOWAIT. We call this from the async pf retrier, or if
> async pf couldn't be notified to the guest.

I was thinking more about what the function does, not how we currently
use it -- nothing prevents us from using it as first somewhere -- but
yeah, even comments would be off then.

> >> Apart from this, the patch looks good.  The mm/ parts are minimal,
> >> so
> >> I
> >> think it's best to merge it through the KVM tree with someone's
> >> Acked-by.
> >
> > I would prefer to have the last hunk in a separate patch, but still,
> >
> > Acked-by: Radim Kr?má? <rkrcmar@redhat.com>
>
> Awesome, thanks much.
>
> I'll recut with the VM_BUG_ON from Paolo and your Ack. LMK if anything
> else from this email should go into the recut.

Ah, sorry, I'm not maintaining mm ... what I meant was

Reviewed-by: Radim Kr?má? <rkrcmar@redhat.com>

and I had to leave before I could find a good apology for
VM_WARN_ON_ONCE(), so if you are replacing BUG_ON, you might want to
look at that one as well.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Sept. 17, 2014, 11:42 a.m. UTC | #13

On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Kr?má? wrote:
> 2014-09-17 13:26+0300, Gleb Natapov:
> > For async_pf_execute() you do not need to even retry. Next guest's page fault
> > will retry it for you.
> 
> Wouldn't that be a waste of vmentries?
This is how it will work with or without this second gup. Page is not
mapped into a shadow page table on this path, it happens on a next fault.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andres Lagar-Cavilla Sept. 17, 2014, 4:58 p.m. UTC | #14

On Wed, Sep 17, 2014 at 12:43 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 16/09/2014 20:42, Andres Lagar-Cavilla ha scritto:
>> On Tue, Sep 16, 2014 at 11:29 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>     I think a first patch should introduce kvm_get_user_page_retry ("Retry a
>>>>     fault after a gup with FOLL_NOWAIT.") and the second would add
>>>>     FOLL_TRIED ("This properly relinquishes mmap semaphore if the
>>>>     filemap/swap has to wait on page lock (and retries the gup to completion
>>>>     after that").
>>>>
>>>> That's not what FOLL_TRIED does. The relinquishing of mmap semaphore is
>>>> done by this patch minus the FOLL_TRIED bits. FOLL_TRIED will let the
>>>> fault handler (e.g. filemap) know that we've been there and waited on
>>>> the IO already, so in the common case we won't need to redo the IO.
>>>
>>> Yes, that's not what FOLL_TRIED does.  But it's the difference between
>>> get_user_pages and kvm_get_user_page_retry, right?
>>
>> Unfortunately get_user_pages does not expose the param (int
>> *nonblocking) that __gup will use to set FAULT_FLAG_ALLOW_RETRY. So
>> that's one difference. The second difference is that kvm_gup_retry
>> will call two times if necessary (the second without _RETRY but with
>> _TRIED).
>
> Yeah, that's how it is in your patch.  I can see that.
>
> What I'm saying is that your patch is two changes in one:
>
> 1) do not use gup_fast in hva_to_pfn_slow, instead use gup as in
> async_pf_execute.  This change can already introduce a function called
> kvm_get_user_page_retry, and can already use it in async_pf_execute and
> hva_to_pfn_slow
>
> 2) introduce the two-phase RETRY + TRIED mechanism in
> kvm_get_user_page_retry, so that the mmap semaphore is relinquished
> properly if the filemap or swap has to wait on the page lock.
>
> I would prefer to split it in two patches.  Is it clearer now?

Understood. So in patch 1, would kvm_gup_retry be ... just a wrapper
around gup? That looks thin to me, and the naming of the function will
not be accurate. Plus, considering Radim's suggestion that the naming
is not optimal.

I can have patch 1 just s/gup_fast/gup (one liner), and then patch 2
do the rest of the work.

Andres
>
> Paolo

Andres Lagar-Cavilla Sept. 17, 2014, 5 p.m. UTC | #15

On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov <gleb@kernel.org> wrote:
> On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Kr?má? wrote:
>> 2014-09-17 13:26+0300, Gleb Natapov:
>> > For async_pf_execute() you do not need to even retry. Next guest's page fault
>> > will retry it for you.
>>
>> Wouldn't that be a waste of vmentries?
> This is how it will work with or without this second gup. Page is not
> mapped into a shadow page table on this path, it happens on a next fault.

The point is that the gup in the async pf completion from the work
queue will not relinquish the mmap semaphore. And it most definitely
should, given that we are likely looking at swap/filemap.

Andres

>
> --
>                         Gleb.

Gleb Natapov Sept. 17, 2014, 5:08 p.m. UTC | #16

On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov <gleb@kernel.org> wrote:
> > On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Kr?má? wrote:
> >> 2014-09-17 13:26+0300, Gleb Natapov:
> >> > For async_pf_execute() you do not need to even retry. Next guest's page fault
> >> > will retry it for you.
> >>
> >> Wouldn't that be a waste of vmentries?
> > This is how it will work with or without this second gup. Page is not
> > mapped into a shadow page table on this path, it happens on a next fault.
> 
> The point is that the gup in the async pf completion from the work
> queue will not relinquish the mmap semaphore. And it most definitely
> should, given that we are likely looking at swap/filemap.
> 
I get this point and the patch looks good in general, but my point is
that when _retry() is called from async_pf_execute() second gup is not
needed. In the original code gup is called to do IO and nothing else.
In your patch this is accomplished by the first gup already, so you
can skip second gup if pagep == nullptr.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andres Lagar-Cavilla Sept. 17, 2014, 5:13 p.m. UTC | #17

On Wed, Sep 17, 2014 at 10:08 AM, Gleb Natapov <gleb@kernel.org> wrote:
> On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote:
>> On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov <gleb@kernel.org> wrote:
>> > On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Kr?má? wrote:
>> >> 2014-09-17 13:26+0300, Gleb Natapov:
>> >> > For async_pf_execute() you do not need to even retry. Next guest's page fault
>> >> > will retry it for you.
>> >>
>> >> Wouldn't that be a waste of vmentries?
>> > This is how it will work with or without this second gup. Page is not
>> > mapped into a shadow page table on this path, it happens on a next fault.
>>
>> The point is that the gup in the async pf completion from the work
>> queue will not relinquish the mmap semaphore. And it most definitely
>> should, given that we are likely looking at swap/filemap.
>>
> I get this point and the patch looks good in general, but my point is
> that when _retry() is called from async_pf_execute() second gup is not
> needed. In the original code gup is called to do IO and nothing else.
> In your patch this is accomplished by the first gup already, so you
> can skip second gup if pagep == nullptr.

I see. However, if this function were to be used elsewhere in the
future, then the "if pagep == NULL don't retry" semantics may not
match the new caller's intention. Would you prefer an explicit flag?

Andres

>
> --
>                         Gleb.

Gleb Natapov Sept. 17, 2014, 5:21 p.m. UTC | #18

On Wed, Sep 17, 2014 at 10:13:45AM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Sep 17, 2014 at 10:08 AM, Gleb Natapov <gleb@kernel.org> wrote:
> > On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote:
> >> On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov <gleb@kernel.org> wrote:
> >> > On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Kr?má? wrote:
> >> >> 2014-09-17 13:26+0300, Gleb Natapov:
> >> >> > For async_pf_execute() you do not need to even retry. Next guest's page fault
> >> >> > will retry it for you.
> >> >>
> >> >> Wouldn't that be a waste of vmentries?
> >> > This is how it will work with or without this second gup. Page is not
> >> > mapped into a shadow page table on this path, it happens on a next fault.
> >>
> >> The point is that the gup in the async pf completion from the work
> >> queue will not relinquish the mmap semaphore. And it most definitely
> >> should, given that we are likely looking at swap/filemap.
> >>
> > I get this point and the patch looks good in general, but my point is
> > that when _retry() is called from async_pf_execute() second gup is not
> > needed. In the original code gup is called to do IO and nothing else.
> > In your patch this is accomplished by the first gup already, so you
> > can skip second gup if pagep == nullptr.
> 
> I see. However, if this function were to be used elsewhere in the
> future, then the "if pagep == NULL don't retry" semantics may not
> match the new caller's intention. Would you prefer an explicit flag?
> 
We can add explicit flag whenever such caller will be added, if ever.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andres Lagar-Cavilla Sept. 17, 2014, 5:41 p.m. UTC | #19

On Wed, Sep 17, 2014 at 10:21 AM, Gleb Natapov <gleb@kernel.org> wrote:
> On Wed, Sep 17, 2014 at 10:13:45AM -0700, Andres Lagar-Cavilla wrote:
>> On Wed, Sep 17, 2014 at 10:08 AM, Gleb Natapov <gleb@kernel.org> wrote:
>> > On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote:
>> >> On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov <gleb@kernel.org> wrote:
>> >> > On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Kr?má? wrote:
>> >> >> 2014-09-17 13:26+0300, Gleb Natapov:
>> >> >> > For async_pf_execute() you do not need to even retry. Next guest's page fault
>> >> >> > will retry it for you.
>> >> >>
>> >> >> Wouldn't that be a waste of vmentries?
>> >> > This is how it will work with or without this second gup. Page is not
>> >> > mapped into a shadow page table on this path, it happens on a next fault.
>> >>
>> >> The point is that the gup in the async pf completion from the work
>> >> queue will not relinquish the mmap semaphore. And it most definitely
>> >> should, given that we are likely looking at swap/filemap.
>> >>
>> > I get this point and the patch looks good in general, but my point is
>> > that when _retry() is called from async_pf_execute() second gup is not
>> > needed. In the original code gup is called to do IO and nothing else.
>> > In your patch this is accomplished by the first gup already, so you
>> > can skip second gup if pagep == nullptr.
>>
>> I see. However, if this function were to be used elsewhere in the
>> future, then the "if pagep == NULL don't retry" semantics may not
>> match the new caller's intention. Would you prefer an explicit flag?
>>
> We can add explicit flag whenever such caller will be added, if ever.

Ok. Patch forthcoming.

Paolo, I'm not sure the split will buy anything, because the first
patch will be a one-liner (no point in the new kvm_gup_something
function if the impl won't match the intended semantics of the
function). But if you push back, I'll cut a v3.

Thanks all,
Andres
>
> --
>                         Gleb.

Paolo Bonzini Sept. 17, 2014, 8:01 p.m. UTC | #20

Il 17/09/2014 18:58, Andres Lagar-Cavilla ha scritto:
> Understood. So in patch 1, would kvm_gup_retry be ... just a wrapper
> around gup? That looks thin to me, and the naming of the function will
> not be accurate.

Depends on how you interpret "retry" ("with retry" vs. "retry after
_fast"). :)

My point was more to make possible future bisection easier, but I'm not
going to insist.  I'll queue the patch as soon as I get the required
Acked-by.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kvm: Faults which trigger IO release the mmap_sem

Commit Message

Comments

Patch