diff mbox

[v2] kvm: Faults which trigger IO release the mmap_sem

Message ID 1410976308-7683-1-git-send-email-andreslc@google.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andres Lagar-Cavilla Sept. 17, 2014, 5:51 p.m. UTC
When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.

If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.

This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.

Reviewed-by: Radim Kr?má? <rkrcmar@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>

---
v1 -> v2

* WARN_ON_ONCE -> VM_WARN_ON_ONCE
* pagep == NULL skips the final retry
* kvm_gup_retry -> kvm_gup_io
* Comment updates throughout
---
 include/linux/kvm_host.h | 11 +++++++++++
 include/linux/mm.h       |  1 +
 mm/gup.c                 |  4 ++++
 virt/kvm/async_pf.c      |  4 +---
 virt/kvm/kvm_main.c      | 49 +++++++++++++++++++++++++++++++++++++++++++++---
 5 files changed, 63 insertions(+), 6 deletions(-)

Comments

Wanpeng Li Sept. 18, 2014, 12:29 a.m. UTC | #1
Hi Andres,
On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
[...]
> static inline int check_user_page_hwpoison(unsigned long addr)
> {
> 	int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
>@@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> 		npages = get_user_page_nowait(current, current->mm,
> 					      addr, write_fault, page);
> 		up_read(&current->mm->mmap_sem);
>-	} else
>-		npages = get_user_pages_fast(addr, 1, write_fault,
>-					     page);
>+	} else {
>+		/*
>+		 * By now we have tried gup_fast, and possibly async_pf, and we
>+		 * are certainly not atomic. Time to retry the gup, allowing
>+		 * mmap semaphore to be relinquished in the case of IO.
>+		 */
>+		npages = kvm_get_user_page_io(current, current->mm, addr,
>+					      write_fault, page);
>+	}

try_async_pf 
 gfn_to_pfn_async 
  __gfn_to_pfn  			async = false 
   __gfn_to_pfn_memslot
    hva_to_pfn 
	 hva_to_pfn_fast 
	 hva_to_pfn_slow 
	  kvm_get_user_page_io

page will always be ready after kvm_get_user_page_io which leads to APF
don't need to work any more.

Regards,
Wanpeng Li

> 	if (npages != 1)
> 		return npages;
> 
>-- 
>2.1.0.rc2.206.gedb03e5
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Sept. 18, 2014, 6:13 a.m. UTC | #2
On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote:
> Hi Andres,
> On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
> [...]
> > static inline int check_user_page_hwpoison(unsigned long addr)
> > {
> > 	int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
> >@@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > 		npages = get_user_page_nowait(current, current->mm,
> > 					      addr, write_fault, page);
> > 		up_read(&current->mm->mmap_sem);
> >-	} else
> >-		npages = get_user_pages_fast(addr, 1, write_fault,
> >-					     page);
> >+	} else {
> >+		/*
> >+		 * By now we have tried gup_fast, and possibly async_pf, and we
> >+		 * are certainly not atomic. Time to retry the gup, allowing
> >+		 * mmap semaphore to be relinquished in the case of IO.
> >+		 */
> >+		npages = kvm_get_user_page_io(current, current->mm, addr,
> >+					      write_fault, page);
> >+	}
> 
> try_async_pf 
>  gfn_to_pfn_async 
>   __gfn_to_pfn  			async = false 
                                        *async = false

>    __gfn_to_pfn_memslot
>     hva_to_pfn 
> 	 hva_to_pfn_fast 
> 	 hva_to_pfn_slow 
hva_to_pfn_slow checks async not *async.

> 	  kvm_get_user_page_io
> 
> page will always be ready after kvm_get_user_page_io which leads to APF
> don't need to work any more.
> 
> Regards,
> Wanpeng Li
> 
> > 	if (npages != 1)
> > 		return npages;
> > 
> >-- 
> >2.1.0.rc2.206.gedb03e5
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to majordomo@kvack.org.  For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Sept. 18, 2014, 6:15 a.m. UTC | #3
On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
> When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
> has been swapped out or is behind a filemap, this will trigger async
> readahead and return immediately. The rationale is that KVM will kick
> back the guest with an "async page fault" and allow for some other
> guest process to take over.
> 
> If async PFs are enabled the fault is retried asap from an async
> workqueue. If not, it's retried immediately in the same code path. In
> either case the retry will not relinquish the mmap semaphore and will
> block on the IO. This is a bad thing, as other mmap semaphore users
> now stall as a function of swap or filemap latency.
> 
> This patch ensures both the regular and async PF path re-enter the
> fault allowing for the mmap semaphore to be relinquished in the case
> of IO wait.
> 
Reviewed-by: Gleb Natapov <gleb@kernel.org>

> Reviewed-by: Radim Kr?má? <rkrcmar@redhat.com>
> Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
> 
> ---
> v1 -> v2
> 
> * WARN_ON_ONCE -> VM_WARN_ON_ONCE
> * pagep == NULL skips the final retry
> * kvm_gup_retry -> kvm_gup_io
> * Comment updates throughout
> ---
>  include/linux/kvm_host.h | 11 +++++++++++
>  include/linux/mm.h       |  1 +
>  mm/gup.c                 |  4 ++++
>  virt/kvm/async_pf.c      |  4 +---
>  virt/kvm/kvm_main.c      | 49 +++++++++++++++++++++++++++++++++++++++++++++---
>  5 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3addcbc..4c1991b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -198,6 +198,17 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>  
> +/*
> + * Carry out a gup that requires IO. Allow the mm to relinquish the mmap
> + * semaphore if the filemap/swap has to wait on a page lock. pagep == NULL
> + * controls whether we retry the gup one more time to completion in that case.
> + * Typically this is called after a FAULT_FLAG_RETRY_NOWAIT in the main tdp
> + * handler.
> + */
> +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm,
> +			 unsigned long addr, bool write_fault,
> +			 struct page **pagep);
> +
>  enum {
>  	OUTSIDE_GUEST_MODE,
>  	IN_GUEST_MODE,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ebc5f90..13e585f7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
>  #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
>  #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
>  #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
> +#define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
>  
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>  			void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..af7ea3e 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>  		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>  	if (*flags & FOLL_NOWAIT)
>  		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> +	if (*flags & FOLL_TRIED) {
> +		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> +		fault_flags |= FAULT_FLAG_TRIED;
> +	}
>  
>  	ret = handle_mm_fault(mm, vma, address, fault_flags);
>  	if (ret & VM_FAULT_ERROR) {
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d6a3d09..5ff7f7f 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>  
>  	might_sleep();
>  
> -	down_read(&mm->mmap_sem);
> -	get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> -	up_read(&mm->mmap_sem);
> +	kvm_get_user_page_io(NULL, mm, addr, 1, NULL);
>  	kvm_async_page_present_sync(vcpu, apf);
>  
>  	spin_lock(&vcpu->async_pf.lock);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7ef6b48..fa8a565 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1115,6 +1115,43 @@ static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
>  	return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
>  }
>  
> +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm,
> +			 unsigned long addr, bool write_fault,
> +			 struct page **pagep)
> +{
> +	int npages;
> +	int locked = 1;
> +	int flags = FOLL_TOUCH | FOLL_HWPOISON |
> +		    (pagep ? FOLL_GET : 0) |
> +		    (write_fault ? FOLL_WRITE : 0);
> +
> +	/*
> +	 * If retrying the fault, we get here *not* having allowed the filemap
> +	 * to wait on the page lock. We should now allow waiting on the IO with
> +	 * the mmap semaphore released.
> +	 */
> +	down_read(&mm->mmap_sem);
> +	npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL,
> +				  &locked);
> +	if (!locked) {
> +		VM_BUG_ON(npages != -EBUSY);
> +
> +		if (!pagep)
> +			return 0;
> +
> +		/*
> +		 * The previous call has now waited on the IO. Now we can
> +		 * retry and complete. Pass TRIED to ensure we do not re
> +		 * schedule async IO (see e.g. filemap_fault).
> +		 */
> +		down_read(&mm->mmap_sem);
> +		npages = __get_user_pages(tsk, mm, addr, 1, flags | FOLL_TRIED,
> +					  pagep, NULL, NULL);
> +	}
> +	up_read(&mm->mmap_sem);
> +	return npages;
> +}
> +
>  static inline int check_user_page_hwpoison(unsigned long addr)
>  {
>  	int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
> @@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>  		npages = get_user_page_nowait(current, current->mm,
>  					      addr, write_fault, page);
>  		up_read(&current->mm->mmap_sem);
> -	} else
> -		npages = get_user_pages_fast(addr, 1, write_fault,
> -					     page);
> +	} else {
> +		/*
> +		 * By now we have tried gup_fast, and possibly async_pf, and we
> +		 * are certainly not atomic. Time to retry the gup, allowing
> +		 * mmap semaphore to be relinquished in the case of IO.
> +		 */
> +		npages = kvm_get_user_page_io(current, current->mm, addr,
> +					      write_fault, page);
> +	}
>  	if (npages != 1)
>  		return npages;
>  
> -- 
> 2.1.0.rc2.206.gedb03e5
> 

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wanpeng Li Sept. 19, 2014, 12:32 a.m. UTC | #4
On Thu, Sep 18, 2014 at 09:13:26AM +0300, Gleb Natapov wrote:
>On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote:
>> Hi Andres,
>> On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
>> [...]
>> > static inline int check_user_page_hwpoison(unsigned long addr)
>> > {
>> > 	int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
>> >@@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>> > 		npages = get_user_page_nowait(current, current->mm,
>> > 					      addr, write_fault, page);
>> > 		up_read(&current->mm->mmap_sem);
>> >-	} else
>> >-		npages = get_user_pages_fast(addr, 1, write_fault,
>> >-					     page);
>> >+	} else {
>> >+		/*
>> >+		 * By now we have tried gup_fast, and possibly async_pf, and we
>> >+		 * are certainly not atomic. Time to retry the gup, allowing
>> >+		 * mmap semaphore to be relinquished in the case of IO.
>> >+		 */
>> >+		npages = kvm_get_user_page_io(current, current->mm, addr,
>> >+					      write_fault, page);
>> >+	}
>> 
>> try_async_pf 
>>  gfn_to_pfn_async 
>>   __gfn_to_pfn  			async = false 
>                                        *async = false
>
>>    __gfn_to_pfn_memslot
>>     hva_to_pfn 
>> 	 hva_to_pfn_fast 
>> 	 hva_to_pfn_slow 
>hva_to_pfn_slow checks async not *async.

Got it, thanks for your pointing out.

Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>

Regards,
Wanpeng Li 

>
>> 	  kvm_get_user_page_io
>> 
>> page will always be ready after kvm_get_user_page_io which leads to APF
>> don't need to work any more.
>> 
>> Regards,
>> Wanpeng Li
>> 
>> > 	if (npages != 1)
>> > 		return npages;
>> > 
>> >-- 
>> >2.1.0.rc2.206.gedb03e5
>> >
>> >--
>> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >the body to majordomo@kvack.org.  For more info on Linux MM,
>> >see: http://www.linux-mm.org/ .
>> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>--
>			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andres Lagar-Cavilla Sept. 19, 2014, 3:58 a.m. UTC | #5
On Thu, Sep 18, 2014 at 5:32 PM, Wanpeng Li <wanpeng.li@linux.intel.com> wrote:
> On Thu, Sep 18, 2014 at 09:13:26AM +0300, Gleb Natapov wrote:
>>On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote:
>>> Hi Andres,
>>> On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
>>> [...]
>>> > static inline int check_user_page_hwpoison(unsigned long addr)
>>> > {
>>> >    int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
> Got it, thanks for your pointing out.
>
> Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>
>
> Regards,
> Wanpeng Li
>
Thanks.

Paolo, should I recut including the recent Reviewed-by's?

Thanks
Andres
ps: shrunk cc

>>
>>>        kvm_get_user_page_io
>>>
>>> page will always be ready after kvm_get_user_page_io which leads to APF
>>> don't need to work any more.
>>>
>>> Regards,
>>> Wanpeng Li
>>>
>>> >    if (npages != 1)
>>> >            return npages;
>>> >
>>> >--
>>> >2.1.0.rc2.206.gedb03e5
>>> >
>>> >--
>>> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> >the body to majordomo@kvack.org.  For more info on Linux MM,
>>> >see: http://www.linux-mm.org/ .
>>> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>>--
>>                       Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Sept. 19, 2014, 6:08 a.m. UTC | #6
Il 19/09/2014 05:58, Andres Lagar-Cavilla ha scritto:
> Paolo, should I recut including the recent Reviewed-by's?

No, I'll add them myself.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andres Lagar-Cavilla Sept. 22, 2014, 8:49 p.m. UTC | #7
On Thu, Sep 18, 2014 at 11:08 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Il 19/09/2014 05:58, Andres Lagar-Cavilla ha scritto:
> > Paolo, should I recut including the recent Reviewed-by's?
>
> No, I'll add them myself.

Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough?

Thanks much
Andres

>
>
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Sept. 22, 2014, 9:32 p.m. UTC | #8
Il 22/09/2014 22:49, Andres Lagar-Cavilla ha scritto:
>>> > > Paolo, should I recut including the recent Reviewed-by's?
>> >
>> > No, I'll add them myself.
> Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough?

It's waiting for an Acked-by on the mm/ changes.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Sept. 22, 2014, 9:53 p.m. UTC | #9
On Mon, 22 Sep 2014 23:32:36 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote:

> Il 22/09/2014 22:49, Andres Lagar-Cavilla ha scritto:
> >>> > > Paolo, should I recut including the recent Reviewed-by's?
> >> >
> >> > No, I'll add them myself.
> > Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough?
> 
> It's waiting for an Acked-by on the mm/ changes.
> 

The MM changes look good to me.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrea Arcangeli Sept. 25, 2014, 9:16 p.m. UTC | #10
Hi Andres,

On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
> +	if (!locked) {
> +		VM_BUG_ON(npages != -EBUSY);
> +

Shouldn't this be VM_BUG_ON(npages)?

Alternatively we could patch gup to do:

			case -EHWPOISON:
+			case -EBUSY:
				return i ? i : ret;
-			case -EBUSY:
-				return i;

I need to fix gup_fast slow path to start with FAULT_FLAG_ALLOW_RETRY
similarly to what you did to the KVM slow path.

gup_fast is called without the mmap_sem (incidentally its whole point
is to only disable irqs and not take the locks) so the enabling of
FAULT_FLAG_ALLOW_RETRY initial pass inside gup_fast should be all self
contained. It shouldn't concern KVM which should be already fine with
your patch, but it will allow the userfaultfd to intercept all
O_DIRECT gup_fast in addition to KVM with your patch.

Eventually get_user_pages should be obsoleted in favor of
get_user_pages_locked (or whoever we decide to call it) so the
userfaultfd can intercept all kind of gups. gup_locked is same as gup
except for one more "locked" parameter at the end, I called the
parameter locked instead of nonblocking because it'd be more proper to
call "nonblocking" gup the FOLL_NOWAIT kind which is quite the
opposite (in fact as the mmap_sem cannot be dropped in the non
blocking version).

ptrace ironically is better off sticking with a NULL locked parameter
and to get a sigbus instead of risking hanging on the userfaultfd
(which would be safe as it can be killed, but it'd be annoying if
erroneously poking into a hole during a gdb session). It's still
possible to pass NULL as parameter to get_user_pages_locked to achieve
that. So the fact some callers won't block in handle_userfault because
FAULT_FLAG_ALLOW_RETRY is not set and the userfault cannot block, may
come handy.

What I'm trying to solve in this context is that the userfault cannot
safely block without FAULT_FLAG_ALLOW_RETRY because we can't allow
userland to take the mmap_sem for an unlimited amount of time without
requiring special privileges, so if handle_userfault wants to blocks
within a gup invocation, it must first release the mmap_sem hence
FAULT_FLAG_ALLOW_RETRY is always required at the first attempt for any
virtual address.

With regard to the last sentence, there's actually a race with
MADV_DONTNEED too, I'd need to change the code to always pass
FAULT_FLAG_ALLOW_RETRY (your code also would need to loop and
insisting with the __get_user_pages(locked) version to solve it). The
KVM ioctl worst case would get an -EFAULT if the race materializes for
example. It's non concerning though because that can be solved in
userland somehow by separating ballooning and live migration
activities.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andres Lagar-Cavilla Sept. 25, 2014, 9:50 p.m. UTC | #11
On Thu, Sep 25, 2014 at 2:16 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Andres,
>
> On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
>> +     if (!locked) {
>> +             VM_BUG_ON(npages != -EBUSY);
>> +
>
> Shouldn't this be VM_BUG_ON(npages)?

Oh shoot you're right. I was confused by the introduction of -EBUSY in
the forward port.

        if (ret & VM_FAULT_RETRY) {
                if (nonblocking)
                        *nonblocking = 0;
                return -EBUSY;
        }

(gaaah!!!)

>
> Alternatively we could patch gup to do:
>
>                         case -EHWPOISON:
> +                       case -EBUSY:
>                                 return i ? i : ret;
> -                       case -EBUSY:
> -                               return i;
>

No preference. Not a lot of semantics available given that we pass 1
as the count to gup. Want to cut the patch or I can just shoot one
right away?

> I need to fix gup_fast slow path to start with FAULT_FLAG_ALLOW_RETRY
> similarly to what you did to the KVM slow path.
>
> gup_fast is called without the mmap_sem (incidentally its whole point
> is to only disable irqs and not take the locks) so the enabling of
> FAULT_FLAG_ALLOW_RETRY initial pass inside gup_fast should be all self
> contained. It shouldn't concern KVM which should be already fine with
> your patch, but it will allow the userfaultfd to intercept all
> O_DIRECT gup_fast in addition to KVM with your patch.
>
> Eventually get_user_pages should be obsoleted in favor of
> get_user_pages_locked (or whoever we decide to call it) so the
> userfaultfd can intercept all kind of gups. gup_locked is same as gup
> except for one more "locked" parameter at the end, I called the
> parameter locked instead of nonblocking because it'd be more proper to
> call "nonblocking" gup the FOLL_NOWAIT kind which is quite the
> opposite (in fact as the mmap_sem cannot be dropped in the non
> blocking version).
>

It's nearly impossible to name it right because 1) it indicates we can
relinquish 2) it returns whether we still hold the mmap semaphore.

I'd prefer it'd be called mmap_sem_hold, which conveys immediately
what this is about ("nonblocking" or "locked" could be about a whole
lot of things)

> ptrace ironically is better off sticking with a NULL locked parameter
> and to get a sigbus instead of risking hanging on the userfaultfd
> (which would be safe as it can be killed, but it'd be annoying if
> erroneously poking into a hole during a gdb session). It's still
> possible to pass NULL as parameter to get_user_pages_locked to achieve
> that. So the fact some callers won't block in handle_userfault because
> FAULT_FLAG_ALLOW_RETRY is not set and the userfault cannot block, may
> come handy.
>
> What I'm trying to solve in this context is that the userfault cannot
> safely block without FAULT_FLAG_ALLOW_RETRY because we can't allow
> userland to take the mmap_sem for an unlimited amount of time without
> requiring special privileges, so if handle_userfault wants to blocks
> within a gup invocation, it must first release the mmap_sem hence
> FAULT_FLAG_ALLOW_RETRY is always required at the first attempt for any
> virtual address.

I can see that. My background for coming into this is very similar: in
a previous life we had a file system shim that would kick up into
userspace for servicing VM memory. KVM just wouldn't let the file
system give up the mmap semaphore. We had /proc readers hanging up all
over the place while userspace was servicing. Not happy.

With KVM (now) and the standard x86 fault giving you ALLOW_RETRY, what
stands in your way? Methinks that gup_fast has no slowpath fallback
that turns on ALLOW_RETRY. What would oppose that being the global
behavior?

>
> With regard to the last sentence, there's actually a race with
> MADV_DONTNEED too, I'd need to change the code to always pass
> FAULT_FLAG_ALLOW_RETRY (your code also would need to loop and
> insisting with the __get_user_pages(locked) version to solve it). The
> KVM ioctl worst case would get an -EFAULT if the race materializes for
> example. It's non concerning though because that can be solved in
> userland somehow by separating ballooning and live migration
> activities.

Well, IIUC every code path that has ALLOW_RETRY dives in the second
time with FAULT_TRIED or similar. In the common case, you happily
blaze through the second time, but if someone raced in while all locks
were given up, one pays the price of the second time being a full
fault hogging the mmap sem. At some point you need to not keep being
polite otherwise the task starves. Presumably the risk of an extra
retry drops steeply every new gup retry. Maybe just try three times is
good enough. It makes for ugly logic though.

Thanks
Andres

>
> Thanks,
> Andrea
diff mbox

Patch

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3addcbc..4c1991b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -198,6 +198,17 @@  int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
+/*
+ * Carry out a gup that requires IO. Allow the mm to relinquish the mmap
+ * semaphore if the filemap/swap has to wait on a page lock. pagep == NULL
+ * controls whether we retry the gup one more time to completion in that case.
+ * Typically this is called after a FAULT_FLAG_RETRY_NOWAIT in the main tdp
+ * handler.
+ */
+int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm,
+			 unsigned long addr, bool write_fault,
+			 struct page **pagep);
+
 enum {
 	OUTSIDE_GUEST_MODE,
 	IN_GUEST_MODE,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ebc5f90..13e585f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2011,6 +2011,7 @@  static inline struct page *follow_page(struct vm_area_struct *vma,
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
+#define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..af7ea3e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -281,6 +281,10 @@  static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
+	if (*flags & FOLL_TRIED) {
+		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+		fault_flags |= FAULT_FLAG_TRIED;
+	}
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
 	if (ret & VM_FAULT_ERROR) {
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index d6a3d09..5ff7f7f 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -80,9 +80,7 @@  static void async_pf_execute(struct work_struct *work)
 
 	might_sleep();
 
-	down_read(&mm->mmap_sem);
-	get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
-	up_read(&mm->mmap_sem);
+	kvm_get_user_page_io(NULL, mm, addr, 1, NULL);
 	kvm_async_page_present_sync(vcpu, apf);
 
 	spin_lock(&vcpu->async_pf.lock);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7ef6b48..fa8a565 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1115,6 +1115,43 @@  static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
 	return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
 }
 
+int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm,
+			 unsigned long addr, bool write_fault,
+			 struct page **pagep)
+{
+	int npages;
+	int locked = 1;
+	int flags = FOLL_TOUCH | FOLL_HWPOISON |
+		    (pagep ? FOLL_GET : 0) |
+		    (write_fault ? FOLL_WRITE : 0);
+
+	/*
+	 * If retrying the fault, we get here *not* having allowed the filemap
+	 * to wait on the page lock. We should now allow waiting on the IO with
+	 * the mmap semaphore released.
+	 */
+	down_read(&mm->mmap_sem);
+	npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL,
+				  &locked);
+	if (!locked) {
+		VM_BUG_ON(npages != -EBUSY);
+
+		if (!pagep)
+			return 0;
+
+		/*
+		 * The previous call has now waited on the IO. Now we can
+		 * retry and complete. Pass TRIED to ensure we do not re
+		 * schedule async IO (see e.g. filemap_fault).
+		 */
+		down_read(&mm->mmap_sem);
+		npages = __get_user_pages(tsk, mm, addr, 1, flags | FOLL_TRIED,
+					  pagep, NULL, NULL);
+	}
+	up_read(&mm->mmap_sem);
+	return npages;
+}
+
 static inline int check_user_page_hwpoison(unsigned long addr)
 {
 	int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
@@ -1177,9 +1214,15 @@  static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 		npages = get_user_page_nowait(current, current->mm,
 					      addr, write_fault, page);
 		up_read(&current->mm->mmap_sem);
-	} else
-		npages = get_user_pages_fast(addr, 1, write_fault,
-					     page);
+	} else {
+		/*
+		 * By now we have tried gup_fast, and possibly async_pf, and we
+		 * are certainly not atomic. Time to retry the gup, allowing
+		 * mmap semaphore to be relinquished in the case of IO.
+		 */
+		npages = kvm_get_user_page_io(current, current->mm, addr,
+					      write_fault, page);
+	}
 	if (npages != 1)
 		return npages;