Message ID | 1410976308-7683-1-git-send-email-andreslc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Andres, On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: [...] > static inline int check_user_page_hwpoison(unsigned long addr) > { > int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE; >@@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, > npages = get_user_page_nowait(current, current->mm, > addr, write_fault, page); > up_read(¤t->mm->mmap_sem); >- } else >- npages = get_user_pages_fast(addr, 1, write_fault, >- page); >+ } else { >+ /* >+ * By now we have tried gup_fast, and possibly async_pf, and we >+ * are certainly not atomic. Time to retry the gup, allowing >+ * mmap semaphore to be relinquished in the case of IO. >+ */ >+ npages = kvm_get_user_page_io(current, current->mm, addr, >+ write_fault, page); >+ } try_async_pf gfn_to_pfn_async __gfn_to_pfn async = false __gfn_to_pfn_memslot hva_to_pfn hva_to_pfn_fast hva_to_pfn_slow kvm_get_user_page_io page will always be ready after kvm_get_user_page_io which leads to APF don't need to work any more. Regards, Wanpeng Li > if (npages != 1) > return npages; > >-- >2.1.0.rc2.206.gedb03e5 > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote: > Hi Andres, > On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: > [...] > > static inline int check_user_page_hwpoison(unsigned long addr) > > { > > int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE; > >@@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, > > npages = get_user_page_nowait(current, current->mm, > > addr, write_fault, page); > > up_read(¤t->mm->mmap_sem); > >- } else > >- npages = get_user_pages_fast(addr, 1, write_fault, > >- page); > >+ } else { > >+ /* > >+ * By now we have tried gup_fast, and possibly async_pf, and we > >+ * are certainly not atomic. Time to retry the gup, allowing > >+ * mmap semaphore to be relinquished in the case of IO. > >+ */ > >+ npages = kvm_get_user_page_io(current, current->mm, addr, > >+ write_fault, page); > >+ } > > try_async_pf > gfn_to_pfn_async > __gfn_to_pfn async = false *async = false > __gfn_to_pfn_memslot > hva_to_pfn > hva_to_pfn_fast > hva_to_pfn_slow hva_to_pfn_slow checks async not *async. > kvm_get_user_page_io > > page will always be ready after kvm_get_user_page_io which leads to APF > don't need to work any more. > > Regards, > Wanpeng Li > > > if (npages != 1) > > return npages; > > > >-- > >2.1.0.rc2.206.gedb03e5 > > > >-- > >To unsubscribe, send a message with 'unsubscribe linux-mm' in > >the body to majordomo@kvack.org. For more info on Linux MM, > >see: http://www.linux-mm.org/ . > >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: > When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory > has been swapped out or is behind a filemap, this will trigger async > readahead and return immediately. The rationale is that KVM will kick > back the guest with an "async page fault" and allow for some other > guest process to take over. > > If async PFs are enabled the fault is retried asap from an async > workqueue. If not, it's retried immediately in the same code path. In > either case the retry will not relinquish the mmap semaphore and will > block on the IO. This is a bad thing, as other mmap semaphore users > now stall as a function of swap or filemap latency. > > This patch ensures both the regular and async PF path re-enter the > fault allowing for the mmap semaphore to be relinquished in the case > of IO wait. > Reviewed-by: Gleb Natapov <gleb@kernel.org> > Reviewed-by: Radim Kr?má? <rkrcmar@redhat.com> > Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com> > > --- > v1 -> v2 > > * WARN_ON_ONCE -> VM_WARN_ON_ONCE > * pagep == NULL skips the final retry > * kvm_gup_retry -> kvm_gup_io > * Comment updates throughout > --- > include/linux/kvm_host.h | 11 +++++++++++ > include/linux/mm.h | 1 + > mm/gup.c | 4 ++++ > virt/kvm/async_pf.c | 4 +--- > virt/kvm/kvm_main.c | 49 +++++++++++++++++++++++++++++++++++++++++++++--- > 5 files changed, 63 insertions(+), 6 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 3addcbc..4c1991b 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -198,6 +198,17 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva, > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > #endif > > +/* > + * Carry out a gup that requires IO. Allow the mm to relinquish the mmap > + * semaphore if the filemap/swap has to wait on a page lock. pagep == NULL > + * controls whether we retry the gup one more time to completion in that case. > + * Typically this is called after a FAULT_FLAG_RETRY_NOWAIT in the main tdp > + * handler. > + */ > +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long addr, bool write_fault, > + struct page **pagep); > + > enum { > OUTSIDE_GUEST_MODE, > IN_GUEST_MODE, > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ebc5f90..13e585f7 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, > #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ > #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ > #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ > +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ > > typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, > void *data); > diff --git a/mm/gup.c b/mm/gup.c > index 91d044b..af7ea3e 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, > fault_flags |= FAULT_FLAG_ALLOW_RETRY; > if (*flags & FOLL_NOWAIT) > fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT; > + if (*flags & FOLL_TRIED) { > + VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY); > + fault_flags |= FAULT_FLAG_TRIED; > + } > > ret = handle_mm_fault(mm, vma, address, fault_flags); > if (ret & VM_FAULT_ERROR) { > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c > index d6a3d09..5ff7f7f 100644 > --- a/virt/kvm/async_pf.c > +++ b/virt/kvm/async_pf.c > @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work) > > might_sleep(); > > - down_read(&mm->mmap_sem); > - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL); > - up_read(&mm->mmap_sem); > + kvm_get_user_page_io(NULL, mm, addr, 1, NULL); > kvm_async_page_present_sync(vcpu, apf); > > spin_lock(&vcpu->async_pf.lock); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 7ef6b48..fa8a565 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1115,6 +1115,43 @@ static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm, > return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL); > } > > +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long addr, bool write_fault, > + struct page **pagep) > +{ > + int npages; > + int locked = 1; > + int flags = FOLL_TOUCH | FOLL_HWPOISON | > + (pagep ? FOLL_GET : 0) | > + (write_fault ? FOLL_WRITE : 0); > + > + /* > + * If retrying the fault, we get here *not* having allowed the filemap > + * to wait on the page lock. We should now allow waiting on the IO with > + * the mmap semaphore released. > + */ > + down_read(&mm->mmap_sem); > + npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL, > + &locked); > + if (!locked) { > + VM_BUG_ON(npages != -EBUSY); > + > + if (!pagep) > + return 0; > + > + /* > + * The previous call has now waited on the IO. Now we can > + * retry and complete. Pass TRIED to ensure we do not re > + * schedule async IO (see e.g. filemap_fault). > + */ > + down_read(&mm->mmap_sem); > + npages = __get_user_pages(tsk, mm, addr, 1, flags | FOLL_TRIED, > + pagep, NULL, NULL); > + } > + up_read(&mm->mmap_sem); > + return npages; > +} > + > static inline int check_user_page_hwpoison(unsigned long addr) > { > int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE; > @@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, > npages = get_user_page_nowait(current, current->mm, > addr, write_fault, page); > up_read(¤t->mm->mmap_sem); > - } else > - npages = get_user_pages_fast(addr, 1, write_fault, > - page); > + } else { > + /* > + * By now we have tried gup_fast, and possibly async_pf, and we > + * are certainly not atomic. Time to retry the gup, allowing > + * mmap semaphore to be relinquished in the case of IO. > + */ > + npages = kvm_get_user_page_io(current, current->mm, addr, > + write_fault, page); > + } > if (npages != 1) > return npages; > > -- > 2.1.0.rc2.206.gedb03e5 > -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 18, 2014 at 09:13:26AM +0300, Gleb Natapov wrote: >On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote: >> Hi Andres, >> On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: >> [...] >> > static inline int check_user_page_hwpoison(unsigned long addr) >> > { >> > int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE; >> >@@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, >> > npages = get_user_page_nowait(current, current->mm, >> > addr, write_fault, page); >> > up_read(¤t->mm->mmap_sem); >> >- } else >> >- npages = get_user_pages_fast(addr, 1, write_fault, >> >- page); >> >+ } else { >> >+ /* >> >+ * By now we have tried gup_fast, and possibly async_pf, and we >> >+ * are certainly not atomic. Time to retry the gup, allowing >> >+ * mmap semaphore to be relinquished in the case of IO. >> >+ */ >> >+ npages = kvm_get_user_page_io(current, current->mm, addr, >> >+ write_fault, page); >> >+ } >> >> try_async_pf >> gfn_to_pfn_async >> __gfn_to_pfn async = false > *async = false > >> __gfn_to_pfn_memslot >> hva_to_pfn >> hva_to_pfn_fast >> hva_to_pfn_slow >hva_to_pfn_slow checks async not *async. Got it, thanks for your pointing out. Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com> Regards, Wanpeng Li > >> kvm_get_user_page_io >> >> page will always be ready after kvm_get_user_page_io which leads to APF >> don't need to work any more. >> >> Regards, >> Wanpeng Li >> >> > if (npages != 1) >> > return npages; >> > >> >-- >> >2.1.0.rc2.206.gedb03e5 >> > >> >-- >> >To unsubscribe, send a message with 'unsubscribe linux-mm' in >> >the body to majordomo@kvack.org. For more info on Linux MM, >> >see: http://www.linux-mm.org/ . >> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > >-- > Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 18, 2014 at 5:32 PM, Wanpeng Li <wanpeng.li@linux.intel.com> wrote: > On Thu, Sep 18, 2014 at 09:13:26AM +0300, Gleb Natapov wrote: >>On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote: >>> Hi Andres, >>> On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: >>> [...] >>> > static inline int check_user_page_hwpoison(unsigned long addr) >>> > { >>> > int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE; > Got it, thanks for your pointing out. > > Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com> > > Regards, > Wanpeng Li > Thanks. Paolo, should I recut including the recent Reviewed-by's? Thanks Andres ps: shrunk cc >> >>> kvm_get_user_page_io >>> >>> page will always be ready after kvm_get_user_page_io which leads to APF >>> don't need to work any more. >>> >>> Regards, >>> Wanpeng Li >>> >>> > if (npages != 1) >>> > return npages; >>> > >>> >-- >>> >2.1.0.rc2.206.gedb03e5 >>> > >>> >-- >>> >To unsubscribe, send a message with 'unsubscribe linux-mm' in >>> >the body to majordomo@kvack.org. For more info on Linux MM, >>> >see: http://www.linux-mm.org/ . >>> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >> >>-- >> Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Il 19/09/2014 05:58, Andres Lagar-Cavilla ha scritto:
> Paolo, should I recut including the recent Reviewed-by's?
No, I'll add them myself.
Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 18, 2014 at 11:08 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > Il 19/09/2014 05:58, Andres Lagar-Cavilla ha scritto: > > Paolo, should I recut including the recent Reviewed-by's? > > No, I'll add them myself. Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough? Thanks much Andres > > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Il 22/09/2014 22:49, Andres Lagar-Cavilla ha scritto: >>> > > Paolo, should I recut including the recent Reviewed-by's? >> > >> > No, I'll add them myself. > Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough? It's waiting for an Acked-by on the mm/ changes. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 22 Sep 2014 23:32:36 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 22/09/2014 22:49, Andres Lagar-Cavilla ha scritto: > >>> > > Paolo, should I recut including the recent Reviewed-by's? > >> > > >> > No, I'll add them myself. > > Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough? > > It's waiting for an Acked-by on the mm/ changes. > The MM changes look good to me. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Andres, On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: > + if (!locked) { > + VM_BUG_ON(npages != -EBUSY); > + Shouldn't this be VM_BUG_ON(npages)? Alternatively we could patch gup to do: case -EHWPOISON: + case -EBUSY: return i ? i : ret; - case -EBUSY: - return i; I need to fix gup_fast slow path to start with FAULT_FLAG_ALLOW_RETRY similarly to what you did to the KVM slow path. gup_fast is called without the mmap_sem (incidentally its whole point is to only disable irqs and not take the locks) so the enabling of FAULT_FLAG_ALLOW_RETRY initial pass inside gup_fast should be all self contained. It shouldn't concern KVM which should be already fine with your patch, but it will allow the userfaultfd to intercept all O_DIRECT gup_fast in addition to KVM with your patch. Eventually get_user_pages should be obsoleted in favor of get_user_pages_locked (or whoever we decide to call it) so the userfaultfd can intercept all kind of gups. gup_locked is same as gup except for one more "locked" parameter at the end, I called the parameter locked instead of nonblocking because it'd be more proper to call "nonblocking" gup the FOLL_NOWAIT kind which is quite the opposite (in fact as the mmap_sem cannot be dropped in the non blocking version). ptrace ironically is better off sticking with a NULL locked parameter and to get a sigbus instead of risking hanging on the userfaultfd (which would be safe as it can be killed, but it'd be annoying if erroneously poking into a hole during a gdb session). It's still possible to pass NULL as parameter to get_user_pages_locked to achieve that. So the fact some callers won't block in handle_userfault because FAULT_FLAG_ALLOW_RETRY is not set and the userfault cannot block, may come handy. What I'm trying to solve in this context is that the userfault cannot safely block without FAULT_FLAG_ALLOW_RETRY because we can't allow userland to take the mmap_sem for an unlimited amount of time without requiring special privileges, so if handle_userfault wants to blocks within a gup invocation, it must first release the mmap_sem hence FAULT_FLAG_ALLOW_RETRY is always required at the first attempt for any virtual address. With regard to the last sentence, there's actually a race with MADV_DONTNEED too, I'd need to change the code to always pass FAULT_FLAG_ALLOW_RETRY (your code also would need to loop and insisting with the __get_user_pages(locked) version to solve it). The KVM ioctl worst case would get an -EFAULT if the race materializes for example. It's non concerning though because that can be solved in userland somehow by separating ballooning and live migration activities. Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 25, 2014 at 2:16 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > Hi Andres, > > On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: >> + if (!locked) { >> + VM_BUG_ON(npages != -EBUSY); >> + > > Shouldn't this be VM_BUG_ON(npages)? Oh shoot you're right. I was confused by the introduction of -EBUSY in the forward port. if (ret & VM_FAULT_RETRY) { if (nonblocking) *nonblocking = 0; return -EBUSY; } (gaaah!!!) > > Alternatively we could patch gup to do: > > case -EHWPOISON: > + case -EBUSY: > return i ? i : ret; > - case -EBUSY: > - return i; > No preference. Not a lot of semantics available given that we pass 1 as the count to gup. Want to cut the patch or I can just shoot one right away? > I need to fix gup_fast slow path to start with FAULT_FLAG_ALLOW_RETRY > similarly to what you did to the KVM slow path. > > gup_fast is called without the mmap_sem (incidentally its whole point > is to only disable irqs and not take the locks) so the enabling of > FAULT_FLAG_ALLOW_RETRY initial pass inside gup_fast should be all self > contained. It shouldn't concern KVM which should be already fine with > your patch, but it will allow the userfaultfd to intercept all > O_DIRECT gup_fast in addition to KVM with your patch. > > Eventually get_user_pages should be obsoleted in favor of > get_user_pages_locked (or whoever we decide to call it) so the > userfaultfd can intercept all kind of gups. gup_locked is same as gup > except for one more "locked" parameter at the end, I called the > parameter locked instead of nonblocking because it'd be more proper to > call "nonblocking" gup the FOLL_NOWAIT kind which is quite the > opposite (in fact as the mmap_sem cannot be dropped in the non > blocking version). > It's nearly impossible to name it right because 1) it indicates we can relinquish 2) it returns whether we still hold the mmap semaphore. I'd prefer it'd be called mmap_sem_hold, which conveys immediately what this is about ("nonblocking" or "locked" could be about a whole lot of things) > ptrace ironically is better off sticking with a NULL locked parameter > and to get a sigbus instead of risking hanging on the userfaultfd > (which would be safe as it can be killed, but it'd be annoying if > erroneously poking into a hole during a gdb session). It's still > possible to pass NULL as parameter to get_user_pages_locked to achieve > that. So the fact some callers won't block in handle_userfault because > FAULT_FLAG_ALLOW_RETRY is not set and the userfault cannot block, may > come handy. > > What I'm trying to solve in this context is that the userfault cannot > safely block without FAULT_FLAG_ALLOW_RETRY because we can't allow > userland to take the mmap_sem for an unlimited amount of time without > requiring special privileges, so if handle_userfault wants to blocks > within a gup invocation, it must first release the mmap_sem hence > FAULT_FLAG_ALLOW_RETRY is always required at the first attempt for any > virtual address. I can see that. My background for coming into this is very similar: in a previous life we had a file system shim that would kick up into userspace for servicing VM memory. KVM just wouldn't let the file system give up the mmap semaphore. We had /proc readers hanging up all over the place while userspace was servicing. Not happy. With KVM (now) and the standard x86 fault giving you ALLOW_RETRY, what stands in your way? Methinks that gup_fast has no slowpath fallback that turns on ALLOW_RETRY. What would oppose that being the global behavior? > > With regard to the last sentence, there's actually a race with > MADV_DONTNEED too, I'd need to change the code to always pass > FAULT_FLAG_ALLOW_RETRY (your code also would need to loop and > insisting with the __get_user_pages(locked) version to solve it). The > KVM ioctl worst case would get an -EFAULT if the race materializes for > example. It's non concerning though because that can be solved in > userland somehow by separating ballooning and live migration > activities. Well, IIUC every code path that has ALLOW_RETRY dives in the second time with FAULT_TRIED or similar. In the common case, you happily blaze through the second time, but if someone raced in while all locks were given up, one pays the price of the second time being a full fault hogging the mmap sem. At some point you need to not keep being polite otherwise the task starves. Presumably the risk of an extra retry drops steeply every new gup retry. Maybe just try three times is good enough. It makes for ugly logic though. Thanks Andres > > Thanks, > Andrea
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3addcbc..4c1991b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -198,6 +198,17 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva, int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #endif +/* + * Carry out a gup that requires IO. Allow the mm to relinquish the mmap + * semaphore if the filemap/swap has to wait on a page lock. pagep == NULL + * controls whether we retry the gup one more time to completion in that case. + * Typically this is called after a FAULT_FLAG_RETRY_NOWAIT in the main tdp + * handler. + */ +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm, + unsigned long addr, bool write_fault, + struct page **pagep); + enum { OUTSIDE_GUEST_MODE, IN_GUEST_MODE, diff --git a/include/linux/mm.h b/include/linux/mm.h index ebc5f90..13e585f7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/mm/gup.c b/mm/gup.c index 91d044b..af7ea3e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, fault_flags |= FAULT_FLAG_ALLOW_RETRY; if (*flags & FOLL_NOWAIT) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT; + if (*flags & FOLL_TRIED) { + VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY); + fault_flags |= FAULT_FLAG_TRIED; + } ret = handle_mm_fault(mm, vma, address, fault_flags); if (ret & VM_FAULT_ERROR) { diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index d6a3d09..5ff7f7f 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work) might_sleep(); - down_read(&mm->mmap_sem); - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL); - up_read(&mm->mmap_sem); + kvm_get_user_page_io(NULL, mm, addr, 1, NULL); kvm_async_page_present_sync(vcpu, apf); spin_lock(&vcpu->async_pf.lock); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7ef6b48..fa8a565 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1115,6 +1115,43 @@ static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm, return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL); } +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm, + unsigned long addr, bool write_fault, + struct page **pagep) +{ + int npages; + int locked = 1; + int flags = FOLL_TOUCH | FOLL_HWPOISON | + (pagep ? FOLL_GET : 0) | + (write_fault ? FOLL_WRITE : 0); + + /* + * If retrying the fault, we get here *not* having allowed the filemap + * to wait on the page lock. We should now allow waiting on the IO with + * the mmap semaphore released. + */ + down_read(&mm->mmap_sem); + npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL, + &locked); + if (!locked) { + VM_BUG_ON(npages != -EBUSY); + + if (!pagep) + return 0; + + /* + * The previous call has now waited on the IO. Now we can + * retry and complete. Pass TRIED to ensure we do not re + * schedule async IO (see e.g. filemap_fault). + */ + down_read(&mm->mmap_sem); + npages = __get_user_pages(tsk, mm, addr, 1, flags | FOLL_TRIED, + pagep, NULL, NULL); + } + up_read(&mm->mmap_sem); + return npages; +} + static inline int check_user_page_hwpoison(unsigned long addr) { int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE; @@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, npages = get_user_page_nowait(current, current->mm, addr, write_fault, page); up_read(¤t->mm->mmap_sem); - } else - npages = get_user_pages_fast(addr, 1, write_fault, - page); + } else { + /* + * By now we have tried gup_fast, and possibly async_pf, and we + * are certainly not atomic. Time to retry the gup, allowing + * mmap semaphore to be relinquished in the case of IO. + */ + npages = kvm_get_user_page_io(current, current->mm, addr, + write_fault, page); + } if (npages != 1) return npages;