Message ID | 159017506369.18853.17306023099999811263.stgit@gimli.home (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio-pci: Block user access to disabled device MMIO | expand |
Hi, Alex, On Fri, May 22, 2020 at 01:17:43PM -0600, Alex Williamson wrote: > @@ -1346,15 +1526,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > { > struct vm_area_struct *vma = vmf->vma; > struct vfio_pci_device *vdev = vma->vm_private_data; > + vm_fault_t ret = VM_FAULT_NOPAGE; > + > + mutex_lock(&vdev->vma_lock); > + down_read(&vdev->memory_lock); I remembered to have seen the fault() handling FAULT_FLAG_RETRY_NOWAIT at least in the very first version, but it's not here any more... Could I ask what's the reason behind? I probably have missed something along with the versions, I'm just not sure whether e.g. this would potentially block a GUP caller even if it's with FOLL_NOWAIT. Side note: Another thing I thought about when reading this patch - there seems to have quite some possibility that the VFIO_DEVICE_PCI_HOT_RESET ioctl will start to return -EBUSY now. Not a problem for this series, but maybe we should rememeber to let the userspace handle -EBUSY properly as follow up too, since I saw QEMU seemed to not handle -EBUSY for host reset path right now. Thanks,
On Sat, 23 May 2020 15:34:17 -0400 Peter Xu <peterx@redhat.com> wrote: > Hi, Alex, > > On Fri, May 22, 2020 at 01:17:43PM -0600, Alex Williamson wrote: > > @@ -1346,15 +1526,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > > { > > struct vm_area_struct *vma = vmf->vma; > > struct vfio_pci_device *vdev = vma->vm_private_data; > > + vm_fault_t ret = VM_FAULT_NOPAGE; > > + > > + mutex_lock(&vdev->vma_lock); > > + down_read(&vdev->memory_lock); > > I remembered to have seen the fault() handling FAULT_FLAG_RETRY_NOWAIT at least > in the very first version, but it's not here any more... Could I ask what's > the reason behind? I probably have missed something along with the versions, > I'm just not sure whether e.g. this would potentially block a GUP caller even > if it's with FOLL_NOWAIT. This is largely what v2 was about, from the cover letter: Locking in 3/ is substantially changed to avoid the retry scenario within the fault handler, therefore a caller who does not allow retry will no longer receive a SIGBUS on contention. The discussion thread starts here: https://lore.kernel.org/kvm/20200501234849.GQ26002@ziepe.ca/ Feel free to interject if there's something that doesn't make sense, the idea is that since we've fixed the lock ordering we never need to release one lock to wait for another, therefore we can wait for the lock. I'm under the impression that we can wait for the lock regardless of the flags under these conditions. > Side note: Another thing I thought about when reading this patch - there seems > to have quite some possibility that the VFIO_DEVICE_PCI_HOT_RESET ioctl will > start to return -EBUSY now. Not a problem for this series, but maybe we should > rememeber to let the userspace handle -EBUSY properly as follow up too, since I > saw QEMU seemed to not handle -EBUSY for host reset path right now. I think this has always been the case, whether it's the device lock or this lock, the only way I know to avoid potential deadlock is to use the 'try' locking semantics. In normal scenarios I expect access to sibling devices is quiesced at this point, so a user driver actually wanting to achieve a reset shouldn't be affected by this. Thanks, Alex
On Sat, May 23, 2020 at 05:06:02PM -0600, Alex Williamson wrote: > On Sat, 23 May 2020 15:34:17 -0400 > Peter Xu <peterx@redhat.com> wrote: > > > Hi, Alex, > > > > On Fri, May 22, 2020 at 01:17:43PM -0600, Alex Williamson wrote: > > > @@ -1346,15 +1526,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > > > { > > > struct vm_area_struct *vma = vmf->vma; > > > struct vfio_pci_device *vdev = vma->vm_private_data; > > > + vm_fault_t ret = VM_FAULT_NOPAGE; > > > + > > > + mutex_lock(&vdev->vma_lock); > > > + down_read(&vdev->memory_lock); > > > > I remembered to have seen the fault() handling FAULT_FLAG_RETRY_NOWAIT at least > > in the very first version, but it's not here any more... Could I ask what's > > the reason behind? I probably have missed something along with the versions, > > I'm just not sure whether e.g. this would potentially block a GUP caller even > > if it's with FOLL_NOWAIT. > > This is largely what v2 was about, from the cover letter: > > Locking in 3/ is substantially changed to avoid the retry scenario > within the fault handler, therefore a caller who does not allow > retry will no longer receive a SIGBUS on contention. > > The discussion thread starts here: > > https://lore.kernel.org/kvm/20200501234849.GQ26002@ziepe.ca/ [1] > > Feel free to interject if there's something that doesn't make sense, > the idea is that since we've fixed the lock ordering we never need to > release one lock to wait for another, therefore we can wait for the > lock. I'm under the impression that we can wait for the lock > regardless of the flags under these conditions. I see; thanks for the link. Sorry I should probably follow up the discussion and ask the question earlier, anyway... For what I understand now, IMHO we should still need all those handlings of FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm not sure what would be the side effect of that if fault() blocked it. E.g., the caller could be in an atomic context. But now I also agree that VM_FAULT_SIGBUS is probably not correct there in the initial version [1] - I thought it was OK initially (after all after the multiple fault retry series we should always be with FAULT_FLAG_ALLOW_RETRY..). However after some thinking... it should be the common slow path where retry is simply not allowed. So IMHO instead of SIGBUS there, we should also use all the slow path of the locks. That'll be safe then because it's never going to be with FAULT_FLAG_RETRY_NOWAIT (FAULT_FLAG_RETRY_NOWAIT depends on FAULT_FLAG_ALLOW_RETRY). A reference code could be __lock_page_or_retry() where the lock_page could wait just like we taking the sems/mutexes, and the previous SIGBUS case would corresponds to this chunk of __lock_page_or_retry(): } else { if (flags & FAULT_FLAG_KILLABLE) { int ret; ret = __lock_page_killable(page); if (ret) { up_read(&mm->mmap_sem); return 0; } } else __lock_page(page); return 1; } Thanks,
(CC Andrea too) On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: > On Sat, May 23, 2020 at 05:06:02PM -0600, Alex Williamson wrote: > > On Sat, 23 May 2020 15:34:17 -0400 > > Peter Xu <peterx@redhat.com> wrote: > > > > > Hi, Alex, > > > > > > On Fri, May 22, 2020 at 01:17:43PM -0600, Alex Williamson wrote: > > > > @@ -1346,15 +1526,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > > > > { > > > > struct vm_area_struct *vma = vmf->vma; > > > > struct vfio_pci_device *vdev = vma->vm_private_data; > > > > + vm_fault_t ret = VM_FAULT_NOPAGE; > > > > + > > > > + mutex_lock(&vdev->vma_lock); > > > > + down_read(&vdev->memory_lock); > > > > > > I remembered to have seen the fault() handling FAULT_FLAG_RETRY_NOWAIT at least > > > in the very first version, but it's not here any more... Could I ask what's > > > the reason behind? I probably have missed something along with the versions, > > > I'm just not sure whether e.g. this would potentially block a GUP caller even > > > if it's with FOLL_NOWAIT. > > > > This is largely what v2 was about, from the cover letter: > > > > Locking in 3/ is substantially changed to avoid the retry scenario > > within the fault handler, therefore a caller who does not allow > > retry will no longer receive a SIGBUS on contention. > > > > The discussion thread starts here: > > > > https://lore.kernel.org/kvm/20200501234849.GQ26002@ziepe.ca/ > > [1] > > > > > Feel free to interject if there's something that doesn't make sense, > > the idea is that since we've fixed the lock ordering we never need to > > release one lock to wait for another, therefore we can wait for the > > lock. I'm under the impression that we can wait for the lock > > regardless of the flags under these conditions. > > I see; thanks for the link. Sorry I should probably follow up the discussion > and ask the question earlier, anyway... > > For what I understand now, IMHO we should still need all those handlings of > FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will > try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm > not sure what would be the side effect of that if fault() blocked it. E.g., > the caller could be in an atomic context. > > But now I also agree that VM_FAULT_SIGBUS is probably not correct there in the > initial version [1] - I thought it was OK initially (after all after the > multiple fault retry series we should always be with FAULT_FLAG_ALLOW_RETRY..). > However after some thinking... it should be the common slow path where retry is > simply not allowed. So IMHO instead of SIGBUS there, we should also use all > the slow path of the locks. That'll be safe then because it's never going to > be with FAULT_FLAG_RETRY_NOWAIT (FAULT_FLAG_RETRY_NOWAIT depends on > FAULT_FLAG_ALLOW_RETRY). > > A reference code could be __lock_page_or_retry() where the lock_page could wait > just like we taking the sems/mutexes, and the previous SIGBUS case would > corresponds to this chunk of __lock_page_or_retry(): > > } else { > if (flags & FAULT_FLAG_KILLABLE) { > int ret; > > ret = __lock_page_killable(page); > if (ret) { > up_read(&mm->mmap_sem); > return 0; > } > } else > __lock_page(page); > return 1; > } > > Thanks, > > -- > Peter Xu
On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: > For what I understand now, IMHO we should still need all those handlings of > FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will > try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm > not sure what would be the side effect of that if fault() blocked it. E.g., > the caller could be in an atomic context. AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when VM_FAULT_RETRY is returned, which this doesn't do? It is not a generic 'do not sleep' Do you know different? Jason
On Mon, May 25, 2020 at 09:26:07AM -0300, Jason Gunthorpe wrote: > On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: > > > For what I understand now, IMHO we should still need all those handlings of > > FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will > > try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm > > not sure what would be the side effect of that if fault() blocked it. E.g., > > the caller could be in an atomic context. > > AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when > VM_FAULT_RETRY is returned, which this doesn't do? Yes, that's why I think we should still properly return VM_FAULT_RETRY if needed.. because IMHO it is still possible that the caller calls with FAULT_FLAG_RETRY_NOWAIT. My understanding is that FAULT_FLAG_RETRY_NOWAIT majorly means: - We cannot release the mmap_sem, and, - We cannot sleep But we're allowed to return VM_FAULT_RETRY if any of the above is necessary. Thanks,
On Mon, May 25, 2020 at 10:28:06AM -0400, Peter Xu wrote: > On Mon, May 25, 2020 at 09:26:07AM -0300, Jason Gunthorpe wrote: > > On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: > > > > > For what I understand now, IMHO we should still need all those handlings of > > > FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will > > > try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm > > > not sure what would be the side effect of that if fault() blocked it. E.g., > > > the caller could be in an atomic context. > > > > AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when > > VM_FAULT_RETRY is returned, which this doesn't do? > > Yes, that's why I think we should still properly return VM_FAULT_RETRY if > needed.. because IMHO it is still possible that the caller calls with > FAULT_FLAG_RETRY_NOWAIT. > > My understanding is that FAULT_FLAG_RETRY_NOWAIT majorly means: > > - We cannot release the mmap_sem, and, > - We cannot sleep Sleeping looks fine, look at any FS implementation of fault, say, xfs. The first thing it does is xfs_ilock() which does down_write(). I can't say when VM_FAULT_RETRY comes into play, but it is not so simple as just sleeping.. Jason
On Mon, May 25, 2020 at 11:46:51AM -0300, Jason Gunthorpe wrote: > On Mon, May 25, 2020 at 10:28:06AM -0400, Peter Xu wrote: > > On Mon, May 25, 2020 at 09:26:07AM -0300, Jason Gunthorpe wrote: > > > On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: > > > > > > > For what I understand now, IMHO we should still need all those handlings of > > > > FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will > > > > try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm > > > > not sure what would be the side effect of that if fault() blocked it. E.g., > > > > the caller could be in an atomic context. > > > > > > AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when > > > VM_FAULT_RETRY is returned, which this doesn't do? > > > > Yes, that's why I think we should still properly return VM_FAULT_RETRY if > > needed.. because IMHO it is still possible that the caller calls with > > FAULT_FLAG_RETRY_NOWAIT. > > > > My understanding is that FAULT_FLAG_RETRY_NOWAIT majorly means: > > > > - We cannot release the mmap_sem, and, > > - We cannot sleep > > Sleeping looks fine, look at any FS implementation of fault, say, > xfs. The first thing it does is xfs_ilock() which does down_write(). Yeah. My wild guess is that maybe fs code will always be without FAULT_FLAG_RETRY_NOWAIT so it's safe to sleep unconditionally (e.g., I think the general #PF should be fine to sleep in fault(); gup should be special, but I didn't observe any gup code called upon file systems)? Or I must have missed something important... Thanks,
On Mon, May 25, 2020 at 11:11:42AM -0400, Peter Xu wrote: > On Mon, May 25, 2020 at 11:46:51AM -0300, Jason Gunthorpe wrote: > > On Mon, May 25, 2020 at 10:28:06AM -0400, Peter Xu wrote: > > > On Mon, May 25, 2020 at 09:26:07AM -0300, Jason Gunthorpe wrote: > > > > On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: > > > > > > > > > For what I understand now, IMHO we should still need all those handlings of > > > > > FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will > > > > > try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm > > > > > not sure what would be the side effect of that if fault() blocked it. E.g., > > > > > the caller could be in an atomic context. > > > > > > > > AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when > > > > VM_FAULT_RETRY is returned, which this doesn't do? > > > > > > Yes, that's why I think we should still properly return VM_FAULT_RETRY if > > > needed.. because IMHO it is still possible that the caller calls with > > > FAULT_FLAG_RETRY_NOWAIT. > > > > > > My understanding is that FAULT_FLAG_RETRY_NOWAIT majorly means: > > > > > > - We cannot release the mmap_sem, and, > > > - We cannot sleep > > > > Sleeping looks fine, look at any FS implementation of fault, say, > > xfs. The first thing it does is xfs_ilock() which does down_write(). > > Yeah. My wild guess is that maybe fs code will always be without > FAULT_FLAG_RETRY_NOWAIT so it's safe to sleep unconditionally (e.g., I think > the general #PF should be fine to sleep in fault(); gup should be special, but > I didn't observe any gup code called upon file systems)? get_user_pages is called on filesystem backed pages. I have no idea what FAULT_FLAG_RETRY_NOWAIT is supposed to do. Maybe John was able to guess when he reworked that stuff? Jason
On 2020-05-25 09:56, Jason Gunthorpe wrote: > On Mon, May 25, 2020 at 11:11:42AM -0400, Peter Xu wrote: >> On Mon, May 25, 2020 at 11:46:51AM -0300, Jason Gunthorpe wrote: >>> On Mon, May 25, 2020 at 10:28:06AM -0400, Peter Xu wrote: >>>> On Mon, May 25, 2020 at 09:26:07AM -0300, Jason Gunthorpe wrote: >>>>> On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: >>>>> >>>>>> For what I understand now, IMHO we should still need all those handlings of >>>>>> FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will >>>>>> try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm >>>>>> not sure what would be the side effect of that if fault() blocked it. E.g., >>>>>> the caller could be in an atomic context. >>>>> >>>>> AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when >>>>> VM_FAULT_RETRY is returned, which this doesn't do? >>>> >>>> Yes, that's why I think we should still properly return VM_FAULT_RETRY if >>>> needed.. because IMHO it is still possible that the caller calls with >>>> FAULT_FLAG_RETRY_NOWAIT. >>>> >>>> My understanding is that FAULT_FLAG_RETRY_NOWAIT majorly means: >>>> >>>> - We cannot release the mmap_sem, and, >>>> - We cannot sleep >>> >>> Sleeping looks fine, look at any FS implementation of fault, say, >>> xfs. The first thing it does is xfs_ilock() which does down_write(). >> >> Yeah. My wild guess is that maybe fs code will always be without >> FAULT_FLAG_RETRY_NOWAIT so it's safe to sleep unconditionally (e.g., I think >> the general #PF should be fine to sleep in fault(); gup should be special, but >> I didn't observe any gup code called upon file systems)? > > get_user_pages is called on filesystem backed pages. > > I have no idea what FAULT_FLAG_RETRY_NOWAIT is supposed to do. Maybe > John was able to guess when he reworked that stuff? > Although I didn't end up touching that particular area, I'm sure it's going to come up sometime soon, so I poked around just now, and found that FAULT_FLAG_RETRY_NOWAIT was added almost exactly 9 years ago. This flag was intended to make KVM and similar things behave better when doing GUP on file-backed pages that might, or might not be in memory. The idea is described in the changelog, but not in the code comments or Documentation, sigh: commit 318b275fbca1ab9ec0862de71420e0e92c3d1aa7 Author: Gleb Natapov <gleb@redhat.com> Date: Tue Mar 22 16:30:51 2011 -0700 mm: allow GUP to fail instead of waiting on a page GUP user may want to try to acquire a reference to a page if it is already in memory, but not if IO, to bring it in, is needed. For example KVM may tell vcpu to schedule another guest process if current one is trying to access swapped out page. Meanwhile, the page will be swapped in and the guest process, that depends on it, will be able to run again. This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY instead. If that helps, maybe documentation approximately like this might be welcome (against linux-next, so I'm using mmap_lock, instead of mmap_sem), below. Or is this overkill? People like minimal documentation in the code, so maybe this belongs in Documentation, if anywhere: diff --git a/include/linux/mm.h b/include/linux/mm.h index 8429d5aa31e44..e32e8e52a57ac 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -430,6 +430,15 @@ extern pgprot_t protection_map[16]; * continuous faults with flags (b). We should always try to detect pending * signals before a retry to make sure the continuous page faults can still be * interrupted if necessary. + * + * About @FAULT_FLAG_RETRY_NOWAIT: this is intended for callers who would like + * to acquire a page, but only if the page is already in memory. If, on the + * other hand, the page requires IO in order to bring it into memory, then fault + * handlers will immediately return VM_FAULT_RETRY ("don't wait"), while leaving + * mmap_lock held ("don't drop mmap_lock"). For example, this is useful for + * virtual machines that have multiple guests running: if guest A attempts + * get_user_pages() on a swapped out page, another guest can be scheduled while + * waiting for IO to swap in guest A's page. */ #define FAULT_FLAG_WRITE 0x01 #define FAULT_FLAG_MKWRITE 0x02 thanks,
On Mon, May 25, 2020 at 01:56:28PM -0700, John Hubbard wrote: > On 2020-05-25 09:56, Jason Gunthorpe wrote: > > On Mon, May 25, 2020 at 11:11:42AM -0400, Peter Xu wrote: > > > On Mon, May 25, 2020 at 11:46:51AM -0300, Jason Gunthorpe wrote: > > > > On Mon, May 25, 2020 at 10:28:06AM -0400, Peter Xu wrote: > > > > > On Mon, May 25, 2020 at 09:26:07AM -0300, Jason Gunthorpe wrote: > > > > > > On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: > > > > > > > > > > > > > For what I understand now, IMHO we should still need all those handlings of > > > > > > > FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will > > > > > > > try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm > > > > > > > not sure what would be the side effect of that if fault() blocked it. E.g., > > > > > > > the caller could be in an atomic context. > > > > > > > > > > > > AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when > > > > > > VM_FAULT_RETRY is returned, which this doesn't do? > > > > > > > > > > Yes, that's why I think we should still properly return VM_FAULT_RETRY if > > > > > needed.. because IMHO it is still possible that the caller calls with > > > > > FAULT_FLAG_RETRY_NOWAIT. > > > > > > > > > > My understanding is that FAULT_FLAG_RETRY_NOWAIT majorly means: > > > > > > > > > > - We cannot release the mmap_sem, and, > > > > > - We cannot sleep > > > > > > > > Sleeping looks fine, look at any FS implementation of fault, say, > > > > xfs. The first thing it does is xfs_ilock() which does down_write(). > > > > > > Yeah. My wild guess is that maybe fs code will always be without > > > FAULT_FLAG_RETRY_NOWAIT so it's safe to sleep unconditionally (e.g., I think > > > the general #PF should be fine to sleep in fault(); gup should be special, but > > > I didn't observe any gup code called upon file systems)? > > > > get_user_pages is called on filesystem backed pages. > > > > I have no idea what FAULT_FLAG_RETRY_NOWAIT is supposed to do. Maybe > > John was able to guess when he reworked that stuff? > > > > Although I didn't end up touching that particular area, I'm sure it's going > to come up sometime soon, so I poked around just now, and found that > FAULT_FLAG_RETRY_NOWAIT was added almost exactly 9 years ago. This flag was > intended to make KVM and similar things behave better when doing GUP on > file-backed pages that might, or might not be in memory. > > The idea is described in the changelog, but not in the code comments or > Documentation, sigh: > > commit 318b275fbca1ab9ec0862de71420e0e92c3d1aa7 > Author: Gleb Natapov <gleb@redhat.com> > Date: Tue Mar 22 16:30:51 2011 -0700 > > mm: allow GUP to fail instead of waiting on a page > > GUP user may want to try to acquire a reference to a page if it is already > in memory, but not if IO, to bring it in, is needed. For example KVM may > tell vcpu to schedule another guest process if current one is trying to > access swapped out page. Meanwhile, the page will be swapped in and the > guest process, that depends on it, will be able to run again. > > This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and > FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in > conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that > it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY > instead. So, from kvm's perspective it was to avoid excessively long blocking in common paths when it could rejoin the completed IO by somehow waiting on a page itself? It all seems like it should not be used unless the page is going to go to IO? Certainly there is no reason to optimize the fringe case of vfio sleeping if there is and incorrect concurrnent attempt to disable the a BAR. > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 8429d5aa31e44..e32e8e52a57ac 100644 > +++ b/include/linux/mm.h > @@ -430,6 +430,15 @@ extern pgprot_t protection_map[16]; > * continuous faults with flags (b). We should always try to detect pending > * signals before a retry to make sure the continuous page faults can still be > * interrupted if necessary. > + * > + * About @FAULT_FLAG_RETRY_NOWAIT: this is intended for callers who would like > + * to acquire a page, but only if the page is already in memory. If, on the > + * other hand, the page requires IO in order to bring it into memory, then fault > + * handlers will immediately return VM_FAULT_RETRY ("don't wait"), while leaving > + * mmap_lock held ("don't drop mmap_lock"). For example, this is useful for > + * virtual machines that have multiple guests running: if guest A attempts > + * get_user_pages() on a swapped out page, another guest can be scheduled while > + * waiting for IO to swap in guest A's page. > */ > #define FAULT_FLAG_WRITE 0x01 > #define FAULT_FLAG_MKWRITE 0x02 It seems reasonable but people might complain about the kvm specifics of the explanation. It might be better to explain how the caller is supposed to know when it is OK to try GUP again and expect success, as it seems to me this is really about externalizing the sleep for page wait? Jason
On 2020-05-25 17:37, Jason Gunthorpe wrote: ... >> commit 318b275fbca1ab9ec0862de71420e0e92c3d1aa7 >> Author: Gleb Natapov <gleb@redhat.com> >> Date: Tue Mar 22 16:30:51 2011 -0700 >> >> mm: allow GUP to fail instead of waiting on a page >> >> GUP user may want to try to acquire a reference to a page if it is already >> in memory, but not if IO, to bring it in, is needed. For example KVM may >> tell vcpu to schedule another guest process if current one is trying to >> access swapped out page. Meanwhile, the page will be swapped in and the >> guest process, that depends on it, will be able to run again. >> >> This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and >> FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in >> conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that >> it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY >> instead. > > So, from kvm's perspective it was to avoid excessively long blocking in > common paths when it could rejoin the completed IO by somehow waiting > on a page itself? Or perhaps some variation on that, such as just retrying with an intervening schedule() call. It's not clear just from that commit. > > It all seems like it should not be used unless the page is going to go > to IO? That's my conclusion so far, yes. > > Certainly there is no reason to optimize the fringe case of vfio > sleeping if there is and incorrect concurrnent attempt to disable the > a BAR. Definitely agree with that position. > >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 8429d5aa31e44..e32e8e52a57ac 100644 >> +++ b/include/linux/mm.h >> @@ -430,6 +430,15 @@ extern pgprot_t protection_map[16]; >> * continuous faults with flags (b). We should always try to detect pending >> * signals before a retry to make sure the continuous page faults can still be >> * interrupted if necessary. >> + * >> + * About @FAULT_FLAG_RETRY_NOWAIT: this is intended for callers who would like >> + * to acquire a page, but only if the page is already in memory. If, on the >> + * other hand, the page requires IO in order to bring it into memory, then fault >> + * handlers will immediately return VM_FAULT_RETRY ("don't wait"), while leaving >> + * mmap_lock held ("don't drop mmap_lock"). For example, this is useful for >> + * virtual machines that have multiple guests running: if guest A attempts >> + * get_user_pages() on a swapped out page, another guest can be scheduled while >> + * waiting for IO to swap in guest A's page. >> */ >> #define FAULT_FLAG_WRITE 0x01 >> #define FAULT_FLAG_MKWRITE 0x02 > > It seems reasonable but people might complain about the kvm > specifics of the explanation. > > It might be better to explain how the caller is supposed to know when > it is OK to try GUP again and expect success, as it seems to me this > is really about externalizing the sleep for page wait? > OK, good point. The example was helpful in the commit log, but not quite appropriate in mm.h, yes. thanks,
On Mon, May 25, 2020 at 09:37:05PM -0300, Jason Gunthorpe wrote: > On Mon, May 25, 2020 at 01:56:28PM -0700, John Hubbard wrote: > > On 2020-05-25 09:56, Jason Gunthorpe wrote: > > > On Mon, May 25, 2020 at 11:11:42AM -0400, Peter Xu wrote: > > > > On Mon, May 25, 2020 at 11:46:51AM -0300, Jason Gunthorpe wrote: > > > > > On Mon, May 25, 2020 at 10:28:06AM -0400, Peter Xu wrote: > > > > > > On Mon, May 25, 2020 at 09:26:07AM -0300, Jason Gunthorpe wrote: > > > > > > > On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: > > > > > > > > > > > > > > > For what I understand now, IMHO we should still need all those handlings of > > > > > > > > FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will > > > > > > > > try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm > > > > > > > > not sure what would be the side effect of that if fault() blocked it. E.g., > > > > > > > > the caller could be in an atomic context. > > > > > > > > > > > > > > AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when > > > > > > > VM_FAULT_RETRY is returned, which this doesn't do? > > > > > > > > > > > > Yes, that's why I think we should still properly return VM_FAULT_RETRY if > > > > > > needed.. because IMHO it is still possible that the caller calls with > > > > > > FAULT_FLAG_RETRY_NOWAIT. > > > > > > > > > > > > My understanding is that FAULT_FLAG_RETRY_NOWAIT majorly means: > > > > > > > > > > > > - We cannot release the mmap_sem, and, > > > > > > - We cannot sleep > > > > > > > > > > Sleeping looks fine, look at any FS implementation of fault, say, > > > > > xfs. The first thing it does is xfs_ilock() which does down_write(). > > > > > > > > Yeah. My wild guess is that maybe fs code will always be without > > > > FAULT_FLAG_RETRY_NOWAIT so it's safe to sleep unconditionally (e.g., I think > > > > the general #PF should be fine to sleep in fault(); gup should be special, but > > > > I didn't observe any gup code called upon file systems)? > > > > > > get_user_pages is called on filesystem backed pages. > > > > > > I have no idea what FAULT_FLAG_RETRY_NOWAIT is supposed to do. Maybe > > > John was able to guess when he reworked that stuff? > > > > > > > Although I didn't end up touching that particular area, I'm sure it's going > > to come up sometime soon, so I poked around just now, and found that > > FAULT_FLAG_RETRY_NOWAIT was added almost exactly 9 years ago. This flag was > > intended to make KVM and similar things behave better when doing GUP on > > file-backed pages that might, or might not be in memory. > > > > The idea is described in the changelog, but not in the code comments or > > Documentation, sigh: > > > > commit 318b275fbca1ab9ec0862de71420e0e92c3d1aa7 > > Author: Gleb Natapov <gleb@redhat.com> > > Date: Tue Mar 22 16:30:51 2011 -0700 > > > > mm: allow GUP to fail instead of waiting on a page > > > > GUP user may want to try to acquire a reference to a page if it is already > > in memory, but not if IO, to bring it in, is needed. For example KVM may > > tell vcpu to schedule another guest process if current one is trying to > > access swapped out page. Meanwhile, the page will be swapped in and the > > guest process, that depends on it, will be able to run again. > > > > This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and > > FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in > > conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that > > it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY > > instead. > > So, from kvm's perspective it was to avoid excessively long blocking in > common paths when it could rejoin the completed IO by somehow waiting > on a page itself? > > It all seems like it should not be used unless the page is going to go > to IO? I think NOWAIT is used as a common flag for kvm for its initial attempt to fault in a normal page, however... I just noticed another fact that actually __get_user_pages() won't work with PFNMAP (check_vma_flags should fail), but KVM just started to support fault() for PFNMAP from commit add6a0cd1c5b (2016) using fixup_user_fault(), where nvidia seems to have a similar request to have a fault handler on some mapped BARs. > > Certainly there is no reason to optimize the fringe case of vfio > sleeping if there is and incorrect concurrnent attempt to disable the > a BAR. If fixup_user_fault() (which is always with ALLOW_RETRY && !RETRY_NOWAIT) is the only path for the new fault(), then current way seems ok. Not sure whether this would worth a WARN_ON_ONCE(RETRY_NOWAIT) in the fault() to be clear of that fact. Thanks,
On Tue, 26 May 2020 09:49:54 -0400 Peter Xu <peterx@redhat.com> wrote: > On Mon, May 25, 2020 at 09:37:05PM -0300, Jason Gunthorpe wrote: > > On Mon, May 25, 2020 at 01:56:28PM -0700, John Hubbard wrote: > > > On 2020-05-25 09:56, Jason Gunthorpe wrote: > > > > On Mon, May 25, 2020 at 11:11:42AM -0400, Peter Xu wrote: > > > > > On Mon, May 25, 2020 at 11:46:51AM -0300, Jason Gunthorpe wrote: > > > > > > On Mon, May 25, 2020 at 10:28:06AM -0400, Peter Xu wrote: > > > > > > > On Mon, May 25, 2020 at 09:26:07AM -0300, Jason Gunthorpe wrote: > > > > > > > > On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: > > > > > > > > > > > > > > > > > For what I understand now, IMHO we should still need all those handlings of > > > > > > > > > FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will > > > > > > > > > try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm > > > > > > > > > not sure what would be the side effect of that if fault() blocked it. E.g., > > > > > > > > > the caller could be in an atomic context. > > > > > > > > > > > > > > > > AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when > > > > > > > > VM_FAULT_RETRY is returned, which this doesn't do? > > > > > > > > > > > > > > Yes, that's why I think we should still properly return VM_FAULT_RETRY if > > > > > > > needed.. because IMHO it is still possible that the caller calls with > > > > > > > FAULT_FLAG_RETRY_NOWAIT. > > > > > > > > > > > > > > My understanding is that FAULT_FLAG_RETRY_NOWAIT majorly means: > > > > > > > > > > > > > > - We cannot release the mmap_sem, and, > > > > > > > - We cannot sleep > > > > > > > > > > > > Sleeping looks fine, look at any FS implementation of fault, say, > > > > > > xfs. The first thing it does is xfs_ilock() which does down_write(). > > > > > > > > > > Yeah. My wild guess is that maybe fs code will always be without > > > > > FAULT_FLAG_RETRY_NOWAIT so it's safe to sleep unconditionally (e.g., I think > > > > > the general #PF should be fine to sleep in fault(); gup should be special, but > > > > > I didn't observe any gup code called upon file systems)? > > > > > > > > get_user_pages is called on filesystem backed pages. > > > > > > > > I have no idea what FAULT_FLAG_RETRY_NOWAIT is supposed to do. Maybe > > > > John was able to guess when he reworked that stuff? > > > > > > > > > > Although I didn't end up touching that particular area, I'm sure it's going > > > to come up sometime soon, so I poked around just now, and found that > > > FAULT_FLAG_RETRY_NOWAIT was added almost exactly 9 years ago. This flag was > > > intended to make KVM and similar things behave better when doing GUP on > > > file-backed pages that might, or might not be in memory. > > > > > > The idea is described in the changelog, but not in the code comments or > > > Documentation, sigh: > > > > > > commit 318b275fbca1ab9ec0862de71420e0e92c3d1aa7 > > > Author: Gleb Natapov <gleb@redhat.com> > > > Date: Tue Mar 22 16:30:51 2011 -0700 > > > > > > mm: allow GUP to fail instead of waiting on a page > > > > > > GUP user may want to try to acquire a reference to a page if it is already > > > in memory, but not if IO, to bring it in, is needed. For example KVM may > > > tell vcpu to schedule another guest process if current one is trying to > > > access swapped out page. Meanwhile, the page will be swapped in and the > > > guest process, that depends on it, will be able to run again. > > > > > > This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and > > > FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in > > > conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that > > > it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY > > > instead. > > > > So, from kvm's perspective it was to avoid excessively long blocking in > > common paths when it could rejoin the completed IO by somehow waiting > > on a page itself? > > > > It all seems like it should not be used unless the page is going to go > > to IO? > > I think NOWAIT is used as a common flag for kvm for its initial attempt to > fault in a normal page, however... I just noticed another fact that actually > __get_user_pages() won't work with PFNMAP (check_vma_flags should fail), but > KVM just started to support fault() for PFNMAP from commit add6a0cd1c5b (2016) > using fixup_user_fault(), where nvidia seems to have a similar request to have > a fault handler on some mapped BARs. > > > > > Certainly there is no reason to optimize the fringe case of vfio > > sleeping if there is and incorrect concurrnent attempt to disable the > > a BAR. > > If fixup_user_fault() (which is always with ALLOW_RETRY && !RETRY_NOWAIT) is > the only path for the new fault(), then current way seems ok. Not sure whether > this would worth a WARN_ON_ONCE(RETRY_NOWAIT) in the fault() to be clear of > that fact. Thanks for the discussion over the weekend folks. Peter, I take it you'd be satisfied if this patch were updated as: diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index aabba6439a5b..35bd7cd4e268 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1528,6 +1528,13 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) struct vfio_pci_device *vdev = vma->vm_private_data; vm_fault_t ret = VM_FAULT_NOPAGE; + /* + * We don't expect to be called with NOWAIT and there are conflicting + * opinions on whether NOWAIT suggests we shouldn't wait for locks or + * just shouldn't wait for I/O. + */ + WARN_ON_ONCE(vmf->flags & FAULT_FLAG_RETRY_NOWAIT); + mutex_lock(&vdev->vma_lock); down_read(&vdev->memory_lock); Is that correct? Thanks, Alex
On Tue, May 26, 2020 at 08:32:18AM -0600, Alex Williamson wrote: > On Tue, 26 May 2020 09:49:54 -0400 > Peter Xu <peterx@redhat.com> wrote: > > > On Mon, May 25, 2020 at 09:37:05PM -0300, Jason Gunthorpe wrote: > > > On Mon, May 25, 2020 at 01:56:28PM -0700, John Hubbard wrote: > > > > On 2020-05-25 09:56, Jason Gunthorpe wrote: > > > > > On Mon, May 25, 2020 at 11:11:42AM -0400, Peter Xu wrote: > > > > > > On Mon, May 25, 2020 at 11:46:51AM -0300, Jason Gunthorpe wrote: > > > > > > > On Mon, May 25, 2020 at 10:28:06AM -0400, Peter Xu wrote: > > > > > > > > On Mon, May 25, 2020 at 09:26:07AM -0300, Jason Gunthorpe wrote: > > > > > > > > > On Sat, May 23, 2020 at 07:52:57PM -0400, Peter Xu wrote: > > > > > > > > > > > > > > > > > > > For what I understand now, IMHO we should still need all those handlings of > > > > > > > > > > FAULT_FLAG_RETRY_NOWAIT like in the initial version. E.g., IIUC KVM gup will > > > > > > > > > > try with FOLL_NOWAIT when async is allowed, before the complete slow path. I'm > > > > > > > > > > not sure what would be the side effect of that if fault() blocked it. E.g., > > > > > > > > > > the caller could be in an atomic context. > > > > > > > > > > > > > > > > > > AFAICT FAULT_FLAG_RETRY_NOWAIT only impacts what happens when > > > > > > > > > VM_FAULT_RETRY is returned, which this doesn't do? > > > > > > > > > > > > > > > > Yes, that's why I think we should still properly return VM_FAULT_RETRY if > > > > > > > > needed.. because IMHO it is still possible that the caller calls with > > > > > > > > FAULT_FLAG_RETRY_NOWAIT. > > > > > > > > > > > > > > > > My understanding is that FAULT_FLAG_RETRY_NOWAIT majorly means: > > > > > > > > > > > > > > > > - We cannot release the mmap_sem, and, > > > > > > > > - We cannot sleep > > > > > > > > > > > > > > Sleeping looks fine, look at any FS implementation of fault, say, > > > > > > > xfs. The first thing it does is xfs_ilock() which does down_write(). > > > > > > > > > > > > Yeah. My wild guess is that maybe fs code will always be without > > > > > > FAULT_FLAG_RETRY_NOWAIT so it's safe to sleep unconditionally (e.g., I think > > > > > > the general #PF should be fine to sleep in fault(); gup should be special, but > > > > > > I didn't observe any gup code called upon file systems)? > > > > > > > > > > get_user_pages is called on filesystem backed pages. > > > > > > > > > > I have no idea what FAULT_FLAG_RETRY_NOWAIT is supposed to do. Maybe > > > > > John was able to guess when he reworked that stuff? > > > > > > > > > > > > > Although I didn't end up touching that particular area, I'm sure it's going > > > > to come up sometime soon, so I poked around just now, and found that > > > > FAULT_FLAG_RETRY_NOWAIT was added almost exactly 9 years ago. This flag was > > > > intended to make KVM and similar things behave better when doing GUP on > > > > file-backed pages that might, or might not be in memory. > > > > > > > > The idea is described in the changelog, but not in the code comments or > > > > Documentation, sigh: > > > > > > > > commit 318b275fbca1ab9ec0862de71420e0e92c3d1aa7 > > > > Author: Gleb Natapov <gleb@redhat.com> > > > > Date: Tue Mar 22 16:30:51 2011 -0700 > > > > > > > > mm: allow GUP to fail instead of waiting on a page > > > > > > > > GUP user may want to try to acquire a reference to a page if it is already > > > > in memory, but not if IO, to bring it in, is needed. For example KVM may > > > > tell vcpu to schedule another guest process if current one is trying to > > > > access swapped out page. Meanwhile, the page will be swapped in and the > > > > guest process, that depends on it, will be able to run again. > > > > > > > > This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and > > > > FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in > > > > conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that > > > > it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY > > > > instead. > > > > > > So, from kvm's perspective it was to avoid excessively long blocking in > > > common paths when it could rejoin the completed IO by somehow waiting > > > on a page itself? > > > > > > It all seems like it should not be used unless the page is going to go > > > to IO? > > > > I think NOWAIT is used as a common flag for kvm for its initial attempt to > > fault in a normal page, however... I just noticed another fact that actually > > __get_user_pages() won't work with PFNMAP (check_vma_flags should fail), but > > KVM just started to support fault() for PFNMAP from commit add6a0cd1c5b (2016) > > using fixup_user_fault(), where nvidia seems to have a similar request to have > > a fault handler on some mapped BARs. > > > > > > > > Certainly there is no reason to optimize the fringe case of vfio > > > sleeping if there is and incorrect concurrnent attempt to disable the > > > a BAR. > > > > If fixup_user_fault() (which is always with ALLOW_RETRY && !RETRY_NOWAIT) is > > the only path for the new fault(), then current way seems ok. Not sure whether > > this would worth a WARN_ON_ONCE(RETRY_NOWAIT) in the fault() to be clear of > > that fact. > > Thanks for the discussion over the weekend folks. Peter, I take it > you'd be satisfied if this patch were updated as: > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index aabba6439a5b..35bd7cd4e268 100644 > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -1528,6 +1528,13 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > struct vfio_pci_device *vdev = vma->vm_private_data; > vm_fault_t ret = VM_FAULT_NOPAGE; > > + /* > + * We don't expect to be called with NOWAIT and there are conflicting > + * opinions on whether NOWAIT suggests we shouldn't wait for locks or > + * just shouldn't wait for I/O. > + */ > + WARN_ON_ONCE(vmf->flags & FAULT_FLAG_RETRY_NOWAIT); > + > mutex_lock(&vdev->vma_lock); > down_read(&vdev->memory_lock); > > > Is that correct? Thanks, Yes, actually either with or without it. :) Since I've read through the patch too and I don't see any issue from my side besides this one, FWIW.. please also feel free to take my r-b: Reviewed-by: Peter Xu <peterx@redhat.com> Thanks,
On Tue, May 26, 2020 at 08:32:18AM -0600, Alex Williamson wrote: > > > Certainly there is no reason to optimize the fringe case of vfio > > > sleeping if there is and incorrect concurrnent attempt to disable the > > > a BAR. > > > > If fixup_user_fault() (which is always with ALLOW_RETRY && !RETRY_NOWAIT) is > > the only path for the new fault(), then current way seems ok. Not sure whether > > this would worth a WARN_ON_ONCE(RETRY_NOWAIT) in the fault() to be clear of > > that fact. > > Thanks for the discussion over the weekend folks. Peter, I take it > you'd be satisfied if this patch were updated as: > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index aabba6439a5b..35bd7cd4e268 100644 > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -1528,6 +1528,13 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > struct vfio_pci_device *vdev = vma->vm_private_data; > vm_fault_t ret = VM_FAULT_NOPAGE; > > + /* > + * We don't expect to be called with NOWAIT and there are conflicting > + * opinions on whether NOWAIT suggests we shouldn't wait for locks or > + * just shouldn't wait for I/O. > + */ > + WARN_ON_ONCE(vmf->flags & FAULT_FLAG_RETRY_NOWAIT); I don't think this is right, this implies there is some reason this code fails with FAULT_FLAG_RETRY_NOWAIT - but it is fine as written, AFAICT Jason
On Tue, 26 May 2020 12:53:31 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Tue, May 26, 2020 at 08:32:18AM -0600, Alex Williamson wrote: > > > > Certainly there is no reason to optimize the fringe case of vfio > > > > sleeping if there is and incorrect concurrnent attempt to disable the > > > > a BAR. > > > > > > If fixup_user_fault() (which is always with ALLOW_RETRY && !RETRY_NOWAIT) is > > > the only path for the new fault(), then current way seems ok. Not sure whether > > > this would worth a WARN_ON_ONCE(RETRY_NOWAIT) in the fault() to be clear of > > > that fact. > > > > Thanks for the discussion over the weekend folks. Peter, I take it > > you'd be satisfied if this patch were updated as: > > > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > > index aabba6439a5b..35bd7cd4e268 100644 > > +++ b/drivers/vfio/pci/vfio_pci.c > > @@ -1528,6 +1528,13 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > > struct vfio_pci_device *vdev = vma->vm_private_data; > > vm_fault_t ret = VM_FAULT_NOPAGE; > > > > + /* > > + * We don't expect to be called with NOWAIT and there are conflicting > > + * opinions on whether NOWAIT suggests we shouldn't wait for locks or > > + * just shouldn't wait for I/O. > > + */ > > + WARN_ON_ONCE(vmf->flags & FAULT_FLAG_RETRY_NOWAIT); > > I don't think this is right, this implies there is some reason this > code fails with FAULT_FLAG_RETRY_NOWAIT - but it is fine as written, > AFAICT Ok, Peter said he's fine either way, I'll use the patch as originally posted and include Peter's R-b. Thanks, Alex
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 66a545a01f8f..aabba6439a5b 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -26,6 +26,7 @@ #include <linux/vfio.h> #include <linux/vgaarb.h> #include <linux/nospec.h> +#include <linux/sched/mm.h> #include "vfio_pci_private.h" @@ -184,6 +185,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev); static void vfio_pci_disable(struct vfio_pci_device *vdev); +static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data); /* * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND @@ -736,6 +738,12 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev, return 0; } +struct vfio_devices { + struct vfio_device **devices; + int cur_index; + int max_index; +}; + static long vfio_pci_ioctl(void *device_data, unsigned int cmd, unsigned long arg) { @@ -809,7 +817,7 @@ static long vfio_pci_ioctl(void *device_data, { void __iomem *io; size_t size; - u16 orig_cmd; + u16 cmd; info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index); info.flags = 0; @@ -829,10 +837,7 @@ static long vfio_pci_ioctl(void *device_data, * Is it really there? Enable memory decode for * implicit access in pci_map_rom(). */ - pci_read_config_word(pdev, PCI_COMMAND, &orig_cmd); - pci_write_config_word(pdev, PCI_COMMAND, - orig_cmd | PCI_COMMAND_MEMORY); - + cmd = vfio_pci_memory_lock_and_enable(vdev); io = pci_map_rom(pdev, &size); if (io) { info.flags = VFIO_REGION_INFO_FLAG_READ; @@ -840,8 +845,8 @@ static long vfio_pci_ioctl(void *device_data, } else { info.size = 0; } + vfio_pci_memory_unlock_and_restore(vdev, cmd); - pci_write_config_word(pdev, PCI_COMMAND, orig_cmd); break; } case VFIO_PCI_VGA_REGION_INDEX: @@ -984,8 +989,16 @@ static long vfio_pci_ioctl(void *device_data, return ret; } else if (cmd == VFIO_DEVICE_RESET) { - return vdev->reset_works ? - pci_try_reset_function(vdev->pdev) : -EINVAL; + int ret; + + if (!vdev->reset_works) + return -EINVAL; + + vfio_pci_zap_and_down_write_memory_lock(vdev); + ret = pci_try_reset_function(vdev->pdev); + up_write(&vdev->memory_lock); + + return ret; } else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) { struct vfio_pci_hot_reset_info hdr; @@ -1065,8 +1078,9 @@ static long vfio_pci_ioctl(void *device_data, int32_t *group_fds; struct vfio_pci_group_entry *groups; struct vfio_pci_group_info info; + struct vfio_devices devs = { .cur_index = 0 }; bool slot = false; - int i, count = 0, ret = 0; + int i, group_idx, mem_idx = 0, count = 0, ret = 0; minsz = offsetofend(struct vfio_pci_hot_reset, count); @@ -1118,9 +1132,9 @@ static long vfio_pci_ioctl(void *device_data, * user interface and store the group and iommu ID. This * ensures the group is held across the reset. */ - for (i = 0; i < hdr.count; i++) { + for (group_idx = 0; group_idx < hdr.count; group_idx++) { struct vfio_group *group; - struct fd f = fdget(group_fds[i]); + struct fd f = fdget(group_fds[group_idx]); if (!f.file) { ret = -EBADF; break; @@ -1133,8 +1147,9 @@ static long vfio_pci_ioctl(void *device_data, break; } - groups[i].group = group; - groups[i].id = vfio_external_user_iommu_id(group); + groups[group_idx].group = group; + groups[group_idx].id = + vfio_external_user_iommu_id(group); } kfree(group_fds); @@ -1153,13 +1168,63 @@ static long vfio_pci_ioctl(void *device_data, ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_validate_devs, &info, slot); - if (!ret) - /* User has access, do the reset */ - ret = pci_reset_bus(vdev->pdev); + if (ret) + goto hot_reset_release; + + devs.max_index = count; + devs.devices = kcalloc(count, sizeof(struct vfio_device *), + GFP_KERNEL); + if (!devs.devices) { + ret = -ENOMEM; + goto hot_reset_release; + } + + /* + * We need to get memory_lock for each device, but devices + * can share mmap_sem, therefore we need to zap and hold + * the vma_lock for each device, and only then get each + * memory_lock. + */ + ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, + vfio_pci_try_zap_and_vma_lock_cb, + &devs, slot); + if (ret) + goto hot_reset_release; + + for (; mem_idx < devs.cur_index; mem_idx++) { + struct vfio_pci_device *tmp; + + tmp = vfio_device_data(devs.devices[mem_idx]); + + ret = down_write_trylock(&tmp->memory_lock); + if (!ret) { + ret = -EBUSY; + goto hot_reset_release; + } + mutex_unlock(&tmp->vma_lock); + } + + /* User has access, do the reset */ + ret = pci_reset_bus(vdev->pdev); hot_reset_release: - for (i--; i >= 0; i--) - vfio_group_put_external_user(groups[i].group); + for (i = 0; i < devs.cur_index; i++) { + struct vfio_device *device; + struct vfio_pci_device *tmp; + + device = devs.devices[i]; + tmp = vfio_device_data(device); + + if (i < mem_idx) + up_write(&tmp->memory_lock); + else + mutex_unlock(&tmp->vma_lock); + vfio_device_put(device); + } + kfree(devs.devices); + + for (group_idx--; group_idx >= 0; group_idx--) + vfio_group_put_external_user(groups[group_idx].group); kfree(groups); return ret; @@ -1299,8 +1364,126 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf, return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true); } -static int vfio_pci_add_vma(struct vfio_pci_device *vdev, - struct vm_area_struct *vma) +/* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */ +static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try) +{ + struct vfio_pci_mmap_vma *mmap_vma, *tmp; + + /* + * Lock ordering: + * vma_lock is nested under mmap_sem for vm_ops callback paths. + * The memory_lock semaphore is used by both code paths calling + * into this function to zap vmas and the vm_ops.fault callback + * to protect the memory enable state of the device. + * + * When zapping vmas we need to maintain the mmap_sem => vma_lock + * ordering, which requires using vma_lock to walk vma_list to + * acquire an mm, then dropping vma_lock to get the mmap_sem and + * reacquiring vma_lock. This logic is derived from similar + * requirements in uverbs_user_mmap_disassociate(). + * + * mmap_sem must always be the top-level lock when it is taken. + * Therefore we can only hold the memory_lock write lock when + * vma_list is empty, as we'd need to take mmap_sem to clear + * entries. vma_list can only be guaranteed empty when holding + * vma_lock, thus memory_lock is nested under vma_lock. + * + * This enables the vm_ops.fault callback to acquire vma_lock, + * followed by memory_lock read lock, while already holding + * mmap_sem without risk of deadlock. + */ + while (1) { + struct mm_struct *mm = NULL; + + if (try) { + if (!mutex_trylock(&vdev->vma_lock)) + return 0; + } else { + mutex_lock(&vdev->vma_lock); + } + while (!list_empty(&vdev->vma_list)) { + mmap_vma = list_first_entry(&vdev->vma_list, + struct vfio_pci_mmap_vma, + vma_next); + mm = mmap_vma->vma->vm_mm; + if (mmget_not_zero(mm)) + break; + + list_del(&mmap_vma->vma_next); + kfree(mmap_vma); + mm = NULL; + } + if (!mm) + return 1; + mutex_unlock(&vdev->vma_lock); + + if (try) { + if (!down_read_trylock(&mm->mmap_sem)) { + mmput(mm); + return 0; + } + } else { + down_read(&mm->mmap_sem); + } + if (mmget_still_valid(mm)) { + if (try) { + if (!mutex_trylock(&vdev->vma_lock)) { + up_read(&mm->mmap_sem); + mmput(mm); + return 0; + } + } else { + mutex_lock(&vdev->vma_lock); + } + list_for_each_entry_safe(mmap_vma, tmp, + &vdev->vma_list, vma_next) { + struct vm_area_struct *vma = mmap_vma->vma; + + if (vma->vm_mm != mm) + continue; + + list_del(&mmap_vma->vma_next); + kfree(mmap_vma); + + zap_vma_ptes(vma, vma->vm_start, + vma->vm_end - vma->vm_start); + } + mutex_unlock(&vdev->vma_lock); + } + up_read(&mm->mmap_sem); + mmput(mm); + } +} + +void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device *vdev) +{ + vfio_pci_zap_and_vma_lock(vdev, false); + down_write(&vdev->memory_lock); + mutex_unlock(&vdev->vma_lock); +} + +u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_device *vdev) +{ + u16 cmd; + + down_write(&vdev->memory_lock); + pci_read_config_word(vdev->pdev, PCI_COMMAND, &cmd); + if (!(cmd & PCI_COMMAND_MEMORY)) + pci_write_config_word(vdev->pdev, PCI_COMMAND, + cmd | PCI_COMMAND_MEMORY); + + return cmd; +} + +void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev, u16 cmd) +{ + pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd); + up_write(&vdev->memory_lock); +} + +/* Caller holds vma_lock */ +static int __vfio_pci_add_vma(struct vfio_pci_device *vdev, + struct vm_area_struct *vma) { struct vfio_pci_mmap_vma *mmap_vma; @@ -1309,10 +1492,7 @@ static int vfio_pci_add_vma(struct vfio_pci_device *vdev, return -ENOMEM; mmap_vma->vma = vma; - - mutex_lock(&vdev->vma_lock); list_add(&mmap_vma->vma_next, &vdev->vma_list); - mutex_unlock(&vdev->vma_lock); return 0; } @@ -1346,15 +1526,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct vfio_pci_device *vdev = vma->vm_private_data; + vm_fault_t ret = VM_FAULT_NOPAGE; + + mutex_lock(&vdev->vma_lock); + down_read(&vdev->memory_lock); + + if (!__vfio_pci_memory_enabled(vdev)) { + ret = VM_FAULT_SIGBUS; + mutex_unlock(&vdev->vma_lock); + goto up_out; + } - if (vfio_pci_add_vma(vdev, vma)) - return VM_FAULT_OOM; + if (__vfio_pci_add_vma(vdev, vma)) { + ret = VM_FAULT_OOM; + mutex_unlock(&vdev->vma_lock); + goto up_out; + } + + mutex_unlock(&vdev->vma_lock); if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, vma->vm_end - vma->vm_start, vma->vm_page_prot)) - return VM_FAULT_SIGBUS; + ret = VM_FAULT_SIGBUS; - return VM_FAULT_NOPAGE; +up_out: + up_read(&vdev->memory_lock); + return ret; } static const struct vm_operations_struct vfio_pci_mmap_ops = { @@ -1680,6 +1877,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) INIT_LIST_HEAD(&vdev->ioeventfds_list); mutex_init(&vdev->vma_lock); INIT_LIST_HEAD(&vdev->vma_list); + init_rwsem(&vdev->memory_lock); ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev); if (ret) @@ -1933,12 +2131,6 @@ static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck) kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock); } -struct vfio_devices { - struct vfio_device **devices; - int cur_index; - int max_index; -}; - static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data) { struct vfio_devices *devs = data; @@ -1969,6 +2161,39 @@ static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data) return 0; } +static int vfio_pci_try_zap_and_vma_lock_cb(struct pci_dev *pdev, void *data) +{ + struct vfio_devices *devs = data; + struct vfio_device *device; + struct vfio_pci_device *vdev; + + if (devs->cur_index == devs->max_index) + return -ENOSPC; + + device = vfio_device_get_from_dev(&pdev->dev); + if (!device) + return -EINVAL; + + if (pci_dev_driver(pdev) != &vfio_pci_driver) { + vfio_device_put(device); + return -EBUSY; + } + + vdev = vfio_device_data(device); + + /* + * Locking multiple devices is prone to deadlock, runaway and + * unwind if we hit contention. + */ + if (!vfio_pci_zap_and_vma_lock(vdev, true)) { + vfio_device_put(device); + return -EBUSY; + } + + devs->devices[devs->cur_index++] = device; + return 0; +} + /* * If a bus or slot reset is available for the provided device and: * - All of the devices affected by that bus or slot reset are unused diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 90c0b80f8acf..3dcddbd572e6 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -395,6 +395,14 @@ static inline void p_setd(struct perm_bits *p, int off, u32 virt, u32 write) *(__le32 *)(&p->write[off]) = cpu_to_le32(write); } +/* Caller should hold memory_lock semaphore */ +bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev) +{ + u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]); + + return cmd & PCI_COMMAND_MEMORY; +} + /* * Restore the *real* BARs after we detect a FLR or backdoor reset. * (backdoor = some device specific technique that we didn't catch) @@ -556,13 +564,18 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos, new_cmd = le32_to_cpu(val); + phys_io = !!(phys_cmd & PCI_COMMAND_IO); + virt_io = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_IO); + new_io = !!(new_cmd & PCI_COMMAND_IO); + phys_mem = !!(phys_cmd & PCI_COMMAND_MEMORY); virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY); - phys_io = !!(phys_cmd & PCI_COMMAND_IO); - virt_io = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_IO); - new_io = !!(new_cmd & PCI_COMMAND_IO); + if (!new_mem) + vfio_pci_zap_and_down_write_memory_lock(vdev); + else + down_write(&vdev->memory_lock); /* * If the user is writing mem/io enable (new_mem/io) and we @@ -579,8 +592,11 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos, } count = vfio_default_config_write(vdev, pos, count, perm, offset, val); - if (count < 0) + if (count < 0) { + if (offset == PCI_COMMAND) + up_write(&vdev->memory_lock); return count; + } /* * Save current memory/io enable bits in vconfig to allow for @@ -591,6 +607,8 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos, *virt_cmd &= cpu_to_le16(~mask); *virt_cmd |= cpu_to_le16(new_cmd & mask); + + up_write(&vdev->memory_lock); } /* Emulate INTx disable */ @@ -828,8 +846,11 @@ static int vfio_exp_config_write(struct vfio_pci_device *vdev, int pos, pos - offset + PCI_EXP_DEVCAP, &cap); - if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) + if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) { + vfio_pci_zap_and_down_write_memory_lock(vdev); pci_try_reset_function(vdev->pdev); + up_write(&vdev->memory_lock); + } } /* @@ -907,8 +928,11 @@ static int vfio_af_config_write(struct vfio_pci_device *vdev, int pos, pos - offset + PCI_AF_CAP, &cap); - if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) + if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) { + vfio_pci_zap_and_down_write_memory_lock(vdev); pci_try_reset_function(vdev->pdev); + up_write(&vdev->memory_lock); + } } return count; diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index 2056f3f85f59..1d9fb2592945 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -249,6 +249,7 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix) struct pci_dev *pdev = vdev->pdev; unsigned int flag = msix ? PCI_IRQ_MSIX : PCI_IRQ_MSI; int ret; + u16 cmd; if (!is_irq_none(vdev)) return -EINVAL; @@ -258,13 +259,16 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix) return -ENOMEM; /* return the number of supported vectors if we can't get all: */ + cmd = vfio_pci_memory_lock_and_enable(vdev); ret = pci_alloc_irq_vectors(pdev, 1, nvec, flag); if (ret < nvec) { if (ret > 0) pci_free_irq_vectors(pdev); + vfio_pci_memory_unlock_and_restore(vdev, cmd); kfree(vdev->ctx); return ret; } + vfio_pci_memory_unlock_and_restore(vdev, cmd); vdev->num_ctx = nvec; vdev->irq_type = msix ? VFIO_PCI_MSIX_IRQ_INDEX : @@ -287,6 +291,7 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev, struct pci_dev *pdev = vdev->pdev; struct eventfd_ctx *trigger; int irq, ret; + u16 cmd; if (vector < 0 || vector >= vdev->num_ctx) return -EINVAL; @@ -295,7 +300,11 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev, if (vdev->ctx[vector].trigger) { irq_bypass_unregister_producer(&vdev->ctx[vector].producer); + + cmd = vfio_pci_memory_lock_and_enable(vdev); free_irq(irq, vdev->ctx[vector].trigger); + vfio_pci_memory_unlock_and_restore(vdev, cmd); + kfree(vdev->ctx[vector].name); eventfd_ctx_put(vdev->ctx[vector].trigger); vdev->ctx[vector].trigger = NULL; @@ -323,6 +332,7 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev, * such a reset it would be unsuccessful. To avoid this, restore the * cached value of the message prior to enabling. */ + cmd = vfio_pci_memory_lock_and_enable(vdev); if (msix) { struct msi_msg msg; @@ -332,6 +342,7 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev, ret = request_irq(irq, vfio_msihandler, 0, vdev->ctx[vector].name, trigger); + vfio_pci_memory_unlock_and_restore(vdev, cmd); if (ret) { kfree(vdev->ctx[vector].name); eventfd_ctx_put(trigger); @@ -376,6 +387,7 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix) { struct pci_dev *pdev = vdev->pdev; int i; + u16 cmd; for (i = 0; i < vdev->num_ctx; i++) { vfio_virqfd_disable(&vdev->ctx[i].unmask); @@ -384,7 +396,9 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix) vfio_msi_set_block(vdev, 0, vdev->num_ctx, NULL, msix); + cmd = vfio_pci_memory_lock_and_enable(vdev); pci_free_irq_vectors(pdev); + vfio_pci_memory_unlock_and_restore(vdev, cmd); /* * Both disable paths above use pci_intx_for_msi() to clear DisINTx diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 9b25f9f6ce1d..86a02aff8735 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -139,6 +139,7 @@ struct vfio_pci_device { struct notifier_block nb; struct mutex vma_lock; struct list_head vma_list; + struct rw_semaphore memory_lock; }; #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX) @@ -181,6 +182,13 @@ extern int vfio_pci_register_dev_region(struct vfio_pci_device *vdev, extern int vfio_pci_set_power_state(struct vfio_pci_device *vdev, pci_power_t state); +extern bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev); +extern void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device + *vdev); +extern u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_device *vdev); +extern void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev, + u16 cmd); + #ifdef CONFIG_VFIO_PCI_IGD extern int vfio_pci_igd_init(struct vfio_pci_device *vdev); #else diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index a87992892a9f..916b184df3a5 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -162,6 +162,7 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf, size_t x_start = 0, x_end = 0; resource_size_t end; void __iomem *io; + struct resource *res = &vdev->pdev->resource[bar]; ssize_t done; if (pci_resource_start(pdev, bar)) @@ -177,6 +178,14 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf, count = min(count, (size_t)(end - pos)); + if (res->flags & IORESOURCE_MEM) { + down_read(&vdev->memory_lock); + if (!__vfio_pci_memory_enabled(vdev)) { + up_read(&vdev->memory_lock); + return -EIO; + } + } + if (bar == PCI_ROM_RESOURCE) { /* * The ROM can fill less space than the BAR, so we start the @@ -184,13 +193,17 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf, * filling large ROM BARs much faster. */ io = pci_map_rom(pdev, &x_start); - if (!io) - return -ENOMEM; + if (!io) { + done = -ENOMEM; + goto out; + } x_end = end; } else { int ret = vfio_pci_setup_barmap(vdev, bar); - if (ret) - return ret; + if (ret) { + done = ret; + goto out; + } io = vdev->barmap[bar]; } @@ -207,6 +220,9 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_device *vdev, char __user *buf, if (bar == PCI_ROM_RESOURCE) pci_unmap_rom(pdev, io); +out: + if (res->flags & IORESOURCE_MEM) + up_read(&vdev->memory_lock); return done; }
Accessing the disabled memory space of a PCI device would typically result in a master abort response on conventional PCI, or an unsupported request on PCI express. The user would generally see these as a -1 response for the read return data and the write would be silently discarded, possibly with an uncorrected, non-fatal AER error triggered on the host. Some systems however take it upon themselves to bring down the entire system when they see something that might indicate a loss of data, such as this discarded write to a disabled memory space. To avoid this, we want to try to block the user from accessing memory spaces while they're disabled. We start with a semaphore around the memory enable bit, where writers modify the memory enable state and must be serialized, while readers make use of the memory region and can access in parallel. Writers include both direct manipulation via the command register, as well as any reset path where the internal mechanics of the reset may both explicitly and implicitly disable memory access, and manipulation of the MSI-X configuration, where the MSI-X vector table resides in MMIO space of the device. Readers include the read and write file ops to access the vfio device fd offsets as well as memory mapped access. In the latter case, we make use of our new vma list support to zap, or invalidate, those memory mappings in order to force them to be faulted back in on access. Our semaphore usage will stall user access to MMIO spaces across internal operations like reset, but the user might experience new behavior when trying to access the MMIO space while disabled via the PCI command register. Access via read or write while disabled will return -EIO and access via memory maps will result in a SIGBUS. This is expected to be compatible with known use cases and potentially provides better error handling capabilities than present in the hardware, while avoiding the more readily accessible and severe platform error responses that might otherwise occur. Fixes: CVE-2020-12888 Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- drivers/vfio/pci/vfio_pci.c | 291 +++++++++++++++++++++++++++++++---- drivers/vfio/pci/vfio_pci_config.c | 36 ++++ drivers/vfio/pci/vfio_pci_intrs.c | 14 ++ drivers/vfio/pci/vfio_pci_private.h | 8 + drivers/vfio/pci/vfio_pci_rdwr.c | 24 ++- 5 files changed, 330 insertions(+), 43 deletions(-)