diff mbox series

[v2,2/3] vfio-pci: Fault mmaps to enable vma tracking

Message ID 158871569380.15589.16950418949340311053.stgit@gimli.home (mailing list archive)
State New, archived
Headers show
Series vfio-pci: Block user access to disabled device MMIO | expand

Commit Message

Alex Williamson May 5, 2020, 9:54 p.m. UTC
Rather than calling remap_pfn_range() when a region is mmap'd, setup
a vm_ops handler to support dynamic faulting of the range on access.
This allows us to manage a list of vmas actively mapping the area that
we can later use to invalidate those mappings.  The open callback
invalidates the vma range so that all tracking is inserted in the
fault handler and removed in the close handler.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c         |   76 ++++++++++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_private.h |    7 +++
 2 files changed, 81 insertions(+), 2 deletions(-)

Comments

Peter Xu May 7, 2020, 9:47 p.m. UTC | #1
Hi, Alex,

On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> +/*
> + * Zap mmaps on open so that we can fault them in on access and therefore
> + * our vma_list only tracks mappings accessed since last zap.
> + */
> +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> +{
> +	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);

A pure question: is this only a safety-belt or it is required in some known
scenarios?

In all cases:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks,
Alex Williamson May 7, 2020, 10:03 p.m. UTC | #2
On Thu, 7 May 2020 17:47:44 -0400
Peter Xu <peterx@redhat.com> wrote:

> Hi, Alex,
> 
> On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > +/*
> > + * Zap mmaps on open so that we can fault them in on access and therefore
> > + * our vma_list only tracks mappings accessed since last zap.
> > + */
> > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > +{
> > +	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);  
> 
> A pure question: is this only a safety-belt or it is required in some known
> scenarios?

It's not required.  I originally did this so that I'm not allocating a
vma_list entry in a path where I can't return error, but as Jason
suggested I could zap here only in the case that I do encounter that
allocation fault.  However I still like consolidating the vma_list
handling to the vm_ops .fault and .close callbacks and potentially we
reduce the zap latency by keeping the vma_list to actual users, which
we'll get to eventually anyway in the VM case as memory BARs are sized
and assigned addresses.

> In all cases:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks!
Alex
Peter Xu May 7, 2020, 10:22 p.m. UTC | #3
On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
> On Thu, 7 May 2020 17:47:44 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > Hi, Alex,
> > 
> > On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > > +/*
> > > + * Zap mmaps on open so that we can fault them in on access and therefore
> > > + * our vma_list only tracks mappings accessed since last zap.
> > > + */
> > > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > > +{
> > > +	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);  
> > 
> > A pure question: is this only a safety-belt or it is required in some known
> > scenarios?
> 
> It's not required.  I originally did this so that I'm not allocating a
> vma_list entry in a path where I can't return error, but as Jason
> suggested I could zap here only in the case that I do encounter that
> allocation fault.  However I still like consolidating the vma_list
> handling to the vm_ops .fault and .close callbacks and potentially we
> reduce the zap latency by keeping the vma_list to actual users, which
> we'll get to eventually anyway in the VM case as memory BARs are sized
> and assigned addresses.

Yes, I don't see much problem either on doing the vma_list maintainance only in
.fault() and .close().  My understandingg is that the worst case is the perf
critical applications (e.g. DPDK) could pre-fault these MMIO region easily
during setup if they want.  My question was majorly about whether the vma
should be guaranteed to have no mapping at all when .open() is called.  But I
agree with you that it's always good to have that as safety-belt anyways.

Thanks!
Jason Gunthorpe May 7, 2020, 11:56 p.m. UTC | #4
On Thu, May 07, 2020 at 06:22:23PM -0400, Peter Xu wrote:
> On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
> > On Thu, 7 May 2020 17:47:44 -0400
> > Peter Xu <peterx@redhat.com> wrote:
> > 
> > > Hi, Alex,
> > > 
> > > On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > > > +/*
> > > > + * Zap mmaps on open so that we can fault them in on access and therefore
> > > > + * our vma_list only tracks mappings accessed since last zap.
> > > > + */
> > > > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > > > +{
> > > > +	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);  
> > > 
> > > A pure question: is this only a safety-belt or it is required in some known
> > > scenarios?
> > 
> > It's not required.  I originally did this so that I'm not allocating a
> > vma_list entry in a path where I can't return error, but as Jason
> > suggested I could zap here only in the case that I do encounter that
> > allocation fault.  However I still like consolidating the vma_list
> > handling to the vm_ops .fault and .close callbacks and potentially we
> > reduce the zap latency by keeping the vma_list to actual users, which
> > we'll get to eventually anyway in the VM case as memory BARs are sized
> > and assigned addresses.
> 
> Yes, I don't see much problem either on doing the vma_list maintainance only in
> .fault() and .close().  My understandingg is that the worst case is the perf
> critical applications (e.g. DPDK) could pre-fault these MMIO region easily
> during setup if they want.  My question was majorly about whether the vma
> should be guaranteed to have no mapping at all when .open() is called.  But I
> agree with you that it's always good to have that as safety-belt anyways.

If the VMA has a mapping then that specific VMA has to be in the
linked list.

So if the zap is skipped then the you have to allocate something and
add to the linked list to track the VMA with mapping.

It is not a 'safety belt'

Jason
Peter Xu May 8, 2020, 2:16 a.m. UTC | #5
On Thu, May 07, 2020 at 08:56:33PM -0300, Jason Gunthorpe wrote:
> On Thu, May 07, 2020 at 06:22:23PM -0400, Peter Xu wrote:
> > On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
> > > On Thu, 7 May 2020 17:47:44 -0400
> > > Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > > Hi, Alex,
> > > > 
> > > > On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > > > > +/*
> > > > > + * Zap mmaps on open so that we can fault them in on access and therefore
> > > > > + * our vma_list only tracks mappings accessed since last zap.
> > > > > + */
> > > > > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > > > > +{
> > > > > +	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);  
> > > > 
> > > > A pure question: is this only a safety-belt or it is required in some known
> > > > scenarios?
> > > 
> > > It's not required.  I originally did this so that I'm not allocating a
> > > vma_list entry in a path where I can't return error, but as Jason
> > > suggested I could zap here only in the case that I do encounter that
> > > allocation fault.  However I still like consolidating the vma_list
> > > handling to the vm_ops .fault and .close callbacks and potentially we
> > > reduce the zap latency by keeping the vma_list to actual users, which
> > > we'll get to eventually anyway in the VM case as memory BARs are sized
> > > and assigned addresses.
> > 
> > Yes, I don't see much problem either on doing the vma_list maintainance only in
> > .fault() and .close().  My understandingg is that the worst case is the perf
> > critical applications (e.g. DPDK) could pre-fault these MMIO region easily
> > during setup if they want.  My question was majorly about whether the vma
> > should be guaranteed to have no mapping at all when .open() is called.  But I
> > agree with you that it's always good to have that as safety-belt anyways.
> 
> If the VMA has a mapping then that specific VMA has to be in the
> linked list.
> 
> So if the zap is skipped then the you have to allocate something and
> add to the linked list to track the VMA with mapping.
> 
> It is not a 'safety belt'

But shouldn't open() only be called when the VMA is created for a memory range?
If so, does it also mean that the address range must have not been mapped yet?

Thanks,
Jason Wang May 8, 2020, 6:44 a.m. UTC | #6
On 2020/5/8 上午10:16, Peter Xu wrote:
> On Thu, May 07, 2020 at 08:56:33PM -0300, Jason Gunthorpe wrote:
>> On Thu, May 07, 2020 at 06:22:23PM -0400, Peter Xu wrote:
>>> On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
>>>> On Thu, 7 May 2020 17:47:44 -0400
>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>>> Hi, Alex,
>>>>>
>>>>> On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
>>>>>> +/*
>>>>>> + * Zap mmaps on open so that we can fault them in on access and therefore
>>>>>> + * our vma_list only tracks mappings accessed since last zap.
>>>>>> + */
>>>>>> +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> +	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
>>>>> A pure question: is this only a safety-belt or it is required in some known
>>>>> scenarios?
>>>> It's not required.  I originally did this so that I'm not allocating a
>>>> vma_list entry in a path where I can't return error, but as Jason
>>>> suggested I could zap here only in the case that I do encounter that
>>>> allocation fault.  However I still like consolidating the vma_list
>>>> handling to the vm_ops .fault and .close callbacks and potentially we
>>>> reduce the zap latency by keeping the vma_list to actual users, which
>>>> we'll get to eventually anyway in the VM case as memory BARs are sized
>>>> and assigned addresses.
>>> Yes, I don't see much problem either on doing the vma_list maintainance only in
>>> .fault() and .close().  My understandingg is that the worst case is the perf
>>> critical applications (e.g. DPDK) could pre-fault these MMIO region easily
>>> during setup if they want.  My question was majorly about whether the vma
>>> should be guaranteed to have no mapping at all when .open() is called.  But I
>>> agree with you that it's always good to have that as safety-belt anyways.
>> If the VMA has a mapping then that specific VMA has to be in the
>> linked list.
>>
>> So if the zap is skipped then the you have to allocate something and
>> add to the linked list to track the VMA with mapping.
>>
>> It is not a 'safety belt'
> But shouldn't open() only be called when the VMA is created for a memory range?
> If so, does it also mean that the address range must have not been mapped yet?


Probably not, e.g when VMA is being split.

Thanks


>
> Thanks,
>
Jason Gunthorpe May 8, 2020, 12:08 p.m. UTC | #7
On Thu, May 07, 2020 at 10:16:56PM -0400, Peter Xu wrote:
> On Thu, May 07, 2020 at 08:56:33PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 07, 2020 at 06:22:23PM -0400, Peter Xu wrote:
> > > On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
> > > > On Thu, 7 May 2020 17:47:44 -0400
> > > > Peter Xu <peterx@redhat.com> wrote:
> > > > 
> > > > > Hi, Alex,
> > > > > 
> > > > > On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > > > > > +/*
> > > > > > + * Zap mmaps on open so that we can fault them in on access and therefore
> > > > > > + * our vma_list only tracks mappings accessed since last zap.
> > > > > > + */
> > > > > > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > > > > > +{
> > > > > > +	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);  
> > > > > 
> > > > > A pure question: is this only a safety-belt or it is required in some known
> > > > > scenarios?
> > > > 
> > > > It's not required.  I originally did this so that I'm not allocating a
> > > > vma_list entry in a path where I can't return error, but as Jason
> > > > suggested I could zap here only in the case that I do encounter that
> > > > allocation fault.  However I still like consolidating the vma_list
> > > > handling to the vm_ops .fault and .close callbacks and potentially we
> > > > reduce the zap latency by keeping the vma_list to actual users, which
> > > > we'll get to eventually anyway in the VM case as memory BARs are sized
> > > > and assigned addresses.
> > > 
> > > Yes, I don't see much problem either on doing the vma_list maintainance only in
> > > .fault() and .close().  My understandingg is that the worst case is the perf
> > > critical applications (e.g. DPDK) could pre-fault these MMIO region easily
> > > during setup if they want.  My question was majorly about whether the vma
> > > should be guaranteed to have no mapping at all when .open() is called.  But I
> > > agree with you that it's always good to have that as safety-belt anyways.
> > 
> > If the VMA has a mapping then that specific VMA has to be in the
> > linked list.
> > 
> > So if the zap is skipped then the you have to allocate something and
> > add to the linked list to track the VMA with mapping.
> > 
> > It is not a 'safety belt'
> 
> But shouldn't open() only be called when the VMA is created for a memory range?
> If so, does it also mean that the address range must have not been mapped yet?

open is called whenever a VMA is copied, I don't think it is called
when the VMA is first created?

Jason
Peter Xu May 8, 2020, 2:26 p.m. UTC | #8
On Fri, May 08, 2020 at 09:08:01AM -0300, Jason Gunthorpe wrote:
> On Thu, May 07, 2020 at 10:16:56PM -0400, Peter Xu wrote:
> > On Thu, May 07, 2020 at 08:56:33PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 07, 2020 at 06:22:23PM -0400, Peter Xu wrote:
> > > > On Thu, May 07, 2020 at 04:03:34PM -0600, Alex Williamson wrote:
> > > > > On Thu, 7 May 2020 17:47:44 -0400
> > > > > Peter Xu <peterx@redhat.com> wrote:
> > > > > 
> > > > > > Hi, Alex,
> > > > > > 
> > > > > > On Tue, May 05, 2020 at 03:54:53PM -0600, Alex Williamson wrote:
> > > > > > > +/*
> > > > > > > + * Zap mmaps on open so that we can fault them in on access and therefore
> > > > > > > + * our vma_list only tracks mappings accessed since last zap.
> > > > > > > + */
> > > > > > > +static void vfio_pci_mmap_open(struct vm_area_struct *vma)
> > > > > > > +{
> > > > > > > +	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);  
> > > > > > 
> > > > > > A pure question: is this only a safety-belt or it is required in some known
> > > > > > scenarios?
> > > > > 
> > > > > It's not required.  I originally did this so that I'm not allocating a
> > > > > vma_list entry in a path where I can't return error, but as Jason
> > > > > suggested I could zap here only in the case that I do encounter that
> > > > > allocation fault.  However I still like consolidating the vma_list
> > > > > handling to the vm_ops .fault and .close callbacks and potentially we
> > > > > reduce the zap latency by keeping the vma_list to actual users, which
> > > > > we'll get to eventually anyway in the VM case as memory BARs are sized
> > > > > and assigned addresses.
> > > > 
> > > > Yes, I don't see much problem either on doing the vma_list maintainance only in
> > > > .fault() and .close().  My understandingg is that the worst case is the perf
> > > > critical applications (e.g. DPDK) could pre-fault these MMIO region easily
> > > > during setup if they want.  My question was majorly about whether the vma
> > > > should be guaranteed to have no mapping at all when .open() is called.  But I
> > > > agree with you that it's always good to have that as safety-belt anyways.
> > > 
> > > If the VMA has a mapping then that specific VMA has to be in the
> > > linked list.
> > > 
> > > So if the zap is skipped then the you have to allocate something and
> > > add to the linked list to track the VMA with mapping.
> > > 
> > > It is not a 'safety belt'
> > 
> > But shouldn't open() only be called when the VMA is created for a memory range?
> > If so, does it also mean that the address range must have not been mapped yet?
> 
> open is called whenever a VMA is copied, I don't think it is called
> when the VMA is first created?

Ah I think you're right. I misunderstood open() which I thought should be
always pairing with close(), but it seems not.  Thanks,
Peter Xu May 8, 2020, 2:27 p.m. UTC | #9
Hi, Jason!

On Fri, May 08, 2020 at 02:44:44PM +0800, Jason Wang wrote:
> Probably not, e.g when VMA is being split.

I see it now, thanks. :)
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6c6b37b5c04e..66a545a01f8f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1299,6 +1299,70 @@  static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
+static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
+			    struct vm_area_struct *vma)
+{
+	struct vfio_pci_mmap_vma *mmap_vma;
+
+	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
+	if (!mmap_vma)
+		return -ENOMEM;
+
+	mmap_vma->vma = vma;
+
+	mutex_lock(&vdev->vma_lock);
+	list_add(&mmap_vma->vma_next, &vdev->vma_list);
+	mutex_unlock(&vdev->vma_lock);
+
+	return 0;
+}
+
+/*
+ * Zap mmaps on open so that we can fault them in on access and therefore
+ * our vma_list only tracks mappings accessed since last zap.
+ */
+static void vfio_pci_mmap_open(struct vm_area_struct *vma)
+{
+	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+}
+
+static void vfio_pci_mmap_close(struct vm_area_struct *vma)
+{
+	struct vfio_pci_device *vdev = vma->vm_private_data;
+	struct vfio_pci_mmap_vma *mmap_vma;
+
+	mutex_lock(&vdev->vma_lock);
+	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
+		if (mmap_vma->vma == vma) {
+			list_del(&mmap_vma->vma_next);
+			kfree(mmap_vma);
+			break;
+		}
+	}
+	mutex_unlock(&vdev->vma_lock);
+}
+
+static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_device *vdev = vma->vm_private_data;
+
+	if (vfio_pci_add_vma(vdev, vma))
+		return VM_FAULT_OOM;
+
+	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+			    vma->vm_end - vma->vm_start, vma->vm_page_prot))
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vfio_pci_mmap_ops = {
+	.open = vfio_pci_mmap_open,
+	.close = vfio_pci_mmap_close,
+	.fault = vfio_pci_mmap_fault,
+};
+
 static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_pci_device *vdev = device_data;
@@ -1357,8 +1421,14 @@  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
 
-	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       req_len, vma->vm_page_prot);
+	/*
+	 * See remap_pfn_range(), called from vfio_pci_fault() but we can't
+	 * change vm_flags within the fault handler.  Set them now.
+	 */
+	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+	vma->vm_ops = &vfio_pci_mmap_ops;
+
+	return 0;
 }
 
 static void vfio_pci_request(void *device_data, unsigned int count)
@@ -1608,6 +1678,8 @@  static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	spin_lock_init(&vdev->irqlock);
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	mutex_init(&vdev->vma_lock);
+	INIT_LIST_HEAD(&vdev->vma_list);
 
 	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
 	if (ret)
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 36ec69081ecd..9b25f9f6ce1d 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -92,6 +92,11 @@  struct vfio_pci_vf_token {
 	int			users;
 };
 
+struct vfio_pci_mmap_vma {
+	struct vm_area_struct	*vma;
+	struct list_head	vma_next;
+};
+
 struct vfio_pci_device {
 	struct pci_dev		*pdev;
 	void __iomem		*barmap[PCI_STD_NUM_BARS];
@@ -132,6 +137,8 @@  struct vfio_pci_device {
 	struct list_head	ioeventfds_list;
 	struct vfio_pci_vf_token	*vf_token;
 	struct notifier_block	nb;
+	struct mutex		vma_lock;
+	struct list_head	vma_list;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)