diff mbox

VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

Message ID 20160126102003.GA14400@nvidia.com (mailing list archive)
State New, archived
Headers show

Commit Message

Neo Jia Jan. 26, 2016, 10:20 a.m. UTC
On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, January 26, 2016 5:30 AM
> > 
> > [cc +Neo @Nvidia]
> > 
> > Hi Jike,
> > 
> > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > I would expect we can spell out next level tasks toward above
> > > > direction, upon which Alex can easily judge whether there are
> > > > some common VFIO framework changes that he can help :-)
> > >
> > > Hi Alex,
> > >
> > > Here is a draft task list after a short discussion w/ Kevin,
> > > would you please have a look?
> > >
> > > 	Bus Driver
> > >
> > > 		{ in i915/vgt/xxx.c }
> > >
> > > 		- define a subset of vfio_pci interfaces
> > > 		- selective pass-through (say aperture)
> > > 		- trap MMIO: interface w/ QEMU
> > 
> > What's included in the subset?  Certainly the bus reset ioctls really
> > don't apply, but you'll need to support the full device interface,
> > right?  That includes the region info ioctl and access through the vfio
> > device file descriptor as well as the interrupt info and setup ioctls.
> 
> That is the next level detail Jike will figure out and discuss soon.
> 
> yes, basic region info/access should be necessary. For interrupt, could
> you elaborate a bit what current interface is doing? If just about creating
> an eventfd for virtual interrupt injection, it applies to vgpu too.
> 
> > 
> > > 	IOMMU
> > >
> > > 		{ in a new vfio_xxx.c }
> > >
> > > 		- allocate: struct device & IOMMU group
> > 
> > It seems like the vgpu instance management would do this.
> > 
> > > 		- map/unmap functions for vgpu
> > > 		- rb-tree to maintain iova/hpa mappings
> > 
> > Yep, pretty much what type1 does now, but without mapping through the
> > IOMMU API.  Essentially just a database of the current userspace
> > mappings that can be accessed for page pinning and IOVA->HPA
> > translation.
> 
> The thought is to reuse iommu_type1.c, by abstracting several underlying
> operations and then put vgpu specific implementation in a vfio_vgpu.c (e.g.
> for map/unmap instead of using IOMMU API, an iova/hpa mapping is updated
> accordingly), etc.
> 
> This file will also connect between VFIO and vendor specific vgpu driver,
> e.g. exposing interfaces to allow the latter querying iova<->hpa and also 
> creating necessary VFIO structures like aforementioned device/IOMMUas...
> 
> > 
> > > 		- interacts with kvmgt.c
> > >
> > >
> > > 	vgpu instance management
> > >
> > > 		{ in i915 }
> > >
> > > 		- path, create/destroy
> > >
> > 
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
> not good to touch vfio internal structures from another module (such as
> i915.ko)
> 
> > 
> > Nvidia has also been looking at this and has some ideas how we might
> > standardize on some of the interfaces and create a vgpu framework to
> > help share code between vendors and hopefully make a more consistent
> > userspace interface for libvirt as well.  I'll let Neo provide some
> > details.  Thanks,
> > 
> 
> Nice to know that. Neo, please share your thought here.

Hi Alex, Kevin and Jike,

(Seems I shouldn't use attachment, resend it again to the list, patches are
inline at the end)

Thanks for adding me to this technical discussion, a great opportunity
for us to design together which can bring both Intel and NVIDIA vGPU solution to
KVM platform.

Instead of directly jumping to the proposal that we have been working on
recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
quick comments / thoughts regarding the existing discussions on this thread as
fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

Then we can look at what we have, hopefully we can reach some consensus soon.

> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".

Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
can be centralized and done in vfio-vgpu. That also include adding to IOMMU
group and VFIO group.

Graphics driver can register with vfio-vgpu to get management and emulation call
backs to graphics driver.   

We already have struct vgpu_device in our proposal that keeps pointer to
physical device.  

> - vfio_pci will inject an IRQ to guest only when physical IRQ
> generated; whereas vfio_vgpu may inject an IRQ for emulation
> purpose. Anyway they can share the same injection interface;

eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
available to graphics driver so that graphics driver can inject interrupts
directly when physical device triggers interrupt. 

Here is the proposal we have, please review.

Please note the patches we have put out here is mainly for POC purpose to
verify our understanding also can serve the purpose to reduce confusions and speed up 
our design, although we are very happy to refine that to something eventually
can be used for both parties and upstreamed.

Linux vGPU kernel design

Comments

Tian, Kevin Jan. 26, 2016, 7:24 p.m. UTC | #1
> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, January 26, 2016 6:21 PM
> 
> 0. High level overview
> =====================================================
> =============================
> 
> 
>   user space:
>                                 +-----------+  VFIO IOMMU IOCTLs
>                       +---------| QEMU VFIO |-------------------------+
>         VFIO IOCTLs   |         +-----------+                         |
>                       |                                               |
>  ---------------------|-----------------------------------------------|---------
>                       |                                               |
>   kernel space:       |  +--->----------->---+  (callback)            V
>                       |  |                   v                 +------V-----+
>   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
>   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
>   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+
>   |          |   |          |     | (register)           ^         ||
>   +----------+   +-------+--+     |    +-----------+     |         ||
>                          V        +----| i915.ko   +-----+     +---VV-------+
>                          |             +-----^-----+           | TYPE1      |
>                          |  (callback)       |                 | IOMMU      |
>                          +-->------------>---+                 +------------+
>  access flow:
> 
>   Guest MMIO / PCI config access
>   |
>   -------------------------------------------------
>   |
>   +-----> KVM VM_EXITs  (kernel)
>           |
>   -------------------------------------------------
>           |
>           +-----> QEMU VFIO driver (user)
>                   |
>   -------------------------------------------------
>                   |
>                   +---->  VGPU kernel driver (kernel)
>                           |
>                           |
>                           +----> vendor driver callback
> 
> 

There is one difference between nvidia and intel implementations. We have
vgpu device model in kernel, as part of i915.ko. So I/O emulation requests
are forwarded directly in kernel side. 

Thanks
Kevin
Neo Jia Jan. 26, 2016, 7:29 p.m. UTC | #2
On Tue, Jan 26, 2016 at 07:24:52PM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, January 26, 2016 6:21 PM
> > 
> > 0. High level overview
> > =====================================================
> > =============================
> > 
> > 
> >   user space:
> >                                 +-----------+  VFIO IOMMU IOCTLs
> >                       +---------| QEMU VFIO |-------------------------+
> >         VFIO IOCTLs   |         +-----------+                         |
> >                       |                                               |
> >  ---------------------|-----------------------------------------------|---------
> >                       |                                               |
> >   kernel space:       |  +--->----------->---+  (callback)            V
> >                       |  |                   v                 +------V-----+
> >   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
> >   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
> >   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+
> >   |          |   |          |     | (register)           ^         ||
> >   +----------+   +-------+--+     |    +-----------+     |         ||
> >                          V        +----| i915.ko   +-----+     +---VV-------+
> >                          |             +-----^-----+           | TYPE1      |
> >                          |  (callback)       |                 | IOMMU      |
> >                          +-->------------>---+                 +------------+
> >  access flow:
> > 
> >   Guest MMIO / PCI config access
> >   |
> >   -------------------------------------------------
> >   |
> >   +-----> KVM VM_EXITs  (kernel)
> >           |
> >   -------------------------------------------------
> >           |
> >           +-----> QEMU VFIO driver (user)
> >                   |
> >   -------------------------------------------------
> >                   |
> >                   +---->  VGPU kernel driver (kernel)
> >                           |
> >                           |
> >                           +----> vendor driver callback
> > 
> > 
> 
> There is one difference between nvidia and intel implementations. We have
> vgpu device model in kernel, as part of i915.ko. So I/O emulation requests
> are forwarded directly in kernel side. 

Hi Kevin,

With the vendor driver callback, it will always forward to the kernel driver. If
you are talking about the QEMU VFIO driver (user) part I put on the above
diagram, that is how QEMU VFIO handles MMIO or pci config access today, which we
don't change anything here in this design.

Thanks,
Neo


> 
> Thanks
> Kevin
Alex Williamson Jan. 26, 2016, 8:06 p.m. UTC | #3
On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]

> Hi Alex, Kevin and Jike,

> (Seems I shouldn't use attachment, resend it again to the list, patches are
> inline at the end)

> Thanks for adding me to this technical discussion, a great opportunity
> for us to design together which can bring both Intel and NVIDIA vGPU solution to
> KVM platform.

> Instead of directly jumping to the proposal that we have been working on
> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> quick comments / thoughts regarding the existing discussions on this thread as
> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

> Then we can look at what we have, hopefully we can reach some consensus soon.

> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".

> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> group and VFIO group.

Is this really a good idea?  The concept of a vgpu is not unique to
vfio, we want vfio to be a driver for a vgpu, not an integral part of
the lifecycle of a vgpu.  That certainly doesn't exclude adding
infrastructure to make lifecycle management of a vgpu more consistent
between drivers, but it should be done independently of vfio.  I'll go
back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
does not create the VF, that's done in coordination with the PF making
use of some PCI infrastructure for consistency between drivers.

It seems like we need to take more advantage of the class and driver
core support to perhaps setup a vgpu bus and class with vfio-vgpu just
being a driver for those devices.

> Graphics driver can register with vfio-vgpu to get management and emulation call
> backs to graphics driver.   

> We already have struct vgpu_device in our proposal that keeps pointer to
> physical device.  

> > - vfio_pci will inject an IRQ to guest only when physical IRQ
> > generated; whereas vfio_vgpu may inject an IRQ for emulation
> > purpose. Anyway they can share the same injection interface;

> eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
> available to graphics driver so that graphics driver can inject interrupts
> directly when physical device triggers interrupt. 

> Here is the proposal we have, please review.

> Please note the patches we have put out here is mainly for POC purpose to
> verify our understanding also can serve the purpose to reduce confusions and speed up 
> our design, although we are very happy to refine that to something eventually
> can be used for both parties and upstreamed.

> Linux vGPU kernel design
> ==================================================================================

> Here we are proposing a generic Linux kernel module based on VFIO framework
> which allows different GPU vendors to plugin and provide their GPU virtualization
> solution on KVM, the benefits of having such generic kernel module are:

> 1) Reuse QEMU VFIO driver, supporting VFIO UAPI

> 2) GPU HW agnostic management API for upper layer software such as libvirt

> 3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor

> 0. High level overview
> ==================================================================================

>  
>   user space:
>                                 +-----------+  VFIO IOMMU IOCTLs
>                       +---------| QEMU VFIO |-------------------------+
>         VFIO IOCTLs   |         +-----------+                         |
>                       |                                               | 
>  ---------------------|-----------------------------------------------|---------
>                       |                                               |
>   kernel space:       |  +--->----------->---+  (callback)            V
>                       |  |                   v                 +------V-----+
>   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
>   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
>   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
>   |          |   |          |     | (register)           ^         ||
>   +----------+   +-------+--+     |    +-----------+     |         ||
>                          V        +----| i915.ko   +-----+     +---VV-------+ 
>                          |             +-----^-----+           | TYPE1      |
>                          |  (callback)       |                 | IOMMU      |
>                          +-->------------>---+                 +------------+
>  access flow:

>   Guest MMIO / PCI config access
>   |
>   -------------------------------------------------
>   |
>   +-----> KVM VM_EXITs  (kernel)
>           |
>   -------------------------------------------------
>           |
>           +-----> QEMU VFIO driver (user)
>                   | 
>   -------------------------------------------------
>                   |
>                   +---->  VGPU kernel driver (kernel)
>                           |  
>                           | 
>                           +----> vendor driver callback


> 1. VGPU management interface
> ==================================================================================

> This is the interface allows upper layer software (mostly libvirt) to query and
> configure virtual GPU device in a HW agnostic fashion. Also, this management
> interface has provided flexibility to underlying GPU vendor to support virtual
> device hotplug, multiple virtual devices per VM, multiple virtual devices from
> different physical devices, etc.

> 1.1 Under per-physical device sysfs:
> ----------------------------------------------------------------------------------

> vgpu_supported_types - RO, list the current supported virtual GPU types and its
> VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> "vgpu_supported_types".
>                             
> vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> gpu device on a target physical GPU. idx: virtual device index inside a VM

> vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> target physical GPU


I've noted in previous discussions that we need to separate user policy
from kernel policy here, the kernel policy should not require a "VM
UUID".  A UUID simply represents a set of one or more devices and an
index picks the device within the set.  Whether that UUID matches a VM
or is independently used is up to the user policy when creating the
device.

Personally I'd also prefer to get rid of the concept of indexes within a
UUID set of devices and instead have each device be independent.  This
seems to be an imposition on the nvidia implementation into the kernel
interface design.


> 1.3 Under vgpu class sysfs:
> ----------------------------------------------------------------------------------

> vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
> interface to notify the GPU vendor driver to commit virtual GPU resource for
> this target VM. 

> Also, the vgpu_start function is a synchronized call, the successful return of
> this call will indicate all the requested vGPU resource has been fully
> committed, the VMM should continue.

> vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
> interface to notify the GPU vendor driver to release virtual GPU resource of
> this target VM.

> 1.4 Virtual device Hotplug
> ----------------------------------------------------------------------------------

> To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
> accessed during VM runtime, and the corresponding registration callback will be
> invoked to allow GPU vendor support hotplug.

> To support hotplug, vendor driver would take necessary action to handle the
> situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
> implies both create and start for that vgpu device.

> Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
> supports vgpu hotplug.

> If hotplug is not supported and VM is still running, vendor driver can return
> error code to indicate not supported.

> Separate create from start gives flixibility to have:

> - multiple vgpu instances for single VM and
> - hotplug feature.

> 2. GPU driver vendor registration interface
> ==================================================================================

> 2.1 Registration interface definition (include/linux/vgpu.h)
> ----------------------------------------------------------------------------------

> extern int vgpu_register_device(struct pci_dev *dev, 
>                                 const struct gpu_device_ops *ops);

> extern void vgpu_unregister_device(struct pci_dev *dev);

> /**
>  * struct gpu_device_ops - Structure to be registered for each physical GPU to
>  * register the device to vgpu module.
>  *
>  * @owner:                      The module owner.
>  * @vgpu_supported_config:      Called to get information about supported vgpu
>  * types.
>  *                              @dev : pci device structure of physical GPU. 
>  *                              @config: should return string listing supported
>  *                              config
>  *                              Returns integer: success (0) or error (< 0)
>  * @vgpu_create:                Called to allocate basic resouces in graphics
>  *                              driver for a particular vgpu.
>  *                              @dev: physical pci device structure on which
>  *                              vgpu 
>  *                                    should be created
>  *                              @vm_uuid: VM's uuid for which VM it is intended
>  *                              to
>  *                              @instance: vgpu instance in that VM
>  *                              @vgpu_id: This represents the type of vgpu to be
>  *                                        created
>  *                              Returns integer: success (0) or error (< 0)
>  * @vgpu_destroy:               Called to free resources in graphics driver for
>  *                              a vgpu instance of that VM.
>  *                              @dev: physical pci device structure to which
>  *                              this vgpu points to.
>  *                              @vm_uuid: VM's uuid for which the vgpu belongs
>  *                              to.
>  *                              @instance: vgpu instance in that VM
>  *                              Returns integer: success (0) or error (< 0)
>  *                              If VM is running and vgpu_destroy is called that 
>  *                              means the vGPU is being hotunpluged. Return
>  *                              error
>  *                              if VM is running and graphics driver doesn't
>  *                              support vgpu hotplug.
>  * @vgpu_start:                 Called to do initiate vGPU initialization
>  *                              process in graphics driver when VM boots before
>  *                              qemu starts.
>  *                              @vm_uuid: VM's UUID which is booting.
>  *                              Returns integer: success (0) or error (< 0)
>  * @vgpu_shutdown:              Called to teardown vGPU related resources for
>  *                              the VM
>  *                              @vm_uuid: VM's UUID which is shutting down .
>  *                              Returns integer: success (0) or error (< 0)
>  * @read:                       Read emulation callback
>  *                              @vdev: vgpu device structure
>  *                              @buf: read buffer
>  *                              @count: number bytes to read 
>  *                              @address_space: specifies for which address
>  *                              space
>  *                              the request is: pci_config_space, IO register
>  *                              space or MMIO space.
>  *                              Retuns number on bytes read on success or error.
>  * @write:                      Write emulation callback
>  *                              @vdev: vgpu device structure
>  *                              @buf: write buffer
>  *                              @count: number bytes to be written
>  *                              @address_space: specifies for which address
>  *                              space
>  *                              the request is: pci_config_space, IO register
>  *                              space or MMIO space.
>  *                              Retuns number on bytes written on success or
>  *                              error.
>  * @vgpu_set_irqs:              Called to send about interrupts configuration
>  *                              information that qemu set. 
>  *                              @vdev: vgpu device structure
>  *                              @flags, index, start, count and *data : same as
>  *                              that of struct vfio_irq_set of
>  *                              VFIO_DEVICE_SET_IRQS API. 
>  *
>  * Physical GPU that support vGPU should be register with vgpu module with 
>  * gpu_device_ops structure.
>  */

> struct gpu_device_ops {
>         struct module   *owner;
>         int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
>         int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
>                                uint32_t instance, uint32_t vgpu_id);
>         int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
>                                 uint32_t instance);
>         int     (*vgpu_start)(uuid_le vm_uuid);
>         int     (*vgpu_shutdown)(uuid_le vm_uuid);
>         ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
>                          uint32_t address_space, loff_t pos);
>         ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
>                          uint32_t address_space,loff_t pos);
>         int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
>                                  unsigned index, unsigned start, unsigned count,
>                                  void *data);

> };


I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvidia)
that register these ops with the main vfio-vgpu driver and they should
also include a probe() function which allows us to associate a given
vgpu device with a set of vendor ops.

> 2.2 Details for callbacks we haven't mentioned above.
> ---------------------------------------------------------------------------------

> vgpu_supported_config: allows the vendor driver to specify the supported vGPU
>                        type/configuration

> vgpu_create          : create a virtual GPU device, can be used for device hotplug.

> vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.

> vgpu_start           : callback function to notify vendor driver vgpu device
>                        come to live for a given virtual machine.

> vgpu_shutdown        : callback function to notify vendor driver 

> read                 : callback to vendor driver to handle virtual device config
>                        space or MMIO read access

> write                : callback to vendor driver to handle virtual device config
>                        space or MMIO write access

> vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
>                        information for the target virtual device, then vendor
>                        driver can inject interrupt into virtual machine for this
>                        device.

> 2.3 Potential additional virtual device configuration registration interface:
> ---------------------------------------------------------------------------------

> callback function to describe the MMAP behavior of the virtual GPU 

> callback function to allow GPU vendor driver to provide PCI config space backing
> memory.

> 3. VGPU TYPE1 IOMMU
> ==================================================================================

> Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
> <iova, hva, size, flag> and save the QEMU mm for later reference.

> You can find the quick/ugly implementation in the attached patch file, which is
> actually just a simple version Alex's type1 IOMMU without actual real
> mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 

> We have thought about providing another vendor driver registration interface so
> such tracking information will be sent to vendor driver and he will use the QEMU
> mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
> quick implementation within our driver, I noticed following issues:

> 1) OS/VFIO logic into vendor driver which will be a maintenance issue.

> 2) Every driver vendor has to implement their own RB tree, instead of reusing
> the common existing VFIO code (vfio_find/link/unlink_dma) 

> 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
> better not have anything inside a vendor driver that the VFIO caller immediately
> depends on.

> Based on the above consideration, we decide to implement the DMA tracking logic
> within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
> IOMMU code) and expose two symbols to outside for MMIO mapping and page
> translation and pinning. 

> Also, with a mmap MMIO interface between virtual and physical, this allows
> para-virtualized guest driver can access his virtual MMIO without taking a MMAP
> fault hit, also we can support different MMIO size between virtual and physical
> device.

> int vgpu_map_virtual_bar
> (
>     uint64_t virt_bar_addr,
>     uint64_t phys_bar_addr,
>     uint32_t len,
>     uint32_t flags
> )

> EXPORT_SYMBOL(vgpu_map_virtual_bar);


Per the implementation provided, this needs to be implemented in the
vfio device driver, not in the iommu interface.  Finding the DMA mapping
of the device and replacing it is wrong.  It should be remapped at the
vfio device file interface using vm_ops.


> int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

> EXPORT_SYMBOL(vgpu_dma_do_translate);

> Still a lot to be added and modified, such as supporting multiple VMs and 
> multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> kernel driver, error handling, roll-back and locked memory size per user, etc. 

Particularly, handling of mapping changes is completely missing.  This
cannot be a point in time translation, the user is free to remap
addresses whenever they wish and device translations need to be updated
accordingly.


> 4. Modules
> ==================================================================================

> Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko

> vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
>                            TYPE1 v1 and v2 interface. 

Depending on how intrusive it is, this can possibly by done within the
existing type1 driver.  Either that or we can split out common code for
use by a separate module.

> vgpu.ko                  - provide registration interface and virtual device
>                            VFIO access.

> 5. QEMU note
> ==================================================================================

> To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> use it as a reference for our implementation. It is basically just a quick c & p
> from vfio/pci.c to quickly meet our needs.

> Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> class, and probably the only thing required is to have a new way to discover the
> device.

> 6. Examples
> ==================================================================================

> On this server, we have two NVIDIA M60 GPUs.

> [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

> After nvidia.ko gets initialized, we can query the supported vGPU type by
> accessing the "vgpu_supported_types" like following:

> [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> 11:GRID M60-0B
> 12:GRID M60-0Q
> 13:GRID M60-1B
> 14:GRID M60-1Q
> 15:GRID M60-2B
> 16:GRID M60-2Q
> 17:GRID M60-4Q
> 18:GRID M60-8Q

> For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> like to create "GRID M60-4Q" VM on it.

> echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create

> Note: the number 0 here is for vGPU device index. So far the change is not tested
> for multiple vgpu devices yet, but we will support it.

> At this moment, if you query the "vgpu_supported_types" it will still show all
> supported virtual GPU types as no virtual GPU resource is committed yet.

> Starting VM:

> echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

> then, the supported vGPU type query will return:

> [root@cjia-vgx-kvm /home/cjia]$
> > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> 17:GRID M60-4Q

> So vgpu_supported_config needs to be called whenever a new virtual device gets
> created as the underlying HW might limit the supported types if there are
> any existing VM runnings.

> Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> GPU driver vendor to clean up resource.

> Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> device sysfs.


I'd like to hear Intel's thoughts on this interface.  Are there
different vgpu capacities or priority classes that would necessitate
different types of vcpus on Intel?

I think there are some gaps in translating from named vgpu types to
indexes here, along with my previous mention of the UUID/set oddity.

Does Intel have a need for start and shutdown interfaces?

Neo, wasn't there at some point information about how many of each type
could be supported through these interfaces?  How does a user know their
capacity limits?

Thanks,
Alex
Tian, Kevin Jan. 26, 2016, 9:38 p.m. UTC | #4
> From: Alex Williamson [mailto:alex.williamson@redhat.com]

> Sent: Wednesday, January 27, 2016 4:06 AM

> 

> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:

> > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:

> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]

> >

> > Hi Alex, Kevin and Jike,

> >

> > (Seems I shouldn't use attachment, resend it again to the list, patches are

> > inline at the end)

> >

> > Thanks for adding me to this technical discussion, a great opportunity

> > for us to design together which can bring both Intel and NVIDIA vGPU solution to

> > KVM platform.

> >

> > Instead of directly jumping to the proposal that we have been working on

> > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple

> > quick comments / thoughts regarding the existing discussions on this thread as

> > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

> >

> > Then we can look at what we have, hopefully we can reach some consensus soon.

> >

> > > Yes, and since you're creating and destroying the vgpu here, this is

> > > where I'd expect a struct device to be created and added to an IOMMU

> > > group.  The lifecycle management should really include links between

> > > the vGPU and physical GPU, which would be much, much easier to do with

> > > struct devices create here rather than at the point where we start

> > > doing vfio "stuff".

> >

> > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management

> > can be centralized and done in vfio-vgpu. That also include adding to IOMMU

> > group and VFIO group.

> 

> Is this really a good idea?  The concept of a vgpu is not unique to

> vfio, we want vfio to be a driver for a vgpu, not an integral part of

> the lifecycle of a vgpu.  That certainly doesn't exclude adding

> infrastructure to make lifecycle management of a vgpu more consistent

> between drivers, but it should be done independently of vfio.  I'll go

> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio

> does not create the VF, that's done in coordination with the PF making

> use of some PCI infrastructure for consistency between drivers.

> 

> It seems like we need to take more advantage of the class and driver

> core support to perhaps setup a vgpu bus and class with vfio-vgpu just

> being a driver for those devices.


Agree with Alex here. Even if we want to do more abstraction of overall
vgpu management, here let's stick to necessary changes within VFIO 
scope.


> >

> > 6. Examples

> >

> =====================================================

> =============================

> >

> > On this server, we have two NVIDIA M60 GPUs.

> >

> > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2

> > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

> > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

> >

> > After nvidia.ko gets initialized, we can query the supported vGPU type by

> > accessing the "vgpu_supported_types" like following:

> >

> > [root@cjia-vgx-kvm ~]# cat

> /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types

> > 11:GRID M60-0B

> > 12:GRID M60-0Q

> > 13:GRID M60-1B

> > 14:GRID M60-1Q

> > 15:GRID M60-2B

> > 16:GRID M60-2Q

> > 17:GRID M60-4Q

> > 18:GRID M60-8Q

> >

> > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would

> > like to create "GRID M60-4Q" VM on it.

> >

> > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >

> /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create

> >

> > Note: the number 0 here is for vGPU device index. So far the change is not tested

> > for multiple vgpu devices yet, but we will support it.

> >

> > At this moment, if you query the "vgpu_supported_types" it will still show all

> > supported virtual GPU types as no virtual GPU resource is committed yet.

> >

> > Starting VM:

> >

> > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

> >

> > then, the supported vGPU type query will return:

> >

> > [root@cjia-vgx-kvm /home/cjia]$

> > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types

> > 17:GRID M60-4Q

> >

> > So vgpu_supported_config needs to be called whenever a new virtual device gets

> > created as the underlying HW might limit the supported types if there are

> > any existing VM runnings.

> >

> > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the

> > GPU driver vendor to clean up resource.

> >

> > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under

> > device sysfs.

> 

> 

> I'd like to hear Intel's thoughts on this interface.  Are there

> different vgpu capacities or priority classes that would necessitate

> different types of vcpus on Intel?


We'll evaluate this proposal with our requirement. A quick comment is
that we don't have such type thing required. We just expose the same
type of vgpu as the underlying platform. On the other hand, our
implementation gives flexibility to user to control resource allocation 
(e.g. video memory) to different VMs, instead of a fixed partition
scheme, so we have an interface to query remaining free resources.

> 

> Does Intel have a need for start and shutdown interfaces?


No for now. But we can extend to support such interface which provides
more flexibility to separate resource allocation from run-time control.

Given that nvidia/intel do have specific requirement on vgpu management,
I'd suggest that we focus on VFIO change first. After that we can evaluate
how much commonality of vgpu management upon which to evaluate 
whether to have a common vgpu framework or just stay with vendor
specific implementation for that part.

Thanks,
Kevin
Neo Jia Jan. 26, 2016, 10:28 p.m. UTC | #5
On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > 
> > Hi Alex, Kevin and Jike,
> > 
> > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > inline at the end)
> > 
> > Thanks for adding me to this technical discussion, a great opportunity
> > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > KVM platform.
> > 
> > Instead of directly jumping to the proposal that we have been working on
> > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > quick comments / thoughts regarding the existing discussions on this thread as
> > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> > 
> > Then we can look at what we have, hopefully we can reach some consensus soon.
> > 
> > > Yes, and since you're creating and destroying the vgpu here, this is
> > > where I'd expect a struct device to be created and added to an IOMMU
> > > group.  The lifecycle management should really include links between
> > > the vGPU and physical GPU, which would be much, much easier to do with
> > > struct devices create here rather than at the point where we start
> > > doing vfio "stuff".
> > 
> > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > group and VFIO group.
> 
> Is this really a good idea?  The concept of a vgpu is not unique to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.  That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.  I'll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> does not create the VF, that's done in coordination with the PF making
> use of some PCI infrastructure for consistency between drivers.
> 
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> being a driver for those devices.
> 
> > Graphics driver can register with vfio-vgpu to get management and emulation call
> > backs to graphics driver.   
> > 
> > We already have struct vgpu_device in our proposal that keeps pointer to
> > physical device.  
> > 
> > > - vfio_pci will inject an IRQ to guest only when physical IRQ
> > > generated; whereas vfio_vgpu may inject an IRQ for emulation
> > > purpose. Anyway they can share the same injection interface;
> > 
> > eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
> > available to graphics driver so that graphics driver can inject interrupts
> > directly when physical device triggers interrupt. 
> > 
> > Here is the proposal we have, please review.
> > 
> > Please note the patches we have put out here is mainly for POC purpose to
> > verify our understanding also can serve the purpose to reduce confusions and speed up 
> > our design, although we are very happy to refine that to something eventually
> > can be used for both parties and upstreamed.
> > 
> > Linux vGPU kernel design
> > ==================================================================================
> > 
> > Here we are proposing a generic Linux kernel module based on VFIO framework
> > which allows different GPU vendors to plugin and provide their GPU virtualization
> > solution on KVM, the benefits of having such generic kernel module are:
> > 
> > 1) Reuse QEMU VFIO driver, supporting VFIO UAPI
> > 
> > 2) GPU HW agnostic management API for upper layer software such as libvirt
> > 
> > 3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor
> > 
> > 0. High level overview
> > ==================================================================================
> > 
> >  
> >   user space:
> >                                 +-----------+  VFIO IOMMU IOCTLs
> >                       +---------| QEMU VFIO |-------------------------+
> >         VFIO IOCTLs   |         +-----------+                         |
> >                       |                                               | 
> >  ---------------------|-----------------------------------------------|---------
> >                       |                                               |
> >   kernel space:       |  +--->----------->---+  (callback)            V
> >                       |  |                   v                 +------V-----+
> >   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
> >   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
> >   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
> >   |          |   |          |     | (register)           ^         ||
> >   +----------+   +-------+--+     |    +-----------+     |         ||
> >                          V        +----| i915.ko   +-----+     +---VV-------+ 
> >                          |             +-----^-----+           | TYPE1      |
> >                          |  (callback)       |                 | IOMMU      |
> >                          +-->------------>---+                 +------------+
> >  access flow:
> > 
> >   Guest MMIO / PCI config access
> >   |
> >   -------------------------------------------------
> >   |
> >   +-----> KVM VM_EXITs  (kernel)
> >           |
> >   -------------------------------------------------
> >           |
> >           +-----> QEMU VFIO driver (user)
> >                   | 
> >   -------------------------------------------------
> >                   |
> >                   +---->  VGPU kernel driver (kernel)
> >                           |  
> >                           | 
> >                           +----> vendor driver callback
> > 
> > 
> > 1. VGPU management interface
> > ==================================================================================
> > 
> > This is the interface allows upper layer software (mostly libvirt) to query and
> > configure virtual GPU device in a HW agnostic fashion. Also, this management
> > interface has provided flexibility to underlying GPU vendor to support virtual
> > device hotplug, multiple virtual devices per VM, multiple virtual devices from
> > different physical devices, etc.
> > 
> > 1.1 Under per-physical device sysfs:
> > ----------------------------------------------------------------------------------
> > 
> > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > "vgpu_supported_types".
> >                             
> > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > 
> > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > target physical GPU
> 
> 
> I've noted in previous discussions that we need to separate user policy
> from kernel policy here, the kernel policy should not require a "VM
> UUID".  A UUID simply represents a set of one or more devices and an
> index picks the device within the set.  Whether that UUID matches a VM
> or is independently used is up to the user policy when creating the
> device.
> 
> Personally I'd also prefer to get rid of the concept of indexes within a
> UUID set of devices and instead have each device be independent.  This
> seems to be an imposition on the nvidia implementation into the kernel
> interface design.
> 

Hi Alex,

I agree with you that we should not put UUID concept into a kernel API. At
this point (without any prototyping), I am thinking of using a list of virtual
devices instead of UUID.

> 
> > 1.3 Under vgpu class sysfs:
> > ----------------------------------------------------------------------------------
> > 
> > vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
> > interface to notify the GPU vendor driver to commit virtual GPU resource for
> > this target VM. 
> > 
> > Also, the vgpu_start function is a synchronized call, the successful return of
> > this call will indicate all the requested vGPU resource has been fully
> > committed, the VMM should continue.
> > 
> > vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
> > interface to notify the GPU vendor driver to release virtual GPU resource of
> > this target VM.
> > 
> > 1.4 Virtual device Hotplug
> > ----------------------------------------------------------------------------------
> > 
> > To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
> > accessed during VM runtime, and the corresponding registration callback will be
> > invoked to allow GPU vendor support hotplug.
> > 
> > To support hotplug, vendor driver would take necessary action to handle the
> > situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
> > implies both create and start for that vgpu device.
> > 
> > Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
> > supports vgpu hotplug.
> > 
> > If hotplug is not supported and VM is still running, vendor driver can return
> > error code to indicate not supported.
> > 
> > Separate create from start gives flixibility to have:
> > 
> > - multiple vgpu instances for single VM and
> > - hotplug feature.
> > 
> > 2. GPU driver vendor registration interface
> > ==================================================================================
> > 
> > 2.1 Registration interface definition (include/linux/vgpu.h)
> > ----------------------------------------------------------------------------------
> > 
> > extern int vgpu_register_device(struct pci_dev *dev, 
> >                                 const struct gpu_device_ops *ops);
> > 
> > extern void vgpu_unregister_device(struct pci_dev *dev);
> > 
> > /**
> >  * struct gpu_device_ops - Structure to be registered for each physical GPU to
> >  * register the device to vgpu module.
> >  *
> >  * @owner:                      The module owner.
> >  * @vgpu_supported_config:      Called to get information about supported vgpu
> >  * types.
> >  *                              @dev : pci device structure of physical GPU. 
> >  *                              @config: should return string listing supported
> >  *                              config
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_create:                Called to allocate basic resouces in graphics
> >  *                              driver for a particular vgpu.
> >  *                              @dev: physical pci device structure on which
> >  *                              vgpu 
> >  *                                    should be created
> >  *                              @vm_uuid: VM's uuid for which VM it is intended
> >  *                              to
> >  *                              @instance: vgpu instance in that VM
> >  *                              @vgpu_id: This represents the type of vgpu to be
> >  *                                        created
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_destroy:               Called to free resources in graphics driver for
> >  *                              a vgpu instance of that VM.
> >  *                              @dev: physical pci device structure to which
> >  *                              this vgpu points to.
> >  *                              @vm_uuid: VM's uuid for which the vgpu belongs
> >  *                              to.
> >  *                              @instance: vgpu instance in that VM
> >  *                              Returns integer: success (0) or error (< 0)
> >  *                              If VM is running and vgpu_destroy is called that 
> >  *                              means the vGPU is being hotunpluged. Return
> >  *                              error
> >  *                              if VM is running and graphics driver doesn't
> >  *                              support vgpu hotplug.
> >  * @vgpu_start:                 Called to do initiate vGPU initialization
> >  *                              process in graphics driver when VM boots before
> >  *                              qemu starts.
> >  *                              @vm_uuid: VM's UUID which is booting.
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_shutdown:              Called to teardown vGPU related resources for
> >  *                              the VM
> >  *                              @vm_uuid: VM's UUID which is shutting down .
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @read:                       Read emulation callback
> >  *                              @vdev: vgpu device structure
> >  *                              @buf: read buffer
> >  *                              @count: number bytes to read 
> >  *                              @address_space: specifies for which address
> >  *                              space
> >  *                              the request is: pci_config_space, IO register
> >  *                              space or MMIO space.
> >  *                              Retuns number on bytes read on success or error.
> >  * @write:                      Write emulation callback
> >  *                              @vdev: vgpu device structure
> >  *                              @buf: write buffer
> >  *                              @count: number bytes to be written
> >  *                              @address_space: specifies for which address
> >  *                              space
> >  *                              the request is: pci_config_space, IO register
> >  *                              space or MMIO space.
> >  *                              Retuns number on bytes written on success or
> >  *                              error.
> >  * @vgpu_set_irqs:              Called to send about interrupts configuration
> >  *                              information that qemu set. 
> >  *                              @vdev: vgpu device structure
> >  *                              @flags, index, start, count and *data : same as
> >  *                              that of struct vfio_irq_set of
> >  *                              VFIO_DEVICE_SET_IRQS API. 
> >  *
> >  * Physical GPU that support vGPU should be register with vgpu module with 
> >  * gpu_device_ops structure.
> >  */
> > 
> > struct gpu_device_ops {
> >         struct module   *owner;
> >         int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
> >         int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
> >                                uint32_t instance, uint32_t vgpu_id);
> >         int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
> >                                 uint32_t instance);
> >         int     (*vgpu_start)(uuid_le vm_uuid);
> >         int     (*vgpu_shutdown)(uuid_le vm_uuid);
> >         ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
> >                          uint32_t address_space, loff_t pos);
> >         ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
> >                          uint32_t address_space,loff_t pos);
> >         int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
> >                                  unsigned index, unsigned start, unsigned count,
> >                                  void *data);
> > 
> > };
> 
> 
> I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvidia)
> that register these ops with the main vfio-vgpu driver and they should
> also include a probe() function which allows us to associate a given
> vgpu device with a set of vendor ops.
> 
> 
> > 
> > 2.2 Details for callbacks we haven't mentioned above.
> > ---------------------------------------------------------------------------------
> > 
> > vgpu_supported_config: allows the vendor driver to specify the supported vGPU
> >                        type/configuration
> > 
> > vgpu_create          : create a virtual GPU device, can be used for device hotplug.
> > 
> > vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.
> > 
> > vgpu_start           : callback function to notify vendor driver vgpu device
> >                        come to live for a given virtual machine.
> > 
> > vgpu_shutdown        : callback function to notify vendor driver 
> > 
> > read                 : callback to vendor driver to handle virtual device config
> >                        space or MMIO read access
> > 
> > write                : callback to vendor driver to handle virtual device config
> >                        space or MMIO write access
> > 
> > vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
> >                        information for the target virtual device, then vendor
> >                        driver can inject interrupt into virtual machine for this
> >                        device.
> > 
> > 2.3 Potential additional virtual device configuration registration interface:
> > ---------------------------------------------------------------------------------
> > 
> > callback function to describe the MMAP behavior of the virtual GPU 
> > 
> > callback function to allow GPU vendor driver to provide PCI config space backing
> > memory.
> > 
> > 3. VGPU TYPE1 IOMMU
> > ==================================================================================
> > 
> > Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
> > <iova, hva, size, flag> and save the QEMU mm for later reference.
> > 
> > You can find the quick/ugly implementation in the attached patch file, which is
> > actually just a simple version Alex's type1 IOMMU without actual real
> > mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 
> > 
> > We have thought about providing another vendor driver registration interface so
> > such tracking information will be sent to vendor driver and he will use the QEMU
> > mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
> > quick implementation within our driver, I noticed following issues:
> > 
> > 1) OS/VFIO logic into vendor driver which will be a maintenance issue.
> > 
> > 2) Every driver vendor has to implement their own RB tree, instead of reusing
> > the common existing VFIO code (vfio_find/link/unlink_dma) 
> > 
> > 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
> > better not have anything inside a vendor driver that the VFIO caller immediately
> > depends on.
> > 
> > Based on the above consideration, we decide to implement the DMA tracking logic
> > within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
> > IOMMU code) and expose two symbols to outside for MMIO mapping and page
> > translation and pinning. 
> > 
> > Also, with a mmap MMIO interface between virtual and physical, this allows
> > para-virtualized guest driver can access his virtual MMIO without taking a MMAP
> > fault hit, also we can support different MMIO size between virtual and physical
> > device.
> > 
> > int vgpu_map_virtual_bar
> > (
> >     uint64_t virt_bar_addr,
> >     uint64_t phys_bar_addr,
> >     uint32_t len,
> >     uint32_t flags
> > )
> > 
> > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> 
> 
> Per the implementation provided, this needs to be implemented in the
> vfio device driver, not in the iommu interface.  Finding the DMA mapping
> of the device and replacing it is wrong.  It should be remapped at the
> vfio device file interface using vm_ops.
> 

So you are basically suggesting that we are going to take a mmap fault and
within that fault handler, we will go into vendor driver to look up the
"pre-registered" mapping and remap there.

Is my understanding correct?

> 
> > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > 
> > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > 
> > Still a lot to be added and modified, such as supporting multiple VMs and 
> > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> 
> Particularly, handling of mapping changes is completely missing.  This
> cannot be a point in time translation, the user is free to remap
> addresses whenever they wish and device translations need to be updated
> accordingly.
> 

When you say "user", do you mean the QEMU? Here, whenever the DMA that
the guest driver is going to launch will be first pinned within VM, and then
registered to QEMU, therefore the IOMMU memory listener, eventually the pages
will be pinned by the GPU or DMA engine.

Since we are keeping the upper level code same, thinking about passthru case,
where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
can change that mapping without causing an IOMMU fault on a active DMA device.

> 
> > 4. Modules
> > ==================================================================================
> > 
> > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > 
> > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> >                            TYPE1 v1 and v2 interface. 
> 
> Depending on how intrusive it is, this can possibly by done within the
> existing type1 driver.  Either that or we can split out common code for
> use by a separate module.
> 
> > vgpu.ko                  - provide registration interface and virtual device
> >                            VFIO access.
> > 
> > 5. QEMU note
> > ==================================================================================
> > 
> > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> > use it as a reference for our implementation. It is basically just a quick c & p
> > from vfio/pci.c to quickly meet our needs.
> > 
> > Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> > class, and probably the only thing required is to have a new way to discover the
> > device.
> > 
> > 6. Examples
> > ==================================================================================
> > 
> > On this server, we have two NVIDIA M60 GPUs.
> > 
> > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > 
> > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > accessing the "vgpu_supported_types" like following:
> > 
> > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > 11:GRID M60-0B
> > 12:GRID M60-0Q
> > 13:GRID M60-1B
> > 14:GRID M60-1Q
> > 15:GRID M60-2B
> > 16:GRID M60-2Q
> > 17:GRID M60-4Q
> > 18:GRID M60-8Q
> > 
> > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > like to create "GRID M60-4Q" VM on it.
> > 
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > 
> > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > for multiple vgpu devices yet, but we will support it.
> > 
> > At this moment, if you query the "vgpu_supported_types" it will still show all
> > supported virtual GPU types as no virtual GPU resource is committed yet.
> > 
> > Starting VM:
> > 
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > 
> > then, the supported vGPU type query will return:
> > 
> > [root@cjia-vgx-kvm /home/cjia]$
> > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > 17:GRID M60-4Q
> > 
> > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > created as the underlying HW might limit the supported types if there are
> > any existing VM runnings.
> > 
> > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > GPU driver vendor to clean up resource.
> > 
> > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > device sysfs.
> 
> 
> I'd like to hear Intel's thoughts on this interface.  Are there
> different vgpu capacities or priority classes that would necessitate
> different types of vcpus on Intel?
> 
> I think there are some gaps in translating from named vgpu types to
> indexes here, along with my previous mention of the UUID/set oddity.
> 
> Does Intel have a need for start and shutdown interfaces?
> 
> Neo, wasn't there at some point information about how many of each type
> could be supported through these interfaces?  How does a user know their
> capacity limits?
> 

Thanks for reminding me that, I think we probably forget to put that *important*
information as the output of "vgpu_supported_types".

Regarding the capacity, we can provide the frame buffer size as part of the
"vgpu_supported_types" output as well, I would imagine those will be eventually
show up on the openstack management interface or virt-mgr.

Basically, yes there would be a separate col to show the number of instance you
can create for each type of VGPU on a specific physical GPU.

Thanks,
Neo


> Thanks,
> Alex
>
Alex Williamson Jan. 26, 2016, 11:30 p.m. UTC | #6
On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > 1.1 Under per-physical device sysfs:
> > > ----------------------------------------------------------------------------------
> > >  
> > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > "vgpu_supported_types".
> > >                             
> > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > >  
> > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > target physical GPU
> > 
> > 
> > I've noted in previous discussions that we need to separate user policy
> > from kernel policy here, the kernel policy should not require a "VM
> > UUID".  A UUID simply represents a set of one or more devices and an
> > index picks the device within the set.  Whether that UUID matches a VM
> > or is independently used is up to the user policy when creating the
> > device.
> > 
> > Personally I'd also prefer to get rid of the concept of indexes within a
> > UUID set of devices and instead have each device be independent.  This
> > seems to be an imposition on the nvidia implementation into the kernel
> > interface design.
> > 

> Hi Alex,

> I agree with you that we should not put UUID concept into a kernel API. At
> this point (without any prototyping), I am thinking of using a list of virtual
> devices instead of UUID.

Hi Neo,

A UUID is a perfectly fine name, so long as we let it be just a UUID and
not the UUID matching some specific use case.

> > >  
> > > int vgpu_map_virtual_bar
> > > (
> > >     uint64_t virt_bar_addr,
> > >     uint64_t phys_bar_addr,
> > >     uint32_t len,
> > >     uint32_t flags
> > > )
> > >  
> > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > 
> > 
> > Per the implementation provided, this needs to be implemented in the
> > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > of the device and replacing it is wrong.  It should be remapped at the
> > vfio device file interface using vm_ops.
> > 

> So you are basically suggesting that we are going to take a mmap fault and
> within that fault handler, we will go into vendor driver to look up the
> "pre-registered" mapping and remap there.

> Is my understanding correct?

Essentially, hopefully the vendor driver will have already registered
the backing for the mmap prior to the fault, but either way could work.
I think the key though is that you want to remap it onto the vma
accessing the vfio device file, not scanning it out of an IOVA mapping
that might be dynamic and doing a vma lookup based on the point in time
mapping of the BAR.  The latter doesn't give me much confidence that
mappings couldn't change while the former should be a one time fault.

In case it's not clear to folks at Intel, the purpose of this is that a
vGPU may directly map a segment of the physical GPU MMIO space, but we
may not know what segment that is at setup time, when QEMU does an mmap
of the vfio device file descriptor.  The thought is that we can create
an invalid mapping when QEMU calls mmap(), knowing that it won't be
accessed until later, then we can fault in the real mmap on demand.  Do
you need anything similar?

> > 
> > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > >  
> > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > >  
> > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > 
> > Particularly, handling of mapping changes is completely missing.  This
> > cannot be a point in time translation, the user is free to remap
> > addresses whenever they wish and device translations need to be updated
> > accordingly.
> > 

> When you say "user", do you mean the QEMU?

vfio is a generic userspace driver interface, QEMU is a very, very
important user of the interface, but not the only user.  So for this
conversation, we're mostly talking about QEMU as the user, but we should
be careful about assuming QEMU is the only user.

> Here, whenever the DMA that
> the guest driver is going to launch will be first pinned within VM, and then
> registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> will be pinned by the GPU or DMA engine.

> Since we are keeping the upper level code same, thinking about passthru case,
> where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> can change that mapping without causing an IOMMU fault on a active DMA device.

For the virtual BAR mapping above, it's easy to imagine that mapping a
BAR to a given address is at the guest discretion, it may be mapped and
unmapped, it may be mapped to different addresses at different points in
time, the guest BIOS may choose to map it at yet another address, etc.
So if somehow we were trying to setup a mapping for peer-to-peer, there
are lots of ways that IOVA could change.  But even with RAM, we can
support memory hotplug in a VM.  What was once a DMA target may be
removed or may now be backed by something else.  Chipset configuration
on the emulated platform may change how guest physical memory appears
and that might change between VM boots.

Currently with physical device assignment the memory listener watches
for both maps and unmaps and updates the iotlb to match.  Just like real
hardware doing these same sorts of things, we rely on the guest to stop
using memory that's going to be moved as a DMA target prior to moving
it.

> > > 4. Modules
> > > ==================================================================================
> > >  
> > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > >  
> > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> > >                            TYPE1 v1 and v2 interface. 
> > 
> > Depending on how intrusive it is, this can possibly by done within the
> > existing type1 driver.  Either that or we can split out common code for
> > use by a separate module.
> > 
> > > vgpu.ko                  - provide registration interface and virtual device
> > >                            VFIO access.
> > >  
> > > 5. QEMU note
> > > ==================================================================================
> > >  
> > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> > > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> > > use it as a reference for our implementation. It is basically just a quick c & p
> > > from vfio/pci.c to quickly meet our needs.
> > >  
> > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> > > class, and probably the only thing required is to have a new way to discover the
> > > device.
> > >  
> > > 6. Examples
> > > ==================================================================================
> > >  
> > > On this server, we have two NVIDIA M60 GPUs.
> > >  
> > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > >  
> > > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > > accessing the "vgpu_supported_types" like following:
> > >  
> > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > > 11:GRID M60-0B
> > > 12:GRID M60-0Q
> > > 13:GRID M60-1B
> > > 14:GRID M60-1Q
> > > 15:GRID M60-2B
> > > 16:GRID M60-2Q
> > > 17:GRID M60-4Q
> > > 18:GRID M60-8Q
> > >  
> > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > > like to create "GRID M60-4Q" VM on it.
> > >  
> > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > >  
> > > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > > for multiple vgpu devices yet, but we will support it.
> > >  
> > > At this moment, if you query the "vgpu_supported_types" it will still show all
> > > supported virtual GPU types as no virtual GPU resource is committed yet.
> > >  
> > > Starting VM:
> > >  
> > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > >  
> > > then, the supported vGPU type query will return:
> > >  
> > > [root@cjia-vgx-kvm /home/cjia]$
> > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > > 17:GRID M60-4Q
> > >  
> > > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > > created as the underlying HW might limit the supported types if there are
> > > any existing VM runnings.
> > >  
> > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > > GPU driver vendor to clean up resource.
> > >  
> > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > > device sysfs.
> > 
> > 
> > I'd like to hear Intel's thoughts on this interface.  Are there
> > different vgpu capacities or priority classes that would necessitate
> > different types of vcpus on Intel?
> > 
> > I think there are some gaps in translating from named vgpu types to
> > indexes here, along with my previous mention of the UUID/set oddity.
> > 
> > Does Intel have a need for start and shutdown interfaces?
> > 
> > Neo, wasn't there at some point information about how many of each type
> > could be supported through these interfaces?  How does a user know their
> > capacity limits?
> > 

> Thanks for reminding me that, I think we probably forget to put that *important*
> information as the output of "vgpu_supported_types".

> Regarding the capacity, we can provide the frame buffer size as part of the
> "vgpu_supported_types" output as well, I would imagine those will be eventually
> show up on the openstack management interface or virt-mgr.

> Basically, yes there would be a separate col to show the number of instance you
> can create for each type of VGPU on a specific physical GPU.

Ok, Thanks,

Alex
Kirti Wankhede Jan. 27, 2016, 8:06 a.m. UTC | #7
On 1/27/2016 1:36 AM, Alex Williamson wrote:
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>   
>> Hi Alex, Kevin and Jike,
>>   
>> (Seems I shouldn't use attachment, resend it again to the list, patches are
>> inline at the end)
>>   
>> Thanks for adding me to this technical discussion, a great opportunity
>> for us to design together which can bring both Intel and NVIDIA vGPU solution to
>> KVM platform.
>>   
>> Instead of directly jumping to the proposal that we have been working on
>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
>> quick comments / thoughts regarding the existing discussions on this thread as
>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
>>   
>> Then we can look at what we have, hopefully we can reach some consensus soon.
>>   
>>> Yes, and since you're creating and destroying the vgpu here, this is
>>> where I'd expect a struct device to be created and added to an IOMMU
>>> group.  The lifecycle management should really include links between
>>> the vGPU and physical GPU, which would be much, much easier to do with
>>> struct devices create here rather than at the point where we start
>>> doing vfio "stuff".
>>   
>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
>> group and VFIO group.
> Is this really a good idea?  The concept of a vgpu is not unique to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.  That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.  I'll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> does not create the VF, that's done in coordination with the PF making
> use of some PCI infrastructure for consistency between drivers.
>
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> being a driver for those devices.

For device passthrough or SR-IOV model, PCI devices are created by PCI 
bus driver and from the probe routine each device is added in vfio group.

For vgpu, there should be a common module that create vgpu device, say 
vgpu module, add vgpu device to an IOMMU group and then add it to vfio 
group.  This module can handle management of vgpus. Advantage of keeping 
this module a separate module than doing device creation in vendor 
modules is to have generic interface for vgpu management, for example, 
files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and 
vgpu driver registration interface.

In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and 
vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to 
vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per 
device. In the vgpu module, vgpu devices are created on request, so 
vgpu_group_init() should be called explicitly for per vgpu device. 
  That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu 
module.  Vgpu_vfio would remain separate entity but merged with vgpu 
module.


Thanks,
Kirti
Neo Jia Jan. 27, 2016, 9:14 a.m. UTC | #8
On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > 1.1 Under per-physical device sysfs:
> > > > ----------------------------------------------------------------------------------
> > > >  
> > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > "vgpu_supported_types".
> > > >                             
> > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > > >  
> > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > > target physical GPU
> > > 
> > > 
> > > I've noted in previous discussions that we need to separate user policy
> > > from kernel policy here, the kernel policy should not require a "VM
> > > UUID".  A UUID simply represents a set of one or more devices and an
> > > index picks the device within the set.  Whether that UUID matches a VM
> > > or is independently used is up to the user policy when creating the
> > > device.
> > > 
> > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > UUID set of devices and instead have each device be independent.  This
> > > seems to be an imposition on the nvidia implementation into the kernel
> > > interface design.
> > > 
> > 
> > Hi Alex,
> > 
> > I agree with you that we should not put UUID concept into a kernel API. At
> > this point (without any prototyping), I am thinking of using a list of virtual
> > devices instead of UUID.
> 
> Hi Neo,
> 
> A UUID is a perfectly fine name, so long as we let it be just a UUID and
> not the UUID matching some specific use case.
> 
> > > >  
> > > > int vgpu_map_virtual_bar
> > > > (
> > > >     uint64_t virt_bar_addr,
> > > >     uint64_t phys_bar_addr,
> > > >     uint32_t len,
> > > >     uint32_t flags
> > > > )
> > > >  
> > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > 
> > > 
> > > Per the implementation provided, this needs to be implemented in the
> > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > of the device and replacing it is wrong.  It should be remapped at the
> > > vfio device file interface using vm_ops.
> > > 
> > 
> > So you are basically suggesting that we are going to take a mmap fault and
> > within that fault handler, we will go into vendor driver to look up the
> > "pre-registered" mapping and remap there.
> > 
> > Is my understanding correct?
> 
> Essentially, hopefully the vendor driver will have already registered
> the backing for the mmap prior to the fault, but either way could work.
> I think the key though is that you want to remap it onto the vma
> accessing the vfio device file, not scanning it out of an IOVA mapping
> that might be dynamic and doing a vma lookup based on the point in time
> mapping of the BAR.  The latter doesn't give me much confidence that
> mappings couldn't change while the former should be a one time fault.

Hi Alex,

The fact is that the vendor driver can only prevent such mmap fault by looking
up the <iova, hva> mapping table that we have saved from IOMMU memory listerner
when the guest region gets programmed. Also, like you have mentioned below, such
mapping between iova and hva shouldn't be changed as long as the SBIOS and
guest OS are done with their job. 

Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 

Probably we should just limit this interface to guest MMIO region and we can have
some crosscheck between the VFIO driver who has monitored the config spcae
access to make sure nothing getting moved around?

> 
> In case it's not clear to folks at Intel, the purpose of this is that a
> vGPU may directly map a segment of the physical GPU MMIO space, but we
> may not know what segment that is at setup time, when QEMU does an mmap
> of the vfio device file descriptor.  The thought is that we can create
> an invalid mapping when QEMU calls mmap(), knowing that it won't be
> accessed until later, then we can fault in the real mmap on demand.  Do
> you need anything similar?
> 
> > > 
> > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > >  
> > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > >  
> > > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > > 
> > > Particularly, handling of mapping changes is completely missing.  This
> > > cannot be a point in time translation, the user is free to remap
> > > addresses whenever they wish and device translations need to be updated
> > > accordingly.
> > > 
> > 
> > When you say "user", do you mean the QEMU?
> 
> vfio is a generic userspace driver interface, QEMU is a very, very
> important user of the interface, but not the only user.  So for this
> conversation, we're mostly talking about QEMU as the user, but we should
> be careful about assuming QEMU is the only user.
> 

Understood. I have to say that our focus at this moment is to support QEMU and
KVM, but I know VFIO interface is much more than that, and that is why I think
it is right to leverage this framework so we can together explore future use
case in the userland.


> > Here, whenever the DMA that
> > the guest driver is going to launch will be first pinned within VM, and then
> > registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> > will be pinned by the GPU or DMA engine.
> > 
> > Since we are keeping the upper level code same, thinking about passthru case,
> > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> > can change that mapping without causing an IOMMU fault on a active DMA device.
> 
> For the virtual BAR mapping above, it's easy to imagine that mapping a
> BAR to a given address is at the guest discretion, it may be mapped and
> unmapped, it may be mapped to different addresses at different points in
> time, the guest BIOS may choose to map it at yet another address, etc.
> So if somehow we were trying to setup a mapping for peer-to-peer, there
> are lots of ways that IOVA could change.  But even with RAM, we can
> support memory hotplug in a VM.  What was once a DMA target may be
> removed or may now be backed by something else.  Chipset configuration
> on the emulated platform may change how guest physical memory appears
> and that might change between VM boots.
> 
> Currently with physical device assignment the memory listener watches
> for both maps and unmaps and updates the iotlb to match.  Just like real
> hardware doing these same sorts of things, we rely on the guest to stop
> using memory that's going to be moved as a DMA target prior to moving
> it.

Right,  you can only do that when the device is quiescent.

As long as this will be notified to the guest, I think we should be able to
support it although the real implementation will depend on how the device gets into 
quiescent state.

This is definitely a very interesting feature we should explore, but I hope we
probably can first focus on the most basic functionality.

Thanks,
Neo

> 
> > > > 4. Modules
> > > > ==================================================================================
> > > >  
> > > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > > >  
> > > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> > > >                            TYPE1 v1 and v2 interface. 
> > > 
> > > Depending on how intrusive it is, this can possibly by done within the
> > > existing type1 driver.  Either that or we can split out common code for
> > > use by a separate module.
> > > 
> > > > vgpu.ko                  - provide registration interface and virtual device
> > > >                            VFIO access.
> > > >  
> > > > 5. QEMU note
> > > > ==================================================================================
> > > >  
> > > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> > > > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> > > > use it as a reference for our implementation. It is basically just a quick c & p
> > > > from vfio/pci.c to quickly meet our needs.
> > > >  
> > > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> > > > class, and probably the only thing required is to have a new way to discover the
> > > > device.
> > > >  
> > > > 6. Examples
> > > > ==================================================================================
> > > >  
> > > > On this server, we have two NVIDIA M60 GPUs.
> > > >  
> > > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > > >  
> > > > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > > > accessing the "vgpu_supported_types" like following:
> > > >  
> > > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > > > 11:GRID M60-0B
> > > > 12:GRID M60-0Q
> > > > 13:GRID M60-1B
> > > > 14:GRID M60-1Q
> > > > 15:GRID M60-2B
> > > > 16:GRID M60-2Q
> > > > 17:GRID M60-4Q
> > > > 18:GRID M60-8Q
> > > >  
> > > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > > > like to create "GRID M60-4Q" VM on it.
> > > >  
> > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> > > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > > >  
> > > > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > > > for multiple vgpu devices yet, but we will support it.
> > > >  
> > > > At this moment, if you query the "vgpu_supported_types" it will still show all
> > > > supported virtual GPU types as no virtual GPU resource is committed yet.
> > > >  
> > > > Starting VM:
> > > >  
> > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > > >  
> > > > then, the supported vGPU type query will return:
> > > >  
> > > > [root@cjia-vgx-kvm /home/cjia]$
> > > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > > > 17:GRID M60-4Q
> > > >  
> > > > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > > > created as the underlying HW might limit the supported types if there are
> > > > any existing VM runnings.
> > > >  
> > > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > > > GPU driver vendor to clean up resource.
> > > >  
> > > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > > > device sysfs.
> > > 
> > > 
> > > I'd like to hear Intel's thoughts on this interface.  Are there
> > > different vgpu capacities or priority classes that would necessitate
> > > different types of vcpus on Intel?
> > > 
> > > I think there are some gaps in translating from named vgpu types to
> > > indexes here, along with my previous mention of the UUID/set oddity.
> > > 
> > > Does Intel have a need for start and shutdown interfaces?
> > > 
> > > Neo, wasn't there at some point information about how many of each type
> > > could be supported through these interfaces?  How does a user know their
> > > capacity limits?
> > > 
> > 
> > Thanks for reminding me that, I think we probably forget to put that *important*
> > information as the output of "vgpu_supported_types".
> > 
> > Regarding the capacity, we can provide the frame buffer size as part of the
> > "vgpu_supported_types" output as well, I would imagine those will be eventually
> > show up on the openstack management interface or virt-mgr.
> > 
> > Basically, yes there would be a separate col to show the number of instance you
> > can create for each type of VGPU on a specific physical GPU.
> 
> Ok, Thanks,
> 
> Alex
>
Alex Williamson Jan. 27, 2016, 4 p.m. UTC | #9
On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:

> On 1/27/2016 1:36 AM, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > >   
> > > Hi Alex, Kevin and Jike,
> > >   
> > > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > > inline at the end)
> > >   
> > > Thanks for adding me to this technical discussion, a great opportunity
> > > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > > KVM platform.
> > >   
> > > Instead of directly jumping to the proposal that we have been working on
> > > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > > quick comments / thoughts regarding the existing discussions on this thread as
> > > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> > >   
> > > Then we can look at what we have, hopefully we can reach some consensus soon.
> > >   
> > > > Yes, and since you're creating and destroying the vgpu here, this is
> > > > where I'd expect a struct device to be created and added to an IOMMU
> > > > group.  The lifecycle management should really include links between
> > > > the vGPU and physical GPU, which would be much, much easier to do with
> > > > struct devices create here rather than at the point where we start
> > > > doing vfio "stuff".
> > >   
> > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > > group and VFIO group.
> > Is this really a good idea?  The concept of a vgpu is not unique to
> > vfio, we want vfio to be a driver for a vgpu, not an integral part of
> > the lifecycle of a vgpu.  That certainly doesn't exclude adding
> > infrastructure to make lifecycle management of a vgpu more consistent
> > between drivers, but it should be done independently of vfio.  I'll go
> > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> > does not create the VF, that's done in coordination with the PF making
> > use of some PCI infrastructure for consistency between drivers.
> > 
> > It seems like we need to take more advantage of the class and driver
> > core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> > being a driver for those devices.

> For device passthrough or SR-IOV model, PCI devices are created by PCI 
> bus driver and from the probe routine each device is added in vfio group.

An SR-IOV VF is created by the PF driver using standard interfaces
provided by the PCI core.  The IOMMU group for a VF is added by the
IOMMU driver when the device is created on the pci_bus_type.  The probe
routine of the vfio bus driver (vfio-pci) is what adds the device into
the vfio group.

> For vgpu, there should be a common module that create vgpu device, say 
> vgpu module, add vgpu device to an IOMMU group and then add it to vfio 
> group.  This module can handle management of vgpus. Advantage of keeping 
> this module a separate module than doing device creation in vendor 
> modules is to have generic interface for vgpu management, for example, 
> files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and 
> vgpu driver registration interface.

But you're suggesting something very different from the SR-IOV model.
If we wanted to mimic that model, the GPU specific driver should create
the vgpu using services provided by a common interface.  For instance
i915 could call a new vgpu_device_create() which creates the device,
adds it to the vgpu class, etc.  That vgpu device should not be assumed
to be used with vfio though, that should happen via a separate probe
using a vfio-vgpu driver.  It's that vfio bus driver that will add the
device to a vfio group.

> In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and 
> vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to 
> vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per 
> device. In the vgpu module, vgpu devices are created on request, so 
> vgpu_group_init() should be called explicitly for per vgpu device. 
>   That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu 
> module.  Vgpu_vfio would remain separate entity but merged with vgpu 
> module.

I disagree with this design, creation of a vgpu necessarily involves the
GPU driver and should not be tied to use of the vgpu with vfio.  vfio
should be a driver for the device, maybe eventually not the only driver
for the device.  Thanks,

Alex
Alex Williamson Jan. 27, 2016, 4:10 p.m. UTC | #10
On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote:
> On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > > 1.1 Under per-physical device sysfs:
> > > > > ----------------------------------------------------------------------------------
> > > > >  
> > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > > "vgpu_supported_types".
> > > > >                             
> > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > > > >  
> > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > > > target physical GPU
> > > >  
> > > >  
> > > > I've noted in previous discussions that we need to separate user policy
> > > > from kernel policy here, the kernel policy should not require a "VM
> > > > UUID".  A UUID simply represents a set of one or more devices and an
> > > > index picks the device within the set.  Whether that UUID matches a VM
> > > > or is independently used is up to the user policy when creating the
> > > > device.
> > > >  
> > > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > > UUID set of devices and instead have each device be independent.  This
> > > > seems to be an imposition on the nvidia implementation into the kernel
> > > > interface design.
> > > >  
> > >  
> > > Hi Alex,
> > >  
> > > I agree with you that we should not put UUID concept into a kernel API. At
> > > this point (without any prototyping), I am thinking of using a list of virtual
> > > devices instead of UUID.
> > 
> > Hi Neo,
> > 
> > A UUID is a perfectly fine name, so long as we let it be just a UUID and
> > not the UUID matching some specific use case.
> > 
> > > > >  
> > > > > int vgpu_map_virtual_bar
> > > > > (
> > > > >     uint64_t virt_bar_addr,
> > > > >     uint64_t phys_bar_addr,
> > > > >     uint32_t len,
> > > > >     uint32_t flags
> > > > > )
> > > > >  
> > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > >  
> > > >  
> > > > Per the implementation provided, this needs to be implemented in the
> > > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > > of the device and replacing it is wrong.  It should be remapped at the
> > > > vfio device file interface using vm_ops.
> > > >  
> > >  
> > > So you are basically suggesting that we are going to take a mmap fault and
> > > within that fault handler, we will go into vendor driver to look up the
> > > "pre-registered" mapping and remap there.
> > >  
> > > Is my understanding correct?
> > 
> > Essentially, hopefully the vendor driver will have already registered
> > the backing for the mmap prior to the fault, but either way could work.
> > I think the key though is that you want to remap it onto the vma
> > accessing the vfio device file, not scanning it out of an IOVA mapping
> > that might be dynamic and doing a vma lookup based on the point in time
> > mapping of the BAR.  The latter doesn't give me much confidence that
> > mappings couldn't change while the former should be a one time fault.

> Hi Alex,

> The fact is that the vendor driver can only prevent such mmap fault by looking
> up the <iova, hva> mapping table that we have saved from IOMMU memory listerner

Why do we need to prevent the fault?  We need to handle the fault when
it occurs.

> when the guest region gets programmed. Also, like you have mentioned below, such
> mapping between iova and hva shouldn't be changed as long as the SBIOS and
> guest OS are done with their job. 

But you don't know they're done with their job.

> Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 

Why does that matter?  We're talking about the first time the VM
accesses the range of the BAR that will be direct mapped to the physical
GPU.  This isn't going to happen in the middle of a benchmark, it's
going to happen during driver initialization in the guest.

> Probably we should just limit this interface to guest MMIO region and we can have
> some crosscheck between the VFIO driver who has monitored the config spcae
> access to make sure nothing getting moved around?

No, the solution for the bar is very clear, map on fault to the vma
accessing the mmap and be done with it for the remainder of this
instance of the VM.

> > In case it's not clear to folks at Intel, the purpose of this is that a
> > vGPU may directly map a segment of the physical GPU MMIO space, but we
> > may not know what segment that is at setup time, when QEMU does an mmap
> > of the vfio device file descriptor.  The thought is that we can create
> > an invalid mapping when QEMU calls mmap(), knowing that it won't be
> > accessed until later, then we can fault in the real mmap on demand.  Do
> > you need anything similar?
> > 
> > > >  
> > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > > >  
> > > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > > >  
> > > > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > > >  
> > > > Particularly, handling of mapping changes is completely missing.  This
> > > > cannot be a point in time translation, the user is free to remap
> > > > addresses whenever they wish and device translations need to be updated
> > > > accordingly.
> > > >  
> > >  
> > > When you say "user", do you mean the QEMU?
> > 
> > vfio is a generic userspace driver interface, QEMU is a very, very
> > important user of the interface, but not the only user.  So for this
> > conversation, we're mostly talking about QEMU as the user, but we should
> > be careful about assuming QEMU is the only user.
> > 

> Understood. I have to say that our focus at this moment is to support QEMU and
> KVM, but I know VFIO interface is much more than that, and that is why I think
> it is right to leverage this framework so we can together explore future use
> case in the userland.


> > > Here, whenever the DMA that
> > > the guest driver is going to launch will be first pinned within VM, and then
> > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> > > will be pinned by the GPU or DMA engine.
> > >  
> > > Since we are keeping the upper level code same, thinking about passthru case,
> > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> > > can change that mapping without causing an IOMMU fault on a active DMA device.
> > 
> > For the virtual BAR mapping above, it's easy to imagine that mapping a
> > BAR to a given address is at the guest discretion, it may be mapped and
> > unmapped, it may be mapped to different addresses at different points in
> > time, the guest BIOS may choose to map it at yet another address, etc.
> > So if somehow we were trying to setup a mapping for peer-to-peer, there
> > are lots of ways that IOVA could change.  But even with RAM, we can
> > support memory hotplug in a VM.  What was once a DMA target may be
> > removed or may now be backed by something else.  Chipset configuration
> > on the emulated platform may change how guest physical memory appears
> > and that might change between VM boots.
> > 
> > Currently with physical device assignment the memory listener watches
> > for both maps and unmaps and updates the iotlb to match.  Just like real
> > hardware doing these same sorts of things, we rely on the guest to stop
> > using memory that's going to be moved as a DMA target prior to moving
> > it.

> Right,  you can only do that when the device is quiescent.

> As long as this will be notified to the guest, I think we should be able to
> support it although the real implementation will depend on how the device gets into 
> quiescent state.

> This is definitely a very interesting feature we should explore, but I hope we
> probably can first focus on the most basic functionality.

If we only do a point-in-time translation and assume it never changes,
that's good enough for a proof of concept, but it's not a complete
solution.  I think this is  practical problem, not just an academic
problem.  There needs to be a mechanism mappings to be invalidated based
on VM memory changes.  Thanks,

Alex
Kirti Wankhede Jan. 27, 2016, 8:55 p.m. UTC | #11
On 1/27/2016 9:30 PM, Alex Williamson wrote:
> On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
>>
>> On 1/27/2016 1:36 AM, Alex Williamson wrote:
>>> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
>>>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>
>>>> Hi Alex, Kevin and Jike,
>>>>
>>>> (Seems I shouldn't use attachment, resend it again to the list, patches are
>>>> inline at the end)
>>>>
>>>> Thanks for adding me to this technical discussion, a great opportunity
>>>> for us to design together which can bring both Intel and NVIDIA vGPU solution to
>>>> KVM platform.
>>>>
>>>> Instead of directly jumping to the proposal that we have been working on
>>>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
>>>> quick comments / thoughts regarding the existing discussions on this thread as
>>>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
>>>>
>>>> Then we can look at what we have, hopefully we can reach some consensus soon.
>>>>
>>>>> Yes, and since you're creating and destroying the vgpu here, this is
>>>>> where I'd expect a struct device to be created and added to an IOMMU
>>>>> group.  The lifecycle management should really include links between
>>>>> the vGPU and physical GPU, which would be much, much easier to do with
>>>>> struct devices create here rather than at the point where we start
>>>>> doing vfio "stuff".
>>>>
>>>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
>>>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
>>>> group and VFIO group.
>>> Is this really a good idea?  The concept of a vgpu is not unique to
>>> vfio, we want vfio to be a driver for a vgpu, not an integral part of
>>> the lifecycle of a vgpu.  That certainly doesn't exclude adding
>>> infrastructure to make lifecycle management of a vgpu more consistent
>>> between drivers, but it should be done independently of vfio.  I'll go
>>> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
>>> does not create the VF, that's done in coordination with the PF making
>>> use of some PCI infrastructure for consistency between drivers.
>>>
>>> It seems like we need to take more advantage of the class and driver
>>> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
>>> being a driver for those devices.
>>
>> For device passthrough or SR-IOV model, PCI devices are created by PCI
>> bus driver and from the probe routine each device is added in vfio group.
>
> An SR-IOV VF is created by the PF driver using standard interfaces
> provided by the PCI core.  The IOMMU group for a VF is added by the
> IOMMU driver when the device is created on the pci_bus_type.  The probe
> routine of the vfio bus driver (vfio-pci) is what adds the device into
> the vfio group.
>
>> For vgpu, there should be a common module that create vgpu device, say
>> vgpu module, add vgpu device to an IOMMU group and then add it to vfio
>> group.  This module can handle management of vgpus. Advantage of keeping
>> this module a separate module than doing device creation in vendor
>> modules is to have generic interface for vgpu management, for example,
>> files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
>> vgpu driver registration interface.
>
> But you're suggesting something very different from the SR-IOV model.
> If we wanted to mimic that model, the GPU specific driver should create
> the vgpu using services provided by a common interface.  For instance
> i915 could call a new vgpu_device_create() which creates the device,
> adds it to the vgpu class, etc.  That vgpu device should not be assumed
> to be used with vfio though, that should happen via a separate probe
> using a vfio-vgpu driver.  It's that vfio bus driver that will add the
> device to a vfio group.
>

In that case vgpu driver should provide a driver registration interface 
to register vfio-vgpu driver.

struct vgpu_driver {
	const char *name;
	int (*probe) (struct vgpu_device *vdev);
	void (*remove) (struct vgpu_device *vdev);
}

int vgpu_register_driver(struct vgpu_driver *driver)
{
...
}
EXPORT_SYMBOL(vgpu_register_driver);

int vgpu_unregister_driver(struct vgpu_driver *driver)
{
...
}
EXPORT_SYMBOL(vgpu_unregister_driver);

vfio-vgpu driver registers to vgpu driver. Then from 
vgpu_device_create(), after creating the device it calls 
vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to 
vfio group.

+--------------+    vgpu_register_driver()+---------------+
|     __init() +------------------------->+               |
|              |                          |               |
|              +<-------------------------+    vgpu.ko    |
| vfio_vgpu.ko |   probe()/remove()       |               |
|              |                +---------+               +---------+
+--------------+                |         +-------+-------+         |
                                 |                 ^                 |
                                 | callback        |                 |
                                 |         +-------+--------+        |
                                 |         |vgpu_register_device()   |
                                 |         |                |        |
                                 +---^-----+-----+    +-----+------+-+
                                     | nvidia.ko |    |  i915.ko   |
                                     |           |    |            |
                                     +-----------+    +------------+

Is my understanding correct?

Thanks,
Kirti


>> In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and
>> vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to
>> vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per
>> device. In the vgpu module, vgpu devices are created on request, so
>> vgpu_group_init() should be called explicitly for per vgpu device.
>>    That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu
>> module.  Vgpu_vfio would remain separate entity but merged with vgpu
>> module.
>
> I disagree with this design, creation of a vgpu necessarily involves the
> GPU driver and should not be tied to use of the vgpu with vfio.  vfio
> should be a driver for the device, maybe eventually not the only driver
> for the device.  Thanks,
>
> Alex
>
Neo Jia Jan. 27, 2016, 9:48 p.m. UTC | #12
On Wed, Jan 27, 2016 at 09:10:16AM -0700, Alex Williamson wrote:
> On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > > > 1.1 Under per-physical device sysfs:
> > > > > > ----------------------------------------------------------------------------------
> > > > > >  
> > > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > > > "vgpu_supported_types".
> > > > > >                             
> > > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > > > > >  
> > > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > > > > target physical GPU
> > > > >  
> > > > >  
> > > > > I've noted in previous discussions that we need to separate user policy
> > > > > from kernel policy here, the kernel policy should not require a "VM
> > > > > UUID".  A UUID simply represents a set of one or more devices and an
> > > > > index picks the device within the set.  Whether that UUID matches a VM
> > > > > or is independently used is up to the user policy when creating the
> > > > > device.
> > > > >  
> > > > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > > > UUID set of devices and instead have each device be independent.  This
> > > > > seems to be an imposition on the nvidia implementation into the kernel
> > > > > interface design.
> > > > >  
> > > >  
> > > > Hi Alex,
> > > >  
> > > > I agree with you that we should not put UUID concept into a kernel API. At
> > > > this point (without any prototyping), I am thinking of using a list of virtual
> > > > devices instead of UUID.
> > > 
> > > Hi Neo,
> > > 
> > > A UUID is a perfectly fine name, so long as we let it be just a UUID and
> > > not the UUID matching some specific use case.
> > > 
> > > > > >  
> > > > > > int vgpu_map_virtual_bar
> > > > > > (
> > > > > >     uint64_t virt_bar_addr,
> > > > > >     uint64_t phys_bar_addr,
> > > > > >     uint32_t len,
> > > > > >     uint32_t flags
> > > > > > )
> > > > > >  
> > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > > >  
> > > > >  
> > > > > Per the implementation provided, this needs to be implemented in the
> > > > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > > > of the device and replacing it is wrong.  It should be remapped at the
> > > > > vfio device file interface using vm_ops.
> > > > >  
> > > >  
> > > > So you are basically suggesting that we are going to take a mmap fault and
> > > > within that fault handler, we will go into vendor driver to look up the
> > > > "pre-registered" mapping and remap there.
> > > >  
> > > > Is my understanding correct?
> > > 
> > > Essentially, hopefully the vendor driver will have already registered
> > > the backing for the mmap prior to the fault, but either way could work.
> > > I think the key though is that you want to remap it onto the vma
> > > accessing the vfio device file, not scanning it out of an IOVA mapping
> > > that might be dynamic and doing a vma lookup based on the point in time
> > > mapping of the BAR.  The latter doesn't give me much confidence that
> > > mappings couldn't change while the former should be a one time fault.
> > 
> > Hi Alex,
> > 
> > The fact is that the vendor driver can only prevent such mmap fault by looking
> > up the <iova, hva> mapping table that we have saved from IOMMU memory listerner
> 
> Why do we need to prevent the fault?  We need to handle the fault when
> it occurs.
> 
> > when the guest region gets programmed. Also, like you have mentioned below, such
> > mapping between iova and hva shouldn't be changed as long as the SBIOS and
> > guest OS are done with their job. 
> 
> But you don't know they're done with their job.
> 
> > Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 
> 
> Why does that matter?  We're talking about the first time the VM
> accesses the range of the BAR that will be direct mapped to the physical
> GPU.  This isn't going to happen in the middle of a benchmark, it's
> going to happen during driver initialization in the guest.
> 
> > Probably we should just limit this interface to guest MMIO region and we can have
> > some crosscheck between the VFIO driver who has monitored the config spcae
> > access to make sure nothing getting moved around?
> 
> No, the solution for the bar is very clear, map on fault to the vma
> accessing the mmap and be done with it for the remainder of this
> instance of the VM.
> 

Hi Alex,

I totally get your points, my previous comments were just trying to explain the
reasoning behind our current implementation. I think I have found a way to hide
the latency of the mmap fault, which might happen in the middle of running a
benchmark.

I will add a new registration interface to allow the driver vendor to provide a
fault handler callback, and from there pages will be installed. 

> > > In case it's not clear to folks at Intel, the purpose of this is that a
> > > vGPU may directly map a segment of the physical GPU MMIO space, but we
> > > may not know what segment that is at setup time, when QEMU does an mmap
> > > of the vfio device file descriptor.  The thought is that we can create
> > > an invalid mapping when QEMU calls mmap(), knowing that it won't be
> > > accessed until later, then we can fault in the real mmap on demand.  Do
> > > you need anything similar?
> > > 
> > > > >  
> > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > > > >  
> > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > > > >  
> > > > > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > > > >  
> > > > > Particularly, handling of mapping changes is completely missing.  This
> > > > > cannot be a point in time translation, the user is free to remap
> > > > > addresses whenever they wish and device translations need to be updated
> > > > > accordingly.
> > > > >  
> > > >  
> > > > When you say "user", do you mean the QEMU?
> > > 
> > > vfio is a generic userspace driver interface, QEMU is a very, very
> > > important user of the interface, but not the only user.  So for this
> > > conversation, we're mostly talking about QEMU as the user, but we should
> > > be careful about assuming QEMU is the only user.
> > > 
> > 
> > Understood. I have to say that our focus at this moment is to support QEMU and
> > KVM, but I know VFIO interface is much more than that, and that is why I think
> > it is right to leverage this framework so we can together explore future use
> > case in the userland.
> > 
> > 
> > > > Here, whenever the DMA that
> > > > the guest driver is going to launch will be first pinned within VM, and then
> > > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> > > > will be pinned by the GPU or DMA engine.
> > > >  
> > > > Since we are keeping the upper level code same, thinking about passthru case,
> > > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> > > > can change that mapping without causing an IOMMU fault on a active DMA device.
> > > 
> > > For the virtual BAR mapping above, it's easy to imagine that mapping a
> > > BAR to a given address is at the guest discretion, it may be mapped and
> > > unmapped, it may be mapped to different addresses at different points in
> > > time, the guest BIOS may choose to map it at yet another address, etc.
> > > So if somehow we were trying to setup a mapping for peer-to-peer, there
> > > are lots of ways that IOVA could change.  But even with RAM, we can
> > > support memory hotplug in a VM.  What was once a DMA target may be
> > > removed or may now be backed by something else.  Chipset configuration
> > > on the emulated platform may change how guest physical memory appears
> > > and that might change between VM boots.
> > > 
> > > Currently with physical device assignment the memory listener watches
> > > for both maps and unmaps and updates the iotlb to match.  Just like real
> > > hardware doing these same sorts of things, we rely on the guest to stop
> > > using memory that's going to be moved as a DMA target prior to moving
> > > it.
> > 
> > Right,  you can only do that when the device is quiescent.
> > 
> > As long as this will be notified to the guest, I think we should be able to
> > support it although the real implementation will depend on how the device gets into 
> > quiescent state.
> > 
> > This is definitely a very interesting feature we should explore, but I hope we
> > probably can first focus on the most basic functionality.
> 
> If we only do a point-in-time translation and assume it never changes,
> that's good enough for a proof of concept, but it's not a complete
> solution.  I think this is  practical problem, not just an academic
> problem.  There needs to be a mechanism mappings to be invalidated based
> on VM memory changes.  Thanks,
> 

Sorry, probably my previous comment is not very clear. I highly value your input and
the information related to the memory hotplug scenarios, and I never exclude the
support of such feature. The only question is when, that is why I would like to
defer such VM memory hotplug feature to phase 2 after the initial official
launch.

Thanks,
Neo

> Alex
>
Alex Williamson Jan. 27, 2016, 9:58 p.m. UTC | #13
On Thu, 2016-01-28 at 02:25 +0530, Kirti Wankhede wrote:

> On 1/27/2016 9:30 PM, Alex Williamson wrote:
> > On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
> > > 
> > > On 1/27/2016 1:36 AM, Alex Williamson wrote:
> > > > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > > > > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > 
> > > > > Hi Alex, Kevin and Jike,
> > > > > 
> > > > > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > > > > inline at the end)
> > > > > 
> > > > > Thanks for adding me to this technical discussion, a great opportunity
> > > > > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > > > > KVM platform.
> > > > > 
> > > > > Instead of directly jumping to the proposal that we have been working on
> > > > > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > > > > quick comments / thoughts regarding the existing discussions on this thread as
> > > > > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> > > > > 
> > > > > Then we can look at what we have, hopefully we can reach some consensus soon.
> > > > > 
> > > > > > Yes, and since you're creating and destroying the vgpu here, this is
> > > > > > where I'd expect a struct device to be created and added to an IOMMU
> > > > > > group.  The lifecycle management should really include links between
> > > > > > the vGPU and physical GPU, which would be much, much easier to do with
> > > > > > struct devices create here rather than at the point where we start
> > > > > > doing vfio "stuff".
> > > > > 
> > > > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > > > > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > > > > group and VFIO group.
> > > > Is this really a good idea?  The concept of a vgpu is not unique to
> > > > vfio, we want vfio to be a driver for a vgpu, not an integral part of
> > > > the lifecycle of a vgpu.  That certainly doesn't exclude adding
> > > > infrastructure to make lifecycle management of a vgpu more consistent
> > > > between drivers, but it should be done independently of vfio.  I'll go
> > > > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> > > > does not create the VF, that's done in coordination with the PF making
> > > > use of some PCI infrastructure for consistency between drivers.
> > > > 
> > > > It seems like we need to take more advantage of the class and driver
> > > > core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> > > > being a driver for those devices.
> > > 
> > > For device passthrough or SR-IOV model, PCI devices are created by PCI
> > > bus driver and from the probe routine each device is added in vfio group.
> > 
> > An SR-IOV VF is created by the PF driver using standard interfaces
> > provided by the PCI core.  The IOMMU group for a VF is added by the
> > IOMMU driver when the device is created on the pci_bus_type.  The probe
> > routine of the vfio bus driver (vfio-pci) is what adds the device into
> > the vfio group.
> > 
> > > For vgpu, there should be a common module that create vgpu device, say
> > > vgpu module, add vgpu device to an IOMMU group and then add it to vfio
> > > group.  This module can handle management of vgpus. Advantage of keeping
> > > this module a separate module than doing device creation in vendor
> > > modules is to have generic interface for vgpu management, for example,
> > > files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
> > > vgpu driver registration interface.
> > 
> > But you're suggesting something very different from the SR-IOV model.
> > If we wanted to mimic that model, the GPU specific driver should create
> > the vgpu using services provided by a common interface.  For instance
> > i915 could call a new vgpu_device_create() which creates the device,
> > adds it to the vgpu class, etc.  That vgpu device should not be assumed
> > to be used with vfio though, that should happen via a separate probe
> > using a vfio-vgpu driver.  It's that vfio bus driver that will add the
> > device to a vfio group.
> > 

> In that case vgpu driver should provide a driver registration interface 
> to register vfio-vgpu driver.

> struct vgpu_driver {
> 	const char *name;
> 	int (*probe) (struct vgpu_device *vdev);
> 	void (*remove) (struct vgpu_device *vdev);
> }

> int vgpu_register_driver(struct vgpu_driver *driver)
> {
> ...
> }
> EXPORT_SYMBOL(vgpu_register_driver);

> int vgpu_unregister_driver(struct vgpu_driver *driver)
> {
> ...
> }
> EXPORT_SYMBOL(vgpu_unregister_driver);

> vfio-vgpu driver registers to vgpu driver. Then from 
> vgpu_device_create(), after creating the device it calls 
> vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to 
> vfio group.

> +--------------+    vgpu_register_driver()+---------------+
> >     __init() +------------------------->+               |
> >              |                          |               |
> >              +<-------------------------+    vgpu.ko    |
> > vfio_vgpu.ko |   probe()/remove()       |               |
> >              |                +---------+               +---------+
> +--------------+                |         +-------+-------+         |
>                                  |                 ^                 |
>                                  | callback        |                 |
>                                  |         +-------+--------+        |
>                                  |         |vgpu_register_device()   |
>                                  |         |                |        |
>                                  +---^-----+-----+    +-----+------+-+
>                                      | nvidia.ko |    |  i915.ko   |
>                                      |           |    |            |
>                                      +-----------+    +------------+

> Is my understanding correct?

We have an entire driver core subsystem in Linux for the purpose of
matching devices to drivers, I don't think we should be re-inventing
that.  That's why I'm suggesting that we should have infrastructure
which facilitates GPU drivers to create vGPU devices in a common way,
perhaps even placing the devices on a virtual vgpu bus, and then allow a
vfio-vgpu driver to register as a driver for devices of that bus/class
and use the existing driver callbacks.  Thanks,

Alex
Kirti Wankhede Jan. 28, 2016, 3:01 a.m. UTC | #14
On 1/28/2016 3:28 AM, Alex Williamson wrote:
> On Thu, 2016-01-28 at 02:25 +0530, Kirti Wankhede wrote:
>>
>> On 1/27/2016 9:30 PM, Alex Williamson wrote:
>>> On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
>>>>
>>>> On 1/27/2016 1:36 AM, Alex Williamson wrote:
>>>>> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
>>>>>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
>>>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>>>
>>>>>> Hi Alex, Kevin and Jike,
>>>>>>
>>>>>> (Seems I shouldn't use attachment, resend it again to the list, patches are
>>>>>> inline at the end)
>>>>>>
>>>>>> Thanks for adding me to this technical discussion, a great opportunity
>>>>>> for us to design together which can bring both Intel and NVIDIA vGPU solution to
>>>>>> KVM platform.
>>>>>>
>>>>>> Instead of directly jumping to the proposal that we have been working on
>>>>>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
>>>>>> quick comments / thoughts regarding the existing discussions on this thread as
>>>>>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
>>>>>>
>>>>>> Then we can look at what we have, hopefully we can reach some consensus soon.
>>>>>>
>>>>>>> Yes, and since you're creating and destroying the vgpu here, this is
>>>>>>> where I'd expect a struct device to be created and added to an IOMMU
>>>>>>> group.  The lifecycle management should really include links between
>>>>>>> the vGPU and physical GPU, which would be much, much easier to do with
>>>>>>> struct devices create here rather than at the point where we start
>>>>>>> doing vfio "stuff".
>>>>>>
>>>>>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
>>>>>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
>>>>>> group and VFIO group.
>>>>> Is this really a good idea?  The concept of a vgpu is not unique to
>>>>> vfio, we want vfio to be a driver for a vgpu, not an integral part of
>>>>> the lifecycle of a vgpu.  That certainly doesn't exclude adding
>>>>> infrastructure to make lifecycle management of a vgpu more consistent
>>>>> between drivers, but it should be done independently of vfio.  I'll go
>>>>> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
>>>>> does not create the VF, that's done in coordination with the PF making
>>>>> use of some PCI infrastructure for consistency between drivers.
>>>>>
>>>>> It seems like we need to take more advantage of the class and driver
>>>>> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
>>>>> being a driver for those devices.
>>>>
>>>> For device passthrough or SR-IOV model, PCI devices are created by PCI
>>>> bus driver and from the probe routine each device is added in vfio group.
>>>
>>> An SR-IOV VF is created by the PF driver using standard interfaces
>>> provided by the PCI core.  The IOMMU group for a VF is added by the
>>> IOMMU driver when the device is created on the pci_bus_type.  The probe
>>> routine of the vfio bus driver (vfio-pci) is what adds the device into
>>> the vfio group.
>>>
>>>> For vgpu, there should be a common module that create vgpu device, say
>>>> vgpu module, add vgpu device to an IOMMU group and then add it to vfio
>>>> group.  This module can handle management of vgpus. Advantage of keeping
>>>> this module a separate module than doing device creation in vendor
>>>> modules is to have generic interface for vgpu management, for example,
>>>> files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
>>>> vgpu driver registration interface.
>>>
>>> But you're suggesting something very different from the SR-IOV model.
>>> If we wanted to mimic that model, the GPU specific driver should create
>>> the vgpu using services provided by a common interface.  For instance
>>> i915 could call a new vgpu_device_create() which creates the device,
>>> adds it to the vgpu class, etc.  That vgpu device should not be assumed
>>> to be used with vfio though, that should happen via a separate probe
>>> using a vfio-vgpu driver.  It's that vfio bus driver that will add the
>>> device to a vfio group.
>>>
>>
>> In that case vgpu driver should provide a driver registration interface
>> to register vfio-vgpu driver.
>>
>> struct vgpu_driver {
>>   	const char *name;
>>   	int (*probe) (struct vgpu_device *vdev);
>>   	void (*remove) (struct vgpu_device *vdev);
>> }
>>
>> int vgpu_register_driver(struct vgpu_driver *driver)
>> {
>> ...
>> }
>> EXPORT_SYMBOL(vgpu_register_driver);
>>
>> int vgpu_unregister_driver(struct vgpu_driver *driver)
>> {
>> ...
>> }
>> EXPORT_SYMBOL(vgpu_unregister_driver);
>>
>> vfio-vgpu driver registers to vgpu driver. Then from
>> vgpu_device_create(), after creating the device it calls
>> vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to
>> vfio group.
>>
>> +--------------+    vgpu_register_driver()+---------------+
>>>      __init() +------------------------->+               |
>>>               |                          |               |
>>>               +<-------------------------+    vgpu.ko    |
>>> vfio_vgpu.ko |   probe()/remove()       |               |
>>>               |                +---------+               +---------+
>> +--------------+                |         +-------+-------+         |
>>                                   |                 ^                 |
>>                                   | callback        |                 |
>>                                   |         +-------+--------+        |
>>                                   |         |vgpu_register_device()   |
>>                                   |         |                |        |
>>                                   +---^-----+-----+    +-----+------+-+
>>                                       | nvidia.ko |    |  i915.ko   |
>>                                       |           |    |            |
>>                                       +-----------+    +------------+
>>
>> Is my understanding correct?
>
> We have an entire driver core subsystem in Linux for the purpose of
> matching devices to drivers, I don't think we should be re-inventing
> that.  That's why I'm suggesting that we should have infrastructure
> which facilitates GPU drivers to create vGPU devices in a common way,
> perhaps even placing the devices on a virtual vgpu bus, and then allow a
> vfio-vgpu driver to register as a driver for devices of that bus/class
> and use the existing driver callbacks.  Thanks,
>
> Alex
>

We will use Linux core subsystem, my point is we have to introduce vgpu 
module to provide such infrastructure to GPU drivers in common way. This 
module helps GPU drivers to create vGPU devices and allow vfio-vgpu 
driver to register for vGPU devices.

Kirti.
diff mbox

Patch

==================================================================================

Here we are proposing a generic Linux kernel module based on VFIO framework
which allows different GPU vendors to plugin and provide their GPU virtualization
solution on KVM, the benefits of having such generic kernel module are:

1) Reuse QEMU VFIO driver, supporting VFIO UAPI

2) GPU HW agnostic management API for upper layer software such as libvirt

3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor

0. High level overview
==================================================================================

 
  user space:
                                +-----------+  VFIO IOMMU IOCTLs
                      +---------| QEMU VFIO |-------------------------+
        VFIO IOCTLs   |         +-----------+                         |
                      |                                               | 
 ---------------------|-----------------------------------------------|---------
                      |                                               |
  kernel space:       |  +--->----------->---+  (callback)            V
                      |  |                   v                 +------V-----+
  +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
  |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
  | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
  |          |   |          |     | (register)           ^         ||
  +----------+   +-------+--+     |    +-----------+     |         ||
                         V        +----| i915.ko   +-----+     +---VV-------+ 
                         |             +-----^-----+           | TYPE1      |
                         |  (callback)       |                 | IOMMU      |
                         +-->------------>---+                 +------------+
 access flow:

  Guest MMIO / PCI config access
  |
  -------------------------------------------------
  |
  +-----> KVM VM_EXITs  (kernel)
          |
  -------------------------------------------------
          |
          +-----> QEMU VFIO driver (user)
                  | 
  -------------------------------------------------
                  |
                  +---->  VGPU kernel driver (kernel)
                          |  
                          | 
                          +----> vendor driver callback


1. VGPU management interface
==================================================================================

This is the interface allows upper layer software (mostly libvirt) to query and
configure virtual GPU device in a HW agnostic fashion. Also, this management
interface has provided flexibility to underlying GPU vendor to support virtual
device hotplug, multiple virtual devices per VM, multiple virtual devices from
different physical devices, etc.

1.1 Under per-physical device sysfs:
----------------------------------------------------------------------------------

vgpu_supported_types - RO, list the current supported virtual GPU types and its
VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
"vgpu_supported_types".
                            
vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
gpu device on a target physical GPU. idx: virtual device index inside a VM

vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
target physical GPU

1.3 Under vgpu class sysfs:
----------------------------------------------------------------------------------

vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to commit virtual GPU resource for
this target VM. 

Also, the vgpu_start function is a synchronized call, the successful return of
this call will indicate all the requested vGPU resource has been fully
committed, the VMM should continue.

vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to release virtual GPU resource of
this target VM.

1.4 Virtual device Hotplug
----------------------------------------------------------------------------------

To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
accessed during VM runtime, and the corresponding registration callback will be
invoked to allow GPU vendor support hotplug.

To support hotplug, vendor driver would take necessary action to handle the
situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
implies both create and start for that vgpu device.

Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
supports vgpu hotplug.

If hotplug is not supported and VM is still running, vendor driver can return
error code to indicate not supported.

Separate create from start gives flixibility to have:

- multiple vgpu instances for single VM and
- hotplug feature.

2. GPU driver vendor registration interface
==================================================================================

2.1 Registration interface definition (include/linux/vgpu.h)
----------------------------------------------------------------------------------

extern int vgpu_register_device(struct pci_dev *dev, 
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu
 * types.
 *                              @dev : pci device structure of physical GPU. 
 *                              @config: should return string listing supported
 *                              config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which
 *                              vgpu 
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended
 *                              to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs
 *                              to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that 
 *                              means the vGPU is being hotunpluged. Return
 *                              error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read 
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or
 *                              error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set. 
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API. 
 *
 * Physical GPU that support vGPU should be register with vgpu module with 
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

2.2 Details for callbacks we haven't mentioned above.
---------------------------------------------------------------------------------

vgpu_supported_config: allows the vendor driver to specify the supported vGPU
                       type/configuration

vgpu_create          : create a virtual GPU device, can be used for device hotplug.

vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.

vgpu_start           : callback function to notify vendor driver vgpu device
                       come to live for a given virtual machine.

vgpu_shutdown        : callback function to notify vendor driver 

read                 : callback to vendor driver to handle virtual device config
                       space or MMIO read access

write                : callback to vendor driver to handle virtual device config
                       space or MMIO write access

vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
                       information for the target virtual device, then vendor
                       driver can inject interrupt into virtual machine for this
                       device.

2.3 Potential additional virtual device configuration registration interface:
---------------------------------------------------------------------------------

callback function to describe the MMAP behavior of the virtual GPU 

callback function to allow GPU vendor driver to provide PCI config space backing
memory.

3. VGPU TYPE1 IOMMU
==================================================================================

Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
<iova, hva, size, flag> and save the QEMU mm for later reference.

You can find the quick/ugly implementation in the attached patch file, which is
actually just a simple version Alex's type1 IOMMU without actual real
mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 

We have thought about providing another vendor driver registration interface so
such tracking information will be sent to vendor driver and he will use the QEMU
mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
quick implementation within our driver, I noticed following issues:

1) OS/VFIO logic into vendor driver which will be a maintenance issue.

2) Every driver vendor has to implement their own RB tree, instead of reusing
the common existing VFIO code (vfio_find/link/unlink_dma) 

3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
better not have anything inside a vendor driver that the VFIO caller immediately
depends on.

Based on the above consideration, we decide to implement the DMA tracking logic
within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
IOMMU code) and expose two symbols to outside for MMIO mapping and page
translation and pinning. 

Also, with a mmap MMIO interface between virtual and physical, this allows
para-virtualized guest driver can access his virtual MMIO without taking a MMAP
fault hit, also we can support different MMIO size between virtual and physical
device.

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

EXPORT_SYMBOL(vgpu_map_virtual_bar);

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

EXPORT_SYMBOL(vgpu_dma_do_translate);

Still a lot to be added and modified, such as supporting multiple VMs and 
multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
kernel driver, error handling, roll-back and locked memory size per user, etc. 

4. Modules
==================================================================================

Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko

vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
                           TYPE1 v1 and v2 interface. 

vgpu.ko                  - provide registration interface and virtual device
                           VFIO access.

5. QEMU note
==================================================================================

To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
use it as a reference for our implementation. It is basically just a quick c & p
from vfio/pci.c to quickly meet our needs.

Once this proposal is finalized, we will move to vfio/pci.c instead of a new
class, and probably the only thing required is to have a new way to discover the
device.

6. Examples
==================================================================================

On this server, we have two NVIDIA M60 GPUs.

[root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

After nvidia.ko gets initialized, we can query the supported vGPU type by
accessing the "vgpu_supported_types" like following:

[root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
11:GRID M60-0B
12:GRID M60-0Q
13:GRID M60-1B
14:GRID M60-1Q
15:GRID M60-2B
16:GRID M60-2Q
17:GRID M60-4Q
18:GRID M60-8Q

For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
like to create "GRID M60-4Q" VM on it.

echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create

Note: the number 0 here is for vGPU device index. So far the change is not tested
for multiple vgpu devices yet, but we will support it.

At this moment, if you query the "vgpu_supported_types" it will still show all
supported virtual GPU types as no virtual GPU resource is committed yet.

Starting VM:

echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

then, the supported vGPU type query will return:

[root@cjia-vgx-kvm /home/cjia]$
> cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
17:GRID M60-4Q

So vgpu_supported_config needs to be called whenever a new virtual device gets
created as the underlying HW might limit the supported types if there are
any existing VM runnings.

Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
GPU driver vendor to clean up resource.

Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
device sysfs.

7. What is not covered:
==================================================================================

7.1 QEMU console VNC

QEMU console VNC is not covered in this RFC as it is a pretty isolated module
and not impacting the basic vGPU functionality, also we already have a good
discussion about the new VFIO interface that Alex is going to introduce to allow us 
describe a region for VM surface.

8 Patches
==================================================================================

0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch - against QEMU 2.5.0

0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch  - against 4.4.0-rc5

Thanks,
Kirti and Neo

From dc8ca387f7b06c6dfc85fb4bd79a760dca76e831 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Tue, 26 Jan 2016 01:21:11 -0800
Subject: [PATCH] Add VGPU and its TYPE1 IOMMU kernel module support

This is just a quick POV implementation to allow GPU driver vendor to plugin
into VFIO framework to provide their virtual GPU support. This kernel is
providing a registration interface for GPU vendor and generic DMA tracking APIs.

extern int vgpu_register_device(struct pci_dev *dev,
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu types.
 *                              @dev : pci device structure of physical GPU.
 *                              @config: should return string listing supported config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which vgpu
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that
 *                              means the vGPU is being hotunpluged. Return error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set.
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API.
 *
 * Physical GPU that support vGPU should be register with vgpu module with
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

Change-Id: Ib70304d9a600c311d5107a94b3fffa938926275b
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/Kconfig                      |   2 +
 drivers/Makefile                     |   1 +
 drivers/vfio/vfio.c                  |   5 +-
 drivers/vgpu/Kconfig                 |  26 ++
 drivers/vgpu/Makefile                |   5 +
 drivers/vgpu/vfio_iommu_type1_vgpu.c | 511 ++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_dev.c              | 550 +++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h          |  47 +++
 drivers/vgpu/vgpu_sysfs.c            | 322 ++++++++++++++++++++
 drivers/vgpu/vgpu_vfio.c             | 521 +++++++++++++++++++++++++++++++++
 include/linux/vgpu.h                 | 157 ++++++++++
 11 files changed, 2144 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c
 create mode 100644 drivers/vgpu/vgpu_dev.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 drivers/vgpu/vgpu_sysfs.c
 create mode 100644 drivers/vgpu/vgpu_vfio.c
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339de85f..5fd9eae79914 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@  source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 795d0ca714bf..142256b4358b 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@  obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
 obj-$(CONFIG_VFIO)		+= vfio/
+obj-$(CONFIG_VGPU)              += vgpu/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b793cbcb..af3ab413e119 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -947,19 +947,18 @@  static long vfio_ioctl_set_iommu(struct vfio_container *container,
 		if (IS_ERR(data)) {
 			ret = PTR_ERR(data);
 			module_put(driver->ops->owner);
-			goto skip_drivers_unlock;
+			continue;
 		}
 
 		ret = __vfio_container_attach_groups(container, driver, data);
 		if (!ret) {
 			container->iommu_driver = driver;
 			container->iommu_data = data;
+			goto skip_drivers_unlock;
 		} else {
 			driver->ops->release(data);
 			module_put(driver->ops->owner);
 		}
-
-		goto skip_drivers_unlock;
 	}
 
 	mutex_unlock(&vfio.iommu_drivers_lock);
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 000000000000..698ddf907a16
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,26 @@ 
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    select VFIO_IOMMU_TYPE1_VGPU
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU 
+    default n
+
+config VFIO_IOMMU_TYPE1_VGPU
+    tristate
+    depends on VGPU_VFIO
+    default n
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 000000000000..098a3591a535
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,5 @@ 
+
+vgpu-y := vgpu_sysfs.o vgpu_dev.o vgpu_vfio.o
+
+obj-$(CONFIG_VGPU)	+= vgpu.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU) += vfio_iommu_type1_vgpu.o
diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c
new file mode 100644
index 000000000000..6b20f1374b3b
--- /dev/null
+++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
@@ -0,0 +1,511 @@ 
+/*
+ * VGPU : IOMMU DMA mapping support for VGPU
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
+
+// VFIO structures
+
+struct vfio_iommu_vgpu {
+	struct mutex lock;
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	struct rb_root dma_list;
+	struct mm_struct * vm_mm;
+};
+
+struct vgpu_vfio_dma {
+	struct rb_node node;
+	dma_addr_t iova;
+	unsigned long vaddr;
+	size_t size;
+	int prot;
+};
+
+/*
+ * VGPU VFIO FOPs definition
+ *
+ */
+
+/*
+ * Duplicated from vfio_link_dma, just quick hack ... should
+ * reuse code later
+ */
+
+static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
+			  struct vgpu_vfio_dma *new)
+{
+	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+	struct vgpu_vfio_dma *dma;
+
+	while (*link) {
+		parent = *link;
+		dma = rb_entry(parent, struct vgpu_vfio_dma, node);
+
+		if (new->iova + new->size <= dma->iova)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
+					   dma_addr_t start, size_t size)
+{
+	struct rb_node *node = iommu->dma_list.rb_node;
+
+	while (node) {
+		struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node);
+
+		if (start + size <= dma->iova)
+			node = node->rb_left;
+		else if (start >= dma->iova + dma->size)
+			node = node->rb_right;
+		else
+			return dma;
+	}
+
+	return NULL;
+}
+
+static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old)
+{
+	rb_erase(&old->node, &iommu->dma_list);
+}
+
+static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
+{
+	struct vgpu_vfio_dma *c, *n;
+	uint32_t i = 0;
+
+	rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
+		printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n",
+		       __FUNCTION__, i++, c->iova, c->vaddr, c->size);
+}
+
+static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_map *map)
+{
+	dma_addr_t iova = map->iova;
+	unsigned long vaddr = map->vaddr;
+	int ret = 0, prot = 0;
+	struct vgpu_vfio_dma *vgpu_dma;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -EEXIST;
+	}
+
+	vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
+
+	if (!vgpu_dma) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -ENOMEM;
+	}
+
+	vgpu_dma->iova = iova;
+	vgpu_dma->vaddr = vaddr;
+	vgpu_dma->prot = prot;
+	vgpu_dma->size = map->size;
+
+	vgpu_link_dma(vgpu_iommu, vgpu_dma);
+
+	mutex_unlock(&vgpu_iommu->lock);
+	return ret;
+}
+
+static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_unmap *unmap)
+{
+	struct vgpu_vfio_dma *vgpu_dma;
+	size_t unmapped = 0;
+	int ret = 0;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
+	if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
+	if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) {
+		unmapped += vgpu_dma->size;
+		vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	unmap->size = unmapped;
+
+	return ret;
+}
+
+/* Ugly hack to quickly test single deivce ... */
+
+static struct vfio_iommu_vgpu *_local_iommu = NULL;
+
+int vgpu_map_virtual_bar
+(
+	uint64_t virt_bar_addr,
+        uint64_t phys_bar_addr,
+	uint32_t len,
+	uint32_t flags
+)
+{
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	unsigned long remote_vaddr = 0;
+	struct vgpu_vfio_dma *vgpu_dma = NULL;
+	struct vm_area_struct *remote_vma = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+	int ret = 0;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	down_write(&mm->mmap_sem);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, virt_bar_addr, len /*  size */);
+	if (!vgpu_dma) {
+		printk(KERN_INFO "%s: fail locate guest physical:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	remote_vaddr = vgpu_dma->vaddr + virt_bar_addr - vgpu_dma->iova;
+
+        remote_vma = find_vma(mm, remote_vaddr);
+
+	if (remote_vma == NULL) {
+		printk(KERN_INFO "%s: fail locate vma, physical addr:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+	else {
+		printk(KERN_INFO "%s: locate vma, addr:0x%lx\n",
+		       __FUNCTION__, remote_vma->vm_start);
+	}
+
+	remote_vma->vm_page_prot = pgprot_noncached(remote_vma->vm_page_prot);
+
+	remote_vma->vm_pgoff = phys_bar_addr >> PAGE_SHIFT;
+
+	ret = remap_pfn_range(remote_vma, virt_bar_addr, remote_vma->vm_pgoff,
+			len, remote_vma->vm_page_prot);
+
+	if (ret) {
+		printk(KERN_INFO "%s: fail to remap vma:%d\n", __FUNCTION__, ret);
+		goto unlock;
+	}
+
+unlock:
+
+	up_write(&mm->mmap_sem);
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_map_virtual_bar);
+
+int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
+{
+	int i = 0, ret = 0, prot = 0;
+	unsigned long remote_vaddr = 0, pfn = 0;
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	struct vgpu_vfio_dma *vgpu_dma;
+	struct page *page[1];
+	// unsigned long * addr = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+
+	prot = IOMMU_READ | IOMMU_WRITE;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	for (i = 0; i < count; i++) {
+		dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
+		vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
+
+		if (!vgpu_dma) {
+			printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova);
+			ret = -EINVAL;
+			goto unlock;
+		}
+
+		remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
+		printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n",
+			__FUNCTION__, i, vgpu_dma->iova,
+			vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
+
+		if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
+			pfn = page_to_pfn(page[0]);
+			printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn);
+			// addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
+		}
+		else {
+			printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i);
+			ret = -ENOMEM;
+			goto unlock;
+		}
+
+		gfn_buffer[i] = pfn;
+		// vunmap(addr);
+
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_dma_do_translate);
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+static void *vfio_iommu_vgpu_open(unsigned long arg)
+{
+	struct vfio_iommu_vgpu *iommu;
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&iommu->lock);
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	/* TODO: Keep track the v2 vs. v1, for now only assume
+	 * we are v2 due to QEMU code */
+	_local_iommu = iommu;
+	return iommu;
+}
+
+static void vfio_iommu_vgpu_release(void *iommu_data)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	kfree(iommu);
+	printk(KERN_INFO "%s", __FUNCTION__);
+}
+
+static long vfio_iommu_vgpu_ioctl(void *iommu_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
+
+	switch (cmd) {
+	case VFIO_CHECK_EXTENSION:
+	{
+		if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
+			return 1;
+		else
+			return 0;
+	}
+
+	case VFIO_IOMMU_GET_INFO:
+	{
+		struct vfio_iommu_type1_info info;
+		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_IOMMU_MAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_map map;
+		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n",
+			map.flags, map.vaddr, map.iova, map.size);
+
+		/*
+		 * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally,
+		 * this should be merged into one single file and reuse data
+		 * structure
+		 *
+		 */
+		ret = vgpu_dma_do_track(vgpu_iommu, &map);
+		break;
+	}
+	case VFIO_IOMMU_UNMAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz)
+			return -EINVAL;
+
+		ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
+		break;
+	}
+	default:
+	{
+		printk(KERN_INFO "%s cmd default ", __FUNCTION__);
+		ret = -ENOTTY;
+		break;
+	}
+	}
+
+	return ret;
+}
+
+
+static int vfio_iommu_vgpu_attach_group(void *iommu_data,
+		                        struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	vgpu_dev = get_vgpu_device_from_group(iommu_group);
+	if (vgpu_dev) {
+		iommu->vgpu_dev = vgpu_dev;
+		iommu->group = iommu_group;
+
+		/* IOMMU shares the same life cylce as VM MM */
+		iommu->vm_mm = current->mm;
+
+		printk(KERN_INFO "%s index %d", __FUNCTION__, vgpu_dev->minor);
+		return 0;
+	}
+	iommu->group = iommu_group;
+	return 1;
+}
+
+static void vfio_iommu_vgpu_detach_group(void *iommu_data,
+		struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+	iommu->vm_mm = NULL;
+	iommu->group = NULL;
+
+	return;
+}
+
+
+static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
+	.name           = "vgpu_vfio",
+	.owner          = THIS_MODULE,
+	.open           = vfio_iommu_vgpu_open,
+	.release        = vfio_iommu_vgpu_release,
+	.ioctl          = vfio_iommu_vgpu_ioctl,
+	.attach_group   = vfio_iommu_vgpu_attach_group,
+	.detach_group   = vfio_iommu_vgpu_detach_group,
+};
+
+
+int vgpu_vfio_iommu_init(void)
+{
+	int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc);
+	}
+
+	return rc;
+}
+
+void vgpu_vfio_iommu_exit(void)
+{
+	// unregister vgpu_vfio driver
+	vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+}
+
+
+module_init(vgpu_vfio_iommu_init);
+module_exit(vgpu_vfio_iommu_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
diff --git a/drivers/vgpu/vgpu_dev.c b/drivers/vgpu/vgpu_dev.c
new file mode 100644
index 000000000000..1d4eb235122c
--- /dev/null
+++ b/drivers/vgpu/vgpu_dev.c
@@ -0,0 +1,550 @@ 
+/*
+ * VGPU core
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC	"VGPU driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME		"vgpu"
+
+#define VGPU_DEV_NAME		"vgpu"
+
+// TODO remove these defines
+// minor number reserved for control device
+#define VGPU_CONTROL_DEVICE       0
+
+#define VGPU_CONTROL_DEVICE_NAME  "vgpuctl"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+	dev_t               vgpu_devt;
+	struct class        *class;
+	struct cdev         vgpu_cdev;
+	struct list_head    vgpu_devices_list;  // Head entry for the doubly linked vgpu_device list
+	struct mutex        vgpu_devices_lock;
+	struct idr          vgpu_idr;
+	struct list_head    gpu_devices_list;
+	struct mutex        gpu_devices_lock;
+} vgpu;
+
+
+/*
+ * Function prototypes
+ */
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev);
+
+unsigned int vgpu_poll(struct file *file, poll_table *wait);
+long vgpu_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long i_arg);
+int vgpu_mmap(struct file *file, struct vm_area_struct *vma);
+
+int vgpu_open(struct inode *inode, struct file *file);
+int vgpu_close(struct inode *inode, struct file *file);
+ssize_t vgpu_read(struct file *file, char __user * buf,
+		      size_t len, loff_t * ppos);
+ssize_t vgpu_write(struct file *file, const char __user *data,
+		       size_t len, loff_t *ppos);
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->group) {
+			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
+				mutex_unlock(&vgpu.vgpu_devices_lock);
+				return vdev;
+			}
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev, *tmp;
+
+	if (!dev)
+		return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+	gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return -EINVAL;
+                }
+        }
+
+	ret = vgpu_create_pci_device_files(dev);
+	if (ret) {
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		kfree(gpu_dev);
+		return ret;
+	}
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+                if (gpu_dev->dev == dev) {
+			printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+			vgpu_remove_pci_device_files(dev);
+                        list_del(&gpu_dev->gpu_next);
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return;
+                }
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+
+/*
+ *  Static functions
+ */
+
+static struct file_operations vgpu_fops = {
+	.owner          = THIS_MODULE,
+};
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev)
+{
+	if (vgpu_dev->dev) {
+		device_destroy(vgpu.class, vgpu_dev->dev->devt);
+		vgpu_dev->dev = NULL;
+	}
+}
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
+{
+	struct vgpu_device *vgpu_dev = NULL;
+
+	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+	if (!vgpu_dev)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vgpu_dev->kref);
+	memcpy(&vgpu_dev->vm_uuid, &uuid, sizeof(uuid_le));
+	vgpu_dev->vgpu_instance = instance;
+	strcpy(vgpu_dev->dev_name, name);
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_del(&vgpu_dev->list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	kfree(vgpu_dev);
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+struct vgpu_device *find_vgpu_device(struct device *dev)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->dev == dev) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id)
+{
+	int minor;
+	char name[64];
+	int numChar = 0;
+	int retval = 0;
+
+	struct iommu_group *group = NULL;
+	struct device *dev = NULL;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	struct gpu_device *gpu_dev;
+
+	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+	numChar = sprintf(name, "%pUb-%d", vm_uuid.b, instance);
+	name[numChar] = '\0';
+
+	vgpu_dev = vgpu_device_alloc(vm_uuid, instance, name);
+	if (IS_ERR(vgpu_dev)) {
+		return PTR_ERR(vgpu_dev);
+	}
+
+	// check if VM device is present
+	// if not present, create with devt=0 and parent=NULL
+	// create device for instance with devt= MKDEV(vgpu.major, minor)
+	// and parent=VM device
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_dev->vgpu_id = vgpu_id;
+
+	// TODO on removing control device change the 3rd parameter to 0
+	minor = idr_alloc(&vgpu.vgpu_idr, vgpu_dev, 1, MINORMASK + 1, GFP_KERNEL);
+	if (minor < 0) {
+		retval = minor;
+		goto create_failed;
+	}
+
+	dev = device_create(vgpu.class, NULL, MKDEV(MAJOR(vgpu.vgpu_devt), minor), NULL, "%s", name);
+	if (IS_ERR(dev)) {
+		retval = PTR_ERR(dev);
+		goto create_failed1;
+	}
+
+	vgpu_dev->dev = dev;
+	vgpu_dev->minor = minor;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (gpu_dev->dev == pdev) {
+			vgpu_dev->gpu_dev = gpu_dev;
+			if (gpu_dev->ops->vgpu_create) {
+				retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->vm_uuid,
+								   instance, vgpu_id);
+				if (retval)
+				{
+					mutex_unlock(&vgpu.gpu_devices_lock);
+					goto create_failed2;
+				}
+			}
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	if (!vgpu_dev->gpu_dev) {
+		retval = -EINVAL;
+		goto create_failed2;
+	}
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->vm_uuid.b);
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		printk(KERN_ERR "VGPU: failed to allocate group!\n");
+		retval = PTR_ERR(group);
+		goto create_failed2;
+	}
+
+	retval = iommu_group_add_device(group, dev);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+		iommu_group_put(group);
+		goto create_failed2;
+	}
+
+	retval = vgpu_group_init(vgpu_dev, group);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed vgpu_group_init \n");
+		iommu_group_put(group);
+		iommu_group_remove_device(dev);
+		goto create_failed2;
+	}
+
+	vgpu_dev->group = group;
+	printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return retval;
+
+create_failed2:
+	vgpu_device_destroy(vgpu_dev);
+
+create_failed1:
+	idr_remove(&vgpu.vgpu_idr, minor);
+
+create_failed:
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+
+	return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+	struct device *dev = vgpu_dev->dev;
+
+	if (!dev) {
+		return;
+	}
+
+	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+	if (vgpu_dev->gpu_dev->ops->vgpu_destroy) {
+		int retval = 0;
+		retval = vgpu_dev->gpu_dev->ops->vgpu_destroy(vgpu_dev->gpu_dev->dev,
+							      vgpu_dev->vm_uuid,
+							      vgpu_dev->vgpu_instance);
+	/* if vendor driver doesn't return success that means vendor driver doesn't
+	 * support hot-unplug */
+		if (retval)
+			return;
+	}
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_group_free(vgpu_dev);
+	iommu_group_put(dev->iommu_group);
+	iommu_group_remove_device(dev);
+	vgpu_device_destroy(vgpu_dev);
+	idr_remove(&vgpu.vgpu_idr, vgpu_dev->minor);
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+}
+
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev, *vgpu_dev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	// search VGPU device
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			vgpu_dev = vdev;
+			break;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	if (vgpu_dev)
+		destroy_vgpu_device(vgpu_dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+	struct gpu_device *gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (&gpu_dev->dev->dev == dev) {
+			if (gpu_dev->ops->vgpu_supported_config)
+				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_start)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_start(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_shutdown)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_shutdown(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data)
+{
+       int ret = 0;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       if (vgpu_dev->gpu_dev->ops->vgpu_set_irqs)
+               ret = vgpu_dev->gpu_dev->ops->vgpu_set_irqs(vgpu_dev, flags,
+                                                          index, start, count, data);
+       mutex_unlock(&vgpu.gpu_devices_lock);
+       return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static struct class vgpu_class = {
+	.name		= VGPU_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= vgpu_class_attrs,
+	.dev_groups	= vgpu_dev_groups,
+	.devnode	= vgpu_devnode,
+};
+
+static int __init vgpu_init(void)
+{
+	int rc = 0;
+
+	memset(&vgpu, 0 , sizeof(vgpu));
+
+	idr_init(&vgpu.vgpu_idr);
+	mutex_init(&vgpu.vgpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+	mutex_init(&vgpu.gpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+	// get major number from kernel
+	rc = alloc_chrdev_region(&vgpu.vgpu_devt, 0, MINORMASK, VGPU_DEV_NAME);
+
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu drv, err:%d\n", rc);
+		return rc;
+	}
+
+	cdev_init(&vgpu.vgpu_cdev, &vgpu_fops);
+	cdev_add(&vgpu.vgpu_cdev, vgpu.vgpu_devt, MINORMASK);
+
+	printk(KERN_ALERT "major_number:%d is allocated for vgpu\n", MAJOR(vgpu.vgpu_devt));
+
+	rc = class_register(&vgpu_class);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu class\n");
+		goto failed1;
+	}
+
+	vgpu.class = &vgpu_class;
+
+	return rc;
+
+failed1:
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+
+	return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+	// TODO: Release all unclosed fd
+	struct vgpu_device *vdev = NULL, *tmp;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry_safe(vdev, tmp, &vgpu.vgpu_devices_list, list) {
+		printk(KERN_INFO "VGPU: exit destroying device %s ", vdev->dev_name);
+		mutex_unlock(&vgpu.vgpu_devices_lock);
+		destroy_vgpu_device(vdev);
+		mutex_lock(&vgpu.vgpu_devices_lock);
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	idr_destroy(&vgpu.vgpu_idr);
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+	class_destroy(vgpu.class);
+	vgpu.class = NULL;
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 000000000000..7e3c400d29f7
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,47 @@ 
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group * group);
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev);
+
+struct vgpu_device *find_vgpu_device(struct device *dev);
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int vgpu_create_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_notify_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_remove_status_file(struct vgpu_device *vgpu_dev);
+
+int vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/drivers/vgpu/vgpu_sysfs.c b/drivers/vgpu/vgpu_sysfs.c
new file mode 100644
index 000000000000..e48cbcd6948d
--- /dev/null
+++ b/drivers/vgpu/vgpu_sysfs.c
@@ -0,0 +1,322 @@ 
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < 36)
+		return -1;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			printk(KERN_ERR "%s err", __FUNCTION__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	char *str;
+	ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+	get_vgpu_supported_types(dev, str);
+
+	n = sprintf(buf,"%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *instance_str, *str;
+	uuid_le vm_uuid;
+	uint32_t instance, vgpu_id;
+	struct pci_dev *pdev;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type and instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if ((instance_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty instance or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+
+	}
+
+	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+	vgpu_id = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+
+		if (create_vgpu_device(pdev, vm_uuid, instance, vgpu_id) < 0) {
+			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
+			return -EINVAL;
+		}
+	}
+
+	return count;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *str;
+	uuid_le vm_uuid;
+	unsigned int instance;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, vm_uuid.b, instance);
+
+	destroy_vgpu_device_by_uuid(vm_uuid, instance);
+
+	return count;
+}
+
+static ssize_t
+vgpu_vm_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv)
+		return sprintf(buf, "%pUb \n", drv->vm_uuid.b);
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_vm_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv && drv->group)
+		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+	&dev_attr_vgpu_vm_uuid.attr,
+	&dev_attr_vgpu_group_id.attr,
+	NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+	.attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+	&vgpu_dev_group,
+	NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_ONLINE);
+
+		ret = vgpu_start_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_start callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_OFFLINE);
+
+		ret = vgpu_shutdown_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+	__ATTR_WO(vgpu_start),
+	__ATTR_WO(vgpu_shutdown),
+	__ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
+
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 000000000000..ef0833140d84
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,521 @@ 
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+};
+
+static int vgpu_dev_open(void *device_data)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+
+}
+
+static uint64_t resource_len(struct vgpu_device *vgpu_dev, int bar_index)
+{
+	uint64_t size = 0;
+
+	switch (bar_index) {
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		size = 16 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		size = 256 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR2_REGION_INDEX:
+		size = 32 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR5_REGION_INDEX:
+		size = 128;
+		break;
+	default:
+		size = 0;
+		break;
+	}
+	return size;
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+       return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd)
+	{
+		case VFIO_DEVICE_GET_INFO:
+		{
+			struct vfio_device_info info;
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index = %d", __FUNCTION__, vdev->vgpu_dev->minor);
+			minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			info.flags = VFIO_DEVICE_FLAGS_PCI;
+			info.num_regions = VFIO_PCI_NUM_REGIONS;
+			info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_GET_REGION_INFO:
+		{
+			struct vfio_region_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd", __FUNCTION__);
+
+			minsz = offsetofend(struct vfio_region_info, offset);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_CONFIG_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0x100;     // 4K
+					//                    info.size = sizeof(vdev->vgpu_dev->config_space);
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+							VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+				case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = resource_len(vdev->vgpu_dev, info.index);
+					if (!info.size) {
+						info.flags = 0;
+						break;
+					}
+
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+
+					if ((info.index == VFIO_PCI_BAR1_REGION_INDEX) ||
+					     (info.index == VFIO_PCI_BAR2_REGION_INDEX)) {
+						info.flags |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+					}
+
+					/* TODO: provides configurable setups to
+					 * GPU vendor
+					 */
+
+					if (info.index == VFIO_PCI_BAR1_REGION_INDEX)
+						info.flags = VFIO_REGION_INFO_FLAG_MMAP;
+
+					break;
+				case VFIO_PCI_VGA_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0xc0000;
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+
+				case VFIO_PCI_ROM_REGION_INDEX:
+				default:
+					return -EINVAL;
+			}
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+
+		}
+		case VFIO_DEVICE_GET_IRQ_INFO:
+		{
+			struct vfio_irq_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);
+			minsz = offsetofend(struct vfio_irq_info, count);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+				case VFIO_PCI_REQ_IRQ_INDEX:
+					break;
+					/* pass thru to return error */
+				default:
+					return -EINVAL;
+			}
+
+			info.count = VFIO_PCI_NUM_IRQS;
+
+			info.flags = VFIO_IRQ_INFO_EVENTFD;
+			info.count = vgpu_get_irq_count(vdev, info.index);
+
+			if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+				info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+						VFIO_IRQ_INFO_AUTOMASKED);
+			else
+				info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_SET_IRQS:
+		{
+			struct vfio_irq_set hdr;
+			u8 *data = NULL;
+			int ret = 0;
+
+			minsz = offsetofend(struct vfio_irq_set, count);
+
+			if (copy_from_user(&hdr, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+					hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+						VFIO_IRQ_SET_ACTION_TYPE_MASK))
+				return -EINVAL;
+
+			if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+				size_t size;
+				int max = vgpu_get_irq_count(vdev, hdr.index);
+
+				if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+					size = sizeof(uint8_t);
+				else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+					size = sizeof(int32_t);
+				else
+					return -EINVAL;
+
+				if (hdr.argsz - minsz < hdr.count * size ||
+				    hdr.start >= max || hdr.start + hdr.count > max)
+					return -EINVAL;
+
+				data = memdup_user((void __user *)(arg + minsz),
+						hdr.count * size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+
+			}
+			ret = vgpu_set_irqs_callback(vdev->vgpu_dev, hdr.flags, hdr.index,
+					hdr.start, hdr.count, data);
+			kfree(data);
+
+
+			return ret;
+		}
+
+		default:
+			return -EINVAL;
+	}
+	return ret;
+}
+
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	int cfg_size = sizeof(vgpu_dev->config_space);
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= cfg_size ||
+	    pos + count > cfg_size) {
+		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto config_rw_exit;
+		}
+
+		/* FIXME: Need to save the BAR value properly */
+		switch (pos) {
+		case PCI_BASE_ADDRESS_0:
+			vgpu_dev->bar[0].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_1:
+			vgpu_dev->bar[1].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_2:
+			vgpu_dev->bar[2].start = *((uint32_t *)user_data);
+			break;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_config,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_config,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+		}
+		kfree(ret_data);
+	}
+
+config_rw_exit:
+
+	return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	uint64_t end;
+	int ret = 0;
+
+	if (!vgpu_dev->bar[bar_index].start) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	end = resource_len(vgpu_dev, bar_index);
+
+	if (offset >= end) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vgpu_dev->bar[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_mmio,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_mmio,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+			}
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		case VFIO_PCI_VGA_REGION_INDEX:
+			break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+/* Just create an invalid mapping without providing a fault handler */
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+	.name		= "vfio-vgpu-grp",
+	.open		= vgpu_dev_open,
+	.release	= vgpu_dev_close,
+	.ioctl		= vgpu_dev_unlocked_ioctl,
+	.read		= vgpu_dev_read,
+	.write		= vgpu_dev_write,
+	.mmap		= vgpu_dev_mmap,
+};
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group *group)
+{
+	struct vfio_vgpu_device *vdev;
+	int ret = 0;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		return -ENOMEM;
+	}
+
+	vdev->group = group;
+	vdev->vgpu_dev = vgpu_dev;
+
+	ret = vfio_add_group_dev(vgpu_dev->dev, &vgpu_vfio_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev)
+{
+	struct vfio_vgpu_device *vdev;
+
+	vdev = vfio_del_group_dev(vgpu_dev->dev);
+	if (!vdev)
+		return -1;
+
+	kfree(vdev);
+	return 0;
+}
+
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 000000000000..a2861c3f42e5
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,157 @@ 
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+	uint64_t start;
+	uint64_t end;
+	int flags;
+};
+
+enum vgpu_emul_space_e {
+	vgpu_emul_space_config = 0, /*!< PCI configuration space */
+	vgpu_emul_space_io = 1,     /*!< I/O register space */
+	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+	struct kref		kref;
+	struct device		*dev;
+	int minor;
+	struct gpu_device	*gpu_dev;
+	struct iommu_group	*group;
+#define DEVICE_NAME_LEN		(64)
+	char			dev_name[DEVICE_NAME_LEN];
+	uuid_le			vm_uuid;
+	uint32_t		vgpu_instance;
+	uint32_t		vgpu_id;
+	atomic_t		usage_count;
+	char			config_space[0x100];          // 4KB PCI cfg space
+	struct pci_bar_info	bar[VFIO_PCI_NUM_REGIONS];
+	struct device_attribute	*dev_attr_vgpu_status;
+	int			vgpu_device_status;
+
+	struct list_head	list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:			The module owner.
+ * @vgpu_supported_config:	Called to get information about supported vgpu types.
+ *				@dev : pci device structure of physical GPU.
+ *				@config: should return string listing supported config
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_create:		Called to allocate basic resouces in graphics
+ *				driver for a particular vgpu.
+ *				@dev: physical pci device structure on which vgpu
+ *				      should be created
+ *				@vm_uuid: VM's uuid for which VM it is intended to
+ *				@instance: vgpu instance in that VM
+ *				@vgpu_id: This represents the type of vgpu to be
+ *					  created
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:		Called to free resources in graphics driver for
+ *				a vgpu instance of that VM.
+ *				@dev: physical pci device structure to which
+ *				this vgpu points to.
+ *				@vm_uuid: VM's uuid for which the vgpu belongs to.
+ *				@instance: vgpu instance in that VM
+ *				Returns integer: success (0) or error (< 0)
+ *				If VM is running and vgpu_destroy is called that
+ *				means the vGPU is being hotunpluged. Return error
+ *				if VM is running and graphics driver doesn't
+ *				support vgpu hotplug.
+ * @vgpu_start:			Called to do initiate vGPU initialization
+ *				process in graphics driver when VM boots before
+ *				qemu starts.
+ *				@vm_uuid: VM's UUID which is booting.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:		Called to teardown vGPU related resources for
+ *				the VM
+ *				@vm_uuid: VM's UUID which is shutting down .
+ *				Returns integer: success (0) or error (< 0)
+ * @read:			Read emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: read buffer
+ *				@count: number bytes to read
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes read on success or error.
+ * @write:			Write emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: write buffer
+ *				@count: number bytes to be written
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes written on success or error.
+ * @vgpu_set_irqs:		Called to send about interrupts configuration
+ *				information that qemu set.
+ *				@vdev: vgpu device structure
+ *				@flags, index, start, count and *data : same as
+ *				that of struct vfio_irq_set of
+ *				VFIO_DEVICE_SET_IRQS API.
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+	struct module   *owner;
+	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
+	int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
+			       uint32_t instance, uint32_t vgpu_id);
+	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
+			        uint32_t instance);
+	int     (*vgpu_start)(uuid_le vm_uuid);
+	int     (*vgpu_shutdown)(uuid_le vm_uuid);
+	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space,loff_t pos);
+	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+				 unsigned index, unsigned start, unsigned count,
+				 void *data);
+
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+	struct pci_dev                  *dev;
+	const struct gpu_device_ops     *ops;
+	struct list_head                gpu_next;
+};
+
+extern int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr, uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
+
-- 
1.8.1.4

From 380156ade7053664bdb318af0659708357f40050 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Sun, 24 Jan 2016 11:24:13 -0800
Subject: [PATCH] Add VGPU VFIO driver class support in QEMU

This is just a quick POV change to allow us experiment the VGPU VFIO support,
the next step is to merge this into the current vfio/pci.c which currently has a
physical backing devices.

Within current POC implementation, we have copy & paste lots function directly
from the vfio/pci.c code, we should merge them together later.

    - Basic MMIO and PCI config apccess are supported

    - MMAP'ed GPU bar is supported

    - INTx and MSI using eventfd is supported, don't think we should support
      interrupt when vector->kvm_interrupt is not enabled.

Change-Id: I99c34ac44524cd4d7d2abbcc4d43634297b96e80

Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/Makefile.objs |   1 +
 hw/vfio/vgpu.c        | 991 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/pci.h  |   3 +
 3 files changed, 995 insertions(+)
 create mode 100644 hw/vfio/vgpu.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d324863..17f2ef1 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,7 @@ 
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
 obj-$(CONFIG_PCI) += pci.o pci-quirks.o
+obj-$(CONFIG_PCI) += vgpu.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/vgpu.c b/hw/vfio/vgpu.c
new file mode 100644
index 0000000..56ebce0
--- /dev/null
+++ b/hw/vfio/vgpu.c
@@ -0,0 +1,991 @@ 
+/*
+ * vGPU VFIO device
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <dirent.h>
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "config.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/pci.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
+#include "trace.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio-common.h"
+#include "qmp-commands.h"
+
+#define TYPE_VFIO_VGPU "vfio-vgpu"
+
+typedef struct VFIOvGPUDevice {
+    PCIDevice pdev;
+    VFIODevice vbasedev;
+    VFIOINTx intx;
+    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
+    uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
+    unsigned int config_size;
+    char  *vgpu_type;
+    char *vm_uuid;
+    off_t config_offset; /* Offset of config space region within device fd */
+    int msi_cap_size;
+    EventNotifier req_notifier;
+    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
+    int interrupt; /* Current interrupt type */
+    VFIOMSIVector *msi_vectors;
+} VFIOvGPUDevice;
+
+/*
+ * Local functions
+ */
+
+// function prototypes
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev);
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len);
+
+
+// INTx functions
+
+static void vfio_vgpu_intx_interrupt(void *opaque)
+{
+    VFIOvGPUDevice *vdev = opaque;
+
+    if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) {
+        return;
+    }
+
+    vdev->intx.pending = true;
+    pci_irq_assert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, false);
+
+}
+
+static void vfio_vgpu_intx_eoi(VFIODevice *vbasedev)
+{
+    VFIOvGPUDevice *vdev = container_of(vbasedev, VFIOvGPUDevice, vbasedev);
+
+    if (!vdev->intx.pending) {
+        return;
+    }
+
+    trace_vfio_intx_eoi(vbasedev->name);
+
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+    vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+}
+
+static void vfio_vgpu_intx_enable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_RESAMPLE,
+    };
+    struct vfio_irq_set *irq_set;
+    int ret, argsz;
+    int32_t *pfd;
+
+    if (!kvm_irqfds_enabled() ||
+        vdev->intx.route.mode != PCI_INTX_ENABLED ||
+        !kvm_resamplefds_enabled()) {
+        return;
+    }
+
+    /* Get to a known interrupt state */
+    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Get an eventfd for resample/unmask */
+    if (event_notifier_init(&vdev->intx.unmask, 0)) {
+        error_report("vfio: Error: event_notifier_init failed eoi");
+        goto fail;
+    }
+
+    /* KVM triggers it, VFIO listens for it */
+    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
+
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to setup resample irqfd: %m");
+        goto fail_irqfd;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = irqfd.resamplefd;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx unmask fd: %m");
+        goto fail_vfio;
+    }
+
+    /* Let'em rip */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    vdev->intx.kvm_accel = true;
+
+    trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
+
+    return;
+
+fail_vfio:
+    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
+    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+fail_irqfd:
+    event_notifier_cleanup(&vdev->intx.unmask);
+fail:
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+#endif
+}
+
+static void vfio_vgpu_intx_disable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_DEASSIGN,
+    };
+
+    if (!vdev->intx.kvm_accel) {
+        return;
+    }
+
+    /*
+     * Get to a known state, hardware masked, QEMU ready to accept new
+     * interrupts, QEMU IRQ de-asserted.
+     */
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Tell KVM to stop listening for an INTx irqfd */
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to disable INTx irqfd: %m");
+    }
+
+    /* We only need to close the eventfd for VFIO to cleanup the kernel side */
+    event_notifier_cleanup(&vdev->intx.unmask);
+
+    /* QEMU starts listening for interrupt events. */
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    vdev->intx.kvm_accel = false;
+
+    /* If we've missed an event, let it re-fire through QEMU */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    trace_vfio_intx_disable_kvm(vdev->vbasedev.name);
+#endif
+}
+
+static void vfio_vgpu_intx_update(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    PCIINTxRoute route;
+
+    if (vdev->interrupt != VFIO_INT_INTx) {
+        return;
+    }
+
+    route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin);
+
+    if (!pci_intx_route_changed(&vdev->intx.route, &route)) {
+        return; /* Nothing changed */
+    }
+
+    trace_vfio_intx_update(vdev->vbasedev.name,
+                           vdev->intx.route.irq, route.irq);
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+
+    vdev->intx.route = route;
+
+    if (route.mode != PCI_INTX_ENABLED) {
+        return;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    /* Re-enable the interrupt in cased we missed an EOI */
+    vfio_vgpu_intx_eoi(&vdev->vbasedev);
+}
+
+static int vfio_vgpu_intx_enable(VFIOvGPUDevice *vdev)
+{
+    uint8_t pin = vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
+    int ret, argsz;
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+
+    if (!pin) {
+        return 0;
+    }
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
+    pci_config_set_interrupt_pin(vdev->pdev.config, pin);
+
+#ifdef CONFIG_KVM
+    /*
+     * Only conditional to avoid generating error messages on platforms
+     * where we won't actually use the result anyway.
+     */
+    if (kvm_irqfds_enabled() && kvm_resamplefds_enabled()) {
+        vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,
+                                                        vdev->intx.pin);
+    }
+#endif
+
+    ret = event_notifier_init(&vdev->intx.interrupt, 0);
+    if (ret) {
+        error_report("vfio: Error: event_notifier_init failed");
+        return ret;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(*pfd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx fd: %m");
+        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->intx.interrupt);
+        return -errno;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    vdev->interrupt = VFIO_INT_INTx;
+
+    trace_vfio_intx_enable(vdev->vbasedev.name);
+
+    return 0;
+}
+
+static void vfio_vgpu_intx_disable(VFIOvGPUDevice *vdev)
+{
+    int fd;
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, true);
+
+    fd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(fd, NULL, NULL, vdev);
+    event_notifier_cleanup(&vdev->intx.interrupt);
+
+    vdev->interrupt = VFIO_INT_NONE;
+
+    trace_vfio_intx_disable(vdev->vbasedev.name);
+}
+
+//MSI functions
+static void vfio_vgpu_remove_kvm_msi_virq(VFIOMSIVector *vector)
+{
+    kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                          vector->virq);
+    kvm_irqchip_release_virq(kvm_state, vector->virq);
+    vector->virq = -1;
+    event_notifier_cleanup(&vector->kvm_interrupt);
+}
+
+static void vfio_vgpu_msi_disable_common(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        if (vdev->msi_vectors[i].use) {
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+    }
+
+    g_free(vdev->msi_vectors);
+    vdev->msi_vectors = NULL;
+    vdev->nr_vectors = 0;
+    vdev->interrupt = VFIO_INT_NONE;
+
+   vfio_vgpu_intx_enable(vdev);
+}
+
+static void vfio_vgpu_msi_disable(VFIOvGPUDevice *vdev)
+{
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSI_IRQ_INDEX);
+    vfio_vgpu_msi_disable_common(vdev);
+}
+
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev)
+{
+
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        vfio_vgpu_msi_disable(vdev);
+    }
+
+    if (vdev->interrupt == VFIO_INT_INTx) {
+        vfio_vgpu_intx_disable(vdev);
+    }
+}
+
+
+static void vfio_vgpu_msi_interrupt(void *opaque)
+{
+    VFIOMSIVector *vector = opaque;
+    VFIOvGPUDevice *vdev = (VFIOvGPUDevice *)vector->vdev;
+    MSIMessage (*get_msg)(PCIDevice *dev, unsigned vector);
+    void (*notify)(PCIDevice *dev, unsigned vector);
+    MSIMessage msg;
+    int nr = vector - vdev->msi_vectors;
+
+    if (!event_notifier_test_and_clear(&vector->interrupt)) {
+        return;
+    }
+
+    if (vdev->interrupt == VFIO_INT_MSIX) {
+        get_msg = msix_get_message;
+        notify = msix_notify;
+    } else if (vdev->interrupt == VFIO_INT_MSI) {
+        get_msg = msi_get_message;
+        notify = msi_notify;
+    } else {
+        abort();
+    }
+
+    msg = get_msg(&vdev->pdev, nr);
+    trace_vfio_msi_interrupt(vdev->vbasedev.name, nr, msg.address, msg.data);
+    notify(&vdev->pdev, nr);
+}
+
+static int vfio_vgpu_enable_vectors(VFIOvGPUDevice *vdev, bool msix)
+{
+    struct vfio_irq_set *irq_set;
+    int ret = 0, i, argsz;
+    int32_t *fds;
+
+    argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds));
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = msix ? VFIO_PCI_MSIX_IRQ_INDEX : VFIO_PCI_MSI_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = vdev->nr_vectors;
+    fds = (int32_t *)&irq_set->data;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        int fd = -1;
+
+        /*
+         * MSI vs MSI-X - The guest has direct access to MSI mask and pending
+         * bits, therefore we always use the KVM signaling path when setup.
+         * MSI-X mask and pending bits are emulated, so we want to use the
+         * KVM signaling path only when configured and unmasked.
+         */
+        if (vdev->msi_vectors[i].use) {
+            if (vdev->msi_vectors[i].virq < 0 ||
+                (msix && msix_is_masked(&vdev->pdev, i))) {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
+            } else {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].kvm_interrupt);
+            }
+        }
+
+        fds[i] = fd;
+    }
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+
+    g_free(irq_set);
+
+    return ret;
+}
+
+static void vfio_vgpu_add_kvm_msi_virq(VFIOvGPUDevice *vdev, VFIOMSIVector *vector,
+                                  MSIMessage *msg, bool msix)
+{
+    int virq;
+
+    if (!msg) {
+        return;
+    }
+
+    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+        return;
+    }
+
+    virq = kvm_irqchip_add_msi_route(kvm_state, *msg, &vdev->pdev);
+    if (virq < 0) {
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                       NULL, virq) < 0) {
+        kvm_irqchip_release_virq(kvm_state, virq);
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    vector->virq = virq;
+}
+
+static void vfio_vgpu_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
+                                     PCIDevice *pdev)
+{
+    kvm_irqchip_update_msi_route(kvm_state, vector->virq, msg, pdev);
+}
+
+
+static void vfio_vgpu_msi_enable(VFIOvGPUDevice *vdev)
+{
+   int ret, i;
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->nr_vectors = msi_nr_vectors_allocated(&vdev->pdev);
+retry:
+    vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors);
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg = msi_get_message(&vdev->pdev, i);
+
+        vector->vdev = (VFIOPCIDevice *)vdev;
+        vector->virq = -1;
+        vector->use = true;
+
+        if (event_notifier_init(&vector->interrupt, 0)) {
+            error_report("vfio: Error: event_notifier_init failed");
+        }
+        qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                            vfio_vgpu_msi_interrupt, NULL, vector);
+
+        /*
+         * Attempt to enable route through KVM irqchip,
+         * default to userspace handling if unavailable.
+         */
+        vfio_vgpu_add_kvm_msi_virq(vdev, vector, &msg, false);
+    }
+
+    /* Set interrupt type prior to possible interrupts */
+    vdev->interrupt = VFIO_INT_MSI;
+
+    ret = vfio_vgpu_enable_vectors(vdev, false);
+    if (ret) {
+        if (ret < 0) {
+            error_report("vfio: Error: Failed to setup MSI fds: %m");
+        } else if (ret != vdev->nr_vectors) {
+            error_report("vfio: Error: Failed to enable %d "
+                         "MSI vectors, retry with %d", vdev->nr_vectors, ret);
+        }
+
+        for (i = 0; i < vdev->nr_vectors; i++) {
+            VFIOMSIVector *vector = &vdev->msi_vectors[i];
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+
+        g_free(vdev->msi_vectors);
+
+        if (ret > 0 && ret != vdev->nr_vectors) {
+            vdev->nr_vectors = ret;
+            goto retry;
+        }
+        vdev->nr_vectors = 0;
+
+        /*
+         * Failing to setup MSI doesn't really fall within any specification.
+         * Let's try leaving interrupts disabled and hope the guest figures
+         * out to fall back to INTx for this device.
+         */
+        error_report("vfio: Error: Failed to enable MSI");
+        vdev->interrupt = VFIO_INT_NONE;
+
+        return;
+    }
+}
+
+static void vfio_vgpu_update_msi(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg;
+
+        if (!vector->use || vector->virq < 0) {
+            continue;
+        }
+
+        msg = msi_get_message(&vdev->pdev, i);
+        vfio_vgpu_update_kvm_msi_virq(vector, msg, &vdev->pdev);
+    }
+}
+
+static int vfio_vgpu_msi_setup(VFIOvGPUDevice *vdev, int pos)
+{
+    uint16_t ctrl;
+    bool msi_64bit, msi_maskbit;
+    int ret, entries;
+
+    if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
+              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
+        return -errno;
+    }
+    ctrl = le16_to_cpu(ctrl);
+
+    msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
+    msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
+    entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
+
+    ret = msi_init(&vdev->pdev, pos, entries, msi_64bit, msi_maskbit);
+    if (ret < 0) {
+        if (ret == -ENOTSUP) {
+            return 0;
+        }
+        error_report("vfio: msi_init failed");
+        return ret;
+    }
+    vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 0);
+
+    return 0;
+}
+
+
+static int vfio_vgpu_msi_init(VFIOvGPUDevice *vdev)
+{
+    uint8_t pos;
+    int ret;
+
+    pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSI);
+    if (!pos) {
+        return 0;
+    }
+
+    ret = vfio_vgpu_msi_setup(vdev, pos);
+    if (ret < 0) {
+        error_report("vgpu: Error setting MSI@0x%x: %d", pos, ret);
+        return ret;
+    }
+
+    return 0;
+}
+
+/*
+ * VGPU device class functions
+ */
+
+static void vfio_vgpu_reset(DeviceState *dev)
+{
+
+
+}
+
+static void vfio_vgpu_eoi(VFIODevice *vbasedev)
+{
+    return;
+}
+
+static int vfio_vgpu_hot_reset_multi(VFIODevice *vbasedev)
+{
+    // Nothing to be reset 
+    return 0;
+}
+
+static void vfio_vgpu_compute_needs_reset(VFIODevice *vbasedev)
+{
+    vbasedev->needs_reset = false;
+}
+
+static VFIODeviceOps vfio_vgpu_ops = {
+    .vfio_compute_needs_reset = vfio_vgpu_compute_needs_reset,
+    .vfio_hot_reset_multi = vfio_vgpu_hot_reset_multi,
+    .vfio_eoi = vfio_vgpu_eoi,
+};
+
+static int vfio_vgpu_populate_device(VFIOvGPUDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+    int i, ret = -1;
+
+    for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
+        reg_info.index = i;
+
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        if (ret) {
+            error_report("vfio: Error getting region %d info: %m", i);
+            return ret;
+        }
+
+        trace_vfio_populate_device_region(vbasedev->name, i,
+                                          (unsigned long)reg_info.size,
+                                          (unsigned long)reg_info.offset,
+                                          (unsigned long)reg_info.flags);
+
+        vdev->bars[i].region.vbasedev = vbasedev;
+        vdev->bars[i].region.flags = reg_info.flags;
+        vdev->bars[i].region.size = reg_info.size;
+        vdev->bars[i].region.fd_offset = reg_info.offset;
+        vdev->bars[i].region.nr = i;
+        QLIST_INIT(&vdev->bars[i].quirks);
+    }
+
+    reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+    if (ret) {
+        error_report("vfio: Error getting config info: %m");
+        return ret;
+    }
+
+    vdev->config_size = reg_info.size;
+    if (vdev->config_size == PCI_CONFIG_SPACE_SIZE) {
+        vdev->pdev.cap_present &= ~QEMU_PCI_CAP_EXPRESS;
+    }
+    vdev->config_offset = reg_info.offset;
+
+    return 0;
+}
+
+static void vfio_vgpu_create_virtual_bar(VFIOvGPUDevice *vdev, int nr)
+{
+    VFIOBAR *bar = &vdev->bars[nr];
+    uint64_t size = bar->region.size;
+    char name[64];
+    uint32_t pci_bar;
+    uint8_t type;
+    int ret;
+
+    /* Skip both unimplemented BARs and the upper half of 64bit BARS. */
+    if (!size) 
+        return;
+
+    /* Determine what type of BAR this is for registration */
+    ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
+                vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
+    if (ret != sizeof(pci_bar)) {
+        error_report("vfio: Failed to read BAR %d (%m)", nr);
+        return;
+    }
+
+    pci_bar = le32_to_cpu(pci_bar);
+    bar->ioport = (pci_bar & PCI_BASE_ADDRESS_SPACE_IO);
+    bar->mem64 = bar->ioport ? 0 : (pci_bar & PCI_BASE_ADDRESS_MEM_TYPE_64);
+    type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
+                                    ~PCI_BASE_ADDRESS_MEM_MASK);
+
+    /* A "slow" read/write mapping underlies all BARs */
+    memory_region_init_io(&bar->region.mem, OBJECT(vdev), &vfio_region_ops,
+                          bar, name, size);
+    pci_register_bar(&vdev->pdev, nr, type, &bar->region.mem);
+
+    // Create an invalid BAR1 mapping
+    if (bar->region.flags & VFIO_REGION_INFO_FLAG_MMAP) {
+        strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
+        vfio_mmap_region(OBJECT(vdev), &bar->region, &bar->region.mem,
+                         &bar->region.mmap_mem, &bar->region.mmap,
+                         size, 0, name);
+    }
+}
+
+static void vfio_vgpu_create_virtual_bars(VFIOvGPUDevice *vdev)
+{
+
+    int i = 0;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        vfio_vgpu_create_virtual_bar(vdev, i);
+    }
+}
+
+static int vfio_vgpu_initfn(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    VFIOGroup *group;
+    ssize_t len;
+    int groupid;
+    struct stat st;
+    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
+    int ret;
+    UuidInfo *uuid_info;
+
+    uuid_info = qmp_query_uuid(NULL);
+    if (strcmp(uuid_info->UUID, UUID_NONE) == 0) {
+        return -EINVAL;
+    } else {
+        vdev->vm_uuid = uuid_info->UUID;
+    }
+
+
+    snprintf(path, sizeof(path), 
+             "/sys/devices/virtual/vgpu/%s-0/", vdev->vm_uuid);
+
+    if (stat(path, &st) < 0) {
+        error_report("vfio-vgpu: error: no such vgpu device: %s", path);
+        return -errno;
+    } 
+
+    vdev->vbasedev.ops = &vfio_vgpu_ops;
+
+    vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
+    vdev->vbasedev.name = g_strdup_printf("%s-0", vdev->vm_uuid);
+
+    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
+
+    len = readlink(path, iommu_group_path, sizeof(path));
+    if (len <= 0 || len >= sizeof(path)) {
+        error_report("vfio-vgpu: error no iommu_group for device");
+        return len < 0 ? -errno : -ENAMETOOLONG;
+    }
+
+    iommu_group_path[len] = 0;
+    group_name = basename(iommu_group_path);
+
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_report("vfio-vgpu: error reading %s: %m", path);
+        return -errno;
+    }
+
+    // TODO: This will only work if we *only* have VFIO_VGPU_IOMMU enabled
+
+    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
+    if (!group) {
+        error_report("vfio: failed to get group %d", groupid);
+        return -ENOENT;
+    }
+
+    snprintf(path, sizeof(path), "%s-0", vdev->vm_uuid);
+
+    ret = vfio_get_device(group, path, &vdev->vbasedev);
+    if (ret) {
+        error_report("vfio-vgpu; failed to get device %s", vdev->vgpu_type);
+        vfio_put_group(group);
+        return ret;
+    }
+
+    ret = vfio_vgpu_populate_device(vdev);
+    if (ret) {
+        return ret;
+    }
+
+    /* Get a copy of config space */
+    ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
+                MIN(pci_config_size(&vdev->pdev), vdev->config_size),
+                vdev->config_offset);
+    if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
+        ret = ret < 0 ? -errno : -EFAULT;
+        error_report("vfio: Failed to read device config space");
+        return ret;
+    }
+
+    vfio_vgpu_create_virtual_bars(vdev);
+
+    ret = vfio_vgpu_msi_init(vdev);
+    if (ret < 0) {
+        error_report("%s: Error setting MSI %d", __FUNCTION__, ret);
+        return ret;
+    }
+
+    if (vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
+        pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_vgpu_intx_update);
+        ret = vfio_vgpu_intx_enable(vdev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+
+static void vfio_vgpu_exitfn(PCIDevice *pdev)
+{
+
+
+}
+
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+    uint32_t val = 0;
+
+    ret = pread(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x %m", __func__, addr);
+        return 0xFFFFFFFF;
+    }
+
+    // memcpy(&vdev->emulated_config_bits + addr, &val, len);
+    return val;
+}
+
+static void vfio_vgpu_write_config(PCIDevice *pdev, uint32_t addr,
+                                  uint32_t val, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+
+    ret = pwrite(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x, val:0x%0x %m",
+                     __func__, addr, val);
+        return;
+    }
+
+    if (pdev->cap_present & QEMU_PCI_CAP_MSI &&
+        ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size)) {
+        int is_enabled, was_enabled = msi_enabled(pdev);
+
+        pci_default_write_config(pdev, addr, val, len);
+
+        is_enabled = msi_enabled(pdev);
+
+        if (!was_enabled) {
+            if (is_enabled) {
+                vfio_vgpu_msi_enable(vdev);
+            }
+        } else {
+            if (!is_enabled) {
+                vfio_vgpu_msi_disable(vdev);
+            } else {
+                vfio_vgpu_update_msi(vdev);
+            }
+        }
+    }
+    else {
+        /* Write everything to QEMU to keep emulated bits correct */
+        pci_default_write_config(pdev, addr, val, len);
+    }
+
+    pci_default_write_config(pdev, addr, val, len);
+
+    return;
+}
+
+static const VMStateDescription vfio_vgpu_vmstate = {
+    .name = TYPE_VFIO_VGPU,
+    .unmigratable = 1,
+};
+
+//
+// We don't actually need the vfio_vgpu_properties
+// as we can just simply rely on VM UUID to find
+// the IOMMU group for this VM
+//
+
+
+static Property vfio_vgpu_properties[] = {
+
+    DEFINE_PROP_STRING("vgpu", VFIOvGPUDevice, vgpu_type),
+    DEFINE_PROP_END_OF_LIST()
+};
+
+#if 0
+
+static void vfio_vgpu_instance_init(Object *obj)
+{
+
+}
+
+static void vfio_vgpu_instance_finalize(Object *obj)
+{
+
+
+}
+
+#endif
+
+static void vfio_vgpu_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+    // vgpudc->parent_realize = dc->realize;
+    // dc->realize = calxeda_xgmac_realize;
+    dc->desc = "VFIO-based vGPU";
+    dc->vmsd = &vfio_vgpu_vmstate;
+    dc->reset = vfio_vgpu_reset;
+    // dc->cannot_instantiate_with_device_add_yet = true; 
+    dc->props = vfio_vgpu_properties;
+    set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+    pdc->init = vfio_vgpu_initfn;
+    pdc->exit = vfio_vgpu_exitfn;
+    pdc->config_read = vfio_vgpu_read_config;
+    pdc->config_write = vfio_vgpu_write_config;
+    pdc->is_express = 0; /* For now, we are not */
+
+    pdc->vendor_id = PCI_DEVICE_ID_NVIDIA;
+    // pdc->device_id = 0x11B0;
+    pdc->class_id = PCI_CLASS_DISPLAY_VGA;
+}
+
+static const TypeInfo vfio_vgpu_dev_info = {
+    .name = TYPE_VFIO_VGPU,
+    .parent = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(VFIOvGPUDevice),
+    .class_init = vfio_vgpu_class_init,
+};
+
+static void register_vgpu_dev_type(void)
+{
+    type_register_static(&vfio_vgpu_dev_info);
+}
+
+type_init(register_vgpu_dev_type)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 379b6e1..9af5e17 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -64,6 +64,9 @@ 
 #define PCI_DEVICE_ID_VMWARE_IDE         0x1729
 #define PCI_DEVICE_ID_VMWARE_VMXNET3     0x07B0
 
+/* NVIDIA (0x10de) */
+#define PCI_DEVICE_ID_NVIDIA             0x10de
+
 /* Intel (0x8086) */
 #define PCI_DEVICE_ID_INTEL_82551IT      0x1209
 #define PCI_DEVICE_ID_INTEL_82557        0x1229