From patchwork Tue Jan 26 09:48:19 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Neo Jia <cjia@nvidia.com>
X-Patchwork-Id: 8119961
Return-Path: 
 <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
X-Original-To: patchwork-qemu-devel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id A501CBEEE5
	for <patchwork-qemu-devel@patchwork.kernel.org>;
	Tue, 26 Jan 2016 10:34:33 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id CECF520218
	for <patchwork-qemu-devel@patchwork.kernel.org>;
	Tue, 26 Jan 2016 10:34:27 +0000 (UTC)
Received: from lists.gnu.org (lists.gnu.org [208.118.235.17])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 27670201E4
	for <patchwork-qemu-devel@patchwork.kernel.org>;
	Tue, 26 Jan 2016 10:34:21 +0000 (UTC)
Received: from localhost ([::1]:42864 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>)
	id 1aO0x2-0000X9-7v for patchwork-qemu-devel@patchwork.kernel.org;
	Tue, 26 Jan 2016 05:34:20 -0500
Received: from eggs.gnu.org ([2001:4830:134:3::10]:58768)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cjia@nvidia.com>) id 1aO0YG-0002op-Iq
	for qemu-devel@nongnu.org; Tue, 26 Jan 2016 05:08:54 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <cjia@nvidia.com>) id 1aO0Y7-0008Ts-4d
	for qemu-devel@nongnu.org; Tue, 26 Jan 2016 05:08:44 -0500
Received: from hqemgate14.nvidia.com ([216.228.121.143]:7322)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cjia@nvidia.com>) id 1aO0Y6-0008Tf-9a
	for qemu-devel@nongnu.org; Tue, 26 Jan 2016 05:08:35 -0500
Received: from hqnvupgp07.nvidia.com (Not Verified[216.228.121.13]) by
	hqemgate14.nvidia.com
	id <B56a740d00000>; Tue, 26 Jan 2016 01:48:00 -0800
Received: from HQMAIL107.nvidia.com ([172.20.187.13])
	by hqnvupgp07.nvidia.com (PGP Universal service);
	Tue, 26 Jan 2016 01:49:15 -0800
X-PGP-Universal: processed;
	by hqnvupgp07.nvidia.com on Tue, 26 Jan 2016 01:49:15 -0800
Received: from nvidia.com (172.16.168.254) by HQMAIL107.nvidia.com
	(172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1130.7;
	Tue, 26 Jan 2016 09:48:26 +0000
Date: Tue, 26 Jan 2016 01:48:19 -0800
From: Neo Jia <cjia@nvidia.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Message-ID: <20160126094819.GA14079@nvidia.com>
References: <569C5071.6080004@intel.com>
	<1453092476.32741.67.camel@redhat.com>
	<569CA8AD.6070200@intel.com>
	<1453143919.32741.169.camel@redhat.com>
	<569F4C86.2070501@intel.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	<56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78D2A3@SHSMSX101.ccr.corp.intel.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: 
 <AADFC41AFE54684AB9EE6CBC0274A5D15F78D2A3@SHSMSX101.ccr.corp.intel.com>
X-NVConfidentiality: public
User-Agent: Mutt/1.5.24 (2015-08-30)
X-Originating-IP: [172.16.168.254]
X-ClientProxiedBy: HQMAIL102.nvidia.com (172.18.146.10) To
	HQMAIL107.nvidia.com (172.20.187.13)
X-detected-operating-system: by eggs.gnu.org: Windows 7 or 8
X-Received-From: 216.228.121.143
X-Mailman-Approved-At: Tue, 26 Jan 2016 05:34:04 -0500
Cc: "Ruan, Shuai" <shuai.ruan@intel.com>, "Song, Jike" <jike.song@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	qemu-devel <qemu-devel@nongnu.org>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Alex Williamson <alex.williamson@redhat.com>, "Lv,
	Zhiyuan" <zhiyuan.lv@intel.com>, Paolo Bonzini <pbonzini@redhat.com>,
	Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3
	release of XenGT - a Mediated ...)
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: 
 qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org
Sender: 
 qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, DC_PNG_UNO_LARGO,
	RCVD_IN_DNSWL_HI,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, January 26, 2016 5:30 AM
> > 
> > [cc +Neo @Nvidia]
> > 
> > Hi Jike,
> > 
> > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > I would expect we can spell out next level tasks toward above
> > > > direction, upon which Alex can easily judge whether there are
> > > > some common VFIO framework changes that he can help :-)
> > >
> > > Hi Alex,
> > >
> > > Here is a draft task list after a short discussion w/ Kevin,
> > > would you please have a look?
> > >
> > > 	Bus Driver
> > >
> > > 		{ in i915/vgt/xxx.c }
> > >
> > > 		- define a subset of vfio_pci interfaces
> > > 		- selective pass-through (say aperture)
> > > 		- trap MMIO: interface w/ QEMU
> > 
> > What's included in the subset?  Certainly the bus reset ioctls really
> > don't apply, but you'll need to support the full device interface,
> > right?  That includes the region info ioctl and access through the vfio
> > device file descriptor as well as the interrupt info and setup ioctls.
> 
> That is the next level detail Jike will figure out and discuss soon.
> 
> yes, basic region info/access should be necessary. For interrupt, could
> you elaborate a bit what current interface is doing? If just about creating
> an eventfd for virtual interrupt injection, it applies to vgpu too.
> 
> > 
> > > 	IOMMU
> > >
> > > 		{ in a new vfio_xxx.c }
> > >
> > > 		- allocate: struct device & IOMMU group
> > 
> > It seems like the vgpu instance management would do this.
> > 
> > > 		- map/unmap functions for vgpu
> > > 		- rb-tree to maintain iova/hpa mappings
> > 
> > Yep, pretty much what type1 does now, but without mapping through the
> > IOMMU API.  Essentially just a database of the current userspace
> > mappings that can be accessed for page pinning and IOVA->HPA
> > translation.
> 
> The thought is to reuse iommu_type1.c, by abstracting several underlying
> operations and then put vgpu specific implementation in a vfio_vgpu.c (e.g.
> for map/unmap instead of using IOMMU API, an iova/hpa mapping is updated
> accordingly), etc.
> 
> This file will also connect between VFIO and vendor specific vgpu driver,
> e.g. exposing interfaces to allow the latter querying iova<->hpa and also 
> creating necessary VFIO structures like aforementioned device/IOMMUas...
> 
> > 
> > > 		- interacts with kvmgt.c
> > >
> > >
> > > 	vgpu instance management
> > >
> > > 		{ in i915 }
> > >
> > > 		- path, create/destroy
> > >
> > 
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
> not good to touch vfio internal structures from another module (such as
> i915.ko)
> 
> > 
> > Nvidia has also been looking at this and has some ideas how we might
> > standardize on some of the interfaces and create a vgpu framework to
> > help share code between vendors and hopefully make a more consistent
> > userspace interface for libvirt as well.  I'll let Neo provide some
> > details.  Thanks,
> > 
> 
> Nice to know that. Neo, please share your thought here.

Hi Alex, Kevin and Jike,

Thanks for adding me to this technical discussion, a great opportunity
for us to design together which can bring both Intel and NVIDIA vGPU solution to
KVM platform.

Instead of directly jumping to the proposal that we have been working on
recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
quick comments / thoughts regarding the existing discussions on this thread as
fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

Then we can look at what we have, hopefully we can reach some consensus soon.

> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".

Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
can be centralized and done in vfio-vgpu. That also include adding to IOMMU
group and VFIO group.

Graphics driver can register with vfio-vgpu to get management and emulation call
backs to graphics driver.   

We already have struct vgpu_device in our proposal that keeps pointer to
physical device.  

> - vfio_pci will inject an IRQ to guest only when physical IRQ
> generated; whereas vfio_vgpu may inject an IRQ for emulation
> purpose. Anyway they can share the same injection interface;

eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
available to graphics driver so that graphics driver can inject interrupts
directly when physical device triggers interrupt. 

Here is the proposal we have, please review.

Please note the patches we have put out here is mainly for POC purpose to
verify our understanding also can serve the purpose to reduce confusions and speed up 
our design, although we are very happy to refine that to something eventually
can be used for both parties and upstreamed.

Linux vGPU kernel design
==================================================================================

Here we are proposing a generic Linux kernel module based on VFIO framework
which allows different GPU vendors to plugin and provide their GPU virtualization
solution on KVM, the benefits of having such generic kernel module are:

1) Reuse QEMU VFIO driver, supporting VFIO UAPI

2) GPU HW agnostic management API for upper layer software such as libvirt

3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor

0. High level overview
==================================================================================

 
  user space:
                                +-----------+  VFIO IOMMU IOCTLs
                      +---------| QEMU VFIO |-------------------------+
        VFIO IOCTLs   |         +-----------+                         |
                      |                                               | 
 ---------------------|-----------------------------------------------|---------
                      |                                               |
  kernel space:       |  +--->----------->---+  (callback)            V
                      |  |                   v                 +------V-----+
  +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
  |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
  | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
  |          |   |          |     | (register)           ^         ||
  +----------+   +-------+--+     |    +-----------+     |         ||
                         V        +----| i915.ko   +-----+     +---VV-------+ 
                         |             +-----^-----+           | TYPE1      |
                         |  (callback)       |                 | IOMMU      |
                         +-->------------>---+                 +------------+
 access flow:

  Guest MMIO / PCI config access
  |
  -------------------------------------------------
  |
  +-----> KVM VM_EXITs  (kernel)
          |
  -------------------------------------------------
          |
          +-----> QEMU VFIO driver (user)
                  | 
  -------------------------------------------------
                  |
                  +---->  VGPU kernel driver (kernel)
                          |  
                          | 
                          +----> vendor driver callback


1. VGPU management interface
==================================================================================

This is the interface allows upper layer software (mostly libvirt) to query and
configure virtual GPU device in a HW agnostic fashion. Also, this management
interface has provided flexibility to underlying GPU vendor to support virtual
device hotplug, multiple virtual devices per VM, multiple virtual devices from
different physical devices, etc.

1.1 Under per-physical device sysfs:
----------------------------------------------------------------------------------

vgpu_supported_types - RO, list the current supported virtual GPU types and its
VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
"vgpu_supported_types".
                            
vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
gpu device on a target physical GPU. idx: virtual device index inside a VM

vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
target physical GPU

1.3 Under vgpu class sysfs:
----------------------------------------------------------------------------------

vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to commit virtual GPU resource for
this target VM. 

Also, the vgpu_start function is a synchronized call, the successful return of
this call will indicate all the requested vGPU resource has been fully
committed, the VMM should continue.

vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to release virtual GPU resource of
this target VM.

1.4 Virtual device Hotplug
----------------------------------------------------------------------------------

To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
accessed during VM runtime, and the corresponding registration callback will be
invoked to allow GPU vendor support hotplug.

To support hotplug, vendor driver would take necessary action to handle the
situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
implies both create and start for that vgpu device.

Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
supports vgpu hotplug.

If hotplug is not supported and VM is still running, vendor driver can return
error code to indicate not supported.

Separate create from start gives flixibility to have:

- multiple vgpu instances for single VM and
- hotplug feature.

2. GPU driver vendor registration interface
==================================================================================

2.1 Registration interface definition (include/linux/vgpu.h)
----------------------------------------------------------------------------------

extern int vgpu_register_device(struct pci_dev *dev, 
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu
 * types.
 *                              @dev : pci device structure of physical GPU. 
 *                              @config: should return string listing supported
 *                              config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which
 *                              vgpu 
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended
 *                              to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs
 *                              to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that 
 *                              means the vGPU is being hotunpluged. Return
 *                              error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read 
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or
 *                              error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set. 
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API. 
 *
 * Physical GPU that support vGPU should be register with vgpu module with 
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

2.2 Details for callbacks we haven't mentioned above.
---------------------------------------------------------------------------------

vgpu_supported_config: allows the vendor driver to specify the supported vGPU
                       type/configuration

vgpu_create          : create a virtual GPU device, can be used for device hotplug.

vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.

vgpu_start           : callback function to notify vendor driver vgpu device
                       come to live for a given virtual machine.

vgpu_shutdown        : callback function to notify vendor driver 

read                 : callback to vendor driver to handle virtual device config
                       space or MMIO read access

write                : callback to vendor driver to handle virtual device config
                       space or MMIO write access

vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
                       information for the target virtual device, then vendor
                       driver can inject interrupt into virtual machine for this
                       device.

2.3 Potential additional virtual device configuration registration interface:
---------------------------------------------------------------------------------

callback function to describe the MMAP behavior of the virtual GPU 

callback function to allow GPU vendor driver to provide PCI config space backing
memory.

3. VGPU TYPE1 IOMMU
==================================================================================

Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
<iova, hva, size, flag> and save the QEMU mm for later reference.

You can find the quick/ugly implementation in the attached patch file, which is
actually just a simple version Alex's type1 IOMMU without actual real
mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 

We have thought about providing another vendor driver registration interface so
such tracking information will be sent to vendor driver and he will use the QEMU
mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
quick implementation within our driver, I noticed following issues:

1) OS/VFIO logic into vendor driver which will be a maintenance issue.

2) Every driver vendor has to implement their own RB tree, instead of reusing
the common existing VFIO code (vfio_find/link/unlink_dma) 

3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
better not have anything inside a vendor driver that the VFIO caller immediately
depends on.

Based on the above consideration, we decide to implement the DMA tracking logic
within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
IOMMU code) and expose two symbols to outside for MMIO mapping and page
translation and pinning. 

Also, with a mmap MMIO interface between virtual and physical, this allows
para-virtualized guest driver can access his virtual MMIO without taking a MMAP
fault hit, also we can support different MMIO size between virtual and physical
device.

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

EXPORT_SYMBOL(vgpu_map_virtual_bar);

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

EXPORT_SYMBOL(vgpu_dma_do_translate);

Still a lot to be added and modified, such as supporting multiple VMs and 
multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
kernel driver, error handling, roll-back and locked memory size per user, etc. 

4. Modules
==================================================================================

Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko

vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
                           TYPE1 v1 and v2 interface. 

vgpu.ko                  - provide registration interface and virtual device
                           VFIO access.

5. QEMU note
==================================================================================

To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
use it as a reference for our implementation. It is basically just a quick c & p
from vfio/pci.c to quickly meet our needs.

Once this proposal is finalized, we will move to vfio/pci.c instead of a new
class, and probably the only thing required is to have a new way to discover the
device.

6. Examples
==================================================================================

On this server, we have two NVIDIA M60 GPUs.

[root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

After nvidia.ko gets initialized, we can query the supported vGPU type by
accessing the "vgpu_supported_types" like following:

[root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
11:GRID M60-0B
12:GRID M60-0Q
13:GRID M60-1B
14:GRID M60-1Q
15:GRID M60-2B
16:GRID M60-2Q
17:GRID M60-4Q
18:GRID M60-8Q

For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
like to create "GRID M60-4Q" VM on it.

echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create

Note: the number 0 here is for vGPU device index. So far the change is not tested
for multiple vgpu devices yet, but we will support it.

At this moment, if you query the "vgpu_supported_types" it will still show all
supported virtual GPU types as no virtual GPU resource is committed yet.

Starting VM:

echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

then, the supported vGPU type query will return:

[root@cjia-vgx-kvm /home/cjia]$
> cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
17:GRID M60-4Q

So vgpu_supported_config needs to be called whenever a new virtual device gets
created as the underlying HW might limit the supported types if there are
any existing VM runnings.

Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
GPU driver vendor to clean up resource.

Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
device sysfs.

7. What is not covered:
==================================================================================

7.1 QEMU console VNC

QEMU console VNC is not covered in this RFC as it is a pretty isolated module
and not impacting the basic vGPU functionality, also we already have a good
discussion about the new VFIO interface that Alex is going to introduce to allow us 
describe a region for VM surface.

8 Patches
==================================================================================

0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch - against QEMU 2.5.0

0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch  - against 4.4.0-rc5

Thanks,
Kirti and Neo


> 
> Jike will provide next level API definitions based on KVMGT requirement. 
> We can further refine it to match requirements of multi-vendors.
> 
> Thanks
> Kevin
From dc8ca387f7b06c6dfc85fb4bd79a760dca76e831 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Tue, 26 Jan 2016 01:21:11 -0800
Subject: [PATCH] Add VGPU and its TYPE1 IOMMU kernel module support

This is just a quick POV implementation to allow GPU driver vendor to plugin
into VFIO framework to provide their virtual GPU support. This kernel is
providing a registration interface for GPU vendor and generic DMA tracking APIs.

extern int vgpu_register_device(struct pci_dev *dev,
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu types.
 *                              @dev : pci device structure of physical GPU.
 *                              @config: should return string listing supported config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which vgpu
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that
 *                              means the vGPU is being hotunpluged. Return error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set.
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API.
 *
 * Physical GPU that support vGPU should be register with vgpu module with
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

Change-Id: Ib70304d9a600c311d5107a94b3fffa938926275b
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/Kconfig                      |   2 +
 drivers/Makefile                     |   1 +
 drivers/vfio/vfio.c                  |   5 +-
 drivers/vgpu/Kconfig                 |  26 ++
 drivers/vgpu/Makefile                |   5 +
 drivers/vgpu/vfio_iommu_type1_vgpu.c | 511 ++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_dev.c              | 550 +++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h          |  47 +++
 drivers/vgpu/vgpu_sysfs.c            | 322 ++++++++++++++++++++
 drivers/vgpu/vgpu_vfio.c             | 521 +++++++++++++++++++++++++++++++++
 include/linux/vgpu.h                 | 157 ++++++++++
 11 files changed, 2144 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c
 create mode 100644 drivers/vgpu/vgpu_dev.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 drivers/vgpu/vgpu_sysfs.c
 create mode 100644 drivers/vgpu/vgpu_vfio.c
 create mode 100644 include/linux/vgpu.h
diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339de85f..5fd9eae79914 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 795d0ca714bf..142256b4358b 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
 obj-$(CONFIG_VFIO)		+= vfio/
+obj-$(CONFIG_VGPU)              += vgpu/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b793cbcb..af3ab413e119 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -947,19 +947,18 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container,
 		if (IS_ERR(data)) {
 			ret = PTR_ERR(data);
 			module_put(driver->ops->owner);
-			goto skip_drivers_unlock;
+			continue;
 		}
 
 		ret = __vfio_container_attach_groups(container, driver, data);
 		if (!ret) {
 			container->iommu_driver = driver;
 			container->iommu_data = data;
+			goto skip_drivers_unlock;
 		} else {
 			driver->ops->release(data);
 			module_put(driver->ops->owner);
 		}
-
-		goto skip_drivers_unlock;
 	}
 
 	mutex_unlock(&vfio.iommu_drivers_lock);
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 000000000000..698ddf907a16
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,26 @@
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    select VFIO_IOMMU_TYPE1_VGPU
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU 
+    default n
+
+config VFIO_IOMMU_TYPE1_VGPU
+    tristate
+    depends on VGPU_VFIO
+    default n
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 000000000000..098a3591a535
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,5 @@
+
+vgpu-y := vgpu_sysfs.o vgpu_dev.o vgpu_vfio.o
+
+obj-$(CONFIG_VGPU)	+= vgpu.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU) += vfio_iommu_type1_vgpu.o
diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c
new file mode 100644
index 000000000000..6b20f1374b3b
--- /dev/null
+++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
@@ -0,0 +1,511 @@
+/*
+ * VGPU : IOMMU DMA mapping support for VGPU
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
+
+// VFIO structures
+
+struct vfio_iommu_vgpu {
+	struct mutex lock;
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	struct rb_root dma_list;
+	struct mm_struct * vm_mm;
+};
+
+struct vgpu_vfio_dma {
+	struct rb_node node;
+	dma_addr_t iova;
+	unsigned long vaddr;
+	size_t size;
+	int prot;
+};
+
+/*
+ * VGPU VFIO FOPs definition
+ *
+ */
+
+/*
+ * Duplicated from vfio_link_dma, just quick hack ... should
+ * reuse code later
+ */
+
+static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
+			  struct vgpu_vfio_dma *new)
+{
+	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+	struct vgpu_vfio_dma *dma;
+
+	while (*link) {
+		parent = *link;
+		dma = rb_entry(parent, struct vgpu_vfio_dma, node);
+
+		if (new->iova + new->size <= dma->iova)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
+					   dma_addr_t start, size_t size)
+{
+	struct rb_node *node = iommu->dma_list.rb_node;
+
+	while (node) {
+		struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node);
+
+		if (start + size <= dma->iova)
+			node = node->rb_left;
+		else if (start >= dma->iova + dma->size)
+			node = node->rb_right;
+		else
+			return dma;
+	}
+
+	return NULL;
+}
+
+static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old)
+{
+	rb_erase(&old->node, &iommu->dma_list);
+}
+
+static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
+{
+	struct vgpu_vfio_dma *c, *n;
+	uint32_t i = 0;
+
+	rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
+		printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n",
+		       __FUNCTION__, i++, c->iova, c->vaddr, c->size);
+}
+
+static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_map *map)
+{
+	dma_addr_t iova = map->iova;
+	unsigned long vaddr = map->vaddr;
+	int ret = 0, prot = 0;
+	struct vgpu_vfio_dma *vgpu_dma;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -EEXIST;
+	}
+
+	vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
+
+	if (!vgpu_dma) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -ENOMEM;
+	}
+
+	vgpu_dma->iova = iova;
+	vgpu_dma->vaddr = vaddr;
+	vgpu_dma->prot = prot;
+	vgpu_dma->size = map->size;
+
+	vgpu_link_dma(vgpu_iommu, vgpu_dma);
+
+	mutex_unlock(&vgpu_iommu->lock);
+	return ret;
+}
+
+static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_unmap *unmap)
+{
+	struct vgpu_vfio_dma *vgpu_dma;
+	size_t unmapped = 0;
+	int ret = 0;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
+	if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
+	if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) {
+		unmapped += vgpu_dma->size;
+		vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	unmap->size = unmapped;
+
+	return ret;
+}
+
+/* Ugly hack to quickly test single deivce ... */
+
+static struct vfio_iommu_vgpu *_local_iommu = NULL;
+
+int vgpu_map_virtual_bar
+(
+	uint64_t virt_bar_addr,
+        uint64_t phys_bar_addr,
+	uint32_t len,
+	uint32_t flags
+)
+{
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	unsigned long remote_vaddr = 0;
+	struct vgpu_vfio_dma *vgpu_dma = NULL;
+	struct vm_area_struct *remote_vma = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+	int ret = 0;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	down_write(&mm->mmap_sem);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, virt_bar_addr, len /*  size */);
+	if (!vgpu_dma) {
+		printk(KERN_INFO "%s: fail locate guest physical:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	remote_vaddr = vgpu_dma->vaddr + virt_bar_addr - vgpu_dma->iova;
+
+        remote_vma = find_vma(mm, remote_vaddr);
+
+	if (remote_vma == NULL) {
+		printk(KERN_INFO "%s: fail locate vma, physical addr:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+	else {
+		printk(KERN_INFO "%s: locate vma, addr:0x%lx\n",
+		       __FUNCTION__, remote_vma->vm_start);
+	}
+
+	remote_vma->vm_page_prot = pgprot_noncached(remote_vma->vm_page_prot);
+
+	remote_vma->vm_pgoff = phys_bar_addr >> PAGE_SHIFT;
+
+	ret = remap_pfn_range(remote_vma, virt_bar_addr, remote_vma->vm_pgoff,
+			len, remote_vma->vm_page_prot);
+
+	if (ret) {
+		printk(KERN_INFO "%s: fail to remap vma:%d\n", __FUNCTION__, ret);
+		goto unlock;
+	}
+
+unlock:
+
+	up_write(&mm->mmap_sem);
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_map_virtual_bar);
+
+int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
+{
+	int i = 0, ret = 0, prot = 0;
+	unsigned long remote_vaddr = 0, pfn = 0;
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	struct vgpu_vfio_dma *vgpu_dma;
+	struct page *page[1];
+	// unsigned long * addr = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+
+	prot = IOMMU_READ | IOMMU_WRITE;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	for (i = 0; i < count; i++) {
+		dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
+		vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
+
+		if (!vgpu_dma) {
+			printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova);
+			ret = -EINVAL;
+			goto unlock;
+		}
+
+		remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
+		printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n",
+			__FUNCTION__, i, vgpu_dma->iova,
+			vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
+
+		if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
+			pfn = page_to_pfn(page[0]);
+			printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn);
+			// addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
+		}
+		else {
+			printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i);
+			ret = -ENOMEM;
+			goto unlock;
+		}
+
+		gfn_buffer[i] = pfn;
+		// vunmap(addr);
+
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_dma_do_translate);
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+static void *vfio_iommu_vgpu_open(unsigned long arg)
+{
+	struct vfio_iommu_vgpu *iommu;
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&iommu->lock);
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	/* TODO: Keep track the v2 vs. v1, for now only assume
+	 * we are v2 due to QEMU code */
+	_local_iommu = iommu;
+	return iommu;
+}
+
+static void vfio_iommu_vgpu_release(void *iommu_data)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	kfree(iommu);
+	printk(KERN_INFO "%s", __FUNCTION__);
+}
+
+static long vfio_iommu_vgpu_ioctl(void *iommu_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
+
+	switch (cmd) {
+	case VFIO_CHECK_EXTENSION:
+	{
+		if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
+			return 1;
+		else
+			return 0;
+	}
+
+	case VFIO_IOMMU_GET_INFO:
+	{
+		struct vfio_iommu_type1_info info;
+		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_IOMMU_MAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_map map;
+		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n",
+			map.flags, map.vaddr, map.iova, map.size);
+
+		/*
+		 * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally,
+		 * this should be merged into one single file and reuse data
+		 * structure
+		 *
+		 */
+		ret = vgpu_dma_do_track(vgpu_iommu, &map);
+		break;
+	}
+	case VFIO_IOMMU_UNMAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz)
+			return -EINVAL;
+
+		ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
+		break;
+	}
+	default:
+	{
+		printk(KERN_INFO "%s cmd default ", __FUNCTION__);
+		ret = -ENOTTY;
+		break;
+	}
+	}
+
+	return ret;
+}
+
+
+static int vfio_iommu_vgpu_attach_group(void *iommu_data,
+		                        struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	vgpu_dev = get_vgpu_device_from_group(iommu_group);
+	if (vgpu_dev) {
+		iommu->vgpu_dev = vgpu_dev;
+		iommu->group = iommu_group;
+
+		/* IOMMU shares the same life cylce as VM MM */
+		iommu->vm_mm = current->mm;
+
+		printk(KERN_INFO "%s index %d", __FUNCTION__, vgpu_dev->minor);
+		return 0;
+	}
+	iommu->group = iommu_group;
+	return 1;
+}
+
+static void vfio_iommu_vgpu_detach_group(void *iommu_data,
+		struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+	iommu->vm_mm = NULL;
+	iommu->group = NULL;
+
+	return;
+}
+
+
+static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
+	.name           = "vgpu_vfio",
+	.owner          = THIS_MODULE,
+	.open           = vfio_iommu_vgpu_open,
+	.release        = vfio_iommu_vgpu_release,
+	.ioctl          = vfio_iommu_vgpu_ioctl,
+	.attach_group   = vfio_iommu_vgpu_attach_group,
+	.detach_group   = vfio_iommu_vgpu_detach_group,
+};
+
+
+int vgpu_vfio_iommu_init(void)
+{
+	int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc);
+	}
+
+	return rc;
+}
+
+void vgpu_vfio_iommu_exit(void)
+{
+	// unregister vgpu_vfio driver
+	vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+}
+
+
+module_init(vgpu_vfio_iommu_init);
+module_exit(vgpu_vfio_iommu_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
diff --git a/drivers/vgpu/vgpu_dev.c b/drivers/vgpu/vgpu_dev.c
new file mode 100644
index 000000000000..1d4eb235122c
--- /dev/null
+++ b/drivers/vgpu/vgpu_dev.c
@@ -0,0 +1,550 @@
+/*
+ * VGPU core
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC	"VGPU driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME		"vgpu"
+
+#define VGPU_DEV_NAME		"vgpu"
+
+// TODO remove these defines
+// minor number reserved for control device
+#define VGPU_CONTROL_DEVICE       0
+
+#define VGPU_CONTROL_DEVICE_NAME  "vgpuctl"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+	dev_t               vgpu_devt;
+	struct class        *class;
+	struct cdev         vgpu_cdev;
+	struct list_head    vgpu_devices_list;  // Head entry for the doubly linked vgpu_device list
+	struct mutex        vgpu_devices_lock;
+	struct idr          vgpu_idr;
+	struct list_head    gpu_devices_list;
+	struct mutex        gpu_devices_lock;
+} vgpu;
+
+
+/*
+ * Function prototypes
+ */
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev);
+
+unsigned int vgpu_poll(struct file *file, poll_table *wait);
+long vgpu_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long i_arg);
+int vgpu_mmap(struct file *file, struct vm_area_struct *vma);
+
+int vgpu_open(struct inode *inode, struct file *file);
+int vgpu_close(struct inode *inode, struct file *file);
+ssize_t vgpu_read(struct file *file, char __user * buf,
+		      size_t len, loff_t * ppos);
+ssize_t vgpu_write(struct file *file, const char __user *data,
+		       size_t len, loff_t *ppos);
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->group) {
+			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
+				mutex_unlock(&vgpu.vgpu_devices_lock);
+				return vdev;
+			}
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev, *tmp;
+
+	if (!dev)
+		return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+	gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return -EINVAL;
+                }
+        }
+
+	ret = vgpu_create_pci_device_files(dev);
+	if (ret) {
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		kfree(gpu_dev);
+		return ret;
+	}
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+                if (gpu_dev->dev == dev) {
+			printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+			vgpu_remove_pci_device_files(dev);
+                        list_del(&gpu_dev->gpu_next);
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return;
+                }
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+
+/*
+ *  Static functions
+ */
+
+static struct file_operations vgpu_fops = {
+	.owner          = THIS_MODULE,
+};
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev)
+{
+	if (vgpu_dev->dev) {
+		device_destroy(vgpu.class, vgpu_dev->dev->devt);
+		vgpu_dev->dev = NULL;
+	}
+}
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
+{
+	struct vgpu_device *vgpu_dev = NULL;
+
+	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+	if (!vgpu_dev)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vgpu_dev->kref);
+	memcpy(&vgpu_dev->vm_uuid, &uuid, sizeof(uuid_le));
+	vgpu_dev->vgpu_instance = instance;
+	strcpy(vgpu_dev->dev_name, name);
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_del(&vgpu_dev->list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	kfree(vgpu_dev);
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+struct vgpu_device *find_vgpu_device(struct device *dev)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->dev == dev) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id)
+{
+	int minor;
+	char name[64];
+	int numChar = 0;
+	int retval = 0;
+
+	struct iommu_group *group = NULL;
+	struct device *dev = NULL;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	struct gpu_device *gpu_dev;
+
+	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+	numChar = sprintf(name, "%pUb-%d", vm_uuid.b, instance);
+	name[numChar] = '\0';
+
+	vgpu_dev = vgpu_device_alloc(vm_uuid, instance, name);
+	if (IS_ERR(vgpu_dev)) {
+		return PTR_ERR(vgpu_dev);
+	}
+
+	// check if VM device is present
+	// if not present, create with devt=0 and parent=NULL
+	// create device for instance with devt= MKDEV(vgpu.major, minor)
+	// and parent=VM device
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_dev->vgpu_id = vgpu_id;
+
+	// TODO on removing control device change the 3rd parameter to 0
+	minor = idr_alloc(&vgpu.vgpu_idr, vgpu_dev, 1, MINORMASK + 1, GFP_KERNEL);
+	if (minor < 0) {
+		retval = minor;
+		goto create_failed;
+	}
+
+	dev = device_create(vgpu.class, NULL, MKDEV(MAJOR(vgpu.vgpu_devt), minor), NULL, "%s", name);
+	if (IS_ERR(dev)) {
+		retval = PTR_ERR(dev);
+		goto create_failed1;
+	}
+
+	vgpu_dev->dev = dev;
+	vgpu_dev->minor = minor;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (gpu_dev->dev == pdev) {
+			vgpu_dev->gpu_dev = gpu_dev;
+			if (gpu_dev->ops->vgpu_create) {
+				retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->vm_uuid,
+								   instance, vgpu_id);
+				if (retval)
+				{
+					mutex_unlock(&vgpu.gpu_devices_lock);
+					goto create_failed2;
+				}
+			}
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	if (!vgpu_dev->gpu_dev) {
+		retval = -EINVAL;
+		goto create_failed2;
+	}
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->vm_uuid.b);
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		printk(KERN_ERR "VGPU: failed to allocate group!\n");
+		retval = PTR_ERR(group);
+		goto create_failed2;
+	}
+
+	retval = iommu_group_add_device(group, dev);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+		iommu_group_put(group);
+		goto create_failed2;
+	}
+
+	retval = vgpu_group_init(vgpu_dev, group);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed vgpu_group_init \n");
+		iommu_group_put(group);
+		iommu_group_remove_device(dev);
+		goto create_failed2;
+	}
+
+	vgpu_dev->group = group;
+	printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return retval;
+
+create_failed2:
+	vgpu_device_destroy(vgpu_dev);
+
+create_failed1:
+	idr_remove(&vgpu.vgpu_idr, minor);
+
+create_failed:
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+
+	return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+	struct device *dev = vgpu_dev->dev;
+
+	if (!dev) {
+		return;
+	}
+
+	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+	if (vgpu_dev->gpu_dev->ops->vgpu_destroy) {
+		int retval = 0;
+		retval = vgpu_dev->gpu_dev->ops->vgpu_destroy(vgpu_dev->gpu_dev->dev,
+							      vgpu_dev->vm_uuid,
+							      vgpu_dev->vgpu_instance);
+	/* if vendor driver doesn't return success that means vendor driver doesn't
+	 * support hot-unplug */
+		if (retval)
+			return;
+	}
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_group_free(vgpu_dev);
+	iommu_group_put(dev->iommu_group);
+	iommu_group_remove_device(dev);
+	vgpu_device_destroy(vgpu_dev);
+	idr_remove(&vgpu.vgpu_idr, vgpu_dev->minor);
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+}
+
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev, *vgpu_dev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	// search VGPU device
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			vgpu_dev = vdev;
+			break;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	if (vgpu_dev)
+		destroy_vgpu_device(vgpu_dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+	struct gpu_device *gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (&gpu_dev->dev->dev == dev) {
+			if (gpu_dev->ops->vgpu_supported_config)
+				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_start)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_start(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_shutdown)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_shutdown(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data)
+{
+       int ret = 0;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       if (vgpu_dev->gpu_dev->ops->vgpu_set_irqs)
+               ret = vgpu_dev->gpu_dev->ops->vgpu_set_irqs(vgpu_dev, flags,
+                                                          index, start, count, data);
+       mutex_unlock(&vgpu.gpu_devices_lock);
+       return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static struct class vgpu_class = {
+	.name		= VGPU_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= vgpu_class_attrs,
+	.dev_groups	= vgpu_dev_groups,
+	.devnode	= vgpu_devnode,
+};
+
+static int __init vgpu_init(void)
+{
+	int rc = 0;
+
+	memset(&vgpu, 0 , sizeof(vgpu));
+
+	idr_init(&vgpu.vgpu_idr);
+	mutex_init(&vgpu.vgpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+	mutex_init(&vgpu.gpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+	// get major number from kernel
+	rc = alloc_chrdev_region(&vgpu.vgpu_devt, 0, MINORMASK, VGPU_DEV_NAME);
+
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu drv, err:%d\n", rc);
+		return rc;
+	}
+
+	cdev_init(&vgpu.vgpu_cdev, &vgpu_fops);
+	cdev_add(&vgpu.vgpu_cdev, vgpu.vgpu_devt, MINORMASK);
+
+	printk(KERN_ALERT "major_number:%d is allocated for vgpu\n", MAJOR(vgpu.vgpu_devt));
+
+	rc = class_register(&vgpu_class);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu class\n");
+		goto failed1;
+	}
+
+	vgpu.class = &vgpu_class;
+
+	return rc;
+
+failed1:
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+
+	return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+	// TODO: Release all unclosed fd
+	struct vgpu_device *vdev = NULL, *tmp;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry_safe(vdev, tmp, &vgpu.vgpu_devices_list, list) {
+		printk(KERN_INFO "VGPU: exit destroying device %s ", vdev->dev_name);
+		mutex_unlock(&vgpu.vgpu_devices_lock);
+		destroy_vgpu_device(vdev);
+		mutex_lock(&vgpu.vgpu_devices_lock);
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	idr_destroy(&vgpu.vgpu_idr);
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+	class_destroy(vgpu.class);
+	vgpu.class = NULL;
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 000000000000..7e3c400d29f7
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,47 @@
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group * group);
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev);
+
+struct vgpu_device *find_vgpu_device(struct device *dev);
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int vgpu_create_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_notify_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_remove_status_file(struct vgpu_device *vgpu_dev);
+
+int vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/drivers/vgpu/vgpu_sysfs.c b/drivers/vgpu/vgpu_sysfs.c
new file mode 100644
index 000000000000..e48cbcd6948d
--- /dev/null
+++ b/drivers/vgpu/vgpu_sysfs.c
@@ -0,0 +1,322 @@
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < 36)
+		return -1;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			printk(KERN_ERR "%s err", __FUNCTION__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	char *str;
+	ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+	get_vgpu_supported_types(dev, str);
+
+	n = sprintf(buf,"%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *instance_str, *str;
+	uuid_le vm_uuid;
+	uint32_t instance, vgpu_id;
+	struct pci_dev *pdev;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type and instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if ((instance_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty instance or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+
+	}
+
+	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+	vgpu_id = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+
+		if (create_vgpu_device(pdev, vm_uuid, instance, vgpu_id) < 0) {
+			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
+			return -EINVAL;
+		}
+	}
+
+	return count;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *str;
+	uuid_le vm_uuid;
+	unsigned int instance;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, vm_uuid.b, instance);
+
+	destroy_vgpu_device_by_uuid(vm_uuid, instance);
+
+	return count;
+}
+
+static ssize_t
+vgpu_vm_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv)
+		return sprintf(buf, "%pUb \n", drv->vm_uuid.b);
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_vm_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv && drv->group)
+		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+	&dev_attr_vgpu_vm_uuid.attr,
+	&dev_attr_vgpu_group_id.attr,
+	NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+	.attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+	&vgpu_dev_group,
+	NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_ONLINE);
+
+		ret = vgpu_start_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_start callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_OFFLINE);
+
+		ret = vgpu_shutdown_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+	__ATTR_WO(vgpu_start),
+	__ATTR_WO(vgpu_shutdown),
+	__ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
+
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 000000000000..ef0833140d84
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,521 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+};
+
+static int vgpu_dev_open(void *device_data)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+
+}
+
+static uint64_t resource_len(struct vgpu_device *vgpu_dev, int bar_index)
+{
+	uint64_t size = 0;
+
+	switch (bar_index) {
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		size = 16 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		size = 256 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR2_REGION_INDEX:
+		size = 32 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR5_REGION_INDEX:
+		size = 128;
+		break;
+	default:
+		size = 0;
+		break;
+	}
+	return size;
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+       return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd)
+	{
+		case VFIO_DEVICE_GET_INFO:
+		{
+			struct vfio_device_info info;
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index = %d", __FUNCTION__, vdev->vgpu_dev->minor);
+			minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			info.flags = VFIO_DEVICE_FLAGS_PCI;
+			info.num_regions = VFIO_PCI_NUM_REGIONS;
+			info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_GET_REGION_INFO:
+		{
+			struct vfio_region_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd", __FUNCTION__);
+
+			minsz = offsetofend(struct vfio_region_info, offset);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_CONFIG_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0x100;     // 4K
+					//                    info.size = sizeof(vdev->vgpu_dev->config_space);
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+							VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+				case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = resource_len(vdev->vgpu_dev, info.index);
+					if (!info.size) {
+						info.flags = 0;
+						break;
+					}
+
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+
+					if ((info.index == VFIO_PCI_BAR1_REGION_INDEX) ||
+					     (info.index == VFIO_PCI_BAR2_REGION_INDEX)) {
+						info.flags |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+					}
+
+					/* TODO: provides configurable setups to
+					 * GPU vendor
+					 */
+
+					if (info.index == VFIO_PCI_BAR1_REGION_INDEX)
+						info.flags = VFIO_REGION_INFO_FLAG_MMAP;
+
+					break;
+				case VFIO_PCI_VGA_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0xc0000;
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+
+				case VFIO_PCI_ROM_REGION_INDEX:
+				default:
+					return -EINVAL;
+			}
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+
+		}
+		case VFIO_DEVICE_GET_IRQ_INFO:
+		{
+			struct vfio_irq_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);
+			minsz = offsetofend(struct vfio_irq_info, count);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+				case VFIO_PCI_REQ_IRQ_INDEX:
+					break;
+					/* pass thru to return error */
+				default:
+					return -EINVAL;
+			}
+
+			info.count = VFIO_PCI_NUM_IRQS;
+
+			info.flags = VFIO_IRQ_INFO_EVENTFD;
+			info.count = vgpu_get_irq_count(vdev, info.index);
+
+			if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+				info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+						VFIO_IRQ_INFO_AUTOMASKED);
+			else
+				info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_SET_IRQS:
+		{
+			struct vfio_irq_set hdr;
+			u8 *data = NULL;
+			int ret = 0;
+
+			minsz = offsetofend(struct vfio_irq_set, count);
+
+			if (copy_from_user(&hdr, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+					hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+						VFIO_IRQ_SET_ACTION_TYPE_MASK))
+				return -EINVAL;
+
+			if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+				size_t size;
+				int max = vgpu_get_irq_count(vdev, hdr.index);
+
+				if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+					size = sizeof(uint8_t);
+				else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+					size = sizeof(int32_t);
+				else
+					return -EINVAL;
+
+				if (hdr.argsz - minsz < hdr.count * size ||
+				    hdr.start >= max || hdr.start + hdr.count > max)
+					return -EINVAL;
+
+				data = memdup_user((void __user *)(arg + minsz),
+						hdr.count * size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+
+			}
+			ret = vgpu_set_irqs_callback(vdev->vgpu_dev, hdr.flags, hdr.index,
+					hdr.start, hdr.count, data);
+			kfree(data);
+
+
+			return ret;
+		}
+
+		default:
+			return -EINVAL;
+	}
+	return ret;
+}
+
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	int cfg_size = sizeof(vgpu_dev->config_space);
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= cfg_size ||
+	    pos + count > cfg_size) {
+		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto config_rw_exit;
+		}
+
+		/* FIXME: Need to save the BAR value properly */
+		switch (pos) {
+		case PCI_BASE_ADDRESS_0:
+			vgpu_dev->bar[0].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_1:
+			vgpu_dev->bar[1].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_2:
+			vgpu_dev->bar[2].start = *((uint32_t *)user_data);
+			break;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_config,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_config,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+		}
+		kfree(ret_data);
+	}
+
+config_rw_exit:
+
+	return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	uint64_t end;
+	int ret = 0;
+
+	if (!vgpu_dev->bar[bar_index].start) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	end = resource_len(vgpu_dev, bar_index);
+
+	if (offset >= end) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vgpu_dev->bar[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_mmio,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_mmio,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+			}
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		case VFIO_PCI_VGA_REGION_INDEX:
+			break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+/* Just create an invalid mapping without providing a fault handler */
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+	.name		= "vfio-vgpu-grp",
+	.open		= vgpu_dev_open,
+	.release	= vgpu_dev_close,
+	.ioctl		= vgpu_dev_unlocked_ioctl,
+	.read		= vgpu_dev_read,
+	.write		= vgpu_dev_write,
+	.mmap		= vgpu_dev_mmap,
+};
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group *group)
+{
+	struct vfio_vgpu_device *vdev;
+	int ret = 0;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		return -ENOMEM;
+	}
+
+	vdev->group = group;
+	vdev->vgpu_dev = vgpu_dev;
+
+	ret = vfio_add_group_dev(vgpu_dev->dev, &vgpu_vfio_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev)
+{
+	struct vfio_vgpu_device *vdev;
+
+	vdev = vfio_del_group_dev(vgpu_dev->dev);
+	if (!vdev)
+		return -1;
+
+	kfree(vdev);
+	return 0;
+}
+
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 000000000000..a2861c3f42e5
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,157 @@
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+	uint64_t start;
+	uint64_t end;
+	int flags;
+};
+
+enum vgpu_emul_space_e {
+	vgpu_emul_space_config = 0, /*!< PCI configuration space */
+	vgpu_emul_space_io = 1,     /*!< I/O register space */
+	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+	struct kref		kref;
+	struct device		*dev;
+	int minor;
+	struct gpu_device	*gpu_dev;
+	struct iommu_group	*group;
+#define DEVICE_NAME_LEN		(64)
+	char			dev_name[DEVICE_NAME_LEN];
+	uuid_le			vm_uuid;
+	uint32_t		vgpu_instance;
+	uint32_t		vgpu_id;
+	atomic_t		usage_count;
+	char			config_space[0x100];          // 4KB PCI cfg space
+	struct pci_bar_info	bar[VFIO_PCI_NUM_REGIONS];
+	struct device_attribute	*dev_attr_vgpu_status;
+	int			vgpu_device_status;
+
+	struct list_head	list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:			The module owner.
+ * @vgpu_supported_config:	Called to get information about supported vgpu types.
+ *				@dev : pci device structure of physical GPU.
+ *				@config: should return string listing supported config
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_create:		Called to allocate basic resouces in graphics
+ *				driver for a particular vgpu.
+ *				@dev: physical pci device structure on which vgpu
+ *				      should be created
+ *				@vm_uuid: VM's uuid for which VM it is intended to
+ *				@instance: vgpu instance in that VM
+ *				@vgpu_id: This represents the type of vgpu to be
+ *					  created
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:		Called to free resources in graphics driver for
+ *				a vgpu instance of that VM.
+ *				@dev: physical pci device structure to which
+ *				this vgpu points to.
+ *				@vm_uuid: VM's uuid for which the vgpu belongs to.
+ *				@instance: vgpu instance in that VM
+ *				Returns integer: success (0) or error (< 0)
+ *				If VM is running and vgpu_destroy is called that
+ *				means the vGPU is being hotunpluged. Return error
+ *				if VM is running and graphics driver doesn't
+ *				support vgpu hotplug.
+ * @vgpu_start:			Called to do initiate vGPU initialization
+ *				process in graphics driver when VM boots before
+ *				qemu starts.
+ *				@vm_uuid: VM's UUID which is booting.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:		Called to teardown vGPU related resources for
+ *				the VM
+ *				@vm_uuid: VM's UUID which is shutting down .
+ *				Returns integer: success (0) or error (< 0)
+ * @read:			Read emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: read buffer
+ *				@count: number bytes to read
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes read on success or error.
+ * @write:			Write emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: write buffer
+ *				@count: number bytes to be written
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes written on success or error.
+ * @vgpu_set_irqs:		Called to send about interrupts configuration
+ *				information that qemu set.
+ *				@vdev: vgpu device structure
+ *				@flags, index, start, count and *data : same as
+ *				that of struct vfio_irq_set of
+ *				VFIO_DEVICE_SET_IRQS API.
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+	struct module   *owner;
+	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
+	int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
+			       uint32_t instance, uint32_t vgpu_id);
+	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
+			        uint32_t instance);
+	int     (*vgpu_start)(uuid_le vm_uuid);
+	int     (*vgpu_shutdown)(uuid_le vm_uuid);
+	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space,loff_t pos);
+	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+				 unsigned index, unsigned start, unsigned count,
+				 void *data);
+
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+	struct pci_dev                  *dev;
+	const struct gpu_device_ops     *ops;
+	struct list_head                gpu_next;
+};
+
+extern int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr, uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
+