Message ID | 20160126102003.GA14400@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> From: Neo Jia [mailto:cjia@nvidia.com] > Sent: Tuesday, January 26, 2016 6:21 PM > > 0. High level overview > ===================================================== > ============================= > > > user space: > +-----------+ VFIO IOMMU IOCTLs > +---------| QEMU VFIO |-------------------------+ > VFIO IOCTLs | +-----------+ | > | | > ---------------------|-----------------------------------------------|--------- > | | > kernel space: | +--->----------->---+ (callback) V > | | v +------V-----+ > +----------+ +----V--^--+ +--+--+-----+ | VGPU | > | | | | +----| nvidia.ko +----->-----> TYPE1 IOMMU| > | VFIO Bus <===| VGPU.ko |<----| +-----------+ | +---++-------+ > | | | | | (register) ^ || > +----------+ +-------+--+ | +-----------+ | || > V +----| i915.ko +-----+ +---VV-------+ > | +-----^-----+ | TYPE1 | > | (callback) | | IOMMU | > +-->------------>---+ +------------+ > access flow: > > Guest MMIO / PCI config access > | > ------------------------------------------------- > | > +-----> KVM VM_EXITs (kernel) > | > ------------------------------------------------- > | > +-----> QEMU VFIO driver (user) > | > ------------------------------------------------- > | > +----> VGPU kernel driver (kernel) > | > | > +----> vendor driver callback > > There is one difference between nvidia and intel implementations. We have vgpu device model in kernel, as part of i915.ko. So I/O emulation requests are forwarded directly in kernel side. Thanks Kevin
On Tue, Jan 26, 2016 at 07:24:52PM +0000, Tian, Kevin wrote: > > From: Neo Jia [mailto:cjia@nvidia.com] > > Sent: Tuesday, January 26, 2016 6:21 PM > > > > 0. High level overview > > ===================================================== > > ============================= > > > > > > user space: > > +-----------+ VFIO IOMMU IOCTLs > > +---------| QEMU VFIO |-------------------------+ > > VFIO IOCTLs | +-----------+ | > > | | > > ---------------------|-----------------------------------------------|--------- > > | | > > kernel space: | +--->----------->---+ (callback) V > > | | v +------V-----+ > > +----------+ +----V--^--+ +--+--+-----+ | VGPU | > > | | | | +----| nvidia.ko +----->-----> TYPE1 IOMMU| > > | VFIO Bus <===| VGPU.ko |<----| +-----------+ | +---++-------+ > > | | | | | (register) ^ || > > +----------+ +-------+--+ | +-----------+ | || > > V +----| i915.ko +-----+ +---VV-------+ > > | +-----^-----+ | TYPE1 | > > | (callback) | | IOMMU | > > +-->------------>---+ +------------+ > > access flow: > > > > Guest MMIO / PCI config access > > | > > ------------------------------------------------- > > | > > +-----> KVM VM_EXITs (kernel) > > | > > ------------------------------------------------- > > | > > +-----> QEMU VFIO driver (user) > > | > > ------------------------------------------------- > > | > > +----> VGPU kernel driver (kernel) > > | > > | > > +----> vendor driver callback > > > > > > There is one difference between nvidia and intel implementations. We have > vgpu device model in kernel, as part of i915.ko. So I/O emulation requests > are forwarded directly in kernel side. Hi Kevin, With the vendor driver callback, it will always forward to the kernel driver. If you are talking about the QEMU VFIO driver (user) part I put on the above diagram, that is how QEMU VFIO handles MMIO or pci config access today, which we don't change anything here in this design. Thanks, Neo > > Thanks > Kevin
On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote: > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote: > > > From: Alex Williamson [mailto:alex.williamson@redhat.com] > > Hi Alex, Kevin and Jike, > > (Seems I shouldn't use attachment, resend it again to the list, patches are > inline at the end) > > Thanks for adding me to this technical discussion, a great opportunity > for us to design together which can bring both Intel and NVIDIA vGPU solution to > KVM platform. > > Instead of directly jumping to the proposal that we have been working on > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple > quick comments / thoughts regarding the existing discussions on this thread as > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO. > > Then we can look at what we have, hopefully we can reach some consensus soon. > > > Yes, and since you're creating and destroying the vgpu here, this is > > where I'd expect a struct device to be created and added to an IOMMU > > group. The lifecycle management should really include links between > > the vGPU and physical GPU, which would be much, much easier to do with > > struct devices create here rather than at the point where we start > > doing vfio "stuff". > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management > can be centralized and done in vfio-vgpu. That also include adding to IOMMU > group and VFIO group. Is this really a good idea? The concept of a vgpu is not unique to vfio, we want vfio to be a driver for a vgpu, not an integral part of the lifecycle of a vgpu. That certainly doesn't exclude adding infrastructure to make lifecycle management of a vgpu more consistent between drivers, but it should be done independently of vfio. I'll go back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio does not create the VF, that's done in coordination with the PF making use of some PCI infrastructure for consistency between drivers. It seems like we need to take more advantage of the class and driver core support to perhaps setup a vgpu bus and class with vfio-vgpu just being a driver for those devices. > Graphics driver can register with vfio-vgpu to get management and emulation call > backs to graphics driver. > > We already have struct vgpu_device in our proposal that keeps pointer to > physical device. > > > - vfio_pci will inject an IRQ to guest only when physical IRQ > > generated; whereas vfio_vgpu may inject an IRQ for emulation > > purpose. Anyway they can share the same injection interface; > > eventfd to inject the interrupt is known to vfio-vgpu, that fd should be > available to graphics driver so that graphics driver can inject interrupts > directly when physical device triggers interrupt. > > Here is the proposal we have, please review. > > Please note the patches we have put out here is mainly for POC purpose to > verify our understanding also can serve the purpose to reduce confusions and speed up > our design, although we are very happy to refine that to something eventually > can be used for both parties and upstreamed. > > Linux vGPU kernel design > ================================================================================== > > Here we are proposing a generic Linux kernel module based on VFIO framework > which allows different GPU vendors to plugin and provide their GPU virtualization > solution on KVM, the benefits of having such generic kernel module are: > > 1) Reuse QEMU VFIO driver, supporting VFIO UAPI > > 2) GPU HW agnostic management API for upper layer software such as libvirt > > 3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor > > 0. High level overview > ================================================================================== > > > user space: > +-----------+ VFIO IOMMU IOCTLs > +---------| QEMU VFIO |-------------------------+ > VFIO IOCTLs | +-----------+ | > | | > ---------------------|-----------------------------------------------|--------- > | | > kernel space: | +--->----------->---+ (callback) V > | | v +------V-----+ > +----------+ +----V--^--+ +--+--+-----+ | VGPU | > | | | | +----| nvidia.ko +----->-----> TYPE1 IOMMU| > | VFIO Bus <===| VGPU.ko |<----| +-----------+ | +---++-------+ > | | | | | (register) ^ || > +----------+ +-------+--+ | +-----------+ | || > V +----| i915.ko +-----+ +---VV-------+ > | +-----^-----+ | TYPE1 | > | (callback) | | IOMMU | > +-->------------>---+ +------------+ > access flow: > > Guest MMIO / PCI config access > | > ------------------------------------------------- > | > +-----> KVM VM_EXITs (kernel) > | > ------------------------------------------------- > | > +-----> QEMU VFIO driver (user) > | > ------------------------------------------------- > | > +----> VGPU kernel driver (kernel) > | > | > +----> vendor driver callback > > > 1. VGPU management interface > ================================================================================== > > This is the interface allows upper layer software (mostly libvirt) to query and > configure virtual GPU device in a HW agnostic fashion. Also, this management > interface has provided flexibility to underlying GPU vendor to support virtual > device hotplug, multiple virtual devices per VM, multiple virtual devices from > different physical devices, etc. > > 1.1 Under per-physical device sysfs: > ---------------------------------------------------------------------------------- > > vgpu_supported_types - RO, list the current supported virtual GPU types and its > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of > "vgpu_supported_types". > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual > gpu device on a target physical GPU. idx: virtual device index inside a VM > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a > target physical GPU I've noted in previous discussions that we need to separate user policy from kernel policy here, the kernel policy should not require a "VM UUID". A UUID simply represents a set of one or more devices and an index picks the device within the set. Whether that UUID matches a VM or is independently used is up to the user policy when creating the device. Personally I'd also prefer to get rid of the concept of indexes within a UUID set of devices and instead have each device be independent. This seems to be an imposition on the nvidia implementation into the kernel interface design. > 1.3 Under vgpu class sysfs: > ---------------------------------------------------------------------------------- > > vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration > interface to notify the GPU vendor driver to commit virtual GPU resource for > this target VM. > > Also, the vgpu_start function is a synchronized call, the successful return of > this call will indicate all the requested vGPU resource has been fully > committed, the VMM should continue. > > vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration > interface to notify the GPU vendor driver to release virtual GPU resource of > this target VM. > > 1.4 Virtual device Hotplug > ---------------------------------------------------------------------------------- > > To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be > accessed during VM runtime, and the corresponding registration callback will be > invoked to allow GPU vendor support hotplug. > > To support hotplug, vendor driver would take necessary action to handle the > situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that > implies both create and start for that vgpu device. > > Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver > supports vgpu hotplug. > > If hotplug is not supported and VM is still running, vendor driver can return > error code to indicate not supported. > > Separate create from start gives flixibility to have: > > - multiple vgpu instances for single VM and > - hotplug feature. > > 2. GPU driver vendor registration interface > ================================================================================== > > 2.1 Registration interface definition (include/linux/vgpu.h) > ---------------------------------------------------------------------------------- > > extern int vgpu_register_device(struct pci_dev *dev, > const struct gpu_device_ops *ops); > > extern void vgpu_unregister_device(struct pci_dev *dev); > > /** > * struct gpu_device_ops - Structure to be registered for each physical GPU to > * register the device to vgpu module. > * > * @owner: The module owner. > * @vgpu_supported_config: Called to get information about supported vgpu > * types. > * @dev : pci device structure of physical GPU. > * @config: should return string listing supported > * config > * Returns integer: success (0) or error (< 0) > * @vgpu_create: Called to allocate basic resouces in graphics > * driver for a particular vgpu. > * @dev: physical pci device structure on which > * vgpu > * should be created > * @vm_uuid: VM's uuid for which VM it is intended > * to > * @instance: vgpu instance in that VM > * @vgpu_id: This represents the type of vgpu to be > * created > * Returns integer: success (0) or error (< 0) > * @vgpu_destroy: Called to free resources in graphics driver for > * a vgpu instance of that VM. > * @dev: physical pci device structure to which > * this vgpu points to. > * @vm_uuid: VM's uuid for which the vgpu belongs > * to. > * @instance: vgpu instance in that VM > * Returns integer: success (0) or error (< 0) > * If VM is running and vgpu_destroy is called that > * means the vGPU is being hotunpluged. Return > * error > * if VM is running and graphics driver doesn't > * support vgpu hotplug. > * @vgpu_start: Called to do initiate vGPU initialization > * process in graphics driver when VM boots before > * qemu starts. > * @vm_uuid: VM's UUID which is booting. > * Returns integer: success (0) or error (< 0) > * @vgpu_shutdown: Called to teardown vGPU related resources for > * the VM > * @vm_uuid: VM's UUID which is shutting down . > * Returns integer: success (0) or error (< 0) > * @read: Read emulation callback > * @vdev: vgpu device structure > * @buf: read buffer > * @count: number bytes to read > * @address_space: specifies for which address > * space > * the request is: pci_config_space, IO register > * space or MMIO space. > * Retuns number on bytes read on success or error. > * @write: Write emulation callback > * @vdev: vgpu device structure > * @buf: write buffer > * @count: number bytes to be written > * @address_space: specifies for which address > * space > * the request is: pci_config_space, IO register > * space or MMIO space. > * Retuns number on bytes written on success or > * error. > * @vgpu_set_irqs: Called to send about interrupts configuration > * information that qemu set. > * @vdev: vgpu device structure > * @flags, index, start, count and *data : same as > * that of struct vfio_irq_set of > * VFIO_DEVICE_SET_IRQS API. > * > * Physical GPU that support vGPU should be register with vgpu module with > * gpu_device_ops structure. > */ > > struct gpu_device_ops { > struct module *owner; > int (*vgpu_supported_config)(struct pci_dev *dev, char *config); > int (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid, > uint32_t instance, uint32_t vgpu_id); > int (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid, > uint32_t instance); > int (*vgpu_start)(uuid_le vm_uuid); > int (*vgpu_shutdown)(uuid_le vm_uuid); > ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count, > uint32_t address_space, loff_t pos); > ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count, > uint32_t address_space,loff_t pos); > int (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags, > unsigned index, unsigned start, unsigned count, > void *data); > > }; I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvidia) that register these ops with the main vfio-vgpu driver and they should also include a probe() function which allows us to associate a given vgpu device with a set of vendor ops. > > 2.2 Details for callbacks we haven't mentioned above. > --------------------------------------------------------------------------------- > > vgpu_supported_config: allows the vendor driver to specify the supported vGPU > type/configuration > > vgpu_create : create a virtual GPU device, can be used for device hotplug. > > vgpu_destroy : destroy a virtual GPU device, can be used for device hotplug. > > vgpu_start : callback function to notify vendor driver vgpu device > come to live for a given virtual machine. > > vgpu_shutdown : callback function to notify vendor driver > > read : callback to vendor driver to handle virtual device config > space or MMIO read access > > write : callback to vendor driver to handle virtual device config > space or MMIO write access > > vgpu_set_irqs : callback to vendor driver to pass along the interrupt > information for the target virtual device, then vendor > driver can inject interrupt into virtual machine for this > device. > > 2.3 Potential additional virtual device configuration registration interface: > --------------------------------------------------------------------------------- > > callback function to describe the MMAP behavior of the virtual GPU > > callback function to allow GPU vendor driver to provide PCI config space backing > memory. > > 3. VGPU TYPE1 IOMMU > ================================================================================== > > Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the > <iova, hva, size, flag> and save the QEMU mm for later reference. > > You can find the quick/ugly implementation in the attached patch file, which is > actually just a simple version Alex's type1 IOMMU without actual real > mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. > > We have thought about providing another vendor driver registration interface so > such tracking information will be sent to vendor driver and he will use the QEMU > mm to do the get_user_pages / remap_pfn_range when it is required. After doing a > quick implementation within our driver, I noticed following issues: > > 1) OS/VFIO logic into vendor driver which will be a maintenance issue. > > 2) Every driver vendor has to implement their own RB tree, instead of reusing > the common existing VFIO code (vfio_find/link/unlink_dma) > > 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU, > better not have anything inside a vendor driver that the VFIO caller immediately > depends on. > > Based on the above consideration, we decide to implement the DMA tracking logic > within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1 > IOMMU code) and expose two symbols to outside for MMIO mapping and page > translation and pinning. > > Also, with a mmap MMIO interface between virtual and physical, this allows > para-virtualized guest driver can access his virtual MMIO without taking a MMAP > fault hit, also we can support different MMIO size between virtual and physical > device. > > int vgpu_map_virtual_bar > ( > uint64_t virt_bar_addr, > uint64_t phys_bar_addr, > uint32_t len, > uint32_t flags > ) > > EXPORT_SYMBOL(vgpu_map_virtual_bar); Per the implementation provided, this needs to be implemented in the vfio device driver, not in the iommu interface. Finding the DMA mapping of the device and replacing it is wrong. It should be remapped at the vfio device file interface using vm_ops. > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > Still a lot to be added and modified, such as supporting multiple VMs and > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU > kernel driver, error handling, roll-back and locked memory size per user, etc. Particularly, handling of mapping changes is completely missing. This cannot be a point in time translation, the user is free to remap addresses whenever they wish and device translations need to be updated accordingly. > 4. Modules > ================================================================================== > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU > TYPE1 v1 and v2 interface. Depending on how intrusive it is, this can possibly by done within the existing type1 driver. Either that or we can split out common code for use by a separate module. > vgpu.ko - provide registration interface and virtual device > VFIO access. > > 5. QEMU note > ================================================================================== > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and > use it as a reference for our implementation. It is basically just a quick c & p > from vfio/pci.c to quickly meet our needs. > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new > class, and probably the only thing required is to have a new way to discover the > device. > > 6. Examples > ================================================================================== > > On this server, we have two NVIDIA M60 GPUs. > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2 > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > > After nvidia.ko gets initialized, we can query the supported vGPU type by > accessing the "vgpu_supported_types" like following: > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > 11:GRID M60-0B > 12:GRID M60-0Q > 13:GRID M60-1B > 14:GRID M60-1Q > 15:GRID M60-2B > 16:GRID M60-2Q > 17:GRID M60-4Q > 18:GRID M60-8Q > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would > like to create "GRID M60-4Q" VM on it. > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create > > Note: the number 0 here is for vGPU device index. So far the change is not tested > for multiple vgpu devices yet, but we will support it. > > At this moment, if you query the "vgpu_supported_types" it will still show all > supported virtual GPU types as no virtual GPU resource is committed yet. > > Starting VM: > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start > > then, the supported vGPU type query will return: > > [root@cjia-vgx-kvm /home/cjia]$ > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > 17:GRID M60-4Q > > So vgpu_supported_config needs to be called whenever a new virtual device gets > created as the underlying HW might limit the supported types if there are > any existing VM runnings. > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the > GPU driver vendor to clean up resource. > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under > device sysfs. I'd like to hear Intel's thoughts on this interface. Are there different vgpu capacities or priority classes that would necessitate different types of vcpus on Intel? I think there are some gaps in translating from named vgpu types to indexes here, along with my previous mention of the UUID/set oddity. Does Intel have a need for start and shutdown interfaces? Neo, wasn't there at some point information about how many of each type could be supported through these interfaces? How does a user know their capacity limits? Thanks, Alex
> From: Alex Williamson [mailto:alex.williamson@redhat.com] > Sent: Wednesday, January 27, 2016 4:06 AM > > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote: > > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote: > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com] > > > > Hi Alex, Kevin and Jike, > > > > (Seems I shouldn't use attachment, resend it again to the list, patches are > > inline at the end) > > > > Thanks for adding me to this technical discussion, a great opportunity > > for us to design together which can bring both Intel and NVIDIA vGPU solution to > > KVM platform. > > > > Instead of directly jumping to the proposal that we have been working on > > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple > > quick comments / thoughts regarding the existing discussions on this thread as > > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO. > > > > Then we can look at what we have, hopefully we can reach some consensus soon. > > > > > Yes, and since you're creating and destroying the vgpu here, this is > > > where I'd expect a struct device to be created and added to an IOMMU > > > group. The lifecycle management should really include links between > > > the vGPU and physical GPU, which would be much, much easier to do with > > > struct devices create here rather than at the point where we start > > > doing vfio "stuff". > > > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management > > can be centralized and done in vfio-vgpu. That also include adding to IOMMU > > group and VFIO group. > > Is this really a good idea? The concept of a vgpu is not unique to > vfio, we want vfio to be a driver for a vgpu, not an integral part of > the lifecycle of a vgpu. That certainly doesn't exclude adding > infrastructure to make lifecycle management of a vgpu more consistent > between drivers, but it should be done independently of vfio. I'll go > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio > does not create the VF, that's done in coordination with the PF making > use of some PCI infrastructure for consistency between drivers. > > It seems like we need to take more advantage of the class and driver > core support to perhaps setup a vgpu bus and class with vfio-vgpu just > being a driver for those devices. Agree with Alex here. Even if we want to do more abstraction of overall vgpu management, here let's stick to necessary changes within VFIO scope. > > > > 6. Examples > > > ===================================================== > ============================= > > > > On this server, we have two NVIDIA M60 GPUs. > > > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2 > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > > > > After nvidia.ko gets initialized, we can query the supported vGPU type by > > accessing the "vgpu_supported_types" like following: > > > > [root@cjia-vgx-kvm ~]# cat > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > > 11:GRID M60-0B > > 12:GRID M60-0Q > > 13:GRID M60-1B > > 14:GRID M60-1Q > > 15:GRID M60-2B > > 16:GRID M60-2Q > > 17:GRID M60-4Q > > 18:GRID M60-8Q > > > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would > > like to create "GRID M60-4Q" VM on it. > > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create > > > > Note: the number 0 here is for vGPU device index. So far the change is not tested > > for multiple vgpu devices yet, but we will support it. > > > > At this moment, if you query the "vgpu_supported_types" it will still show all > > supported virtual GPU types as no virtual GPU resource is committed yet. > > > > Starting VM: > > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start > > > > then, the supported vGPU type query will return: > > > > [root@cjia-vgx-kvm /home/cjia]$ > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > > 17:GRID M60-4Q > > > > So vgpu_supported_config needs to be called whenever a new virtual device gets > > created as the underlying HW might limit the supported types if there are > > any existing VM runnings. > > > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the > > GPU driver vendor to clean up resource. > > > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under > > device sysfs. > > > I'd like to hear Intel's thoughts on this interface. Are there > different vgpu capacities or priority classes that would necessitate > different types of vcpus on Intel? We'll evaluate this proposal with our requirement. A quick comment is that we don't have such type thing required. We just expose the same type of vgpu as the underlying platform. On the other hand, our implementation gives flexibility to user to control resource allocation (e.g. video memory) to different VMs, instead of a fixed partition scheme, so we have an interface to query remaining free resources. > > Does Intel have a need for start and shutdown interfaces? No for now. But we can extend to support such interface which provides more flexibility to separate resource allocation from run-time control. Given that nvidia/intel do have specific requirement on vgpu management, I'd suggest that we focus on VFIO change first. After that we can evaluate how much commonality of vgpu management upon which to evaluate whether to have a common vgpu framework or just stay with vendor specific implementation for that part. Thanks, Kevin
On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote: > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote: > > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote: > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com] > > > > Hi Alex, Kevin and Jike, > > > > (Seems I shouldn't use attachment, resend it again to the list, patches are > > inline at the end) > > > > Thanks for adding me to this technical discussion, a great opportunity > > for us to design together which can bring both Intel and NVIDIA vGPU solution to > > KVM platform. > > > > Instead of directly jumping to the proposal that we have been working on > > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple > > quick comments / thoughts regarding the existing discussions on this thread as > > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO. > > > > Then we can look at what we have, hopefully we can reach some consensus soon. > > > > > Yes, and since you're creating and destroying the vgpu here, this is > > > where I'd expect a struct device to be created and added to an IOMMU > > > group. The lifecycle management should really include links between > > > the vGPU and physical GPU, which would be much, much easier to do with > > > struct devices create here rather than at the point where we start > > > doing vfio "stuff". > > > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management > > can be centralized and done in vfio-vgpu. That also include adding to IOMMU > > group and VFIO group. > > Is this really a good idea? The concept of a vgpu is not unique to > vfio, we want vfio to be a driver for a vgpu, not an integral part of > the lifecycle of a vgpu. That certainly doesn't exclude adding > infrastructure to make lifecycle management of a vgpu more consistent > between drivers, but it should be done independently of vfio. I'll go > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio > does not create the VF, that's done in coordination with the PF making > use of some PCI infrastructure for consistency between drivers. > > It seems like we need to take more advantage of the class and driver > core support to perhaps setup a vgpu bus and class with vfio-vgpu just > being a driver for those devices. > > > Graphics driver can register with vfio-vgpu to get management and emulation call > > backs to graphics driver. > > > > We already have struct vgpu_device in our proposal that keeps pointer to > > physical device. > > > > > - vfio_pci will inject an IRQ to guest only when physical IRQ > > > generated; whereas vfio_vgpu may inject an IRQ for emulation > > > purpose. Anyway they can share the same injection interface; > > > > eventfd to inject the interrupt is known to vfio-vgpu, that fd should be > > available to graphics driver so that graphics driver can inject interrupts > > directly when physical device triggers interrupt. > > > > Here is the proposal we have, please review. > > > > Please note the patches we have put out here is mainly for POC purpose to > > verify our understanding also can serve the purpose to reduce confusions and speed up > > our design, although we are very happy to refine that to something eventually > > can be used for both parties and upstreamed. > > > > Linux vGPU kernel design > > ================================================================================== > > > > Here we are proposing a generic Linux kernel module based on VFIO framework > > which allows different GPU vendors to plugin and provide their GPU virtualization > > solution on KVM, the benefits of having such generic kernel module are: > > > > 1) Reuse QEMU VFIO driver, supporting VFIO UAPI > > > > 2) GPU HW agnostic management API for upper layer software such as libvirt > > > > 3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor > > > > 0. High level overview > > ================================================================================== > > > > > > user space: > > +-----------+ VFIO IOMMU IOCTLs > > +---------| QEMU VFIO |-------------------------+ > > VFIO IOCTLs | +-----------+ | > > | | > > ---------------------|-----------------------------------------------|--------- > > | | > > kernel space: | +--->----------->---+ (callback) V > > | | v +------V-----+ > > +----------+ +----V--^--+ +--+--+-----+ | VGPU | > > | | | | +----| nvidia.ko +----->-----> TYPE1 IOMMU| > > | VFIO Bus <===| VGPU.ko |<----| +-----------+ | +---++-------+ > > | | | | | (register) ^ || > > +----------+ +-------+--+ | +-----------+ | || > > V +----| i915.ko +-----+ +---VV-------+ > > | +-----^-----+ | TYPE1 | > > | (callback) | | IOMMU | > > +-->------------>---+ +------------+ > > access flow: > > > > Guest MMIO / PCI config access > > | > > ------------------------------------------------- > > | > > +-----> KVM VM_EXITs (kernel) > > | > > ------------------------------------------------- > > | > > +-----> QEMU VFIO driver (user) > > | > > ------------------------------------------------- > > | > > +----> VGPU kernel driver (kernel) > > | > > | > > +----> vendor driver callback > > > > > > 1. VGPU management interface > > ================================================================================== > > > > This is the interface allows upper layer software (mostly libvirt) to query and > > configure virtual GPU device in a HW agnostic fashion. Also, this management > > interface has provided flexibility to underlying GPU vendor to support virtual > > device hotplug, multiple virtual devices per VM, multiple virtual devices from > > different physical devices, etc. > > > > 1.1 Under per-physical device sysfs: > > ---------------------------------------------------------------------------------- > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of > > "vgpu_supported_types". > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual > > gpu device on a target physical GPU. idx: virtual device index inside a VM > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a > > target physical GPU > > > I've noted in previous discussions that we need to separate user policy > from kernel policy here, the kernel policy should not require a "VM > UUID". A UUID simply represents a set of one or more devices and an > index picks the device within the set. Whether that UUID matches a VM > or is independently used is up to the user policy when creating the > device. > > Personally I'd also prefer to get rid of the concept of indexes within a > UUID set of devices and instead have each device be independent. This > seems to be an imposition on the nvidia implementation into the kernel > interface design. > Hi Alex, I agree with you that we should not put UUID concept into a kernel API. At this point (without any prototyping), I am thinking of using a list of virtual devices instead of UUID. > > > 1.3 Under vgpu class sysfs: > > ---------------------------------------------------------------------------------- > > > > vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration > > interface to notify the GPU vendor driver to commit virtual GPU resource for > > this target VM. > > > > Also, the vgpu_start function is a synchronized call, the successful return of > > this call will indicate all the requested vGPU resource has been fully > > committed, the VMM should continue. > > > > vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration > > interface to notify the GPU vendor driver to release virtual GPU resource of > > this target VM. > > > > 1.4 Virtual device Hotplug > > ---------------------------------------------------------------------------------- > > > > To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be > > accessed during VM runtime, and the corresponding registration callback will be > > invoked to allow GPU vendor support hotplug. > > > > To support hotplug, vendor driver would take necessary action to handle the > > situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that > > implies both create and start for that vgpu device. > > > > Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver > > supports vgpu hotplug. > > > > If hotplug is not supported and VM is still running, vendor driver can return > > error code to indicate not supported. > > > > Separate create from start gives flixibility to have: > > > > - multiple vgpu instances for single VM and > > - hotplug feature. > > > > 2. GPU driver vendor registration interface > > ================================================================================== > > > > 2.1 Registration interface definition (include/linux/vgpu.h) > > ---------------------------------------------------------------------------------- > > > > extern int vgpu_register_device(struct pci_dev *dev, > > const struct gpu_device_ops *ops); > > > > extern void vgpu_unregister_device(struct pci_dev *dev); > > > > /** > > * struct gpu_device_ops - Structure to be registered for each physical GPU to > > * register the device to vgpu module. > > * > > * @owner: The module owner. > > * @vgpu_supported_config: Called to get information about supported vgpu > > * types. > > * @dev : pci device structure of physical GPU. > > * @config: should return string listing supported > > * config > > * Returns integer: success (0) or error (< 0) > > * @vgpu_create: Called to allocate basic resouces in graphics > > * driver for a particular vgpu. > > * @dev: physical pci device structure on which > > * vgpu > > * should be created > > * @vm_uuid: VM's uuid for which VM it is intended > > * to > > * @instance: vgpu instance in that VM > > * @vgpu_id: This represents the type of vgpu to be > > * created > > * Returns integer: success (0) or error (< 0) > > * @vgpu_destroy: Called to free resources in graphics driver for > > * a vgpu instance of that VM. > > * @dev: physical pci device structure to which > > * this vgpu points to. > > * @vm_uuid: VM's uuid for which the vgpu belongs > > * to. > > * @instance: vgpu instance in that VM > > * Returns integer: success (0) or error (< 0) > > * If VM is running and vgpu_destroy is called that > > * means the vGPU is being hotunpluged. Return > > * error > > * if VM is running and graphics driver doesn't > > * support vgpu hotplug. > > * @vgpu_start: Called to do initiate vGPU initialization > > * process in graphics driver when VM boots before > > * qemu starts. > > * @vm_uuid: VM's UUID which is booting. > > * Returns integer: success (0) or error (< 0) > > * @vgpu_shutdown: Called to teardown vGPU related resources for > > * the VM > > * @vm_uuid: VM's UUID which is shutting down . > > * Returns integer: success (0) or error (< 0) > > * @read: Read emulation callback > > * @vdev: vgpu device structure > > * @buf: read buffer > > * @count: number bytes to read > > * @address_space: specifies for which address > > * space > > * the request is: pci_config_space, IO register > > * space or MMIO space. > > * Retuns number on bytes read on success or error. > > * @write: Write emulation callback > > * @vdev: vgpu device structure > > * @buf: write buffer > > * @count: number bytes to be written > > * @address_space: specifies for which address > > * space > > * the request is: pci_config_space, IO register > > * space or MMIO space. > > * Retuns number on bytes written on success or > > * error. > > * @vgpu_set_irqs: Called to send about interrupts configuration > > * information that qemu set. > > * @vdev: vgpu device structure > > * @flags, index, start, count and *data : same as > > * that of struct vfio_irq_set of > > * VFIO_DEVICE_SET_IRQS API. > > * > > * Physical GPU that support vGPU should be register with vgpu module with > > * gpu_device_ops structure. > > */ > > > > struct gpu_device_ops { > > struct module *owner; > > int (*vgpu_supported_config)(struct pci_dev *dev, char *config); > > int (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid, > > uint32_t instance, uint32_t vgpu_id); > > int (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid, > > uint32_t instance); > > int (*vgpu_start)(uuid_le vm_uuid); > > int (*vgpu_shutdown)(uuid_le vm_uuid); > > ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count, > > uint32_t address_space, loff_t pos); > > ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count, > > uint32_t address_space,loff_t pos); > > int (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags, > > unsigned index, unsigned start, unsigned count, > > void *data); > > > > }; > > > I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvidia) > that register these ops with the main vfio-vgpu driver and they should > also include a probe() function which allows us to associate a given > vgpu device with a set of vendor ops. > > > > > > 2.2 Details for callbacks we haven't mentioned above. > > --------------------------------------------------------------------------------- > > > > vgpu_supported_config: allows the vendor driver to specify the supported vGPU > > type/configuration > > > > vgpu_create : create a virtual GPU device, can be used for device hotplug. > > > > vgpu_destroy : destroy a virtual GPU device, can be used for device hotplug. > > > > vgpu_start : callback function to notify vendor driver vgpu device > > come to live for a given virtual machine. > > > > vgpu_shutdown : callback function to notify vendor driver > > > > read : callback to vendor driver to handle virtual device config > > space or MMIO read access > > > > write : callback to vendor driver to handle virtual device config > > space or MMIO write access > > > > vgpu_set_irqs : callback to vendor driver to pass along the interrupt > > information for the target virtual device, then vendor > > driver can inject interrupt into virtual machine for this > > device. > > > > 2.3 Potential additional virtual device configuration registration interface: > > --------------------------------------------------------------------------------- > > > > callback function to describe the MMAP behavior of the virtual GPU > > > > callback function to allow GPU vendor driver to provide PCI config space backing > > memory. > > > > 3. VGPU TYPE1 IOMMU > > ================================================================================== > > > > Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the > > <iova, hva, size, flag> and save the QEMU mm for later reference. > > > > You can find the quick/ugly implementation in the attached patch file, which is > > actually just a simple version Alex's type1 IOMMU without actual real > > mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. > > > > We have thought about providing another vendor driver registration interface so > > such tracking information will be sent to vendor driver and he will use the QEMU > > mm to do the get_user_pages / remap_pfn_range when it is required. After doing a > > quick implementation within our driver, I noticed following issues: > > > > 1) OS/VFIO logic into vendor driver which will be a maintenance issue. > > > > 2) Every driver vendor has to implement their own RB tree, instead of reusing > > the common existing VFIO code (vfio_find/link/unlink_dma) > > > > 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU, > > better not have anything inside a vendor driver that the VFIO caller immediately > > depends on. > > > > Based on the above consideration, we decide to implement the DMA tracking logic > > within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1 > > IOMMU code) and expose two symbols to outside for MMIO mapping and page > > translation and pinning. > > > > Also, with a mmap MMIO interface between virtual and physical, this allows > > para-virtualized guest driver can access his virtual MMIO without taking a MMAP > > fault hit, also we can support different MMIO size between virtual and physical > > device. > > > > int vgpu_map_virtual_bar > > ( > > uint64_t virt_bar_addr, > > uint64_t phys_bar_addr, > > uint32_t len, > > uint32_t flags > > ) > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar); > > > Per the implementation provided, this needs to be implemented in the > vfio device driver, not in the iommu interface. Finding the DMA mapping > of the device and replacing it is wrong. It should be remapped at the > vfio device file interface using vm_ops. > So you are basically suggesting that we are going to take a mmap fault and within that fault handler, we will go into vendor driver to look up the "pre-registered" mapping and remap there. Is my understanding correct? > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) > > > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > > > Still a lot to be added and modified, such as supporting multiple VMs and > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU > > kernel driver, error handling, roll-back and locked memory size per user, etc. > > Particularly, handling of mapping changes is completely missing. This > cannot be a point in time translation, the user is free to remap > addresses whenever they wish and device translations need to be updated > accordingly. > When you say "user", do you mean the QEMU? Here, whenever the DMA that the guest driver is going to launch will be first pinned within VM, and then registered to QEMU, therefore the IOMMU memory listener, eventually the pages will be pinned by the GPU or DMA engine. Since we are keeping the upper level code same, thinking about passthru case, where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU can change that mapping without causing an IOMMU fault on a active DMA device. > > > 4. Modules > > ================================================================================== > > > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko > > > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU > > TYPE1 v1 and v2 interface. > > Depending on how intrusive it is, this can possibly by done within the > existing type1 driver. Either that or we can split out common code for > use by a separate module. > > > vgpu.ko - provide registration interface and virtual device > > VFIO access. > > > > 5. QEMU note > > ================================================================================== > > > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO > > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and > > use it as a reference for our implementation. It is basically just a quick c & p > > from vfio/pci.c to quickly meet our needs. > > > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new > > class, and probably the only thing required is to have a new way to discover the > > device. > > > > 6. Examples > > ================================================================================== > > > > On this server, we have two NVIDIA M60 GPUs. > > > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2 > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > > > > After nvidia.ko gets initialized, we can query the supported vGPU type by > > accessing the "vgpu_supported_types" like following: > > > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > > 11:GRID M60-0B > > 12:GRID M60-0Q > > 13:GRID M60-1B > > 14:GRID M60-1Q > > 15:GRID M60-2B > > 16:GRID M60-2Q > > 17:GRID M60-4Q > > 18:GRID M60-8Q > > > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would > > like to create "GRID M60-4Q" VM on it. > > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create > > > > Note: the number 0 here is for vGPU device index. So far the change is not tested > > for multiple vgpu devices yet, but we will support it. > > > > At this moment, if you query the "vgpu_supported_types" it will still show all > > supported virtual GPU types as no virtual GPU resource is committed yet. > > > > Starting VM: > > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start > > > > then, the supported vGPU type query will return: > > > > [root@cjia-vgx-kvm /home/cjia]$ > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > > 17:GRID M60-4Q > > > > So vgpu_supported_config needs to be called whenever a new virtual device gets > > created as the underlying HW might limit the supported types if there are > > any existing VM runnings. > > > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the > > GPU driver vendor to clean up resource. > > > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under > > device sysfs. > > > I'd like to hear Intel's thoughts on this interface. Are there > different vgpu capacities or priority classes that would necessitate > different types of vcpus on Intel? > > I think there are some gaps in translating from named vgpu types to > indexes here, along with my previous mention of the UUID/set oddity. > > Does Intel have a need for start and shutdown interfaces? > > Neo, wasn't there at some point information about how many of each type > could be supported through these interfaces? How does a user know their > capacity limits? > Thanks for reminding me that, I think we probably forget to put that *important* information as the output of "vgpu_supported_types". Regarding the capacity, we can provide the frame buffer size as part of the "vgpu_supported_types" output as well, I would imagine those will be eventually show up on the openstack management interface or virt-mgr. Basically, yes there would be a separate col to show the number of instance you can create for each type of VGPU on a specific physical GPU. Thanks, Neo > Thanks, > Alex >
On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote: > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote: > > > 1.1 Under per-physical device sysfs: > > > ---------------------------------------------------------------------------------- > > > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of > > > "vgpu_supported_types". > > > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual > > > gpu device on a target physical GPU. idx: virtual device index inside a VM > > > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a > > > target physical GPU > > > > > > I've noted in previous discussions that we need to separate user policy > > from kernel policy here, the kernel policy should not require a "VM > > UUID". A UUID simply represents a set of one or more devices and an > > index picks the device within the set. Whether that UUID matches a VM > > or is independently used is up to the user policy when creating the > > device. > > > > Personally I'd also prefer to get rid of the concept of indexes within a > > UUID set of devices and instead have each device be independent. This > > seems to be an imposition on the nvidia implementation into the kernel > > interface design. > > > > Hi Alex, > > I agree with you that we should not put UUID concept into a kernel API. At > this point (without any prototyping), I am thinking of using a list of virtual > devices instead of UUID. Hi Neo, A UUID is a perfectly fine name, so long as we let it be just a UUID and not the UUID matching some specific use case. > > > > > > int vgpu_map_virtual_bar > > > ( > > > uint64_t virt_bar_addr, > > > uint64_t phys_bar_addr, > > > uint32_t len, > > > uint32_t flags > > > ) > > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar); > > > > > > Per the implementation provided, this needs to be implemented in the > > vfio device driver, not in the iommu interface. Finding the DMA mapping > > of the device and replacing it is wrong. It should be remapped at the > > vfio device file interface using vm_ops. > > > > So you are basically suggesting that we are going to take a mmap fault and > within that fault handler, we will go into vendor driver to look up the > "pre-registered" mapping and remap there. > > Is my understanding correct? Essentially, hopefully the vendor driver will have already registered the backing for the mmap prior to the fault, but either way could work. I think the key though is that you want to remap it onto the vma accessing the vfio device file, not scanning it out of an IOVA mapping that might be dynamic and doing a vma lookup based on the point in time mapping of the BAR. The latter doesn't give me much confidence that mappings couldn't change while the former should be a one time fault. In case it's not clear to folks at Intel, the purpose of this is that a vGPU may directly map a segment of the physical GPU MMIO space, but we may not know what segment that is at setup time, when QEMU does an mmap of the vfio device file descriptor. The thought is that we can create an invalid mapping when QEMU calls mmap(), knowing that it won't be accessed until later, then we can fault in the real mmap on demand. Do you need anything similar? > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) > > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > > > > > Still a lot to be added and modified, such as supporting multiple VMs and > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU > > > kernel driver, error handling, roll-back and locked memory size per user, etc. > > > > Particularly, handling of mapping changes is completely missing. This > > cannot be a point in time translation, the user is free to remap > > addresses whenever they wish and device translations need to be updated > > accordingly. > > > > When you say "user", do you mean the QEMU? vfio is a generic userspace driver interface, QEMU is a very, very important user of the interface, but not the only user. So for this conversation, we're mostly talking about QEMU as the user, but we should be careful about assuming QEMU is the only user. > Here, whenever the DMA that > the guest driver is going to launch will be first pinned within VM, and then > registered to QEMU, therefore the IOMMU memory listener, eventually the pages > will be pinned by the GPU or DMA engine. > > Since we are keeping the upper level code same, thinking about passthru case, > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU > can change that mapping without causing an IOMMU fault on a active DMA device. For the virtual BAR mapping above, it's easy to imagine that mapping a BAR to a given address is at the guest discretion, it may be mapped and unmapped, it may be mapped to different addresses at different points in time, the guest BIOS may choose to map it at yet another address, etc. So if somehow we were trying to setup a mapping for peer-to-peer, there are lots of ways that IOVA could change. But even with RAM, we can support memory hotplug in a VM. What was once a DMA target may be removed or may now be backed by something else. Chipset configuration on the emulated platform may change how guest physical memory appears and that might change between VM boots. Currently with physical device assignment the memory listener watches for both maps and unmaps and updates the iotlb to match. Just like real hardware doing these same sorts of things, we rely on the guest to stop using memory that's going to be moved as a DMA target prior to moving it. > > > 4. Modules > > > ================================================================================== > > > > > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko > > > > > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU > > > TYPE1 v1 and v2 interface. > > > > Depending on how intrusive it is, this can possibly by done within the > > existing type1 driver. Either that or we can split out common code for > > use by a separate module. > > > > > vgpu.ko - provide registration interface and virtual device > > > VFIO access. > > > > > > 5. QEMU note > > > ================================================================================== > > > > > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO > > > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and > > > use it as a reference for our implementation. It is basically just a quick c & p > > > from vfio/pci.c to quickly meet our needs. > > > > > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new > > > class, and probably the only thing required is to have a new way to discover the > > > device. > > > > > > 6. Examples > > > ================================================================================== > > > > > > On this server, we have two NVIDIA M60 GPUs. > > > > > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2 > > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > > > > > > After nvidia.ko gets initialized, we can query the supported vGPU type by > > > accessing the "vgpu_supported_types" like following: > > > > > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > > > 11:GRID M60-0B > > > 12:GRID M60-0Q > > > 13:GRID M60-1B > > > 14:GRID M60-1Q > > > 15:GRID M60-2B > > > 16:GRID M60-2Q > > > 17:GRID M60-4Q > > > 18:GRID M60-8Q > > > > > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would > > > like to create "GRID M60-4Q" VM on it. > > > > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > > > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create > > > > > > Note: the number 0 here is for vGPU device index. So far the change is not tested > > > for multiple vgpu devices yet, but we will support it. > > > > > > At this moment, if you query the "vgpu_supported_types" it will still show all > > > supported virtual GPU types as no virtual GPU resource is committed yet. > > > > > > Starting VM: > > > > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start > > > > > > then, the supported vGPU type query will return: > > > > > > [root@cjia-vgx-kvm /home/cjia]$ > > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > > > 17:GRID M60-4Q > > > > > > So vgpu_supported_config needs to be called whenever a new virtual device gets > > > created as the underlying HW might limit the supported types if there are > > > any existing VM runnings. > > > > > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the > > > GPU driver vendor to clean up resource. > > > > > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under > > > device sysfs. > > > > > > I'd like to hear Intel's thoughts on this interface. Are there > > different vgpu capacities or priority classes that would necessitate > > different types of vcpus on Intel? > > > > I think there are some gaps in translating from named vgpu types to > > indexes here, along with my previous mention of the UUID/set oddity. > > > > Does Intel have a need for start and shutdown interfaces? > > > > Neo, wasn't there at some point information about how many of each type > > could be supported through these interfaces? How does a user know their > > capacity limits? > > > > Thanks for reminding me that, I think we probably forget to put that *important* > information as the output of "vgpu_supported_types". > > Regarding the capacity, we can provide the frame buffer size as part of the > "vgpu_supported_types" output as well, I would imagine those will be eventually > show up on the openstack management interface or virt-mgr. > > Basically, yes there would be a separate col to show the number of instance you > can create for each type of VGPU on a specific physical GPU. Ok, Thanks, Alex
On 1/27/2016 1:36 AM, Alex Williamson wrote: > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote: >> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote: >>>> From: Alex Williamson [mailto:alex.williamson@redhat.com] >> >> Hi Alex, Kevin and Jike, >> >> (Seems I shouldn't use attachment, resend it again to the list, patches are >> inline at the end) >> >> Thanks for adding me to this technical discussion, a great opportunity >> for us to design together which can bring both Intel and NVIDIA vGPU solution to >> KVM platform. >> >> Instead of directly jumping to the proposal that we have been working on >> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple >> quick comments / thoughts regarding the existing discussions on this thread as >> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO. >> >> Then we can look at what we have, hopefully we can reach some consensus soon. >> >>> Yes, and since you're creating and destroying the vgpu here, this is >>> where I'd expect a struct device to be created and added to an IOMMU >>> group. The lifecycle management should really include links between >>> the vGPU and physical GPU, which would be much, much easier to do with >>> struct devices create here rather than at the point where we start >>> doing vfio "stuff". >> >> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management >> can be centralized and done in vfio-vgpu. That also include adding to IOMMU >> group and VFIO group. > Is this really a good idea? The concept of a vgpu is not unique to > vfio, we want vfio to be a driver for a vgpu, not an integral part of > the lifecycle of a vgpu. That certainly doesn't exclude adding > infrastructure to make lifecycle management of a vgpu more consistent > between drivers, but it should be done independently of vfio. I'll go > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio > does not create the VF, that's done in coordination with the PF making > use of some PCI infrastructure for consistency between drivers. > > It seems like we need to take more advantage of the class and driver > core support to perhaps setup a vgpu bus and class with vfio-vgpu just > being a driver for those devices. For device passthrough or SR-IOV model, PCI devices are created by PCI bus driver and from the probe routine each device is added in vfio group. For vgpu, there should be a common module that create vgpu device, say vgpu module, add vgpu device to an IOMMU group and then add it to vfio group. This module can handle management of vgpus. Advantage of keeping this module a separate module than doing device creation in vendor modules is to have generic interface for vgpu management, for example, files /sys/class/vgpu/vgpu_start and /sys/class/vgpu/vgpu_shudown and vgpu driver registration interface. In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per device. In the vgpu module, vgpu devices are created on request, so vgpu_group_init() should be called explicitly for per vgpu device. That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu module. Vgpu_vfio would remain separate entity but merged with vgpu module. Thanks, Kirti
On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote: > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote: > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote: > > > > 1.1 Under per-physical device sysfs: > > > > ---------------------------------------------------------------------------------- > > > > > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of > > > > "vgpu_supported_types". > > > > > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM > > > > > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a > > > > target physical GPU > > > > > > > > > I've noted in previous discussions that we need to separate user policy > > > from kernel policy here, the kernel policy should not require a "VM > > > UUID". A UUID simply represents a set of one or more devices and an > > > index picks the device within the set. Whether that UUID matches a VM > > > or is independently used is up to the user policy when creating the > > > device. > > > > > > Personally I'd also prefer to get rid of the concept of indexes within a > > > UUID set of devices and instead have each device be independent. This > > > seems to be an imposition on the nvidia implementation into the kernel > > > interface design. > > > > > > > Hi Alex, > > > > I agree with you that we should not put UUID concept into a kernel API. At > > this point (without any prototyping), I am thinking of using a list of virtual > > devices instead of UUID. > > Hi Neo, > > A UUID is a perfectly fine name, so long as we let it be just a UUID and > not the UUID matching some specific use case. > > > > > > > > > int vgpu_map_virtual_bar > > > > ( > > > > uint64_t virt_bar_addr, > > > > uint64_t phys_bar_addr, > > > > uint32_t len, > > > > uint32_t flags > > > > ) > > > > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar); > > > > > > > > > Per the implementation provided, this needs to be implemented in the > > > vfio device driver, not in the iommu interface. Finding the DMA mapping > > > of the device and replacing it is wrong. It should be remapped at the > > > vfio device file interface using vm_ops. > > > > > > > So you are basically suggesting that we are going to take a mmap fault and > > within that fault handler, we will go into vendor driver to look up the > > "pre-registered" mapping and remap there. > > > > Is my understanding correct? > > Essentially, hopefully the vendor driver will have already registered > the backing for the mmap prior to the fault, but either way could work. > I think the key though is that you want to remap it onto the vma > accessing the vfio device file, not scanning it out of an IOVA mapping > that might be dynamic and doing a vma lookup based on the point in time > mapping of the BAR. The latter doesn't give me much confidence that > mappings couldn't change while the former should be a one time fault. Hi Alex, The fact is that the vendor driver can only prevent such mmap fault by looking up the <iova, hva> mapping table that we have saved from IOMMU memory listerner when the guest region gets programmed. Also, like you have mentioned below, such mapping between iova and hva shouldn't be changed as long as the SBIOS and guest OS are done with their job. Yes, you are right it is one time fault, but the gpu work is heavily pipelined. Probably we should just limit this interface to guest MMIO region and we can have some crosscheck between the VFIO driver who has monitored the config spcae access to make sure nothing getting moved around? > > In case it's not clear to folks at Intel, the purpose of this is that a > vGPU may directly map a segment of the physical GPU MMIO space, but we > may not know what segment that is at setup time, when QEMU does an mmap > of the vfio device file descriptor. The thought is that we can create > an invalid mapping when QEMU calls mmap(), knowing that it won't be > accessed until later, then we can fault in the real mmap on demand. Do > you need anything similar? > > > > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) > > > > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > > > > > > > Still a lot to be added and modified, such as supporting multiple VMs and > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. > > > > > > Particularly, handling of mapping changes is completely missing. This > > > cannot be a point in time translation, the user is free to remap > > > addresses whenever they wish and device translations need to be updated > > > accordingly. > > > > > > > When you say "user", do you mean the QEMU? > > vfio is a generic userspace driver interface, QEMU is a very, very > important user of the interface, but not the only user. So for this > conversation, we're mostly talking about QEMU as the user, but we should > be careful about assuming QEMU is the only user. > Understood. I have to say that our focus at this moment is to support QEMU and KVM, but I know VFIO interface is much more than that, and that is why I think it is right to leverage this framework so we can together explore future use case in the userland. > > Here, whenever the DMA that > > the guest driver is going to launch will be first pinned within VM, and then > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages > > will be pinned by the GPU or DMA engine. > > > > Since we are keeping the upper level code same, thinking about passthru case, > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU > > can change that mapping without causing an IOMMU fault on a active DMA device. > > For the virtual BAR mapping above, it's easy to imagine that mapping a > BAR to a given address is at the guest discretion, it may be mapped and > unmapped, it may be mapped to different addresses at different points in > time, the guest BIOS may choose to map it at yet another address, etc. > So if somehow we were trying to setup a mapping for peer-to-peer, there > are lots of ways that IOVA could change. But even with RAM, we can > support memory hotplug in a VM. What was once a DMA target may be > removed or may now be backed by something else. Chipset configuration > on the emulated platform may change how guest physical memory appears > and that might change between VM boots. > > Currently with physical device assignment the memory listener watches > for both maps and unmaps and updates the iotlb to match. Just like real > hardware doing these same sorts of things, we rely on the guest to stop > using memory that's going to be moved as a DMA target prior to moving > it. Right, you can only do that when the device is quiescent. As long as this will be notified to the guest, I think we should be able to support it although the real implementation will depend on how the device gets into quiescent state. This is definitely a very interesting feature we should explore, but I hope we probably can first focus on the most basic functionality. Thanks, Neo > > > > > 4. Modules > > > > ================================================================================== > > > > > > > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko > > > > > > > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU > > > > TYPE1 v1 and v2 interface. > > > > > > Depending on how intrusive it is, this can possibly by done within the > > > existing type1 driver. Either that or we can split out common code for > > > use by a separate module. > > > > > > > vgpu.ko - provide registration interface and virtual device > > > > VFIO access. > > > > > > > > 5. QEMU note > > > > ================================================================================== > > > > > > > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO > > > > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and > > > > use it as a reference for our implementation. It is basically just a quick c & p > > > > from vfio/pci.c to quickly meet our needs. > > > > > > > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new > > > > class, and probably the only thing required is to have a new way to discover the > > > > device. > > > > > > > > 6. Examples > > > > ================================================================================== > > > > > > > > On this server, we have two NVIDIA M60 GPUs. > > > > > > > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2 > > > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > > > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > > > > > > > > After nvidia.ko gets initialized, we can query the supported vGPU type by > > > > accessing the "vgpu_supported_types" like following: > > > > > > > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > > > > 11:GRID M60-0B > > > > 12:GRID M60-0Q > > > > 13:GRID M60-1B > > > > 14:GRID M60-1Q > > > > 15:GRID M60-2B > > > > 16:GRID M60-2Q > > > > 17:GRID M60-4Q > > > > 18:GRID M60-8Q > > > > > > > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would > > > > like to create "GRID M60-4Q" VM on it. > > > > > > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > > > > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create > > > > > > > > Note: the number 0 here is for vGPU device index. So far the change is not tested > > > > for multiple vgpu devices yet, but we will support it. > > > > > > > > At this moment, if you query the "vgpu_supported_types" it will still show all > > > > supported virtual GPU types as no virtual GPU resource is committed yet. > > > > > > > > Starting VM: > > > > > > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start > > > > > > > > then, the supported vGPU type query will return: > > > > > > > > [root@cjia-vgx-kvm /home/cjia]$ > > > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > > > > 17:GRID M60-4Q > > > > > > > > So vgpu_supported_config needs to be called whenever a new virtual device gets > > > > created as the underlying HW might limit the supported types if there are > > > > any existing VM runnings. > > > > > > > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the > > > > GPU driver vendor to clean up resource. > > > > > > > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under > > > > device sysfs. > > > > > > > > > I'd like to hear Intel's thoughts on this interface. Are there > > > different vgpu capacities or priority classes that would necessitate > > > different types of vcpus on Intel? > > > > > > I think there are some gaps in translating from named vgpu types to > > > indexes here, along with my previous mention of the UUID/set oddity. > > > > > > Does Intel have a need for start and shutdown interfaces? > > > > > > Neo, wasn't there at some point information about how many of each type > > > could be supported through these interfaces? How does a user know their > > > capacity limits? > > > > > > > Thanks for reminding me that, I think we probably forget to put that *important* > > information as the output of "vgpu_supported_types". > > > > Regarding the capacity, we can provide the frame buffer size as part of the > > "vgpu_supported_types" output as well, I would imagine those will be eventually > > show up on the openstack management interface or virt-mgr. > > > > Basically, yes there would be a separate col to show the number of instance you > > can create for each type of VGPU on a specific physical GPU. > > Ok, Thanks, > > Alex >
On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote: > > On 1/27/2016 1:36 AM, Alex Williamson wrote: > > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote: > > > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote: > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com] > > > > > > Hi Alex, Kevin and Jike, > > > > > > (Seems I shouldn't use attachment, resend it again to the list, patches are > > > inline at the end) > > > > > > Thanks for adding me to this technical discussion, a great opportunity > > > for us to design together which can bring both Intel and NVIDIA vGPU solution to > > > KVM platform. > > > > > > Instead of directly jumping to the proposal that we have been working on > > > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple > > > quick comments / thoughts regarding the existing discussions on this thread as > > > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO. > > > > > > Then we can look at what we have, hopefully we can reach some consensus soon. > > > > > > > Yes, and since you're creating and destroying the vgpu here, this is > > > > where I'd expect a struct device to be created and added to an IOMMU > > > > group. The lifecycle management should really include links between > > > > the vGPU and physical GPU, which would be much, much easier to do with > > > > struct devices create here rather than at the point where we start > > > > doing vfio "stuff". > > > > > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management > > > can be centralized and done in vfio-vgpu. That also include adding to IOMMU > > > group and VFIO group. > > Is this really a good idea? The concept of a vgpu is not unique to > > vfio, we want vfio to be a driver for a vgpu, not an integral part of > > the lifecycle of a vgpu. That certainly doesn't exclude adding > > infrastructure to make lifecycle management of a vgpu more consistent > > between drivers, but it should be done independently of vfio. I'll go > > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio > > does not create the VF, that's done in coordination with the PF making > > use of some PCI infrastructure for consistency between drivers. > > > > It seems like we need to take more advantage of the class and driver > > core support to perhaps setup a vgpu bus and class with vfio-vgpu just > > being a driver for those devices. > > For device passthrough or SR-IOV model, PCI devices are created by PCI > bus driver and from the probe routine each device is added in vfio group. An SR-IOV VF is created by the PF driver using standard interfaces provided by the PCI core. The IOMMU group for a VF is added by the IOMMU driver when the device is created on the pci_bus_type. The probe routine of the vfio bus driver (vfio-pci) is what adds the device into the vfio group. > For vgpu, there should be a common module that create vgpu device, say > vgpu module, add vgpu device to an IOMMU group and then add it to vfio > group. This module can handle management of vgpus. Advantage of keeping > this module a separate module than doing device creation in vendor > modules is to have generic interface for vgpu management, for example, > files /sys/class/vgpu/vgpu_start and /sys/class/vgpu/vgpu_shudown and > vgpu driver registration interface. But you're suggesting something very different from the SR-IOV model. If we wanted to mimic that model, the GPU specific driver should create the vgpu using services provided by a common interface. For instance i915 could call a new vgpu_device_create() which creates the device, adds it to the vgpu class, etc. That vgpu device should not be assumed to be used with vfio though, that should happen via a separate probe using a vfio-vgpu driver. It's that vfio bus driver that will add the device to a vfio group. > In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and > vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to > vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per > device. In the vgpu module, vgpu devices are created on request, so > vgpu_group_init() should be called explicitly for per vgpu device. > That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu > module. Vgpu_vfio would remain separate entity but merged with vgpu > module. I disagree with this design, creation of a vgpu necessarily involves the GPU driver and should not be tied to use of the vgpu with vfio. vfio should be a driver for the device, maybe eventually not the only driver for the device. Thanks, Alex
On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote: > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote: > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote: > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote: > > > > > 1.1 Under per-physical device sysfs: > > > > > ---------------------------------------------------------------------------------- > > > > > > > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of > > > > > "vgpu_supported_types". > > > > > > > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual > > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM > > > > > > > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a > > > > > target physical GPU > > > > > > > > > > > > I've noted in previous discussions that we need to separate user policy > > > > from kernel policy here, the kernel policy should not require a "VM > > > > UUID". A UUID simply represents a set of one or more devices and an > > > > index picks the device within the set. Whether that UUID matches a VM > > > > or is independently used is up to the user policy when creating the > > > > device. > > > > > > > > Personally I'd also prefer to get rid of the concept of indexes within a > > > > UUID set of devices and instead have each device be independent. This > > > > seems to be an imposition on the nvidia implementation into the kernel > > > > interface design. > > > > > > > > > > Hi Alex, > > > > > > I agree with you that we should not put UUID concept into a kernel API. At > > > this point (without any prototyping), I am thinking of using a list of virtual > > > devices instead of UUID. > > > > Hi Neo, > > > > A UUID is a perfectly fine name, so long as we let it be just a UUID and > > not the UUID matching some specific use case. > > > > > > > > > > > > int vgpu_map_virtual_bar > > > > > ( > > > > > uint64_t virt_bar_addr, > > > > > uint64_t phys_bar_addr, > > > > > uint32_t len, > > > > > uint32_t flags > > > > > ) > > > > > > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar); > > > > > > > > > > > > Per the implementation provided, this needs to be implemented in the > > > > vfio device driver, not in the iommu interface. Finding the DMA mapping > > > > of the device and replacing it is wrong. It should be remapped at the > > > > vfio device file interface using vm_ops. > > > > > > > > > > So you are basically suggesting that we are going to take a mmap fault and > > > within that fault handler, we will go into vendor driver to look up the > > > "pre-registered" mapping and remap there. > > > > > > Is my understanding correct? > > > > Essentially, hopefully the vendor driver will have already registered > > the backing for the mmap prior to the fault, but either way could work. > > I think the key though is that you want to remap it onto the vma > > accessing the vfio device file, not scanning it out of an IOVA mapping > > that might be dynamic and doing a vma lookup based on the point in time > > mapping of the BAR. The latter doesn't give me much confidence that > > mappings couldn't change while the former should be a one time fault. > > Hi Alex, > > The fact is that the vendor driver can only prevent such mmap fault by looking > up the <iova, hva> mapping table that we have saved from IOMMU memory listerner Why do we need to prevent the fault? We need to handle the fault when it occurs. > when the guest region gets programmed. Also, like you have mentioned below, such > mapping between iova and hva shouldn't be changed as long as the SBIOS and > guest OS are done with their job. But you don't know they're done with their job. > Yes, you are right it is one time fault, but the gpu work is heavily pipelined. Why does that matter? We're talking about the first time the VM accesses the range of the BAR that will be direct mapped to the physical GPU. This isn't going to happen in the middle of a benchmark, it's going to happen during driver initialization in the guest. > Probably we should just limit this interface to guest MMIO region and we can have > some crosscheck between the VFIO driver who has monitored the config spcae > access to make sure nothing getting moved around? No, the solution for the bar is very clear, map on fault to the vma accessing the mmap and be done with it for the remainder of this instance of the VM. > > In case it's not clear to folks at Intel, the purpose of this is that a > > vGPU may directly map a segment of the physical GPU MMIO space, but we > > may not know what segment that is at setup time, when QEMU does an mmap > > of the vfio device file descriptor. The thought is that we can create > > an invalid mapping when QEMU calls mmap(), knowing that it won't be > > accessed until later, then we can fault in the real mmap on demand. Do > > you need anything similar? > > > > > > > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) > > > > > > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > > > > > > > > > Still a lot to be added and modified, such as supporting multiple VMs and > > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU > > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. > > > > > > > > Particularly, handling of mapping changes is completely missing. This > > > > cannot be a point in time translation, the user is free to remap > > > > addresses whenever they wish and device translations need to be updated > > > > accordingly. > > > > > > > > > > When you say "user", do you mean the QEMU? > > > > vfio is a generic userspace driver interface, QEMU is a very, very > > important user of the interface, but not the only user. So for this > > conversation, we're mostly talking about QEMU as the user, but we should > > be careful about assuming QEMU is the only user. > > > > Understood. I have to say that our focus at this moment is to support QEMU and > KVM, but I know VFIO interface is much more than that, and that is why I think > it is right to leverage this framework so we can together explore future use > case in the userland. > > > > > Here, whenever the DMA that > > > the guest driver is going to launch will be first pinned within VM, and then > > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages > > > will be pinned by the GPU or DMA engine. > > > > > > Since we are keeping the upper level code same, thinking about passthru case, > > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU > > > can change that mapping without causing an IOMMU fault on a active DMA device. > > > > For the virtual BAR mapping above, it's easy to imagine that mapping a > > BAR to a given address is at the guest discretion, it may be mapped and > > unmapped, it may be mapped to different addresses at different points in > > time, the guest BIOS may choose to map it at yet another address, etc. > > So if somehow we were trying to setup a mapping for peer-to-peer, there > > are lots of ways that IOVA could change. But even with RAM, we can > > support memory hotplug in a VM. What was once a DMA target may be > > removed or may now be backed by something else. Chipset configuration > > on the emulated platform may change how guest physical memory appears > > and that might change between VM boots. > > > > Currently with physical device assignment the memory listener watches > > for both maps and unmaps and updates the iotlb to match. Just like real > > hardware doing these same sorts of things, we rely on the guest to stop > > using memory that's going to be moved as a DMA target prior to moving > > it. > > Right, you can only do that when the device is quiescent. > > As long as this will be notified to the guest, I think we should be able to > support it although the real implementation will depend on how the device gets into > quiescent state. > > This is definitely a very interesting feature we should explore, but I hope we > probably can first focus on the most basic functionality. If we only do a point-in-time translation and assume it never changes, that's good enough for a proof of concept, but it's not a complete solution. I think this is practical problem, not just an academic problem. There needs to be a mechanism mappings to be invalidated based on VM memory changes. Thanks, Alex
On 1/27/2016 9:30 PM, Alex Williamson wrote: > On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote: >> >> On 1/27/2016 1:36 AM, Alex Williamson wrote: >>> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote: >>>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote: >>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com] >>>> >>>> Hi Alex, Kevin and Jike, >>>> >>>> (Seems I shouldn't use attachment, resend it again to the list, patches are >>>> inline at the end) >>>> >>>> Thanks for adding me to this technical discussion, a great opportunity >>>> for us to design together which can bring both Intel and NVIDIA vGPU solution to >>>> KVM platform. >>>> >>>> Instead of directly jumping to the proposal that we have been working on >>>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple >>>> quick comments / thoughts regarding the existing discussions on this thread as >>>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO. >>>> >>>> Then we can look at what we have, hopefully we can reach some consensus soon. >>>> >>>>> Yes, and since you're creating and destroying the vgpu here, this is >>>>> where I'd expect a struct device to be created and added to an IOMMU >>>>> group. The lifecycle management should really include links between >>>>> the vGPU and physical GPU, which would be much, much easier to do with >>>>> struct devices create here rather than at the point where we start >>>>> doing vfio "stuff". >>>> >>>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management >>>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU >>>> group and VFIO group. >>> Is this really a good idea? The concept of a vgpu is not unique to >>> vfio, we want vfio to be a driver for a vgpu, not an integral part of >>> the lifecycle of a vgpu. That certainly doesn't exclude adding >>> infrastructure to make lifecycle management of a vgpu more consistent >>> between drivers, but it should be done independently of vfio. I'll go >>> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio >>> does not create the VF, that's done in coordination with the PF making >>> use of some PCI infrastructure for consistency between drivers. >>> >>> It seems like we need to take more advantage of the class and driver >>> core support to perhaps setup a vgpu bus and class with vfio-vgpu just >>> being a driver for those devices. >> >> For device passthrough or SR-IOV model, PCI devices are created by PCI >> bus driver and from the probe routine each device is added in vfio group. > > An SR-IOV VF is created by the PF driver using standard interfaces > provided by the PCI core. The IOMMU group for a VF is added by the > IOMMU driver when the device is created on the pci_bus_type. The probe > routine of the vfio bus driver (vfio-pci) is what adds the device into > the vfio group. > >> For vgpu, there should be a common module that create vgpu device, say >> vgpu module, add vgpu device to an IOMMU group and then add it to vfio >> group. This module can handle management of vgpus. Advantage of keeping >> this module a separate module than doing device creation in vendor >> modules is to have generic interface for vgpu management, for example, >> files /sys/class/vgpu/vgpu_start and /sys/class/vgpu/vgpu_shudown and >> vgpu driver registration interface. > > But you're suggesting something very different from the SR-IOV model. > If we wanted to mimic that model, the GPU specific driver should create > the vgpu using services provided by a common interface. For instance > i915 could call a new vgpu_device_create() which creates the device, > adds it to the vgpu class, etc. That vgpu device should not be assumed > to be used with vfio though, that should happen via a separate probe > using a vfio-vgpu driver. It's that vfio bus driver that will add the > device to a vfio group. > In that case vgpu driver should provide a driver registration interface to register vfio-vgpu driver. struct vgpu_driver { const char *name; int (*probe) (struct vgpu_device *vdev); void (*remove) (struct vgpu_device *vdev); } int vgpu_register_driver(struct vgpu_driver *driver) { ... } EXPORT_SYMBOL(vgpu_register_driver); int vgpu_unregister_driver(struct vgpu_driver *driver) { ... } EXPORT_SYMBOL(vgpu_unregister_driver); vfio-vgpu driver registers to vgpu driver. Then from vgpu_device_create(), after creating the device it calls vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to vfio group. +--------------+ vgpu_register_driver()+---------------+ | __init() +------------------------->+ | | | | | | +<-------------------------+ vgpu.ko | | vfio_vgpu.ko | probe()/remove() | | | | +---------+ +---------+ +--------------+ | +-------+-------+ | | ^ | | callback | | | +-------+--------+ | | |vgpu_register_device() | | | | | +---^-----+-----+ +-----+------+-+ | nvidia.ko | | i915.ko | | | | | +-----------+ +------------+ Is my understanding correct? Thanks, Kirti >> In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and >> vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to >> vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per >> device. In the vgpu module, vgpu devices are created on request, so >> vgpu_group_init() should be called explicitly for per vgpu device. >> That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu >> module. Vgpu_vfio would remain separate entity but merged with vgpu >> module. > > I disagree with this design, creation of a vgpu necessarily involves the > GPU driver and should not be tied to use of the vgpu with vfio. vfio > should be a driver for the device, maybe eventually not the only driver > for the device. Thanks, > > Alex >
On Wed, Jan 27, 2016 at 09:10:16AM -0700, Alex Williamson wrote: > On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote: > > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote: > > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote: > > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote: > > > > > > 1.1 Under per-physical device sysfs: > > > > > > ---------------------------------------------------------------------------------- > > > > > > > > > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its > > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of > > > > > > "vgpu_supported_types". > > > > > > > > > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual > > > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM > > > > > > > > > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a > > > > > > target physical GPU > > > > > > > > > > > > > > > I've noted in previous discussions that we need to separate user policy > > > > > from kernel policy here, the kernel policy should not require a "VM > > > > > UUID". A UUID simply represents a set of one or more devices and an > > > > > index picks the device within the set. Whether that UUID matches a VM > > > > > or is independently used is up to the user policy when creating the > > > > > device. > > > > > > > > > > Personally I'd also prefer to get rid of the concept of indexes within a > > > > > UUID set of devices and instead have each device be independent. This > > > > > seems to be an imposition on the nvidia implementation into the kernel > > > > > interface design. > > > > > > > > > > > > > Hi Alex, > > > > > > > > I agree with you that we should not put UUID concept into a kernel API. At > > > > this point (without any prototyping), I am thinking of using a list of virtual > > > > devices instead of UUID. > > > > > > Hi Neo, > > > > > > A UUID is a perfectly fine name, so long as we let it be just a UUID and > > > not the UUID matching some specific use case. > > > > > > > > > > > > > > > int vgpu_map_virtual_bar > > > > > > ( > > > > > > uint64_t virt_bar_addr, > > > > > > uint64_t phys_bar_addr, > > > > > > uint32_t len, > > > > > > uint32_t flags > > > > > > ) > > > > > > > > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar); > > > > > > > > > > > > > > > Per the implementation provided, this needs to be implemented in the > > > > > vfio device driver, not in the iommu interface. Finding the DMA mapping > > > > > of the device and replacing it is wrong. It should be remapped at the > > > > > vfio device file interface using vm_ops. > > > > > > > > > > > > > So you are basically suggesting that we are going to take a mmap fault and > > > > within that fault handler, we will go into vendor driver to look up the > > > > "pre-registered" mapping and remap there. > > > > > > > > Is my understanding correct? > > > > > > Essentially, hopefully the vendor driver will have already registered > > > the backing for the mmap prior to the fault, but either way could work. > > > I think the key though is that you want to remap it onto the vma > > > accessing the vfio device file, not scanning it out of an IOVA mapping > > > that might be dynamic and doing a vma lookup based on the point in time > > > mapping of the BAR. The latter doesn't give me much confidence that > > > mappings couldn't change while the former should be a one time fault. > > > > Hi Alex, > > > > The fact is that the vendor driver can only prevent such mmap fault by looking > > up the <iova, hva> mapping table that we have saved from IOMMU memory listerner > > Why do we need to prevent the fault? We need to handle the fault when > it occurs. > > > when the guest region gets programmed. Also, like you have mentioned below, such > > mapping between iova and hva shouldn't be changed as long as the SBIOS and > > guest OS are done with their job. > > But you don't know they're done with their job. > > > Yes, you are right it is one time fault, but the gpu work is heavily pipelined. > > Why does that matter? We're talking about the first time the VM > accesses the range of the BAR that will be direct mapped to the physical > GPU. This isn't going to happen in the middle of a benchmark, it's > going to happen during driver initialization in the guest. > > > Probably we should just limit this interface to guest MMIO region and we can have > > some crosscheck between the VFIO driver who has monitored the config spcae > > access to make sure nothing getting moved around? > > No, the solution for the bar is very clear, map on fault to the vma > accessing the mmap and be done with it for the remainder of this > instance of the VM. > Hi Alex, I totally get your points, my previous comments were just trying to explain the reasoning behind our current implementation. I think I have found a way to hide the latency of the mmap fault, which might happen in the middle of running a benchmark. I will add a new registration interface to allow the driver vendor to provide a fault handler callback, and from there pages will be installed. > > > In case it's not clear to folks at Intel, the purpose of this is that a > > > vGPU may directly map a segment of the physical GPU MMIO space, but we > > > may not know what segment that is at setup time, when QEMU does an mmap > > > of the vfio device file descriptor. The thought is that we can create > > > an invalid mapping when QEMU calls mmap(), knowing that it won't be > > > accessed until later, then we can fault in the real mmap on demand. Do > > > you need anything similar? > > > > > > > > > > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) > > > > > > > > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > > > > > > > > > > > Still a lot to be added and modified, such as supporting multiple VMs and > > > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU > > > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. > > > > > > > > > > Particularly, handling of mapping changes is completely missing. This > > > > > cannot be a point in time translation, the user is free to remap > > > > > addresses whenever they wish and device translations need to be updated > > > > > accordingly. > > > > > > > > > > > > > When you say "user", do you mean the QEMU? > > > > > > vfio is a generic userspace driver interface, QEMU is a very, very > > > important user of the interface, but not the only user. So for this > > > conversation, we're mostly talking about QEMU as the user, but we should > > > be careful about assuming QEMU is the only user. > > > > > > > Understood. I have to say that our focus at this moment is to support QEMU and > > KVM, but I know VFIO interface is much more than that, and that is why I think > > it is right to leverage this framework so we can together explore future use > > case in the userland. > > > > > > > > Here, whenever the DMA that > > > > the guest driver is going to launch will be first pinned within VM, and then > > > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages > > > > will be pinned by the GPU or DMA engine. > > > > > > > > Since we are keeping the upper level code same, thinking about passthru case, > > > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU > > > > can change that mapping without causing an IOMMU fault on a active DMA device. > > > > > > For the virtual BAR mapping above, it's easy to imagine that mapping a > > > BAR to a given address is at the guest discretion, it may be mapped and > > > unmapped, it may be mapped to different addresses at different points in > > > time, the guest BIOS may choose to map it at yet another address, etc. > > > So if somehow we were trying to setup a mapping for peer-to-peer, there > > > are lots of ways that IOVA could change. But even with RAM, we can > > > support memory hotplug in a VM. What was once a DMA target may be > > > removed or may now be backed by something else. Chipset configuration > > > on the emulated platform may change how guest physical memory appears > > > and that might change between VM boots. > > > > > > Currently with physical device assignment the memory listener watches > > > for both maps and unmaps and updates the iotlb to match. Just like real > > > hardware doing these same sorts of things, we rely on the guest to stop > > > using memory that's going to be moved as a DMA target prior to moving > > > it. > > > > Right, you can only do that when the device is quiescent. > > > > As long as this will be notified to the guest, I think we should be able to > > support it although the real implementation will depend on how the device gets into > > quiescent state. > > > > This is definitely a very interesting feature we should explore, but I hope we > > probably can first focus on the most basic functionality. > > If we only do a point-in-time translation and assume it never changes, > that's good enough for a proof of concept, but it's not a complete > solution. I think this is practical problem, not just an academic > problem. There needs to be a mechanism mappings to be invalidated based > on VM memory changes. Thanks, > Sorry, probably my previous comment is not very clear. I highly value your input and the information related to the memory hotplug scenarios, and I never exclude the support of such feature. The only question is when, that is why I would like to defer such VM memory hotplug feature to phase 2 after the initial official launch. Thanks, Neo > Alex >
On Thu, 2016-01-28 at 02:25 +0530, Kirti Wankhede wrote: > > On 1/27/2016 9:30 PM, Alex Williamson wrote: > > On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote: > > > > > > On 1/27/2016 1:36 AM, Alex Williamson wrote: > > > > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote: > > > > > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote: > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com] > > > > > > > > > > Hi Alex, Kevin and Jike, > > > > > > > > > > (Seems I shouldn't use attachment, resend it again to the list, patches are > > > > > inline at the end) > > > > > > > > > > Thanks for adding me to this technical discussion, a great opportunity > > > > > for us to design together which can bring both Intel and NVIDIA vGPU solution to > > > > > KVM platform. > > > > > > > > > > Instead of directly jumping to the proposal that we have been working on > > > > > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple > > > > > quick comments / thoughts regarding the existing discussions on this thread as > > > > > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO. > > > > > > > > > > Then we can look at what we have, hopefully we can reach some consensus soon. > > > > > > > > > > > Yes, and since you're creating and destroying the vgpu here, this is > > > > > > where I'd expect a struct device to be created and added to an IOMMU > > > > > > group. The lifecycle management should really include links between > > > > > > the vGPU and physical GPU, which would be much, much easier to do with > > > > > > struct devices create here rather than at the point where we start > > > > > > doing vfio "stuff". > > > > > > > > > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management > > > > > can be centralized and done in vfio-vgpu. That also include adding to IOMMU > > > > > group and VFIO group. > > > > Is this really a good idea? The concept of a vgpu is not unique to > > > > vfio, we want vfio to be a driver for a vgpu, not an integral part of > > > > the lifecycle of a vgpu. That certainly doesn't exclude adding > > > > infrastructure to make lifecycle management of a vgpu more consistent > > > > between drivers, but it should be done independently of vfio. I'll go > > > > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio > > > > does not create the VF, that's done in coordination with the PF making > > > > use of some PCI infrastructure for consistency between drivers. > > > > > > > > It seems like we need to take more advantage of the class and driver > > > > core support to perhaps setup a vgpu bus and class with vfio-vgpu just > > > > being a driver for those devices. > > > > > > For device passthrough or SR-IOV model, PCI devices are created by PCI > > > bus driver and from the probe routine each device is added in vfio group. > > > > An SR-IOV VF is created by the PF driver using standard interfaces > > provided by the PCI core. The IOMMU group for a VF is added by the > > IOMMU driver when the device is created on the pci_bus_type. The probe > > routine of the vfio bus driver (vfio-pci) is what adds the device into > > the vfio group. > > > > > For vgpu, there should be a common module that create vgpu device, say > > > vgpu module, add vgpu device to an IOMMU group and then add it to vfio > > > group. This module can handle management of vgpus. Advantage of keeping > > > this module a separate module than doing device creation in vendor > > > modules is to have generic interface for vgpu management, for example, > > > files /sys/class/vgpu/vgpu_start and /sys/class/vgpu/vgpu_shudown and > > > vgpu driver registration interface. > > > > But you're suggesting something very different from the SR-IOV model. > > If we wanted to mimic that model, the GPU specific driver should create > > the vgpu using services provided by a common interface. For instance > > i915 could call a new vgpu_device_create() which creates the device, > > adds it to the vgpu class, etc. That vgpu device should not be assumed > > to be used with vfio though, that should happen via a separate probe > > using a vfio-vgpu driver. It's that vfio bus driver that will add the > > device to a vfio group. > > > > In that case vgpu driver should provide a driver registration interface > to register vfio-vgpu driver. > > struct vgpu_driver { > const char *name; > int (*probe) (struct vgpu_device *vdev); > void (*remove) (struct vgpu_device *vdev); > } > > int vgpu_register_driver(struct vgpu_driver *driver) > { > ... > } > EXPORT_SYMBOL(vgpu_register_driver); > > int vgpu_unregister_driver(struct vgpu_driver *driver) > { > ... > } > EXPORT_SYMBOL(vgpu_unregister_driver); > > vfio-vgpu driver registers to vgpu driver. Then from > vgpu_device_create(), after creating the device it calls > vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to > vfio group. > > +--------------+ vgpu_register_driver()+---------------+ > > __init() +------------------------->+ | > > | | | > > +<-------------------------+ vgpu.ko | > > vfio_vgpu.ko | probe()/remove() | | > > | +---------+ +---------+ > +--------------+ | +-------+-------+ | > | ^ | > | callback | | > | +-------+--------+ | > | |vgpu_register_device() | > | | | | > +---^-----+-----+ +-----+------+-+ > | nvidia.ko | | i915.ko | > | | | | > +-----------+ +------------+ > > Is my understanding correct? We have an entire driver core subsystem in Linux for the purpose of matching devices to drivers, I don't think we should be re-inventing that. That's why I'm suggesting that we should have infrastructure which facilitates GPU drivers to create vGPU devices in a common way, perhaps even placing the devices on a virtual vgpu bus, and then allow a vfio-vgpu driver to register as a driver for devices of that bus/class and use the existing driver callbacks. Thanks, Alex
On 1/28/2016 3:28 AM, Alex Williamson wrote: > On Thu, 2016-01-28 at 02:25 +0530, Kirti Wankhede wrote: >> >> On 1/27/2016 9:30 PM, Alex Williamson wrote: >>> On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote: >>>> >>>> On 1/27/2016 1:36 AM, Alex Williamson wrote: >>>>> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote: >>>>>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote: >>>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com] >>>>>> >>>>>> Hi Alex, Kevin and Jike, >>>>>> >>>>>> (Seems I shouldn't use attachment, resend it again to the list, patches are >>>>>> inline at the end) >>>>>> >>>>>> Thanks for adding me to this technical discussion, a great opportunity >>>>>> for us to design together which can bring both Intel and NVIDIA vGPU solution to >>>>>> KVM platform. >>>>>> >>>>>> Instead of directly jumping to the proposal that we have been working on >>>>>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple >>>>>> quick comments / thoughts regarding the existing discussions on this thread as >>>>>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO. >>>>>> >>>>>> Then we can look at what we have, hopefully we can reach some consensus soon. >>>>>> >>>>>>> Yes, and since you're creating and destroying the vgpu here, this is >>>>>>> where I'd expect a struct device to be created and added to an IOMMU >>>>>>> group. The lifecycle management should really include links between >>>>>>> the vGPU and physical GPU, which would be much, much easier to do with >>>>>>> struct devices create here rather than at the point where we start >>>>>>> doing vfio "stuff". >>>>>> >>>>>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management >>>>>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU >>>>>> group and VFIO group. >>>>> Is this really a good idea? The concept of a vgpu is not unique to >>>>> vfio, we want vfio to be a driver for a vgpu, not an integral part of >>>>> the lifecycle of a vgpu. That certainly doesn't exclude adding >>>>> infrastructure to make lifecycle management of a vgpu more consistent >>>>> between drivers, but it should be done independently of vfio. I'll go >>>>> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio >>>>> does not create the VF, that's done in coordination with the PF making >>>>> use of some PCI infrastructure for consistency between drivers. >>>>> >>>>> It seems like we need to take more advantage of the class and driver >>>>> core support to perhaps setup a vgpu bus and class with vfio-vgpu just >>>>> being a driver for those devices. >>>> >>>> For device passthrough or SR-IOV model, PCI devices are created by PCI >>>> bus driver and from the probe routine each device is added in vfio group. >>> >>> An SR-IOV VF is created by the PF driver using standard interfaces >>> provided by the PCI core. The IOMMU group for a VF is added by the >>> IOMMU driver when the device is created on the pci_bus_type. The probe >>> routine of the vfio bus driver (vfio-pci) is what adds the device into >>> the vfio group. >>> >>>> For vgpu, there should be a common module that create vgpu device, say >>>> vgpu module, add vgpu device to an IOMMU group and then add it to vfio >>>> group. This module can handle management of vgpus. Advantage of keeping >>>> this module a separate module than doing device creation in vendor >>>> modules is to have generic interface for vgpu management, for example, >>>> files /sys/class/vgpu/vgpu_start and /sys/class/vgpu/vgpu_shudown and >>>> vgpu driver registration interface. >>> >>> But you're suggesting something very different from the SR-IOV model. >>> If we wanted to mimic that model, the GPU specific driver should create >>> the vgpu using services provided by a common interface. For instance >>> i915 could call a new vgpu_device_create() which creates the device, >>> adds it to the vgpu class, etc. That vgpu device should not be assumed >>> to be used with vfio though, that should happen via a separate probe >>> using a vfio-vgpu driver. It's that vfio bus driver that will add the >>> device to a vfio group. >>> >> >> In that case vgpu driver should provide a driver registration interface >> to register vfio-vgpu driver. >> >> struct vgpu_driver { >> const char *name; >> int (*probe) (struct vgpu_device *vdev); >> void (*remove) (struct vgpu_device *vdev); >> } >> >> int vgpu_register_driver(struct vgpu_driver *driver) >> { >> ... >> } >> EXPORT_SYMBOL(vgpu_register_driver); >> >> int vgpu_unregister_driver(struct vgpu_driver *driver) >> { >> ... >> } >> EXPORT_SYMBOL(vgpu_unregister_driver); >> >> vfio-vgpu driver registers to vgpu driver. Then from >> vgpu_device_create(), after creating the device it calls >> vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to >> vfio group. >> >> +--------------+ vgpu_register_driver()+---------------+ >>> __init() +------------------------->+ | >>> | | | >>> +<-------------------------+ vgpu.ko | >>> vfio_vgpu.ko | probe()/remove() | | >>> | +---------+ +---------+ >> +--------------+ | +-------+-------+ | >> | ^ | >> | callback | | >> | +-------+--------+ | >> | |vgpu_register_device() | >> | | | | >> +---^-----+-----+ +-----+------+-+ >> | nvidia.ko | | i915.ko | >> | | | | >> +-----------+ +------------+ >> >> Is my understanding correct? > > We have an entire driver core subsystem in Linux for the purpose of > matching devices to drivers, I don't think we should be re-inventing > that. That's why I'm suggesting that we should have infrastructure > which facilitates GPU drivers to create vGPU devices in a common way, > perhaps even placing the devices on a virtual vgpu bus, and then allow a > vfio-vgpu driver to register as a driver for devices of that bus/class > and use the existing driver callbacks. Thanks, > > Alex > We will use Linux core subsystem, my point is we have to introduce vgpu module to provide such infrastructure to GPU drivers in common way. This module helps GPU drivers to create vGPU devices and allow vfio-vgpu driver to register for vGPU devices. Kirti.
================================================================================== Here we are proposing a generic Linux kernel module based on VFIO framework which allows different GPU vendors to plugin and provide their GPU virtualization solution on KVM, the benefits of having such generic kernel module are: 1) Reuse QEMU VFIO driver, supporting VFIO UAPI 2) GPU HW agnostic management API for upper layer software such as libvirt 3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor 0. High level overview ================================================================================== user space: +-----------+ VFIO IOMMU IOCTLs +---------| QEMU VFIO |-------------------------+ VFIO IOCTLs | +-----------+ | | | ---------------------|-----------------------------------------------|--------- | | kernel space: | +--->----------->---+ (callback) V | | v +------V-----+ +----------+ +----V--^--+ +--+--+-----+ | VGPU | | | | | +----| nvidia.ko +----->-----> TYPE1 IOMMU| | VFIO Bus <===| VGPU.ko |<----| +-----------+ | +---++-------+ | | | | | (register) ^ || +----------+ +-------+--+ | +-----------+ | || V +----| i915.ko +-----+ +---VV-------+ | +-----^-----+ | TYPE1 | | (callback) | | IOMMU | +-->------------>---+ +------------+ access flow: Guest MMIO / PCI config access | ------------------------------------------------- | +-----> KVM VM_EXITs (kernel) | ------------------------------------------------- | +-----> QEMU VFIO driver (user) | ------------------------------------------------- | +----> VGPU kernel driver (kernel) | | +----> vendor driver callback 1. VGPU management interface ================================================================================== This is the interface allows upper layer software (mostly libvirt) to query and configure virtual GPU device in a HW agnostic fashion. Also, this management interface has provided flexibility to underlying GPU vendor to support virtual device hotplug, multiple virtual devices per VM, multiple virtual devices from different physical devices, etc. 1.1 Under per-physical device sysfs: ---------------------------------------------------------------------------------- vgpu_supported_types - RO, list the current supported virtual GPU types and its VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of "vgpu_supported_types". vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual gpu device on a target physical GPU. idx: virtual device index inside a VM vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a target physical GPU 1.3 Under vgpu class sysfs: ---------------------------------------------------------------------------------- vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration interface to notify the GPU vendor driver to commit virtual GPU resource for this target VM. Also, the vgpu_start function is a synchronized call, the successful return of this call will indicate all the requested vGPU resource has been fully committed, the VMM should continue. vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration interface to notify the GPU vendor driver to release virtual GPU resource of this target VM. 1.4 Virtual device Hotplug ---------------------------------------------------------------------------------- To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be accessed during VM runtime, and the corresponding registration callback will be invoked to allow GPU vendor support hotplug. To support hotplug, vendor driver would take necessary action to handle the situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that implies both create and start for that vgpu device. Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver supports vgpu hotplug. If hotplug is not supported and VM is still running, vendor driver can return error code to indicate not supported. Separate create from start gives flixibility to have: - multiple vgpu instances for single VM and - hotplug feature. 2. GPU driver vendor registration interface ================================================================================== 2.1 Registration interface definition (include/linux/vgpu.h) ---------------------------------------------------------------------------------- extern int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops); extern void vgpu_unregister_device(struct pci_dev *dev); /** * struct gpu_device_ops - Structure to be registered for each physical GPU to * register the device to vgpu module. * * @owner: The module owner. * @vgpu_supported_config: Called to get information about supported vgpu * types. * @dev : pci device structure of physical GPU. * @config: should return string listing supported * config * Returns integer: success (0) or error (< 0) * @vgpu_create: Called to allocate basic resouces in graphics * driver for a particular vgpu. * @dev: physical pci device structure on which * vgpu * should be created * @vm_uuid: VM's uuid for which VM it is intended * to * @instance: vgpu instance in that VM * @vgpu_id: This represents the type of vgpu to be * created * Returns integer: success (0) or error (< 0) * @vgpu_destroy: Called to free resources in graphics driver for * a vgpu instance of that VM. * @dev: physical pci device structure to which * this vgpu points to. * @vm_uuid: VM's uuid for which the vgpu belongs * to. * @instance: vgpu instance in that VM * Returns integer: success (0) or error (< 0) * If VM is running and vgpu_destroy is called that * means the vGPU is being hotunpluged. Return * error * if VM is running and graphics driver doesn't * support vgpu hotplug. * @vgpu_start: Called to do initiate vGPU initialization * process in graphics driver when VM boots before * qemu starts. * @vm_uuid: VM's UUID which is booting. * Returns integer: success (0) or error (< 0) * @vgpu_shutdown: Called to teardown vGPU related resources for * the VM * @vm_uuid: VM's UUID which is shutting down . * Returns integer: success (0) or error (< 0) * @read: Read emulation callback * @vdev: vgpu device structure * @buf: read buffer * @count: number bytes to read * @address_space: specifies for which address * space * the request is: pci_config_space, IO register * space or MMIO space. * Retuns number on bytes read on success or error. * @write: Write emulation callback * @vdev: vgpu device structure * @buf: write buffer * @count: number bytes to be written * @address_space: specifies for which address * space * the request is: pci_config_space, IO register * space or MMIO space. * Retuns number on bytes written on success or * error. * @vgpu_set_irqs: Called to send about interrupts configuration * information that qemu set. * @vdev: vgpu device structure * @flags, index, start, count and *data : same as * that of struct vfio_irq_set of * VFIO_DEVICE_SET_IRQS API. * * Physical GPU that support vGPU should be register with vgpu module with * gpu_device_ops structure. */ struct gpu_device_ops { struct module *owner; int (*vgpu_supported_config)(struct pci_dev *dev, char *config); int (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id); int (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid, uint32_t instance); int (*vgpu_start)(uuid_le vm_uuid); int (*vgpu_shutdown)(uuid_le vm_uuid); ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count, uint32_t address_space, loff_t pos); ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count, uint32_t address_space,loff_t pos); int (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags, unsigned index, unsigned start, unsigned count, void *data); }; 2.2 Details for callbacks we haven't mentioned above. --------------------------------------------------------------------------------- vgpu_supported_config: allows the vendor driver to specify the supported vGPU type/configuration vgpu_create : create a virtual GPU device, can be used for device hotplug. vgpu_destroy : destroy a virtual GPU device, can be used for device hotplug. vgpu_start : callback function to notify vendor driver vgpu device come to live for a given virtual machine. vgpu_shutdown : callback function to notify vendor driver read : callback to vendor driver to handle virtual device config space or MMIO read access write : callback to vendor driver to handle virtual device config space or MMIO write access vgpu_set_irqs : callback to vendor driver to pass along the interrupt information for the target virtual device, then vendor driver can inject interrupt into virtual machine for this device. 2.3 Potential additional virtual device configuration registration interface: --------------------------------------------------------------------------------- callback function to describe the MMAP behavior of the virtual GPU callback function to allow GPU vendor driver to provide PCI config space backing memory. 3. VGPU TYPE1 IOMMU ================================================================================== Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the <iova, hva, size, flag> and save the QEMU mm for later reference. You can find the quick/ugly implementation in the attached patch file, which is actually just a simple version Alex's type1 IOMMU without actual real mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. We have thought about providing another vendor driver registration interface so such tracking information will be sent to vendor driver and he will use the QEMU mm to do the get_user_pages / remap_pfn_range when it is required. After doing a quick implementation within our driver, I noticed following issues: 1) OS/VFIO logic into vendor driver which will be a maintenance issue. 2) Every driver vendor has to implement their own RB tree, instead of reusing the common existing VFIO code (vfio_find/link/unlink_dma) 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU, better not have anything inside a vendor driver that the VFIO caller immediately depends on. Based on the above consideration, we decide to implement the DMA tracking logic within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1 IOMMU code) and expose two symbols to outside for MMIO mapping and page translation and pinning. Also, with a mmap MMIO interface between virtual and physical, this allows para-virtualized guest driver can access his virtual MMIO without taking a MMAP fault hit, also we can support different MMIO size between virtual and physical device. int vgpu_map_virtual_bar ( uint64_t virt_bar_addr, uint64_t phys_bar_addr, uint32_t len, uint32_t flags ) EXPORT_SYMBOL(vgpu_map_virtual_bar); int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) EXPORT_SYMBOL(vgpu_dma_do_translate); Still a lot to be added and modified, such as supporting multiple VMs and multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU kernel driver, error handling, roll-back and locked memory size per user, etc. 4. Modules ================================================================================== Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU TYPE1 v1 and v2 interface. vgpu.ko - provide registration interface and virtual device VFIO access. 5. QEMU note ================================================================================== To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and use it as a reference for our implementation. It is basically just a quick c & p from vfio/pci.c to quickly meet our needs. Once this proposal is finalized, we will move to vfio/pci.c instead of a new class, and probably the only thing required is to have a new way to discover the device. 6. Examples ================================================================================== On this server, we have two NVIDIA M60 GPUs. [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) After nvidia.ko gets initialized, we can query the supported vGPU type by accessing the "vgpu_supported_types" like following: [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 11:GRID M60-0B 12:GRID M60-0Q 13:GRID M60-1B 14:GRID M60-1Q 15:GRID M60-2B 16:GRID M60-2Q 17:GRID M60-4Q 18:GRID M60-8Q For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would like to create "GRID M60-4Q" VM on it. echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create Note: the number 0 here is for vGPU device index. So far the change is not tested for multiple vgpu devices yet, but we will support it. At this moment, if you query the "vgpu_supported_types" it will still show all supported virtual GPU types as no virtual GPU resource is committed yet. Starting VM: echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start then, the supported vGPU type query will return: [root@cjia-vgx-kvm /home/cjia]$ > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 17:GRID M60-4Q So vgpu_supported_config needs to be called whenever a new virtual device gets created as the underlying HW might limit the supported types if there are any existing VM runnings. Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the GPU driver vendor to clean up resource. Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under device sysfs. 7. What is not covered: ================================================================================== 7.1 QEMU console VNC QEMU console VNC is not covered in this RFC as it is a pretty isolated module and not impacting the basic vGPU functionality, also we already have a good discussion about the new VFIO interface that Alex is going to introduce to allow us describe a region for VM surface. 8 Patches ================================================================================== 0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch - against QEMU 2.5.0 0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch - against 4.4.0-rc5 Thanks, Kirti and Neo From dc8ca387f7b06c6dfc85fb4bd79a760dca76e831 Mon Sep 17 00:00:00 2001 From: Neo Jia <cjia@nvidia.com> Date: Tue, 26 Jan 2016 01:21:11 -0800 Subject: [PATCH] Add VGPU and its TYPE1 IOMMU kernel module support This is just a quick POV implementation to allow GPU driver vendor to plugin into VFIO framework to provide their virtual GPU support. This kernel is providing a registration interface for GPU vendor and generic DMA tracking APIs. extern int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops); extern void vgpu_unregister_device(struct pci_dev *dev); /** * struct gpu_device_ops - Structure to be registered for each physical GPU to * register the device to vgpu module. * * @owner: The module owner. * @vgpu_supported_config: Called to get information about supported vgpu types. * @dev : pci device structure of physical GPU. * @config: should return string listing supported config * Returns integer: success (0) or error (< 0) * @vgpu_create: Called to allocate basic resouces in graphics * driver for a particular vgpu. * @dev: physical pci device structure on which vgpu * should be created * @vm_uuid: VM's uuid for which VM it is intended to * @instance: vgpu instance in that VM * @vgpu_id: This represents the type of vgpu to be * created * Returns integer: success (0) or error (< 0) * @vgpu_destroy: Called to free resources in graphics driver for * a vgpu instance of that VM. * @dev: physical pci device structure to which * this vgpu points to. * @vm_uuid: VM's uuid for which the vgpu belongs to. * @instance: vgpu instance in that VM * Returns integer: success (0) or error (< 0) * If VM is running and vgpu_destroy is called that * means the vGPU is being hotunpluged. Return error * if VM is running and graphics driver doesn't * support vgpu hotplug. * @vgpu_start: Called to do initiate vGPU initialization * process in graphics driver when VM boots before * qemu starts. * @vm_uuid: VM's UUID which is booting. * Returns integer: success (0) or error (< 0) * @vgpu_shutdown: Called to teardown vGPU related resources for * the VM * @vm_uuid: VM's UUID which is shutting down . * Returns integer: success (0) or error (< 0) * @read: Read emulation callback * @vdev: vgpu device structure * @buf: read buffer * @count: number bytes to read * @address_space: specifies for which address space * the request is: pci_config_space, IO register * space or MMIO space. * Retuns number on bytes read on success or error. * @write: Write emulation callback * @vdev: vgpu device structure * @buf: write buffer * @count: number bytes to be written * @address_space: specifies for which address space * the request is: pci_config_space, IO register * space or MMIO space. * Retuns number on bytes written on success or error. * @vgpu_set_irqs: Called to send about interrupts configuration * information that qemu set. * @vdev: vgpu device structure * @flags, index, start, count and *data : same as * that of struct vfio_irq_set of * VFIO_DEVICE_SET_IRQS API. * * Physical GPU that support vGPU should be register with vgpu module with * gpu_device_ops structure. */ struct gpu_device_ops { struct module *owner; int (*vgpu_supported_config)(struct pci_dev *dev, char *config); int (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id); int (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid, uint32_t instance); int (*vgpu_start)(uuid_le vm_uuid); int (*vgpu_shutdown)(uuid_le vm_uuid); ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count, uint32_t address_space, loff_t pos); ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count, uint32_t address_space,loff_t pos); int (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags, unsigned index, unsigned start, unsigned count, void *data); }; int vgpu_map_virtual_bar ( uint64_t virt_bar_addr, uint64_t phys_bar_addr, uint32_t len, uint32_t flags ) int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) Change-Id: Ib70304d9a600c311d5107a94b3fffa938926275b Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Neo Jia <cjia@nvidia.com> --- drivers/Kconfig | 2 + drivers/Makefile | 1 + drivers/vfio/vfio.c | 5 +- drivers/vgpu/Kconfig | 26 ++ drivers/vgpu/Makefile | 5 + drivers/vgpu/vfio_iommu_type1_vgpu.c | 511 ++++++++++++++++++++++++++++++++ drivers/vgpu/vgpu_dev.c | 550 +++++++++++++++++++++++++++++++++++ drivers/vgpu/vgpu_private.h | 47 +++ drivers/vgpu/vgpu_sysfs.c | 322 ++++++++++++++++++++ drivers/vgpu/vgpu_vfio.c | 521 +++++++++++++++++++++++++++++++++ include/linux/vgpu.h | 157 ++++++++++ 11 files changed, 2144 insertions(+), 3 deletions(-) create mode 100644 drivers/vgpu/Kconfig create mode 100644 drivers/vgpu/Makefile create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c create mode 100644 drivers/vgpu/vgpu_dev.c create mode 100644 drivers/vgpu/vgpu_private.h create mode 100644 drivers/vgpu/vgpu_sysfs.c create mode 100644 drivers/vgpu/vgpu_vfio.c create mode 100644 include/linux/vgpu.h diff --git a/drivers/Kconfig b/drivers/Kconfig index d2ac339de85f..5fd9eae79914 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -122,6 +122,8 @@ source "drivers/uio/Kconfig" source "drivers/vfio/Kconfig" +source "drivers/vgpu/Kconfig" + source "drivers/vlynq/Kconfig" source "drivers/virt/Kconfig" diff --git a/drivers/Makefile b/drivers/Makefile index 795d0ca714bf..142256b4358b 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION) += message/ obj-y += firewire/ obj-$(CONFIG_UIO) += uio/ obj-$(CONFIG_VFIO) += vfio/ +obj-$(CONFIG_VGPU) += vgpu/ obj-y += cdrom/ obj-y += auxdisplay/ obj-$(CONFIG_PCCARD) += pcmcia/ diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 6070b793cbcb..af3ab413e119 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -947,19 +947,18 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container, if (IS_ERR(data)) { ret = PTR_ERR(data); module_put(driver->ops->owner); - goto skip_drivers_unlock; + continue; } ret = __vfio_container_attach_groups(container, driver, data); if (!ret) { container->iommu_driver = driver; container->iommu_data = data; + goto skip_drivers_unlock; } else { driver->ops->release(data); module_put(driver->ops->owner); } - - goto skip_drivers_unlock; } mutex_unlock(&vfio.iommu_drivers_lock); diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig new file mode 100644 index 000000000000..698ddf907a16 --- /dev/null +++ b/drivers/vgpu/Kconfig @@ -0,0 +1,26 @@ + +menuconfig VGPU + tristate "VGPU driver framework" + depends on VFIO + select VGPU_VFIO + select VFIO_IOMMU_TYPE1_VGPU + help + VGPU provides a framework to virtualize GPU without SR-IOV cap + See Documentation/vgpu.txt for more details. + + If you don't know what do here, say N. + +config VGPU + tristate + depends on VFIO + default n + +config VGPU_VFIO + tristate + depends on VGPU + default n + +config VFIO_IOMMU_TYPE1_VGPU + tristate + depends on VGPU_VFIO + default n diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile new file mode 100644 index 000000000000..098a3591a535 --- /dev/null +++ b/drivers/vgpu/Makefile @@ -0,0 +1,5 @@ + +vgpu-y := vgpu_sysfs.o vgpu_dev.o vgpu_vfio.o + +obj-$(CONFIG_VGPU) += vgpu.o +obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU) += vfio_iommu_type1_vgpu.o diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c new file mode 100644 index 000000000000..6b20f1374b3b --- /dev/null +++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c @@ -0,0 +1,511 @@ +/* + * VGPU : IOMMU DMA mapping support for VGPU + * + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved. + * Author: Neo Jia <cjia@nvidia.com> + * Kirti Wankhede <kwankhede@nvidia.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/compat.h> +#include <linux/device.h> +#include <linux/kernel.h> +#include <linux/fs.h> +#include <linux/miscdevice.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/uuid.h> +#include <linux/vfio.h> +#include <linux/iommu.h> +#include <linux/vgpu.h> + +#include "vgpu_private.h" + +#define DRIVER_VERSION "0.1" +#define DRIVER_AUTHOR "NVIDIA Corporation" +#define DRIVER_DESC "VGPU Type1 IOMMU driver for VFIO" + +// VFIO structures + +struct vfio_iommu_vgpu { + struct mutex lock; + struct iommu_group *group; + struct vgpu_device *vgpu_dev; + struct rb_root dma_list; + struct mm_struct * vm_mm; +}; + +struct vgpu_vfio_dma { + struct rb_node node; + dma_addr_t iova; + unsigned long vaddr; + size_t size; + int prot; +}; + +/* + * VGPU VFIO FOPs definition + * + */ + +/* + * Duplicated from vfio_link_dma, just quick hack ... should + * reuse code later + */ + +static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu, + struct vgpu_vfio_dma *new) +{ + struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL; + struct vgpu_vfio_dma *dma; + + while (*link) { + parent = *link; + dma = rb_entry(parent, struct vgpu_vfio_dma, node); + + if (new->iova + new->size <= dma->iova) + link = &(*link)->rb_left; + else + link = &(*link)->rb_right; + } + + rb_link_node(&new->node, parent, link); + rb_insert_color(&new->node, &iommu->dma_list); +} + +static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu, + dma_addr_t start, size_t size) +{ + struct rb_node *node = iommu->dma_list.rb_node; + + while (node) { + struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node); + + if (start + size <= dma->iova) + node = node->rb_left; + else if (start >= dma->iova + dma->size) + node = node->rb_right; + else + return dma; + } + + return NULL; +} + +static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old) +{ + rb_erase(&old->node, &iommu->dma_list); +} + +static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu) +{ + struct vgpu_vfio_dma *c, *n; + uint32_t i = 0; + + rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node) + printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n", + __FUNCTION__, i++, c->iova, c->vaddr, c->size); +} + +static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu, + struct vfio_iommu_type1_dma_map *map) +{ + dma_addr_t iova = map->iova; + unsigned long vaddr = map->vaddr; + int ret = 0, prot = 0; + struct vgpu_vfio_dma *vgpu_dma; + + mutex_lock(&vgpu_iommu->lock); + + if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) { + mutex_unlock(&vgpu_iommu->lock); + return -EEXIST; + } + + vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL); + + if (!vgpu_dma) { + mutex_unlock(&vgpu_iommu->lock); + return -ENOMEM; + } + + vgpu_dma->iova = iova; + vgpu_dma->vaddr = vaddr; + vgpu_dma->prot = prot; + vgpu_dma->size = map->size; + + vgpu_link_dma(vgpu_iommu, vgpu_dma); + + mutex_unlock(&vgpu_iommu->lock); + return ret; +} + +static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu, + struct vfio_iommu_type1_dma_unmap *unmap) +{ + struct vgpu_vfio_dma *vgpu_dma; + size_t unmapped = 0; + int ret = 0; + + mutex_lock(&vgpu_iommu->lock); + + vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0); + if (vgpu_dma && vgpu_dma->iova != unmap->iova) { + ret = -EINVAL; + goto unlock; + } + + vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0); + if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) { + ret = -EINVAL; + goto unlock; + } + + while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) { + unmapped += vgpu_dma->size; + vgpu_unlink_dma(vgpu_iommu, vgpu_dma); + } + +unlock: + mutex_unlock(&vgpu_iommu->lock); + unmap->size = unmapped; + + return ret; +} + +/* Ugly hack to quickly test single deivce ... */ + +static struct vfio_iommu_vgpu *_local_iommu = NULL; + +int vgpu_map_virtual_bar +( + uint64_t virt_bar_addr, + uint64_t phys_bar_addr, + uint32_t len, + uint32_t flags +) +{ + struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu; + unsigned long remote_vaddr = 0; + struct vgpu_vfio_dma *vgpu_dma = NULL; + struct vm_area_struct *remote_vma = NULL; + struct mm_struct *mm = vgpu_iommu->vm_mm; + int ret = 0; + + printk(KERN_INFO "%s: >>>>\n", __FUNCTION__); + + mutex_lock(&vgpu_iommu->lock); + + vgpu_dump_dma(vgpu_iommu); + + down_write(&mm->mmap_sem); + + vgpu_dma = vgpu_find_dma(vgpu_iommu, virt_bar_addr, len /* size */); + if (!vgpu_dma) { + printk(KERN_INFO "%s: fail locate guest physical:0x%llx\n", + __FUNCTION__, virt_bar_addr); + ret = -EINVAL; + goto unlock; + } + + remote_vaddr = vgpu_dma->vaddr + virt_bar_addr - vgpu_dma->iova; + + remote_vma = find_vma(mm, remote_vaddr); + + if (remote_vma == NULL) { + printk(KERN_INFO "%s: fail locate vma, physical addr:0x%llx\n", + __FUNCTION__, virt_bar_addr); + ret = -EINVAL; + goto unlock; + } + else { + printk(KERN_INFO "%s: locate vma, addr:0x%lx\n", + __FUNCTION__, remote_vma->vm_start); + } + + remote_vma->vm_page_prot = pgprot_noncached(remote_vma->vm_page_prot); + + remote_vma->vm_pgoff = phys_bar_addr >> PAGE_SHIFT; + + ret = remap_pfn_range(remote_vma, virt_bar_addr, remote_vma->vm_pgoff, + len, remote_vma->vm_page_prot); + + if (ret) { + printk(KERN_INFO "%s: fail to remap vma:%d\n", __FUNCTION__, ret); + goto unlock; + } + +unlock: + + up_write(&mm->mmap_sem); + mutex_unlock(&vgpu_iommu->lock); + printk(KERN_INFO "%s: <<<<\n", __FUNCTION__); + + return ret; +} + +EXPORT_SYMBOL(vgpu_map_virtual_bar); + +int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) +{ + int i = 0, ret = 0, prot = 0; + unsigned long remote_vaddr = 0, pfn = 0; + struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu; + struct vgpu_vfio_dma *vgpu_dma; + struct page *page[1]; + // unsigned long * addr = NULL; + struct mm_struct *mm = vgpu_iommu->vm_mm; + + prot = IOMMU_READ | IOMMU_WRITE; + + printk(KERN_INFO "%s: >>>>\n", __FUNCTION__); + + mutex_lock(&vgpu_iommu->lock); + + vgpu_dump_dma(vgpu_iommu); + + for (i = 0; i < count; i++) { + dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT; + vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /* size */); + + if (!vgpu_dma) { + printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova); + ret = -EINVAL; + goto unlock; + } + + remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova; + printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n", + __FUNCTION__, i, vgpu_dma->iova, + vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr); + + if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) { + pfn = page_to_pfn(page[0]); + printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn); + // addr = vmap(page, 1, VM_MAP, PAGE_KERNEL); + } + else { + printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i); + ret = -ENOMEM; + goto unlock; + } + + gfn_buffer[i] = pfn; + // vunmap(addr); + + } + +unlock: + mutex_unlock(&vgpu_iommu->lock); + printk(KERN_INFO "%s: <<<<\n", __FUNCTION__); + return ret; +} + +EXPORT_SYMBOL(vgpu_dma_do_translate); + + + + + + + + + + + + + + + + + + +static void *vfio_iommu_vgpu_open(unsigned long arg) +{ + struct vfio_iommu_vgpu *iommu; + + iommu = kzalloc(sizeof(*iommu), GFP_KERNEL); + + if (!iommu) + return ERR_PTR(-ENOMEM); + + mutex_init(&iommu->lock); + + printk(KERN_INFO "%s", __FUNCTION__); + + /* TODO: Keep track the v2 vs. v1, for now only assume + * we are v2 due to QEMU code */ + _local_iommu = iommu; + return iommu; +} + +static void vfio_iommu_vgpu_release(void *iommu_data) +{ + struct vfio_iommu_vgpu *iommu = iommu_data; + kfree(iommu); + printk(KERN_INFO "%s", __FUNCTION__); +} + +static long vfio_iommu_vgpu_ioctl(void *iommu_data, + unsigned int cmd, unsigned long arg) +{ + int ret = 0; + unsigned long minsz; + struct vfio_iommu_vgpu *vgpu_iommu = iommu_data; + + switch (cmd) { + case VFIO_CHECK_EXTENSION: + { + if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU)) + return 1; + else + return 0; + } + + case VFIO_IOMMU_GET_INFO: + { + struct vfio_iommu_type1_info info; + minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes); + + if (copy_from_user(&info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz < minsz) + return -EINVAL; + + info.flags = 0; + + return copy_to_user((void __user *)arg, &info, minsz); + } + case VFIO_IOMMU_MAP_DMA: + { + // TODO + struct vfio_iommu_type1_dma_map map; + minsz = offsetofend(struct vfio_iommu_type1_dma_map, size); + + if (copy_from_user(&map, (void __user *)arg, minsz)) + return -EFAULT; + + if (map.argsz < minsz) + return -EINVAL; + + printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n", + map.flags, map.vaddr, map.iova, map.size); + + /* + * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally, + * this should be merged into one single file and reuse data + * structure + * + */ + ret = vgpu_dma_do_track(vgpu_iommu, &map); + break; + } + case VFIO_IOMMU_UNMAP_DMA: + { + // TODO + struct vfio_iommu_type1_dma_unmap unmap; + + minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size); + + if (copy_from_user(&unmap, (void __user *)arg, minsz)) + return -EFAULT; + + if (unmap.argsz < minsz) + return -EINVAL; + + ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap); + break; + } + default: + { + printk(KERN_INFO "%s cmd default ", __FUNCTION__); + ret = -ENOTTY; + break; + } + } + + return ret; +} + + +static int vfio_iommu_vgpu_attach_group(void *iommu_data, + struct iommu_group *iommu_group) +{ + struct vfio_iommu_vgpu *iommu = iommu_data; + struct vgpu_device *vgpu_dev = NULL; + + printk(KERN_INFO "%s", __FUNCTION__); + + vgpu_dev = get_vgpu_device_from_group(iommu_group); + if (vgpu_dev) { + iommu->vgpu_dev = vgpu_dev; + iommu->group = iommu_group; + + /* IOMMU shares the same life cylce as VM MM */ + iommu->vm_mm = current->mm; + + printk(KERN_INFO "%s index %d", __FUNCTION__, vgpu_dev->minor); + return 0; + } + iommu->group = iommu_group; + return 1; +} + +static void vfio_iommu_vgpu_detach_group(void *iommu_data, + struct iommu_group *iommu_group) +{ + struct vfio_iommu_vgpu *iommu = iommu_data; + + printk(KERN_INFO "%s", __FUNCTION__); + iommu->vm_mm = NULL; + iommu->group = NULL; + + return; +} + + +static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = { + .name = "vgpu_vfio", + .owner = THIS_MODULE, + .open = vfio_iommu_vgpu_open, + .release = vfio_iommu_vgpu_release, + .ioctl = vfio_iommu_vgpu_ioctl, + .attach_group = vfio_iommu_vgpu_attach_group, + .detach_group = vfio_iommu_vgpu_detach_group, +}; + + +int vgpu_vfio_iommu_init(void) +{ + int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops); + + printk(KERN_INFO "%s\n", __FUNCTION__); + if (rc < 0) { + printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc); + } + + return rc; +} + +void vgpu_vfio_iommu_exit(void) +{ + // unregister vgpu_vfio driver + vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops); + printk(KERN_INFO "%s\n", __FUNCTION__); +} + + +module_init(vgpu_vfio_iommu_init); +module_exit(vgpu_vfio_iommu_exit); + +MODULE_VERSION(DRIVER_VERSION); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR(DRIVER_AUTHOR); +MODULE_DESCRIPTION(DRIVER_DESC); + diff --git a/drivers/vgpu/vgpu_dev.c b/drivers/vgpu/vgpu_dev.c new file mode 100644 index 000000000000..1d4eb235122c --- /dev/null +++ b/drivers/vgpu/vgpu_dev.c @@ -0,0 +1,550 @@ +/* + * VGPU core + * + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved. + * Author: Neo Jia <cjia@nvidia.com> + * Kirti Wankhede <kwankhede@nvidia.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/kernel.h> +#include <linux/fs.h> +#include <linux/poll.h> +#include <linux/slab.h> +#include <linux/cdev.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/uuid.h> +#include <linux/vfio.h> +#include <linux/iommu.h> +#include <linux/sysfs.h> +#include <linux/ctype.h> +#include <linux/vgpu.h> + +#include "vgpu_private.h" + +#define DRIVER_VERSION "0.1" +#define DRIVER_AUTHOR "NVIDIA Corporation" +#define DRIVER_DESC "VGPU driver" + +/* + * #defines + */ + +#define VGPU_CLASS_NAME "vgpu" + +#define VGPU_DEV_NAME "vgpu" + +// TODO remove these defines +// minor number reserved for control device +#define VGPU_CONTROL_DEVICE 0 + +#define VGPU_CONTROL_DEVICE_NAME "vgpuctl" + +/* + * Global Structures + */ + +static struct vgpu { + dev_t vgpu_devt; + struct class *class; + struct cdev vgpu_cdev; + struct list_head vgpu_devices_list; // Head entry for the doubly linked vgpu_device list + struct mutex vgpu_devices_lock; + struct idr vgpu_idr; + struct list_head gpu_devices_list; + struct mutex gpu_devices_lock; +} vgpu; + + +/* + * Function prototypes + */ + +static void vgpu_device_destroy(struct vgpu_device *vgpu_dev); + +unsigned int vgpu_poll(struct file *file, poll_table *wait); +long vgpu_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long i_arg); +int vgpu_mmap(struct file *file, struct vm_area_struct *vma); + +int vgpu_open(struct inode *inode, struct file *file); +int vgpu_close(struct inode *inode, struct file *file); +ssize_t vgpu_read(struct file *file, char __user * buf, + size_t len, loff_t * ppos); +ssize_t vgpu_write(struct file *file, const char __user *data, + size_t len, loff_t *ppos); + +/* + * Functions + */ + +struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group) +{ + + struct vgpu_device *vdev = NULL; + + mutex_lock(&vgpu.vgpu_devices_lock); + list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) { + if (vdev->group) { + if (iommu_group_id(vdev->group) == iommu_group_id(group)) { + mutex_unlock(&vgpu.vgpu_devices_lock); + return vdev; + } + } + } + mutex_unlock(&vgpu.vgpu_devices_lock); + return NULL; +} + +EXPORT_SYMBOL_GPL(get_vgpu_device_from_group); + +int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops) +{ + int ret = 0; + struct gpu_device *gpu_dev, *tmp; + + if (!dev) + return -EINVAL; + + gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL); + if (!gpu_dev) + return -ENOMEM; + + gpu_dev->dev = dev; + gpu_dev->ops = ops; + + mutex_lock(&vgpu.gpu_devices_lock); + + /* Check for duplicates */ + list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) { + if (tmp->dev == dev) { + mutex_unlock(&vgpu.gpu_devices_lock); + kfree(gpu_dev); + return -EINVAL; + } + } + + ret = vgpu_create_pci_device_files(dev); + if (ret) { + mutex_unlock(&vgpu.gpu_devices_lock); + kfree(gpu_dev); + return ret; + } + list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list); + + printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class); + mutex_unlock(&vgpu.gpu_devices_lock); + + return 0; +} +EXPORT_SYMBOL(vgpu_register_device); + +void vgpu_unregister_device(struct pci_dev *dev) +{ + struct gpu_device *gpu_dev; + + mutex_lock(&vgpu.gpu_devices_lock); + list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) { + if (gpu_dev->dev == dev) { + printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class); + vgpu_remove_pci_device_files(dev); + list_del(&gpu_dev->gpu_next); + mutex_unlock(&vgpu.gpu_devices_lock); + kfree(gpu_dev); + return; + } + } + mutex_unlock(&vgpu.gpu_devices_lock); +} +EXPORT_SYMBOL(vgpu_unregister_device); + + +/* + * Static functions + */ + +static struct file_operations vgpu_fops = { + .owner = THIS_MODULE, +}; + +static void vgpu_device_destroy(struct vgpu_device *vgpu_dev) +{ + if (vgpu_dev->dev) { + device_destroy(vgpu.class, vgpu_dev->dev->devt); + vgpu_dev->dev = NULL; + } +} + +/* + * Helper Functions + */ + +static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name) +{ + struct vgpu_device *vgpu_dev = NULL; + + vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL); + if (!vgpu_dev) + return ERR_PTR(-ENOMEM); + + kref_init(&vgpu_dev->kref); + memcpy(&vgpu_dev->vm_uuid, &uuid, sizeof(uuid_le)); + vgpu_dev->vgpu_instance = instance; + strcpy(vgpu_dev->dev_name, name); + + mutex_lock(&vgpu.vgpu_devices_lock); + list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list); + mutex_unlock(&vgpu.vgpu_devices_lock); + + return vgpu_dev; +} + +static void vgpu_device_free(struct vgpu_device *vgpu_dev) +{ + mutex_lock(&vgpu.vgpu_devices_lock); + list_del(&vgpu_dev->list); + mutex_unlock(&vgpu.vgpu_devices_lock); + kfree(vgpu_dev); +} + +struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance) +{ + struct vgpu_device *vdev = NULL; + + mutex_lock(&vgpu.vgpu_devices_lock); + list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) { + if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) && + (vdev->vgpu_instance == instance)) { + mutex_unlock(&vgpu.vgpu_devices_lock); + return vdev; + } + } + mutex_unlock(&vgpu.vgpu_devices_lock); + return NULL; +} + +struct vgpu_device *find_vgpu_device(struct device *dev) +{ + struct vgpu_device *vdev = NULL; + + mutex_lock(&vgpu.vgpu_devices_lock); + list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) { + if (vdev->dev == dev) { + mutex_unlock(&vgpu.vgpu_devices_lock); + return vdev; + } + } + + mutex_unlock(&vgpu.vgpu_devices_lock); + return NULL; +} + +int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id) +{ + int minor; + char name[64]; + int numChar = 0; + int retval = 0; + + struct iommu_group *group = NULL; + struct device *dev = NULL; + struct vgpu_device *vgpu_dev = NULL; + + struct gpu_device *gpu_dev; + + printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__); + + numChar = sprintf(name, "%pUb-%d", vm_uuid.b, instance); + name[numChar] = '\0'; + + vgpu_dev = vgpu_device_alloc(vm_uuid, instance, name); + if (IS_ERR(vgpu_dev)) { + return PTR_ERR(vgpu_dev); + } + + // check if VM device is present + // if not present, create with devt=0 and parent=NULL + // create device for instance with devt= MKDEV(vgpu.major, minor) + // and parent=VM device + + mutex_lock(&vgpu.vgpu_devices_lock); + + vgpu_dev->vgpu_id = vgpu_id; + + // TODO on removing control device change the 3rd parameter to 0 + minor = idr_alloc(&vgpu.vgpu_idr, vgpu_dev, 1, MINORMASK + 1, GFP_KERNEL); + if (minor < 0) { + retval = minor; + goto create_failed; + } + + dev = device_create(vgpu.class, NULL, MKDEV(MAJOR(vgpu.vgpu_devt), minor), NULL, "%s", name); + if (IS_ERR(dev)) { + retval = PTR_ERR(dev); + goto create_failed1; + } + + vgpu_dev->dev = dev; + vgpu_dev->minor = minor; + + mutex_lock(&vgpu.gpu_devices_lock); + list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) { + if (gpu_dev->dev == pdev) { + vgpu_dev->gpu_dev = gpu_dev; + if (gpu_dev->ops->vgpu_create) { + retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->vm_uuid, + instance, vgpu_id); + if (retval) + { + mutex_unlock(&vgpu.gpu_devices_lock); + goto create_failed2; + } + } + break; + } + } + mutex_unlock(&vgpu.gpu_devices_lock); + + if (!vgpu_dev->gpu_dev) { + retval = -EINVAL; + goto create_failed2; + } + + mutex_lock(&vgpu.gpu_devices_lock); + mutex_unlock(&vgpu.gpu_devices_lock); + + printk(KERN_INFO "UUID %pUb \n", vgpu_dev->vm_uuid.b); + + group = iommu_group_alloc(); + if (IS_ERR(group)) { + printk(KERN_ERR "VGPU: failed to allocate group!\n"); + retval = PTR_ERR(group); + goto create_failed2; + } + + retval = iommu_group_add_device(group, dev); + if (retval) { + printk(KERN_ERR "VGPU: failed to add dev to group!\n"); + iommu_group_put(group); + goto create_failed2; + } + + retval = vgpu_group_init(vgpu_dev, group); + if (retval) { + printk(KERN_ERR "VGPU: failed vgpu_group_init \n"); + iommu_group_put(group); + iommu_group_remove_device(dev); + goto create_failed2; + } + + vgpu_dev->group = group; + printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group)); + + mutex_unlock(&vgpu.vgpu_devices_lock); + return retval; + +create_failed2: + vgpu_device_destroy(vgpu_dev); + +create_failed1: + idr_remove(&vgpu.vgpu_idr, minor); + +create_failed: + mutex_unlock(&vgpu.vgpu_devices_lock); + vgpu_device_free(vgpu_dev); + + return retval; +} + +void destroy_vgpu_device(struct vgpu_device *vgpu_dev) +{ + struct device *dev = vgpu_dev->dev; + + if (!dev) { + return; + } + + printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name); + if (vgpu_dev->gpu_dev->ops->vgpu_destroy) { + int retval = 0; + retval = vgpu_dev->gpu_dev->ops->vgpu_destroy(vgpu_dev->gpu_dev->dev, + vgpu_dev->vm_uuid, + vgpu_dev->vgpu_instance); + /* if vendor driver doesn't return success that means vendor driver doesn't + * support hot-unplug */ + if (retval) + return; + } + + mutex_lock(&vgpu.vgpu_devices_lock); + + vgpu_group_free(vgpu_dev); + iommu_group_put(dev->iommu_group); + iommu_group_remove_device(dev); + vgpu_device_destroy(vgpu_dev); + idr_remove(&vgpu.vgpu_idr, vgpu_dev->minor); + + mutex_unlock(&vgpu.vgpu_devices_lock); + vgpu_device_free(vgpu_dev); +} + +void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance) +{ + struct vgpu_device *vdev, *vgpu_dev = NULL; + + mutex_lock(&vgpu.vgpu_devices_lock); + + // search VGPU device + list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) { + if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) && + (vdev->vgpu_instance == instance)) { + vgpu_dev = vdev; + break; + } + } + + mutex_unlock(&vgpu.vgpu_devices_lock); + if (vgpu_dev) + destroy_vgpu_device(vgpu_dev); +} + +void get_vgpu_supported_types(struct device *dev, char *str) +{ + struct gpu_device *gpu_dev; + + mutex_lock(&vgpu.gpu_devices_lock); + list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) { + if (&gpu_dev->dev->dev == dev) { + if (gpu_dev->ops->vgpu_supported_config) + gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str); + break; + } + } + mutex_unlock(&vgpu.gpu_devices_lock); +} + +int vgpu_start_callback(struct vgpu_device *vgpu_dev) +{ + int ret = 0; + + mutex_lock(&vgpu.gpu_devices_lock); + if (vgpu_dev->gpu_dev->ops->vgpu_start) + ret = vgpu_dev->gpu_dev->ops->vgpu_start(vgpu_dev->vm_uuid); + mutex_unlock(&vgpu.gpu_devices_lock); + return ret; +} + +int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev) +{ + int ret = 0; + + mutex_lock(&vgpu.gpu_devices_lock); + if (vgpu_dev->gpu_dev->ops->vgpu_shutdown) + ret = vgpu_dev->gpu_dev->ops->vgpu_shutdown(vgpu_dev->vm_uuid); + mutex_unlock(&vgpu.gpu_devices_lock); + return ret; +} + +int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags, + unsigned index, unsigned start, unsigned count, + void *data) +{ + int ret = 0; + + mutex_lock(&vgpu.gpu_devices_lock); + if (vgpu_dev->gpu_dev->ops->vgpu_set_irqs) + ret = vgpu_dev->gpu_dev->ops->vgpu_set_irqs(vgpu_dev, flags, + index, start, count, data); + mutex_unlock(&vgpu.gpu_devices_lock); + return ret; +} + +char *vgpu_devnode(struct device *dev, umode_t *mode) +{ + return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev)); +} + +static struct class vgpu_class = { + .name = VGPU_CLASS_NAME, + .owner = THIS_MODULE, + .class_attrs = vgpu_class_attrs, + .dev_groups = vgpu_dev_groups, + .devnode = vgpu_devnode, +}; + +static int __init vgpu_init(void) +{ + int rc = 0; + + memset(&vgpu, 0 , sizeof(vgpu)); + + idr_init(&vgpu.vgpu_idr); + mutex_init(&vgpu.vgpu_devices_lock); + INIT_LIST_HEAD(&vgpu.vgpu_devices_list); + mutex_init(&vgpu.gpu_devices_lock); + INIT_LIST_HEAD(&vgpu.gpu_devices_list); + + // get major number from kernel + rc = alloc_chrdev_region(&vgpu.vgpu_devt, 0, MINORMASK, VGPU_DEV_NAME); + + if (rc < 0) { + printk(KERN_ERR "Error: failed to register vgpu drv, err:%d\n", rc); + return rc; + } + + cdev_init(&vgpu.vgpu_cdev, &vgpu_fops); + cdev_add(&vgpu.vgpu_cdev, vgpu.vgpu_devt, MINORMASK); + + printk(KERN_ALERT "major_number:%d is allocated for vgpu\n", MAJOR(vgpu.vgpu_devt)); + + rc = class_register(&vgpu_class); + if (rc < 0) { + printk(KERN_ERR "Error: failed to register vgpu class\n"); + goto failed1; + } + + vgpu.class = &vgpu_class; + + return rc; + +failed1: + cdev_del(&vgpu.vgpu_cdev); + unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK); + + return rc; +} + +static void __exit vgpu_exit(void) +{ + // TODO: Release all unclosed fd + struct vgpu_device *vdev = NULL, *tmp; + + mutex_lock(&vgpu.vgpu_devices_lock); + list_for_each_entry_safe(vdev, tmp, &vgpu.vgpu_devices_list, list) { + printk(KERN_INFO "VGPU: exit destroying device %s ", vdev->dev_name); + mutex_unlock(&vgpu.vgpu_devices_lock); + destroy_vgpu_device(vdev); + mutex_lock(&vgpu.vgpu_devices_lock); + } + mutex_unlock(&vgpu.vgpu_devices_lock); + + idr_destroy(&vgpu.vgpu_idr); + cdev_del(&vgpu.vgpu_cdev); + unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK); + class_destroy(vgpu.class); + vgpu.class = NULL; +} + +module_init(vgpu_init) +module_exit(vgpu_exit) + +MODULE_VERSION(DRIVER_VERSION); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR(DRIVER_AUTHOR); +MODULE_DESCRIPTION(DRIVER_DESC); diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h new file mode 100644 index 000000000000..7e3c400d29f7 --- /dev/null +++ b/drivers/vgpu/vgpu_private.h @@ -0,0 +1,47 @@ +/* + * VGPU interal definition + * + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved. + * Author: + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#ifndef VGPU_PRIVATE_H +#define VGPU_PRIVATE_H + +int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group * group); + +int vgpu_group_free(struct vgpu_device *vgpu_dev); + +struct vgpu_device *find_vgpu_device(struct device *dev); + +struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance); + +int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id); +void destroy_vgpu_device(struct vgpu_device *vgpu_dev); +void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance); + +/* Function prototypes for vgpu_sysfs */ + +extern struct class_attribute vgpu_class_attrs[]; +extern const struct attribute_group *vgpu_dev_groups[]; + +int vgpu_create_status_file(struct vgpu_device *vgpu_dev); +void vgpu_notify_status_file(struct vgpu_device *vgpu_dev); +void vgpu_remove_status_file(struct vgpu_device *vgpu_dev); + +int vgpu_create_pci_device_files(struct pci_dev *dev); +void vgpu_remove_pci_device_files(struct pci_dev *dev); + +void get_vgpu_supported_types(struct device *dev, char *str); +int vgpu_start_callback(struct vgpu_device *vgpu_dev); +int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev); + +int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags, + unsigned index, unsigned start, unsigned count, + void *data); + +#endif /* VGPU_PRIVATE_H */ diff --git a/drivers/vgpu/vgpu_sysfs.c b/drivers/vgpu/vgpu_sysfs.c new file mode 100644 index 000000000000..e48cbcd6948d --- /dev/null +++ b/drivers/vgpu/vgpu_sysfs.c @@ -0,0 +1,322 @@ +/* + * File attributes for vGPU devices + * + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved. + * Author: Neo Jia <cjia@nvidia.com> + * Kirti Wankhede <kwankhede@nvidia.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/kernel.h> +#include <linux/sched.h> +#include <linux/fs.h> +#include <linux/sysfs.h> +#include <linux/ctype.h> +#include <linux/uuid.h> +#include <linux/vfio.h> +#include <linux/vgpu.h> + +#include "vgpu_private.h" + +/* Prototypes */ + +static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf); +static DEVICE_ATTR_RO(vgpu_supported_types); + +static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count); +static DEVICE_ATTR_WO(vgpu_create); + +static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count); +static DEVICE_ATTR_WO(vgpu_destroy); + + +/* Static functions */ + +static bool is_uuid_sep(char sep) +{ + if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0') + return true; + return false; +} + + +static int uuid_parse(const char *str, uuid_le *uuid) +{ + int i; + + if (strlen(str) < 36) + return -1; + + for (i = 0; i < 16; i++) { + if (!isxdigit(str[0]) || !isxdigit(str[1])) { + printk(KERN_ERR "%s err", __FUNCTION__); + return -EINVAL; + } + + uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]); + str += 2; + if (is_uuid_sep(*str)) + str++; + } + + return 0; +} + + +/* Functions */ +static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + char *str; + ssize_t n; + + str = kzalloc(sizeof(*str) * 512, GFP_KERNEL); + if (!str) + return -ENOMEM; + + get_vgpu_supported_types(dev, str); + + n = sprintf(buf,"%s\n", str); + kfree(str); + + return n; +} + +static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) +{ + char *vm_uuid_str, *instance_str, *str; + uuid_le vm_uuid; + uint32_t instance, vgpu_id; + struct pci_dev *pdev; + + str = kstrndup(buf, count, GFP_KERNEL); + + if (!str) + return -ENOMEM; + + if ((vm_uuid_str = strsep(&str, ":")) == NULL) { + printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf); + return -EINVAL; + } + + if (str == NULL) { + printk(KERN_ERR "%s vgpu type and instance not specified %s \n", __FUNCTION__, buf); + return -EINVAL; + } + + if ((instance_str = strsep(&str, ":")) == NULL) { + printk(KERN_ERR "%s Empty instance or string %s \n", __FUNCTION__, buf); + return -EINVAL; + } + + if (str == NULL) { + printk(KERN_ERR "%s vgpu type not specified %s \n", __FUNCTION__, buf); + return -EINVAL; + + } + + instance = (unsigned int)simple_strtoul(instance_str, NULL, 0); + + vgpu_id = (unsigned int)simple_strtoul(str, NULL, 0); + + if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) { + printk(KERN_ERR "%s UUID parse error %s \n", __FUNCTION__, buf); + return -EINVAL; + } + + if (dev_is_pci(dev)) { + pdev = to_pci_dev(dev); + + if (create_vgpu_device(pdev, vm_uuid, instance, vgpu_id) < 0) { + printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__); + return -EINVAL; + } + } + + return count; +} + +static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) +{ + char *vm_uuid_str, *str; + uuid_le vm_uuid; + unsigned int instance; + + str = kstrndup(buf, count, GFP_KERNEL); + + if (!str) + return -ENOMEM; + + if ((vm_uuid_str = strsep(&str, ":")) == NULL) { + printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf); + return -EINVAL; + } + + if (str == NULL) { + printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf); + return -EINVAL; + } + + instance = (unsigned int)simple_strtoul(str, NULL, 0); + + if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) { + printk(KERN_ERR "%s UUID parse error %s \n", __FUNCTION__, buf); + return -EINVAL; + } + + printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, vm_uuid.b, instance); + + destroy_vgpu_device_by_uuid(vm_uuid, instance); + + return count; +} + +static ssize_t +vgpu_vm_uuid_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + struct vgpu_device *drv = find_vgpu_device(dev); + + if (drv) + return sprintf(buf, "%pUb \n", drv->vm_uuid.b); + + return sprintf(buf, " \n"); +} + +static DEVICE_ATTR_RO(vgpu_vm_uuid); + +static ssize_t +vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + struct vgpu_device *drv = find_vgpu_device(dev); + + if (drv && drv->group) + return sprintf(buf, "%d \n", iommu_group_id(drv->group)); + + return sprintf(buf, " \n"); +} + +static DEVICE_ATTR_RO(vgpu_group_id); + + +static struct attribute *vgpu_dev_attrs[] = { + &dev_attr_vgpu_vm_uuid.attr, + &dev_attr_vgpu_group_id.attr, + NULL, +}; + +static const struct attribute_group vgpu_dev_group = { + .attrs = vgpu_dev_attrs, +}; + +const struct attribute_group *vgpu_dev_groups[] = { + &vgpu_dev_group, + NULL, +}; + + +ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr, + const char *buf, size_t count) +{ + char *vm_uuid_str; + uuid_le vm_uuid; + struct vgpu_device *vgpu_dev = NULL; + int ret; + + vm_uuid_str = kstrndup(buf, count, GFP_KERNEL); + + if (!vm_uuid_str) + return -ENOMEM; + + if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) { + printk(KERN_ERR "%s UUID parse error %s \n", __FUNCTION__, buf); + return -EINVAL; + } + + vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0); + + if (vgpu_dev && vgpu_dev->dev) { + kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_ONLINE); + + ret = vgpu_start_callback(vgpu_dev); + if (ret < 0) { + printk(KERN_ERR "%s vgpu_start callback failed %d \n", __FUNCTION__, ret); + return ret; + } + } + + return count; +} + +ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr, + const char *buf, size_t count) +{ + char *vm_uuid_str; + uuid_le vm_uuid; + struct vgpu_device *vgpu_dev = NULL; + int ret; + + vm_uuid_str = kstrndup(buf, count, GFP_KERNEL); + + if (!vm_uuid_str) + return -ENOMEM; + + if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) { + printk(KERN_ERR "%s UUID parse error %s \n", __FUNCTION__, buf); + return -EINVAL; + } + vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0); + + if (vgpu_dev && vgpu_dev->dev) { + kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_OFFLINE); + + ret = vgpu_shutdown_callback(vgpu_dev); + if (ret < 0) { + printk(KERN_ERR "%s vgpu_shutdown callback failed %d \n", __FUNCTION__, ret); + return ret; + } + } + + return count; +} + +struct class_attribute vgpu_class_attrs[] = { + __ATTR_WO(vgpu_start), + __ATTR_WO(vgpu_shutdown), + __ATTR_NULL +}; + +int vgpu_create_pci_device_files(struct pci_dev *dev) +{ + int retval; + + retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr); + if (retval) { + printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n"); + return retval; + } + + retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr); + if (retval) { + printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n"); + return retval; + } + + retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr); + if (retval) { + printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n"); + return retval; + } + + return 0; +} + + +void vgpu_remove_pci_device_files(struct pci_dev *dev) +{ + sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr); + sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr); + sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr); +} + diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c new file mode 100644 index 000000000000..ef0833140d84 --- /dev/null +++ b/drivers/vgpu/vgpu_vfio.c @@ -0,0 +1,521 @@ +/* + * VGPU VFIO device + * + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved. + * Author: Neo Jia <cjia@nvidia.com> + * Kirti Wankhede <kwankhede@nvidia.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/kernel.h> +#include <linux/fs.h> +#include <linux/poll.h> +#include <linux/slab.h> +#include <linux/cdev.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/uuid.h> +#include <linux/vfio.h> +#include <linux/iommu.h> +#include <linux/vgpu.h> + +#include "vgpu_private.h" + +#define VFIO_PCI_OFFSET_SHIFT 40 + +#define VFIO_PCI_OFFSET_TO_INDEX(off) (off >> VFIO_PCI_OFFSET_SHIFT) +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT) +#define VFIO_PCI_OFFSET_MASK (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1) + +struct vfio_vgpu_device { + struct iommu_group *group; + struct vgpu_device *vgpu_dev; +}; + +static int vgpu_dev_open(void *device_data) +{ + printk(KERN_INFO "%s ", __FUNCTION__); + return 0; +} + +static void vgpu_dev_close(void *device_data) +{ + +} + +static uint64_t resource_len(struct vgpu_device *vgpu_dev, int bar_index) +{ + uint64_t size = 0; + + switch (bar_index) { + case VFIO_PCI_BAR0_REGION_INDEX: + size = 16 * 1024 * 1024; + break; + case VFIO_PCI_BAR1_REGION_INDEX: + size = 256 * 1024 * 1024; + break; + case VFIO_PCI_BAR2_REGION_INDEX: + size = 32 * 1024 * 1024; + break; + case VFIO_PCI_BAR5_REGION_INDEX: + size = 128; + break; + default: + size = 0; + break; + } + return size; +} + +static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type) +{ + return 1; +} + +static long vgpu_dev_unlocked_ioctl(void *device_data, + unsigned int cmd, unsigned long arg) +{ + int ret = 0; + struct vfio_vgpu_device *vdev = device_data; + unsigned long minsz; + + switch (cmd) + { + case VFIO_DEVICE_GET_INFO: + { + struct vfio_device_info info; + printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index = %d", __FUNCTION__, vdev->vgpu_dev->minor); + minsz = offsetofend(struct vfio_device_info, num_irqs); + + if (copy_from_user(&info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz < minsz) + return -EINVAL; + + info.flags = VFIO_DEVICE_FLAGS_PCI; + info.num_regions = VFIO_PCI_NUM_REGIONS; + info.num_irqs = VFIO_PCI_NUM_IRQS; + + return copy_to_user((void __user *)arg, &info, minsz); + } + + case VFIO_DEVICE_GET_REGION_INFO: + { + struct vfio_region_info info; + + printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd", __FUNCTION__); + + minsz = offsetofend(struct vfio_region_info, offset); + + if (copy_from_user(&info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz < minsz) + return -EINVAL; + + switch (info.index) { + case VFIO_PCI_CONFIG_REGION_INDEX: + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index); + info.size = 0x100; // 4K + // info.size = sizeof(vdev->vgpu_dev->config_space); + info.flags = VFIO_REGION_INFO_FLAG_READ | + VFIO_REGION_INFO_FLAG_WRITE; + break; + case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX: + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index); + info.size = resource_len(vdev->vgpu_dev, info.index); + if (!info.size) { + info.flags = 0; + break; + } + + info.flags = VFIO_REGION_INFO_FLAG_READ | + VFIO_REGION_INFO_FLAG_WRITE; + + if ((info.index == VFIO_PCI_BAR1_REGION_INDEX) || + (info.index == VFIO_PCI_BAR2_REGION_INDEX)) { + info.flags |= PCI_BASE_ADDRESS_MEM_PREFETCH; + } + + /* TODO: provides configurable setups to + * GPU vendor + */ + + if (info.index == VFIO_PCI_BAR1_REGION_INDEX) + info.flags = VFIO_REGION_INFO_FLAG_MMAP; + + break; + case VFIO_PCI_VGA_REGION_INDEX: + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index); + info.size = 0xc0000; + info.flags = VFIO_REGION_INFO_FLAG_READ | + VFIO_REGION_INFO_FLAG_WRITE; + break; + + case VFIO_PCI_ROM_REGION_INDEX: + default: + return -EINVAL; + } + + return copy_to_user((void __user *)arg, &info, minsz); + + } + case VFIO_DEVICE_GET_IRQ_INFO: + { + struct vfio_irq_info info; + + printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__); + minsz = offsetofend(struct vfio_irq_info, count); + + if (copy_from_user(&info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS) + return -EINVAL; + + switch (info.index) { + case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX: + case VFIO_PCI_REQ_IRQ_INDEX: + break; + /* pass thru to return error */ + default: + return -EINVAL; + } + + info.count = VFIO_PCI_NUM_IRQS; + + info.flags = VFIO_IRQ_INFO_EVENTFD; + info.count = vgpu_get_irq_count(vdev, info.index); + + if (info.index == VFIO_PCI_INTX_IRQ_INDEX) + info.flags |= (VFIO_IRQ_INFO_MASKABLE | + VFIO_IRQ_INFO_AUTOMASKED); + else + info.flags |= VFIO_IRQ_INFO_NORESIZE; + + return copy_to_user((void __user *)arg, &info, minsz); + } + + case VFIO_DEVICE_SET_IRQS: + { + struct vfio_irq_set hdr; + u8 *data = NULL; + int ret = 0; + + minsz = offsetofend(struct vfio_irq_set, count); + + if (copy_from_user(&hdr, (void __user *)arg, minsz)) + return -EFAULT; + + if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS || + hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK | + VFIO_IRQ_SET_ACTION_TYPE_MASK)) + return -EINVAL; + + if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) { + size_t size; + int max = vgpu_get_irq_count(vdev, hdr.index); + + if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL) + size = sizeof(uint8_t); + else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD) + size = sizeof(int32_t); + else + return -EINVAL; + + if (hdr.argsz - minsz < hdr.count * size || + hdr.start >= max || hdr.start + hdr.count > max) + return -EINVAL; + + data = memdup_user((void __user *)(arg + minsz), + hdr.count * size); + if (IS_ERR(data)) + return PTR_ERR(data); + + } + ret = vgpu_set_irqs_callback(vdev->vgpu_dev, hdr.flags, hdr.index, + hdr.start, hdr.count, data); + kfree(data); + + + return ret; + } + + default: + return -EINVAL; + } + return ret; +} + + +ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf, + size_t count, loff_t *ppos, bool iswrite) +{ + struct vgpu_device *vgpu_dev = vdev->vgpu_dev; + int cfg_size = sizeof(vgpu_dev->config_space); + int ret = 0; + uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK; + + if (pos < 0 || pos >= cfg_size || + pos + count > cfg_size) { + printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos); + ret = -EFAULT; + goto config_rw_exit; + } + + if (iswrite) { + char *user_data = kmalloc(count, GFP_KERNEL); + + if (user_data == NULL) { + ret = -ENOMEM; + goto config_rw_exit; + } + + if (copy_from_user(user_data, buf, count)) { + ret = -EFAULT; + kfree(user_data); + goto config_rw_exit; + } + + /* FIXME: Need to save the BAR value properly */ + switch (pos) { + case PCI_BASE_ADDRESS_0: + vgpu_dev->bar[0].start = *((uint32_t *)user_data); + break; + case PCI_BASE_ADDRESS_1: + vgpu_dev->bar[1].start = *((uint32_t *)user_data); + break; + case PCI_BASE_ADDRESS_2: + vgpu_dev->bar[2].start = *((uint32_t *)user_data); + break; + } + + if (vgpu_dev->gpu_dev->ops->write) { + ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev, + user_data, + count, + vgpu_emul_space_config, + pos); + } + + kfree(user_data); + } + else + { + char *ret_data = kmalloc(count, GFP_KERNEL); + + if (ret_data == NULL) { + ret = -ENOMEM; + goto config_rw_exit; + } + + memset(ret_data, 0, count); + + if (vgpu_dev->gpu_dev->ops->read) { + ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev, + ret_data, + count, + vgpu_emul_space_config, + pos); + } + + if (ret > 0 ) { + if (copy_to_user(buf, ret_data, ret)) { + ret = -EFAULT; + kfree(ret_data); + goto config_rw_exit; + } + } + kfree(ret_data); + } + +config_rw_exit: + + return ret; +} + +ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf, + size_t count, loff_t *ppos, bool iswrite) +{ + struct vgpu_device *vgpu_dev = vdev->vgpu_dev; + loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK; + loff_t pos; + int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); + uint64_t end; + int ret = 0; + + if (!vgpu_dev->bar[bar_index].start) { + ret = -EINVAL; + goto bar_rw_exit; + } + + end = resource_len(vgpu_dev, bar_index); + + if (offset >= end) { + ret = -EINVAL; + goto bar_rw_exit; + } + + pos = vgpu_dev->bar[bar_index].start + offset; + if (iswrite) { + char *user_data = kmalloc(count, GFP_KERNEL); + + if (user_data == NULL) { + ret = -ENOMEM; + goto bar_rw_exit; + } + + if (copy_from_user(user_data, buf, count)) { + ret = -EFAULT; + kfree(user_data); + goto bar_rw_exit; + } + + if (vgpu_dev->gpu_dev->ops->write) { + ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev, + user_data, + count, + vgpu_emul_space_mmio, + pos); + } + + kfree(user_data); + } + else + { + char *ret_data = kmalloc(count, GFP_KERNEL); + + if (ret_data == NULL) { + ret = -ENOMEM; + goto bar_rw_exit; + } + + memset(ret_data, 0, count); + + if (vgpu_dev->gpu_dev->ops->read) { + ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev, + ret_data, + count, + vgpu_emul_space_mmio, + pos); + } + + if (ret > 0 ) { + if (copy_to_user(buf, ret_data, ret)) { + ret = -EFAULT; + } + } + kfree(ret_data); + } + +bar_rw_exit: + return ret; +} + + +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf, + size_t count, loff_t *ppos, bool iswrite) +{ + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); + struct vfio_vgpu_device *vdev = device_data; + + if (index >= VFIO_PCI_NUM_REGIONS) + return -EINVAL; + + switch (index) { + case VFIO_PCI_CONFIG_REGION_INDEX: + return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite); + + + case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX: + return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite); + + case VFIO_PCI_ROM_REGION_INDEX: + case VFIO_PCI_VGA_REGION_INDEX: + break; + } + + return -EINVAL; +} + + +static ssize_t vgpu_dev_read(void *device_data, char __user *buf, + size_t count, loff_t *ppos) +{ + int ret = 0; + + if (count) + ret = vgpu_dev_rw(device_data, buf, count, ppos, false); + + return ret; +} + +static ssize_t vgpu_dev_write(void *device_data, const char __user *buf, + size_t count, loff_t *ppos) +{ + int ret = 0; + + if (count) + ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true); + + return ret; +} + +/* Just create an invalid mapping without providing a fault handler */ + +static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma) +{ + printk(KERN_INFO "%s ", __FUNCTION__); + return 0; +} + +static const struct vfio_device_ops vgpu_vfio_dev_ops = { + .name = "vfio-vgpu-grp", + .open = vgpu_dev_open, + .release = vgpu_dev_close, + .ioctl = vgpu_dev_unlocked_ioctl, + .read = vgpu_dev_read, + .write = vgpu_dev_write, + .mmap = vgpu_dev_mmap, +}; + +int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group *group) +{ + struct vfio_vgpu_device *vdev; + int ret = 0; + + vdev = kzalloc(sizeof(*vdev), GFP_KERNEL); + if (!vdev) { + return -ENOMEM; + } + + vdev->group = group; + vdev->vgpu_dev = vgpu_dev; + + ret = vfio_add_group_dev(vgpu_dev->dev, &vgpu_vfio_dev_ops, vdev); + if (ret) + kfree(vdev); + + return ret; +} + + +int vgpu_group_free(struct vgpu_device *vgpu_dev) +{ + struct vfio_vgpu_device *vdev; + + vdev = vfio_del_group_dev(vgpu_dev->dev); + if (!vdev) + return -1; + + kfree(vdev); + return 0; +} + diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h new file mode 100644 index 000000000000..a2861c3f42e5 --- /dev/null +++ b/include/linux/vgpu.h @@ -0,0 +1,157 @@ +/* + * VGPU definition + * + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved. + * Author: + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#ifndef VGPU_H +#define VGPU_H + +// Common Data structures + +struct pci_bar_info { + uint64_t start; + uint64_t end; + int flags; +}; + +enum vgpu_emul_space_e { + vgpu_emul_space_config = 0, /*!< PCI configuration space */ + vgpu_emul_space_io = 1, /*!< I/O register space */ + vgpu_emul_space_mmio = 2 /*!< Memory-mapped I/O space */ +}; + +struct gpu_device; + +/* + * VGPU device + */ +struct vgpu_device { + struct kref kref; + struct device *dev; + int minor; + struct gpu_device *gpu_dev; + struct iommu_group *group; +#define DEVICE_NAME_LEN (64) + char dev_name[DEVICE_NAME_LEN]; + uuid_le vm_uuid; + uint32_t vgpu_instance; + uint32_t vgpu_id; + atomic_t usage_count; + char config_space[0x100]; // 4KB PCI cfg space + struct pci_bar_info bar[VFIO_PCI_NUM_REGIONS]; + struct device_attribute *dev_attr_vgpu_status; + int vgpu_device_status; + + struct list_head list; +}; + + +/** + * struct gpu_device_ops - Structure to be registered for each physical GPU to + * register the device to vgpu module. + * + * @owner: The module owner. + * @vgpu_supported_config: Called to get information about supported vgpu types. + * @dev : pci device structure of physical GPU. + * @config: should return string listing supported config + * Returns integer: success (0) or error (< 0) + * @vgpu_create: Called to allocate basic resouces in graphics + * driver for a particular vgpu. + * @dev: physical pci device structure on which vgpu + * should be created + * @vm_uuid: VM's uuid for which VM it is intended to + * @instance: vgpu instance in that VM + * @vgpu_id: This represents the type of vgpu to be + * created + * Returns integer: success (0) or error (< 0) + * @vgpu_destroy: Called to free resources in graphics driver for + * a vgpu instance of that VM. + * @dev: physical pci device structure to which + * this vgpu points to. + * @vm_uuid: VM's uuid for which the vgpu belongs to. + * @instance: vgpu instance in that VM + * Returns integer: success (0) or error (< 0) + * If VM is running and vgpu_destroy is called that + * means the vGPU is being hotunpluged. Return error + * if VM is running and graphics driver doesn't + * support vgpu hotplug. + * @vgpu_start: Called to do initiate vGPU initialization + * process in graphics driver when VM boots before + * qemu starts. + * @vm_uuid: VM's UUID which is booting. + * Returns integer: success (0) or error (< 0) + * @vgpu_shutdown: Called to teardown vGPU related resources for + * the VM + * @vm_uuid: VM's UUID which is shutting down . + * Returns integer: success (0) or error (< 0) + * @read: Read emulation callback + * @vdev: vgpu device structure + * @buf: read buffer + * @count: number bytes to read + * @address_space: specifies for which address space + * the request is: pci_config_space, IO register + * space or MMIO space. + * Retuns number on bytes read on success or error. + * @write: Write emulation callback + * @vdev: vgpu device structure + * @buf: write buffer + * @count: number bytes to be written + * @address_space: specifies for which address space + * the request is: pci_config_space, IO register + * space or MMIO space. + * Retuns number on bytes written on success or error. + * @vgpu_set_irqs: Called to send about interrupts configuration + * information that qemu set. + * @vdev: vgpu device structure + * @flags, index, start, count and *data : same as + * that of struct vfio_irq_set of + * VFIO_DEVICE_SET_IRQS API. + * + * Physical GPU that support vGPU should be register with vgpu module with + * gpu_device_ops structure. + */ + +struct gpu_device_ops { + struct module *owner; + int (*vgpu_supported_config)(struct pci_dev *dev, char *config); + int (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid, + uint32_t instance, uint32_t vgpu_id); + int (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid, + uint32_t instance); + int (*vgpu_start)(uuid_le vm_uuid); + int (*vgpu_shutdown)(uuid_le vm_uuid); + ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count, + uint32_t address_space, loff_t pos); + ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count, + uint32_t address_space,loff_t pos); + int (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags, + unsigned index, unsigned start, unsigned count, + void *data); + +}; + +/* + * Physical GPU + */ +struct gpu_device { + struct pci_dev *dev; + const struct gpu_device_ops *ops; + struct list_head gpu_next; +}; + +extern int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops); +extern void vgpu_unregister_device(struct pci_dev *dev); + +extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr, uint32_t len, uint32_t flags); +extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count); + +struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group); + +#endif /* VGPU_H */ + -- 1.8.1.4 From 380156ade7053664bdb318af0659708357f40050 Mon Sep 17 00:00:00 2001 From: Neo Jia <cjia@nvidia.com> Date: Sun, 24 Jan 2016 11:24:13 -0800 Subject: [PATCH] Add VGPU VFIO driver class support in QEMU This is just a quick POV change to allow us experiment the VGPU VFIO support, the next step is to merge this into the current vfio/pci.c which currently has a physical backing devices. Within current POC implementation, we have copy & paste lots function directly from the vfio/pci.c code, we should merge them together later. - Basic MMIO and PCI config apccess are supported - MMAP'ed GPU bar is supported - INTx and MSI using eventfd is supported, don't think we should support interrupt when vector->kvm_interrupt is not enabled. Change-Id: I99c34ac44524cd4d7d2abbcc4d43634297b96e80 Signed-off-by: Neo Jia <cjia@nvidia.com> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> --- hw/vfio/Makefile.objs | 1 + hw/vfio/vgpu.c | 991 ++++++++++++++++++++++++++++++++++++++++++++++++++ include/hw/pci/pci.h | 3 + 3 files changed, 995 insertions(+) create mode 100644 hw/vfio/vgpu.c diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index d324863..17f2ef1 100644 --- a/hw/vfio/Makefile.objs +++ b/hw/vfio/Makefile.objs @@ -1,6 +1,7 @@ ifeq ($(CONFIG_LINUX), y) obj-$(CONFIG_SOFTMMU) += common.o obj-$(CONFIG_PCI) += pci.o pci-quirks.o +obj-$(CONFIG_PCI) += vgpu.o obj-$(CONFIG_SOFTMMU) += platform.o obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o endif diff --git a/hw/vfio/vgpu.c b/hw/vfio/vgpu.c new file mode 100644 index 0000000..56ebce0 --- /dev/null +++ b/hw/vfio/vgpu.c @@ -0,0 +1,991 @@ +/* + * vGPU VFIO device + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include <dirent.h> +#include <linux/vfio.h> +#include <sys/ioctl.h> +#include <sys/mman.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <unistd.h> + +#include "config.h" +#include "exec/address-spaces.h" +#include "exec/memory.h" +#include "hw/pci/msi.h" +#include "hw/pci/msix.h" +#include "hw/pci/pci.h" +#include "qemu-common.h" +#include "qemu/error-report.h" +#include "qemu/event_notifier.h" +#include "qemu/queue.h" +#include "qemu/range.h" +#include "sysemu/kvm.h" +#include "sysemu/sysemu.h" +#include "trace.h" +#include "hw/vfio/vfio.h" +#include "hw/vfio/pci.h" +#include "hw/vfio/vfio-common.h" +#include "qmp-commands.h" + +#define TYPE_VFIO_VGPU "vfio-vgpu" + +typedef struct VFIOvGPUDevice { + PCIDevice pdev; + VFIODevice vbasedev; + VFIOINTx intx; + VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */ + uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */ + unsigned int config_size; + char *vgpu_type; + char *vm_uuid; + off_t config_offset; /* Offset of config space region within device fd */ + int msi_cap_size; + EventNotifier req_notifier; + int nr_vectors; /* Number of MSI/MSIX vectors currently in use */ + int interrupt; /* Current interrupt type */ + VFIOMSIVector *msi_vectors; +} VFIOvGPUDevice; + +/* + * Local functions + */ + +// function prototypes +static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev); +static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len); + + +// INTx functions + +static void vfio_vgpu_intx_interrupt(void *opaque) +{ + VFIOvGPUDevice *vdev = opaque; + + if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) { + return; + } + + vdev->intx.pending = true; + pci_irq_assert(&vdev->pdev); +// vfio_mmap_set_enabled(vdev, false); + +} + +static void vfio_vgpu_intx_eoi(VFIODevice *vbasedev) +{ + VFIOvGPUDevice *vdev = container_of(vbasedev, VFIOvGPUDevice, vbasedev); + + if (!vdev->intx.pending) { + return; + } + + trace_vfio_intx_eoi(vbasedev->name); + + vdev->intx.pending = false; + pci_irq_deassert(&vdev->pdev); + vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX); +} + +static void vfio_vgpu_intx_enable_kvm(VFIOvGPUDevice *vdev) +{ +#ifdef CONFIG_KVM + struct kvm_irqfd irqfd = { + .fd = event_notifier_get_fd(&vdev->intx.interrupt), + .gsi = vdev->intx.route.irq, + .flags = KVM_IRQFD_FLAG_RESAMPLE, + }; + struct vfio_irq_set *irq_set; + int ret, argsz; + int32_t *pfd; + + if (!kvm_irqfds_enabled() || + vdev->intx.route.mode != PCI_INTX_ENABLED || + !kvm_resamplefds_enabled()) { + return; + } + + /* Get to a known interrupt state */ + qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev); + vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX); + vdev->intx.pending = false; + pci_irq_deassert(&vdev->pdev); + + /* Get an eventfd for resample/unmask */ + if (event_notifier_init(&vdev->intx.unmask, 0)) { + error_report("vfio: Error: event_notifier_init failed eoi"); + goto fail; + } + + /* KVM triggers it, VFIO listens for it */ + irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask); + + if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) { + error_report("vfio: Error: Failed to setup resample irqfd: %m"); + goto fail_irqfd; + } + + argsz = sizeof(*irq_set) + sizeof(*pfd); + + irq_set = g_malloc0(argsz); + irq_set->argsz = argsz; + irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK; + irq_set->index = VFIO_PCI_INTX_IRQ_INDEX; + irq_set->start = 0; + irq_set->count = 1; + pfd = (int32_t *)&irq_set->data; + + *pfd = irqfd.resamplefd; + + ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set); + g_free(irq_set); + if (ret) { + error_report("vfio: Error: Failed to setup INTx unmask fd: %m"); + goto fail_vfio; + } + + /* Let'em rip */ + vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX); + + vdev->intx.kvm_accel = true; + + trace_vfio_intx_enable_kvm(vdev->vbasedev.name); + + return; + +fail_vfio: + irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN; + kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd); +fail_irqfd: + event_notifier_cleanup(&vdev->intx.unmask); +fail: + qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev); + vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX); +#endif +} + +static void vfio_vgpu_intx_disable_kvm(VFIOvGPUDevice *vdev) +{ +#ifdef CONFIG_KVM + struct kvm_irqfd irqfd = { + .fd = event_notifier_get_fd(&vdev->intx.interrupt), + .gsi = vdev->intx.route.irq, + .flags = KVM_IRQFD_FLAG_DEASSIGN, + }; + + if (!vdev->intx.kvm_accel) { + return; + } + + /* + * Get to a known state, hardware masked, QEMU ready to accept new + * interrupts, QEMU IRQ de-asserted. + */ + vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX); + vdev->intx.pending = false; + pci_irq_deassert(&vdev->pdev); + + /* Tell KVM to stop listening for an INTx irqfd */ + if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) { + error_report("vfio: Error: Failed to disable INTx irqfd: %m"); + } + + /* We only need to close the eventfd for VFIO to cleanup the kernel side */ + event_notifier_cleanup(&vdev->intx.unmask); + + /* QEMU starts listening for interrupt events. */ + qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev); + + vdev->intx.kvm_accel = false; + + /* If we've missed an event, let it re-fire through QEMU */ + vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX); + + trace_vfio_intx_disable_kvm(vdev->vbasedev.name); +#endif +} + +static void vfio_vgpu_intx_update(PCIDevice *pdev) +{ + VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev); + PCIINTxRoute route; + + if (vdev->interrupt != VFIO_INT_INTx) { + return; + } + + route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin); + + if (!pci_intx_route_changed(&vdev->intx.route, &route)) { + return; /* Nothing changed */ + } + + trace_vfio_intx_update(vdev->vbasedev.name, + vdev->intx.route.irq, route.irq); + + vfio_vgpu_intx_disable_kvm(vdev); + + vdev->intx.route = route; + + if (route.mode != PCI_INTX_ENABLED) { + return; + } + + vfio_vgpu_intx_enable_kvm(vdev); + + /* Re-enable the interrupt in cased we missed an EOI */ + vfio_vgpu_intx_eoi(&vdev->vbasedev); +} + +static int vfio_vgpu_intx_enable(VFIOvGPUDevice *vdev) +{ + uint8_t pin = vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1); + int ret, argsz; + struct vfio_irq_set *irq_set; + int32_t *pfd; + + if (!pin) { + return 0; + } + + vfio_vgpu_disable_interrupts(vdev); + + vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */ + pci_config_set_interrupt_pin(vdev->pdev.config, pin); + +#ifdef CONFIG_KVM + /* + * Only conditional to avoid generating error messages on platforms + * where we won't actually use the result anyway. + */ + if (kvm_irqfds_enabled() && kvm_resamplefds_enabled()) { + vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev, + vdev->intx.pin); + } +#endif + + ret = event_notifier_init(&vdev->intx.interrupt, 0); + if (ret) { + error_report("vfio: Error: event_notifier_init failed"); + return ret; + } + + argsz = sizeof(*irq_set) + sizeof(*pfd); + + irq_set = g_malloc0(argsz); + irq_set->argsz = argsz; + irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER; + irq_set->index = VFIO_PCI_INTX_IRQ_INDEX; + irq_set->start = 0; + irq_set->count = 1; + pfd = (int32_t *)&irq_set->data; + + *pfd = event_notifier_get_fd(&vdev->intx.interrupt); + qemu_set_fd_handler(*pfd, vfio_vgpu_intx_interrupt, NULL, vdev); + + ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set); + g_free(irq_set); + if (ret) { + error_report("vfio: Error: Failed to setup INTx fd: %m"); + qemu_set_fd_handler(*pfd, NULL, NULL, vdev); + event_notifier_cleanup(&vdev->intx.interrupt); + return -errno; + } + + vfio_vgpu_intx_enable_kvm(vdev); + + vdev->interrupt = VFIO_INT_INTx; + + trace_vfio_intx_enable(vdev->vbasedev.name); + + return 0; +} + +static void vfio_vgpu_intx_disable(VFIOvGPUDevice *vdev) +{ + int fd; + + vfio_vgpu_intx_disable_kvm(vdev); + vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX); + vdev->intx.pending = false; + pci_irq_deassert(&vdev->pdev); +// vfio_mmap_set_enabled(vdev, true); + + fd = event_notifier_get_fd(&vdev->intx.interrupt); + qemu_set_fd_handler(fd, NULL, NULL, vdev); + event_notifier_cleanup(&vdev->intx.interrupt); + + vdev->interrupt = VFIO_INT_NONE; + + trace_vfio_intx_disable(vdev->vbasedev.name); +} + +//MSI functions +static void vfio_vgpu_remove_kvm_msi_virq(VFIOMSIVector *vector) +{ + kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt, + vector->virq); + kvm_irqchip_release_virq(kvm_state, vector->virq); + vector->virq = -1; + event_notifier_cleanup(&vector->kvm_interrupt); +} + +static void vfio_vgpu_msi_disable_common(VFIOvGPUDevice *vdev) +{ + int i; + + for (i = 0; i < vdev->nr_vectors; i++) { + VFIOMSIVector *vector = &vdev->msi_vectors[i]; + if (vdev->msi_vectors[i].use) { + if (vector->virq >= 0) { + vfio_vgpu_remove_kvm_msi_virq(vector); + } + qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt), + NULL, NULL, NULL); + event_notifier_cleanup(&vector->interrupt); + } + } + + g_free(vdev->msi_vectors); + vdev->msi_vectors = NULL; + vdev->nr_vectors = 0; + vdev->interrupt = VFIO_INT_NONE; + + vfio_vgpu_intx_enable(vdev); +} + +static void vfio_vgpu_msi_disable(VFIOvGPUDevice *vdev) +{ + vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSI_IRQ_INDEX); + vfio_vgpu_msi_disable_common(vdev); +} + +static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev) +{ + + if (vdev->interrupt == VFIO_INT_MSI) { + vfio_vgpu_msi_disable(vdev); + } + + if (vdev->interrupt == VFIO_INT_INTx) { + vfio_vgpu_intx_disable(vdev); + } +} + + +static void vfio_vgpu_msi_interrupt(void *opaque) +{ + VFIOMSIVector *vector = opaque; + VFIOvGPUDevice *vdev = (VFIOvGPUDevice *)vector->vdev; + MSIMessage (*get_msg)(PCIDevice *dev, unsigned vector); + void (*notify)(PCIDevice *dev, unsigned vector); + MSIMessage msg; + int nr = vector - vdev->msi_vectors; + + if (!event_notifier_test_and_clear(&vector->interrupt)) { + return; + } + + if (vdev->interrupt == VFIO_INT_MSIX) { + get_msg = msix_get_message; + notify = msix_notify; + } else if (vdev->interrupt == VFIO_INT_MSI) { + get_msg = msi_get_message; + notify = msi_notify; + } else { + abort(); + } + + msg = get_msg(&vdev->pdev, nr); + trace_vfio_msi_interrupt(vdev->vbasedev.name, nr, msg.address, msg.data); + notify(&vdev->pdev, nr); +} + +static int vfio_vgpu_enable_vectors(VFIOvGPUDevice *vdev, bool msix) +{ + struct vfio_irq_set *irq_set; + int ret = 0, i, argsz; + int32_t *fds; + + argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds)); + + irq_set = g_malloc0(argsz); + irq_set->argsz = argsz; + irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER; + irq_set->index = msix ? VFIO_PCI_MSIX_IRQ_INDEX : VFIO_PCI_MSI_IRQ_INDEX; + irq_set->start = 0; + irq_set->count = vdev->nr_vectors; + fds = (int32_t *)&irq_set->data; + + for (i = 0; i < vdev->nr_vectors; i++) { + int fd = -1; + + /* + * MSI vs MSI-X - The guest has direct access to MSI mask and pending + * bits, therefore we always use the KVM signaling path when setup. + * MSI-X mask and pending bits are emulated, so we want to use the + * KVM signaling path only when configured and unmasked. + */ + if (vdev->msi_vectors[i].use) { + if (vdev->msi_vectors[i].virq < 0 || + (msix && msix_is_masked(&vdev->pdev, i))) { + fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt); + } else { + fd = event_notifier_get_fd(&vdev->msi_vectors[i].kvm_interrupt); + } + } + + fds[i] = fd; + } + + ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set); + + g_free(irq_set); + + return ret; +} + +static void vfio_vgpu_add_kvm_msi_virq(VFIOvGPUDevice *vdev, VFIOMSIVector *vector, + MSIMessage *msg, bool msix) +{ + int virq; + + if (!msg) { + return; + } + + if (event_notifier_init(&vector->kvm_interrupt, 0)) { + return; + } + + virq = kvm_irqchip_add_msi_route(kvm_state, *msg, &vdev->pdev); + if (virq < 0) { + event_notifier_cleanup(&vector->kvm_interrupt); + return; + } + + if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt, + NULL, virq) < 0) { + kvm_irqchip_release_virq(kvm_state, virq); + event_notifier_cleanup(&vector->kvm_interrupt); + return; + } + + vector->virq = virq; +} + +static void vfio_vgpu_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg, + PCIDevice *pdev) +{ + kvm_irqchip_update_msi_route(kvm_state, vector->virq, msg, pdev); +} + + +static void vfio_vgpu_msi_enable(VFIOvGPUDevice *vdev) +{ + int ret, i; + + vfio_vgpu_disable_interrupts(vdev); + + vdev->nr_vectors = msi_nr_vectors_allocated(&vdev->pdev); +retry: + vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors); + + for (i = 0; i < vdev->nr_vectors; i++) { + VFIOMSIVector *vector = &vdev->msi_vectors[i]; + MSIMessage msg = msi_get_message(&vdev->pdev, i); + + vector->vdev = (VFIOPCIDevice *)vdev; + vector->virq = -1; + vector->use = true; + + if (event_notifier_init(&vector->interrupt, 0)) { + error_report("vfio: Error: event_notifier_init failed"); + } + qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt), + vfio_vgpu_msi_interrupt, NULL, vector); + + /* + * Attempt to enable route through KVM irqchip, + * default to userspace handling if unavailable. + */ + vfio_vgpu_add_kvm_msi_virq(vdev, vector, &msg, false); + } + + /* Set interrupt type prior to possible interrupts */ + vdev->interrupt = VFIO_INT_MSI; + + ret = vfio_vgpu_enable_vectors(vdev, false); + if (ret) { + if (ret < 0) { + error_report("vfio: Error: Failed to setup MSI fds: %m"); + } else if (ret != vdev->nr_vectors) { + error_report("vfio: Error: Failed to enable %d " + "MSI vectors, retry with %d", vdev->nr_vectors, ret); + } + + for (i = 0; i < vdev->nr_vectors; i++) { + VFIOMSIVector *vector = &vdev->msi_vectors[i]; + if (vector->virq >= 0) { + vfio_vgpu_remove_kvm_msi_virq(vector); + } + qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt), + NULL, NULL, NULL); + event_notifier_cleanup(&vector->interrupt); + } + + g_free(vdev->msi_vectors); + + if (ret > 0 && ret != vdev->nr_vectors) { + vdev->nr_vectors = ret; + goto retry; + } + vdev->nr_vectors = 0; + + /* + * Failing to setup MSI doesn't really fall within any specification. + * Let's try leaving interrupts disabled and hope the guest figures + * out to fall back to INTx for this device. + */ + error_report("vfio: Error: Failed to enable MSI"); + vdev->interrupt = VFIO_INT_NONE; + + return; + } +} + +static void vfio_vgpu_update_msi(VFIOvGPUDevice *vdev) +{ + int i; + + for (i = 0; i < vdev->nr_vectors; i++) { + VFIOMSIVector *vector = &vdev->msi_vectors[i]; + MSIMessage msg; + + if (!vector->use || vector->virq < 0) { + continue; + } + + msg = msi_get_message(&vdev->pdev, i); + vfio_vgpu_update_kvm_msi_virq(vector, msg, &vdev->pdev); + } +} + +static int vfio_vgpu_msi_setup(VFIOvGPUDevice *vdev, int pos) +{ + uint16_t ctrl; + bool msi_64bit, msi_maskbit; + int ret, entries; + + if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl), + vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) { + return -errno; + } + ctrl = le16_to_cpu(ctrl); + + msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT); + msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT); + entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1); + + ret = msi_init(&vdev->pdev, pos, entries, msi_64bit, msi_maskbit); + if (ret < 0) { + if (ret == -ENOTSUP) { + return 0; + } + error_report("vfio: msi_init failed"); + return ret; + } + vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 0); + + return 0; +} + + +static int vfio_vgpu_msi_init(VFIOvGPUDevice *vdev) +{ + uint8_t pos; + int ret; + + pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSI); + if (!pos) { + return 0; + } + + ret = vfio_vgpu_msi_setup(vdev, pos); + if (ret < 0) { + error_report("vgpu: Error setting MSI@0x%x: %d", pos, ret); + return ret; + } + + return 0; +} + +/* + * VGPU device class functions + */ + +static void vfio_vgpu_reset(DeviceState *dev) +{ + + +} + +static void vfio_vgpu_eoi(VFIODevice *vbasedev) +{ + return; +} + +static int vfio_vgpu_hot_reset_multi(VFIODevice *vbasedev) +{ + // Nothing to be reset + return 0; +} + +static void vfio_vgpu_compute_needs_reset(VFIODevice *vbasedev) +{ + vbasedev->needs_reset = false; +} + +static VFIODeviceOps vfio_vgpu_ops = { + .vfio_compute_needs_reset = vfio_vgpu_compute_needs_reset, + .vfio_hot_reset_multi = vfio_vgpu_hot_reset_multi, + .vfio_eoi = vfio_vgpu_eoi, +}; + +static int vfio_vgpu_populate_device(VFIOvGPUDevice *vdev) +{ + VFIODevice *vbasedev = &vdev->vbasedev; + struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) }; + int i, ret = -1; + + for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) { + reg_info.index = i; + + ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, ®_info); + if (ret) { + error_report("vfio: Error getting region %d info: %m", i); + return ret; + } + + trace_vfio_populate_device_region(vbasedev->name, i, + (unsigned long)reg_info.size, + (unsigned long)reg_info.offset, + (unsigned long)reg_info.flags); + + vdev->bars[i].region.vbasedev = vbasedev; + vdev->bars[i].region.flags = reg_info.flags; + vdev->bars[i].region.size = reg_info.size; + vdev->bars[i].region.fd_offset = reg_info.offset; + vdev->bars[i].region.nr = i; + QLIST_INIT(&vdev->bars[i].quirks); + } + + reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX; + + ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_REGION_INFO, ®_info); + if (ret) { + error_report("vfio: Error getting config info: %m"); + return ret; + } + + vdev->config_size = reg_info.size; + if (vdev->config_size == PCI_CONFIG_SPACE_SIZE) { + vdev->pdev.cap_present &= ~QEMU_PCI_CAP_EXPRESS; + } + vdev->config_offset = reg_info.offset; + + return 0; +} + +static void vfio_vgpu_create_virtual_bar(VFIOvGPUDevice *vdev, int nr) +{ + VFIOBAR *bar = &vdev->bars[nr]; + uint64_t size = bar->region.size; + char name[64]; + uint32_t pci_bar; + uint8_t type; + int ret; + + /* Skip both unimplemented BARs and the upper half of 64bit BARS. */ + if (!size) + return; + + /* Determine what type of BAR this is for registration */ + ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar), + vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr)); + if (ret != sizeof(pci_bar)) { + error_report("vfio: Failed to read BAR %d (%m)", nr); + return; + } + + pci_bar = le32_to_cpu(pci_bar); + bar->ioport = (pci_bar & PCI_BASE_ADDRESS_SPACE_IO); + bar->mem64 = bar->ioport ? 0 : (pci_bar & PCI_BASE_ADDRESS_MEM_TYPE_64); + type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK : + ~PCI_BASE_ADDRESS_MEM_MASK); + + /* A "slow" read/write mapping underlies all BARs */ + memory_region_init_io(&bar->region.mem, OBJECT(vdev), &vfio_region_ops, + bar, name, size); + pci_register_bar(&vdev->pdev, nr, type, &bar->region.mem); + + // Create an invalid BAR1 mapping + if (bar->region.flags & VFIO_REGION_INFO_FLAG_MMAP) { + strncat(name, " mmap", sizeof(name) - strlen(name) - 1); + vfio_mmap_region(OBJECT(vdev), &bar->region, &bar->region.mem, + &bar->region.mmap_mem, &bar->region.mmap, + size, 0, name); + } +} + +static void vfio_vgpu_create_virtual_bars(VFIOvGPUDevice *vdev) +{ + + int i = 0; + + for (i = 0; i < PCI_ROM_SLOT; i++) { + vfio_vgpu_create_virtual_bar(vdev, i); + } +} + +static int vfio_vgpu_initfn(PCIDevice *pdev) +{ + VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev); + VFIOGroup *group; + ssize_t len; + int groupid; + struct stat st; + char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name; + int ret; + UuidInfo *uuid_info; + + uuid_info = qmp_query_uuid(NULL); + if (strcmp(uuid_info->UUID, UUID_NONE) == 0) { + return -EINVAL; + } else { + vdev->vm_uuid = uuid_info->UUID; + } + + + snprintf(path, sizeof(path), + "/sys/devices/virtual/vgpu/%s-0/", vdev->vm_uuid); + + if (stat(path, &st) < 0) { + error_report("vfio-vgpu: error: no such vgpu device: %s", path); + return -errno; + } + + vdev->vbasedev.ops = &vfio_vgpu_ops; + + vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI; + vdev->vbasedev.name = g_strdup_printf("%s-0", vdev->vm_uuid); + + strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1); + + len = readlink(path, iommu_group_path, sizeof(path)); + if (len <= 0 || len >= sizeof(path)) { + error_report("vfio-vgpu: error no iommu_group for device"); + return len < 0 ? -errno : -ENAMETOOLONG; + } + + iommu_group_path[len] = 0; + group_name = basename(iommu_group_path); + + if (sscanf(group_name, "%d", &groupid) != 1) { + error_report("vfio-vgpu: error reading %s: %m", path); + return -errno; + } + + // TODO: This will only work if we *only* have VFIO_VGPU_IOMMU enabled + + group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev)); + if (!group) { + error_report("vfio: failed to get group %d", groupid); + return -ENOENT; + } + + snprintf(path, sizeof(path), "%s-0", vdev->vm_uuid); + + ret = vfio_get_device(group, path, &vdev->vbasedev); + if (ret) { + error_report("vfio-vgpu; failed to get device %s", vdev->vgpu_type); + vfio_put_group(group); + return ret; + } + + ret = vfio_vgpu_populate_device(vdev); + if (ret) { + return ret; + } + + /* Get a copy of config space */ + ret = pread(vdev->vbasedev.fd, vdev->pdev.config, + MIN(pci_config_size(&vdev->pdev), vdev->config_size), + vdev->config_offset); + if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) { + ret = ret < 0 ? -errno : -EFAULT; + error_report("vfio: Failed to read device config space"); + return ret; + } + + vfio_vgpu_create_virtual_bars(vdev); + + ret = vfio_vgpu_msi_init(vdev); + if (ret < 0) { + error_report("%s: Error setting MSI %d", __FUNCTION__, ret); + return ret; + } + + if (vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) { + pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_vgpu_intx_update); + ret = vfio_vgpu_intx_enable(vdev); + if (ret) { + return ret; + } + } + + return 0; +} + + +static void vfio_vgpu_exitfn(PCIDevice *pdev) +{ + + +} + +static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len) +{ + VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev); + ssize_t ret; + uint32_t val = 0; + + ret = pread(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr); + + if (ret != len) { + error_report("%s: failed at offset:0x%0x %m", __func__, addr); + return 0xFFFFFFFF; + } + + // memcpy(&vdev->emulated_config_bits + addr, &val, len); + return val; +} + +static void vfio_vgpu_write_config(PCIDevice *pdev, uint32_t addr, + uint32_t val, int len) +{ + VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev); + ssize_t ret; + + ret = pwrite(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr); + + if (ret != len) { + error_report("%s: failed at offset:0x%0x, val:0x%0x %m", + __func__, addr, val); + return; + } + + if (pdev->cap_present & QEMU_PCI_CAP_MSI && + ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size)) { + int is_enabled, was_enabled = msi_enabled(pdev); + + pci_default_write_config(pdev, addr, val, len); + + is_enabled = msi_enabled(pdev); + + if (!was_enabled) { + if (is_enabled) { + vfio_vgpu_msi_enable(vdev); + } + } else { + if (!is_enabled) { + vfio_vgpu_msi_disable(vdev); + } else { + vfio_vgpu_update_msi(vdev); + } + } + } + else { + /* Write everything to QEMU to keep emulated bits correct */ + pci_default_write_config(pdev, addr, val, len); + } + + pci_default_write_config(pdev, addr, val, len); + + return; +} + +static const VMStateDescription vfio_vgpu_vmstate = { + .name = TYPE_VFIO_VGPU, + .unmigratable = 1, +}; + +// +// We don't actually need the vfio_vgpu_properties +// as we can just simply rely on VM UUID to find +// the IOMMU group for this VM +// + + +static Property vfio_vgpu_properties[] = { + + DEFINE_PROP_STRING("vgpu", VFIOvGPUDevice, vgpu_type), + DEFINE_PROP_END_OF_LIST() +}; + +#if 0 + +static void vfio_vgpu_instance_init(Object *obj) +{ + +} + +static void vfio_vgpu_instance_finalize(Object *obj) +{ + + +} + +#endif + +static void vfio_vgpu_class_init(ObjectClass *klass, void *data) +{ + DeviceClass *dc = DEVICE_CLASS(klass); + PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass); + // vgpudc->parent_realize = dc->realize; + // dc->realize = calxeda_xgmac_realize; + dc->desc = "VFIO-based vGPU"; + dc->vmsd = &vfio_vgpu_vmstate; + dc->reset = vfio_vgpu_reset; + // dc->cannot_instantiate_with_device_add_yet = true; + dc->props = vfio_vgpu_properties; + set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories); + pdc->init = vfio_vgpu_initfn; + pdc->exit = vfio_vgpu_exitfn; + pdc->config_read = vfio_vgpu_read_config; + pdc->config_write = vfio_vgpu_write_config; + pdc->is_express = 0; /* For now, we are not */ + + pdc->vendor_id = PCI_DEVICE_ID_NVIDIA; + // pdc->device_id = 0x11B0; + pdc->class_id = PCI_CLASS_DISPLAY_VGA; +} + +static const TypeInfo vfio_vgpu_dev_info = { + .name = TYPE_VFIO_VGPU, + .parent = TYPE_PCI_DEVICE, + .instance_size = sizeof(VFIOvGPUDevice), + .class_init = vfio_vgpu_class_init, +}; + +static void register_vgpu_dev_type(void) +{ + type_register_static(&vfio_vgpu_dev_info); +} + +type_init(register_vgpu_dev_type) diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h index 379b6e1..9af5e17 100644 --- a/include/hw/pci/pci.h +++ b/include/hw/pci/pci.h @@ -64,6 +64,9 @@ #define PCI_DEVICE_ID_VMWARE_IDE 0x1729 #define PCI_DEVICE_ID_VMWARE_VMXNET3 0x07B0 +/* NVIDIA (0x10de) */ +#define PCI_DEVICE_ID_NVIDIA 0x10de + /* Intel (0x8086) */ #define PCI_DEVICE_ID_INTEL_82551IT 0x1209 #define PCI_DEVICE_ID_INTEL_82557 0x1229