From patchwork Tue Oct 18 12:38:21 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jike Song <jike.song@intel.com>
X-Patchwork-Id: 9382099
Return-Path: 
 <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	20574600CA for <patchwork-qemu-devel@patchwork.kernel.org>;
	Tue, 18 Oct 2016 12:43:34 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 10884295CF
	for <patchwork-qemu-devel@patchwork.kernel.org>;
	Tue, 18 Oct 2016 12:43:34 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 0502B295D4; Tue, 18 Oct 2016 12:43:34 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=unavailable version=3.3.1
Received: from lists.gnu.org (lists.gnu.org [208.118.235.17])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 32945295D7
	for <patchwork-qemu-devel@patchwork.kernel.org>;
	Tue, 18 Oct 2016 12:43:32 +0000 (UTC)
Received: from localhost ([::1]:41271 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>)
	id 1bwTjw-0008Ic-30 for patchwork-qemu-devel@patchwork.kernel.org;
	Tue, 18 Oct 2016 08:43:32 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43586)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jike.song@intel.com>) id 1bwThu-0006kw-Fd
	for qemu-devel@nongnu.org; Tue, 18 Oct 2016 08:41:28 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jike.song@intel.com>) id 1bwThr-0000Yn-4h
	for qemu-devel@nongnu.org; Tue, 18 Oct 2016 08:41:26 -0400
Received: from mga11.intel.com ([192.55.52.93]:36019)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <jike.song@intel.com>) id 1bwThq-0000YJ-PB
	for qemu-devel@nongnu.org; Tue, 18 Oct 2016 08:41:23 -0400
Received: from orsmga005.jf.intel.com ([10.7.209.41])
	by fmsmga102.fm.intel.com with ESMTP; 18 Oct 2016 05:41:21 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.31,361,1473145200"; d="scan'208";a="20870364"
Received: from git1.bj.intel.com ([10.238.135.72])
	by orsmga005.jf.intel.com with ESMTP; 18 Oct 2016 05:41:18 -0700
Message-ID: <580617BD.8000300@intel.com>
Date: Tue, 18 Oct 2016 20:38:21 +0800
From: Jike Song <jike.song@intel.com>
User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64;
	rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: Alex Williamson <alex.williamson@redhat.com>
References: <c9d63f52-9b6f-5752-2111-773b33adc426@redhat.com>
	<1259cdba-c137-c3da-abe2-ecf51aec6738@linux.intel.com>
	<e992eb4e-0806-8f6e-851d-36eaf389a897@redhat.com>
	<ea9dffe6-7afa-4862-e46f-6f780a309e46@linux.intel.com>
	<523e1446-75f1-fe3a-d818-f7d238d57751@redhat.com>
	<5800B579.9000705@intel.com> <20161014084158.623087aa@t450s.home>
	<20161014084601.2a50ba87@t450s.home>
	<20161014163545.GA6121@nvidia.com>
	<20161014105124.42b438a6@t450s.home>
	<20161014221901.GA8865@nvidia.com>
	<20161017100229.1474ae33@t450s.home>
In-Reply-To: <20161017100229.1474ae33@t450s.home>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 192.55.52.93
Subject: Re: [Qemu-devel] [PATCH 1/2] KVM: page track: add a new notifier
	type: track_flush_slot
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: "Tian, Kevin" <kevin.tian@intel.com>, Neo Jia <cjia@nvidia.com>,
	kvm@vger.kernel.org, guangrong.xiao@intel.com,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Xiaoguang Chen <xiaoguang.chen@intel.com>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Errors-To: 
 qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
X-Virus-Scanned: ClamAV using ClamSMTP

On 10/18/2016 12:02 AM, Alex Williamson wrote:
> On Fri, 14 Oct 2016 15:19:01 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
>> On Fri, Oct 14, 2016 at 10:51:24AM -0600, Alex Williamson wrote:
>>> On Fri, 14 Oct 2016 09:35:45 -0700
>>> Neo Jia <cjia@nvidia.com> wrote:
>>>   
>>>> On Fri, Oct 14, 2016 at 08:46:01AM -0600, Alex Williamson wrote:  
>>>>> On Fri, 14 Oct 2016 08:41:58 -0600
>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>>     
>>>>>> On Fri, 14 Oct 2016 18:37:45 +0800
>>>>>> Jike Song <jike.song@intel.com> wrote:
>>>>>>     
>>>>>>> On 10/11/2016 05:47 PM, Paolo Bonzini wrote:      
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/10/2016 11:21, Xiao Guangrong wrote:        
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/11/2016 04:54 PM, Paolo Bonzini wrote:        
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11/10/2016 04:39, Xiao Guangrong wrote:        
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/11/2016 02:32 AM, Paolo Bonzini wrote:        
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/10/2016 20:01, Neo Jia wrote:        
>>>>>>>>>>>>>> Hi Neo,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> AFAIK this is needed because KVMGT doesn't paravirtualize the PPGTT,
>>>>>>>>>>>>>> while nVidia does.        
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Paolo and Xiaoguang,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am just wondering how device driver can register a notifier so he
>>>>>>>>>>>>> can be
>>>>>>>>>>>>> notified for write-protected pages when writes are happening.        
>>>>>>>>>>>>
>>>>>>>>>>>> It can't yet, but the API is ready for that.  kvm_vfio_set_group is
>>>>>>>>>>>> currently where a struct kvm_device* and struct vfio_group* touch.
>>>>>>>>>>>> Given
>>>>>>>>>>>> a struct kvm_device*, dev->kvm provides the struct kvm to be passed to
>>>>>>>>>>>> kvm_page_track_register_notifier.  So I guess you could add a callback
>>>>>>>>>>>> that passes the struct kvm_device* to the mdev device.
>>>>>>>>>>>>
>>>>>>>>>>>> Xiaoguang and Guangrong, what were your plans?  We discussed it briefly
>>>>>>>>>>>> at KVM Forum but I don't remember the details.        
>>>>>>>>>>>
>>>>>>>>>>> Your suggestion was that pass kvm fd to KVMGT via VFIO, so that we can
>>>>>>>>>>> figure out the kvm instance based on the fd.
>>>>>>>>>>>
>>>>>>>>>>> We got a new idea, how about search the kvm instance by mm_struct, it
>>>>>>>>>>> can work as KVMGT is running in the vcpu context and it is much more
>>>>>>>>>>> straightforward.        
>>>>>>>>>>
>>>>>>>>>> Perhaps I didn't understand your suggestion, but the same mm_struct can
>>>>>>>>>> have more than 1 struct kvm so I'm not sure that it can work.        
>>>>>>>>>
>>>>>>>>> vcpu->pid is valid during vcpu running so that it can be used to figure
>>>>>>>>> out which kvm instance owns the vcpu whose pid is the one as current
>>>>>>>>> thread, i think it can work. :)        
>>>>>>>>
>>>>>>>> No, don't do that.  There's no reason for a thread to run a single VCPU,
>>>>>>>> and if you can have multiple VCPUs you can also have multiple VCPUs from
>>>>>>>> multiple VMs.
>>>>>>>>
>>>>>>>> Passing file descriptors around are the right way to connect subsystems.        
>>>>>>>
>>>>>>> [CC Alex, Kevin and Qemu-devel]
>>>>>>>
>>>>>>> Hi Paolo & Alex,
>>>>>>>
>>>>>>> IIUC, passing file descriptors means touching QEMU and the UAPI between
>>>>>>> QEMU and VFIO. Would you guys have a look at below draft patch? If it's
>>>>>>> on the correct direction, I'll send the split ones. Thanks!
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Jike
>>>>>>>
>>>>>>>
>>>>>>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>>>>>>> index bec694c..f715d37 100644
>>>>>>> --- a/hw/vfio/pci-quirks.c
>>>>>>> +++ b/hw/vfio/pci-quirks.c
>>>>>>> @@ -10,12 +10,14 @@
>>>>>>>   * the COPYING file in the top-level directory.
>>>>>>>   */
>>>>>>>  
>>>>>>> +#include <sys/ioctl.h>
>>>>>>>  #include "qemu/osdep.h"
>>>>>>>  #include "qemu/error-report.h"
>>>>>>>  #include "qemu/range.h"
>>>>>>>  #include "qapi/error.h"
>>>>>>>  #include "hw/nvram/fw_cfg.h"
>>>>>>>  #include "pci.h"
>>>>>>> +#include "sysemu/kvm.h"
>>>>>>>  #include "trace.h"
>>>>>>>  
>>>>>>>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
>>>>>>> @@ -1844,3 +1846,15 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
>>>>>>>          break;
>>>>>>>      }
>>>>>>>  }
>>>>>>> +
>>>>>>> +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev)
>>>>>>> +{
>>>>>>> +    int vmfd;
>>>>>>> +
>>>>>>> +    if (!kvm_enabled() || !vdev->kvmgt)
>>>>>>> +        return;
>>>>>>> +
>>>>>>> +    /* Tell the device what KVM it attached */
>>>>>>> +    vmfd = kvm_get_vmfd(kvm_state);
>>>>>>> +    ioctl(vdev->vbasedev.fd, VFIO_SET_KVMFD, vmfd);
>>>>>>> +}
>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>> index a5a620a..8732552 100644
>>>>>>> --- a/hw/vfio/pci.c
>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>> @@ -2561,6 +2561,8 @@ static int vfio_initfn(PCIDevice *pdev)
>>>>>>>          return ret;
>>>>>>>      }
>>>>>>>  
>>>>>>> +    vfio_quirk_kvmgt(vdev);
>>>>>>> +
>>>>>>>      /* Get a copy of config space */
>>>>>>>      ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
>>>>>>>                  MIN(pci_config_size(&vdev->pdev), vdev->config_size),
>>>>>>> @@ -2832,6 +2834,7 @@ static Property vfio_pci_dev_properties[] = {
>>>>>>>      DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
>>>>>>>                         sub_device_id, PCI_ANY_ID),
>>>>>>>      DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
>>>>>>> +    DEFINE_PROP_BOOL("kvmgt", VFIOPCIDevice, kvmgt, false),      
>>>>>>
>>>>>> Just a side note, device options are a headache, users are prone to get
>>>>>> them wrong and minimally it requires an entire round to get libvirt
>>>>>> support.  We should be able to detect from the device or vfio API
>>>>>> whether such a call is required.  Obviously if we can use the existing
>>>>>> kvm-vfio device, that's the better option anyway.  Thanks,    
>>>>>
>>>>> Also, vfio devices currently have no hard dependencies on KVM, if kvmgt
>>>>> does, it needs to produce a device failure when unavailable.  Thanks,    
>>>>
>>>> Also, I would like to see this as an generic feature instead of
>>>> kvmgt specific interface, so we don't have to add new options to QEMU and it is
>>>> up to the vendor driver to proceed with or without it.  
>>>
>>> In general this should be decided by lack of some required feature
>>> exclusively provided by KVM.  I would not want to add a generic opt-out
>>> for mdev vendor drivers to decide that they arbitrarily want to disable
>>> that path.  Thanks,  
>>
>> IIUC, you are suggesting that this path should be controlled by KVM feature cap
>> and it will be accessible to VFIO users when such checking is satisfied.
> 
> Maybe we're getting too loose with our pronouns here, I'm starting to
> lose track of what "this" is referring to.  I agree that there's no
> reason for the ioctl, as proposed to be kvmgt specific.  I would hope
> that going through the kvm-vfio device to create that linkage would
> eliminate that, but we'll need to see what Jike can come up with to
> plumb between KVM and vfio.  Vendor drivers can implement their own
> ioctls, now that we pass them through the mdev layer, but someone needs
> to call those ioctls.  Ideally we want something programmatic to
> trigger that, without requiring a user to pass an extra device
> parameter.  Additionally, if there is any hope of making use of the
> device with userspace drivers other than QEMU, hard dependencies on KVM
> should be avoided.  Thanks,
> 
> Alex
> 

Thanks for the advice, so I cooked another patch for your comments.
Basically a 'void *usrdata' is added to vfio_group, external users
can set it (kvm) or get it (kvm or other users like kvmgt).

BTW, in device-model, the open method will return failure to vfio-mdev
in case that such kvm information is not available.
---
Thanks,
Jike

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index d1d70e0..6b8d1d2 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -86,6 +86,7 @@ struct vfio_group {
 	struct mutex			unbound_lock;
 	atomic_t			opened;
 	bool				noiommu;
+	void				*usrdata;
 };
 
 struct vfio_device {
@@ -447,14 +448,13 @@ static struct vfio_group *vfio_group_try_get(struct vfio_group *group)
 }
 
 static
-struct vfio_group *vfio_group_get_from_iommu(struct iommu_group *iommu_group)
+struct vfio_group *__vfio_group_get_from_iommu(struct iommu_group *iommu_group)
 {
 	struct vfio_group *group;
 
 	mutex_lock(&vfio.group_lock);
 	list_for_each_entry(group, &vfio.group_list, vfio_next) {
 		if (group->iommu_group == iommu_group) {
-			vfio_group_get(group);
 			mutex_unlock(&vfio.group_lock);
 			return group;
 		}
@@ -464,6 +464,17 @@ struct vfio_group *vfio_group_get_from_iommu(struct iommu_group *iommu_group)
 	return NULL;
 }
 
+static
+struct vfio_group *vfio_group_get_from_iommu(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group = __vfio_group_get_from_iommu(iommu_group);
+	if (!group)
+		return NULL;
+
+	vfio_group_get(group);
+	return group;
+}
+
 static struct vfio_group *vfio_group_get_from_minor(int minor)
 {
 	struct vfio_group *group;
@@ -1728,6 +1739,31 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
 }
 EXPORT_SYMBOL_GPL(vfio_external_check_extension);
 
+void vfio_group_set_usrdata(struct vfio_group *group, void *data)
+{
+	group->usrdata = data;
+}
+EXPORT_SYMBOL_GPL(vfio_group_set_usrdata);
+
+void *vfio_group_get_usrdata(struct vfio_group *group)
+{
+	return group->usrdata;
+}
+EXPORT_SYMBOL_GPL(vfio_group_get_usrdata);
+
+void *vfio_group_get_usrdata_by_device(struct device *dev)
+{
+	struct vfio_group *vfio_group;
+
+	vfio_group = __vfio_group_get_from_iommu(dev->iommu_group);
+	if (!vfio_group)
+		return NULL;
+
+	return vfio_group_get_usrdata(vfio_group);
+}
+EXPORT_SYMBOL_GPL(vfio_group_get_usrdata_by_device);
+
+
 /**
  * Sub-module support
  */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b..712588f 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -91,6 +91,10 @@ extern void vfio_unregister_iommu_driver(
 extern int vfio_external_user_iommu_id(struct vfio_group *group);
 extern long vfio_external_check_extension(struct vfio_group *group,
 					  unsigned long arg);
+extern void vfio_group_set_usrdata(struct vfio_group *group, void *data);
+extern void *vfio_group_get_usrdata(struct vfio_group *group);
+extern void *vfio_group_get_usrdata_by_device(struct device *dev);
+
 
 /*
  * Sub-module helpers
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 1dd087d..e00d401 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -60,6 +60,20 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
 	symbol_put(vfio_group_put_external_user);
 }
 
+static void kvm_vfio_group_set_kvm(struct vfio_group *group, void *kvm)
+{
+	void (*fn)(struct vfio_group *, void *);
+
+	fn = symbol_get(vfio_group_set_usrdata);
+	if (!fn)
+		return;
+
+	fn(group, kvm);
+	kvm_get_kvm(kvm);
+
+	symbol_put(vfio_group_set_usrdata);
+}
+
 static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 {
 	long (*fn)(struct vfio_group *, unsigned long);
@@ -161,6 +175,8 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 
 		kvm_vfio_update_coherency(dev);
 
+		kvm_vfio_group_set_kvm(vfio_group, dev->kvm);
+
 		return 0;
 
 	case KVM_DEV_VFIO_GROUP_DEL:
@@ -200,6 +216,8 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 
 		kvm_vfio_update_coherency(dev);
 
+		kvm_put_kvm(dev->kvm);
+
 		return ret;
 	}