diff mbox

[v8,4/6] docs: Add Documentation for Mediated devices

Message ID 20161012154457.5a05b4fb@t450s.home (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Williamson Oct. 12, 2016, 9:44 p.m. UTC
On Thu, 13 Oct 2016 00:32:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/12/2016 9:29 PM, Alex Williamson wrote:
> > On Wed, 12 Oct 2016 20:43:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 10/12/2016 7:22 AM, Tian, Kevin wrote:  
> >>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >>>> Sent: Wednesday, October 12, 2016 4:45 AM    
> >>>>>> +* mdev_supported_types:
> >>>>>> +    List of current supported mediated device types and its details are added
> >>>>>> +in this directory in following format:
> >>>>>> +
> >>>>>> +|- <parent phy device>
> >>>>>> +|--- Vendor-specific-attributes [optional]
> >>>>>> +|--- mdev_supported_types
> >>>>>> +|     |--- <type id>
> >>>>>> +|     |   |--- create
> >>>>>> +|     |   |--- name
> >>>>>> +|     |   |--- available_instances
> >>>>>> +|     |   |--- description /class
> >>>>>> +|     |   |--- [devices]
> >>>>>> +|     |--- <type id>
> >>>>>> +|     |   |--- create
> >>>>>> +|     |   |--- name
> >>>>>> +|     |   |--- available_instances
> >>>>>> +|     |   |--- description /class
> >>>>>> +|     |   |--- [devices]
> >>>>>> +|     |--- <type id>
> >>>>>> +|          |--- create
> >>>>>> +|          |--- name
> >>>>>> +|          |--- available_instances
> >>>>>> +|          |--- description /class
> >>>>>> +|          |--- [devices]
> >>>>>> +
> >>>>>> +[TBD : description or class is yet to be decided. This will change.]    
> >>>>>
> >>>>> I thought that in previous discussions we had agreed to drop
> >>>>> the <type id> concept and use the name as the unique identifier.
> >>>>> When reporting these types in libvirt we won't want to report
> >>>>> the type id values - we'll want the name strings to be unique.
> >>>>>    
> >>>>
> >>>> The 'name' might not be unique but type_id will be. For example that Neo
> >>>> pointed out in earlier discussion, virtual devices can come from two
> >>>> different physical devices, end user would be presented with what they
> >>>> had selected but there will be internal implementation differences. In
> >>>> that case 'type_id' will be unique.
> >>>>    
> >>>
> >>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
> >>> string (if you still called it <type id>), and then no need of additional
> >>> 'name' field which can be put inside 'description' field. See below quote:
> >>>     
> >>
> >> We had internal discussions about this within NVIDIA and found that
> >> 'name' might not be unique where as 'type_id' would be unique. I'm
> >> refering to Neo's mail after that, where Neo do pointed that out.
> >>
> >> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html  
> > 
> > Everyone not privy to those internal discussions, including me, seems to
> > think we dropped type_id and that if a vendor does not have a stable
> > name, they can compose some sort of stable type description based on the
> > name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
> > haven't managed to kill off type_id yet.  No matter what internal
> > representation each vendor driver has of "type_id" it seems possible
> > for it to come up with stable string to define a given configuration.  
> 
> 
> The 'type_id' is unique and the 'name' are not, the name is just a
> virtual device name/ human readable name. Because at this moment Intel
> can't define a proper GPU class, we have to add a 'description' field
> there as well to represent the features of this virtual device, once we
> have all agreed with the GPU class and its mandatory attributes, the
> 'description' field can be removed. Here is an example,
> type_id/type_name = NVIDIA_11,
> name=M60-M0Q,
> description=2560x1600, 2 displays, 512MB"
> 
> Neo's previous comment only applies to the situation where we will have
> the GPU class or optional attributes defined and recognized by libvirt,
> since that is not going to happen any time soon, we will have to have
> the new 'description' field, and we don't want to have it mixed up with
> 'name' field.
> 
> We can definitely have something like name+id as Alex recommended to
> remove the 'name' field, but it will just require libvirt to have more
> logic to parse that string.

Let's use the mtty example driver provided in patch 5 so we can all
more clearly see how the interfaces work.  I'll start from the
beginning of my experience and work my way to the type/name thing.

(please add a modules_install target to the Makefile)

# modprobe mtty

Now what?  It seems like I need to have prior knowledge that this
drivers supports mdev devices and I need to go hunt for them.  We need
to create a class (ex. /sys/class/mdev/) where a user can find all the
devices that participate in this mediated device infrastructure.  That
would point me to /sys/devices/mtty.

# tree /sys/devices/mtty
/sys/devices/mtty
|-- mdev_supported_types
|   `-- mtty1
|       |-- available_instances (1)
|       |-- create
|       |-- devices
|       `-- name ("Dual-port-serial")
|-- mtty_dev
|   `-- sample_mtty_dev ("This is phy device")
|-- power
|   |-- async
|   |-- autosuspend_delay_ms
|   |-- control
|   |-- runtime_active_kids
|   |-- runtime_active_time
|   |-- runtime_enabled
|   |-- runtime_status
|   |-- runtime_suspended_time
|   `-- runtime_usage
`-- uevent

Ok, but that was boring, we really need to have at least 2 supported
types to validate the interface, so without changing the actual device
backing, I pretended to have a single port vs dual port:

/sys/devices/mtty
|-- mdev_supported_types
|   |-- mtty1
|   |   |-- available_instances (24)
|   |   |-- create
|   |   |-- devices
|   |   `-- name (Single-port-serial)
|   `-- mtty2
|       |-- available_instances (12)
|       |-- create
|       |-- devices
|       `-- name (Dual-port-serial)
[snip]

I arbitrarily decides I have 24 ports and each single port uses 1 port
and each dual port uses 2 ports.

Before I start creating devices, what are we going to key the libvirt
XML on?  Can we do anything to prevent vendors from colliding or do we
have any way to suggest meaningful and unique type_ids?  Presumably if
we had a PCI device hosting this, we would be rooted at that parent
device, ex. /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0.  Maybe
the type_id should automatically be prefixed by the vendor module name,
ex. mtty-1, i915-foo, nvidia-bar.  There's something missing for
deterministically creating a "XYZ" device and knowing exactly what that
means and finding a parent device that supports it.

Let's get to mdev creating...

# uuidgen > mdev_supported_types/mtty2/create
# tree /sys/devices/mtty
/sys/devices/mtty
|-- e68189be-700e-41f7-93a3-b5351e79c470
|   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
|   |-- iommu_group -> ../../../kernel/iommu_groups/63
|   |-- mtty2 -> ../mdev_supported_types/mtty2
|   |-- power
|   |   |-- async
|   |   |-- autosuspend_delay_ms
|   |   |-- control
|   |   |-- runtime_active_kids
|   |   |-- runtime_active_time
|   |   |-- runtime_enabled
|   |   |-- runtime_status
|   |   |-- runtime_suspended_time
|   |   `-- runtime_usage
|   |-- remove
|   |-- subsystem -> ../../../bus/mdev
|   |-- uevent
|   `-- vendor
|       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
|-- mdev_supported_types
|   |-- mtty1
|   |   |-- available_instances (22)
|   |   |-- create
|   |   |-- devices
|   |   `-- name
|   `-- mtty2
|       |-- available_instances (11)
|       |-- create
|       |-- devices
|       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
|       `-- name

The mdev device was created directly under the parent, which seems like
it's going to get messy to me (ie. imagine dropping a bunch of uuids
into a PCI parent device's sysfs directory, how does a user know what
they are?).

Under the device we have "mtty2", shouldn't that be
"mdev_supported_type", which then links to mtty2?  Otherwise a user
needs to decode from the link what this attribute is.

Also here's an example of those vendor sysfs entries per device.  So
long as the vendor never expects a tool like libvirt to manipulate
attributes there, I can see how that could be pretty powerful.

Moving down to the mdev_supported_types, I've updated mtty so that it
actually adjusts available instance, and we can now see a link under
the devices for mtty2.

Also worth noting is that a link for the device appears
in /sys/bus/mdev/devices.

BTW, specifying this device for QEMU vfio-pci is where the sysfsdev
option comes into play:

-device
vfio-pci,sysfsdev=/sys/devices/mtty/e68189be-700e-41f7-93a3-b5351e79c470

Which raises another question, we can tell through the vfio interfaces
that this is exposes as a PCI device, by creating a container
(open(/dev/vfio/vfio)), setting an iommu (ioctl(VFIO_SET_IOMMU)),
adding the group to the container (ioctl(VFIO_GROUP_SET_CONTAINER)),
getting the device (ioctl(VFIO_GROUP_GET_DEVICE_FD)), and finally
getting the device info (ioctl(VFIO_DEVICE_GET_INFO)) and checking the
flag bit that says the API is PCI.  That's a long path to go and has
stumbling blocks like the type of iommu that's available for the given
platform.  How do we make that manageable?  I don't think we want to
create some artificial relationship that the type of the parent
necessarily match the type of the child mdev, we've already broken that
with a simple mdev tty driver.

One more:

# uuidgen > mdev_supported_types/mtty1/create
# tree /sys/devices/mtty
/sys/devices/mtty
|-- a7ae17d1-2de4-44c2-ae58-20ae0a0befe8
|   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
|   |-- iommu_group -> ../../../kernel/iommu_groups/64
|   |-- mtty1 -> ../mdev_supported_types/mtty1
|   |-- power
[snip]
|   |-- remove
|   |-- subsystem -> ../../../bus/mdev
|   |-- uevent
|   `-- vendor
|       `-- sample_mdev_dev ("This is MDEV a7ae17d1-2de4-44c2-ae58-20ae0a0befe8")
|-- e68189be-700e-41f7-93a3-b5351e79c470
|   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
|   |-- iommu_group -> ../../../kernel/iommu_groups/63
|   |-- mtty2 -> ../mdev_supported_types/mtty2
|   |-- power
[snip]
|   |-- remove
|   |-- subsystem -> ../../../bus/mdev
|   |-- uevent
|   `-- vendor
|       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
|-- mdev_supported_types
|   |-- mtty1
|   |   |-- available_instances (21)
|   |   |-- create
|   |   |-- devices
|   |   |   `-- a7ae17d1-2de4-44c2-ae58-20ae0a0befe8 -> ../../../a7ae17d1-2de4-44c2-ae58-20ae0a0befe8
|   |   `-- name
|   `-- mtty2
|       |-- available_instances (10)
|       |-- create
|       |-- devices
|       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
|       `-- name

Hopefully as expected with the caveats for the first example.

# echo 1 > a7ae17d1-2de4-44c2-ae58-20ae0a0befe8/remove
# echo 1 > e68189be-700e-41f7-93a3-b5351e79c470/remove

These do what they're supposed to, the devices are gone.

Ok, I've identified some issues, let's figure out how to resolve them.
Thanks,

Alex

(hack multi-port mtty patch attached)

Comments

Kirti Wankhede Oct. 13, 2016, 9:22 a.m. UTC | #1
On 10/13/2016 3:14 AM, Alex Williamson wrote:
> On Thu, 13 Oct 2016 00:32:48 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 10/12/2016 9:29 PM, Alex Williamson wrote:
>>> On Wed, 12 Oct 2016 20:43:48 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 10/12/2016 7:22 AM, Tian, Kevin wrote:  
>>>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>>>>>> Sent: Wednesday, October 12, 2016 4:45 AM    
>>>>>>>> +* mdev_supported_types:
>>>>>>>> +    List of current supported mediated device types and its details are added
>>>>>>>> +in this directory in following format:
>>>>>>>> +
>>>>>>>> +|- <parent phy device>
>>>>>>>> +|--- Vendor-specific-attributes [optional]
>>>>>>>> +|--- mdev_supported_types
>>>>>>>> +|     |--- <type id>
>>>>>>>> +|     |   |--- create
>>>>>>>> +|     |   |--- name
>>>>>>>> +|     |   |--- available_instances
>>>>>>>> +|     |   |--- description /class
>>>>>>>> +|     |   |--- [devices]
>>>>>>>> +|     |--- <type id>
>>>>>>>> +|     |   |--- create
>>>>>>>> +|     |   |--- name
>>>>>>>> +|     |   |--- available_instances
>>>>>>>> +|     |   |--- description /class
>>>>>>>> +|     |   |--- [devices]
>>>>>>>> +|     |--- <type id>
>>>>>>>> +|          |--- create
>>>>>>>> +|          |--- name
>>>>>>>> +|          |--- available_instances
>>>>>>>> +|          |--- description /class
>>>>>>>> +|          |--- [devices]
>>>>>>>> +
>>>>>>>> +[TBD : description or class is yet to be decided. This will change.]    
>>>>>>>
>>>>>>> I thought that in previous discussions we had agreed to drop
>>>>>>> the <type id> concept and use the name as the unique identifier.
>>>>>>> When reporting these types in libvirt we won't want to report
>>>>>>> the type id values - we'll want the name strings to be unique.
>>>>>>>    
>>>>>>
>>>>>> The 'name' might not be unique but type_id will be. For example that Neo
>>>>>> pointed out in earlier discussion, virtual devices can come from two
>>>>>> different physical devices, end user would be presented with what they
>>>>>> had selected but there will be internal implementation differences. In
>>>>>> that case 'type_id' will be unique.
>>>>>>    
>>>>>
>>>>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
>>>>> string (if you still called it <type id>), and then no need of additional
>>>>> 'name' field which can be put inside 'description' field. See below quote:
>>>>>     
>>>>
>>>> We had internal discussions about this within NVIDIA and found that
>>>> 'name' might not be unique where as 'type_id' would be unique. I'm
>>>> refering to Neo's mail after that, where Neo do pointed that out.
>>>>
>>>> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html  
>>>
>>> Everyone not privy to those internal discussions, including me, seems to
>>> think we dropped type_id and that if a vendor does not have a stable
>>> name, they can compose some sort of stable type description based on the
>>> name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
>>> haven't managed to kill off type_id yet.  No matter what internal
>>> representation each vendor driver has of "type_id" it seems possible
>>> for it to come up with stable string to define a given configuration.  
>>
>>
>> The 'type_id' is unique and the 'name' are not, the name is just a
>> virtual device name/ human readable name. Because at this moment Intel
>> can't define a proper GPU class, we have to add a 'description' field
>> there as well to represent the features of this virtual device, once we
>> have all agreed with the GPU class and its mandatory attributes, the
>> 'description' field can be removed. Here is an example,
>> type_id/type_name = NVIDIA_11,
>> name=M60-M0Q,
>> description=2560x1600, 2 displays, 512MB"
>>
>> Neo's previous comment only applies to the situation where we will have
>> the GPU class or optional attributes defined and recognized by libvirt,
>> since that is not going to happen any time soon, we will have to have
>> the new 'description' field, and we don't want to have it mixed up with
>> 'name' field.
>>
>> We can definitely have something like name+id as Alex recommended to
>> remove the 'name' field, but it will just require libvirt to have more
>> logic to parse that string.
> 
> Let's use the mtty example driver provided in patch 5 so we can all
> more clearly see how the interfaces work.  I'll start from the
> beginning of my experience and work my way to the type/name thing.
> 

Thanks for looking into it and getting feel of it. And I hope this helps
to understand that 'name' and 'type_id' are different.


> (please add a modules_install target to the Makefile)
>

This is an example and I feel it should not be installed in
/lib/modules/../build path. This should be used to understand the
interface and the flow of mdev device management life cycle. User can
use insmod to load driver:

# insmod mtty.ko

> # modprobe mtty
> 
> Now what?  It seems like I need to have prior knowledge that this
> drivers supports mdev devices and I need to go hunt for them.  We need
> to create a class (ex. /sys/class/mdev/) where a user can find all the
> devices that participate in this mediated device infrastructure.  That
> would point me to /sys/devices/mtty.
> 

You can find devices registered to mdev framework by searching for
'mdev_supported_types' directory at the leaf nodes of devices in
/sys/devices directory. Yes, we can have 'mdev' class and links to
devices which are registered to mdev framework in /sys/class/mdev/.


> # tree /sys/devices/mtty
> /sys/devices/mtty
> |-- mdev_supported_types
> |   `-- mtty1
> |       |-- available_instances (1)
> |       |-- create
> |       |-- devices
> |       `-- name ("Dual-port-serial")
> |-- mtty_dev
> |   `-- sample_mtty_dev ("This is phy device")
> |-- power
> |   |-- async
> |   |-- autosuspend_delay_ms
> |   |-- control
> |   |-- runtime_active_kids
> |   |-- runtime_active_time
> |   |-- runtime_enabled
> |   |-- runtime_status
> |   |-- runtime_suspended_time
> |   `-- runtime_usage
> `-- uevent
> 
> Ok, but that was boring, we really need to have at least 2 supported
> types to validate the interface, so without changing the actual device
> backing, I pretended to have a single port vs dual port:
> 
> /sys/devices/mtty
> |-- mdev_supported_types
> |   |-- mtty1
> |   |   |-- available_instances (24)
> |   |   |-- create
> |   |   |-- devices
> |   |   `-- name (Single-port-serial)
> |   `-- mtty2
> |       |-- available_instances (12)
> |       |-- create
> |       |-- devices
> |       `-- name (Dual-port-serial)
> [snip]
> 
> I arbitrarily decides I have 24 ports and each single port uses 1 port
> and each dual port uses 2 ports.
> 
> Before I start creating devices, what are we going to key the libvirt
> XML on?  Can we do anything to prevent vendors from colliding or do we
> have any way to suggest meaningful and unique type_ids? 

Libvirt would have parent and type_id in XML. No two vendors can own
same parent device. So I don't think vendors would collide even having
same type_id, since <parent, type_id> pair would always be unique.


 Presumably if
> we had a PCI device hosting this, we would be rooted at that parent
> device, ex. /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0.  Maybe
> the type_id should automatically be prefixed by the vendor module name,
> ex. mtty-1, i915-foo, nvidia-bar.  There's something missing for
> deterministically creating a "XYZ" device and knowing exactly what that
> means and finding a parent device that supports it.
> 

We can prefix type_id with module name, i.e using dev->driver->name, but
<parent, type_id> pair is unique so I don't see much benefit in doing
that.


> Let's get to mdev creating...
> 
> # uuidgen > mdev_supported_types/mtty2/create
> # tree /sys/devices/mtty
> /sys/devices/mtty
> |-- e68189be-700e-41f7-93a3-b5351e79c470
> |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
> |   |-- iommu_group -> ../../../kernel/iommu_groups/63
> |   |-- mtty2 -> ../mdev_supported_types/mtty2
> |   |-- power
> |   |   |-- async
> |   |   |-- autosuspend_delay_ms
> |   |   |-- control
> |   |   |-- runtime_active_kids
> |   |   |-- runtime_active_time
> |   |   |-- runtime_enabled
> |   |   |-- runtime_status
> |   |   |-- runtime_suspended_time
> |   |   `-- runtime_usage
> |   |-- remove
> |   |-- subsystem -> ../../../bus/mdev
> |   |-- uevent
> |   `-- vendor
> |       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
> |-- mdev_supported_types
> |   |-- mtty1
> |   |   |-- available_instances (22)
> |   |   |-- create
> |   |   |-- devices
> |   |   `-- name
> |   `-- mtty2
> |       |-- available_instances (11)
> |       |-- create
> |       |-- devices
> |       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
> |       `-- name
> 
> The mdev device was created directly under the parent, which seems like
> it's going to get messy to me (ie. imagine dropping a bunch of uuids
> into a PCI parent device's sysfs directory, how does a user know what
> they are?).
> 

That is the way devices are placed in sysfs. For example below devices:

80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 1a (rev 07)
80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 2a (rev 07)
80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 3a in PCI Express Mode (rev 07)
80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
0 (rev 07)
80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
1 (rev 07)
80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
2 (rev 07)
80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
3 (rev 07)
80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
4 (rev 07)
80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
5 (rev 07)
80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
6 (rev 07)
80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
7 (rev 07)
80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address
Map, VTd_Misc, System Management (rev 07)
80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control
Status and Global Errors (rev 07)
80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)

In sysfs, those are in same parent folder of its parent root port:

# ls /sys/devices/pci0000\:80/ -l
total 0
drwxr-xr-x 8 root root    0 Oct 13 12:08 0000:80:01.0
drwxr-xr-x 7 root root    0 Oct 13 12:08 0000:80:02.0
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:03.0
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.0
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.1
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.2
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.3
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.4
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.5
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.6
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.7
drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.0
drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.2
drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.4
lrwxrwxrwx 1 root root    0 Oct 13 13:25 firmware_node ->
../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01
drwxr-xr-x 3 root root    0 Oct 13 12:08 pci_bus
drwxr-xr-x 2 root root    0 Oct 13 12:08 power
-rw-r--r-- 1 root root 4096 Oct 13 13:25 uevent


> Under the device we have "mtty2", shouldn't that be
> "mdev_supported_type", which then links to mtty2?  Otherwise a user
> needs to decode from the link what this attribute is.
> 

I thought it should show type, so that by looking at 'ls' output user
should be able to find type_id.

> Also here's an example of those vendor sysfs entries per device.  So
> long as the vendor never expects a tool like libvirt to manipulate
> attributes there, I can see how that could be pretty powerful.
> 

Yes, it is good to have vendor specific entries, libvirt might not
report/use it. That would be more useful for system admin to get extra
information manually that libvirt doesn't report.


> Moving down to the mdev_supported_types, I've updated mtty so that it
> actually adjusts available instance, and we can now see a link under
> the devices for mtty2.
> 
> Also worth noting is that a link for the device appears
> in /sys/bus/mdev/devices.
> 
> BTW, specifying this device for QEMU vfio-pci is where the sysfsdev
> option comes into play:
> 
> -device
> vfio-pci,sysfsdev=/sys/devices/mtty/e68189be-700e-41f7-93a3-b5351e79c470
> 
> Which raises another question, we can tell through the vfio interfaces
> that this is exposes as a PCI device, by creating a container
> (open(/dev/vfio/vfio)), setting an iommu (ioctl(VFIO_SET_IOMMU)),
> adding the group to the container (ioctl(VFIO_GROUP_SET_CONTAINER)),
> getting the device (ioctl(VFIO_GROUP_GET_DEVICE_FD)), and finally
> getting the device info (ioctl(VFIO_DEVICE_GET_INFO)) and checking the
> flag bit that says the API is PCI.  That's a long path to go and has
> stumbling blocks like the type of iommu that's available for the given
> platform.  How do we make that manageable? 

Do you want device type to be expressed in sysfs? Then that should be
done from vendor driver. vfio_mdev module is now a shim layer, so mdev
core or vfio_mdev module don't know what device type flag vendor driver
had set.

Thanks,
Kirti

> I don't think we want to
> create some artificial relationship that the type of the parent
> necessarily match the type of the child mdev, we've already broken that
> with a simple mdev tty driver.
> 
> One more:
> 
> # uuidgen > mdev_supported_types/mtty1/create
> # tree /sys/devices/mtty
> /sys/devices/mtty
> |-- a7ae17d1-2de4-44c2-ae58-20ae0a0befe8
> |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
> |   |-- iommu_group -> ../../../kernel/iommu_groups/64
> |   |-- mtty1 -> ../mdev_supported_types/mtty1
> |   |-- power
> [snip]
> |   |-- remove
> |   |-- subsystem -> ../../../bus/mdev
> |   |-- uevent
> |   `-- vendor
> |       `-- sample_mdev_dev ("This is MDEV a7ae17d1-2de4-44c2-ae58-20ae0a0befe8")
> |-- e68189be-700e-41f7-93a3-b5351e79c470
> |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
> |   |-- iommu_group -> ../../../kernel/iommu_groups/63
> |   |-- mtty2 -> ../mdev_supported_types/mtty2
> |   |-- power
> [snip]
> |   |-- remove
> |   |-- subsystem -> ../../../bus/mdev
> |   |-- uevent
> |   `-- vendor
> |       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
> |-- mdev_supported_types
> |   |-- mtty1
> |   |   |-- available_instances (21)
> |   |   |-- create
> |   |   |-- devices
> |   |   |   `-- a7ae17d1-2de4-44c2-ae58-20ae0a0befe8 -> ../../../a7ae17d1-2de4-44c2-ae58-20ae0a0befe8
> |   |   `-- name
> |   `-- mtty2
> |       |-- available_instances (10)
> |       |-- create
> |       |-- devices
> |       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
> |       `-- name
> 
> Hopefully as expected with the caveats for the first example.
> 
> # echo 1 > a7ae17d1-2de4-44c2-ae58-20ae0a0befe8/remove
> # echo 1 > e68189be-700e-41f7-93a3-b5351e79c470/remove
> 
> These do what they're supposed to, the devices are gone.
> 
> Ok, I've identified some issues, let's figure out how to resolve them.
> Thanks,
> 
> Alex
> 
> (hack multi-port mtty patch attached)
>
Alex Williamson Oct. 13, 2016, 2:36 p.m. UTC | #2
On Thu, 13 Oct 2016 14:52:09 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/13/2016 3:14 AM, Alex Williamson wrote:
> > On Thu, 13 Oct 2016 00:32:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 10/12/2016 9:29 PM, Alex Williamson wrote:  
> >>> On Wed, 12 Oct 2016 20:43:48 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 10/12/2016 7:22 AM, Tian, Kevin wrote:    
> >>>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >>>>>> Sent: Wednesday, October 12, 2016 4:45 AM      
> >>>>>>>> +* mdev_supported_types:
> >>>>>>>> +    List of current supported mediated device types and its details are added
> >>>>>>>> +in this directory in following format:
> >>>>>>>> +
> >>>>>>>> +|- <parent phy device>
> >>>>>>>> +|--- Vendor-specific-attributes [optional]
> >>>>>>>> +|--- mdev_supported_types
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|     |   |--- create
> >>>>>>>> +|     |   |--- name
> >>>>>>>> +|     |   |--- available_instances
> >>>>>>>> +|     |   |--- description /class
> >>>>>>>> +|     |   |--- [devices]
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|     |   |--- create
> >>>>>>>> +|     |   |--- name
> >>>>>>>> +|     |   |--- available_instances
> >>>>>>>> +|     |   |--- description /class
> >>>>>>>> +|     |   |--- [devices]
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|          |--- create
> >>>>>>>> +|          |--- name
> >>>>>>>> +|          |--- available_instances
> >>>>>>>> +|          |--- description /class
> >>>>>>>> +|          |--- [devices]
> >>>>>>>> +
> >>>>>>>> +[TBD : description or class is yet to be decided. This will change.]      
> >>>>>>>
> >>>>>>> I thought that in previous discussions we had agreed to drop
> >>>>>>> the <type id> concept and use the name as the unique identifier.
> >>>>>>> When reporting these types in libvirt we won't want to report
> >>>>>>> the type id values - we'll want the name strings to be unique.
> >>>>>>>      
> >>>>>>
> >>>>>> The 'name' might not be unique but type_id will be. For example that Neo
> >>>>>> pointed out in earlier discussion, virtual devices can come from two
> >>>>>> different physical devices, end user would be presented with what they
> >>>>>> had selected but there will be internal implementation differences. In
> >>>>>> that case 'type_id' will be unique.
> >>>>>>      
> >>>>>
> >>>>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
> >>>>> string (if you still called it <type id>), and then no need of additional
> >>>>> 'name' field which can be put inside 'description' field. See below quote:
> >>>>>       
> >>>>
> >>>> We had internal discussions about this within NVIDIA and found that
> >>>> 'name' might not be unique where as 'type_id' would be unique. I'm
> >>>> refering to Neo's mail after that, where Neo do pointed that out.
> >>>>
> >>>> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html    
> >>>
> >>> Everyone not privy to those internal discussions, including me, seems to
> >>> think we dropped type_id and that if a vendor does not have a stable
> >>> name, they can compose some sort of stable type description based on the
> >>> name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
> >>> haven't managed to kill off type_id yet.  No matter what internal
> >>> representation each vendor driver has of "type_id" it seems possible
> >>> for it to come up with stable string to define a given configuration.    
> >>
> >>
> >> The 'type_id' is unique and the 'name' are not, the name is just a
> >> virtual device name/ human readable name. Because at this moment Intel
> >> can't define a proper GPU class, we have to add a 'description' field
> >> there as well to represent the features of this virtual device, once we
> >> have all agreed with the GPU class and its mandatory attributes, the
> >> 'description' field can be removed. Here is an example,
> >> type_id/type_name = NVIDIA_11,
> >> name=M60-M0Q,
> >> description=2560x1600, 2 displays, 512MB"
> >>
> >> Neo's previous comment only applies to the situation where we will have
> >> the GPU class or optional attributes defined and recognized by libvirt,
> >> since that is not going to happen any time soon, we will have to have
> >> the new 'description' field, and we don't want to have it mixed up with
> >> 'name' field.
> >>
> >> We can definitely have something like name+id as Alex recommended to
> >> remove the 'name' field, but it will just require libvirt to have more
> >> logic to parse that string.  
> > 
> > Let's use the mtty example driver provided in patch 5 so we can all
> > more clearly see how the interfaces work.  I'll start from the
> > beginning of my experience and work my way to the type/name thing.
> >   
> 
> Thanks for looking into it and getting feel of it. And I hope this helps
> to understand that 'name' and 'type_id' are different.
> 
> 
> > (please add a modules_install target to the Makefile)
> >  
> 
> This is an example and I feel it should not be installed in
> /lib/modules/../build path. This should be used to understand the
> interface and the flow of mdev device management life cycle. User can
> use insmod to load driver:
> 
> # insmod mtty.ko

It's not built by default, that's sufficient.  Providing a
modules_install target makes it more accessible for testing and allows
easier testing of module dependencies with modprobe.  insmod does not
exercise the automatic module dependency loading.

> > # modprobe mtty
> > 
> > Now what?  It seems like I need to have prior knowledge that this
> > drivers supports mdev devices and I need to go hunt for them.  We need
> > to create a class (ex. /sys/class/mdev/) where a user can find all the
> > devices that participate in this mediated device infrastructure.  That
> > would point me to /sys/devices/mtty.
> >   
> 
> You can find devices registered to mdev framework by searching for
> 'mdev_supported_types' directory at the leaf nodes of devices in
> /sys/devices directory. Yes, we can have 'mdev' class and links to
> devices which are registered to mdev framework in /sys/class/mdev/.
> 
> 
> > # tree /sys/devices/mtty
> > /sys/devices/mtty
> > |-- mdev_supported_types
> > |   `-- mtty1
> > |       |-- available_instances (1)
> > |       |-- create
> > |       |-- devices
> > |       `-- name ("Dual-port-serial")
> > |-- mtty_dev
> > |   `-- sample_mtty_dev ("This is phy device")
> > |-- power
> > |   |-- async
> > |   |-- autosuspend_delay_ms
> > |   |-- control
> > |   |-- runtime_active_kids
> > |   |-- runtime_active_time
> > |   |-- runtime_enabled
> > |   |-- runtime_status
> > |   |-- runtime_suspended_time
> > |   `-- runtime_usage
> > `-- uevent
> > 
> > Ok, but that was boring, we really need to have at least 2 supported
> > types to validate the interface, so without changing the actual device
> > backing, I pretended to have a single port vs dual port:
> > 
> > /sys/devices/mtty
> > |-- mdev_supported_types
> > |   |-- mtty1
> > |   |   |-- available_instances (24)
> > |   |   |-- create
> > |   |   |-- devices
> > |   |   `-- name (Single-port-serial)
> > |   `-- mtty2
> > |       |-- available_instances (12)
> > |       |-- create
> > |       |-- devices
> > |       `-- name (Dual-port-serial)
> > [snip]
> > 
> > I arbitrarily decides I have 24 ports and each single port uses 1 port
> > and each dual port uses 2 ports.
> > 
> > Before I start creating devices, what are we going to key the libvirt
> > XML on?  Can we do anything to prevent vendors from colliding or do we
> > have any way to suggest meaningful and unique type_ids?   
> 
> Libvirt would have parent and type_id in XML. No two vendors can own
> same parent device. So I don't think vendors would collide even having
> same type_id, since <parent, type_id> pair would always be unique.


We have a goal of supporting migration with mdev devices, Intel has
already shown this is possible.  Tying the XML representation of an
mdev device to a parent device is directly contradictory to that goal.
libvirt needs a token which is unique across vendor to be able to
instantiate an mdev device.  <parent, type_id> is unacceptable.
 
>  Presumably if
> > we had a PCI device hosting this, we would be rooted at that parent
> > device, ex. /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0.  Maybe
> > the type_id should automatically be prefixed by the vendor module name,
> > ex. mtty-1, i915-foo, nvidia-bar.  There's something missing for
> > deterministically creating a "XYZ" device and knowing exactly what that
> > means and finding a parent device that supports it.
> >   
> 
> We can prefix type_id with module name, i.e using dev->driver->name, but
> <parent, type_id> pair is unique so I don't see much benefit in doing
> that.
> 
> 
> > Let's get to mdev creating...
> > 
> > # uuidgen > mdev_supported_types/mtty2/create
> > # tree /sys/devices/mtty
> > /sys/devices/mtty
> > |-- e68189be-700e-41f7-93a3-b5351e79c470
> > |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
> > |   |-- iommu_group -> ../../../kernel/iommu_groups/63
> > |   |-- mtty2 -> ../mdev_supported_types/mtty2
> > |   |-- power
> > |   |   |-- async
> > |   |   |-- autosuspend_delay_ms
> > |   |   |-- control
> > |   |   |-- runtime_active_kids
> > |   |   |-- runtime_active_time
> > |   |   |-- runtime_enabled
> > |   |   |-- runtime_status
> > |   |   |-- runtime_suspended_time
> > |   |   `-- runtime_usage
> > |   |-- remove
> > |   |-- subsystem -> ../../../bus/mdev
> > |   |-- uevent
> > |   `-- vendor
> > |       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
> > |-- mdev_supported_types
> > |   |-- mtty1
> > |   |   |-- available_instances (22)
> > |   |   |-- create
> > |   |   |-- devices
> > |   |   `-- name
> > |   `-- mtty2
> > |       |-- available_instances (11)
> > |       |-- create
> > |       |-- devices
> > |       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
> > |       `-- name
> > 
> > The mdev device was created directly under the parent, which seems like
> > it's going to get messy to me (ie. imagine dropping a bunch of uuids
> > into a PCI parent device's sysfs directory, how does a user know what
> > they are?).
> >   
> 
> That is the way devices are placed in sysfs. For example below devices:
> 
> 80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 1a (rev 07)
> 80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 2a (rev 07)
> 80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 3a in PCI Express Mode (rev 07)
> 80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 0 (rev 07)
> 80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 1 (rev 07)
> 80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 2 (rev 07)
> 80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 3 (rev 07)
> 80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 4 (rev 07)
> 80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 5 (rev 07)
> 80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 6 (rev 07)
> 80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 7 (rev 07)
> 80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address
> Map, VTd_Misc, System Management (rev 07)
> 80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control
> Status and Global Errors (rev 07)
> 80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
> 
> In sysfs, those are in same parent folder of its parent root port:
> 
> # ls /sys/devices/pci0000\:80/ -l
> total 0
> drwxr-xr-x 8 root root    0 Oct 13 12:08 0000:80:01.0
> drwxr-xr-x 7 root root    0 Oct 13 12:08 0000:80:02.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:03.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.1
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.2
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.3
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.4
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.5
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.6
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.7
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.0
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.2
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.4
> lrwxrwxrwx 1 root root    0 Oct 13 13:25 firmware_node ->
> ../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01
> drwxr-xr-x 3 root root    0 Oct 13 12:08 pci_bus
> drwxr-xr-x 2 root root    0 Oct 13 12:08 power
> -rw-r--r-- 1 root root 4096 Oct 13 13:25 uevent

I agree, that's also messy, but a PCI device has a standard PCI address
format, so just by looking at those you know it's a child PCI device.
We can write code that understands how to parse that and understands
what it is by the address.  On the other hand we're dropping uuids into
a directory.  We can write code that understands how to parse it, but
how do we know that it's a child device vs some other attribute for the
parent?

> > Under the device we have "mtty2", shouldn't that be
> > "mdev_supported_type", which then links to mtty2?  Otherwise a user
> > needs to decode from the link what this attribute is.
> >   
> 
> I thought it should show type, so that by looking at 'ls' output user
> should be able to find type_id.

The type_id should be shown by actually reading the link, not by the
link name itself, the same way that the iommu_group link for a device
isn't the group number, it links to the group number but uses a
standard link name.

> > Also here's an example of those vendor sysfs entries per device.  So
> > long as the vendor never expects a tool like libvirt to manipulate
> > attributes there, I can see how that could be pretty powerful.
> >   
> 
> Yes, it is good to have vendor specific entries, libvirt might not
> report/use it. That would be more useful for system admin to get extra
> information manually that libvirt doesn't report.
> 
> 
> > Moving down to the mdev_supported_types, I've updated mtty so that it
> > actually adjusts available instance, and we can now see a link under
> > the devices for mtty2.
> > 
> > Also worth noting is that a link for the device appears
> > in /sys/bus/mdev/devices.
> > 
> > BTW, specifying this device for QEMU vfio-pci is where the sysfsdev
> > option comes into play:
> > 
> > -device
> > vfio-pci,sysfsdev=/sys/devices/mtty/e68189be-700e-41f7-93a3-b5351e79c470
> > 
> > Which raises another question, we can tell through the vfio interfaces
> > that this is exposes as a PCI device, by creating a container
> > (open(/dev/vfio/vfio)), setting an iommu (ioctl(VFIO_SET_IOMMU)),
> > adding the group to the container (ioctl(VFIO_GROUP_SET_CONTAINER)),
> > getting the device (ioctl(VFIO_GROUP_GET_DEVICE_FD)), and finally
> > getting the device info (ioctl(VFIO_DEVICE_GET_INFO)) and checking the
> > flag bit that says the API is PCI.  That's a long path to go and has
> > stumbling blocks like the type of iommu that's available for the given
> > platform.  How do we make that manageable?   
> 
> Do you want device type to be expressed in sysfs? Then that should be
> done from vendor driver. vfio_mdev module is now a shim layer, so mdev
> core or vfio_mdev module don't know what device type flag vendor driver
> had set.

Right, the vendor driver would need to expose this, the mdev layers are
device agnostic, they don't know or care which device API is being
exposed.  The other question is whether it needs to be part of the
initial implementation or can we assume pci for now and add something
later.  I guess we already have our proof to the contrary with the
IBM ccw device that libvirt can't simply assume pci.  I see that many
devices in sysfs have a subsystem link, which seems rather appropriate,
but we're not creating a real pci device, so linking to /sys/bus/pci
or /sys/class/pci_bus both seem invalid.  Is that a dead end?  We could
always expose vfio_device_info.flags, but that seems pretty ugly as
well, plus the sysfs mdev interface is not vfio specific.  What if we
had a "device_api" attribute which the vendor driver would show as
"vfio-pci"?  Therefore the mdev interface is not tied to vfio, but we
clearly show that a given type_id clearly exports a vfio-pci
interface.  Thanks,

Alex
Paolo Bonzini Oct. 13, 2016, 4 p.m. UTC | #3
On 13/10/2016 16:36, Alex Williamson wrote:
>> > 
>> > Libvirt would have parent and type_id in XML. No two vendors can own
>> > same parent device. So I don't think vendors would collide even having
>> > same type_id, since <parent, type_id> pair would always be unique.
> 
> We have a goal of supporting migration with mdev devices, Intel has
> already shown this is possible.  Tying the XML representation of an
> mdev device to a parent device is directly contradictory to that goal.
> libvirt needs a token which is unique across vendor to be able to
> instantiate an mdev device.  <parent, type_id> is unacceptable.

Would the vGPU's PCI vendor and device ID be acceptable and unique?

Paolo
Alex Williamson Oct. 13, 2016, 4:30 p.m. UTC | #4
On Thu, 13 Oct 2016 18:00:07 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 13/10/2016 16:36, Alex Williamson wrote:
> >> > 
> >> > Libvirt would have parent and type_id in XML. No two vendors can own
> >> > same parent device. So I don't think vendors would collide even having
> >> > same type_id, since <parent, type_id> pair would always be unique.  
> > 
> > We have a goal of supporting migration with mdev devices, Intel has
> > already shown this is possible.  Tying the XML representation of an
> > mdev device to a parent device is directly contradictory to that goal.
> > libvirt needs a token which is unique across vendor to be able to
> > instantiate an mdev device.  <parent, type_id> is unacceptable.  
> 
> Would the vGPU's PCI vendor and device ID be acceptable and unique?

No, the PCI vendor:device ID doesn't fully describe the device, just
look at a given GeForce configuration on the market today, a single
NVIDIA PCI device ID can have different clocks, different memory sizes,
different cooling profiles, different output ports, etc. configurable
by the card vendor.  An mdev vGPU can be the same.  The type_id needs
to encompass the entire virtual hardware configuration of the device.

Personally I think we should create a type-id the pre-pends the vendor
driver name, giving each vendor a unique namespace, and then let the
vendor driver manage the rest.  For example, an "nvidia-xyz" type_id
should define a unique mdev configuration that may be implemented
across multiple physical hardware SKUs.  libvirt xml (with
managed='yes') would list an "nvidia-xyz" device as required for the VM
and search the host for mdev parent device capable of creating such a
device.  Based on utilization, locality, or whatever parameters it
determines, libvirt would create (or re-use, if a pool) an instance of
that type_id.  Specifying a specific parent might be something we want
to allow to give the user full placement control, but I don't think it
should be required based on the sysfs API we provide to the user.
Thanks,

Alex
Kirti Wankhede Oct. 14, 2016, 3:31 a.m. UTC | #5
On 10/13/2016 8:06 PM, Alex Williamson wrote:
> On Thu, 13 Oct 2016 14:52:09 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 10/13/2016 3:14 AM, Alex Williamson wrote:
>>> On Thu, 13 Oct 2016 00:32:48 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 10/12/2016 9:29 PM, Alex Williamson wrote:  
>>>>> On Wed, 12 Oct 2016 20:43:48 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>>> On 10/12/2016 7:22 AM, Tian, Kevin wrote:    
>>>>>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>>>>>>>> Sent: Wednesday, October 12, 2016 4:45 AM      
>>>>>>>>>> +* mdev_supported_types:
>>>>>>>>>> +    List of current supported mediated device types and its details are added
>>>>>>>>>> +in this directory in following format:
>>>>>>>>>> +
>>>>>>>>>> +|- <parent phy device>
>>>>>>>>>> +|--- Vendor-specific-attributes [optional]
>>>>>>>>>> +|--- mdev_supported_types
>>>>>>>>>> +|     |--- <type id>
>>>>>>>>>> +|     |   |--- create
>>>>>>>>>> +|     |   |--- name
>>>>>>>>>> +|     |   |--- available_instances
>>>>>>>>>> +|     |   |--- description /class
>>>>>>>>>> +|     |   |--- [devices]
>>>>>>>>>> +|     |--- <type id>
>>>>>>>>>> +|     |   |--- create
>>>>>>>>>> +|     |   |--- name
>>>>>>>>>> +|     |   |--- available_instances
>>>>>>>>>> +|     |   |--- description /class
>>>>>>>>>> +|     |   |--- [devices]
>>>>>>>>>> +|     |--- <type id>
>>>>>>>>>> +|          |--- create
>>>>>>>>>> +|          |--- name
>>>>>>>>>> +|          |--- available_instances
>>>>>>>>>> +|          |--- description /class
>>>>>>>>>> +|          |--- [devices]
>>>>>>>>>> +
>>>>>>>>>> +[TBD : description or class is yet to be decided. This will change.]      
>>>>>>>>>
>>>>>>>>> I thought that in previous discussions we had agreed to drop
>>>>>>>>> the <type id> concept and use the name as the unique identifier.
>>>>>>>>> When reporting these types in libvirt we won't want to report
>>>>>>>>> the type id values - we'll want the name strings to be unique.
>>>>>>>>>      
>>>>>>>>
>>>>>>>> The 'name' might not be unique but type_id will be. For example that Neo
>>>>>>>> pointed out in earlier discussion, virtual devices can come from two
>>>>>>>> different physical devices, end user would be presented with what they
>>>>>>>> had selected but there will be internal implementation differences. In
>>>>>>>> that case 'type_id' will be unique.
>>>>>>>>      
>>>>>>>
>>>>>>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
>>>>>>> string (if you still called it <type id>), and then no need of additional
>>>>>>> 'name' field which can be put inside 'description' field. See below quote:
>>>>>>>       
>>>>>>
>>>>>> We had internal discussions about this within NVIDIA and found that
>>>>>> 'name' might not be unique where as 'type_id' would be unique. I'm
>>>>>> refering to Neo's mail after that, where Neo do pointed that out.
>>>>>>
>>>>>> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html    
>>>>>
>>>>> Everyone not privy to those internal discussions, including me, seems to
>>>>> think we dropped type_id and that if a vendor does not have a stable
>>>>> name, they can compose some sort of stable type description based on the
>>>>> name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
>>>>> haven't managed to kill off type_id yet.  No matter what internal
>>>>> representation each vendor driver has of "type_id" it seems possible
>>>>> for it to come up with stable string to define a given configuration.    
>>>>
>>>>
>>>> The 'type_id' is unique and the 'name' are not, the name is just a
>>>> virtual device name/ human readable name. Because at this moment Intel
>>>> can't define a proper GPU class, we have to add a 'description' field
>>>> there as well to represent the features of this virtual device, once we
>>>> have all agreed with the GPU class and its mandatory attributes, the
>>>> 'description' field can be removed. Here is an example,
>>>> type_id/type_name = NVIDIA_11,
>>>> name=M60-M0Q,
>>>> description=2560x1600, 2 displays, 512MB"
>>>>
>>>> Neo's previous comment only applies to the situation where we will have
>>>> the GPU class or optional attributes defined and recognized by libvirt,
>>>> since that is not going to happen any time soon, we will have to have
>>>> the new 'description' field, and we don't want to have it mixed up with
>>>> 'name' field.
>>>>
>>>> We can definitely have something like name+id as Alex recommended to
>>>> remove the 'name' field, but it will just require libvirt to have more
>>>> logic to parse that string.  
>>>
>>> Let's use the mtty example driver provided in patch 5 so we can all
>>> more clearly see how the interfaces work.  I'll start from the
>>> beginning of my experience and work my way to the type/name thing.
>>>   
>>
>> Thanks for looking into it and getting feel of it. And I hope this helps
>> to understand that 'name' and 'type_id' are different.
>>
>>
>>> (please add a modules_install target to the Makefile)
>>>  
>>
>> This is an example and I feel it should not be installed in
>> /lib/modules/../build path. This should be used to understand the
>> interface and the flow of mdev device management life cycle. User can
>> use insmod to load driver:
>>
>> # insmod mtty.ko
> 
> It's not built by default, that's sufficient.  Providing a
> modules_install target makes it more accessible for testing and allows
> easier testing of module dependencies with modprobe.  insmod does not
> exercise the automatic module dependency loading.
> 
>>> # modprobe mtty
>>>
>>> Now what?  It seems like I need to have prior knowledge that this
>>> drivers supports mdev devices and I need to go hunt for them.  We need
>>> to create a class (ex. /sys/class/mdev/) where a user can find all the
>>> devices that participate in this mediated device infrastructure.  That
>>> would point me to /sys/devices/mtty.
>>>   
>>
>> You can find devices registered to mdev framework by searching for
>> 'mdev_supported_types' directory at the leaf nodes of devices in
>> /sys/devices directory. Yes, we can have 'mdev' class and links to
>> devices which are registered to mdev framework in /sys/class/mdev/.
>>
>>
>>> # tree /sys/devices/mtty
>>> /sys/devices/mtty
>>> |-- mdev_supported_types
>>> |   `-- mtty1
>>> |       |-- available_instances (1)
>>> |       |-- create
>>> |       |-- devices
>>> |       `-- name ("Dual-port-serial")
>>> |-- mtty_dev
>>> |   `-- sample_mtty_dev ("This is phy device")
>>> |-- power
>>> |   |-- async
>>> |   |-- autosuspend_delay_ms
>>> |   |-- control
>>> |   |-- runtime_active_kids
>>> |   |-- runtime_active_time
>>> |   |-- runtime_enabled
>>> |   |-- runtime_status
>>> |   |-- runtime_suspended_time
>>> |   `-- runtime_usage
>>> `-- uevent
>>>
>>> Ok, but that was boring, we really need to have at least 2 supported
>>> types to validate the interface, so without changing the actual device
>>> backing, I pretended to have a single port vs dual port:
>>>
>>> /sys/devices/mtty
>>> |-- mdev_supported_types
>>> |   |-- mtty1
>>> |   |   |-- available_instances (24)
>>> |   |   |-- create
>>> |   |   |-- devices
>>> |   |   `-- name (Single-port-serial)
>>> |   `-- mtty2
>>> |       |-- available_instances (12)
>>> |       |-- create
>>> |       |-- devices
>>> |       `-- name (Dual-port-serial)
>>> [snip]
>>>
>>> I arbitrarily decides I have 24 ports and each single port uses 1 port
>>> and each dual port uses 2 ports.
>>>
>>> Before I start creating devices, what are we going to key the libvirt
>>> XML on?  Can we do anything to prevent vendors from colliding or do we
>>> have any way to suggest meaningful and unique type_ids?   
>>
>> Libvirt would have parent and type_id in XML. No two vendors can own
>> same parent device. So I don't think vendors would collide even having
>> same type_id, since <parent, type_id> pair would always be unique.
> 
> 
> We have a goal of supporting migration with mdev devices, Intel has
> already shown this is possible.  Tying the XML representation of an
> mdev device to a parent device is directly contradictory to that goal.
> libvirt needs a token which is unique across vendor to be able to
> instantiate an mdev device.  <parent, type_id> is unacceptable.
>  

Ok, I'll have dev->driver->name prefix for type_id from mdev core
module. In vendor driver attribute functions, they should also use same
format to identify the type.

>>  Presumably if
>>> we had a PCI device hosting this, we would be rooted at that parent
>>> device, ex. /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0.  Maybe
>>> the type_id should automatically be prefixed by the vendor module name,
>>> ex. mtty-1, i915-foo, nvidia-bar.  There's something missing for
>>> deterministically creating a "XYZ" device and knowdev->driver->nameing exactly what that
>>> means and finding a parent device that supports it.
>>>   
>>
>> We can prefix type_id with module name, i.e using , but
>> <parent, type_id> pair is unique so I don't see much benefit in doing
>> that.
>>
>>
>>> Let's get to mdev creating...
>>>
>>> # uuidgen > mdev_supported_types/mtty2/create
>>> # tree /sys/devices/mtty
>>> /sys/devices/mtty
>>> |-- e68189be-700e-41f7-93a3-b5351e79c470
>>> |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
>>> |   |-- iommu_group -> ../../../kernel/iommu_groups/63
>>> |   |-- mtty2 -> ../mdev_supported_types/mtty2
>>> |   |-- power
>>> |   |   |-- async
>>> |   |   |-- autosuspend_delay_ms
>>> |   |   |-- control
>>> |   |   |-- runtime_active_kids
>>> |   |   |-- runtime_active_time
>>> |   |   |-- runtime_enabled
>>> |   |   |-- runtime_status
>>> |   |   |-- runtime_suspended_time
>>> |   |   `-- runtime_usage
>>> |   |-- remove
>>> |   |-- subsystem -> ../../../bus/mdev
>>> |   |-- uevent
>>> |   `-- vendor
>>> |       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
>>> |-- mdev_supported_types
>>> |   |-- mtty1
>>> |   |   |-- available_instances (22)
>>> |   |   |-- create
>>> |   |   |-- devices
>>> |   |   `-- name
>>> |   `-- mtty2
>>> |       |-- available_instances (11)
>>> |       |-- create
>>> |       |-- devices
>>> |       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
>>> |       `-- name
>>>
>>> The mdev device was created directly under the parent, which seems like
>>> it's going to get messy to me (ie. imagine dropping a bunch of uuids
>>> into a PCI parent device's sysfs directory, how does a user know what
>>> they are?).
>>>   
>>
>> That is the way devices are placed in sysfs. For example below devices:
>>
>> 80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
>> Root Port 1a (rev 07)
>> 80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
>> Root Port 2a (rev 07)
>> 80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
>> Root Port 3a in PCI Express Mode (rev 07)
>> 80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 0 (rev 07)
>> 80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 1 (rev 07)
>> 80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 2 (rev 07)
>> 80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 3 (rev 07)
>> 80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 4 (rev 07)
>> 80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 5 (rev 07)
>> 80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 6 (rev 07)
>> 80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 7 (rev 07)
>> 80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address
>> Map, VTd_Misc, System Management (rev 07)
>> 80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control
>> Status and Global Errors (rev 07)
>> 80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
>>
>> In sysfs, those are in same parent folder of its parent root port:
>>
>> # ls /sys/devices/pci0000\:80/ -l
>> total 0
>> drwxr-xr-x 8 root root    0 Oct 13 12:08 0000:80:01.0
>> drwxr-xr-x 7 root root    0 Oct 13 12:08 0000:80:02.0
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:03.0
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.0
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.1
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.2
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.3
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.4
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.5
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.6
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.7
>> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.0
>> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.2
>> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.4
>> lrwxrwxrwx 1 root root    0 Oct 13 13:25 firmware_node ->
>> ../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01
>> drwxr-xr-x 3 root root    0 Oct 13 12:08 pci_bus
>> drwxr-xr-x 2 root root    0 Oct 13 12:08 power
>> -rw-r--r-- 1 root root 4096 Oct 13 13:25 uevent
> 
> I agree, that's also messy, but a PCI device has a standard PCI address
> format, so just by looking at those you know it's a child PCI device.
> We can write code that understands how to parse that and understands
> what it is by the address.  On the other hand we're dropping uuids into
> a directory.  We can write code that understands how to parse it, but
> how do we know that it's a child device vs some other attribute for the
> parent?
> 


mdev child will have 'mdev_supported_type' link in its directory. That
should be sufficient to identify it's child and not other attribute.


>>> Under the device we have "mtty2", shouldn't that be
>>> "mdev_supported_type", which then links to mtty2?  Otherwise a user
>>> needs to decode from the link what this attribute is.
>>>   
>>
>> I thought it should show type, so that by looking at 'ls' output user
>> should be able to find type_id.
> 
> The type_id should be shown by actually reading the link, not by the
> link name itself, the same way that the iommu_group link for a device
> isn't the group number, it links to the group number but uses a
> standard link name.
> 

Ok. I'll rename the link name to 'mdev_supported_type'

>>> Also here's an example of those vendor sysfs entries per device.  So
>>> long as the vendor never expects a tool like libvirt to manipulate
>>> attributes there, I can see how that could be pretty powerful.
>>>   
>>
>> Yes, it is good to have vendor specific entries, libvirt might not
>> report/use it. That would be more useful for system admin to get extra
>> information manually that libvirt doesn't report.
>>
>>
>>> Moving down to the mdev_supported_types, I've updated mtty so that it
>>> actually adjusts available instance, and we can now see a link under
>>> the devices for mtty2.
>>>
>>> Also worth noting is that a link for the device appears
>>> in /sys/bus/mdev/devices.
>>>
>>> BTW, specifying this device for QEMU vfio-pci is where the sysfsdev
>>> option comes into play:
>>>
>>> -device
>>> vfio-pci,sysfsdev=/sys/devices/mtty/e68189be-700e-41f7-93a3-b5351e79c470
>>>
>>> Which raises another question, we can tell through the vfio interfaces
>>> that this is exposes as a PCI device, by creating a container
>>> (open(/dev/vfio/vfio)), setting an iommu (ioctl(VFIO_SET_IOMMU)),
>>> adding the group to the container (ioctl(VFIO_GROUP_SET_CONTAINER)),
>>> getting the device (ioctl(VFIO_GROUP_GET_DEVICE_FD)), and finally
>>> getting the device info (ioctl(VFIO_DEVICE_GET_INFO)) and checking the
>>> flag bit that says the API is PCI.  That's a long path to go and has
>>> stumbling blocks like the type of iommu that's available for the given
>>> platform.  How do we make that manageable?   
>>
>> Do you want device type to be expressed in sysfs? Then that should be
>> done from vendor driver. vfio_mdev module is now a shim layer, so mdev
>> core or vfio_mdev module don't know what device type flag vendor driver
>> had set.
> 
> Right, the vendor driver would need to expose this, the mdev layers are
> device agnostic, they don't know or care which device API is being
> exposed.  The other question is whether it needs to be part of the
> initial implementation or can we assume pci for now and add something
> later.  I guess we already have our proof to the contrary with the
> IBM ccw device that libvirt can't simply assume pci.  I see that many
> devices in sysfs have a subsystem link, which seems rather appropriate,
> but we're not creating a real pci device, so linking to /sys/bus/pci
> or /sys/class/pci_bus both seem invalid.  Is that a dead end?  We could
> always expose vfio_device_info.flags, but that seems pretty ugly as
> well, plus the sysfs mdev interface is not vfio specific.  What if we
> had a "device_api" attribute which the vendor driver would show as
> "vfio-pci"?  Therefore the mdev interface is not tied to vfio, but we
> clearly show that a given type_id clearly exports a vfio-pci
> interface.  Thanks,
> 

We can't use subsystem for mdev device. Kernel's device core framework
adds subsystem link to mdev device folder as:
 subsystem -> ../../../../../../../bus/mdev

We will have an mandatory attribute in 'supported_type_groups' which
should show "vfio-pci" or "vfio-platform" based on the device flag
vendor driver is going to set for that type.

Thanks,
Kirti
Alex Williamson Oct. 14, 2016, 4:22 a.m. UTC | #6
On Fri, 14 Oct 2016 09:01:01 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/13/2016 8:06 PM, Alex Williamson wrote:
> > On Thu, 13 Oct 2016 14:52:09 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 10/13/2016 3:14 AM, Alex Williamson wrote:  
> >>> Under the device we have "mtty2", shouldn't that be
> >>> "mdev_supported_type", which then links to mtty2?  Otherwise a user
> >>> needs to decode from the link what this attribute is.
> >>>     
> >>
> >> I thought it should show type, so that by looking at 'ls' output user
> >> should be able to find type_id.  
> > 
> > The type_id should be shown by actually reading the link, not by the
> > link name itself, the same way that the iommu_group link for a device
> > isn't the group number, it links to the group number but uses a
> > standard link name.
> >   
> 
> Ok. I'll rename the link name to 'mdev_supported_type'

Hmm, if we have a device, then clearly it's a supported type, we can
probably reduce this to 'mdev_type'.  Sorry for not catching that.

BTW, please include the linux-kernel <linux-kernel@vger.kernel.org>
mailing list on the CC in your next posting.  Thanks,

Alex
diff mbox

Patch

diff --git a/Documentation/vfio-mdev/Makefile b/Documentation/vfio-mdev/Makefile
index ff6f8a3..721daf0 100644
--- a/Documentation/vfio-mdev/Makefile
+++ b/Documentation/vfio-mdev/Makefile
@@ -8,6 +8,9 @@  obj-m:=mtty.o
 default:
 	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
 
+modules_install:
+	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules_install
+
 clean:
 	@rm -rf .*.cmd *.mod.c *.o *.ko .tmp*
 	@rm -rf Module.* Modules.* modules.* .tmp_versions
diff --git a/Documentation/vfio-mdev/mtty.c b/Documentation/vfio-mdev/mtty.c
index 497c90e..05ab40d 100644
--- a/Documentation/vfio-mdev/mtty.c
+++ b/Documentation/vfio-mdev/mtty.c
@@ -141,8 +141,11 @@  struct mdev_state {
 	struct serial_port s[2];
 	struct mutex rxtx_lock;
 	struct vfio_device_info dev_info;
+	int nr_ports;
 };
 
+#define MAX_MTTYS 24
+
 struct mutex mdev_list_lock;
 struct list_head mdev_devices_list;
 
@@ -723,6 +726,11 @@  int mtty_create(struct kobject *kobj, struct mdev_device *mdev)
 	if (mdev_state == NULL)
 		return -ENOMEM;
 
+	if (!strcmp(kobj->name, "mtty1"))
+		mdev_state->nr_ports = 1;
+	else if (!strcmp(kobj->name, "mtty2"))
+		mdev_state->nr_ports = 2;
+
 	mdev_state->irq_index = -1;
 	mdev_state->s[0].max_fifo_size = MAX_FIFO_SIZE;
 	mdev_state->s[1].max_fifo_size = MAX_FIFO_SIZE;
@@ -1224,7 +1232,12 @@  const struct attribute_group *mdev_dev_groups[] = {
 static ssize_t
 name_show(struct kobject *kobj, struct device *dev, char *buf)
 {
-	return sprintf(buf, "Dual-port-serial\n");
+	if (!strcmp(kobj->name, "mtty1"))
+		return sprintf(buf, "Single-port-serial\n");
+	if (!strcmp(kobj->name, "mtty2"))
+		return sprintf(buf, "Dual-port-serial\n");
+
+	return -EINVAL;
 }
 
 MDEV_TYPE_ATTR_RO(name);
@@ -1232,7 +1245,20 @@  MDEV_TYPE_ATTR_RO(name);
 static ssize_t
 available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
 {
-	return sprintf(buf, "1\n");
+	struct mdev_state *mds;
+	int ports, used = 0;
+
+	if (!strcmp(kobj->name, "mtty1"))
+		ports = 1;
+	else if (!strcmp(kobj->name, "mtty2"))
+		ports = 2;
+	else
+		return -EINVAL;
+
+	list_for_each_entry(mds, &mdev_devices_list, next) {
+		used += mds->nr_ports;
+	}
+	return sprintf(buf, "%d\n", (MAX_MTTYS - used)/ports);
 }
 
 MDEV_TYPE_ATTR_RO(available_instances);
@@ -1243,13 +1269,19 @@  static struct attribute *mdev_types_attrs[] = {
 	NULL,
 };
 
-static struct attribute_group mdev_type_group = {
+static struct attribute_group mdev_type_group1 = {
 	.name  = "mtty1",
 	.attrs = mdev_types_attrs,
 };
 
+static struct attribute_group mdev_type_group2 = {
+	.name  = "mtty2",
+	.attrs = mdev_types_attrs,
+};
+
 struct attribute_group *mdev_type_groups[] = {
-	&mdev_type_group,
+	&mdev_type_group1,
+	&mdev_type_group2,
 	NULL,
 };