Message ID | 20191107160448.20962-1-parav@mellanox.com (mailing list archive) |
---|---|
Headers | show |
Series | Mellanox, mlx5 sub function support | expand |
On Thu, Nov 07, 2019 at 10:04:48AM -0600, Parav Pandit wrote: > Hi Dave, Jiri, Alex, > <...> > - View netdevice and (optionally) RDMA device using iproute2 tools > $ ip link show > $ rdma dev show You perfectly explained how ETH devices will be named, but what about RDMA? How will be named? I feel that rdma-core needs to be extended to support such mediated devices. Thanks
Hi Leon, > -----Original Message----- > From: Leon Romanovsky <leon@kernel.org> > Sent: Thursday, November 7, 2019 11:04 AM > To: Parav Pandit <parav@mellanox.com> > Cc: alex.williamson@redhat.com; davem@davemloft.net; > kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed > <saeedm@mellanox.com>; kwankhede@nvidia.com; cohuck@redhat.com; Jiri > Pirko <jiri@mellanox.com>; linux-rdma@vger.kernel.org > Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support > > On Thu, Nov 07, 2019 at 10:04:48AM -0600, Parav Pandit wrote: > > Hi Dave, Jiri, Alex, > > > > <...> > > > - View netdevice and (optionally) RDMA device using iproute2 tools > > $ ip link show > > $ rdma dev show > > You perfectly explained how ETH devices will be named, but what about > RDMA? > How will be named? I feel that rdma-core needs to be extended to support such > mediated devices. > rdma devices are named by default using mlx_X. After your persistent naming patches, I thought we have GUID based naming scheme which doesn't care about its underlying bus. So mdevs will be able to use current GUID based naming scheme we already have. Additionally, if user prefers, mdev alias, we can extend systemd/udev to use mdev alias based names (like PCI bdf). Such as, rocem<alias1> ibm<alias2> Format is: <link_layer><m><alias> m -> stands for mdev device (similar to 'p' for PCI)
On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > Mellanox sub function capability allows users to create several hundreds > of networking and/or rdma devices without depending on PCI SR-IOV support. You call the new port type "sub function" but the devlink port flavour is mdev. As I'm sure you remember you nacked my patches exposing NFP's PCI sub functions which are just regions of the BAR without any mdev capability. Am I in the clear to repost those now? Jiri? > Overview: > --------- > Mellanox ConnectX sub functions are exposed to user as a mediated > device (mdev) [2] as discussed in RFC [3] and further during > netdevconf0x13 at [4]. > > mlx5 mediated device (mdev) enables users to create multiple netdevices > and/or RDMA devices from single PCI function. > > Each mdev maps to a mlx5 sub function. > mlx5 sub function is similar to PCI VF. However it doesn't have its own > PCI function and MSI-X vectors. > > mlx5 mdevs share common PCI resources such as PCI BAR region, > MSI-X interrupts. > > Each mdev has its own window in the PCI BAR region, which is > accessible only to that mdev and applications using it. > > Each mlx5 sub function has its own resource namespace for RDMA resources. > > mdevs are supported when eswitch mode of the devlink instance > is in switchdev mode described in devlink documentation [5]. So presumably the mdevs don't spawn their own devlink instance today, but once mapped via VIRTIO to a VM they will create one? It could be useful to specify. > Network side: > - By default the netdevice and the rdma device of mlx5 mdev cannot send or > receive any packets over the network or to any other mlx5 mdev. Does this mean the frames don't fall back to the repr by default?
Hi Jakub, > -----Original Message----- > From: Jakub Kicinski <jakub.kicinski@netronome.com> > Sent: Thursday, November 7, 2019 2:33 PM > To: Parav Pandit <parav@mellanox.com> > Cc: alex.williamson@redhat.com; davem@davemloft.net; > kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed > <saeedm@mellanox.com>; kwankhede@nvidia.com; leon@kernel.org; > cohuck@redhat.com; Jiri Pirko <jiri@mellanox.com>; linux- > rdma@vger.kernel.org; Or Gerlitz <gerlitz.or@gmail.com> > Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support > > On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > > Mellanox sub function capability allows users to create several > > hundreds of networking and/or rdma devices without depending on PCI SR- > IOV support. > > You call the new port type "sub function" but the devlink port flavour is mdev. > Sub function is the internal driver structure. The abstract entity at user and stack level is mdev. Hence the port flavour is mdev. > As I'm sure you remember you nacked my patches exposing NFP's PCI sub > functions which are just regions of the BAR without any mdev capability. Am I > in the clear to repost those now? Jiri? > For sure I didn't nack it. :-) What I remember discussing offline/mailing list is (a) exposing mdev/sub fuctions as devlink sub ports is not so good abstraction (b) user creating/deleting eswitch sub ports would be hard to fit in the whole usage model > > Overview: > > --------- > > Mellanox ConnectX sub functions are exposed to user as a mediated > > device (mdev) [2] as discussed in RFC [3] and further during > > netdevconf0x13 at [4]. > > > > mlx5 mediated device (mdev) enables users to create multiple > > netdevices and/or RDMA devices from single PCI function. > > > > Each mdev maps to a mlx5 sub function. > > mlx5 sub function is similar to PCI VF. However it doesn't have its > > own PCI function and MSI-X vectors. > > > > mlx5 mdevs share common PCI resources such as PCI BAR region, MSI-X > > interrupts. > > > > Each mdev has its own window in the PCI BAR region, which is > > accessible only to that mdev and applications using it. > > > > Each mlx5 sub function has its own resource namespace for RDMA resources. > > > > mdevs are supported when eswitch mode of the devlink instance is in > > switchdev mode described in devlink documentation [5]. > > So presumably the mdevs don't spawn their own devlink instance today, but > once mapped via VIRTIO to a VM they will create one? > mdev doesn't spawn the devlink instance today when mdev is created by user, like PCI. When PCI bus driver enumerates and creates PCI device, there isn't a devlink instance for it. But, mdev's devlink instance is created when mlx5_core driver binds to the mdev device. (again similar to PCI, when mlx5_core driver binds to PCI, its devlink instance is created ). I should have put the example in patch-15 which creates/deletes devlink instance of mdev. I will revise the commit log of patch-15 to include that. Good point. > It could be useful to specify. > Yes, its certainly useful. I missed to put the example in commit log of patch-15. > > Network side: > > - By default the netdevice and the rdma device of mlx5 mdev cannot > > send or receive any packets over the network or to any other mlx5 mdev. > > Does this mean the frames don't fall back to the repr by default? Probably I wasn't clear. What I wanted to say is, that frames transmitted by mdev's netdevice and rdma devices don't go to network. These frames goes to representor device. User must configure representor to send/receive/steer traffic to mdev.
From: Parav Pandit <parav@mellanox.com> Date: Thu, 7 Nov 2019 10:04:48 -0600 > This series adds the support for mlx5 sub function devices using > mediated device with eswitch switchdev mode. I think at a minimum there needs to be deeper explanations in the commit log messages and thus I expect a respin of this series. Thanks.
On Thu, 7 Nov 2019 20:52:29 +0000, Parav Pandit wrote: > > On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > > > Mellanox sub function capability allows users to create several > > > hundreds of networking and/or rdma devices without depending on PCI SR- > > IOV support. > > > > You call the new port type "sub function" but the devlink port flavour is mdev. > > > Sub function is the internal driver structure. The abstract entity at user and stack level is mdev. > Hence the port flavour is mdev. FWIW I agree mdev as flavour seems like the right choice. > > As I'm sure you remember you nacked my patches exposing NFP's PCI sub > > functions which are just regions of the BAR without any mdev capability. Am I > > in the clear to repost those now? Jiri? > > > For sure I didn't nack it. :-) Well, maybe the word "nack" wasn't exactly used :) > What I remember discussing offline/mailing list is > (a) exposing mdev/sub fuctions as devlink sub ports is not so good abstraction > (b) user creating/deleting eswitch sub ports would be hard to fit in the whole usage model Okay, so I can repost the "basic" sub functions? > > > Overview: > > > --------- > > > Mellanox ConnectX sub functions are exposed to user as a mediated > > > device (mdev) [2] as discussed in RFC [3] and further during > > > netdevconf0x13 at [4]. > > > > > > mlx5 mediated device (mdev) enables users to create multiple > > > netdevices and/or RDMA devices from single PCI function. > > > > > > Each mdev maps to a mlx5 sub function. > > > mlx5 sub function is similar to PCI VF. However it doesn't have its > > > own PCI function and MSI-X vectors. > > > > > > mlx5 mdevs share common PCI resources such as PCI BAR region, MSI-X > > > interrupts. > > > > > > Each mdev has its own window in the PCI BAR region, which is > > > accessible only to that mdev and applications using it. > > > > > > Each mlx5 sub function has its own resource namespace for RDMA resources. > > > > > > mdevs are supported when eswitch mode of the devlink instance is in > > > switchdev mode described in devlink documentation [5]. > > > > So presumably the mdevs don't spawn their own devlink instance today, but > > once mapped via VIRTIO to a VM they will create one? > > > mdev doesn't spawn the devlink instance today when mdev is created by user, like PCI. > When PCI bus driver enumerates and creates PCI device, there isn't a devlink instance for it. > > But, mdev's devlink instance is created when mlx5_core driver binds to the mdev device. > (again similar to PCI, when mlx5_core driver binds to PCI, its devlink instance is created ). > > I should have put the example in patch-15 which creates/deletes devlink instance of mdev. > I will revise the commit log of patch-15 to include that. > Good point. Thanks. > > It could be useful to specify. > > > Yes, its certainly useful. I missed to put the example in commit log of patch-15. > > > > Network side: > > > - By default the netdevice and the rdma device of mlx5 mdev cannot > > > send or receive any packets over the network or to any other mlx5 mdev. > > > > Does this mean the frames don't fall back to the repr by default? > Probably I wasn't clear. > What I wanted to say is, that frames transmitted by mdev's netdevice and rdma devices don't go to network. > These frames goes to representor device. > User must configure representor to send/receive/steer traffic to mdev.
> -----Original Message----- > From: Jakub Kicinski <jakub.kicinski@netronome.com> > Sent: Thursday, November 7, 2019 7:16 PM > To: Parav Pandit <parav@mellanox.com> > Cc: alex.williamson@redhat.com; davem@davemloft.net; > kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed > <saeedm@mellanox.com>; kwankhede@nvidia.com; leon@kernel.org; > cohuck@redhat.com; Jiri Pirko <jiri@mellanox.com>; linux- > rdma@vger.kernel.org; Or Gerlitz <gerlitz.or@gmail.com> > Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support > > On Thu, 7 Nov 2019 20:52:29 +0000, Parav Pandit wrote: > > > On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > > > > Mellanox sub function capability allows users to create several > > > > hundreds of networking and/or rdma devices without depending on > > > > PCI SR- > > > IOV support. > > > > > > You call the new port type "sub function" but the devlink port flavour is > mdev. > > > > > Sub function is the internal driver structure. The abstract entity at user and > stack level is mdev. > > Hence the port flavour is mdev. > > FWIW I agree mdev as flavour seems like the right choice. > Ok. > > > As I'm sure you remember you nacked my patches exposing NFP's PCI > > > sub functions which are just regions of the BAR without any mdev > > > capability. Am I in the clear to repost those now? Jiri? > > > > > For sure I didn't nack it. :-) > > Well, maybe the word "nack" wasn't exactly used :) > > > What I remember discussing offline/mailing list is > > (a) exposing mdev/sub fuctions as devlink sub ports is not so good > > abstraction > > (b) user creating/deleting eswitch sub ports would be hard to fit in > > the whole usage model > > Okay, so I can repost the "basic" sub functions? > I think so. Would you like post on top of this series as port flavour etc would come by default? Also there is vfio/mdev dependency exist in this series... > > > > Overview: > > > > --------- > > > > Mellanox ConnectX sub functions are exposed to user as a mediated > > > > device (mdev) [2] as discussed in RFC [3] and further during > > > > netdevconf0x13 at [4]. > > > > > > > > mlx5 mediated device (mdev) enables users to create multiple > > > > netdevices and/or RDMA devices from single PCI function. > > > > > > > > Each mdev maps to a mlx5 sub function. > > > > mlx5 sub function is similar to PCI VF. However it doesn't have > > > > its own PCI function and MSI-X vectors. > > > > > > > > mlx5 mdevs share common PCI resources such as PCI BAR region, > > > > MSI-X interrupts. > > > > > > > > Each mdev has its own window in the PCI BAR region, which is > > > > accessible only to that mdev and applications using it. > > > > > > > > Each mlx5 sub function has its own resource namespace for RDMA > resources. > > > > > > > > mdevs are supported when eswitch mode of the devlink instance is > > > > in switchdev mode described in devlink documentation [5]. > > > > > > So presumably the mdevs don't spawn their own devlink instance > > > today, but once mapped via VIRTIO to a VM they will create one? > > > > > mdev doesn't spawn the devlink instance today when mdev is created by > user, like PCI. > > When PCI bus driver enumerates and creates PCI device, there isn't a > devlink instance for it. > > > > But, mdev's devlink instance is created when mlx5_core driver binds to the > mdev device. > > (again similar to PCI, when mlx5_core driver binds to PCI, its devlink > instance is created ). > > > > I should have put the example in patch-15 which creates/deletes devlink > instance of mdev. > > I will revise the commit log of patch-15 to include that. > > Good point. > > Thanks. > > > > It could be useful to specify. > > > > > Yes, its certainly useful. I missed to put the example in commit log of > patch-15. > > > > > > Network side: > > > > - By default the netdevice and the rdma device of mlx5 mdev cannot > > > > send or receive any packets over the network or to any other mlx5 > mdev. > > > > > > Does this mean the frames don't fall back to the repr by default? > > Probably I wasn't clear. > > What I wanted to say is, that frames transmitted by mdev's netdevice and > rdma devices don't go to network. > > These frames goes to representor device. > > User must configure representor to send/receive/steer traffic to mdev. > >
On Fri, 8 Nov 2019 01:49:09 +0000, Parav Pandit wrote: > > > What I remember discussing offline/mailing list is > > > (a) exposing mdev/sub fuctions as devlink sub ports is not so good > > > abstraction > > > (b) user creating/deleting eswitch sub ports would be hard to fit in > > > the whole usage model > > > > Okay, so I can repost the "basic" sub functions? > > > I think so. Would you like post on top of this series as port flavour > etc would come by default? Also there is vfio/mdev dependency exist > in this series... I don't mind the ordering.
On Thu, Nov 07, 2019 at 08:10:45PM +0000, Parav Pandit wrote: > Hi Leon, > > > -----Original Message----- > > From: Leon Romanovsky <leon@kernel.org> > > Sent: Thursday, November 7, 2019 11:04 AM > > To: Parav Pandit <parav@mellanox.com> > > Cc: alex.williamson@redhat.com; davem@davemloft.net; > > kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed > > <saeedm@mellanox.com>; kwankhede@nvidia.com; cohuck@redhat.com; Jiri > > Pirko <jiri@mellanox.com>; linux-rdma@vger.kernel.org > > Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support > > > > On Thu, Nov 07, 2019 at 10:04:48AM -0600, Parav Pandit wrote: > > > Hi Dave, Jiri, Alex, > > > > > > > <...> > > > > > - View netdevice and (optionally) RDMA device using iproute2 tools > > > $ ip link show > > > $ rdma dev show > > > > You perfectly explained how ETH devices will be named, but what about > > RDMA? > > How will be named? I feel that rdma-core needs to be extended to support such > > mediated devices. > > > rdma devices are named by default using mlx_X. > After your persistent naming patches, I thought we have GUID based naming scheme which doesn't care about its underlying bus. No, it is not how it is done. RDMA persistent naming is modeled exactly as ETH naming, it means that we do care about bus and we don't use GUID unless user explicitly asked, exactly as MAC based names in ETH world. > So mdevs will be able to use current GUID based naming scheme we already have. Unfortunately, no. > > Additionally, if user prefers, mdev alias, we can extend systemd/udev to use mdev alias based names (like PCI bdf). It is not "Additionally", but "must". > Such as, > rocem<alias1> > ibm<alias2> > Format is: > <link_layer><m><alias> > m -> stands for mdev device (similar to 'p' for PCI)
Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote: >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: >> Mellanox sub function capability allows users to create several hundreds >> of networking and/or rdma devices without depending on PCI SR-IOV support. > >You call the new port type "sub function" but the devlink port flavour >is mdev. > >As I'm sure you remember you nacked my patches exposing NFP's PCI >sub functions which are just regions of the BAR without any mdev >capability. Am I in the clear to repost those now? Jiri? Well question is, if it makes sense to have SFs without having them as mdev? I mean, we discussed the modelling thoroughtly and eventually we realized that in order to model this correctly, we need SFs on "a bus". Originally we were thinking about custom bus, but mdev is already there to handle this. Our SFs are also just regions of the BAR, same thing as you have. Can't you do the same for nfp SFs? Then the "mdev" flavour is enough for all. > >> Overview: >> --------- >> Mellanox ConnectX sub functions are exposed to user as a mediated >> device (mdev) [2] as discussed in RFC [3] and further during >> netdevconf0x13 at [4]. >> >> mlx5 mediated device (mdev) enables users to create multiple netdevices >> and/or RDMA devices from single PCI function. >> >> Each mdev maps to a mlx5 sub function. >> mlx5 sub function is similar to PCI VF. However it doesn't have its own >> PCI function and MSI-X vectors. >> >> mlx5 mdevs share common PCI resources such as PCI BAR region, >> MSI-X interrupts. >> >> Each mdev has its own window in the PCI BAR region, which is >> accessible only to that mdev and applications using it. >> >> Each mlx5 sub function has its own resource namespace for RDMA resources. >> >> mdevs are supported when eswitch mode of the devlink instance >> is in switchdev mode described in devlink documentation [5]. > >So presumably the mdevs don't spawn their own devlink instance today, >but once mapped via VIRTIO to a VM they will create one? I don't think it is needed for anything. Maybe one day if there is a need to create devlink instance for VF or SF, we can add it. But currently, I don't see the need. > >It could be useful to specify. > >> Network side: >> - By default the netdevice and the rdma device of mlx5 mdev cannot send or >> receive any packets over the network or to any other mlx5 mdev. > >Does this mean the frames don't fall back to the repr by default? That would be the sane default. If I up the representor, I should see packets coming in from SF/VF and I should be able to send packets back.
On Fri, Nov 08, 2019 at 01:12:33PM +0100, Jiri Pirko wrote: > Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote: > >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > >> Mellanox sub function capability allows users to create several hundreds > >> of networking and/or rdma devices without depending on PCI SR-IOV support. > > > >You call the new port type "sub function" but the devlink port flavour > >is mdev. > > > >As I'm sure you remember you nacked my patches exposing NFP's PCI > >sub functions which are just regions of the BAR without any mdev > >capability. Am I in the clear to repost those now? Jiri? > > Well question is, if it makes sense to have SFs without having them as > mdev? I mean, we discussed the modelling thoroughtly and eventually we > realized that in order to model this correctly, we need SFs on "a bus". > Originally we were thinking about custom bus, but mdev is already there > to handle this. Did anyone consult Greg on this? The new intel driver has been having a very similar discussion about how to model their 'multi function device' ie to bind RDMA and other drivers to a shared PCI function, and I think that discussion settled on adding a new bus? Really these things are all very similar, it would be nice to have a clear methodology on how to use the device core if a single PCI device is split by software into multiple different functional units and attached to different driver instances. Currently there is alot of hacking in this area.. And a consistent scheme might resolve the ugliness with the dma_ops wrappers. We already have the 'mfd' stuff to support splitting platform devices, maybe we need to create a 'pci-mfd' to support splitting PCI devices? I'm not really clear how mfd and mdev relate, I always thought mdev was strongly linked to vfio. At the very least if it is agreed mdev should be the vehicle here, then it should also be able to solve the netdev/rdma hookup problem too. Jason
> -----Original Message----- > From: Leon Romanovsky <leon@kernel.org> > Sent: Friday, November 8, 2019 12:20 AM > To: Parav Pandit <parav@mellanox.com> > Cc: alex.williamson@redhat.com; davem@davemloft.net; > kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed > <saeedm@mellanox.com>; kwankhede@nvidia.com; cohuck@redhat.com; > Jiri Pirko <jiri@mellanox.com>; linux-rdma@vger.kernel.org > Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support > > On Thu, Nov 07, 2019 at 08:10:45PM +0000, Parav Pandit wrote: > > Hi Leon, > > > > > -----Original Message----- > > > From: Leon Romanovsky <leon@kernel.org> > > > Sent: Thursday, November 7, 2019 11:04 AM > > > To: Parav Pandit <parav@mellanox.com> > > > Cc: alex.williamson@redhat.com; davem@davemloft.net; > > > kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed > > > <saeedm@mellanox.com>; kwankhede@nvidia.com; > cohuck@redhat.com; Jiri > > > Pirko <jiri@mellanox.com>; linux-rdma@vger.kernel.org > > > Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function > > > support > > > > > > On Thu, Nov 07, 2019 at 10:04:48AM -0600, Parav Pandit wrote: > > > > Hi Dave, Jiri, Alex, > > > > > > > > > > <...> > > > > > > > - View netdevice and (optionally) RDMA device using iproute2 tools > > > > $ ip link show > > > > $ rdma dev show > > > > > > You perfectly explained how ETH devices will be named, but what > > > about RDMA? > > > How will be named? I feel that rdma-core needs to be extended to > > > support such mediated devices. > > > > > rdma devices are named by default using mlx_X. > > After your persistent naming patches, I thought we have GUID based > naming scheme which doesn't care about its underlying bus. > > No, it is not how it is done. RDMA persistent naming is modeled exactly as > ETH naming, it means that we do care about bus and we don't use GUID > unless user explicitly asked, exactly as MAC based names in ETH world. > > > So mdevs will be able to use current GUID based naming scheme we > already have. > > Unfortunately, no. > > > > > Additionally, if user prefers, mdev alias, we can extend systemd/udev to > use mdev alias based names (like PCI bdf). > > It is not "Additionally", but "must". > Ok. So will do post this series with similar naming scheme. Will update the cover letter to describe rdma naming too. > > Such as, > > rocem<alias1> > > ibm<alias2> > > Format is: > > <link_layer><m><alias> > > m -> stands for mdev device (similar to 'p' for PCI)
Hi Jason, + Greg > -----Original Message----- > From: Jason Gunthorpe <jgg@ziepe.ca> > Sent: Friday, November 8, 2019 8:41 AM > To: Jiri Pirko <jiri@resnulli.us>; Ertman@ziepe.ca; David M > <david.m.ertman@intel.com>; gregkh@linuxfoundation.org > Cc: Jakub Kicinski <jakub.kicinski@netronome.com>; Parav Pandit > <parav@mellanox.com>; alex.williamson@redhat.com; > davem@davemloft.net; kvm@vger.kernel.org; netdev@vger.kernel.org; > Saeed Mahameed <saeedm@mellanox.com>; kwankhede@nvidia.com; > leon@kernel.org; cohuck@redhat.com; Jiri Pirko <jiri@mellanox.com>; linux- > rdma@vger.kernel.org; Or Gerlitz <gerlitz.or@gmail.com> > Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support > > On Fri, Nov 08, 2019 at 01:12:33PM +0100, Jiri Pirko wrote: > > Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com > wrote: > > >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > > >> Mellanox sub function capability allows users to create several > > >> hundreds of networking and/or rdma devices without depending on PCI > SR-IOV support. > > > > > >You call the new port type "sub function" but the devlink port > > >flavour is mdev. > > > > > >As I'm sure you remember you nacked my patches exposing NFP's PCI sub > > >functions which are just regions of the BAR without any mdev > > >capability. Am I in the clear to repost those now? Jiri? > > > > Well question is, if it makes sense to have SFs without having them as > > mdev? I mean, we discussed the modelling thoroughtly and eventually we > > realized that in order to model this correctly, we need SFs on "a bus". > > Originally we were thinking about custom bus, but mdev is already > > there to handle this. > > Did anyone consult Greg on this? > Back when I started with subdev bus in March, we consulted Greg and mdev maintainers. After which we settled on extending mdev for wider use case, more below. It is extended for multiple users for example for virtio too in addition to vfio and mlx5_core. > The new intel driver has been having a very similar discussion about how to > model their 'multi function device' ie to bind RDMA and other drivers to a > shared PCI function, and I think that discussion settled on adding a new bus? > > Really these things are all very similar, it would be nice to have a clear > methodology on how to use the device core if a single PCI device is split by > software into multiple different functional units and attached to different > driver instances. > > Currently there is alot of hacking in this area.. And a consistent scheme > might resolve the ugliness with the dma_ops wrappers. > > We already have the 'mfd' stuff to support splitting platform devices, maybe > we need to create a 'pci-mfd' to support splitting PCI devices? > > I'm not really clear how mfd and mdev relate, I always thought mdev was > strongly linked to vfio. > Mdev at beginning was strongly linked to vfio, but as I mentioned above it is addressing more use case. I observed that discussion, but was not sure of extending mdev further. One way to do for Intel drivers to do is after series [9]. Where PCI driver says, MDEV_CLASS_ID_I40_FOO RDMA driver mdev_register_driver(), matches on it and does the probe(). > At the very least if it is agreed mdev should be the vehicle here, then it > should also be able to solve the netdev/rdma hookup problem too. > > Jason [9] https://patchwork.ozlabs.org/patch/1190425
> -----Original Message----- > From: Jiri Pirko <jiri@resnulli.us> > Sent: Friday, November 8, 2019 6:13 AM > To: Jakub Kicinski <jakub.kicinski@netronome.com> > Cc: Parav Pandit <parav@mellanox.com>; alex.williamson@redhat.com; > davem@davemloft.net; kvm@vger.kernel.org; netdev@vger.kernel.org; > Saeed Mahameed <saeedm@mellanox.com>; kwankhede@nvidia.com; > leon@kernel.org; cohuck@redhat.com; Jiri Pirko <jiri@mellanox.com>; linux- > rdma@vger.kernel.org; Or Gerlitz <gerlitz.or@gmail.com> > Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support > > Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote: > >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > >> Mellanox sub function capability allows users to create several > >> hundreds of networking and/or rdma devices without depending on PCI > SR-IOV support. > > > >You call the new port type "sub function" but the devlink port flavour > >is mdev. > > > >As I'm sure you remember you nacked my patches exposing NFP's PCI sub > >functions which are just regions of the BAR without any mdev > >capability. Am I in the clear to repost those now? Jiri? > > Well question is, if it makes sense to have SFs without having them as mdev? > I mean, we discussed the modelling thoroughtly and eventually we realized > that in order to model this correctly, we need SFs on "a bus". > Originally we were thinking about custom bus, but mdev is already there to > handle this. > > Our SFs are also just regions of the BAR, same thing as you have. > > Can't you do the same for nfp SFs? > Then the "mdev" flavour is enough for all. > > > > > >> Overview: > >> --------- > >> Mellanox ConnectX sub functions are exposed to user as a mediated > >> device (mdev) [2] as discussed in RFC [3] and further during > >> netdevconf0x13 at [4]. > >> > >> mlx5 mediated device (mdev) enables users to create multiple > >> netdevices and/or RDMA devices from single PCI function. > >> > >> Each mdev maps to a mlx5 sub function. > >> mlx5 sub function is similar to PCI VF. However it doesn't have its > >> own PCI function and MSI-X vectors. > >> > >> mlx5 mdevs share common PCI resources such as PCI BAR region, MSI-X > >> interrupts. > >> > >> Each mdev has its own window in the PCI BAR region, which is > >> accessible only to that mdev and applications using it. > >> > >> Each mlx5 sub function has its own resource namespace for RDMA > resources. > >> > >> mdevs are supported when eswitch mode of the devlink instance is in > >> switchdev mode described in devlink documentation [5]. > > > >So presumably the mdevs don't spawn their own devlink instance today, > >but once mapped via VIRTIO to a VM they will create one? > > I don't think it is needed for anything. Maybe one day if there is a need to > create devlink instance for VF or SF, we can add it. But currently, I don't see > the need. > > > > > >It could be useful to specify. > > > >> Network side: > >> - By default the netdevice and the rdma device of mlx5 mdev cannot > >> send or receive any packets over the network or to any other mlx5 mdev. > > > >Does this mean the frames don't fall back to the repr by default? > > That would be the sane default. If I up the representor, I should see packets > coming in from SF/VF and I should be able to send packets back. It is similar to VF. I clarified in previous email. Wil update the cover letter to update the description.
On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote: > Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote: > >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > >> Mellanox sub function capability allows users to create several hundreds > >> of networking and/or rdma devices without depending on PCI SR-IOV support. > > > >You call the new port type "sub function" but the devlink port flavour > >is mdev. > > > >As I'm sure you remember you nacked my patches exposing NFP's PCI > >sub functions which are just regions of the BAR without any mdev > >capability. Am I in the clear to repost those now? Jiri? > > Well question is, if it makes sense to have SFs without having them as > mdev? I mean, we discussed the modelling thoroughtly and eventually we > realized that in order to model this correctly, we need SFs on "a bus". > Originally we were thinking about custom bus, but mdev is already there > to handle this. But the "main/real" port is not a mdev in your case. NFP is like mlx4. It has one PCI PF for multiple ports. > Our SFs are also just regions of the BAR, same thing as you have. > > Can't you do the same for nfp SFs? > Then the "mdev" flavour is enough for all. Absolutely not. Why not make the main device of mlx5 a mdev, too, if that's acceptable. There's (a) long precedence for multiple ports on one PCI PF in networking devices, (b) plenty deployed software which depend on the main devices hanging off the PCI PF directly. The point of mdevs is being able to sign them to VFs or run DPDK on them (map to user space). For normal devices existing sysfs hierarchy were one device has multiple children of a certain class, without a bus and a separate driver is perfectly fine. Do you think we should also slice all serial chips into mdevs if they have multiple lines. Exactly as I predicted much confusion about what's being achieved here, heh :)
On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > The new intel driver has been having a very similar discussion about how to > > model their 'multi function device' ie to bind RDMA and other drivers to a > > shared PCI function, and I think that discussion settled on adding a new bus? > > > > Really these things are all very similar, it would be nice to have a clear > > methodology on how to use the device core if a single PCI device is split by > > software into multiple different functional units and attached to different > > driver instances. > > > > Currently there is alot of hacking in this area.. And a consistent scheme > > might resolve the ugliness with the dma_ops wrappers. > > > > We already have the 'mfd' stuff to support splitting platform devices, maybe > > we need to create a 'pci-mfd' to support splitting PCI devices? > > > > I'm not really clear how mfd and mdev relate, I always thought mdev was > > strongly linked to vfio. > > > Mdev at beginning was strongly linked to vfio, but as I mentioned above it is addressing more use case. > > I observed that discussion, but was not sure of extending mdev further. > > One way to do for Intel drivers to do is after series [9]. > Where PCI driver says, MDEV_CLASS_ID_I40_FOO > RDMA driver mdev_register_driver(), matches on it and does the probe(). Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs muddying the purpose of mdevs is not a clear trade off. IMHO MFD should be of more natural use for Intel, since it's about providing different functionality rather than virtual slices of the same device. > > At the very least if it is agreed mdev should be the vehicle here, then it > > should also be able to solve the netdev/rdma hookup problem too.
> -----Original Message----- > From: Jakub Kicinski <jakub.kicinski@netronome.com> > On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote: > > Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote: > > >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > > >> Mellanox sub function capability allows users to create several > > >> hundreds of networking and/or rdma devices without depending on PCI > SR-IOV support. > > > > > >You call the new port type "sub function" but the devlink port > > >flavour is mdev. > > > > > >As I'm sure you remember you nacked my patches exposing NFP's PCI sub > > >functions which are just regions of the BAR without any mdev > > >capability. Am I in the clear to repost those now? Jiri? > > > > Well question is, if it makes sense to have SFs without having them as > > mdev? I mean, we discussed the modelling thoroughtly and eventually we > > realized that in order to model this correctly, we need SFs on "a bus". > > Originally we were thinking about custom bus, but mdev is already > > there to handle this. > > But the "main/real" port is not a mdev in your case. NFP is like mlx4. > It has one PCI PF for multiple ports. > > > Our SFs are also just regions of the BAR, same thing as you have. > > > > Can't you do the same for nfp SFs? > > Then the "mdev" flavour is enough for all. > > Absolutely not. > Please explain what is missing in mdev. And frankly it is too late for such a comment. I sent out RFC patches back in March-19, further discussed in netdevconf0x13, further discussed several times while we introduced devlink ports, further discussed in august with mdev alias series and phys_port_name formation for mdev. This series shouldn't be any surprise for you at all. Anyways. > Why not make the main device of mlx5 a mdev, too, if that's acceptable. > There's (a) long precedence for multiple ports on one PCI PF in networking > devices, (b) plenty deployed software which depend on the main devices > hanging off the PCI PF directly. > > The point of mdevs is being able to sign them to VFs or run DPDK on them (map > to user space). > That can be one use case. That is not the only use case. I clearly explained the use case scenarios in motivation section of the cover letter. Please go through it again. This series is certainly not targeting the DPDK usecase right now. Also please read design decisions section of cover letter... > For normal devices existing sysfs hierarchy were one device has multiple > children of a certain class, without a bus and a separate driver is perfectly fine. > Do you think we should also slice all serial chips into mdevs if they have > multiple lines. > > Exactly as I predicted much confusion about what's being achieved here, heh :) We don't see confusion here. Please be specific on the point that confuses you. A PCI device is sliced to multiple devices to make use of it in different ways. Serial chips are not good example of it. This is fitting in with - overall kernel bus-driver model, - suitable for multiple types of devices (netdev, rdma and their different link layers), - devlink framework of bus/device notion for health reporters, ports, resources - mapping to VM/bare-metal/container usecase. - and more.. I don't want to further repeat the cover-letter here.. let's talk specific points to improve upon.
Fri, Nov 08, 2019 at 08:06:40PM CET, jakub.kicinski@netronome.com wrote: >On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote: >> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote: >> >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: >> >> Mellanox sub function capability allows users to create several hundreds >> >> of networking and/or rdma devices without depending on PCI SR-IOV support. >> > >> >You call the new port type "sub function" but the devlink port flavour >> >is mdev. >> > >> >As I'm sure you remember you nacked my patches exposing NFP's PCI >> >sub functions which are just regions of the BAR without any mdev >> >capability. Am I in the clear to repost those now? Jiri? >> >> Well question is, if it makes sense to have SFs without having them as >> mdev? I mean, we discussed the modelling thoroughtly and eventually we >> realized that in order to model this correctly, we need SFs on "a bus". >> Originally we were thinking about custom bus, but mdev is already there >> to handle this. > >But the "main/real" port is not a mdev in your case. NFP is like mlx4. >It has one PCI PF for multiple ports. I don't see how relevant the number of PFs-vs-uplink_ports is. > >> Our SFs are also just regions of the BAR, same thing as you have. >> >> Can't you do the same for nfp SFs? >> Then the "mdev" flavour is enough for all. > >Absolutely not. > >Why not make the main device of mlx5 a mdev, too, if that's acceptable. >There's (a) long precedence for multiple ports on one PCI PF in >networking devices, (b) plenty deployed software >which depend on the main devices hanging off the PCI PF directly. > >The point of mdevs is being able to sign them to VFs or run DPDK on >them (map to user space). > >For normal devices existing sysfs hierarchy were one device has >multiple children of a certain class, without a bus and a separate >driver is perfectly fine. Do you think we should also slice all serial >chips into mdevs if they have multiple lines. > >Exactly as I predicted much confusion about what's being achieved here, >heh :) Please let me understand how your device is different. Originally Parav didn't want to have mlx5 subfunctions as mdev. He wanted to have them tight to the same pci device as the pf. No difference from what you describe you want. However while we thought about how to fit things in, how to handle na phys_port_name, how to see things in sysfs we came up with an idea of a dedicated bus. We took it upstream and people suggested to use mdev bus for this. Parav, please correct me if I'm wrong but I don't think where is a plan to push SFs into VM or to userspace as Jakub expects, right?
On Fri, 8 Nov 2019 19:34:07 +0000, Parav Pandit wrote: > > -----Original Message----- > > From: Jakub Kicinski <jakub.kicinski@netronome.com> > > > On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote: > > > Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote: > > > >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > > > >> Mellanox sub function capability allows users to create several > > > >> hundreds of networking and/or rdma devices without depending on PCI > > SR-IOV support. > > > > > > > >You call the new port type "sub function" but the devlink port > > > >flavour is mdev. > > > > > > > >As I'm sure you remember you nacked my patches exposing NFP's PCI sub > > > >functions which are just regions of the BAR without any mdev > > > >capability. Am I in the clear to repost those now? Jiri? > > > > > > Well question is, if it makes sense to have SFs without having them as > > > mdev? I mean, we discussed the modelling thoroughtly and eventually we > > > realized that in order to model this correctly, we need SFs on "a bus". > > > Originally we were thinking about custom bus, but mdev is already > > > there to handle this. > > > > But the "main/real" port is not a mdev in your case. NFP is like mlx4. > > It has one PCI PF for multiple ports. > > > > > Our SFs are also just regions of the BAR, same thing as you have. > > > > > > Can't you do the same for nfp SFs? > > > Then the "mdev" flavour is enough for all. > > > > Absolutely not. > > > Please explain what is missing in mdev. > And frankly it is too late for such a comment. > I sent out RFC patches back in March-19, > further discussed in netdevconf0x13, > further discussed several times while we introduced devlink ports, > further discussed in august with mdev alias series and phys_port_name formation for mdev. > This series shouldn't be any surprise for you at all. > Anyways. > > > Why not make the main device of mlx5 a mdev, too, if that's acceptable. > > There's (a) long precedence for multiple ports on one PCI PF in networking > > devices, (b) plenty deployed software which depend on the main devices > > hanging off the PCI PF directly. > > > > The point of mdevs is being able to sign them to VFs or run DPDK on them (map > > to user space). > > > That can be one use case. That is not the only use case. > I clearly explained the use case scenarios in motivation section of the cover letter. Please go through it again. > This series is certainly not targeting the DPDK usecase right now. > > Also please read design decisions section of cover letter... > > > For normal devices existing sysfs hierarchy were one device has multiple > > children of a certain class, without a bus and a separate driver is perfectly fine. > > Do you think we should also slice all serial chips into mdevs if they have > > multiple lines. > > > > Exactly as I predicted much confusion about what's being achieved here, heh :) > > We don't see confusion here. Please be specific on the point that confuses you. > A PCI device is sliced to multiple devices to make use of it in different ways. Serial chips are not good example of it. > This is fitting in with > - overall kernel bus-driver model, > - suitable for multiple types of devices (netdev, rdma and their different link layers), > - devlink framework of bus/device notion for health reporters, ports, resources > - mapping to VM/bare-metal/container usecase. > - and more.. > I don't want to further repeat the cover-letter here.. let's talk specific points to improve upon. Parav, I think you completely missed the point of my email. I was replying to Jiri saying that mdev does not fit the nfp/mlx4 mutli-port devices, and we still need a normal "PCI split" or something along those lines to have multiple PF ports on one PF. I'm not talking about your use of mdev at all. The discussion on patch 12 also seems to indicate you don't read the emails you reply to...
On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > > The new intel driver has been having a very similar discussion about how to > > > model their 'multi function device' ie to bind RDMA and other drivers to a > > > shared PCI function, and I think that discussion settled on adding a new bus? > > > > > > Really these things are all very similar, it would be nice to have a clear > > > methodology on how to use the device core if a single PCI device is split by > > > software into multiple different functional units and attached to different > > > driver instances. > > > > > > Currently there is alot of hacking in this area.. And a consistent scheme > > > might resolve the ugliness with the dma_ops wrappers. > > > > > > We already have the 'mfd' stuff to support splitting platform devices, maybe > > > we need to create a 'pci-mfd' to support splitting PCI devices? > > > > > > I'm not really clear how mfd and mdev relate, I always thought mdev was > > > strongly linked to vfio. > > > > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned > > above it is addressing more use case. > > > > I observed that discussion, but was not sure of extending mdev further. > > > > One way to do for Intel drivers to do is after series [9]. > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO > > RDMA driver mdev_register_driver(), matches on it and does the probe(). > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs > muddying the purpose of mdevs is not a clear trade off. IMHO, mdev has amdev_parent_ops structure clearly intended to link it to vfio, so using a mdev for something not related to vfio seems like a poor choice. I suppose this series is the start and we will eventually see the mlx5's mdev_parent_ops filled in to support vfio - but *right now* this looks identical to the problem most of the RDMA capable net drivers have splitting into a 'core' and a 'function' > IMHO MFD should be of more natural use for Intel, since it's about > providing different functionality rather than virtual slices of the > same device. I don't think the 'different functionality' should matter much. Generally these multi-function drivers are build some some common 'core' language like queues interrupts, BAR space, etc and then these common things can be specialized into netdev, rdma, scsi, etc. So we see a general rough design with a core layer managing the raw HW then drivers on top of that (including netdev) using that API. The actual layering doesn't come through in the driver model, generally people put all the core stuff in with the netdev and then try and shuffle the netdev around as the 'handle' for that core API. These SFs are pretty similar in that the core physical driver continues to provide some software API support to the SF children (at least for mlx it is a small API) For instance mdev has no generic way to learn the BAR struct resources, so there is some extra API around the side that does this - in this series it is done by hackily co-opting the drvdata to something owned by the struct device instead of the device_driver and using that to access the API surface on 'struct mlx5_sf *', which includes the BAR info and so forth. This is probably the main difference from MFD. At least the few drivers I looked at, did not try and expose an SW API from the 'core' to the 'part', everything was usual generic driver resource stuff. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@ziepe.ca> > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > > > The new intel driver has been having a very similar discussion > > > > about how to model their 'multi function device' ie to bind RDMA > > > > and other drivers to a shared PCI function, and I think that discussion > settled on adding a new bus? > > > > > > > > Really these things are all very similar, it would be nice to have > > > > a clear methodology on how to use the device core if a single PCI > > > > device is split by software into multiple different functional > > > > units and attached to different driver instances. > > > > > > > > Currently there is alot of hacking in this area.. And a consistent > > > > scheme might resolve the ugliness with the dma_ops wrappers. > > > > > > > > We already have the 'mfd' stuff to support splitting platform > > > > devices, maybe we need to create a 'pci-mfd' to support splitting PCI > devices? > > > > > > > > I'm not really clear how mfd and mdev relate, I always thought > > > > mdev was strongly linked to vfio. > > > > > > > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned > > > above it is addressing more use case. > > > > > > I observed that discussion, but was not sure of extending mdev further. > > > > > > One way to do for Intel drivers to do is after series [9]. > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO RDMA driver > > > mdev_register_driver(), matches on it and does the probe(). > > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs > > muddying the purpose of mdevs is not a clear trade off. > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it to vfio, > so using a mdev for something not related to vfio seems like a poor choice. > Splitting mdev_parent_ops{} is already in works for larger use case in series [1] for virtio. [1] https://patchwork.kernel.org/patch/11233127/
On Fri, Nov 08, 2019 at 08:20:43PM +0000, Parav Pandit wrote: > > > > From: Jason Gunthorpe <jgg@ziepe.ca> > > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: > > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > > > > The new intel driver has been having a very similar discussion > > > > > about how to model their 'multi function device' ie to bind RDMA > > > > > and other drivers to a shared PCI function, and I think that discussion > > settled on adding a new bus? > > > > > > > > > > Really these things are all very similar, it would be nice to have > > > > > a clear methodology on how to use the device core if a single PCI > > > > > device is split by software into multiple different functional > > > > > units and attached to different driver instances. > > > > > > > > > > Currently there is alot of hacking in this area.. And a consistent > > > > > scheme might resolve the ugliness with the dma_ops wrappers. > > > > > > > > > > We already have the 'mfd' stuff to support splitting platform > > > > > devices, maybe we need to create a 'pci-mfd' to support splitting PCI > > devices? > > > > > > > > > > I'm not really clear how mfd and mdev relate, I always thought > > > > > mdev was strongly linked to vfio. > > > > > > > > > > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned > > > > above it is addressing more use case. > > > > > > > > I observed that discussion, but was not sure of extending mdev further. > > > > > > > > One way to do for Intel drivers to do is after series [9]. > > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO RDMA driver > > > > mdev_register_driver(), matches on it and does the probe(). > > > > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs > > > muddying the purpose of mdevs is not a clear trade off. > > > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it to vfio, > > so using a mdev for something not related to vfio seems like a poor choice. > > > Splitting mdev_parent_ops{} is already in works for larger use case in series [1] for virtio. > > [1] https://patchwork.kernel.org/patch/11233127/ Weird. So what is mdev actually providing and what does it represent if the entire driver facing API surface is under a union? This smells a lot like it is re-implementing a bus.. AFAIK bus is supposed to represent the in-kernel API the struct device presents to drivers. Jason
On Fri, 8 Nov 2019 16:12:53 -0400 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > > > The new intel driver has been having a very similar discussion about how to > > > > model their 'multi function device' ie to bind RDMA and other drivers to a > > > > shared PCI function, and I think that discussion settled on adding a new bus? > > > > > > > > Really these things are all very similar, it would be nice to have a clear > > > > methodology on how to use the device core if a single PCI device is split by > > > > software into multiple different functional units and attached to different > > > > driver instances. > > > > > > > > Currently there is alot of hacking in this area.. And a consistent scheme > > > > might resolve the ugliness with the dma_ops wrappers. > > > > > > > > We already have the 'mfd' stuff to support splitting platform devices, maybe > > > > we need to create a 'pci-mfd' to support splitting PCI devices? > > > > > > > > I'm not really clear how mfd and mdev relate, I always thought mdev was > > > > strongly linked to vfio. > > > > > > > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned > > > above it is addressing more use case. > > > > > > I observed that discussion, but was not sure of extending mdev further. > > > > > > One way to do for Intel drivers to do is after series [9]. > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO > > > RDMA driver mdev_register_driver(), matches on it and does the probe(). > > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs > > muddying the purpose of mdevs is not a clear trade off. > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it > to vfio, so using a mdev for something not related to vfio seems like > a poor choice. Unless there's some opposition, I'm intended to queue this for v5.5: https://www.spinics.net/lists/kvm/msg199613.html mdev has started out as tied to vfio, but at it's core, it's just a device life cycle infrastructure with callbacks between bus drivers and vendor devices. If virtio is on the wrong path with the above series, please speak up. Thanks, Alex > I suppose this series is the start and we will eventually see the > mlx5's mdev_parent_ops filled in to support vfio - but *right now* > this looks identical to the problem most of the RDMA capable net > drivers have splitting into a 'core' and a 'function' > > > IMHO MFD should be of more natural use for Intel, since it's about > > providing different functionality rather than virtual slices of the > > same device. > > I don't think the 'different functionality' should matter much. > > Generally these multi-function drivers are build some some common > 'core' language like queues interrupts, BAR space, etc and then these > common things can be specialized into netdev, rdma, scsi, etc. So we > see a general rough design with a core layer managing the raw HW then > drivers on top of that (including netdev) using that API. > > The actual layering doesn't come through in the driver model, > generally people put all the core stuff in with the netdev and then > try and shuffle the netdev around as the 'handle' for that core API. > > These SFs are pretty similar in that the core physical driver > continues to provide some software API support to the SF children (at > least for mlx it is a small API) > > For instance mdev has no generic way to learn the BAR struct > resources, so there is some extra API around the side that does this - > in this series it is done by hackily co-opting the drvdata to > something owned by the struct device instead of the device_driver and > using that to access the API surface on 'struct mlx5_sf *', which > includes the BAR info and so forth. > > This is probably the main difference from MFD. At least the few > drivers I looked at, did not try and expose an SW API from the 'core' > to the 'part', everything was usual generic driver resource stuff. > > Jason
> -----Original Message----- > From: Jiri Pirko <jiri@resnulli.us> > > Fri, Nov 08, 2019 at 08:06:40PM CET, jakub.kicinski@netronome.com wrote: > >On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote: > >> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com > wrote: > >> >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > >> >> Mellanox sub function capability allows users to create several > >> >> hundreds of networking and/or rdma devices without depending on PCI > SR-IOV support. > >> > > >> >You call the new port type "sub function" but the devlink port > >> >flavour is mdev. > >> > > >> >As I'm sure you remember you nacked my patches exposing NFP's PCI > >> >sub functions which are just regions of the BAR without any mdev > >> >capability. Am I in the clear to repost those now? Jiri? > >> > >> Well question is, if it makes sense to have SFs without having them > >> as mdev? I mean, we discussed the modelling thoroughtly and > >> eventually we realized that in order to model this correctly, we need SFs on > "a bus". > >> Originally we were thinking about custom bus, but mdev is already > >> there to handle this. > > > >But the "main/real" port is not a mdev in your case. NFP is like mlx4. > >It has one PCI PF for multiple ports. > > I don't see how relevant the number of PFs-vs-uplink_ports is. > > > > > >> Our SFs are also just regions of the BAR, same thing as you have. > >> > >> Can't you do the same for nfp SFs? > >> Then the "mdev" flavour is enough for all. > > > >Absolutely not. > > > >Why not make the main device of mlx5 a mdev, too, if that's acceptable. > >There's (a) long precedence for multiple ports on one PCI PF in > >networking devices, (b) plenty deployed software which depend on the > >main devices hanging off the PCI PF directly. > > > >The point of mdevs is being able to sign them to VFs or run DPDK on > >them (map to user space). > > > >For normal devices existing sysfs hierarchy were one device has > >multiple children of a certain class, without a bus and a separate > >driver is perfectly fine. Do you think we should also slice all serial > >chips into mdevs if they have multiple lines. > > > >Exactly as I predicted much confusion about what's being achieved here, > >heh :) > > Please let me understand how your device is different. > Originally Parav didn't want to have mlx5 subfunctions as mdev. He wanted to > have them tight to the same pci device as the pf. No difference from what you > describe you want. > However while we thought about how to fit things in, how to > handle na phys_port_name, how to see things in sysfs we came up with an idea > of a dedicated bus. We took it upstream and people suggested to use mdev bus > for this. > You are right. We considered multiple ports approach, followed by subdevices and mfd. Around that time mdev was being proposed that can address current and future VM/userspace usecases using one way to lifecycle the devices. > Parav, please correct me if I'm wrong but I don't think where is a plan to push > SFs into VM or to userspace as Jakub expects, right? With this series - certainly not. In future, if mdev to be used by via vfio/VM framework, why should we prevent it (ofcourse after implementing necessary isolation method)?
On Fri, Nov 08, 2019 at 04:32:09PM -0400, Jason Gunthorpe wrote: > On Fri, Nov 08, 2019 at 08:20:43PM +0000, Parav Pandit wrote: > > > > > > > From: Jason Gunthorpe <jgg@ziepe.ca> > > > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: > > > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > > > > > The new intel driver has been having a very similar discussion > > > > > > about how to model their 'multi function device' ie to bind RDMA > > > > > > and other drivers to a shared PCI function, and I think that discussion > > > settled on adding a new bus? > > > > > > > > > > > > Really these things are all very similar, it would be nice to have > > > > > > a clear methodology on how to use the device core if a single PCI > > > > > > device is split by software into multiple different functional > > > > > > units and attached to different driver instances. > > > > > > > > > > > > Currently there is alot of hacking in this area.. And a consistent > > > > > > scheme might resolve the ugliness with the dma_ops wrappers. > > > > > > > > > > > > We already have the 'mfd' stuff to support splitting platform > > > > > > devices, maybe we need to create a 'pci-mfd' to support splitting PCI > > > devices? > > > > > > > > > > > > I'm not really clear how mfd and mdev relate, I always thought > > > > > > mdev was strongly linked to vfio. > > > > > > > > > > > > > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned > > > > > above it is addressing more use case. > > > > > > > > > > I observed that discussion, but was not sure of extending mdev further. > > > > > > > > > > One way to do for Intel drivers to do is after series [9]. > > > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO RDMA driver > > > > > mdev_register_driver(), matches on it and does the probe(). > > > > > > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs > > > > muddying the purpose of mdevs is not a clear trade off. > > > > > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it to vfio, > > > so using a mdev for something not related to vfio seems like a poor choice. > > > > > Splitting mdev_parent_ops{} is already in works for larger use case in series [1] for virtio. > > > > [1] https://patchwork.kernel.org/patch/11233127/ > > Weird. So what is mdev actually providing and what does it represent > if the entire driver facing API surface is under a union? > > This smells a lot like it is re-implementing a bus.. AFAIK bus is > supposed to represent the in-kernel API the struct device presents to > drivers. Yes, yes yes yes... I'm getting tired of saying the same thing here, just use a bus, that's what it is there for. greg k-h
On Fri, Nov 08, 2019 at 01:34:35PM -0700, Alex Williamson wrote: > On Fri, 8 Nov 2019 16:12:53 -0400 > Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: > > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > > > > The new intel driver has been having a very similar discussion about how to > > > > > model their 'multi function device' ie to bind RDMA and other drivers to a > > > > > shared PCI function, and I think that discussion settled on adding a new bus? > > > > > > > > > > Really these things are all very similar, it would be nice to have a clear > > > > > methodology on how to use the device core if a single PCI device is split by > > > > > software into multiple different functional units and attached to different > > > > > driver instances. > > > > > > > > > > Currently there is alot of hacking in this area.. And a consistent scheme > > > > > might resolve the ugliness with the dma_ops wrappers. > > > > > > > > > > We already have the 'mfd' stuff to support splitting platform devices, maybe > > > > > we need to create a 'pci-mfd' to support splitting PCI devices? > > > > > > > > > > I'm not really clear how mfd and mdev relate, I always thought mdev was > > > > > strongly linked to vfio. > > > > > > > > > > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned > > > > above it is addressing more use case. > > > > > > > > I observed that discussion, but was not sure of extending mdev further. > > > > > > > > One way to do for Intel drivers to do is after series [9]. > > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO > > > > RDMA driver mdev_register_driver(), matches on it and does the probe(). > > > > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs > > > muddying the purpose of mdevs is not a clear trade off. > > > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it > > to vfio, so using a mdev for something not related to vfio seems like > > a poor choice. > > Unless there's some opposition, I'm intended to queue this for v5.5: > > https://www.spinics.net/lists/kvm/msg199613.html > > mdev has started out as tied to vfio, but at it's core, it's just a > device life cycle infrastructure with callbacks between bus drivers > and vendor devices. If virtio is on the wrong path with the above > series, please speak up. Thanks, Well, I think Greg just objected pretty strongly. IMHO it is wrong to turn mdev into some API multiplexor. That is what the driver core already does and AFAIK your bus type is supposed to represent your API contract to your drivers. Since the bus type is ABI, 'mdev' is really all about vfio I guess? Maybe mdev should grow by factoring the special GUID life cycle stuff into a helper library that can make it simpler to build proper API specific bus's using that lifecycle model? ie the virtio I saw proposed should probably be a mdev-virtio bus type providing this new virtio API contract using a 'struct mdev_virtio'? I only looked briefly but mdev seems like an unusual way to use the driver core. *generally* I would expect that if a driver wants to provide a foo_device (on a foo bus, providing the foo API contract) it looks very broadly like: struct foo_device { struct device dev; const struct foo_ops *ops; }; struct my_foo_device { struct foo_device fdev; }; foo_device_register(&mydev->fdev); Which means we can use normal container_of() patterns, while mdev seems to want to allocate all the structs internally.. I guess this is because of how the lifecycle stuff works? From a device core view it looks quite unnatural. Jason
On Fri, Nov 08, 2019 at 05:05:45PM -0400, Jason Gunthorpe wrote: > On Fri, Nov 08, 2019 at 01:34:35PM -0700, Alex Williamson wrote: > > On Fri, 8 Nov 2019 16:12:53 -0400 > > Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: > > > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > > > > > The new intel driver has been having a very similar discussion about how to > > > > > > model their 'multi function device' ie to bind RDMA and other drivers to a > > > > > > shared PCI function, and I think that discussion settled on adding a new bus? > > > > > > > > > > > > Really these things are all very similar, it would be nice to have a clear > > > > > > methodology on how to use the device core if a single PCI device is split by > > > > > > software into multiple different functional units and attached to different > > > > > > driver instances. > > > > > > > > > > > > Currently there is alot of hacking in this area.. And a consistent scheme > > > > > > might resolve the ugliness with the dma_ops wrappers. > > > > > > > > > > > > We already have the 'mfd' stuff to support splitting platform devices, maybe > > > > > > we need to create a 'pci-mfd' to support splitting PCI devices? > > > > > > > > > > > > I'm not really clear how mfd and mdev relate, I always thought mdev was > > > > > > strongly linked to vfio. > > > > > > > > > > > > > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned > > > > > above it is addressing more use case. > > > > > > > > > > I observed that discussion, but was not sure of extending mdev further. > > > > > > > > > > One way to do for Intel drivers to do is after series [9]. > > > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO > > > > > RDMA driver mdev_register_driver(), matches on it and does the probe(). > > > > > > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs > > > > muddying the purpose of mdevs is not a clear trade off. > > > > > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it > > > to vfio, so using a mdev for something not related to vfio seems like > > > a poor choice. > > > > Unless there's some opposition, I'm intended to queue this for v5.5: > > > > https://www.spinics.net/lists/kvm/msg199613.html > > > > mdev has started out as tied to vfio, but at it's core, it's just a > > device life cycle infrastructure with callbacks between bus drivers > > and vendor devices. If virtio is on the wrong path with the above > > series, please speak up. Thanks, > > Well, I think Greg just objected pretty strongly. Yes I did. I keep saying this again and again, and so did you here: > IMHO it is wrong to turn mdev into some API multiplexor. That is what > the driver core already does and AFAIK your bus type is supposed to > represent your API contract to your drivers. That is exactly right. Don't re-create the driver api interface at another layer please. thanks, greg k-h
On Fri, 8 Nov 2019 20:41:18 +0100, Jiri Pirko wrote: > Fri, Nov 08, 2019 at 08:06:40PM CET, jakub.kicinski@netronome.com wrote: > >On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote: > >> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote: > >> >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: > >> >> Mellanox sub function capability allows users to create several hundreds > >> >> of networking and/or rdma devices without depending on PCI SR-IOV support. > >> > > >> >You call the new port type "sub function" but the devlink port flavour > >> >is mdev. > >> > > >> >As I'm sure you remember you nacked my patches exposing NFP's PCI > >> >sub functions which are just regions of the BAR without any mdev > >> >capability. Am I in the clear to repost those now? Jiri? > >> > >> Well question is, if it makes sense to have SFs without having them as > >> mdev? I mean, we discussed the modelling thoroughtly and eventually we > >> realized that in order to model this correctly, we need SFs on "a bus". > >> Originally we were thinking about custom bus, but mdev is already there > >> to handle this. > > > >But the "main/real" port is not a mdev in your case. NFP is like mlx4. > >It has one PCI PF for multiple ports. > > I don't see how relevant the number of PFs-vs-uplink_ports is. Well. We have a slice per external port, the association between the port and the slice becomes irrelevant once switchdev mode is enabled, but the queues are assigned statically so it'd be a waste of resources to not show all slices as netdevs. > >> Our SFs are also just regions of the BAR, same thing as you have. > >> > >> Can't you do the same for nfp SFs? > >> Then the "mdev" flavour is enough for all. > > > >Absolutely not. > > > >Why not make the main device of mlx5 a mdev, too, if that's acceptable. > >There's (a) long precedence for multiple ports on one PCI PF in > >networking devices, (b) plenty deployed software > >which depend on the main devices hanging off the PCI PF directly. > > > >The point of mdevs is being able to sign them to VFs or run DPDK on > >them (map to user space). > > > >For normal devices existing sysfs hierarchy were one device has > >multiple children of a certain class, without a bus and a separate > >driver is perfectly fine. Do you think we should also slice all serial > >chips into mdevs if they have multiple lines. > > > >Exactly as I predicted much confusion about what's being achieved here, > >heh :) > > Please let me understand how your device is different. > Originally Parav didn't want to have mlx5 subfunctions as mdev. He > wanted to have them tight to the same pci device as the pf. No > difference from what you describe you want. However while we thought > about how to fit things in, how to handle na phys_port_name, how to see > things in sysfs we came up with an idea of a dedicated bus. The difference is that there is naturally a main device and subslices with this new mlx5 code. In mlx4 or nfp all ports are equal and statically allocated when FW initializes based on port breakout. Maybe it's the fact I spent last night at an airport but I'm feeling like I'm arguing about this stronger than I actually care :) > We took it upstream and people suggested to use mdev bus for this. > > Parav, please correct me if I'm wrong but I don't think where is a plan > to push SFs into VM or to userspace as Jakub expects, right? There's definitely a plan to push them to VFs, I believe that was part of the original requirements, otherwise there'd be absolutely no need for a bus to begin with.
Fri, Nov 08, 2019 at 10:21:20PM CET, jakub.kicinski@netronome.com wrote: >On Fri, 8 Nov 2019 20:41:18 +0100, Jiri Pirko wrote: >> Fri, Nov 08, 2019 at 08:06:40PM CET, jakub.kicinski@netronome.com wrote: >> >On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote: >> >> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote: >> >> >On Thu, 7 Nov 2019 10:04:48 -0600, Parav Pandit wrote: >> >> >> Mellanox sub function capability allows users to create several hundreds >> >> >> of networking and/or rdma devices without depending on PCI SR-IOV support. >> >> > >> >> >You call the new port type "sub function" but the devlink port flavour >> >> >is mdev. >> >> > >> >> >As I'm sure you remember you nacked my patches exposing NFP's PCI >> >> >sub functions which are just regions of the BAR without any mdev >> >> >capability. Am I in the clear to repost those now? Jiri? >> >> >> >> Well question is, if it makes sense to have SFs without having them as >> >> mdev? I mean, we discussed the modelling thoroughtly and eventually we >> >> realized that in order to model this correctly, we need SFs on "a bus". >> >> Originally we were thinking about custom bus, but mdev is already there >> >> to handle this. >> > >> >But the "main/real" port is not a mdev in your case. NFP is like mlx4. >> >It has one PCI PF for multiple ports. >> >> I don't see how relevant the number of PFs-vs-uplink_ports is. > >Well. We have a slice per external port, the association between the >port and the slice becomes irrelevant once switchdev mode is enabled, >but the queues are assigned statically so it'd be a waste of resources >to not show all slices as netdevs. > >> >> Our SFs are also just regions of the BAR, same thing as you have. >> >> >> >> Can't you do the same for nfp SFs? >> >> Then the "mdev" flavour is enough for all. >> > >> >Absolutely not. >> > >> >Why not make the main device of mlx5 a mdev, too, if that's acceptable. >> >There's (a) long precedence for multiple ports on one PCI PF in >> >networking devices, (b) plenty deployed software >> >which depend on the main devices hanging off the PCI PF directly. >> > >> >The point of mdevs is being able to sign them to VFs or run DPDK on >> >them (map to user space). >> > >> >For normal devices existing sysfs hierarchy were one device has >> >multiple children of a certain class, without a bus and a separate >> >driver is perfectly fine. Do you think we should also slice all serial >> >chips into mdevs if they have multiple lines. >> > >> >Exactly as I predicted much confusion about what's being achieved here, >> >heh :) >> >> Please let me understand how your device is different. >> Originally Parav didn't want to have mlx5 subfunctions as mdev. He >> wanted to have them tight to the same pci device as the pf. No >> difference from what you describe you want. However while we thought >> about how to fit things in, how to handle na phys_port_name, how to see >> things in sysfs we came up with an idea of a dedicated bus. > >The difference is that there is naturally a main device and subslices >with this new mlx5 code. In mlx4 or nfp all ports are equal and >statically allocated when FW initializes based on port breakout. Ah, I see. I was missing the static part in nfp. Now I understand. It is just an another "pf", but not real pf in the pci terminology, right? > >Maybe it's the fact I spent last night at an airport but I'm feeling >like I'm arguing about this stronger than I actually care :) > >> We took it upstream and people suggested to use mdev bus for this. >> >> Parav, please correct me if I'm wrong but I don't think where is a plan >> to push SFs into VM or to userspace as Jakub expects, right? > >There's definitely a plan to push them to VFs, I believe that was part >of the original requirements, otherwise there'd be absolutely no need >for a bus to begin with.
On Fri, 8 Nov 2019 16:12:53 -0400, Jason Gunthorpe wrote: > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > > Mdev at beginning was strongly linked to vfio, but as I mentioned > > > above it is addressing more use case. > > > > > > I observed that discussion, but was not sure of extending mdev further. > > > > > > One way to do for Intel drivers to do is after series [9]. > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO > > > RDMA driver mdev_register_driver(), matches on it and does the probe(). > > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs > > muddying the purpose of mdevs is not a clear trade off. > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it > to vfio, so using a mdev for something not related to vfio seems like > a poor choice. Yes, my suggestion to use mdev was entirely based on the premise that the purpose of this work is to get vfio working.. otherwise I'm unclear as to why we'd need a bus in the first place. If this is just for containers - we have macvlan offload for years now, with no need for a separate device. > I suppose this series is the start and we will eventually see the > mlx5's mdev_parent_ops filled in to support vfio - but *right now* > this looks identical to the problem most of the RDMA capable net > drivers have splitting into a 'core' and a 'function' On the RDMA/Intel front, would you mind explaining what the main motivation for the special buses is? I'm a little confurious. My understanding is MFD was created to help with cases where single device has multiple pieces of common IP in it. Do modern RDMA cards really share IP across generations? Is there a need to reload the drivers for the separate pieces (I wonder if the devlink reload doesn't belong to the device model :(). Or is it purely an abstraction and people like abstractions?
On Fri, 8 Nov 2019 22:39:52 +0100, Jiri Pirko wrote: > >> Please let me understand how your device is different. > >> Originally Parav didn't want to have mlx5 subfunctions as mdev. He > >> wanted to have them tight to the same pci device as the pf. No > >> difference from what you describe you want. However while we thought > >> about how to fit things in, how to handle na phys_port_name, how to see > >> things in sysfs we came up with an idea of a dedicated bus. > > > >The difference is that there is naturally a main device and subslices > >with this new mlx5 code. In mlx4 or nfp all ports are equal and > >statically allocated when FW initializes based on port breakout. > > Ah, I see. I was missing the static part in nfp. Now I understand. It is > just an another "pf", but not real pf in the pci terminology, right? Ack, due to (real and perceived) HW limitations what should have been separate PFs got squished into a single big one. Biggest NFP chip has an insane (for a NIC) number Ethernet ports.
On Fri, 8 Nov 2019 17:05:45 -0400 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Fri, Nov 08, 2019 at 01:34:35PM -0700, Alex Williamson wrote: > > On Fri, 8 Nov 2019 16:12:53 -0400 > > Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: > > > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > > > > > The new intel driver has been having a very similar discussion about how to > > > > > > model their 'multi function device' ie to bind RDMA and other drivers to a > > > > > > shared PCI function, and I think that discussion settled on adding a new bus? > > > > > > > > > > > > Really these things are all very similar, it would be nice to have a clear > > > > > > methodology on how to use the device core if a single PCI device is split by > > > > > > software into multiple different functional units and attached to different > > > > > > driver instances. > > > > > > > > > > > > Currently there is alot of hacking in this area.. And a consistent scheme > > > > > > might resolve the ugliness with the dma_ops wrappers. > > > > > > > > > > > > We already have the 'mfd' stuff to support splitting platform devices, maybe > > > > > > we need to create a 'pci-mfd' to support splitting PCI devices? > > > > > > > > > > > > I'm not really clear how mfd and mdev relate, I always thought mdev was > > > > > > strongly linked to vfio. > > > > > > > > > > > > > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned > > > > > above it is addressing more use case. > > > > > > > > > > I observed that discussion, but was not sure of extending mdev further. > > > > > > > > > > One way to do for Intel drivers to do is after series [9]. > > > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO > > > > > RDMA driver mdev_register_driver(), matches on it and does the probe(). > > > > > > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs > > > > muddying the purpose of mdevs is not a clear trade off. > > > > > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it > > > to vfio, so using a mdev for something not related to vfio seems like > > > a poor choice. > > > > Unless there's some opposition, I'm intended to queue this for v5.5: > > > > https://www.spinics.net/lists/kvm/msg199613.html > > > > mdev has started out as tied to vfio, but at it's core, it's just a > > device life cycle infrastructure with callbacks between bus drivers > > and vendor devices. If virtio is on the wrong path with the above > > series, please speak up. Thanks, > > Well, I think Greg just objected pretty strongly. > > IMHO it is wrong to turn mdev into some API multiplexor. That is what > the driver core already does and AFAIK your bus type is supposed to > represent your API contract to your drivers. > > Since the bus type is ABI, 'mdev' is really all about vfio I guess? > > Maybe mdev should grow by factoring the special GUID life cycle stuff > into a helper library that can make it simpler to build proper API > specific bus's using that lifecycle model? ie the virtio I saw > proposed should probably be a mdev-virtio bus type providing this new > virtio API contract using a 'struct mdev_virtio'? I see, the bus:API contract is more clear when we're talking about physical buses and physical devices following a hardware specification. But if we take PCI for example, each PCI device has it's own internal API that operates on the bus API. PCI bus drivers match devices based on vendor and device ID, which defines that internal API, not the bus API. The bus API is pretty thin when we're talking virtual devices and virtual buses though. The bus "API" is essentially that lifecycle management, so I'm having a bit of a hard time differentiating this from saying "hey, that PCI bus is nice, but we can't have drivers using their own API on the same bus, so can we move the config space, reset, hotplug, etc, stuff into helpers and come up with an (ex.) mlx5_bus instead?" Essentially for virtual devices, we're dictating a bus per device type, whereas it seemed like a reasonable idea at the time to create a common virtual device bus, but maybe it went into the weeds when trying to figure out how device drivers match to devices on that bus and actually interact with them. > I only looked briefly but mdev seems like an unusual way to use the > driver core. *generally* I would expect that if a driver wants to > provide a foo_device (on a foo bus, providing the foo API contract) it > looks very broadly like: > > struct foo_device { > struct device dev; > const struct foo_ops *ops; > }; > struct my_foo_device { > struct foo_device fdev; > }; > > foo_device_register(&mydev->fdev); > > Which means we can use normal container_of() patterns, while mdev > seems to want to allocate all the structs internally.. I guess this is > because of how the lifecycle stuff works? From a device core view it > looks quite unnatural. Right, there's an attempt in mdev to do the common bits of the device creation in the core and pass it to the vendor driver to fill in the private bits. I'm sure it could be cleaner, patches welcome :) Thanks, Alex
Fri, Nov 08, 2019 at 10:51:09PM CET, jakub.kicinski@netronome.com wrote: >On Fri, 8 Nov 2019 22:39:52 +0100, Jiri Pirko wrote: >> >> Please let me understand how your device is different. >> >> Originally Parav didn't want to have mlx5 subfunctions as mdev. He >> >> wanted to have them tight to the same pci device as the pf. No >> >> difference from what you describe you want. However while we thought >> >> about how to fit things in, how to handle na phys_port_name, how to see >> >> things in sysfs we came up with an idea of a dedicated bus. >> > >> >The difference is that there is naturally a main device and subslices >> >with this new mlx5 code. In mlx4 or nfp all ports are equal and >> >statically allocated when FW initializes based on port breakout. >> >> Ah, I see. I was missing the static part in nfp. Now I understand. It is >> just an another "pf", but not real pf in the pci terminology, right? > >Ack, due to (real and perceived) HW limitations what should have been >separate PFs got squished into a single big one. > >Biggest NFP chip has an insane (for a NIC) number Ethernet ports. Okay. So we'll endup in having flavour "mdev" for the SFs that are spawned on fly by user and "sf" for the fixed one - that is your patchset if I recall correctly.
Hi Greg, Jason, > -----Original Message----- > From: Alex Williamson <alex.williamson@redhat.com> > > On Fri, 8 Nov 2019 17:05:45 -0400 > Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Fri, Nov 08, 2019 at 01:34:35PM -0700, Alex Williamson wrote: > > > On Fri, 8 Nov 2019 16:12:53 -0400 > > > Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: > > > > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: > > > > > > > The new intel driver has been having a very similar > > > > > > > discussion about how to model their 'multi function device' > > > > > > > ie to bind RDMA and other drivers to a shared PCI function, and I > think that discussion settled on adding a new bus? > > > > > > > > > > > > > > Really these things are all very similar, it would be nice > > > > > > > to have a clear methodology on how to use the device core if > > > > > > > a single PCI device is split by software into multiple > > > > > > > different functional units and attached to different driver instances. > > > > > > > > > > > > > > Currently there is alot of hacking in this area.. And a > > > > > > > consistent scheme might resolve the ugliness with the dma_ops > wrappers. > > > > > > > > > > > > > > We already have the 'mfd' stuff to support splitting > > > > > > > platform devices, maybe we need to create a 'pci-mfd' to support > splitting PCI devices? > > > > > > > > > > > > > > I'm not really clear how mfd and mdev relate, I always > > > > > > > thought mdev was strongly linked to vfio. > > > > > > > > > > > > > > > > > > > Mdev at beginning was strongly linked to vfio, but as I > > > > > > mentioned above it is addressing more use case. > > > > > > > > > > > > I observed that discussion, but was not sure of extending mdev > further. > > > > > > > > > > > > One way to do for Intel drivers to do is after series [9]. > > > > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO > > > > > > RDMA driver mdev_register_driver(), matches on it and does the > probe(). > > > > > > > > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs > > > > > muddying the purpose of mdevs is not a clear trade off. > > > > > > > > IMHO, mdev has amdev_parent_ops structure clearly intended to link > > > > it to vfio, so using a mdev for something not related to vfio > > > > seems like a poor choice. > > > > > > Unless there's some opposition, I'm intended to queue this for v5.5: > > > > > > https://www.spinics.net/lists/kvm/msg199613.html > > > > > > mdev has started out as tied to vfio, but at it's core, it's just a > > > device life cycle infrastructure with callbacks between bus drivers > > > and vendor devices. If virtio is on the wrong path with the above > > > series, please speak up. Thanks, > > > > Well, I think Greg just objected pretty strongly. > > > > IMHO it is wrong to turn mdev into some API multiplexor. That is what > > the driver core already does and AFAIK your bus type is supposed to > > represent your API contract to your drivers. > > > > Since the bus type is ABI, 'mdev' is really all about vfio I guess? > > > > Maybe mdev should grow by factoring the special GUID life cycle stuff > > into a helper library that can make it simpler to build proper API > > specific bus's using that lifecycle model? ie the virtio I saw > > proposed should probably be a mdev-virtio bus type providing this new > > virtio API contract using a 'struct mdev_virtio'? > > I see, the bus:API contract is more clear when we're talking about physical > buses and physical devices following a hardware specification. > But if we take PCI for example, each PCI device has it's own internal API that > operates on the bus API. PCI bus drivers match devices based on vendor and > device ID, which defines that internal API, not the bus API. The bus API is pretty > thin when we're talking virtual devices and virtual buses though. The bus "API" > is essentially that lifecycle management, so I'm having a bit of a hard time > differentiating this from saying "hey, that PCI bus is nice, but we can't have > drivers using their own API on the same bus, so can we move the config space, > reset, hotplug, etc, stuff into helpers and come up with an (ex.) mlx5_bus > instead?" Essentially for virtual devices, we're dictating a bus per device type, > whereas it seemed like a reasonable idea at the time to create a common > virtual device bus, but maybe it went into the weeds when trying to figure out > how device drivers match to devices on that bus and actually interact with > them. > > > I only looked briefly but mdev seems like an unusual way to use the > > driver core. *generally* I would expect that if a driver wants to > > provide a foo_device (on a foo bus, providing the foo API contract) it > > looks very broadly like: > > > > struct foo_device { > > struct device dev; > > const struct foo_ops *ops; > > }; > > struct my_foo_device { > > struct foo_device fdev; > > }; > > > > foo_device_register(&mydev->fdev); > > If I understood Greg's direction on using bus and Jason's suggestion of 'mdev-virtio' example, User has one of the three use cases as I described in cover letter. i.e. create a sub device and configure it. once its configured, Based on the use case, map it to right bus driver. 1. mdev-vfio (no demux business) 2. virtio (new bus) 3. mlx5_bus (new bus) We should be creating 3 different buses, instead of mdev bus being de-multiplexer of that? Hence, depending the device flavour specified, create such device on right bus? For example, $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour mdev <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0 flavour mlx5 id 1 subdev_id 3 $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $ echo <respective_device_id> <sysfs_path>/bind Implement power management callbacks also on all above 3 buses? Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so that multiple vendors can reuse?
On Fri, Nov 08, 2019 at 02:52:10PM -0700, Alex Williamson wrote: > > > > > > Unless there's some opposition, I'm intended to queue this for v5.5: > > > > > > https://www.spinics.net/lists/kvm/msg199613.html > > > > > > mdev has started out as tied to vfio, but at it's core, it's just a > > > device life cycle infrastructure with callbacks between bus drivers > > > and vendor devices. If virtio is on the wrong path with the above > > > series, please speak up. Thanks, > > > > Well, I think Greg just objected pretty strongly. > > > > IMHO it is wrong to turn mdev into some API multiplexor. That is what > > the driver core already does and AFAIK your bus type is supposed to > > represent your API contract to your drivers. > > > > Since the bus type is ABI, 'mdev' is really all about vfio I guess? > > > > Maybe mdev should grow by factoring the special GUID life cycle stuff > > into a helper library that can make it simpler to build proper API > > specific bus's using that lifecycle model? ie the virtio I saw > > proposed should probably be a mdev-virtio bus type providing this new > > virtio API contract using a 'struct mdev_virtio'? > > I see, the bus:API contract is more clear when we're talking about > physical buses and physical devices following a hardware > specification. Well, I don't think it matters, this is a software contract inside the kernel between the 'struct foo_device' (as provided by the foo_bus) and the 'struct foo_driver' This contract is certainly easier to define when a HW specification dictates basically how it works. > But if we take PCI for example, each PCI device has it's own internal > API that operates on the bus API. PCI bus drivers match devices based > on vendor and device ID, which defines that internal API, not the bus > API. Yes, this matching is part of the API contract between the bus and device driver. But all of the pci_* functions that accept a 'struct pci_device *' are also part of this API contract toward the driver. > The bus API is pretty thin when we're talking virtual devices and > virtual buses though. The bus "API" is essentially that lifecycle > management, so I'm having a bit of a hard time differentiating this But Parav just pointed out to a virtio SW API that had something like 20 API entry points. > instead?" Essentially for virtual devices, we're dictating a bus per > device type, whereas it seemed like a reasonable idea at the time to Well, what does a driver binding to a virtual device need to know? The virtual device API should provide all of that information. I think things like vfio and virtio APIs are very reasonable bus types. virtio in particular has a published 'hw-like' specification with some good layers that can build a bus API. Not so sure about the very HW specific things like the Intel driver and these SFs. These will really only ever bind to one driver and seem to have no commonalities. For those we either create a bus per driver-specific proprietary API (feels kind of wrong) or we have a generic bus essentially for managed multi-function hardware that uses a simple 'void *hw_data' as the driver API and some matching logic to support that. > create a common virtual device bus, but maybe it went into the weeds > when trying to figure out how device drivers match to devices on that > bus and actually interact with them. I think it is important to focus on the the SW API the 'struct foo_device' is supposed to provide toward the driver that binds to it. It should be a sensible API covering some well defined area.. Jason
On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote: > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it > > to vfio, so using a mdev for something not related to vfio seems like > > a poor choice. > > Yes, my suggestion to use mdev was entirely based on the premise that > the purpose of this work is to get vfio working.. otherwise I'm unclear > as to why we'd need a bus in the first place. If this is just for > containers - we have macvlan offload for years now, with no need for a > separate device. This SF thing is a full fledged VF function, it is not at all like macvlan. This is perhaps less important for the netdev part of the world, but the difference is very big for the RDMA side, and should enable VFIO too.. > On the RDMA/Intel front, would you mind explaining what the main > motivation for the special buses is? I'm a little confurious. Well, the issue is driver binding. For years we have had these multi-function netdev drivers that have a single PCI device which must bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I expect when NVMe over TCP rolls out we will have drivers like cxgb4 binding to 6 subsytems in total!] Today most of this is a big hack where the PCI device binds to the netdev driver and then the other drivers in different subsystems 'discover' that an appropriate netdev is plugged in using various unique, hacky and ugly means. For instance cxgb4 duplicates a chunk of the device core, see cxgb4_register_uld() for example. Other drivers try to use netdev notifiers, and various other wild things. So, the general concept is to use the driver model to manage driver binding. A multi-subsystem driver would have several parts: - A pci_driver which binds to the pci_device (the core) It creates, on a bus, struct ??_device's for the other subsystems that this HW supports. ie if the chip supports netdev then a ??_device that binds to the netdev driver is created, same for RDMA - A ??_driver in netdev binds to the device and accesses the core API - A ??_driver in RDMA binds to the device and accesses the core API - A ??_driver in SCSI binds to the device and accesses the core API Now the driver model directly handles all binding, autoloading, discovery, etc, and 'netdev' is just another consumer of 'core' functionality. For something like mlx5 the 'core' is the stuff in drivers/net/ethernet/mellanox/mlx5/core/*.c, give or take. It is broadly generic stuff like send commands, create queues, manage HW resources, etc. There has been some lack of clarity on what the ?? should be. People have proposed platform and MFD, and those seem to be no-goes. So, it looks like ?? will be a mlx5_driver on a mlx5_bus, and Intel will use an ice_driver on a ice_bus, ditto for cxgb4, if I understand Greg's guidance. Though I'm wondering if we should have a 'multi_subsystem_device' that was really just about passing a 'void *core_handle' from the 'core' (ie the bus) to the driver (ie RDMA, netdev, etc). It seems weakly defined, but also exactly what every driver doing this needs.. It is basically what this series is abusing mdev to accomplish. > My understanding is MFD was created to help with cases where single > device has multiple pieces of common IP in it. MFD really seems to be good at splitting a device when the HW is orthogonal at the register level. Ie you might have regs 100-200 for ethernet and 200-300 for RDMA. But this is not how modern HW works, the functional division is more subtle and more software based. ie on most devices a netdev and rdma queue are nearly the same, just a few settings make them function differently. So what is needed isn't a split of register set like MFD specializes in, but a unique per-driver API between the 'core' and 'subsystem' parts of the multi-subsystem device. > Do modern RDMA cards really share IP across generations? What is a generation? Mellanox has had a stable RDMA driver across many sillicon generations. Intel looks like their new driver will support at least the last two or more sillicon generations.. RDMA drivers are monstrous complex things, there is a big incentive to not respin them every time a new chip comes out. > Is there a need to reload the drivers for the separate pieces (I > wonder if the devlink reload doesn't belong to the device model :(). Yes, it is already done, but without driver model support the only way to reload the rdma driver is to unload the entire module as there is no 'unbind' Jason
Hi Jason, > -----Original Message----- > From: Jason Gunthorpe <jgg@ziepe.ca> > > On Fri, Nov 08, 2019 at 02:52:10PM -0700, Alex Williamson wrote: > > > > > > > > Unless there's some opposition, I'm intended to queue this for v5.5: > > > > > > > > https://www.spinics.net/lists/kvm/msg199613.html > > > > > > > > mdev has started out as tied to vfio, but at it's core, it's just > > > > a device life cycle infrastructure with callbacks between bus > > > > drivers and vendor devices. If virtio is on the wrong path with > > > > the above series, please speak up. Thanks, > > > > > > Well, I think Greg just objected pretty strongly. > > > > > > IMHO it is wrong to turn mdev into some API multiplexor. That is > > > what the driver core already does and AFAIK your bus type is > > > supposed to represent your API contract to your drivers. > > > > > > Since the bus type is ABI, 'mdev' is really all about vfio I guess? > > > > > > Maybe mdev should grow by factoring the special GUID life cycle > > > stuff into a helper library that can make it simpler to build proper > > > API specific bus's using that lifecycle model? ie the virtio I saw > > > proposed should probably be a mdev-virtio bus type providing this > > > new virtio API contract using a 'struct mdev_virtio'? > > > > I see, the bus:API contract is more clear when we're talking about > > physical buses and physical devices following a hardware > > specification. > > Well, I don't think it matters, this is a software contract inside the kernel > between the 'struct foo_device' (as provided by the foo_bus) and the 'struct > foo_driver' > > This contract is certainly easier to define when a HW specification dictates > basically how it works. > > > But if we take PCI for example, each PCI device has it's own internal > > API that operates on the bus API. PCI bus drivers match devices based > > on vendor and device ID, which defines that internal API, not the bus > > API. > > Yes, this matching is part of the API contract between the bus and device > driver. > > But all of the pci_* functions that accept a 'struct pci_device *' are also part of > this API contract toward the driver. > > > The bus API is pretty thin when we're talking virtual devices and > > virtual buses though. The bus "API" is essentially that lifecycle > > management, so I'm having a bit of a hard time differentiating this > > But Parav just pointed out to a virtio SW API that had something like > 20 API entry points. > > > instead?" Essentially for virtual devices, we're dictating a bus per > > device type, whereas it seemed like a reasonable idea at the time to > > Well, what does a driver binding to a virtual device need to know? > > The virtual device API should provide all of that information. > > I think things like vfio and virtio APIs are very reasonable bus types. virtio in > particular has a published 'hw-like' specification with some good layers that > can build a bus API. > > Not so sure about the very HW specific things like the Intel driver and these SFs. > These will really only ever bind to one driver and seem to have no > commonalities. > > For those we either create a bus per driver-specific proprietary API (feels kind > of wrong) or we have a generic bus essentially for managed multi-function > hardware that uses a simple 'void *hw_data' as the driver API and some > matching logic to support that. > Its certainly important to use generic bus approach overall at kernel level so that every vendor doesn't define that own devlink flavor, id scheme, udev naming method, PM and so on. (It is not just bus definition). Coming to hw_data part, even if this subdev (vendor) bus is created, it can still exactly follow your foo_device example. In fact my first published RFC [1] and its specific patch [2] was doing that. probe() routine in series [1] didn't have PCI like struct subdev *, because I wanted to use the core's generic probe(), However it still correct because probe() can reach out to foo_device using container_of(). And hence the *hw_data is also resolved. So struct looks like, struct subdev { struct device device; /* resource range */ /* num of irq vectors */ const char *hw_addr; [..] }; struct mlx5_subdev { struct subdev device; [..]; }; I request to reconsider RFC [1] for multi-function SFs use with extension of device flavour as 'virtio', 'mlx5' etc in [3]. [1] https://lkml.org/lkml/2019/3/1/19 [2] https://lore.kernel.org/patchwork/patch/1046997/#1238851 [3] https://lkml.org/lkml/2019/3/1/25 > > create a common virtual device bus, but maybe it went into the weeds > > when trying to figure out how device drivers match to devices on that > > bus and actually interact with them. > > I think it is important to focus on the the SW API the 'struct foo_device' is > supposed to provide toward the driver that binds to it. > > It should be a sensible API covering some well defined area.. > > Jason
On Fri, Nov 08, 2019 at 10:48:31PM +0000, Parav Pandit wrote: > We should be creating 3 different buses, instead of mdev bus being de-multiplexer of that? > > Hence, depending the device flavour specified, create such device on right bus? > > For example, > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo subdev_id 1 > $ devlink create subdev pci/0000:05:00.0 flavour mdev <uuid> subdev_id 2 > $ devlink create subdev pci/0000:05:00.0 flavour mlx5 id 1 subdev_id 3 I like the idea of specifying what kind of interface you want at sub device creation time. It fits the driver model pretty well and doesn't require abusing the vfio mdev for binding to a netdev driver. > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> > $ echo <respective_device_id> <sysfs_path>/bind Is explicit binding really needed? If you specify a vfio flavour why shouldn't the vfio driver autoload and bind to it right away? That is kind of the point of the driver model... (kind of related, but I don't get while all that GUID and lifecycle stuff in mdev should apply for something like a SF) > Implement power management callbacks also on all above 3 buses? > Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so > that multiple vendors can reuse? In this specific case, why does the SF in mlx5 mode even need a bus? Is it only because of devlink? That would be unfortunate Jason
On Fri, Nov 08, 2019 at 08:44:26PM -0400, Jason Gunthorpe wrote: > There has been some lack of clarity on what the ?? should be. People > have proposed platform and MFD, and those seem to be no-goes. So, it > looks like ?? will be a mlx5_driver on a mlx5_bus, and Intel will use > an ice_driver on a ice_bus, ditto for cxgb4, if I understand Greg's > guidance. Yes, that is the only way it can work because you really are just sharing a single PCI device in a vendor-specific way, and they all need to get along with each one properly for that vendor-specific way. So each vendor needs its own "bus" to be able to work out things properly, I doubt you can make this more generic than that easily. > Though I'm wondering if we should have a 'multi_subsystem_device' that > was really just about passing a 'void *core_handle' from the 'core' > (ie the bus) to the driver (ie RDMA, netdev, etc). Ick, no. > It seems weakly defined, but also exactly what every driver doing this > needs.. It is basically what this series is abusing mdev to accomplish. What is so hard about writing a bus? Last I tried it was just a few hundred lines of code, if that. I know it's not the easiest in places, but we have loads of examples to crib from. If you have problems/questions, just ask! Or, worst case, you just do what I asked in this thread somewhere, and write a "virtual bus" where you just create devices and bind them to the driver before registering and away you go. No auto-loading needed (or possible), but then you have a generic layer that everyone can use if they want to (but you loose some functionality at the expense of generic code.) Are these constant long email threads a way that people are just trying to get me to do this work for them? Because if it is, it's working... thanks, greg k-h
Sat, Nov 09, 2019 at 09:46:59AM CET, gregkh@linuxfoundation.org wrote: >On Fri, Nov 08, 2019 at 08:44:26PM -0400, Jason Gunthorpe wrote: >> There has been some lack of clarity on what the ?? should be. People >> have proposed platform and MFD, and those seem to be no-goes. So, it >> looks like ?? will be a mlx5_driver on a mlx5_bus, and Intel will use >> an ice_driver on a ice_bus, ditto for cxgb4, if I understand Greg's >> guidance. > >Yes, that is the only way it can work because you really are just >sharing a single PCI device in a vendor-specific way, and they all need >to get along with each one properly for that vendor-specific way. So >each vendor needs its own "bus" to be able to work out things properly, >I doubt you can make this more generic than that easily. > >> Though I'm wondering if we should have a 'multi_subsystem_device' that >> was really just about passing a 'void *core_handle' from the 'core' >> (ie the bus) to the driver (ie RDMA, netdev, etc). > >Ick, no. > >> It seems weakly defined, but also exactly what every driver doing this >> needs.. It is basically what this series is abusing mdev to accomplish. > >What is so hard about writing a bus? Last I tried it was just a few >hundred lines of code, if that. I know it's not the easiest in places, >but we have loads of examples to crib from. If you have >problems/questions, just ask! > >Or, worst case, you just do what I asked in this thread somewhere, and >write a "virtual bus" where you just create devices and bind them to the >driver before registering and away you go. No auto-loading needed (or >possible), but then you have a generic layer that everyone can use if >they want to (but you loose some functionality at the expense of >generic code.) Pardon my ignorance, just to be clear: You suggest to have one-virtual-bus-per-driver or rather some common "xbus" to serve this purpose for all of them, right? If so, isn't that a bit ugly to have a bus in every driver? I wonder if there can be some abstraction found. > >Are these constant long email threads a way that people are just trying >to get me to do this work for them? Because if it is, it's working... Maybe they are just confused, like I am :) > >thanks, > >greg k-h
On Fri, 8 Nov 2019 20:44:26 -0400, Jason Gunthorpe wrote: > On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote: > > Yes, my suggestion to use mdev was entirely based on the premise that > > the purpose of this work is to get vfio working.. otherwise I'm unclear > > as to why we'd need a bus in the first place. If this is just for > > containers - we have macvlan offload for years now, with no need for a > > separate device. > > This SF thing is a full fledged VF function, it is not at all like > macvlan. This is perhaps less important for the netdev part of the > world, but the difference is very big for the RDMA side, and should > enable VFIO too.. Well, macvlan used VMDq so it was pretty much a "legacy SR-IOV" VF. I'd perhaps need to learn more about RDMA to appreciate the difference. > > On the RDMA/Intel front, would you mind explaining what the main > > motivation for the special buses is? I'm a little confurious. > > Well, the issue is driver binding. For years we have had these > multi-function netdev drivers that have a single PCI device which must > bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb > drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I > expect when NVMe over TCP rolls out we will have drivers like cxgb4 > binding to 6 subsytems in total!] What I'm missing is why is it so bad to have a driver register to multiple subsystems. I've seen no end of hacks caused people trying to split their driver too deeply by functionality. Separate sub-drivers, buses and modules. The nfp driver was split up before I upstreamed it, I merged it into one monolithic driver/module. Code is still split up cleanly internally, the architecture doesn't change in any major way. Sure 5% of developers were upset they can't do some partial reloads they were used to, but they got used to the new ways, and 100% of users were happy about the simplicity. For the nfp I think the _real_ reason to have a bus was that it was expected to have some out-of-tree modules bind to it. Something I would not encourage :) Maybe RDMA and storage have some requirements where the reload of the part of the driver is important, IDK.. > > My understanding is MFD was created to help with cases where single > > device has multiple pieces of common IP in it. > > MFD really seems to be good at splitting a device when the HW is > orthogonal at the register level. Ie you might have regs 100-200 for > ethernet and 200-300 for RDMA. > > But this is not how modern HW works, the functional division is more > subtle and more software based. ie on most devices a netdev and rdma > queue are nearly the same, just a few settings make them function > differently. > > So what is needed isn't a split of register set like MFD specializes > in, but a unique per-driver API between the 'core' and 'subsystem' > parts of the multi-subsystem device. Exactly, because the device is one. For my simplistic brain one device means one driver, which can register to as many subsystems as it wants. > > Do modern RDMA cards really share IP across generations? > > What is a generation? Mellanox has had a stable RDMA driver across > many sillicon generations. Intel looks like their new driver will > support at least the last two or more sillicon generations.. > > RDMA drivers are monstrous complex things, there is a big incentive to > not respin them every time a new chip comes out. Ack, but then again none of the drivers gets rewritten from scratch, right? It's not that some "sub-drivers" get reused and some not, no? > > Is there a need to reload the drivers for the separate pieces (I > > wonder if the devlink reload doesn't belong to the device model :(). > > Yes, it is already done, but without driver model support the only way > to reload the rdma driver is to unload the entire module as there is > no 'unbind' The reload is the only thing that I can think of (other than out-of-tree code), but with devlink no I believe it can be solved differently. Thanks a lot for the explanation Jason, much appreciated! The practicality of this is still a little elusive to me, but since Greg seems on board I guess it's just me :)
On Sat, 9 Nov 2019 12:18:09 +0100, Jiri Pirko wrote: > >Are these constant long email threads a way that people are just trying > >to get me to do this work for them? Because if it is, it's working... > > Maybe they are just confused, like I am :) +1 :)
On Fri, 8 Nov 2019 20:57:08 -0400, Jason Gunthorpe wrote: > On Fri, Nov 08, 2019 at 10:48:31PM +0000, Parav Pandit wrote: > > We should be creating 3 different buses, instead of mdev bus being de-multiplexer of that? > > > > Hence, depending the device flavour specified, create such device on right bus? > > > > For example, > > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo subdev_id 1 > > $ devlink create subdev pci/0000:05:00.0 flavour mdev <uuid> subdev_id 2 > > $ devlink create subdev pci/0000:05:00.0 flavour mlx5 id 1 subdev_id 3 > > I like the idea of specifying what kind of interface you want at sub > device creation time. It fits the driver model pretty well and doesn't > require abusing the vfio mdev for binding to a netdev driver. Aren't the HW resources spun out in all three cases exactly identical? IMHO creation of sub device should only define which HW resources are provisioned/relegated. Specifying a driver when recreating a device seems a little backwards.
On Sat, Nov 09, 2019 at 12:18:09PM +0100, Jiri Pirko wrote: > Sat, Nov 09, 2019 at 09:46:59AM CET, gregkh@linuxfoundation.org wrote: > >On Fri, Nov 08, 2019 at 08:44:26PM -0400, Jason Gunthorpe wrote: > >> There has been some lack of clarity on what the ?? should be. People > >> have proposed platform and MFD, and those seem to be no-goes. So, it > >> looks like ?? will be a mlx5_driver on a mlx5_bus, and Intel will use > >> an ice_driver on a ice_bus, ditto for cxgb4, if I understand Greg's > >> guidance. > > > >Yes, that is the only way it can work because you really are just > >sharing a single PCI device in a vendor-specific way, and they all need > >to get along with each one properly for that vendor-specific way. So > >each vendor needs its own "bus" to be able to work out things properly, > >I doubt you can make this more generic than that easily. > > > >> Though I'm wondering if we should have a 'multi_subsystem_device' that > >> was really just about passing a 'void *core_handle' from the 'core' > >> (ie the bus) to the driver (ie RDMA, netdev, etc). > > > >Ick, no. > > > >> It seems weakly defined, but also exactly what every driver doing this > >> needs.. It is basically what this series is abusing mdev to accomplish. > > > >What is so hard about writing a bus? Last I tried it was just a few > >hundred lines of code, if that. I know it's not the easiest in places, > >but we have loads of examples to crib from. If you have > >problems/questions, just ask! > > > >Or, worst case, you just do what I asked in this thread somewhere, and > >write a "virtual bus" where you just create devices and bind them to the > >driver before registering and away you go. No auto-loading needed (or > >possible), but then you have a generic layer that everyone can use if > >they want to (but you loose some functionality at the expense of > >generic code.) > > Pardon my ignorance, just to be clear: You suggest to have > one-virtual-bus-per-driver or rather some common "xbus" to serve this > purpose for all of them, right? Yes. > If so, isn't that a bit ugly to have a bus in every driver? No, not if that's what you want to have for that specific type of device. I.e. you want to have multiple drivers all attached to a single PCI device and somehow "share" the physical resources properly in a sane way. > I wonder if there can be some abstraction found. The abstraction is just that, the bus one. It's not all that complex, is it? thanks, greg k-h
On Sat, Nov 09, 2019 at 09:27:47AM -0800, Jakub Kicinski wrote: > On Fri, 8 Nov 2019 20:44:26 -0400, Jason Gunthorpe wrote: > > On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote: > > > Yes, my suggestion to use mdev was entirely based on the premise that > > > the purpose of this work is to get vfio working.. otherwise I'm unclear > > > as to why we'd need a bus in the first place. If this is just for > > > containers - we have macvlan offload for years now, with no need for a > > > separate device. > > > > This SF thing is a full fledged VF function, it is not at all like > > macvlan. This is perhaps less important for the netdev part of the > > world, but the difference is very big for the RDMA side, and should > > enable VFIO too.. > > Well, macvlan used VMDq so it was pretty much a "legacy SR-IOV" VF. > I'd perhaps need to learn more about RDMA to appreciate the difference. > > > > On the RDMA/Intel front, would you mind explaining what the main > > > motivation for the special buses is? I'm a little confurious. > > > > Well, the issue is driver binding. For years we have had these > > multi-function netdev drivers that have a single PCI device which must > > bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb > > drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I > > expect when NVMe over TCP rolls out we will have drivers like cxgb4 > > binding to 6 subsytems in total!] > > What I'm missing is why is it so bad to have a driver register to > multiple subsystems. Because these PCI devices seem to do "different" things all in one PCI resource set. Blame the hardware designers :) > I've seen no end of hacks caused people trying to split their driver > too deeply by functionality. Separate sub-drivers, buses and modules. > > The nfp driver was split up before I upstreamed it, I merged it into > one monolithic driver/module. Code is still split up cleanly internally, > the architecture doesn't change in any major way. Sure 5% of developers > were upset they can't do some partial reloads they were used to, but > they got used to the new ways, and 100% of users were happy about the > simplicity. I agree, you should stick with the "one device/driver" thing where ever possible, like you did. > For the nfp I think the _real_ reason to have a bus was that it > was expected to have some out-of-tree modules bind to it. Something > I would not encourage :) That's not ok, and I agree with you. But there seems to be some more complex PCI devices that do lots of different things all at once. Kind of like a PCI device that wants to be both a keyboard and a storage device at the same time (i.e. a button on a disk drive...) thanks, greg k-h
On Sat, Nov 09, 2019 at 09:41:03AM -0800, Jakub Kicinski wrote: > On Fri, 8 Nov 2019 20:57:08 -0400, Jason Gunthorpe wrote: > > On Fri, Nov 08, 2019 at 10:48:31PM +0000, Parav Pandit wrote: > > > We should be creating 3 different buses, instead of mdev bus being de-multiplexer of that? > > > > > > Hence, depending the device flavour specified, create such device on right bus? > > > > > > For example, > > > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo subdev_id 1 > > > $ devlink create subdev pci/0000:05:00.0 flavour mdev <uuid> subdev_id 2 > > > $ devlink create subdev pci/0000:05:00.0 flavour mlx5 id 1 subdev_id 3 > > > > I like the idea of specifying what kind of interface you want at sub > > device creation time. It fits the driver model pretty well and doesn't > > require abusing the vfio mdev for binding to a netdev driver. > > Aren't the HW resources spun out in all three cases exactly identical? Exactly? No, not really. The only constant is that some chunk of the BAR is dedicated to this subedv. The BAR is flexible, so a BAR chunk configured for virtio is not going to support mlx5 mode. Aside from that, there are other differences ie - mlx5 does not need a dedicated set of MSI-X's while other modes do. There are fewer MSI-X's than SF's, so managing this is important for the admin. Even in modes which are very similar, like mlx5 vs mdev-vfio, the HW still has to be configured to provide global DMA isolation on the NIC for vfio as the IOMMU cannot be involved. This is extra overhead and should not be activated unless vfio is being used. .. and finally the driver core does not support a 'multiple-inheritance' like idea, so we can't have a 'foo_device' that is three different things. So somehow the 'flavour' of the 'struct device' has to be exposed to userspace, and it is best if this is done at device creation time so the BAR region and HW can be setup once and we don't have to have complex reconfiguration flows. Jason
On Sat, Nov 09, 2019 at 09:27:47AM -0800, Jakub Kicinski wrote: > On Fri, 8 Nov 2019 20:44:26 -0400, Jason Gunthorpe wrote: > > On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote: > > > Yes, my suggestion to use mdev was entirely based on the premise that > > > the purpose of this work is to get vfio working.. otherwise I'm unclear > > > as to why we'd need a bus in the first place. If this is just for > > > containers - we have macvlan offload for years now, with no need for a > > > separate device. > > > > This SF thing is a full fledged VF function, it is not at all like > > macvlan. This is perhaps less important for the netdev part of the > > world, but the difference is very big for the RDMA side, and should > > enable VFIO too.. > > Well, macvlan used VMDq so it was pretty much a "legacy SR-IOV" VF. > I'd perhaps need to learn more about RDMA to appreciate the difference. It has a lot to do with the how the RDMA functionality works in the HW.. At least for mlx the RDMA is 'below' all the netdev stuff, so even though netdev has some offloaded vlan RDMA sees, essentially, the union of all the vlan's on the system. Which at least breaks the security model of a macvlan device for net-namespaces. Maybe with new HW something could be done, but today, the HW is limited. > > > On the RDMA/Intel front, would you mind explaining what the main > > > motivation for the special buses is? I'm a little confurious. > > > > Well, the issue is driver binding. For years we have had these > > multi-function netdev drivers that have a single PCI device which must > > bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb > > drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I > > expect when NVMe over TCP rolls out we will have drivers like cxgb4 > > binding to 6 subsytems in total!] > > What I'm missing is why is it so bad to have a driver register to > multiple subsystems. Well, for example, if you proposed to have a RDMA driver in drivers/net/ethernet/foo/, I would NAK it, and I hope Dave would too. Same for SCSI and nvme. This Linux process is that driver code for a subsystem lives in the subsystem and should be in a subsystem specific module. While it is technically possible to have a giant driver, it distorts our process in a way I don't think is good. So, we have software layers between the large Linux subsystems just to make the development side manageable and practical. .. once the code lives in another subsystem, it is in a new module. A new module requires some way to connect them all together, the driver core is the logical way to do this connection. I don't think a driver should be split beyond that. Even my suggestion of a 'core' may in practice just be the netdev driver as most of the other modules can't function without netdev. ie you can't do iSCSI without an IP stack. > > What is a generation? Mellanox has had a stable RDMA driver across > > many sillicon generations. Intel looks like their new driver will > > support at least the last two or more sillicon generations.. > > > > RDMA drivers are monstrous complex things, there is a big incentive to > > not respin them every time a new chip comes out. > > Ack, but then again none of the drivers gets rewritten from scratch, > right? It's not that some "sub-drivers" get reused and some not, no? Remarkably Intel is saying their new RDMA 'sub-driver' will be compatible with their ICE and pre-ICE (sorry, forget the names) netdev core drivers. netdev will get a different driver for each, but RDMA will use the same driver. Jason
> From: Jason Gunthorpe <jgg@ziepe.ca> > Sent: Friday, November 8, 2019 6:57 PM > > We should be creating 3 different buses, instead of mdev bus being de- > multiplexer of that? > > > > Hence, depending the device flavour specified, create such device on right > bus? > > > > For example, > > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo > > subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour mdev > > <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0 flavour > > mlx5 id 1 subdev_id 3 > > I like the idea of specifying what kind of interface you want at sub device > creation time. It fits the driver model pretty well and doesn't require abusing > the vfio mdev for binding to a netdev driver. > > > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $ echo > > <respective_device_id> <sysfs_path>/bind > > Is explicit binding really needed? No. > If you specify a vfio flavour why shouldn't > the vfio driver autoload and bind to it right away? That is kind of the point > of the driver model... > It some configuration is needed that cannot be passed at device creation time, explicit bind later can be used. > (kind of related, but I don't get while all that GUID and lifecycle stuff in mdev > should apply for something like a SF) > GUID is just the name of the device. But lets park this aside for a moment. > > Implement power management callbacks also on all above 3 buses? > > Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so > > that multiple vendors can reuse? > > In this specific case, why does the SF in mlx5 mode even need a bus? > Is it only because of devlink? That would be unfortunate > Devlink is one part due to identifying using bus/dev. How do we refer to its devlink instance of SF without bus/device? Can we extend devlink_register() to accept optionally have sf_id? If we don't have a bus, creating sub function (a device), without a 'struct device' which will have BAR, resources, etc is odd. Now if we cannot see 'struct device' in sysfs, how do we persistently name them? Are we ok to add /sys/class/net/sf_netdev/subdev_id And /sys/class/infiniband/<rdma_dev>/subdev_id So that systemd/udev can rename them as en<X?><subdev_id> and roce<X><subdev_id> If so, what will be X without a bus type? This route without a bus is certainly helpful to overcome the IOMMU limitation where IOMMU only listens to pci bus type for DMAR setup, dmar_register_bus_notifier(), and in intel_iommu_init()-> bus_set_iommu(&pci_bus_type, &intel_iommu_ops); and other IOMMU doing similar PCI/AMBA binding. This is currently overcome using WA dma_ops.
On 2019/11/9 上午5:05, Jason Gunthorpe wrote: > On Fri, Nov 08, 2019 at 01:34:35PM -0700, Alex Williamson wrote: >> On Fri, 8 Nov 2019 16:12:53 -0400 >> Jason Gunthorpe <jgg@ziepe.ca> wrote: >> >>> On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote: >>>> On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote: >>>>>> The new intel driver has been having a very similar discussion about how to >>>>>> model their 'multi function device' ie to bind RDMA and other drivers to a >>>>>> shared PCI function, and I think that discussion settled on adding a new bus? >>>>>> >>>>>> Really these things are all very similar, it would be nice to have a clear >>>>>> methodology on how to use the device core if a single PCI device is split by >>>>>> software into multiple different functional units and attached to different >>>>>> driver instances. >>>>>> >>>>>> Currently there is alot of hacking in this area.. And a consistent scheme >>>>>> might resolve the ugliness with the dma_ops wrappers. >>>>>> >>>>>> We already have the 'mfd' stuff to support splitting platform devices, maybe >>>>>> we need to create a 'pci-mfd' to support splitting PCI devices? >>>>>> >>>>>> I'm not really clear how mfd and mdev relate, I always thought mdev was >>>>>> strongly linked to vfio. >>>>>> >>>>> Mdev at beginning was strongly linked to vfio, but as I mentioned >>>>> above it is addressing more use case. >>>>> >>>>> I observed that discussion, but was not sure of extending mdev further. >>>>> >>>>> One way to do for Intel drivers to do is after series [9]. >>>>> Where PCI driver says, MDEV_CLASS_ID_I40_FOO >>>>> RDMA driver mdev_register_driver(), matches on it and does the probe(). >>>> Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs >>>> muddying the purpose of mdevs is not a clear trade off. >>> IMHO, mdev has amdev_parent_ops structure clearly intended to link it >>> to vfio, so using a mdev for something not related to vfio seems like >>> a poor choice. >> Unless there's some opposition, I'm intended to queue this for v5.5: >> >> https://www.spinics.net/lists/kvm/msg199613.html >> >> mdev has started out as tied to vfio, but at it's core, it's just a >> device life cycle infrastructure with callbacks between bus drivers >> and vendor devices. If virtio is on the wrong path with the above >> series, please speak up. Thanks, > Well, I think Greg just objected pretty strongly. > > IMHO it is wrong to turn mdev into some API multiplexor. That is what > the driver core already does and AFAIK your bus type is supposed to > represent your API contract to your drivers. > > Since the bus type is ABI, 'mdev' is really all about vfio I guess? > > Maybe mdev should grow by factoring the special GUID life cycle stuff > into a helper library that can make it simpler to build proper API > specific bus's using that lifecycle model? ie the virtio I saw > proposed should probably be a mdev-virtio bus type providing this new > virtio API contract using a 'struct mdev_virtio'? Yes, and probably just decouple the vfio a little bit more from mdev, and allow mdev to register multiple types of buses. Vfio-mdev still go for mdev bus, but for virtio and other they will go their own. Thanks > > I only looked briefly but mdev seems like an unusual way to use the > driver core. *generally* I would expect that if a driver wants to > provide a foo_device (on a foo bus, providing the foo API contract) it > looks very broadly like: > > struct foo_device { > struct device dev; > const struct foo_ops *ops; > }; > struct my_foo_device { > struct foo_device fdev; > }; > > foo_device_register(&mydev->fdev); > > Which means we can use normal container_of() patterns, while mdev > seems to want to allocate all the structs internally.. I guess this is > because of how the lifecycle stuff works? From a device core view it > looks quite unnatural. > > Jason
On Sun, 10 Nov 2019 10:18:55 +0100, gregkh@linuxfoundation.org wrote: > > What I'm missing is why is it so bad to have a driver register to > > multiple subsystems. > > Because these PCI devices seem to do "different" things all in one PCI > resource set. Blame the hardware designers :) See below, I don't think you can blame the HW designers in this particular case :) > > For the nfp I think the _real_ reason to have a bus was that it > > was expected to have some out-of-tree modules bind to it. Something > > I would not encourage :) > > That's not ok, and I agree with you. > > But there seems to be some more complex PCI devices that do lots of > different things all at once. Kind of like a PCI device that wants to > be both a keyboard and a storage device at the same time (i.e. a button > on a disk drive...) The keyboard which is also a storage device may be a clear cut case where multiple devices were integrated into one bus endpoint. The case with these advanced networking adapters is a little different in that they are one HW device which has oodles of FW implementing clients or acceleration for various networking protocols. The nice thing about having a fake bus is you can load out-of-tree drivers to operate extra protocols quite cleanly. I'm not saying that's what the code in question is doing, I'm saying I'd personally like to understand the motivation more clearly before every networking driver out there starts spawning buses. The only argument I've heard so far for the separate devices is reloading subset of the drivers, which I'd rate as moderately convincing.
On Sun, 10 Nov 2019 15:37:59 -0400, Jason Gunthorpe wrote: > On Sat, Nov 09, 2019 at 09:27:47AM -0800, Jakub Kicinski wrote: > > On Fri, 8 Nov 2019 20:44:26 -0400, Jason Gunthorpe wrote: > > > On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote: > > > > Yes, my suggestion to use mdev was entirely based on the premise that > > > > the purpose of this work is to get vfio working.. otherwise I'm unclear > > > > as to why we'd need a bus in the first place. If this is just for > > > > containers - we have macvlan offload for years now, with no need for a > > > > separate device. > > > > > > This SF thing is a full fledged VF function, it is not at all like > > > macvlan. This is perhaps less important for the netdev part of the > > > world, but the difference is very big for the RDMA side, and should > > > enable VFIO too.. > > > > Well, macvlan used VMDq so it was pretty much a "legacy SR-IOV" VF. > > I'd perhaps need to learn more about RDMA to appreciate the difference. > > It has a lot to do with the how the RDMA functionality works in the > HW.. At least for mlx the RDMA is 'below' all the netdev stuff, so > even though netdev has some offloaded vlan RDMA sees, essentially, the > union of all the vlan's on the system. > > Which at least breaks the security model of a macvlan device for > net-namespaces. > > Maybe with new HW something could be done, but today, the HW is > limited. Oh, I think we sort of talked past each other there. I was just pointing to the fact that Intel's macvlan offload did well without any fake bus or devices. I'm not saying anything about the particulars of the virtualization from the networking perspective. > > > > On the RDMA/Intel front, would you mind explaining what the main > > > > motivation for the special buses is? I'm a little confurious. > > > > > > Well, the issue is driver binding. For years we have had these > > > multi-function netdev drivers that have a single PCI device which must > > > bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb > > > drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I > > > expect when NVMe over TCP rolls out we will have drivers like cxgb4 > > > binding to 6 subsytems in total!] > > > > What I'm missing is why is it so bad to have a driver register to > > multiple subsystems. > > Well, for example, if you proposed to have a RDMA driver in > drivers/net/ethernet/foo/, I would NAK it, and I hope Dave would > too. Same for SCSI and nvme. > > This Linux process is that driver code for a subsystem lives in the > subsystem and should be in a subsystem specific module. While it is > technically possible to have a giant driver, it distorts our process > in a way I don't think is good. > > So, we have software layers between the large Linux subsystems just to > make the development side manageable and practical. > > .. once the code lives in another subsystem, it is in a new module. A > new module requires some way to connect them all together, the driver > core is the logical way to do this connection. > > I don't think a driver should be split beyond that. Even my suggestion > of a 'core' may in practice just be the netdev driver as most of the > other modules can't function without netdev. ie you can't do iSCSI > without an IP stack. Okay, yes, that's what I was expecting you'd say. I'm not 100% convinced a bus is necessary, we lived long enough with drivers split across the tree... > > > What is a generation? Mellanox has had a stable RDMA driver across > > > many sillicon generations. Intel looks like their new driver will > > > support at least the last two or more sillicon generations.. > > > > > > RDMA drivers are monstrous complex things, there is a big incentive to > > > not respin them every time a new chip comes out. > > > > Ack, but then again none of the drivers gets rewritten from scratch, > > right? It's not that some "sub-drivers" get reused and some not, no? > > Remarkably Intel is saying their new RDMA 'sub-driver' will be compatible > with their ICE and pre-ICE (sorry, forget the names) netdev core > drivers. > > netdev will get a different driver for each, but RDMA will use the > same driver. I see :)
> -----Original Message----- > From: kvm-owner@vger.kernel.org <kvm-owner@vger.kernel.org> On Behalf > Of Jakub Kicinski > Sent: Sunday, November 10, 2019 9:46 PM > On Sun, 10 Nov 2019 10:18:55 +0100, gregkh@linuxfoundation.org wrote: [..] > The nice thing about having a fake bus is you can load out-of-tree drivers to > operate extra protocols quite cleanly. > This series does NOT intent to do any out of tree driver. Please do not think in that direction for this series. > I'm not saying that's what the code in question is doing, I'm saying I'd > personally like to understand the motivation more clearly before every > networking driver out there starts spawning buses. The only argument I've > heard so far for the separate devices is reloading subset of the drivers, which > I'd rate as moderately convincing. Primary objectives behind using a bus in this series is: 1. get same level of device view as PF/VF/SF by devlink instance 2. to not re-invent already matured pm (suspend/resume) in devlink and/or vendor driver 3. ability to bind a sub-function to different drivers depending on use case based on 'in-kernel' defined class-id (mdev/virtio/kernel) - just like vfio-pci and regular PF driver, by following standard driver model (Ofcourse, It can be done using 3 or more buses as one virtual mdev bus appears an abuse) 4. create->configure->bind process of an sub function (just like a VF) 5. persistent naming of sf's netdev and rdmadev (again like PF and VF) I will wait for Jason's and Jiri's view on the alternative proposal I sent few hours back to omit bus for in-kernel use of sf; and see how far can we run without a bus :-)
Mon, Nov 11, 2019 at 04:46:01AM CET, jakub.kicinski@netronome.com wrote: >On Sun, 10 Nov 2019 10:18:55 +0100, gregkh@linuxfoundation.org wrote: >> > What I'm missing is why is it so bad to have a driver register to >> > multiple subsystems. >> >> Because these PCI devices seem to do "different" things all in one PCI >> resource set. Blame the hardware designers :) > >See below, I don't think you can blame the HW designers in this >particular case :) > >> > For the nfp I think the _real_ reason to have a bus was that it >> > was expected to have some out-of-tree modules bind to it. Something >> > I would not encourage :) >> >> That's not ok, and I agree with you. >> >> But there seems to be some more complex PCI devices that do lots of >> different things all at once. Kind of like a PCI device that wants to >> be both a keyboard and a storage device at the same time (i.e. a button >> on a disk drive...) > >The keyboard which is also a storage device may be a clear cut case >where multiple devices were integrated into one bus endpoint. Also, I think that very important differentiator between keyboard/button and NIC is that keyboard/button is fixed. You have driver bus with 2 devices on constant addresses. However in case of NIC subfunctions. You have 0 at he beginning and user instructs to create more (maybe hundreds). Now important questions appear: 1) How to create devices (what API) - mdev has this figured out 2) How to to do the addressing of the devices. Needs to be predictable/defined by the user - mdev has this figured out 3) Udev names of netdevices - udev names that according to the bus/address. That is straightforeward with mdev. I can't really see how to figure this one in particular with per-driver busses :/ > >The case with these advanced networking adapters is a little different >in that they are one HW device which has oodles of FW implementing >clients or acceleration for various networking protocols. > >The nice thing about having a fake bus is you can load out-of-tree >drivers to operate extra protocols quite cleanly. > >I'm not saying that's what the code in question is doing, I'm saying >I'd personally like to understand the motivation more clearly before >every networking driver out there starts spawning buses. The only >argument I've heard so far for the separate devices is reloading subset >of the drivers, which I'd rate as moderately convincing.
On Mon, Nov 11, 2019 at 02:30:26PM +0100, Jiri Pirko wrote: > Mon, Nov 11, 2019 at 04:46:01AM CET, jakub.kicinski@netronome.com wrote: > >On Sun, 10 Nov 2019 10:18:55 +0100, gregkh@linuxfoundation.org wrote: > >> > What I'm missing is why is it so bad to have a driver register to > >> > multiple subsystems. > >> > >> Because these PCI devices seem to do "different" things all in one PCI > >> resource set. Blame the hardware designers :) > > > >See below, I don't think you can blame the HW designers in this > >particular case :) > > > >> > For the nfp I think the _real_ reason to have a bus was that it > >> > was expected to have some out-of-tree modules bind to it. Something > >> > I would not encourage :) > >> > >> That's not ok, and I agree with you. > >> > >> But there seems to be some more complex PCI devices that do lots of > >> different things all at once. Kind of like a PCI device that wants to > >> be both a keyboard and a storage device at the same time (i.e. a button > >> on a disk drive...) > > > >The keyboard which is also a storage device may be a clear cut case > >where multiple devices were integrated into one bus endpoint. > > Also, I think that very important differentiator between keyboard/button > and NIC is that keyboard/button is fixed. You have driver bus with 2 > devices on constant addresses. > > However in case of NIC subfunctions. You have 0 at he beginning and user > instructs to create more (maybe hundreds). Now important questions > appear: > > 1) How to create devices (what API) - mdev has this figured out > 2) How to to do the addressing of the devices. Needs to be > predictable/defined by the user - mdev has this figured out > 3) Udev names of netdevices - udev names that according to the > bus/address. That is straightforeward with mdev. > I can't really see how to figure this one in particular with > per-driver busses :/ Are network devices somehow only allowed to be on mdev busses? No, don't be silly, userspace handles this just fine today on any type of bus, it's not an issue. You don't have to like individual "driver busses", but you had better not be using a fake platform device to use mdev. That's my main objection... thanks, greg k-h
Sun, Nov 10, 2019 at 08:48:31PM CET, parav@mellanox.com wrote: > >> From: Jason Gunthorpe <jgg@ziepe.ca> >> Sent: Friday, November 8, 2019 6:57 PM >> > We should be creating 3 different buses, instead of mdev bus being de- >> multiplexer of that? >> > >> > Hence, depending the device flavour specified, create such device on right >> bus? >> > >> > For example, >> > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo >> > subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour mdev >> > <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0 flavour >> > mlx5 id 1 subdev_id 3 >> >> I like the idea of specifying what kind of interface you want at sub device >> creation time. It fits the driver model pretty well and doesn't require abusing >> the vfio mdev for binding to a netdev driver. >> >> > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $ echo >> > <respective_device_id> <sysfs_path>/bind >> >> Is explicit binding really needed? >No. > >> If you specify a vfio flavour why shouldn't >> the vfio driver autoload and bind to it right away? That is kind of the point >> of the driver model... >> >It some configuration is needed that cannot be passed at device creation time, explicit bind later can be used. > >> (kind of related, but I don't get while all that GUID and lifecycle stuff in mdev >> should apply for something like a SF) >> >GUID is just the name of the device. >But lets park this aside for a moment. > >> > Implement power management callbacks also on all above 3 buses? >> > Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so >> > that multiple vendors can reuse? >> >> In this specific case, why does the SF in mlx5 mode even need a bus? >> Is it only because of devlink? That would be unfortunate >> >Devlink is one part due to identifying using bus/dev. >How do we refer to its devlink instance of SF without bus/device? Question is, why to have devlink instance for SF itself. Same as VF, you don't need devlink instance. You only need devlink_port (or devlink_subdev) instance on the PF devlink parent for it. >Can we extend devlink_register() to accept optionally have sf_id? > >If we don't have a bus, creating sub function (a device), without a 'struct device' which will have BAR, resources, etc is odd. > >Now if we cannot see 'struct device' in sysfs, how do we persistently name them? >Are we ok to add /sys/class/net/sf_netdev/subdev_id >And >/sys/class/infiniband/<rdma_dev>/subdev_id > >So that systemd/udev can rename them as en<X?><subdev_id> and roce<X><subdev_id> >If so, what will be X without a bus type? > >This route without a bus is certainly helpful to overcome the IOMMU limitation where IOMMU only listens to pci bus type for DMAR setup, >dmar_register_bus_notifier(), and in >intel_iommu_init()-> bus_set_iommu(&pci_bus_type, &intel_iommu_ops); >and other IOMMU doing similar PCI/AMBA binding. >This is currently overcome using WA dma_ops.
Mon, Nov 11, 2019 at 03:14:30PM CET, gregkh@linuxfoundation.org wrote: >On Mon, Nov 11, 2019 at 02:30:26PM +0100, Jiri Pirko wrote: >> Mon, Nov 11, 2019 at 04:46:01AM CET, jakub.kicinski@netronome.com wrote: >> >On Sun, 10 Nov 2019 10:18:55 +0100, gregkh@linuxfoundation.org wrote: >> >> > What I'm missing is why is it so bad to have a driver register to >> >> > multiple subsystems. >> >> >> >> Because these PCI devices seem to do "different" things all in one PCI >> >> resource set. Blame the hardware designers :) >> > >> >See below, I don't think you can blame the HW designers in this >> >particular case :) >> > >> >> > For the nfp I think the _real_ reason to have a bus was that it >> >> > was expected to have some out-of-tree modules bind to it. Something >> >> > I would not encourage :) >> >> >> >> That's not ok, and I agree with you. >> >> >> >> But there seems to be some more complex PCI devices that do lots of >> >> different things all at once. Kind of like a PCI device that wants to >> >> be both a keyboard and a storage device at the same time (i.e. a button >> >> on a disk drive...) >> > >> >The keyboard which is also a storage device may be a clear cut case >> >where multiple devices were integrated into one bus endpoint. >> >> Also, I think that very important differentiator between keyboard/button >> and NIC is that keyboard/button is fixed. You have driver bus with 2 >> devices on constant addresses. >> >> However in case of NIC subfunctions. You have 0 at he beginning and user >> instructs to create more (maybe hundreds). Now important questions >> appear: >> >> 1) How to create devices (what API) - mdev has this figured out >> 2) How to to do the addressing of the devices. Needs to be >> predictable/defined by the user - mdev has this figured out >> 3) Udev names of netdevices - udev names that according to the >> bus/address. That is straightforeward with mdev. >> I can't really see how to figure this one in particular with >> per-driver busses :/ > >Are network devices somehow only allowed to be on mdev busses? Of course not. But there is a difference if we are talking about: a) "the usual" network devices, like PF, VF. - They are well defined and they have well defined lifecycle (pci probe, sriov sysfs for number of VFs, etc). I this world all works fine. Even if a device has 100 static subdevices (bus or no bus). b) dynamically created sub-bar-devices or subfunctions. Could be created by user. This is not handled now in kernel, we have to find correct iface. I don't really care it it is fakebus, driverbus, etc. I'm just concerned about how to handle 1), 2), 3) above. > >No, don't be silly, userspace handles this just fine today on any type >of bus, it's not an issue. > >You don't have to like individual "driver busses", but you had better >not be using a fake platform device to use mdev. That's my main >objection... Okay, I understand your objection. Do you have suggestion how to handle 1) 2) 3) from above? > >thanks, > >greg k-h
> -----Original Message----- > From: Jiri Pirko <jiri@resnulli.us> > Sent: Monday, November 11, 2019 8:18 AM > Sun, Nov 10, 2019 at 08:48:31PM CET, parav@mellanox.com wrote: > > > >> From: Jason Gunthorpe <jgg@ziepe.ca> > >> Sent: Friday, November 8, 2019 6:57 PM > >> > We should be creating 3 different buses, instead of mdev bus being > >> > de- > >> multiplexer of that? > >> > > >> > Hence, depending the device flavour specified, create such device > >> > on right > >> bus? > >> > > >> > For example, > >> > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo > >> > subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour mdev > >> > <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0 flavour > >> > mlx5 id 1 subdev_id 3 > >> > >> I like the idea of specifying what kind of interface you want at sub > >> device creation time. It fits the driver model pretty well and > >> doesn't require abusing the vfio mdev for binding to a netdev driver. > >> > >> > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $ > >> > echo <respective_device_id> <sysfs_path>/bind > >> > >> Is explicit binding really needed? > >No. > > > >> If you specify a vfio flavour why shouldn't the vfio driver autoload > >> and bind to it right away? That is kind of the point of the driver > >> model... > >> > >It some configuration is needed that cannot be passed at device creation > time, explicit bind later can be used. > > > >> (kind of related, but I don't get while all that GUID and lifecycle > >> stuff in mdev should apply for something like a SF) > >> > >GUID is just the name of the device. > >But lets park this aside for a moment. > > > >> > Implement power management callbacks also on all above 3 buses? > >> > Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so > >> > that multiple vendors can reuse? > >> > >> In this specific case, why does the SF in mlx5 mode even need a bus? > >> Is it only because of devlink? That would be unfortunate > >> > >Devlink is one part due to identifying using bus/dev. > >How do we refer to its devlink instance of SF without bus/device? > > Question is, why to have devlink instance for SF itself. Same as VF, you don't mlx5_core has devlink instance for PF and VF for long time now. Health report, txq/rxq dumps etc all anchored to this devlink instance even for VF. (similar to PF). And so, SF same framework should work for SF. > need devlink instance. You only need devlink_port (or > devlink_subdev) instance on the PF devlink parent for it. > Devlink_port or devlink_subdev are still on eswitch or mgmt side. They are not present on the side where devlink instance exist on side where txq/rxq/eq etc exist.
Mon, Nov 11, 2019 at 03:58:18PM CET, parav@mellanox.com wrote: > > >> -----Original Message----- >> From: Jiri Pirko <jiri@resnulli.us> >> Sent: Monday, November 11, 2019 8:18 AM >> Sun, Nov 10, 2019 at 08:48:31PM CET, parav@mellanox.com wrote: >> > >> >> From: Jason Gunthorpe <jgg@ziepe.ca> >> >> Sent: Friday, November 8, 2019 6:57 PM >> >> > We should be creating 3 different buses, instead of mdev bus being >> >> > de- >> >> multiplexer of that? >> >> > >> >> > Hence, depending the device flavour specified, create such device >> >> > on right >> >> bus? >> >> > >> >> > For example, >> >> > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo >> >> > subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour mdev >> >> > <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0 flavour >> >> > mlx5 id 1 subdev_id 3 >> >> >> >> I like the idea of specifying what kind of interface you want at sub >> >> device creation time. It fits the driver model pretty well and >> >> doesn't require abusing the vfio mdev for binding to a netdev driver. >> >> >> >> > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $ >> >> > echo <respective_device_id> <sysfs_path>/bind >> >> >> >> Is explicit binding really needed? >> >No. >> > >> >> If you specify a vfio flavour why shouldn't the vfio driver autoload >> >> and bind to it right away? That is kind of the point of the driver >> >> model... >> >> >> >It some configuration is needed that cannot be passed at device creation >> time, explicit bind later can be used. >> > >> >> (kind of related, but I don't get while all that GUID and lifecycle >> >> stuff in mdev should apply for something like a SF) >> >> >> >GUID is just the name of the device. >> >But lets park this aside for a moment. >> > >> >> > Implement power management callbacks also on all above 3 buses? >> >> > Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so >> >> > that multiple vendors can reuse? >> >> >> >> In this specific case, why does the SF in mlx5 mode even need a bus? >> >> Is it only because of devlink? That would be unfortunate >> >> >> >Devlink is one part due to identifying using bus/dev. >> >How do we refer to its devlink instance of SF without bus/device? >> >> Question is, why to have devlink instance for SF itself. Same as VF, you don't >mlx5_core has devlink instance for PF and VF for long time now. >Health report, txq/rxq dumps etc all anchored to this devlink instance even for VF. (similar to PF). >And so, SF same framework should work for SF. Right, for health it makes sense. > >> need devlink instance. You only need devlink_port (or >> devlink_subdev) instance on the PF devlink parent for it. >> >Devlink_port or devlink_subdev are still on eswitch or mgmt side. >They are not present on the side where devlink instance exist on side where txq/rxq/eq etc exist. > Got it.
Hi All, > From: Jiri Pirko <jiri@resnulli.us> > Sent: Monday, November 11, 2019 9:06 AM > > Mon, Nov 11, 2019 at 03:58:18PM CET, parav@mellanox.com wrote: > > > > > >> -----Original Message----- > >> From: Jiri Pirko <jiri@resnulli.us> > >> Sent: Monday, November 11, 2019 8:18 AM Sun, Nov 10, 2019 at > >> 08:48:31PM CET, parav@mellanox.com wrote: > >> > > >> >> From: Jason Gunthorpe <jgg@ziepe.ca> > >> >> Sent: Friday, November 8, 2019 6:57 PM > >> >> > We should be creating 3 different buses, instead of mdev bus > >> >> > being > >> >> > de- > >> >> multiplexer of that? > >> >> > > >> >> > Hence, depending the device flavour specified, create such > >> >> > device on right > >> >> bus? > >> >> > > >> >> > For example, > >> >> > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo > >> >> > subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour > >> >> > mdev <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0 > >> >> > flavour > >> >> > mlx5 id 1 subdev_id 3 > >> >> > >> >> I like the idea of specifying what kind of interface you want at > >> >> sub device creation time. It fits the driver model pretty well and > >> >> doesn't require abusing the vfio mdev for binding to a netdev driver. > >> >> > >> >> > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $ > >> >> > echo <respective_device_id> <sysfs_path>/bind > >> >> > >> >> Is explicit binding really needed? > >> >No. > >> > > >> >> If you specify a vfio flavour why shouldn't the vfio driver > >> >> autoload and bind to it right away? That is kind of the point of > >> >> the driver model... > >> >> > >> >It some configuration is needed that cannot be passed at device > >> >creation > >> time, explicit bind later can be used. > >> > > >> >> (kind of related, but I don't get while all that GUID and > >> >> lifecycle stuff in mdev should apply for something like a SF) > >> >> > >> >GUID is just the name of the device. > >> >But lets park this aside for a moment. > >> > > >> >> > Implement power management callbacks also on all above 3 buses? > >> >> > Abstract out mlx5_bus into more generic virtual bus (vdev bus?) > >> >> > so that multiple vendors can reuse? > >> >> > >> >> In this specific case, why does the SF in mlx5 mode even need a bus? > >> >> Is it only because of devlink? That would be unfortunate > >> >> > >> >Devlink is one part due to identifying using bus/dev. > >> >How do we refer to its devlink instance of SF without bus/device? > >> > >> Question is, why to have devlink instance for SF itself. Same as VF, > >> you don't > >mlx5_core has devlink instance for PF and VF for long time now. > >Health report, txq/rxq dumps etc all anchored to this devlink instance even for > VF. (similar to PF). > >And so, SF same framework should work for SF. > > Right, for health it makes sense. > > > > >> need devlink instance. You only need devlink_port (or > >> devlink_subdev) instance on the PF devlink parent for it. > >> > >Devlink_port or devlink_subdev are still on eswitch or mgmt side. > >They are not present on the side where devlink instance exist on side where > txq/rxq/eq etc exist. > > > > Got it. I am working on the revised v1 version of these series to address below concerns to achieve all the requirements captured in this cover letter. 1. Avoid mdev bus abuse (Jason's and Greg's input) 2. Avoid dma_ops overload (Christoph's comment) 3. Update cover letter for devlink examples (Jakub's comment) 4. Update cover letter to describe rdma persistent naming scheme (Leon's comment) 5. Jiri's few comments on code restructure