[net-next,v4,00/15] Add mlx5 subfunction support

Message ID	20201214214352.198172-1-saeed@kernel.org (mailing list archive)
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Saeed Mahameed <saeed@kernel.org> To: "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Jason Gunthorpe <jgg@nvidia.com> Cc: Leon Romanovsky <leonro@nvidia.com>, netdev@vger.kernel.org, linux-rdma@vger.kernel.org, David Ahern <dsahern@kernel.org>, Jacob Keller <jacob.e.keller@intel.com>, Sridhar Samudrala <sridhar.samudrala@intel.com>, david.m.ertman@intel.com, dan.j.williams@intel.com, kiran.patil@intel.com, gregkh@linuxfoundation.org, Saeed Mahameed <saeed@kernel.org> Subject: [net-next v4 00/15] Add mlx5 subfunction support Date: Mon, 14 Dec 2020 13:43:37 -0800 Message-Id: <20201214214352.198172-1-saeed@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Add mlx5 subfunction support \| expand [net-next,v4,00/15] Add mlx5 subfunction support [net-next,v4,01/15] net/mlx5: Fix compilation warning for 32-bit platform [net-next,v4,02/15] devlink: Prepare code to fill multiple port function attributes [net-next,v4,03/15] devlink: Introduce PCI SF port flavour and port attribute [net-next,v4,04/15] devlink: Support add and delete devlink port [net-next,v4,05/15] devlink: Support get and set state of port function [net-next,v4,06/15] net/mlx5: Introduce vhca state event notifier [net-next,v4,07/15] net/mlx5: SF, Add auxiliary device support [net-next,v4,08/15] net/mlx5: SF, Add auxiliary device driver [net-next,v4,09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport [net-next,v4,10/15] net/mlx5: E-switch, Add eswitch helpers for SF vport [net-next,v4,11/15] net/mlx5: SF, Add port add delete functionality [net-next,v4,12/15] net/mlx5: SF, Port function state change support [net-next,v4,13/15] devlink: Add devlink port documentation [net-next,v4,14/15] devlink: Extend devlink port documentation for subfunctions [net-next,v4,15/15] net/mlx5: Add devlink subfunction port documentation

Saeed Mahameed Dec. 14, 2020, 9:43 p.m. UTC

Hi Dave, Jakub, Jason,

This series form Parav was the theme of this mlx5 release cycle,
we've been waiting anxiously for the auxbus infrastructure to make it into
the kernel, and now as the auxbus is in and all the stars are aligned, I
can finally submit this V2 of the devlink and mlx5 subfunction support.

Subfunctions came to solve the scaling issue of virtualization
and switchdev environments, where SRIOV failed to deliver and users ran
out of VFs very quickly as SRIOV demands huge amount of physical resources
in both of the servers and the NIC.

Subfunction provide the same functionality as SRIOV but in a very
lightweight manner, please see the thorough and detailed
documentation from Parav below, in the commit messages and the
Networking documentation patches at the end of this series.

Sending V4 as a continuation to V1 that was sent Last month [0],
[0] https://lore.kernel.org/linux-rdma/20201112192424.2742-1-parav@nvidia.com/

---
Changelog:
v3->v4:
 - Fix 32bit compilation issue

v2->v3:
 - added header file sf/priv.h to cmd.c to avoid missing prototype warning
 - made mlx5_sf_table_disable as static function as its used only in one file

v1->v2:
 - added documentation for subfunction and its mlx5 implementation
 - add MLX5_SF config option documentation
 - rebased
 - dropped devlink global lock improvement patch as mlx5 doesn't support
   reload while SFs are allocated
 - dropped devlink reload lock patch as mlx5 doesn't support reload
   when SFs are allocated
 - using updated vhca event from device to add remove auxiliary device
 - split sf devlink port allocation and sf hardware context allocation

Parav Pandit Says:
=================

This patchset introduces support for mlx5 subfunction (SF).

A subfunction is a lightweight function that has a parent PCI function on
which it is deployed. mlx5 subfunction has its own function capabilities
and its own resources. This means a subfunction has its own dedicated
queues(txq, rxq, cq, eq). These queues are neither shared nor stealed from
the parent PCI function.

When subfunction is RDMA capable, it has its own QP1, GID table and rdma
resources neither shared nor stealed from the parent PCI function.

A subfunction has dedicated window in PCI BAR space that is not shared
with ther other subfunctions or parent PCI function. This ensures that all
class devices of the subfunction accesses only assigned PCI BAR space.

A Subfunction supports eswitch representation through which it supports tc
offloads. User must configure eswitch to send/receive packets from/to
subfunction port.

Subfunctions share PCI level resources such as PCI MSI-X IRQs with
their other subfunctions and/or with its parent PCI function.

Patch summary:
--------------
Patch 1 to 4 prepares devlink
patch 5 to 7 mlx5 adds SF device support
Patch 8 to 11 mlx5 adds SF devlink port support
Patch 12 and 14 adds documentation

Patch-1 prepares code to handle multiple port function attributes
Patch-2 introduces devlink pcisf port flavour similar to pcipf and pcivf
Patch-3 adds port add and delete driver callbacks
Patch-4 adds port function state get and set callbacks
Patch-5 mlx5 vhca event notifier support to distribute subfunction
        state change notification
Patch-6 adds SF auxiliary device
Patch-7 adds SF auxiliary driver
Patch-8 prepares eswitch to handler SF vport
Patch-9 adds eswitch helpers to add/remove SF vport
Patch-10 implements devlink port add/del callbacks
Patch-11 implements devlink port function get/set callbacks
Patch-12 to 14 adds documentation
Patch-12 added mlx5 port function documentation
Patch-13 adds subfunction documentation
Patch-14 adds mlx5 subfunction documentation

Subfunction support is discussed in detail in RFC [1] and [2].
RFC [1] and extension [2] describes requirements, design and proposed
plumbing using devlink, auxiliary bus and sysfs for systemd/udev
support. Functionality of this patchset is best explained using real
examples further below.

overview:
--------
A subfunction can be created and deleted by a user using devlink port
add/delete interface.

A subfunction can be configured using devlink port function attribute
before its activated.

When a subfunction is activated, it results in an auxiliary device on
the host PCI device where it is deployed. A driver binds to the
auxiliary device that further creates supported class devices.

example subfunction usage sequence:
-----------------------------------
Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

Add a devlink port of subfunction flaovur:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88

Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active

Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4

$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff

$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

After use inactivate the function:
$ devlink port function set ens2f0npf0sf88 state inactive

Now delete the subfunction port:
$ devlink port del ens2f0npf0sf88

[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2

=================

Parav Pandit (14):
  net/mlx5: Fix compilation warning for 32-bit platform
  devlink: Prepare code to fill multiple port function attributes
  devlink: Introduce PCI SF port flavour and port attribute
  devlink: Support add and delete devlink port
  devlink: Support get and set state of port function
  net/mlx5: Introduce vhca state event notifier
  net/mlx5: SF, Add auxiliary device support
  net/mlx5: SF, Add auxiliary device driver
  net/mlx5: E-switch, Add eswitch helpers for SF vport
  net/mlx5: SF, Add port add delete functionality
  net/mlx5: SF, Port function state change support
  devlink: Add devlink port documentation
  devlink: Extend devlink port documentation for subfunctions
  net/mlx5: Add devlink subfunction port documentation

Vu Pham (1):
  net/mlx5: E-switch, Prepare eswitch to handle SF vport

 Documentation/driver-api/auxiliary_bus.rst    |   2 +
 .../device_drivers/ethernet/mellanox/mlx5.rst | 209 +++++++
 .../networking/devlink/devlink-port.rst       | 199 +++++++
 Documentation/networking/devlink/index.rst    |   1 +
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  19 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   8 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  19 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   5 +-
 .../mellanox/mlx5/core/esw/acl/egress_ofld.c  |   2 +-
 .../mellanox/mlx5/core/esw/devlink_port.c     |  41 ++
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  48 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  78 +++
 .../mellanox/mlx5/core/eswitch_offloads.c     |  47 +-
 .../net/ethernet/mellanox/mlx5/core/events.c  |   7 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  60 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  12 +
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  20 +
 .../net/ethernet/mellanox/mlx5/core/sf/cmd.c  |  49 ++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.c  | 271 +++++++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  55 ++
 .../mellanox/mlx5/core/sf/dev/driver.c        | 101 ++++
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  | 552 ++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/sf/hw_table.c | 233 ++++++++
 .../mlx5/core/sf/mlx5_ifc_vhca_event.h        |  82 +++
 .../net/ethernet/mellanox/mlx5/core/sf/priv.h |  21 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  92 +++
 .../mellanox/mlx5/core/sf/vhca_event.c        | 189 ++++++
 .../mellanox/mlx5/core/sf/vhca_event.h        |  57 ++
 .../net/ethernet/mellanox/mlx5/core/vport.c   |   3 +-
 include/linux/mlx5/driver.h                   |  16 +-
 include/linux/mlx5/mlx5_ifc.h                 |   6 +-
 include/net/devlink.h                         |  79 +++
 include/uapi/linux/devlink.h                  |  26 +
 net/core/devlink.c                            | 266 ++++++++-
 35 files changed, 2834 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/networking/devlink/devlink-port.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h

Alexander Duyck Dec. 15, 2020, 1:53 a.m. UTC | #1

On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> wrote:
>
> Hi Dave, Jakub, Jason,
>
> This series form Parav was the theme of this mlx5 release cycle,
> we've been waiting anxiously for the auxbus infrastructure to make it into
> the kernel, and now as the auxbus is in and all the stars are aligned, I
> can finally submit this V2 of the devlink and mlx5 subfunction support.
>
> Subfunctions came to solve the scaling issue of virtualization
> and switchdev environments, where SRIOV failed to deliver and users ran
> out of VFs very quickly as SRIOV demands huge amount of physical resources
> in both of the servers and the NIC.
>
> Subfunction provide the same functionality as SRIOV but in a very
> lightweight manner, please see the thorough and detailed
> documentation from Parav below, in the commit messages and the
> Networking documentation patches at the end of this series.
>

Just to clarify a few things for myself. You mention virtualization
and SR-IOV in your patch description but you cannot support direct
assignment with this correct? The idea here is simply logical
partitioning of an existing network interface, correct? So this isn't
so much a solution for virtualization, but may work better for
containers. I view this as an important distinction to make as the
first thing that came to mind when I read this was mediated devices
which is similar, but focused only on the virtualization case:
https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html

> Parav Pandit Says:
> =================
>
> This patchset introduces support for mlx5 subfunction (SF).
>
> A subfunction is a lightweight function that has a parent PCI function on
> which it is deployed. mlx5 subfunction has its own function capabilities
> and its own resources. This means a subfunction has its own dedicated
> queues(txq, rxq, cq, eq). These queues are neither shared nor stealed from
> the parent PCI function.

Rather than calling this a subfunction, would it make more sense to
call it something such as a queue set? It seems like this is exposing
some of the same functionality we did in the Intel drivers such as
ixgbe and i40e via the macvlan offload interface. However the
ixgbe/i40e hardware was somewhat limited in that we were only able to
expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and even
with that we have seen some limitations to the interface. It sounds
like you are able to break out RDMA capable devices this way as well.
So in terms of ways to go I would argue this is likely better. However
one downside is that we are going to end up seeing each subfunction
being different from driver to driver and vendor to vendor which I
would argue was also one of the problems with SR-IOV as you end up
with a bit of vendor lock-in as a result of this feature since each
vendor will be providing a different interface.

> When subfunction is RDMA capable, it has its own QP1, GID table and rdma
> resources neither shared nor stealed from the parent PCI function.
>
> A subfunction has dedicated window in PCI BAR space that is not shared
> with ther other subfunctions or parent PCI function. This ensures that all
> class devices of the subfunction accesses only assigned PCI BAR space.
>
> A Subfunction supports eswitch representation through which it supports tc
> offloads. User must configure eswitch to send/receive packets from/to
> subfunction port.
>
> Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> their other subfunctions and/or with its parent PCI function.

This piece to the architecture for this has me somewhat concerned. If
all your resources are shared and you are allowing devices to be
created incrementally you either have to pre-partition the entire
function which usually results in limited resources for your base
setup, or free resources from existing interfaces and redistribute
them as things change. I would be curious which approach you are
taking here? So for example if you hit a certain threshold will you
need to reset the port and rebalance the IRQs between the various
functions?

> Patch summary:
> --------------
> Patch 1 to 4 prepares devlink
> patch 5 to 7 mlx5 adds SF device support
> Patch 8 to 11 mlx5 adds SF devlink port support
> Patch 12 and 14 adds documentation
>
> Patch-1 prepares code to handle multiple port function attributes
> Patch-2 introduces devlink pcisf port flavour similar to pcipf and pcivf
> Patch-3 adds port add and delete driver callbacks
> Patch-4 adds port function state get and set callbacks
> Patch-5 mlx5 vhca event notifier support to distribute subfunction
>         state change notification
> Patch-6 adds SF auxiliary device
> Patch-7 adds SF auxiliary driver
> Patch-8 prepares eswitch to handler SF vport
> Patch-9 adds eswitch helpers to add/remove SF vport
> Patch-10 implements devlink port add/del callbacks
> Patch-11 implements devlink port function get/set callbacks
> Patch-12 to 14 adds documentation
> Patch-12 added mlx5 port function documentation
> Patch-13 adds subfunction documentation
> Patch-14 adds mlx5 subfunction documentation
>
> Subfunction support is discussed in detail in RFC [1] and [2].
> RFC [1] and extension [2] describes requirements, design and proposed
> plumbing using devlink, auxiliary bus and sysfs for systemd/udev
> support. Functionality of this patchset is best explained using real
> examples further below.
>
> overview:
> --------
> A subfunction can be created and deleted by a user using devlink port
> add/delete interface.
>
> A subfunction can be configured using devlink port function attribute
> before its activated.
>
> When a subfunction is activated, it results in an auxiliary device on
> the host PCI device where it is deployed. A driver binds to the
> auxiliary device that further creates supported class devices.
>
> example subfunction usage sequence:
> -----------------------------------
> Change device to switchdev mode:
> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
>
> Add a devlink port of subfunction flaovur:
> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Typo in your description. Also I don't know if you want to stick with
"flavour" or just shorten it to the U.S. spelling which is "flavor".

> Configure mac address of the port function:
> $ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88
>
> Now activate the function:
> $ devlink port function set ens2f0npf0sf88 state active
>
> Now use the auxiliary device and class devices:
> $ devlink dev show
> pci/0000:06:00.0
> auxiliary/mlx5_core.sf.4
>
> $ ip link show
> 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
>     altname enp6s0f0np0
> 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>

I assume that p0sf88 is supposed to be the newly created subfunction.
However I thought the naming was supposed to be the same as what you
are referring to in the devlink, or did I miss something?

> $ rdma dev show
> 43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
> 44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
>
> After use inactivate the function:
> $ devlink port function set ens2f0npf0sf88 state inactive
>
> Now delete the subfunction port:
> $ devlink port del ens2f0npf0sf88

This seems wrong to me as it breaks the symmetry with the port add
command and assumes you have ownership of the interface in the host. I
would much prefer to to see the same arguments that were passed to the
add command being used to do the teardown as that would allow for the
parent function to create the object, assign it to a container
namespace, and not need to pull it back in order to destroy it.

David Ahern Dec. 15, 2020, 2:44 a.m. UTC | #2

On 12/14/20 6:53 PM, Alexander Duyck wrote:
>> example subfunction usage sequence:
>> -----------------------------------
>> Change device to switchdev mode:
>> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
>>
>> Add a devlink port of subfunction flaovur:
>> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> 
> Typo in your description. Also I don't know if you want to stick with
> "flavour" or just shorten it to the U.S. spelling which is "flavor".

The term exists in devlink today (since 2018). When support was added to
iproute2 I decided there was no reason to require the US spelling over
the British spelling, so I accepted the patch.

Parav Pandit Dec. 15, 2020, 5:48 a.m. UTC | #3

> From: Alexander Duyck <alexander.duyck@gmail.com>
> Sent: Tuesday, December 15, 2020 7:24 AM
> 
> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> wrote:
> >
> > Hi Dave, Jakub, Jason,
> >
> 
> Just to clarify a few things for myself. You mention virtualization and SR-IOV
> in your patch description but you cannot support direct assignment with this
> correct? 
Correct. it cannot be directly assigned.

> The idea here is simply logical partitioning of an existing network
> interface, correct? 
No. Idea is to spawn multiple functions from a single PCI device.
These functions are not born in PCI device and in OS until they are created by user.
Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
I better not repeat all of it here again. Please go through it.
If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.

> So this isn't so much a solution for virtualization, but may
> work better for containers. I view this as an important distinction to make as
> the first thing that came to mind when I read this was mediated devices
> which is similar, but focused only on the virtualization case:
> https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-
> device.html
>
Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device.
We are not going back to it anymore.
It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more.
 
> Rather than calling this a subfunction, would it make more sense to call it
> something such as a queue set? 
No, queue is just one way to send and receive data/packets.
Jason and Saeed explained and discussed  this piece to you and others during v0 few weeks back at [1], [2], [3].
Please take a look.

> So in terms of ways to go I would argue this is likely better. However one
> downside is that we are going to end up seeing each subfunction being
> different from driver to driver and vendor to vendor which I would argue
> was also one of the problems with SR-IOV as you end up with a bit of vendor
> lock-in as a result of this feature since each vendor will be providing a
> different interface.
>
Each and several vendors provided unified interface for managing VFs. i.e.
(a) enable/disable was via vendor neutral sysfs
(b) sriov capability exposed via standard pci capability and sysfs
(c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink
Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced.

So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'.

> > A Subfunction supports eswitch representation through which it
> > supports tc offloads. User must configure eswitch to send/receive
> > packets from/to subfunction port.
> >
> > Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> > their other subfunctions and/or with its parent PCI function.
> 
> This piece to the architecture for this has me somewhat concerned. If all your
> resources are shared and 
All resources are not shared.

> you are allowing devices to be created
> incrementally you either have to pre-partition the entire function which
> usually results in limited resources for your base setup, or free resources
> from existing interfaces and redistribute them as things change. I would be
> curious which approach you are taking here? So for example if you hit a
> certain threshold will you need to reset the port and rebalance the IRQs
> between the various functions?
No. Its works bit differently for mlx5 device.
When base function is started, it started as if it doesn't have any subfunctions.
When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants.

For example, PCI PF uses BAR 0, while subfunctions uses BAR 2.
For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF.
In future, yes, a dedicated IRQs per SF is likely desired.
Sridhar also talked about limiting number of queues to a subfunction.
I believe there will be resources/attributes of the function to be controlled.
devlink already provides rich interface to achieve that using devlink resources [8].

[..]

> > $ ip link show
> > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
> DOWN mode DEFAULT group default qlen 1000
> >     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
> >     altname enp6s0f0np0
> > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> mode DEFAULT group default qlen 1000
> >     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>
> 
> I assume that p0sf88 is supposed to be the newly created subfunction.
> However I thought the naming was supposed to be the same as what you are
> referring to in the devlink, or did I miss something?
>
I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc).
I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects.
Hope below description clarifies a bit.
There are two netdevices.
(a) representor netdevice, attached to the devlink port of the eswitch
(b) netdevice of the SF used by the end application (in your example, this is assigned to container).
 
Both netdevice follow obviously a different naming scheme.
Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher.
It is based on phys_port_name sysfs attribute.
This is same for existing PF and SF representors exist for year+ now. Further used by subfunction.

For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads
phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container.

> > After use inactivate the function:
> > $ devlink port function set ens2f0npf0sf88 state inactive
> >
> > Now delete the subfunction port:
> > $ devlink port del ens2f0npf0sf88
> 
> This seems wrong to me as it breaks the symmetry with the port add
> command and
Example of the representor device is only to make life easier for the user.
Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit).
I explained this in a thread with Sridhar at [6].
In short devlink port del <bus/device_name/port_index command is just fine.
Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation.
I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported.

> assumes you have ownership of the interface in the host. I
> would much prefer to to see the same arguments that were passed to the
> add command being used to do the teardown as that would allow for the
> parent function to create the object, assign it to a container namespace, and
> not need to pull it back in order to destroy it.
Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases.
So port delete command works on the port index.
Host doesn't need to pull it back to destroy it. It is destroyed via port del command.

[1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/
[2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
[3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/
[4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/
[5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/
[6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/
[7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u
[8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html

Saeed Mahameed Dec. 15, 2020, 6:15 a.m. UTC | #4

On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> wrote:
> > Hi Dave, Jakub, Jason,
> > 
> > This series form Parav was the theme of this mlx5 release cycle,
> > we've been waiting anxiously for the auxbus infrastructure to make
> > it into
> > the kernel, and now as the auxbus is in and all the stars are
> > aligned, I
> > can finally submit this V2 of the devlink and mlx5 subfunction
> > support.
> > 
> > Subfunctions came to solve the scaling issue of virtualization
> > and switchdev environments, where SRIOV failed to deliver and users
> > ran
> > out of VFs very quickly as SRIOV demands huge amount of physical
> > resources
> > in both of the servers and the NIC.
> > 
> > Subfunction provide the same functionality as SRIOV but in a very
> > lightweight manner, please see the thorough and detailed
> > documentation from Parav below, in the commit messages and the
> > Networking documentation patches at the end of this series.
> > 
> 
> Just to clarify a few things for myself. You mention virtualization
> and SR-IOV in your patch description but you cannot support direct
> assignment with this correct? The idea here is simply logical
> partitioning of an existing network interface, correct? So this isn't
> so much a solution for virtualization, but may work better for
> containers. I view this as an important distinction to make as the

at the current state yes, but the SF solution can be extended to
support direct assignment, so this is why i think SF solution can do
better and eventually replace SRIOV.
also many customers are currently using SRIOV with containers to get
the performance and isolation features since there was no other
options.

> first thing that came to mind when I read this was mediated devices
> which is similar, but focused only on the virtualization case:
> https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html
> 
> > Parav Pandit Says:
> > =================
> > 
> > This patchset introduces support for mlx5 subfunction (SF).
> > 
> > A subfunction is a lightweight function that has a parent PCI
> > function on
> > which it is deployed. mlx5 subfunction has its own function
> > capabilities
> > and its own resources. This means a subfunction has its own
> > dedicated
> > queues(txq, rxq, cq, eq). These queues are neither shared nor
> > stealed from
> > the parent PCI function.
> 
> Rather than calling this a subfunction, would it make more sense to
> call it something such as a queue set? It seems like this is exposing
> some of the same functionality we did in the Intel drivers such as
> ixgbe and i40e via the macvlan offload interface. However the
> ixgbe/i40e hardware was somewhat limited in that we were only able to
> expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and
> even
> with that we have seen some limitations to the interface. It sounds
> like you are able to break out RDMA capable devices this way as well.
> So in terms of ways to go I would argue this is likely better. 

We've discussed this thoroughly on V0, the SF solutions is closer to a
VF than a VMDQ, this is not just a set of queues.

https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/

> However
> one downside is that we are going to end up seeing each subfunction
> being different from driver to driver and vendor to vendor which I
> would argue was also one of the problems with SR-IOV as you end up
> with a bit of vendor lock-in as a result of this feature since each
> vendor will be providing a different interface.
> 

I disagree, SFs are tightly coupled with switchdev model and devlink
functions port, they are backed with the a well defined model, i can
say the same about sriov with switchdev mode, this sort of vendor lock-
in issues is eliminated when you migrate to switchdev mode.

> > When subfunction is RDMA capable, it has its own QP1, GID table and
> > rdma
> > resources neither shared nor stealed from the parent PCI function.
> > 
> > A subfunction has dedicated window in PCI BAR space that is not
> > shared
> > with ther other subfunctions or parent PCI function. This ensures
> > that all
> > class devices of the subfunction accesses only assigned PCI BAR
> > space.
> > 
> > A Subfunction supports eswitch representation through which it
> > supports tc
> > offloads. User must configure eswitch to send/receive packets
> > from/to
> > subfunction port.
> > 
> > Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> > their other subfunctions and/or with its parent PCI function.
> 
> This piece to the architecture for this has me somewhat concerned. If
> all your resources are shared and you are allowing devices to be

not all, only PCI MSIX, for now..

> created incrementally you either have to pre-partition the entire
> function which usually results in limited resources for your base
> setup, or free resources from existing interfaces and redistribute
> them as things change. I would be curious which approach you are
> taking here? So for example if you hit a certain threshold will you
> need to reset the port and rebalance the IRQs between the various
> functions?
> 

Currently SFs will use whatever IRQs the PF has pre-allocated for
itself, so there is no IRQ limit issue at the moment, we are
considering a dynamic IRQ pool with dynamic balancing, or even better
us the IMS approach, which perfectly fits the SF architecture. 
https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/

for internal resources the are fully isolated (not shared) and
they are internally managed by FW exactly like a VF internal resources.

Alexander Duyck Dec. 15, 2020, 4:16 p.m. UTC | #5

On Mon, Dec 14, 2020 at 6:44 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 12/14/20 6:53 PM, Alexander Duyck wrote:
> >> example subfunction usage sequence:
> >> -----------------------------------
> >> Change device to switchdev mode:
> >> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
> >>
> >> Add a devlink port of subfunction flaovur:
> >> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> >
> > Typo in your description. Also I don't know if you want to stick with
> > "flavour" or just shorten it to the U.S. spelling which is "flavor".
>
> The term exists in devlink today (since 2018). When support was added to
> iproute2 I decided there was no reason to require the US spelling over
> the British spelling, so I accepted the patch.

Okay. The only reason why I noticed is because "flaovur" is definitely
a wrong spelling. If it is already in the interface then no need to
change it.

Parav Pandit Dec. 15, 2020, 4:59 p.m. UTC | #6

> From: Alexander Duyck <alexander.duyck@gmail.com>
> Sent: Tuesday, December 15, 2020 9:47 PM
> 
> On Mon, Dec 14, 2020 at 6:44 PM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 12/14/20 6:53 PM, Alexander Duyck wrote:
> > >> example subfunction usage sequence:
> > >> -----------------------------------
> > >> Change device to switchdev mode:
> > >> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
> > >>
> > >> Add a devlink port of subfunction flaovur:
> > >> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> > >
> > > Typo in your description. Also I don't know if you want to stick
> > > with "flavour" or just shorten it to the U.S. spelling which is "flavor".
> >
> > The term exists in devlink today (since 2018). When support was added
> > to
> > iproute2 I decided there was no reason to require the US spelling over
> > the British spelling, so I accepted the patch.
> 
> Okay. The only reason why I noticed is because "flaovur" is definitely a wrong
> spelling. If it is already in the interface then no need to change it.
I am using to write "flavor" and I realized that I should probably say "flavour" because in devlink it is that way.
So I added 'u' and typo added it at wrong location. :-)
Thanks for catching it. 
Saeed sent the v5 fixing this along with few more English corrections.

Alexander Duyck Dec. 15, 2020, 6:47 p.m. UTC | #7

On Mon, Dec 14, 2020 at 9:48 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Alexander Duyck <alexander.duyck@gmail.com>
> > Sent: Tuesday, December 15, 2020 7:24 AM
> >
> > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> > wrote:
> > >
> > > Hi Dave, Jakub, Jason,
> > >
> >
> > Just to clarify a few things for myself. You mention virtualization and SR-IOV
> > in your patch description but you cannot support direct assignment with this
> > correct?
> Correct. it cannot be directly assigned.
>
> > The idea here is simply logical partitioning of an existing network
> > interface, correct?
> No. Idea is to spawn multiple functions from a single PCI device.
> These functions are not born in PCI device and in OS until they are created by user.

That is the definition of logical partitioning. You are essentially
taking one physical PCIe function and splitting up the resources over
multiple logical devices. With something like an MFD driver you would
partition the device as soon as the driver loads, but with this you
are peeling our resources and defining the devices on demand.

> Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
> I better not repeat all of it here again. Please go through it.
> If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.

I think I have a pretty good idea of how the feature works. My concern
is more the use of marketing speak versus actual functionality. The
way this is being setup it sounds like it is useful for virtualization
and it is not, at least in its current state. It may be at some point
in the future but I worry that it is really going to muddy the waters
as we end up with yet another way to partition devices.

> > So this isn't so much a solution for virtualization, but may
> > work better for containers. I view this as an important distinction to make as
> > the first thing that came to mind when I read this was mediated devices
> > which is similar, but focused only on the virtualization case:
> > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-
> > device.html
> >
> Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device.

I agree with you on that. My thought was more the fact that the two
can be easily confused. If we are going to do this we need to define
that for networking devices perhaps that using the mdev interface
would be deprecated and we would need to go through devlink. However
before we do that we need to make sure we have this completely
standardized.

> We are not going back to it anymore.
> It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more.

That is kind of my point. It is already in the kernel. What you are
adding is the stuff that is duplicating it. I'm assuming that in order
to be able to virtualize your interfaces in the future you are going
to have to make use of the same vfio plumbing that the mediated
devices do.

> > Rather than calling this a subfunction, would it make more sense to call it
> > something such as a queue set?
> No, queue is just one way to send and receive data/packets.
> Jason and Saeed explained and discussed  this piece to you and others during v0 few weeks back at [1], [2], [3].
> Please take a look.

Yeah, I recall that. However I feel like it is being oversold. It
isn't "SR-IOV done right" it seems more like "VMDq done better". The
fact that interrupts are shared between the subfunctions is telling.
That is exactly how things work for Intel parts when they do VMDq as
well. The queues are split up into pools and a block of queues belongs
to a specific queue. From what I can can tell the only difference is
that there is isolation of the pool into specific pages in the BAR.
Which is essentially a requirement for mediated devices so that they
can be direct assigned.

> > So in terms of ways to go I would argue this is likely better. However one
> > downside is that we are going to end up seeing each subfunction being
> > different from driver to driver and vendor to vendor which I would argue
> > was also one of the problems with SR-IOV as you end up with a bit of vendor
> > lock-in as a result of this feature since each vendor will be providing a
> > different interface.
> >
> Each and several vendors provided unified interface for managing VFs. i.e.
> (a) enable/disable was via vendor neutral sysfs
> (b) sriov capability exposed via standard pci capability and sysfs
> (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink
> Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced.
>
> So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'.

I think you are missing the point. The biggest issue with SR-IOV
adoption was the fact that the drivers all created different VF
interfaces and those interfaces didn't support migration. That was the
two biggest drawbacks for SR-IOV. I don't really see this approach
resolving either of those and so that is one of the reasons why I say
this is closer to "VMDq done better"  rather than "SR-IOV done right".
Assuming at some point one of the flavours is a virtio-net style
interface you could eventually get to the point of something similar
to what seems to have been the goal of mdev which was meant to address
these two points.

> > > A Subfunction supports eswitch representation through which it
> > > supports tc offloads. User must configure eswitch to send/receive
> > > packets from/to subfunction port.
> > >
> > > Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> > > their other subfunctions and/or with its parent PCI function.
> >
> > This piece to the architecture for this has me somewhat concerned. If all your
> > resources are shared and
> All resources are not shared.

Just to clarify, when I say "shared" I mean that they are all coming
from the same function. So if for example I were to direct-assign the
PF then all the resources for the subfunctions would go with it.

> > you are allowing devices to be created
> > incrementally you either have to pre-partition the entire function which
> > usually results in limited resources for your base setup, or free resources
> > from existing interfaces and redistribute them as things change. I would be
> > curious which approach you are taking here? So for example if you hit a
> > certain threshold will you need to reset the port and rebalance the IRQs
> > between the various functions?
> No. Its works bit differently for mlx5 device.
> When base function is started, it started as if it doesn't have any subfunctions.
> When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants.
>
> For example, PCI PF uses BAR 0, while subfunctions uses BAR 2.

In the grand scheme BAR doesn't really matter much. The assumption
here is that resources are page aligned so that you can map the pages
into a guest eventually.

> For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF.
> In future, yes, a dedicated IRQs per SF is likely desired.
> Sridhar also talked about limiting number of queues to a subfunction.
> I believe there will be resources/attributes of the function to be controlled.
> devlink already provides rich interface to achieve that using devlink resources [8].
>
> [..]

So it sounds like the device firmware is pre-partitioining the
resources and splitting them up between the PCI PF and your
subfunctions. Although in your case it sounds like there are
significantly more resources than you might find in an ixgbe interface
for instance. :)

The point is that we should probably define some sort of standard
and/or expectations on what should happen when you spawn a new
interface. Would it be acceptable for the PF and existing subfunctions
to have to reset if you need to rebalance the IRQ distribution, or
should they not be disrupted when you spawn a new interface?

One of the things I prefer about the mediated device setup is the fact
that it has essentially pre-partitioned things beforehand and you know
how many of each type of interface you can spawn. Is there any
information like that provided by this interface?

> > > $ ip link show
> > > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
> > DOWN mode DEFAULT group default qlen 1000
> > >     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
> > >     altname enp6s0f0np0
> > > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> > mode DEFAULT group default qlen 1000
> > >     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>
> >
> > I assume that p0sf88 is supposed to be the newly created subfunction.
> > However I thought the naming was supposed to be the same as what you are
> > referring to in the devlink, or did I miss something?
> >
> I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc).
> I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects.
> Hope below description clarifies a bit.
> There are two netdevices.
> (a) representor netdevice, attached to the devlink port of the eswitch
> (b) netdevice of the SF used by the end application (in your example, this is assigned to container).

Sorry, that wasn't clear from your example. So in this case you
started in a namespace and the new device you created via devlink was
spawned in the root namespace?

> Both netdevice follow obviously a different naming scheme.
> Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher.
> It is based on phys_port_name sysfs attribute.
> This is same for existing PF and SF representors exist for year+ now. Further used by subfunction.
>
> For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads
> phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container.

Admittedly I have been out of the loop for the last couple years since
I had switched over to memory management work for a while. It would be
useful to include something that shows your created network interface
in the example in addition to your switchdev port.

> > > After use inactivate the function:
> > > $ devlink port function set ens2f0npf0sf88 state inactive
> > >
> > > Now delete the subfunction port:
> > > $ devlink port del ens2f0npf0sf88
> >
> > This seems wrong to me as it breaks the symmetry with the port add
> > command and
> Example of the representor device is only to make life easier for the user.
> Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit).
> I explained this in a thread with Sridhar at [6].
> In short devlink port del <bus/device_name/port_index command is just fine.
> Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation.
> I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported.

Okay then, that addresses my concern. I just wanted to make sure we
weren't in some situation where you had to have the interface in order
to remove it.

> > assumes you have ownership of the interface in the host. I
> > would much prefer to to see the same arguments that were passed to the
> > add command being used to do the teardown as that would allow for the
> > parent function to create the object, assign it to a container namespace, and
> > not need to pull it back in order to destroy it.
> Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases.
> So port delete command works on the port index.
> Host doesn't need to pull it back to destroy it. It is destroyed via port del command.
>
> [1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/
> [2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> [3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/
> [4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/
> [5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/
> [6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/
> [7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u
> [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html
>

Alexander Duyck Dec. 15, 2020, 7:12 p.m. UTC | #8

On Mon, Dec 14, 2020 at 10:15 PM Saeed Mahameed <saeed@kernel.org> wrote:
>
> On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote:
> > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> > wrote:
> > > Hi Dave, Jakub, Jason,
> > >
> > > This series form Parav was the theme of this mlx5 release cycle,
> > > we've been waiting anxiously for the auxbus infrastructure to make
> > > it into
> > > the kernel, and now as the auxbus is in and all the stars are
> > > aligned, I
> > > can finally submit this V2 of the devlink and mlx5 subfunction
> > > support.
> > >
> > > Subfunctions came to solve the scaling issue of virtualization
> > > and switchdev environments, where SRIOV failed to deliver and users
> > > ran
> > > out of VFs very quickly as SRIOV demands huge amount of physical
> > > resources
> > > in both of the servers and the NIC.
> > >
> > > Subfunction provide the same functionality as SRIOV but in a very
> > > lightweight manner, please see the thorough and detailed
> > > documentation from Parav below, in the commit messages and the
> > > Networking documentation patches at the end of this series.
> > >
> >
> > Just to clarify a few things for myself. You mention virtualization
> > and SR-IOV in your patch description but you cannot support direct
> > assignment with this correct? The idea here is simply logical
> > partitioning of an existing network interface, correct? So this isn't
> > so much a solution for virtualization, but may work better for
> > containers. I view this as an important distinction to make as the
>
> at the current state yes, but the SF solution can be extended to
> support direct assignment, so this is why i think SF solution can do
> better and eventually replace SRIOV.

My only real concern is that this and mediated devices are essentially
the same thing. When you start making this work for direct-assignment
the only real difference becomes the switchdev and devlink interfaces.
Basically this is netdev specific mdev versus the PCIe specific mdev.

> also many customers are currently using SRIOV with containers to get
> the performance and isolation features since there was no other
> options.

There were, but you hadn't implemented them. The fact is the approach
Intel had taken for that was offloaded macvlan.

I think the big thing we really should do if we are going to go this
route is to look at standardizing what the flavours are that get
created by the parent netdevice. Otherwise we are just creating the
same mess we had with SRIOV all over again and muddying the waters of
mediated devices.

> > first thing that came to mind when I read this was mediated devices
> > which is similar, but focused only on the virtualization case:
> > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html
> >
> > > Parav Pandit Says:
> > > =================
> > >
> > > This patchset introduces support for mlx5 subfunction (SF).
> > >
> > > A subfunction is a lightweight function that has a parent PCI
> > > function on
> > > which it is deployed. mlx5 subfunction has its own function
> > > capabilities
> > > and its own resources. This means a subfunction has its own
> > > dedicated
> > > queues(txq, rxq, cq, eq). These queues are neither shared nor
> > > stealed from
> > > the parent PCI function.
> >
> > Rather than calling this a subfunction, would it make more sense to
> > call it something such as a queue set? It seems like this is exposing
> > some of the same functionality we did in the Intel drivers such as
> > ixgbe and i40e via the macvlan offload interface. However the
> > ixgbe/i40e hardware was somewhat limited in that we were only able to
> > expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and
> > even
> > with that we have seen some limitations to the interface. It sounds
> > like you are able to break out RDMA capable devices this way as well.
> > So in terms of ways to go I would argue this is likely better.
>
> We've discussed this thoroughly on V0, the SF solutions is closer to a
> VF than a VMDQ, this is not just a set of queues.
>
> https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/

VMDq is more than just a set of queues. The fact is it is a pool of
resources that get created to handle the requests for a specific VM.
The extra bits that are added here are essentially stuff that was
required to support mediated devices.

> > However
> > one downside is that we are going to end up seeing each subfunction
> > being different from driver to driver and vendor to vendor which I
> > would argue was also one of the problems with SR-IOV as you end up
> > with a bit of vendor lock-in as a result of this feature since each
> > vendor will be providing a different interface.
> >
>
> I disagree, SFs are tightly coupled with switchdev model and devlink
> functions port, they are backed with the a well defined model, i can
> say the same about sriov with switchdev mode, this sort of vendor lock-
> in issues is eliminated when you migrate to switchdev mode.

What you are talking about is the backend. I am talking about what is
exposed to the user. The user is going to see a Mellanox device having
to be placed into their container in order to support this. One of the
advantages of the Intel approach was that the macvlan interface was
generic so you could have an offloaded interface or not and the user
wouldn't necessarily know. The offload could be disabled and the user
would be none the wiser as it is moved from one interface to another.
I see that as a big thing that is missing in this solution.

> > > When subfunction is RDMA capable, it has its own QP1, GID table and
> > > rdma
> > > resources neither shared nor stealed from the parent PCI function.
> > >
> > > A subfunction has dedicated window in PCI BAR space that is not
> > > shared
> > > with ther other subfunctions or parent PCI function. This ensures
> > > that all
> > > class devices of the subfunction accesses only assigned PCI BAR
> > > space.
> > >
> > > A Subfunction supports eswitch representation through which it
> > > supports tc
> > > offloads. User must configure eswitch to send/receive packets
> > > from/to
> > > subfunction port.
> > >
> > > Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> > > their other subfunctions and/or with its parent PCI function.
> >
> > This piece to the architecture for this has me somewhat concerned. If
> > all your resources are shared and you are allowing devices to be
>
> not all, only PCI MSIX, for now..

They aren't shared after you partition them but they are coming from
the same device. Basically you are subdividing the BAR2 in order to
generate the subfunctions. BAR2 is a shared resource in my point of
view.

> > created incrementally you either have to pre-partition the entire
> > function which usually results in limited resources for your base
> > setup, or free resources from existing interfaces and redistribute
> > them as things change. I would be curious which approach you are
> > taking here? So for example if you hit a certain threshold will you
> > need to reset the port and rebalance the IRQs between the various
> > functions?
> >
>
> Currently SFs will use whatever IRQs the PF has pre-allocated for
> itself, so there is no IRQ limit issue at the moment, we are
> considering a dynamic IRQ pool with dynamic balancing, or even better
> us the IMS approach, which perfectly fits the SF architecture.
> https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/

When you say you are using the PF's interrupts are you just using that
as a pool of resources or having the interrupt process interrupts for
both the PF and SFs? Without IMS you are limited to 2048 interrupts.
Moving over to that would make sense since SF is similar to mdev in
the way it partitions up the device and resources.

> for internal resources the are fully isolated (not shared) and
> they are internally managed by FW exactly like a VF internal resources.

I assume by isolated you mean they are contained within page aligned
blocks like what was required for mdev?

Saeed Mahameed Dec. 15, 2020, 8:05 p.m. UTC | #9

On Tue, 2020-12-15 at 10:47 -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2020 at 9:48 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > 
> > > From: Alexander Duyck <alexander.duyck@gmail.com>
> > > Sent: Tuesday, December 15, 2020 7:24 AM
> > > 
> > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> > > wrote:
> > > > Hi Dave, Jakub, Jason,
> > > > 
> > > 
> > > Just to clarify a few things for myself. You mention
> > > virtualization and SR-IOV
> > > in your patch description but you cannot support direct
> > > assignment with this
> > > correct?
> > Correct. it cannot be directly assigned.
> > 
> > > The idea here is simply logical partitioning of an existing
> > > network
> > > interface, correct?
> > No. Idea is to spawn multiple functions from a single PCI device.
> > These functions are not born in PCI device and in OS until they are
> > created by user.
> 
> That is the definition of logical partitioning. You are essentially
> taking one physical PCIe function and splitting up the resources over
> multiple logical devices. With something like an MFD driver you would
> partition the device as soon as the driver loads, but with this you
> are peeling our resources and defining the devices on demand.
> 

Sure, same for SRIOV and same for VMDQ. they are all logical
partitioning and they are all sharing the same resources of the system.
our point here is that the SF mechanisms are more similar to SRIOV than
VMDQ, other than sharing the MSIX vectors and slicing up the BARs,
everything else works exactly like SRIOV.

> > Jason and Saeed explained this in great detail few weeks back in v0
> > version of the patchset at [1], [2] and [3].
> > I better not repeat all of it here again. Please go through it.
> > If you may want to read precursor to it, RFC from Jiri at [4] is
> > also explains this in great detail.
> 
> I think I have a pretty good idea of how the feature works. My
> concern
> is more the use of marketing speak versus actual functionality. The
> way this is being setup it sounds like it is useful for
> virtualization
> and it is not, at least in its current state. It may be at some point
> in the future but I worry that it is really going to muddy the waters
> as we end up with yet another way to partition devices.
> 

Ok, maybe we have different views on the feature and use cases, this is
useful for visualization for various reasons, take vdpa for an
instance, but anyway, we can improve documentation to address your
concerns and at some point in the future when we add the direct
assignment we can add the necessary documentation.


> > > So this isn't so much a solution for virtualization, but may
> > > work better for containers. I view this as an important
> > > distinction to make as
> > > the first thing that came to mind when I read this was mediated
> > > devices
> > > which is similar, but focused only on the virtualization case:
> > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-
> > > device.html
> > > 
> > Managing subfunction using medicated device is already ruled out
> > last year at [5] as it is the abuse of the mdev bus for this
> > purpose + has severe limitations of managing the subfunction
> > device.
> 
> I agree with you on that. My thought was more the fact that the two
> can be easily confused. If we are going to do this we need to define
> that for networking devices perhaps that using the mdev interface
> would be deprecated and we would need to go through devlink. However
> before we do that we need to make sure we have this completely
> standardized.
> 

Then lets keep this discussion for later, when we add the direct
assignment support :)

> > We are not going back to it anymore.
> > It will be duplicating lot of the plumbing which exists in devlink,
> > netlink, auxiliary bus and more.
> 
> That is kind of my point. It is already in the kernel. What you are
> adding is the stuff that is duplicating it. I'm assuming that in
> order
> to be able to virtualize your interfaces in the future you are going
> to have to make use of the same vfio plumbing that the mediated
> devices do.
> 
> > > Rather than calling this a subfunction, would it make more sense
> > > to call it
> > > something such as a queue set?
> > No, queue is just one way to send and receive data/packets.
> > Jason and Saeed explained and discussed  this piece to you and
> > others during v0 few weeks back at [1], [2], [3].
> > Please take a look.
> 
> Yeah, I recall that. However I feel like it is being oversold. It
> isn't "SR-IOV done right" it seems more like "VMDq done better". The

Ok then will improve documentation to not oversell this as "SRIOV done
right".

> fact that interrupts are shared between the subfunctions is telling.
> That is exactly how things work for Intel parts when they do VMDq as
> well. The queues are split up into pools and a block of queues
> belongs
> to a specific queue. From what I can can tell the only difference is
> that there is isolation of the pool into specific pages in the BAR.
> Which is essentially a requirement for mediated devices so that they
> can be direct assigned.
> 

I disagree, this is very dissimilar to VDMQ.
SF is a VF without a unique PCI function and bar, the BAR is split up
to give the "SF" access to its own partition in the HW, and then it
will access the HW exactly like a VF would do, there are no "pools"
involved here or any PF resource/queue management, from our HW
architecture perspective there is no difference between SF and VF..
really.

> > > So in terms of ways to go I would argue this is likely better.
> > > However one
> > > downside is that we are going to end up seeing each subfunction
> > > being
> > > different from driver to driver and vendor to vendor which I
> > > would argue
> > > was also one of the problems with SR-IOV as you end up with a bit
> > > of vendor
> > > lock-in as a result of this feature since each vendor will be
> > > providing a
> > > different interface.
> > > 
> > Each and several vendors provided unified interface for managing
> > VFs. i.e.
> > (a) enable/disable was via vendor neutral sysfs
> > (b) sriov capability exposed via standard pci capability and sysfs
> > (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust)
> > are using vendor agnostic netlink
> > Even though the driver's internal implementation largely differs on
> > how trust, spoof, mac, vlan rate etc are enforced.
> > 
> > So subfunction feature/attribute/functionality will be implemented
> > differently internally in the driver matching vendor's device, for
> > reasonably abstract concept of 'subfunction'.
> 
> I think you are missing the point. The biggest issue with SR-IOV
> adoption was the fact that the drivers all created different VF
> interfaces and those interfaces didn't support migration. That was
> the
> two biggest drawbacks for SR-IOV. I don't really see this approach
> resolving either of those and so that is one of the reasons why I say
> this is closer to "VMDq done better"  rather than "SR-IOV done
> right".
> Assuming at some point one of the flavours is a virtio-net style
> interface you could eventually get to the point of something similar
> to what seems to have been the goal of mdev which was meant to
> address
> these two points.
> 

Improving documentation will address these concerns ? 
but to be clear it is much easier to solve sriov live migration when
SFs are involved.

> > > > A Subfunction supports eswitch representation through which it
> > > > supports tc offloads. User must configure eswitch to
> > > > send/receive
> > > > packets from/to subfunction port.
> > > > 
> > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs
> > > > with
> > > > their other subfunctions and/or with its parent PCI function.
> > > 
> > > This piece to the architecture for this has me somewhat
> > > concerned. If all your
> > > resources are shared and
> > All resources are not shared.
> 
> Just to clarify, when I say "shared" I mean that they are all coming
> from the same function. So if for example I were to direct-assign the
> PF then all the resources for the subfunctions would go with it.
> 
> > > you are allowing devices to be created
> > > incrementally you either have to pre-partition the entire
> > > function which
> > > usually results in limited resources for your base setup, or free
> > > resources
> > > from existing interfaces and redistribute them as things change.
> > > I would be
> > > curious which approach you are taking here? So for example if you
> > > hit a
> > > certain threshold will you need to reset the port and rebalance
> > > the IRQs
> > > between the various functions?
> > No. Its works bit differently for mlx5 device.
> > When base function is started, it started as if it doesn't have any
> > subfunctions.
> > When subfunction is instantiated, it spawns new resources in device
> > (hw, fw, memory) depending on how much a function wants.
> > 
> > For example, PCI PF uses BAR 0, while subfunctions uses BAR 2.
> 
> In the grand scheme BAR doesn't really matter much. The assumption
> here is that resources are page aligned so that you can map the pages
> into a guest eventually.
> 
> > For IRQs, subfunction instance shares the IRQ with its
> > parent/hosting PCI PF.
> > In future, yes, a dedicated IRQs per SF is likely desired.
> > Sridhar also talked about limiting number of queues to a
> > subfunction.
> > I believe there will be resources/attributes of the function to be
> > controlled.
> > devlink already provides rich interface to achieve that using
> > devlink resources [8].
> > 
> > [..]
> 
> So it sounds like the device firmware is pre-partitioining the
> resources and splitting them up between the PCI PF and your
> subfunctions. Although in your case it sounds like there are
> significantly more resources than you might find in an ixgbe
> interface
> for instance. :)
> 
> The point is that we should probably define some sort of standard
> and/or expectations on what should happen when you spawn a new
> interface. Would it be acceptable for the PF and existing
> subfunctions
> to have to reset if you need to rebalance the IRQ distribution, or
> should they not be disrupted when you spawn a new interface?
> 
> One of the things I prefer about the mediated device setup is the
> fact
> that it has essentially pre-partitioned things beforehand and you
> know
> how many of each type of interface you can spawn. Is there any
> information like that provided by this interface?
> 
> > > > $ ip link show
> > > > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
> > > DOWN mode DEFAULT group default qlen 1000
> > > >     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
> > > >     altname enp6s0f0np0
> > > > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
> > > > DOWN
> > > mode DEFAULT group default qlen 1000
> > > >     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>
> > > 
> > > I assume that p0sf88 is supposed to be the newly created
> > > subfunction.
> > > However I thought the naming was supposed to be the same as what
> > > you are
> > > referring to in the devlink, or did I miss something?
> > > 
> > I believe you are confused with the representor netdevice of
> > subfuction with devices of subfunction. (netdev, rdma, vdpa etc).
> > I suggest that please refer to the diagram in patch_15 in [7] to
> > see the stack, modules, objects.
> > Hope below description clarifies a bit.
> > There are two netdevices.
> > (a) representor netdevice, attached to the devlink port of the
> > eswitch
> > (b) netdevice of the SF used by the end application (in your
> > example, this is assigned to container).
> 
> Sorry, that wasn't clear from your example. So in this case you
> started in a namespace and the new device you created via devlink was
> spawned in the root namespace?
> 
> > Both netdevice follow obviously a different naming scheme.
> > Representor netdevice follows naming scheme well defined in kernel
> > + systemd/udev v245 and higher.
> > It is based on phys_port_name sysfs attribute.
> > This is same for existing PF and SF representors exist for year+
> > now. Further used by subfunction.
> > 
> > For subfunction netdevice (p0s88), system/udev will be extended. I
> > put example based on my few lines of udev rule that reads
> > phys_port_name and user supplied sfnum, so that user exactly knows
> > which interface to assign to container.
> 
> Admittedly I have been out of the loop for the last couple years
> since
> I had switched over to memory management work for a while. It would
> be
> useful to include something that shows your created network interface
> in the example in addition to your switchdev port.
> 
> > > > After use inactivate the function:
> > > > $ devlink port function set ens2f0npf0sf88 state inactive
> > > > 
> > > > Now delete the subfunction port:
> > > > $ devlink port del ens2f0npf0sf88
> > > 
> > > This seems wrong to me as it breaks the symmetry with the port
> > > add
> > > command and
> > Example of the representor device is only to make life easier for
> > the user.
> > Devlink port del command works based on the devlink port index,
> > just like existing devlink port commands (get,set,split,unsplit).
> > I explained this in a thread with Sridhar at [6].
> > In short devlink port del <bus/device_name/port_index command is
> > just fine.
> > Port index is unique handle for the devlink instance that user
> > refers to delete, get, set port and port function attributes post
> > its creation.
> > I choose the representor netdev example because it is more
> > intuitive to related to, but port index is equally fine and
> > supported.
> 
> Okay then, that addresses my concern. I just wanted to make sure we
> weren't in some situation where you had to have the interface in
> order
> to remove it.
> 
> > > assumes you have ownership of the interface in the host. I
> > > would much prefer to to see the same arguments that were passed
> > > to the
> > > add command being used to do the teardown as that would allow for
> > > the
> > > parent function to create the object, assign it to a container
> > > namespace, and
> > > not need to pull it back in order to destroy it.
> > Parent function will not have same netdevice name as that of
> > representor netdevice, because both devices exist in single system
> > for large part of the use cases.
> > So port delete command works on the port index.
> > Host doesn't need to pull it back to destroy it. It is destroyed
> > via port del command.
> > 
> > [1] 
> > https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/
> > [2] 
> > https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> > [3] 
> > https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/
> > [4] 
> > https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/
> > [5] 
> > https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/
> > [6] 
> > https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/
> > [7] 
> > https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u
> > [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html
> >

Saeed Mahameed Dec. 15, 2020, 8:35 p.m. UTC | #10

On Tue, 2020-12-15 at 11:12 -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2020 at 10:15 PM Saeed Mahameed <saeed@kernel.org>
> wrote:
> > On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote:
> > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> > > wrote:
> > > > Hi Dave, Jakub, Jason,
> > > > 
> > > > This series form Parav was the theme of this mlx5 release
> > > > cycle,
> > > > we've been waiting anxiously for the auxbus infrastructure to
> > > > make
> > > > it into
> > > > the kernel, and now as the auxbus is in and all the stars are
> > > > aligned, I
> > > > can finally submit this V2 of the devlink and mlx5 subfunction
> > > > support.
> > > > 
> > > > Subfunctions came to solve the scaling issue of virtualization
> > > > and switchdev environments, where SRIOV failed to deliver and
> > > > users
> > > > ran
> > > > out of VFs very quickly as SRIOV demands huge amount of
> > > > physical
> > > > resources
> > > > in both of the servers and the NIC.
> > > > 
> > > > Subfunction provide the same functionality as SRIOV but in a
> > > > very
> > > > lightweight manner, please see the thorough and detailed
> > > > documentation from Parav below, in the commit messages and the
> > > > Networking documentation patches at the end of this series.
> > > > 
> > > 
> > > Just to clarify a few things for myself. You mention
> > > virtualization
> > > and SR-IOV in your patch description but you cannot support
> > > direct
> > > assignment with this correct? The idea here is simply logical
> > > partitioning of an existing network interface, correct? So this
> > > isn't
> > > so much a solution for virtualization, but may work better for
> > > containers. I view this as an important distinction to make as
> > > the
> > 
> > at the current state yes, but the SF solution can be extended to
> > support direct assignment, so this is why i think SF solution can
> > do
> > better and eventually replace SRIOV.
> 
> My only real concern is that this and mediated devices are
> essentially
> the same thing. When you start making this work for direct-assignment
> the only real difference becomes the switchdev and devlink
> interfaces.

not just devlink and switchdev, auxbus was also introduced to
standardize some of the interfaces.

> Basically this is netdev specific mdev versus the PCIe specific mdev.
> 

SF is not netdev specific mdev .. :/

> > also many customers are currently using SRIOV with containers to
> > get
> > the performance and isolation features since there was no other
> > options.
> 
> There were, but you hadn't implemented them. The fact is the approach
> Intel had taken for that was offloaded macvlan.
> 

offloaded macvlan is just a macvlan with checksum/tso and gro.

macvlan can't provide RDMA, TC offloads, ethtool steering, PTP, vdpa
...
our SF provides the same set of features a VF can provide


> I think the big thing we really should do if we are going to go this
> route is to look at standardizing what the flavours are that get
> created by the parent netdevice. Otherwise we are just creating the
> same mess we had with SRIOV all over again and muddying the waters of
> mediated devices.
> 

yes in the near future we will be working on auxbus interfaces for
auto-probing and user flavor selection, this is a must have feature for
us.

> > > first thing that came to mind when I read this was mediated
> > > devices
> > > which is similar, but focused only on the virtualization case:
> > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html
> > > 
> > > > Parav Pandit Says:
> > > > =================
> > > > 
> > > > This patchset introduces support for mlx5 subfunction (SF).
> > > > 
> > > > A subfunction is a lightweight function that has a parent PCI
> > > > function on
> > > > which it is deployed. mlx5 subfunction has its own function
> > > > capabilities
> > > > and its own resources. This means a subfunction has its own
> > > > dedicated
> > > > queues(txq, rxq, cq, eq). These queues are neither shared nor
> > > > stealed from
> > > > the parent PCI function.
> > > 
> > > Rather than calling this a subfunction, would it make more sense
> > > to
> > > call it something such as a queue set? It seems like this is
> > > exposing
> > > some of the same functionality we did in the Intel drivers such
> > > as
> > > ixgbe and i40e via the macvlan offload interface. However the
> > > ixgbe/i40e hardware was somewhat limited in that we were only
> > > able to
> > > expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and
> > > even
> > > with that we have seen some limitations to the interface. It
> > > sounds
> > > like you are able to break out RDMA capable devices this way as
> > > well.
> > > So in terms of ways to go I would argue this is likely better.
> > 
> > We've discussed this thoroughly on V0, the SF solutions is closer
> > to a
> > VF than a VMDQ, this is not just a set of queues.
> > 
> > https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> 
> VMDq is more than just a set of queues. The fact is it is a pool of
> resources that get created to handle the requests for a specific VM.
> The extra bits that are added here are essentially stuff that was
> required to support mediated devices.
> 

VMDq pools are managed by the driver and only logically isolated in the
kernel, SFs has no shared pool for transport resources (queues), SFs
have their own isolated steering domains, processing engines, and HW
objects, exactly like a VF.

> > > However
> > > one downside is that we are going to end up seeing each
> > > subfunction
> > > being different from driver to driver and vendor to vendor which
> > > I
> > > would argue was also one of the problems with SR-IOV as you end
> > > up
> > > with a bit of vendor lock-in as a result of this feature since
> > > each
> > > vendor will be providing a different interface.
> > > 
> > 
> > I disagree, SFs are tightly coupled with switchdev model and
> > devlink
> > functions port, they are backed with the a well defined model, i
> > can
> > say the same about sriov with switchdev mode, this sort of vendor
> > lock-
> > in issues is eliminated when you migrate to switchdev mode.
> 
> What you are talking about is the backend. I am talking about what is
> exposed to the user. The user is going to see a Mellanox device
> having
> to be placed into their container in order to support this. One of
> the
> advantages of the Intel approach was that the macvlan interface was
> generic so you could have an offloaded interface or not and the user
> wouldn't necessarily know. The offload could be disabled and the user
> would be none the wiser as it is moved from one interface to another.
> I see that as a big thing that is missing in this solution.
> 

You are talking about the basic netdev users, Sure there are users who
would want a more generic netdev, so yes. but most of my customers are
not like that, they want vdpa/rdma and heavy netdev offload such as
encap/decap/crypto and driver xdp in their containers, the SF approach
will make more sense to them than sriov and VMDq.

> > > > When subfunction is RDMA capable, it has its own QP1, GID table
> > > > and
> > > > rdma
> > > > resources neither shared nor stealed from the parent PCI
> > > > function.
> > > > 
> > > > A subfunction has dedicated window in PCI BAR space that is not
> > > > shared
> > > > with ther other subfunctions or parent PCI function. This
> > > > ensures
> > > > that all
> > > > class devices of the subfunction accesses only assigned PCI BAR
> > > > space.
> > > > 
> > > > A Subfunction supports eswitch representation through which it
> > > > supports tc
> > > > offloads. User must configure eswitch to send/receive packets
> > > > from/to
> > > > subfunction port.
> > > > 
> > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs
> > > > with
> > > > their other subfunctions and/or with its parent PCI function.
> > > 
> > > This piece to the architecture for this has me somewhat
> > > concerned. If
> > > all your resources are shared and you are allowing devices to be
> > 
> > not all, only PCI MSIX, for now..
> 
> They aren't shared after you partition them but they are coming from
> the same device. Basically you are subdividing the BAR2 in order to
> generate the subfunctions. BAR2 is a shared resource in my point of
> view.
> 

Sure, but it doesn't host any actual resources, only the communication
channel with the HW partition, so other than the BAR and the msix the
actual HW resources, steering pipelines, offloads and queues are totlay
isolated and separated. 


> > > created incrementally you either have to pre-partition the entire
> > > function which usually results in limited resources for your base
> > > setup, or free resources from existing interfaces and
> > > redistribute
> > > them as things change. I would be curious which approach you are
> > > taking here? So for example if you hit a certain threshold will
> > > you
> > > need to reset the port and rebalance the IRQs between the various
> > > functions?
> > > 
> > 
> > Currently SFs will use whatever IRQs the PF has pre-allocated for
> > itself, so there is no IRQ limit issue at the moment, we are
> > considering a dynamic IRQ pool with dynamic balancing, or even
> > better
> > us the IMS approach, which perfectly fits the SF architecture.
> > https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/
> 
> When you say you are using the PF's interrupts are you just using
> that
> as a pool of resources or having the interrupt process interrupts for
> both the PF and SFs? Without IMS you are limited to 2048 interrupts.
> Moving over to that would make sense since SF is similar to mdev in
> the way it partitions up the device and resources.
> 

Yes moving to IMS is on the top of our priorities.

> > for internal resources the are fully isolated (not shared) and
> > they are internally managed by FW exactly like a VF internal
> > resources.
> 
> I assume by isolated you mean they are contained within page aligned
> blocks like what was required for mdev?

I mean they are isolated and abstracted in the FW, we don't really
expose any resource directly to the BAR. the BAR is only used for
communicating with the device, so VF and SF will work exactly the same
the only difference is where they get their BAR  and offsets from,
everything else is just similar.

David Ahern Dec. 15, 2020, 8:59 p.m. UTC | #11

On 12/14/20 10:48 PM, Parav Pandit wrote:
> 
>> From: Alexander Duyck <alexander.duyck@gmail.com>
>> Sent: Tuesday, December 15, 2020 7:24 AM
>>
>> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
>> wrote:
>>>
>>> Hi Dave, Jakub, Jason,
>>>
>>
>> Just to clarify a few things for myself. You mention virtualization and SR-IOV
>> in your patch description but you cannot support direct assignment with this
>> correct? 
> Correct. it cannot be directly assigned.
> 
>> The idea here is simply logical partitioning of an existing network
>> interface, correct? 
> No. Idea is to spawn multiple functions from a single PCI device.
> These functions are not born in PCI device and in OS until they are created by user.
> Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
> I better not repeat all of it here again. Please go through it.
> If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.
> 
>> So this isn't so much a solution for virtualization, but may
>> work better for containers. I view this as an important distinction to make as
>> the first thing that came to mind when I read this was mediated devices
>> which is similar, but focused only on the virtualization case:
>> https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-
>> device.html
>>
> Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device.
> We are not going back to it anymore.
> It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more.
>  
>> Rather than calling this a subfunction, would it make more sense to call it
>> something such as a queue set? 
> No, queue is just one way to send and receive data/packets.
> Jason and Saeed explained and discussed  this piece to you and others during v0 few weeks back at [1], [2], [3].
> Please take a look.
> 
>> So in terms of ways to go I would argue this is likely better. However one
>> downside is that we are going to end up seeing each subfunction being
>> different from driver to driver and vendor to vendor which I would argue
>> was also one of the problems with SR-IOV as you end up with a bit of vendor
>> lock-in as a result of this feature since each vendor will be providing a
>> different interface.
>>
> Each and several vendors provided unified interface for managing VFs. i.e.
> (a) enable/disable was via vendor neutral sysfs
> (b) sriov capability exposed via standard pci capability and sysfs
> (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink
> Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced.
> 
> So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'.
> 
>>> A Subfunction supports eswitch representation through which it
>>> supports tc offloads. User must configure eswitch to send/receive
>>> packets from/to subfunction port.
>>>
>>> Subfunctions share PCI level resources such as PCI MSI-X IRQs with
>>> their other subfunctions and/or with its parent PCI function.
>>
>> This piece to the architecture for this has me somewhat concerned. If all your
>> resources are shared and 
> All resources are not shared.
> 
>> you are allowing devices to be created
>> incrementally you either have to pre-partition the entire function which
>> usually results in limited resources for your base setup, or free resources
>> from existing interfaces and redistribute them as things change. I would be
>> curious which approach you are taking here? So for example if you hit a
>> certain threshold will you need to reset the port and rebalance the IRQs
>> between the various functions?
> No. Its works bit differently for mlx5 device.
> When base function is started, it started as if it doesn't have any subfunctions.
> When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants.
> 
> For example, PCI PF uses BAR 0, while subfunctions uses BAR 2.
> For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF.
> In future, yes, a dedicated IRQs per SF is likely desired.
> Sridhar also talked about limiting number of queues to a subfunction.
> I believe there will be resources/attributes of the function to be controlled.
> devlink already provides rich interface to achieve that using devlink resources [8].
> 
> [..]
> 
>>> $ ip link show
>>> 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
>> DOWN mode DEFAULT group default qlen 1000
>>>     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
>>>     altname enp6s0f0np0
>>> 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
>> mode DEFAULT group default qlen 1000
>>>     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>
>>
>> I assume that p0sf88 is supposed to be the newly created subfunction.
>> However I thought the naming was supposed to be the same as what you are
>> referring to in the devlink, or did I miss something?
>>
> I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc).
> I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects.
> Hope below description clarifies a bit.
> There are two netdevices.
> (a) representor netdevice, attached to the devlink port of the eswitch
> (b) netdevice of the SF used by the end application (in your example, this is assigned to container).
>  
> Both netdevice follow obviously a different naming scheme.
> Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher.
> It is based on phys_port_name sysfs attribute.
> This is same for existing PF and SF representors exist for year+ now. Further used by subfunction.
> 
> For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads
> phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container.
> 
>>> After use inactivate the function:
>>> $ devlink port function set ens2f0npf0sf88 state inactive
>>>
>>> Now delete the subfunction port:
>>> $ devlink port del ens2f0npf0sf88
>>
>> This seems wrong to me as it breaks the symmetry with the port add
>> command and
> Example of the representor device is only to make life easier for the user.
> Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit).
> I explained this in a thread with Sridhar at [6].
> In short devlink port del <bus/device_name/port_index command is just fine.
> Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation.
> I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported.
> 
>> assumes you have ownership of the interface in the host. I
>> would much prefer to to see the same arguments that were passed to the
>> add command being used to do the teardown as that would allow for the
>> parent function to create the object, assign it to a container namespace, and
>> not need to pull it back in order to destroy it.
> Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases.
> So port delete command works on the port index.
> Host doesn't need to pull it back to destroy it. It is destroyed via port del command.
> 
> [1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/
> [2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> [3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/
> [4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/
> [5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/
> [6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/
> [7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u
> [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html
> 

Seems to be a repeated line of questions. You might want to add these
FAQs, responses and references to the subfunction document once this set
gets merged.

Jason Gunthorpe Dec. 15, 2020, 9:03 p.m. UTC | #12

On Tue, Dec 15, 2020 at 10:47:36AM -0800, Alexander Duyck wrote:

> > Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
> > I better not repeat all of it here again. Please go through it.
> > If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.
> 
> I think I have a pretty good idea of how the feature works. My concern
> is more the use of marketing speak versus actual functionality. The
> way this is being setup it sounds like it is useful for virtualization
> and it is not, at least in its current state. It may be at some point
> in the future but I worry that it is really going to muddy the waters
> as we end up with yet another way to partition devices.

If we do a virtualization version then it will take a SF and instead
of loading a mlx5_core on the SF aux device, we will load some
vfio_mdev_mlx5 driver which will convert the SF aux device into a
/dev/vfio/*

This is essentially the same as how you'd take a PCI VF and replace
mlx5_core with vfio-pci to get /dev/vfio/*. It has to be a special
mdev driver because it sits on the SF aux device, not on the VF PCI
device.

The vfio_mdev_mlx5 driver will create what Intel calls an SIOV ADI
from the SF, in other words the SF is already a superset of what a
SIOV ADI should be.

This matches very nicely the driver model in Linux, and I don't think
it becomes more muddied as we go along. If anything it is becoming
more clear and sane as things progress.

> I agree with you on that. My thought was more the fact that the two
> can be easily confused. If we are going to do this we need to define
> that for networking devices perhaps that using the mdev interface
> would be deprecated and we would need to go through devlink. However
> before we do that we need to make sure we have this completely
> standardized.

mdev is for creating /dev/vfio/* interfaces in userspace. Using it for
anything else is a bad abuse of the driver model.

We had this debate endlessly already.

AFAIK, there is nothing to deprecate, there are no mdev_drivers in
drivers/net, and none should ever be added. The only mdev_driver that
should ever exists is in vfio_mdev.c

If someone is using a mdev_driver in drivers/net out of tree then they
will need to convert to an aux driver for in-tree.

> Yeah, I recall that. However I feel like it is being oversold. It
> isn't "SR-IOV done right" it seems more like "VMDq done better". The
> fact that interrupts are shared between the subfunctions is telling.

The interrupt sharing is a consequence of having an ADI-like model
without relying on IMS. When IMS works then shared interrupts won't be
very necessary. Otherwise there is no choice but to share the MSI
table of the function.

> That is exactly how things work for Intel parts when they do VMDq as
> well. The queues are split up into pools and a block of queues belongs
> to a specific queue. From what I can can tell the only difference is
> that there is isolation of the pool into specific pages in the BAR.
> Which is essentially a requirement for mediated devices so that they
> can be direct assigned.

No, I said this to Jakub, mlx5 SFs have very little to do with
queues. There is no some 'queue' HW element that needs partitioning.

The SF is a hardware security boundary that wraps every operation a
mlx5 device can do. This is why it is an ADI. It is not a crappy ADI
that relies on hypervisor emulation, it is the real thing, just like a
SRIOV VF. You stick it in the VM and the guest can directly talk to
the HW. The HW provides the security.

I can't put focus on this enough: A mlx5 SF can run a *full RDMA
stack*. This means the driver can create all the RDMA HW objects and
resources under the SF. This is *not* just steering some ethernet
traffic to a few different ethernet queues like VMDq is.

The Intel analog to a SF is a *full virtual function* on one of the
Intel iWarp capable NICs, not VMDq.

> Assuming at some point one of the flavours is a virtio-net style
> interface you could eventually get to the point of something similar
> to what seems to have been the goal of mdev which was meant to address
> these two points.

mlx5 already supports VDPA virtio-net on PF/VF and with this series SF
too.

ie you can take a SF, bind the vdpa_mlx5 driver, and get a fully HW
accelerated "ADI" that does virtio-net. This can be assigned to a
guest and shows up as a PCI virtio-net netdev. With VT-d guest packet
tx/rx on this netdev never uses the hypervisor CPU.

> The point is that we should probably define some sort of standard
> and/or expectations on what should happen when you spawn a new
> interface. Would it be acceptable for the PF and existing subfunctions
> to have to reset if you need to rebalance the IRQ distribution, or
> should they not be disrupted when you spawn a new interface?

It is best to think of the SF as an ADI, so if you change something in
the PF and that causes the driver attached to the ADI in a VM to
reset, is that OK? I'd say no.

Jason

Jakub Kicinski Dec. 15, 2020, 9:28 p.m. UTC | #13

On Tue, 15 Dec 2020 12:35:20 -0800 Saeed Mahameed wrote:
> > I think the big thing we really should do if we are going to go this
> > route is to look at standardizing what the flavours are that get
> > created by the parent netdevice. Otherwise we are just creating the
> > same mess we had with SRIOV all over again and muddying the waters of
> > mediated devices.
> 
> yes in the near future we will be working on auxbus interfaces for
> auto-probing and user flavor selection, this is a must have feature for
> us.

Can you elaborate? I thought config would be via devlink.

Alexander Duyck Dec. 15, 2020, 9:41 p.m. UTC | #14

On Tue, Dec 15, 2020 at 12:35 PM Saeed Mahameed <saeed@kernel.org> wrote:
>
> On Tue, 2020-12-15 at 11:12 -0800, Alexander Duyck wrote:
> > On Mon, Dec 14, 2020 at 10:15 PM Saeed Mahameed <saeed@kernel.org>
> > wrote:
> > > On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote:
> > > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> > > > wrote:
> > > > > Hi Dave, Jakub, Jason,
> > > > >
> > > > > This series form Parav was the theme of this mlx5 release
> > > > > cycle,
> > > > > we've been waiting anxiously for the auxbus infrastructure to
> > > > > make
> > > > > it into
> > > > > the kernel, and now as the auxbus is in and all the stars are
> > > > > aligned, I
> > > > > can finally submit this V2 of the devlink and mlx5 subfunction
> > > > > support.
> > > > >
> > > > > Subfunctions came to solve the scaling issue of virtualization
> > > > > and switchdev environments, where SRIOV failed to deliver and
> > > > > users
> > > > > ran
> > > > > out of VFs very quickly as SRIOV demands huge amount of
> > > > > physical
> > > > > resources
> > > > > in both of the servers and the NIC.
> > > > >
> > > > > Subfunction provide the same functionality as SRIOV but in a
> > > > > very
> > > > > lightweight manner, please see the thorough and detailed
> > > > > documentation from Parav below, in the commit messages and the
> > > > > Networking documentation patches at the end of this series.
> > > > >
> > > >
> > > > Just to clarify a few things for myself. You mention
> > > > virtualization
> > > > and SR-IOV in your patch description but you cannot support
> > > > direct
> > > > assignment with this correct? The idea here is simply logical
> > > > partitioning of an existing network interface, correct? So this
> > > > isn't
> > > > so much a solution for virtualization, but may work better for
> > > > containers. I view this as an important distinction to make as
> > > > the
> > >
> > > at the current state yes, but the SF solution can be extended to
> > > support direct assignment, so this is why i think SF solution can
> > > do
> > > better and eventually replace SRIOV.
> >
> > My only real concern is that this and mediated devices are
> > essentially
> > the same thing. When you start making this work for direct-assignment
> > the only real difference becomes the switchdev and devlink
> > interfaces.
>
> not just devlink and switchdev, auxbus was also introduced to
> standardize some of the interfaces.

The auxbus is just there to make up for the fact that there isn't
another bus type for this though. I would imagine otherwise this would
be on some sort of platform bus.

> > Basically this is netdev specific mdev versus the PCIe specific mdev.
> >
>
> SF is not netdev specific mdev .. :/

I agree it is not. However there are just a few extensions to it. What
I would really like to see is a solid standardization of what this is.
Otherwise the comparison is going to be made. Especially since a year
ago Mellanox was pushing this as an mdev type interface. There is more
here than just mdev, however my concern is that we may be also losing
some of the advantages of mdev.

It would be much easier for me to go along with this if we had more
than one vendor pushing it. My concern is that this is becoming
something that may end up being vendor specific.

> > > also many customers are currently using SRIOV with containers to
> > > get
> > > the performance and isolation features since there was no other
> > > options.
> >
> > There were, but you hadn't implemented them. The fact is the approach
> > Intel had taken for that was offloaded macvlan.
> >
>
> offloaded macvlan is just a macvlan with checksum/tso and gro.
>
> macvlan can't provide RDMA, TC offloads, ethtool steering, PTP, vdpa

Agreed. I have already acknowledged that macvlan couldn't meet the
needs for all use cases. However at the same time it provides a
consistent interface regardless of vendors.

If we decide to go with the vendor specific drivers for subfunctions
that is fine, however I see that going down the same path as SR-IOV
and ultimately ending in obscurity since I don't see many being
willing to adopt it.

> ...
> our SF provides the same set of features a VF can provide

That is all well and good. However if we agree that SR-IOV wasn't done
right saying that you are spinning up something that works just like
SR-IOV isn't all that appealing, is it?

> > I think the big thing we really should do if we are going to go this
> > route is to look at standardizing what the flavours are that get
> > created by the parent netdevice. Otherwise we are just creating the
> > same mess we had with SRIOV all over again and muddying the waters of
> > mediated devices.
> >
>
> yes in the near future we will be working on auxbus interfaces for
> auto-probing and user flavor selection, this is a must have feature for
> us.

I would take this one step further. If we are going to have flavours
maybe we should have interfaces pre-defined that are vendor agnostic
that can represent the possible flavors. Basically an Ethernet
interface for that case, and RDMA interface for that case and so on.
It limits what functionality can be exposed, however it frees up the
containers and/or guests to work on whatever NIC you want as long as
it supports that interface.

> > > > first thing that came to mind when I read this was mediated
> > > > devices
> > > > which is similar, but focused only on the virtualization case:
> > > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html
> > > >
> > > > > Parav Pandit Says:
> > > > > =================
> > > > >
> > > > > This patchset introduces support for mlx5 subfunction (SF).
> > > > >
> > > > > A subfunction is a lightweight function that has a parent PCI
> > > > > function on
> > > > > which it is deployed. mlx5 subfunction has its own function
> > > > > capabilities
> > > > > and its own resources. This means a subfunction has its own
> > > > > dedicated
> > > > > queues(txq, rxq, cq, eq). These queues are neither shared nor
> > > > > stealed from
> > > > > the parent PCI function.
> > > >
> > > > Rather than calling this a subfunction, would it make more sense
> > > > to
> > > > call it something such as a queue set? It seems like this is
> > > > exposing
> > > > some of the same functionality we did in the Intel drivers such
> > > > as
> > > > ixgbe and i40e via the macvlan offload interface. However the
> > > > ixgbe/i40e hardware was somewhat limited in that we were only
> > > > able to
> > > > expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and
> > > > even
> > > > with that we have seen some limitations to the interface. It
> > > > sounds
> > > > like you are able to break out RDMA capable devices this way as
> > > > well.
> > > > So in terms of ways to go I would argue this is likely better.
> > >
> > > We've discussed this thoroughly on V0, the SF solutions is closer
> > > to a
> > > VF than a VMDQ, this is not just a set of queues.
> > >
> > > https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> >
> > VMDq is more than just a set of queues. The fact is it is a pool of
> > resources that get created to handle the requests for a specific VM.
> > The extra bits that are added here are essentially stuff that was
> > required to support mediated devices.
> >
>
> VMDq pools are managed by the driver and only logically isolated in the
> kernel, SFs has no shared pool for transport resources (queues), SFs
> have their own isolated steering domains, processing engines, and HW
> objects, exactly like a VF.

You are describing your specific implementation. That may not apply to
others. What you are defining as the differences of VMDq and SR-IOV
are not the same as other vendors.

You are essentially arguing implementation semantics, if it is
configured by the driver or firmware it doesn't really make any
difference. Being fully isolated versus only logically isolated only
really matters in terms of direct assignment. In the grand scheme of
things the only real difference between SR-IOV and VMDq is the
spawning of the PCIe device with its own BAR to access the resources.
Isolating the queues to their own 4K bounded subset of a BAR is pretty
straightforward and I assume that and the firmware is what is giving
you most of your isolation in this case.

> > > > However
> > > > one downside is that we are going to end up seeing each
> > > > subfunction
> > > > being different from driver to driver and vendor to vendor which
> > > > I
> > > > would argue was also one of the problems with SR-IOV as you end
> > > > up
> > > > with a bit of vendor lock-in as a result of this feature since
> > > > each
> > > > vendor will be providing a different interface.
> > > >
> > >
> > > I disagree, SFs are tightly coupled with switchdev model and
> > > devlink
> > > functions port, they are backed with the a well defined model, i
> > > can
> > > say the same about sriov with switchdev mode, this sort of vendor
> > > lock-
> > > in issues is eliminated when you migrate to switchdev mode.
> >
> > What you are talking about is the backend. I am talking about what is
> > exposed to the user. The user is going to see a Mellanox device
> > having
> > to be placed into their container in order to support this. One of
> > the
> > advantages of the Intel approach was that the macvlan interface was
> > generic so you could have an offloaded interface or not and the user
> > wouldn't necessarily know. The offload could be disabled and the user
> > would be none the wiser as it is moved from one interface to another.
> > I see that as a big thing that is missing in this solution.
> >
>
> You are talking about the basic netdev users, Sure there are users who
> would want a more generic netdev, so yes. but most of my customers are
> not like that, they want vdpa/rdma and heavy netdev offload such as
> encap/decap/crypto and driver xdp in their containers, the SF approach
> will make more sense to them than sriov and VMDq.

I am talking about my perspective. From what I have seen, one-off
features that are only available from specific vendors are a pain to
deal with and difficult to enable when you have to support multiple
vendors within your ecosystem. What you end up going for is usually
the lowest common denominator because you ideally want to be able to
configure all your devices the same and have one recipe for setup.

I'm not saying you cannot enable those features. However at the same
time I am saying it would be nice to have a vendor neutral way of
dealing with those if we are going to support SF, ideally with some
sort of software fallback that may not perform as well but will at
least get us the same functionality.

I'm trying to remember which netdev conference it was. I referred to
this as a veth switchdev offload when something like this was first
brought up. The more I think about it now it would almost make more
sense to have something like that as a flavor. The way I view it we
have a few different use cases floating around which will have
different needs. My thought is having a standardized interface that
could address those needs would be a good way to go for this as it
would force everyone to come together and define a standardized
feature set that all of the vendors would want to expose.

> > > > > When subfunction is RDMA capable, it has its own QP1, GID table
> > > > > and
> > > > > rdma
> > > > > resources neither shared nor stealed from the parent PCI
> > > > > function.
> > > > >
> > > > > A subfunction has dedicated window in PCI BAR space that is not
> > > > > shared
> > > > > with ther other subfunctions or parent PCI function. This
> > > > > ensures
> > > > > that all
> > > > > class devices of the subfunction accesses only assigned PCI BAR
> > > > > space.
> > > > >
> > > > > A Subfunction supports eswitch representation through which it
> > > > > supports tc
> > > > > offloads. User must configure eswitch to send/receive packets
> > > > > from/to
> > > > > subfunction port.
> > > > >
> > > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs
> > > > > with
> > > > > their other subfunctions and/or with its parent PCI function.
> > > >
> > > > This piece to the architecture for this has me somewhat
> > > > concerned. If
> > > > all your resources are shared and you are allowing devices to be
> > >
> > > not all, only PCI MSIX, for now..
> >
> > They aren't shared after you partition them but they are coming from
> > the same device. Basically you are subdividing the BAR2 in order to
> > generate the subfunctions. BAR2 is a shared resource in my point of
> > view.
> >
>
> Sure, but it doesn't host any actual resources, only the communication
> channel with the HW partition, so other than the BAR and the msix the
> actual HW resources, steering pipelines, offloads and queues are totlay
> isolated and separated.

I understand what you are trying to say, however this is semantics
specific to the implementation. Ultimately you are having to share the
function.

> > > > created incrementally you either have to pre-partition the entire
> > > > function which usually results in limited resources for your base
> > > > setup, or free resources from existing interfaces and
> > > > redistribute
> > > > them as things change. I would be curious which approach you are
> > > > taking here? So for example if you hit a certain threshold will
> > > > you
> > > > need to reset the port and rebalance the IRQs between the various
> > > > functions?
> > > >
> > >
> > > Currently SFs will use whatever IRQs the PF has pre-allocated for
> > > itself, so there is no IRQ limit issue at the moment, we are
> > > considering a dynamic IRQ pool with dynamic balancing, or even
> > > better
> > > us the IMS approach, which perfectly fits the SF architecture.
> > > https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/
> >
> > When you say you are using the PF's interrupts are you just using
> > that
> > as a pool of resources or having the interrupt process interrupts for
> > both the PF and SFs? Without IMS you are limited to 2048 interrupts.
> > Moving over to that would make sense since SF is similar to mdev in
> > the way it partitions up the device and resources.
> >
>
> Yes moving to IMS is on the top of our priorities.
>
> > > for internal resources the are fully isolated (not shared) and
> > > they are internally managed by FW exactly like a VF internal
> > > resources.
> >
> > I assume by isolated you mean they are contained within page aligned
> > blocks like what was required for mdev?
>
> I mean they are isolated and abstracted in the FW, we don't really
> expose any resource directly to the BAR. the BAR is only used for
> communicating with the device, so VF and SF will work exactly the same
> the only difference is where they get their BAR  and offsets from,
> everything else is just similar.

I think where you and I differ is our use of the term "resource". I
would consider the address space a "resource", while you argue that
the resources are hidden behind the BAR.

I agree with you that the firmware should be managing most of the
resources in the device. So it isn't surprising that it would be
splitting them up and then dolling out pieces as needed to put
together a subfunction.

Jason Gunthorpe Dec. 16, 2020, 12:19 a.m. UTC | #15

On Tue, Dec 15, 2020 at 01:41:04PM -0800, Alexander Duyck wrote:

> > not just devlink and switchdev, auxbus was also introduced to
> > standardize some of the interfaces.
> 
> The auxbus is just there to make up for the fact that there isn't
> another bus type for this though. I would imagine otherwise this would
> be on some sort of platform bus.

Please lets not start this again. This was gone over with Greg for
literally a year and a half and he explicitly NAK'd platform bus for
this purpose.

Aux bus exists to connect different kernel subsystems that touch the
same HW block together. Here we have the mlx5_core subsystem, vdpa,
rdma, and netdev all being linked together using auxbus.

It is kind of like what MFD does, but again, using MFD for this was
also NAK'd by Greg.

At the very worst we might sometime find out there is some common
stuff between ADIs that we might get an ADI bus, but I'm not
optimistic. So far it looks like there is no commonality.

Aux bus has at least 4 users already in various stages of submission,
and many other target areas that should be replaced by it.

> I would really like to see is a solid standardization of what this is.
> Otherwise the comparison is going to be made. Especially since a year
> ago Mellanox was pushing this as an mdev type interface. 

mdev was NAK'd too.

mdev is only for creating /dev/vfio/*.

> That is all well and good. However if we agree that SR-IOV wasn't done
> right saying that you are spinning up something that works just like
> SR-IOV isn't all that appealing, is it?

Fitting into some universal least-common-denominator was never a goal
for SR-IOV, so I wouldn't agree it was done wrong. 

> I am talking about my perspective. From what I have seen, one-off
> features that are only available from specific vendors are a pain to
> deal with and difficult to enable when you have to support multiple
> vendors within your ecosystem. What you end up going for is usually
> the lowest common denominator because you ideally want to be able to
> configure all your devices the same and have one recipe for setup.

So encourage other vendors to support the switchdev model for managing
VFs and ADIs!

> I'm not saying you cannot enable those features. However at the same
> time I am saying it would be nice to have a vendor neutral way of
> dealing with those if we are going to support SF, ideally with some
> sort of software fallback that may not perform as well but will at
> least get us the same functionality.

Is it really true there is no way to create a software device on a
switchdev today? I looked for a while and couldn't find
anything. openvswitch can do this, so it does seem like a gap, but
this has nothing to do with this series.

A software switchdev path should still end up with the representor and
user facing netdev, and the behavior of the two netdevs should be
identical to the VF switchdev flow we already have today.

SF doesn't change any of this, it just shines a light that, yes,
people actually have been using VFs with netdevs in containers and
switchdev, as part of their operations.

FWIW, I view this as a positive because it shows the switchdev model
is working very well and seeing adoption beyond the original idea of
controlling VMs with SRIOV.

> I'm trying to remember which netdev conference it was. I referred to
> this as a veth switchdev offload when something like this was first
> brought up. 

Sure, though I think the way you'd create such a thing might be
different. These APIs are really about creating an ADI that might be
assigned to a VM and never have a netdev.

It would be nonsense to create a veth-switchdev thing with out a
netdev, and there have been various past attempts already NAK'd to
transform a netdev into an ADI.

Anyhow, if such a thing exists someday it could make sense to
automatically substitute the HW version using a SF, if available.

> could address those needs would be a good way to go for this as it
> would force everyone to come together and define a standardized
> feature set that all of the vendors would want to expose.

I would say switchdev is already the standard feature set.

Jason

Edwin Peer Dec. 16, 2020, 1:12 a.m. UTC | #16

On Tue, Dec 15, 2020 at 10:49 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:

> It isn't "SR-IOV done right" it seems more like "VMDq done better".

I don't think I agree with that assertion. The fact that VMDq can talk
to a common driver still makes VMDq preferable in some respects. Thus,
subfunctions do appear to be more of a better SR-IOV than a better
VMDq, but I'm similarly not sold on whether a better SR-IOV is
sufficient benefit to warrant the additional complexity this
introduces. If I understand correctly, subfunctions buy two things:

1) More than 256 SFs are possible: Maybe it's about time PCI-SIG
addresses this limit for VFs? If that were the only problem with VFs,
then fixing it once there would be cleaner. The devlink interface for
configuring a SF is certainly more sexy than legacy SR-IOV, but it
shouldn't be fundamentally impossible to zhuzh up VFs either. One can
also imagine possibilities around remapping multiple PFs (and their
VFs) in a clever way to get around the limited number of PCI resources
exposed to the host.

2) More flexible division of resources: It's not clear that device
firmwarre can't perform smarter allocation than N/<num VFs>, but
subfunctions also appear to allow sharing of certain resources by the
PF driver, if desirable. To the extent that resources are shared, how
are workloads isolated from each other?

I'm not sure I like the idea of having to support another resource
allocation model in our driver just to support this, at least not
without a clearer understanding of what is being gained.

Like you, I would also prefer a more common infrastructure for
exposing something based on VirtIO/VMDq as the container/VM facing
netdevs. Is the lowest common denominator that a VMDq based interface
would constrain things to really unsuitable for container use cases?
Is the added complexity and confusion around VF vs SF vs VMDq really
warranted? I also don't see how this tackles container/VF portability,
migration of workloads, kernel network stack bypass, or any of the
other legacy limitations regarding SR-IOV VFs when we have vendor
specific aux bus drivers talking directly to vendor specific backend
hardware resources. In this regard, don't subfunctions, by definition,
have most of the same limitations as SR-IOV VFs?

Regards,
Edwin Peer

Alexander Duyck Dec. 16, 2020, 2:19 a.m. UTC | #17

On Tue, Dec 15, 2020 at 4:20 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Dec 15, 2020 at 01:41:04PM -0800, Alexander Duyck wrote:
>
> > > not just devlink and switchdev, auxbus was also introduced to
> > > standardize some of the interfaces.
> >
> > The auxbus is just there to make up for the fact that there isn't
> > another bus type for this though. I would imagine otherwise this would
> > be on some sort of platform bus.
>
> Please lets not start this again. This was gone over with Greg for
> literally a year and a half and he explicitly NAK'd platform bus for
> this purpose.
>
> Aux bus exists to connect different kernel subsystems that touch the
> same HW block together. Here we have the mlx5_core subsystem, vdpa,
> rdma, and netdev all being linked together using auxbus.
>
> It is kind of like what MFD does, but again, using MFD for this was
> also NAK'd by Greg.

Sorry I wasn't saying we needed to use MFD or platform bus. I'm aware
of the discussions I was just saying there are some parallels, not
that we needed to go through and use that.

> At the very worst we might sometime find out there is some common
> stuff between ADIs that we might get an ADI bus, but I'm not
> optimistic. So far it looks like there is no commonality.
>
> Aux bus has at least 4 users already in various stages of submission,
> and many other target areas that should be replaced by it.

Aux bus is fine and I am happy with it. I was just saying that auxbus
isn't something that should really be used to say that this is
significantly different from mdev as they both rely on a bus topology.

> > I would really like to see is a solid standardization of what this is.
> > Otherwise the comparison is going to be made. Especially since a year
> > ago Mellanox was pushing this as an mdev type interface.
>
> mdev was NAK'd too.
>
> mdev is only for creating /dev/vfio/*.

Agreed. However my worry is that as we start looking to make this
support virtualization it will still end up swinging more toward mdev.
I would much prefer to make certain that any mdev is a flavor and
doesn't end up coloring the entire interface.

> > That is all well and good. However if we agree that SR-IOV wasn't done
> > right saying that you are spinning up something that works just like
> > SR-IOV isn't all that appealing, is it?
>
> Fitting into some universal least-common-denominator was never a goal
> for SR-IOV, so I wouldn't agree it was done wrong.

It isn't so much about right or wrong but he use cases. My experience
has been that SR-IOV ends up being used for very niche use cases where
you are direct assigning it into either DPDK or some NFV VM and you
are essentially building the application around the NIC. It is all
well and good, but for general virtualization it never really caught
on.

My thought is that if we are going to do this we need to do this in a
way that improves the usability, otherwise we are looking at more
niche use cases.

> > I am talking about my perspective. From what I have seen, one-off
> > features that are only available from specific vendors are a pain to
> > deal with and difficult to enable when you have to support multiple
> > vendors within your ecosystem. What you end up going for is usually
> > the lowest common denominator because you ideally want to be able to
> > configure all your devices the same and have one recipe for setup.
>
> So encourage other vendors to support the switchdev model for managing
> VFs and ADIs!

Ugh, don't get me started on switchdev. The biggest issue as I see it
with switchev is that you have to have a true switch in order to
really be able to use it. As such dumbed down hardware like the ixgbe
for instance cannot use it since it defaults to outputting anything
that doesn't have an existing rule to the external port. If we could
tweak the design to allow for more dumbed down hardware it would
probably be much easier to get wider adoption.

Honestly, the switchdev interface isn't what I was talking about. I
was talking about the SF interface, not the switchdev side of it. In
my mind you can place your complexity into the switchdev side of the
interface, but keep the SF interface simple. Then you can back it with
whatever you want, but without having to have a vendor specific
version of the interface being plugged into the guest or container.

One of the reasons why virtio-net is being pushed as a common
interface for vendors is for this reason. It is an interface that can
be emulated by software or hardware and it allows the guest to run on
any arbitrary hardware.

> > I'm not saying you cannot enable those features. However at the same
> > time I am saying it would be nice to have a vendor neutral way of
> > dealing with those if we are going to support SF, ideally with some
> > sort of software fallback that may not perform as well but will at
> > least get us the same functionality.
>
> Is it really true there is no way to create a software device on a
> switchdev today? I looked for a while and couldn't find
> anything. openvswitch can do this, so it does seem like a gap, but
> this has nothing to do with this series.

It has plenty to do with this series. This topic has been under
discussion since something like 2017 when Mellanox first brought it up
at Netdev 2.1. At the time I told them they should implement this as a
veth offload. Then it becomes obvious what the fallback becomes as you
can place packets into one end of a veth and it comes out the other,
just like a switchdev representor and the SF in this case. It would
make much more sense to do it this way rather than setting up yet
another vendor proprietary interface pair.

> A software switchdev path should still end up with the representor and
> user facing netdev, and the behavior of the two netdevs should be
> identical to the VF switchdev flow we already have today.
>
> SF doesn't change any of this, it just shines a light that, yes,
> people actually have been using VFs with netdevs in containers and
> switchdev, as part of their operations.
>
> FWIW, I view this as a positive because it shows the switchdev model
> is working very well and seeing adoption beyond the original idea of
> controlling VMs with SRIOV.

PF/VF isolation is a given. So the existing switchdev behavior is fine
in that regard. I wouldn't expect that to change. The fact is we
actually support something similar in the macvlan approach we put in
the Intel drivers since the macvlan itself provided an option for
isolation or to hairpin the traffic if you wanted to allow the VMDq
instances to be bypassed. That was another thing I view as a huge
feature as broadcast/multicast traffic can really get ugly when the
devices are all separate pipelines and being able to switch that off
and just do software replication and hairpinning can be extremely
useful.

> > I'm trying to remember which netdev conference it was. I referred to
> > this as a veth switchdev offload when something like this was first
> > brought up.
>
> Sure, though I think the way you'd create such a thing might be
> different. These APIs are really about creating an ADI that might be
> assigned to a VM and never have a netdev.
>
> It would be nonsense to create a veth-switchdev thing with out a
> netdev, and there have been various past attempts already NAK'd to
> transform a netdev into an ADI.
>
> Anyhow, if such a thing exists someday it could make sense to
> automatically substitute the HW version using a SF, if available.

The main problem as I see it is the fact that the SF interface is
bound too tightly to the hardware. The direct pass-thru with no
hairpin is always an option but if we are going to have an interface
where both ends are in the same system there are many cases where
always pushing all the packets off to the hardware don't necessarily
make sense.

> > could address those needs would be a good way to go for this as it
> > would force everyone to come together and define a standardized
> > feature set that all of the vendors would want to expose.
>
> I would say switchdev is already the standard feature set.

Yes, it is a standard feature set for the control plane. However for
the data-path it is somewhat limited as I feel it only describes what
goes through the switch. Not the interfaces that are exposed as the
endpoints. It is the problem of that last bit and how it is handled
that can make things ugly. For example the multicast/broadcast
replication problem that just occurred to me while writing this up.
The fact is for east-west traffic there has always been a problem with
the switchdev model as it limits everything to PCIe/DMA so there are
cases where software switches can outperform the hardware ones.

Jason Gunthorpe Dec. 16, 2020, 2:39 a.m. UTC | #18

On Tue, Dec 15, 2020 at 05:12:33PM -0800, Edwin Peer wrote:

> 1) More than 256 SFs are possible: Maybe it's about time PCI-SIG
> addresses this limit for VFs? 

They can't, the Bus/Device/Function is limited by protocol and
changing that would upend the entire PCI world.

Instead PCI-SIG said PASID is the way forward.

> If that were the only problem with VFs, then fixing it once there
> would be cleaner. 

Maybe, but half the problem with VFs is how HW expensive they are. The
mlx5 SF version is not such a good example, but Intel has shown in
other recent patches, like for their idxd, that the HW side of an ADI
can be very simple and hypervisor emulation can build a simple HW
capability into a full ADI for assignment to a guest.

A lot of the trappings that PCI-SIG requires to be implemented in HW
for a VF, like PCI config space, MSI tables, BAR space, etc. is all
just dead weight when scaling up to 1000's of VFs.

The ADI scheme is not that bad, the very simplest HW is just a queue
that can have all DMA contained by a PASID and can trigger an
addr/data interrupt message. Much less HW costly than a SRIOV VF.

Regardless, Intel kicked this path off years ago when they published
their SIOV cookbook and everyone started integrating PASID support
into their IOMMUs and working on ADIs. The mlx5 SFs are kind of early
because the HW is flexible enough to avoid the parts of SIOV that are
not ready or widely deployed yet, like IMS and PASID.

> Like you, I would also prefer a more common infrastructure for
> exposing something based on VirtIO/VMDq as the container/VM facing
> netdevs. 

A major point is to get switchdev.

> I also don't see how this tackles container/VF portability,
> migration of workloads, kernel network stack bypass, or any of the
> other legacy limitations regarding SR-IOV VFs

It isn't ment too. SF/ADI are just a way to have more VF's than PCI SIG
can support..

Jason

Jason Gunthorpe Dec. 16, 2020, 3:03 a.m. UTC | #19

On Tue, Dec 15, 2020 at 06:19:18PM -0800, Alexander Duyck wrote:

> > > I would really like to see is a solid standardization of what this is.
> > > Otherwise the comparison is going to be made. Especially since a year
> > > ago Mellanox was pushing this as an mdev type interface.
> >
> > mdev was NAK'd too.
> >
> > mdev is only for creating /dev/vfio/*.
> 
> Agreed. However my worry is that as we start looking to make this
> support virtualization it will still end up swinging more toward
> mdev.

Of course. mdev is also the only way to create a /dev/vfio/* :)

So all paths that want to use vfio must end up creating a mdev.

Here we would choose to create the mdev on top of the SF aux device.
There isn't really anything mlx5 specific about that decision. 

The SF models the vendor specific ADI in the driver model.

> It isn't so much about right or wrong but he use cases. My experience
> has been that SR-IOV ends up being used for very niche use cases where
> you are direct assigning it into either DPDK or some NFV VM and you
> are essentially building the application around the NIC. It is all
> well and good, but for general virtualization it never really caught
> on.

Sure

> > So encourage other vendors to support the switchdev model for managing
> > VFs and ADIs!
> 
> Ugh, don't get me started on switchdev. The biggest issue as I see it
> with switchev is that you have to have a true switch in order to
> really be able to use it. 

That cuts both ways, suggesting HW with a true switch model itself
with VMDq is equally problematic.

> As such dumbed down hardware like the ixgbe for instance cannot use
> it since it defaults to outputting anything that doesn't have an
> existing rule to the external port. If we could tweak the design to
> allow for more dumbed down hardware it would probably be much easier
> to get wider adoption.

I'd agree with this

> interface, but keep the SF interface simple. Then you can back it with
> whatever you want, but without having to have a vendor specific
> version of the interface being plugged into the guest or container.

The entire point *is* to create the vendor version because that serves
the niche cases where SRIOV assignment is already being used.

Having a general solution that can't do vendor SRIOV is useful for
other application, but doesn't eliminate the need for the SRIOV case.

> One of the reasons why virtio-net is being pushed as a common
> interface for vendors is for this reason. It is an interface that can
> be emulated by software or hardware and it allows the guest to run on
> any arbitrary hardware.

Yes, and there is mlx5_vdpa to support this usecase, and it binds to
the SF. Of course all of that is vendor specific too, the driver to
convert HW specifc register programming into a virio-net ADI has to
live *somewhere*

> It has plenty to do with this series. This topic has been under
> discussion since something like 2017 when Mellanox first brought it up
> at Netdev 2.1. At the time I told them they should implement this as a
> veth offload. 

veth doesn't give an ADI, it is useless for these niche cases.

veth offload might be interesting for some container case, but feels
like writing an enormous amount of code to accomplish nothing new...

> Then it becomes obvious what the fallback becomes as you can place
> packets into one end of a veth and it comes out the other, just like
> a switchdev representor and the SF in this case. It would make much
> more sense to do it this way rather than setting up yet another
> vendor proprietary interface pair.

I agree it makes sense to have an all SW veth-like option, but I
wouldn't try to make that as the entry point for all the HW
acceleration or to serve the niche SRIOV use cases, or to represent an
ADI.

It just can't do that and it would make a huge mess if you tried to
force it. Didn't Intel already try this once with trying to use the
macvlan netdev and its queue offload to build an ADI?

> > Anyhow, if such a thing exists someday it could make sense to
> > automatically substitute the HW version using a SF, if available.
> 
> The main problem as I see it is the fact that the SF interface is
> bound too tightly to the hardware. 

That is goal here. This is not about creating just a netdev, this is
about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.

The SF has to support all of that completely. Focusing only on the
one use case of netdevs in containers misses the bigger picture. 

Yes, lots of this stuff is niche, but niche stuff needs to be
supported too.

> Yes, it is a standard feature set for the control plane. However for
> the data-path it is somewhat limited as I feel it only describes what
> goes through the switch.

Sure, I think that is its main point.

> Not the interfaces that are exposed as the endpoints. 

It came from modeling physical HW so the endports are 'physical'
things like actual HW switch ports, or SRIOV VFs, ADI, etc.

> It is the problem of that last bit and how it is handled that can
> make things ugly. For example the multicast/broadcast replication
> problem that just occurred to me while writing this up.  The fact is
> for east-west traffic there has always been a problem with the
> switchdev model as it limits everything to PCIe/DMA so there are
> cases where software switches can outperform the hardware ones.

Yes, but, mixing CPU and DMA in the same packet delivery scheme is
very complicated :)

Jason

Alexander Duyck Dec. 16, 2020, 3:12 a.m. UTC | #20

On Tue, Dec 15, 2020 at 5:13 PM Edwin Peer <edwin.peer@broadcom.com> wrote:
>
> On Tue, Dec 15, 2020 at 10:49 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>
> > It isn't "SR-IOV done right" it seems more like "VMDq done better".
>
> I don't think I agree with that assertion. The fact that VMDq can talk
> to a common driver still makes VMDq preferable in some respects. Thus,
> subfunctions do appear to be more of a better SR-IOV than a better
> VMDq, but I'm similarly not sold on whether a better SR-IOV is
> sufficient benefit to warrant the additional complexity this
> introduces. If I understand correctly, subfunctions buy two things:
>
> 1) More than 256 SFs are possible: Maybe it's about time PCI-SIG
> addresses this limit for VFs? If that were the only problem with VFs,
> then fixing it once there would be cleaner. The devlink interface for
> configuring a SF is certainly more sexy than legacy SR-IOV, but it
> shouldn't be fundamentally impossible to zhuzh up VFs either. One can
> also imagine possibilities around remapping multiple PFs (and their
> VFs) in a clever way to get around the limited number of PCI resources
> exposed to the host.

The fact is SR-IOV just wasn't designed to scale well. I think we are
probably going to see most vendors move away from it.

The fact is what we are talking about now is the future of all this
and how to implement Scalable I/O Virtualization
(https://software.intel.com/content/www/us/en/develop/download/intel-scalable-io-virtualization-technical-specification.html).
The document is a good primer to many of the features we are
discussing as it describes how to compose a device.

The problem is as it was with SR-IOV that the S-IOV specification is
very PCIe centric and doesn't do a good job explaining how to deal
with the network as it relates to all this. Then to complicate things
the S-IOV expected this to be used with direct assigned devices for
guests/applications and instead we are talking about using the devices
in the host which makes things a bit messier.

> 2) More flexible division of resources: It's not clear that device
> firmwarre can't perform smarter allocation than N/<num VFs>, but
> subfunctions also appear to allow sharing of certain resources by the
> PF driver, if desirable. To the extent that resources are shared, how
> are workloads isolated from each other?
>
> I'm not sure I like the idea of having to support another resource
> allocation model in our driver just to support this, at least not
> without a clearer understanding of what is being gained.

I view this as the future alternative to SR-IOV. It is just a matter
of how we define it. Eventually we probably would be dropping the
SR-IOV implementation and instead moving over to S-IOV as an
alternative instead. As such if this is done right I don't see this as
a thing where we need to support both. Really we should be able to
drop support for one if we have the other.

Alexander Duyck Dec. 16, 2020, 4:13 a.m. UTC | #21

On Tue, Dec 15, 2020 at 7:04 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Dec 15, 2020 at 06:19:18PM -0800, Alexander Duyck wrote:
>
> > > > I would really like to see is a solid standardization of what this is.
> > > > Otherwise the comparison is going to be made. Especially since a year
> > > > ago Mellanox was pushing this as an mdev type interface.
> > >
> > > mdev was NAK'd too.
> > >
> > > mdev is only for creating /dev/vfio/*.
> >
> > Agreed. However my worry is that as we start looking to make this
> > support virtualization it will still end up swinging more toward
> > mdev.
>
> Of course. mdev is also the only way to create a /dev/vfio/* :)
>
> So all paths that want to use vfio must end up creating a mdev.
>
> Here we would choose to create the mdev on top of the SF aux device.
> There isn't really anything mlx5 specific about that decision.
>
> The SF models the vendor specific ADI in the driver model.
>
> > It isn't so much about right or wrong but he use cases. My experience
> > has been that SR-IOV ends up being used for very niche use cases where
> > you are direct assigning it into either DPDK or some NFV VM and you
> > are essentially building the application around the NIC. It is all
> > well and good, but for general virtualization it never really caught
> > on.
>
> Sure
>
> > > So encourage other vendors to support the switchdev model for managing
> > > VFs and ADIs!
> >
> > Ugh, don't get me started on switchdev. The biggest issue as I see it
> > with switchev is that you have to have a true switch in order to
> > really be able to use it.
>
> That cuts both ways, suggesting HW with a true switch model itself
> with VMDq is equally problematic.

Yes and no. For example the macvlan offload I had setup could be
configured both ways and it made use of VMDq. I'm not necessarily
arguing that we need to do VMDq here, however at the same time saying
that this is only meant to replace SR-IOV becomes problematic since we
already have SR-IOV so why replace it with something that has many of
the same limitations?

> > As such dumbed down hardware like the ixgbe for instance cannot use
> > it since it defaults to outputting anything that doesn't have an
> > existing rule to the external port. If we could tweak the design to
> > allow for more dumbed down hardware it would probably be much easier
> > to get wider adoption.
>
> I'd agree with this
>
> > interface, but keep the SF interface simple. Then you can back it with
> > whatever you want, but without having to have a vendor specific
> > version of the interface being plugged into the guest or container.
>
> The entire point *is* to create the vendor version because that serves
> the niche cases where SRIOV assignment is already being used.
>
> Having a general solution that can't do vendor SRIOV is useful for
> other application, but doesn't eliminate the need for the SRIOV case.

So part of the problem here is we already have SR-IOV. So we don't
need to repeat the mistakes. Rather, we need to have a solution to the
existing problems and then we can look at eliminating it.

That said I understand your argument, however I view the elimination
of SR-IOV to be something we do after we get this interface right and
can justify doing so. I don't have a problem necessarily with vendor
specific instances, unless we are only able to get vendor specific
instances. Thus I would prefer that we have a solution in place before
we allow the switch over.

> > One of the reasons why virtio-net is being pushed as a common
> > interface for vendors is for this reason. It is an interface that can
> > be emulated by software or hardware and it allows the guest to run on
> > any arbitrary hardware.
>
> Yes, and there is mlx5_vdpa to support this usecase, and it binds to
> the SF. Of course all of that is vendor specific too, the driver to
> convert HW specifc register programming into a virio-net ADI has to
> live *somewhere*

Right, but this is more the model I am in favor of. The backend is
hidden from the guest and lives somewhere on the host.

Also it might be useful to call out the flavours and planned flavours
in the cover page. Admittedly the description is somewhat lacking in
that regard.

> > It has plenty to do with this series. This topic has been under
> > discussion since something like 2017 when Mellanox first brought it up
> > at Netdev 2.1. At the time I told them they should implement this as a
> > veth offload.
>
> veth doesn't give an ADI, it is useless for these niche cases.
>
> veth offload might be interesting for some container case, but feels
> like writing an enormous amount of code to accomplish nothing new...

My concern is if we are going to start partitioning up a PF on the
host we might as well make the best use of it. I would argue that it
would make more sense to have some standardized mechanism in place for
the PF to communicate and interact with the SFs. I would argue that is
one of the reasons why this keeps being compared to either VMDq or VMQ
as it is something that SR-IOV has yet to fully replace and has many
features that would be useful in an interface that is a subpartition
of an existing interface.

> > Then it becomes obvious what the fallback becomes as you can place
> > packets into one end of a veth and it comes out the other, just like
> > a switchdev representor and the SF in this case. It would make much
> > more sense to do it this way rather than setting up yet another
> > vendor proprietary interface pair.
>
> I agree it makes sense to have an all SW veth-like option, but I
> wouldn't try to make that as the entry point for all the HW
> acceleration or to serve the niche SRIOV use cases, or to represent an
> ADI.
>
> It just can't do that and it would make a huge mess if you tried to
> force it. Didn't Intel already try this once with trying to use the
> macvlan netdev and its queue offload to build an ADI?

The Intel drivers still have the macvlan as the assignable ADI and
make use of VMDq to enable it. Actually I would consider it an example
of the kind of thing I am talking about. It is capable of doing
software switching between interfaces, broadcast/multicast replication
in software, and makes use of the hardware interfaces to allow for
receiving directly from the driver into the macvlan interface.

The limitation as I see it is that the macvlan interface doesn't allow
for much in the way of custom offloads and the Intel hardware doesn't
support switchdev. As such it is good for a basic interface, but
doesn't really do well in terms of supporting advanced vendor-specific
features.

> > > Anyhow, if such a thing exists someday it could make sense to
> > > automatically substitute the HW version using a SF, if available.
> >
> > The main problem as I see it is the fact that the SF interface is
> > bound too tightly to the hardware.
>
> That is goal here. This is not about creating just a netdev, this is
> about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.

One issue is right now we are only seeing the rdma and netdev. It is
kind of backwards as it is using the ADIs on the host when this was
really meant to be used for things like mdev.

> The SF has to support all of that completely. Focusing only on the
> one use case of netdevs in containers misses the bigger picture.
>
> Yes, lots of this stuff is niche, but niche stuff needs to be
> supported too.

I have no problem with niche stuff, however we need to address the
basics before we move on to the niche stuff.

> > Yes, it is a standard feature set for the control plane. However for
> > the data-path it is somewhat limited as I feel it only describes what
> > goes through the switch.
>
> Sure, I think that is its main point.
>
> > Not the interfaces that are exposed as the endpoints.
>
> It came from modeling physical HW so the endports are 'physical'
> things like actual HW switch ports, or SRIOV VFs, ADI, etc.

The problem is the "physical things" such as the SRIOV VFs and ADI
aren't really defined in the specification and are left up to the
implementers interpretation. These specs have always been fuzzy since
they are essentially PCI specifications and don't explain anything
about how the network on such a device should be configured or
expected to work. The swtichdev API puts some restrictions in place
but there still ends up being parts without any definition.

> > It is the problem of that last bit and how it is handled that can
> > make things ugly. For example the multicast/broadcast replication
> > problem that just occurred to me while writing this up.  The fact is
> > for east-west traffic there has always been a problem with the
> > switchdev model as it limits everything to PCIe/DMA so there are
> > cases where software switches can outperform the hardware ones.
>
> Yes, but, mixing CPU and DMA in the same packet delivery scheme is
> very complicated :)

I'm not necessarily saying we need to mix the two. However there are
cases such as multicast/broadcast where it would make much more sense
to avoid the duplication of packets and instead simply send one copy
and have it replicated by the software.

What would probably make sense for now would be to look at splitting
the netdev into two pieces. The frontend which would provide the
netdev and be a common driver for subfunction netdevs in this case,
and a backend which would be a common point for all the subfunctions
that are being used directly on the host. This is essentially what we
have with the macvlan model. The idea is that if we wanted to do
software switching or duplication of traffic we could, but if not then
we wouldn't.

Parav Pandit Dec. 16, 2020, 4:45 a.m. UTC | #22

> From: Alexander Duyck <alexander.duyck@gmail.com>
> Sent: Wednesday, December 16, 2020 9:43 AM
> 
> >
> > That is goal here. This is not about creating just a netdev, this is
> > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.
> 
> One issue is right now we are only seeing the rdma and netdev. It is kind of
> backwards as it is using the ADIs on the host when this was really meant to
> be used for things like mdev.
>
mdev is just yet another _use_ of subfunction.
There are users of subfunction who want to use it without mapping subfunction as mdev to guest VM.
i.e. as netdev, rdma, vdpa.
In future direct assignment can be via mdev of subfunction, like how rest of the above devices are done.

Creating a subfunction for non vfio purpose via vfio mdev is just not right.
If I understand you correctly, I hope you are not suggesting that.

Leon Romanovsky Dec. 16, 2020, 6:50 a.m. UTC | #23

On Tue, Dec 15, 2020 at 01:28:05PM -0800, Jakub Kicinski wrote:
> On Tue, 15 Dec 2020 12:35:20 -0800 Saeed Mahameed wrote:
> > > I think the big thing we really should do if we are going to go this
> > > route is to look at standardizing what the flavours are that get
> > > created by the parent netdevice. Otherwise we are just creating the
> > > same mess we had with SRIOV all over again and muddying the waters of
> > > mediated devices.
> >
> > yes in the near future we will be working on auxbus interfaces for
> > auto-probing and user flavor selection, this is a must have feature for
> > us.
>
> Can you elaborate? I thought config would be via devlink.

Yes, everything continues to be done through devlink.

One of the immediate features is an ability to disable/enable creation
of specific SF types.

For example, if user doesn't want RDMA, the SF RDMA won't be created.

Thanks

Jason Gunthorpe Dec. 16, 2020, 1:33 p.m. UTC | #24

On Tue, Dec 15, 2020 at 08:13:21PM -0800, Alexander Duyck wrote:

> > > Ugh, don't get me started on switchdev. The biggest issue as I see it
> > > with switchev is that you have to have a true switch in order to
> > > really be able to use it.
> >
> > That cuts both ways, suggesting HW with a true switch model itself
> > with VMDq is equally problematic.
> 
> Yes and no. For example the macvlan offload I had setup could be
> configured both ways and it made use of VMDq. I'm not necessarily
> arguing that we need to do VMDq here, however at the same time saying
> that this is only meant to replace SR-IOV becomes problematic since we
> already have SR-IOV so why replace it with something that has many of
> the same limitations?

Why? Because SR-IOV is the *only* option for many use cases. Still. I
said this already, something more generic does not magicaly eliminate
SR-IOV.

The SIOV ADI model is a small refinement to the existing VF scheme, it
is completely parallel to making more generic things.

It is not "repeating mistakes" it is accepting the limitations of
SR-IOV because benefits exist and applications need those benefits.

> That said I understand your argument, however I view the elimination
> of SR-IOV to be something we do after we get this interface right and
> can justify doing so. 

Elimination of SR-IOV isn't even a goal here!

> Also it might be useful to call out the flavours and planned flavours
> in the cover page. Admittedly the description is somewhat lacking in
> that regard.

This is more of a general switchdev remark though. In the swithdev
model you have a the switch and a switch port. Each port has a
swichdev representor on the switch side and a "user port" of some
kind.

It can be a physical thing:
 - SFP
 - QSFP
 - WiFi Antennae

It could be a semi-physical thing outside the view of the kernel:
 - SmartNIC VF/SF attached to another CPU

It can be a semi-physical thing in view of this kernel:
 - SRIOV VF (struct pci device)
 - SF (struct aux device)

It could be a SW construct in this kernel:
 - netdev (struct net device)

*all* of these different port types are needed. Probably more down the
road!

Notice I don't have VPDA, VF/SF netdev, or virtio-mdev as a "user
port" type here. Instead creating the user port pci or aux device
allows the user to use the Linux driver model to control what happens
to the pci/aux device next.

> I would argue that is one of the reasons why this keeps being
> compared to either VMDq or VMQ as it is something that SR-IOV has
> yet to fully replace and has many features that would be useful in
> an interface that is a subpartition of an existing interface.

In what sense does switchdev and a VF not fully replace macvlan VMDq?

> The Intel drivers still have the macvlan as the assignable ADI and
> make use of VMDq to enable it.

Is this in-tree or only in the proprietary driver? AFAIK there is no
in-tree way to extract the DMA queue from the macvlan netdev into
userspace..

Remeber all this VF/SF/VDPA stuff results in a HW dataplane, not a SW
one. It doesn't really make sense to compare a SW dataplane to a HW
one. HW dataplanes come with limitations and require special driver
code.

> The limitation as I see it is that the macvlan interface doesn't allow
> for much in the way of custom offloads and the Intel hardware doesn't
> support switchdev. As such it is good for a basic interface, but
> doesn't really do well in terms of supporting advanced vendor-specific
> features.

I don't know what it is that prevents Intel from modeling their
selector HW in switchdev, but I think it is on them to work with the
switchdev folks to figure something out.

I'm a bit surprised HW that can do macvlan can't be modeled with
switchdev? What is missing?

> > That is goal here. This is not about creating just a netdev, this is
> > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.
> 
> One issue is right now we are only seeing the rdma and netdev. It is
> kind of backwards as it is using the ADIs on the host when this was
> really meant to be used for things like mdev.

This is second 15 patch series on this path already. It is not
possible to pack every single thing into this series. This is the
micro step of introducing the SF idea and using SF==VF to show how the
driver stack works. The minimal changing to the existing drivers
implies this can support an ADI as well.

Further, this does already show an ADI! vdpa_mlx5 will run on the
VF/SF and eventually causes qemu to build a virtio-net ADI that
directly passes HW DMA rings into the guest.

Isn't this exactly the kind of generic SRIOV replacement option you
have been asking for? Doesn't this completely supersede stuff built on
macvlan?

> expected to work. The swtichdev API puts some restrictions in place
> but there still ends up being parts without any definition.

I'm curious what you see as needing definition here? 

The SRIOV model has the HW register programming API is device
specific.

The switchdev model is: no matter what HW register programing is done
on the VF/SF all the packets tx/rx'd will flow through the switchdev.

The purpose of switchdev/SRIOV/SIOV has never been to define a single
"one register set to rule them all".

That is the area that VDPA virtio-net and others are covering.

Jason

Alexander Duyck Dec. 16, 2020, 4:31 p.m. UTC | #25

On Wed, Dec 16, 2020 at 5:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Dec 15, 2020 at 08:13:21PM -0800, Alexander Duyck wrote:
>
> > > > Ugh, don't get me started on switchdev. The biggest issue as I see it
> > > > with switchev is that you have to have a true switch in order to
> > > > really be able to use it.
> > >
> > > That cuts both ways, suggesting HW with a true switch model itself
> > > with VMDq is equally problematic.
> >
> > Yes and no. For example the macvlan offload I had setup could be
> > configured both ways and it made use of VMDq. I'm not necessarily
> > arguing that we need to do VMDq here, however at the same time saying
> > that this is only meant to replace SR-IOV becomes problematic since we
> > already have SR-IOV so why replace it with something that has many of
> > the same limitations?
>
> Why? Because SR-IOV is the *only* option for many use cases. Still. I
> said this already, something more generic does not magicaly eliminate
> SR-IOV.
>
> The SIOV ADI model is a small refinement to the existing VF scheme, it
> is completely parallel to making more generic things.
>
> It is not "repeating mistakes" it is accepting the limitations of
> SR-IOV because benefits exist and applications need those benefits.

If we have two interfaces, both with pretty much the same limitations
then many would view it as "repeating mistakes". The fact is we
already have SR-IOV. Why introduce yet another interface that has the
same functionality?

You say this will scale better but I am not even sure about that. The
fact is SR-IOV could scale to 256 VFs, but for networking I kind of
doubt the limitation would have been the bus number and would more
likely be issues with packet replication and PCIe throughput,
especially when you start dealing with east-west traffic within the
same system.

> > That said I understand your argument, however I view the elimination
> > of SR-IOV to be something we do after we get this interface right and
> > can justify doing so.
>
> Elimination of SR-IOV isn't even a goal here!

Sorry you used the word "replace", and my assumption here was that the
goal is to get something in place that can take the place of SR-IOV so
that you wouldn't be maintaining the two systems at the same time.
That is my concern as I don't want us having SR-IOV, and then several
flavors of SIOV. We need to decide on one thing that will be the way
forward.

> > Also it might be useful to call out the flavours and planned flavours
> > in the cover page. Admittedly the description is somewhat lacking in
> > that regard.
>
> This is more of a general switchdev remark though. In the swithdev
> model you have a the switch and a switch port. Each port has a
> swichdev representor on the switch side and a "user port" of some
> kind.
>
> It can be a physical thing:
>  - SFP
>  - QSFP
>  - WiFi Antennae
>
> It could be a semi-physical thing outside the view of the kernel:
>  - SmartNIC VF/SF attached to another CPU
>
> It can be a semi-physical thing in view of this kernel:
>  - SRIOV VF (struct pci device)
>  - SF (struct aux device)
>
> It could be a SW construct in this kernel:
>  - netdev (struct net device)
>
> *all* of these different port types are needed. Probably more down the
> road!
>
> Notice I don't have VPDA, VF/SF netdev, or virtio-mdev as a "user
> port" type here. Instead creating the user port pci or aux device
> allows the user to use the Linux driver model to control what happens
> to the pci/aux device next.

I get that. That is why I said switchdev isn't a standard for the
endpoint. One of the biggest issues with SR-IOV that I have seen is
the fact that the last piece isn't really defined. We never did a good
job of defining how the ADI should look to the guest and as a result
it kind of stalled in adoption.

> > I would argue that is one of the reasons why this keeps being
> > compared to either VMDq or VMQ as it is something that SR-IOV has
> > yet to fully replace and has many features that would be useful in
> > an interface that is a subpartition of an existing interface.
>
> In what sense does switchdev and a VF not fully replace macvlan VMDq?

One of the biggest is east-west traffic. You quickly run up against
the PCIe bandwidth bottleneck and then the performance tanks. I have
seen a number of cases where peer-to-peer on the same host swamps the
network interface.

> > The Intel drivers still have the macvlan as the assignable ADI and
> > make use of VMDq to enable it.
>
> Is this in-tree or only in the proprietary driver? AFAIK there is no
> in-tree way to extract the DMA queue from the macvlan netdev into
> userspace..
>
> Remeber all this VF/SF/VDPA stuff results in a HW dataplane, not a SW
> one. It doesn't really make sense to compare a SW dataplane to a HW
> one. HW dataplanes come with limitations and require special driver
> code.

I get that. At the same time we can mask some of those limitations by
allowing for the backend to be somewhat abstract so you have the
possibility of augmenting the hardware dataplane with a software one
if needed.

> > The limitation as I see it is that the macvlan interface doesn't allow
> > for much in the way of custom offloads and the Intel hardware doesn't
> > support switchdev. As such it is good for a basic interface, but
> > doesn't really do well in terms of supporting advanced vendor-specific
> > features.
>
> I don't know what it is that prevents Intel from modeling their
> selector HW in switchdev, but I think it is on them to work with the
> switchdev folks to figure something out.

They tried for the ixgbe and i40e. The problem is the hardware
couldn't conform to what was asked for if I recall. It has been a few
years since I worked in the Ethernet group at intel so I don't recall
the exact details.

> I'm a bit surprised HW that can do macvlan can't be modeled with
> switchdev? What is missing?

If I recall it was the fact that the hardware defaults to transmitting
everything that doesn't match an existing rule to the external port
unless it comes from the external port.

> > > That is goal here. This is not about creating just a netdev, this is
> > > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.
> >
> > One issue is right now we are only seeing the rdma and netdev. It is
> > kind of backwards as it is using the ADIs on the host when this was
> > really meant to be used for things like mdev.
>
> This is second 15 patch series on this path already. It is not
> possible to pack every single thing into this series. This is the
> micro step of introducing the SF idea and using SF==VF to show how the
> driver stack works. The minimal changing to the existing drivers
> implies this can support an ADI as well.
>
> Further, this does already show an ADI! vdpa_mlx5 will run on the
> VF/SF and eventually causes qemu to build a virtio-net ADI that
> directly passes HW DMA rings into the guest.
>
> Isn't this exactly the kind of generic SRIOV replacement option you
> have been asking for? Doesn't this completely supersede stuff built on
> macvlan?

Something like the vdpa model is more like what I had in mind. Only
vdpa only works for the userspace networking case.

Basically the idea is to have an assignable device interface that
isn't directly tied to the hardware. Instead it is making use of a
slice of it and referencing the PF as the parent leaving the PF as the
owner of the slice. If at some point in the future we could make
changes to allow for software to step in and do some switching if
needed. The key bit is the abstraction of the assignable interface so
that it is vendor agnostic and could be switched over to pure software
backing if needed.

> > expected to work. The swtichdev API puts some restrictions in place
> > but there still ends up being parts without any definition.
>
> I'm curious what you see as needing definition here?
>
> The SRIOV model has the HW register programming API is device
> specific.
>
> The switchdev model is: no matter what HW register programing is done
> on the VF/SF all the packets tx/rx'd will flow through the switchdev.
>
> The purpose of switchdev/SRIOV/SIOV has never been to define a single
> "one register set to rule them all".
>
> That is the area that VDPA virtio-net and others are covering.

That is fine and that covers it for direct assigned devices. However
that doesn't cover the container case. My thought is if we are going
to partition a PF into multiple netdevices we should have some generic
interface that can be provided to represent the netdevs so that if
they are pushed into containers you don't have to rip them out if for
some reason you need to change the network configuration. For the
Intel NICs we did that with macvlan in the VMDq case. I see no reason
why you couldn't do something like that here with the subfunction
case.

Jason Gunthorpe Dec. 16, 2020, 5:51 p.m. UTC | #26

On Wed, Dec 16, 2020 at 08:31:44AM -0800, Alexander Duyck wrote:

> You say this will scale better but I am not even sure about that. The
> fact is SR-IOV could scale to 256 VFs, but for networking I kind of
> doubt the limitation would have been the bus number and would more
> likely be issues with packet replication and PCIe throughput,
> especially when you start dealing with east-west traffic within the
> same system.

We have been seeing deployments already hitting the 256 limit. This is
not a "theoretical use" patch set. There are already VM and container
farms with SW networking that support much more than 256 VM/containers
per server.

The optimization here is to reduce the hypervisor workload and free up
CPU cycles for the VMs/containers to consume. This means less handling
of packets in the CPU, especially for VM cases w/ SRIOV or VDPA.

Even the extra DMA on the NIC is not really a big deal. These are 400G
NICs with big fast PCI. If you top them out you are already doing an
aggregate of 400G of network traffic. That is a big number for a
single sever, it is OK.

Someone might feel differently if they did this on a 10/40G NIC, in
which case this is not the solution for their application.

> Sorry you used the word "replace", and my assumption here was that the
> goal is to get something in place that can take the place of SR-IOV so
> that you wouldn't be maintaining the two systems at the same time.
> That is my concern as I don't want us having SR-IOV, and then several
> flavors of SIOV. We need to decide on one thing that will be the way
> forward.

SRIOV has to continue until the PASID and IMS platform features are
widely available and mature. It will probably be 10 years before we
see most people able to use SIOV for everything they want.

I think we will see lots of SIOV varients, I know Intel is already
pushing SIOV parts outside netdev.

> I get that. That is why I said switchdev isn't a standard for the
> endpoint. One of the biggest issues with SR-IOV that I have seen is
> the fact that the last piece isn't really defined. We never did a good
> job of defining how the ADI should look to the guest and as a result
> it kind of stalled in adoption.

The ADI is supposed to present the HW programming API that is
desired. It is always up to the implementation.

SIOV was never a project to standardize HW programming models like
virtio-net, NVMe, etc.

> > I'm a bit surprised HW that can do macvlan can't be modeled with
> > switchdev? What is missing?
> 
> If I recall it was the fact that the hardware defaults to transmitting
> everything that doesn't match an existing rule to the external port
> unless it comes from the external port.

That seems small enough it should be resolvable, IMHO. eg some new
switch rule that matches that specific HW behavior?

> Something like the vdpa model is more like what I had in mind. Only
> vdpa only works for the userspace networking case.

That's because making a driver that converts the native HW to VDPA and
then running a generic netdev on the resulting virtio-net is a pretty
wild thing to do. I can't really think of an actual use case.

> Basically the idea is to have an assignable device interface that
> isn't directly tied to the hardware. 

The switchdev model is to create a switch port. As I explained in
Linux we see "pci device" and "aux device" as being some "user port"
options to access to that switch.

If you want a "generic device" that is fine, but what exactly is that
programming interface in Linux? Sketch out an API, where does the idea
go?  What does the driver that implement it look like? What consumes
it?

Should this be a great idea, then a mlx5 version of this will still be
to create an SF aux device, bind mlx5_core, then bind "generic device"
on top of that. This is simply a reflection of how the mlx5 HW/SW
layering works. Squashing all of this into a single layer is work with
no bad ROI.

> they are pushed into containers you don't have to rip them out if for
> some reason you need to change the network configuration. 

Why would you need to rip them out?

Jason

Saeed Mahameed Dec. 16, 2020, 5:59 p.m. UTC | #27

On Wed, 2020-12-16 at 08:50 +0200, Leon Romanovsky wrote:
> On Tue, Dec 15, 2020 at 01:28:05PM -0800, Jakub Kicinski wrote:
> > On Tue, 15 Dec 2020 12:35:20 -0800 Saeed Mahameed wrote:
> > > > I think the big thing we really should do if we are going to go
> > > > this
> > > > route is to look at standardizing what the flavours are that
> > > > get
> > > > created by the parent netdevice. Otherwise we are just creating
> > > > the
> > > > same mess we had with SRIOV all over again and muddying the
> > > > waters of
> > > > mediated devices.
> > > 
> > > yes in the near future we will be working on auxbus interfaces
> > > for
> > > auto-probing and user flavor selection, this is a must have
> > > feature for
> > > us.
> > 
> > Can you elaborate? I thought config would be via devlink.
> 
> Yes, everything continues to be done through devlink.
> 
> One of the immediate features is an ability to disable/enable
> creation
> of specific SF types.
> 
> For example, if user doesn't want RDMA, the SF RDMA won't be created.
> 

Devlink is an option too, we still don't have our mind set on a
specific API, we are considering both as a valuable solutions since
devlink make sense as a go to interface for everything SF, but on the
other hand, auto-probing and device instantiating is done at the auxbus
level, so it also make sense to have some sort of "device type" user
selection api in the auxbus, anyway this discussion is for a future
patch.

Alexander Duyck Dec. 16, 2020, 7:27 p.m. UTC | #28

On Wed, Dec 16, 2020 at 9:51 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Dec 16, 2020 at 08:31:44AM -0800, Alexander Duyck wrote:
>
> > You say this will scale better but I am not even sure about that. The
> > fact is SR-IOV could scale to 256 VFs, but for networking I kind of
> > doubt the limitation would have been the bus number and would more
> > likely be issues with packet replication and PCIe throughput,
> > especially when you start dealing with east-west traffic within the
> > same system.
>
> We have been seeing deployments already hitting the 256 limit. This is
> not a "theoretical use" patch set. There are already VM and container
> farms with SW networking that support much more than 256 VM/containers
> per server.

That has been the case for a long time. However it had been my
experience that SR-IOV never scaled well to meet those needs and so it
hadn't been used in such deployments.

> The optimization here is to reduce the hypervisor workload and free up
> CPU cycles for the VMs/containers to consume. This means less handling
> of packets in the CPU, especially for VM cases w/ SRIOV or VDPA.
>
> Even the extra DMA on the NIC is not really a big deal. These are 400G
> NICs with big fast PCI. If you top them out you are already doing an
> aggregate of 400G of network traffic. That is a big number for a
> single sever, it is OK.

Yes, but at a certain point you start bumping up against memory
throughput limitations as well. Doubling up the memory footprint by
having the device have to write to new pages instead of being able to
do something like pinning and zero-copy would be expensive.

For something like an NFV use case it might make sense, but in the
general high server count case it seems like it is a setup that would
be detrimental.

> Someone might feel differently if they did this on a 10/40G NIC, in
> which case this is not the solution for their application.

My past experience was with 10/40G NIC with tens of VFs. When we start
talking about hundreds I would imagine the overhead becomes orders of
magnitudes worse as the problem becomes more of an n^2 issue since you
will have n times more systems sending to n times more systems
receiving. As such things like broadcast traffic would end up
consuming a fair bit of traffic.

> > Sorry you used the word "replace", and my assumption here was that the
> > goal is to get something in place that can take the place of SR-IOV so
> > that you wouldn't be maintaining the two systems at the same time.
> > That is my concern as I don't want us having SR-IOV, and then several
> > flavors of SIOV. We need to decide on one thing that will be the way
> > forward.
>
> SRIOV has to continue until the PASID and IMS platform features are
> widely available and mature. It will probably be 10 years before we
> see most people able to use SIOV for everything they want.
>
> I think we will see lots of SIOV varients, I know Intel is already
> pushing SIOV parts outside netdev.

The key bit here is outside of netdev. Like I said, SIOV and SR-IOV
tend to be PCIe specific specifications. What we are defining here is
how the network interfaces presented by such devices will work.

> > I get that. That is why I said switchdev isn't a standard for the
> > endpoint. One of the biggest issues with SR-IOV that I have seen is
> > the fact that the last piece isn't really defined. We never did a good
> > job of defining how the ADI should look to the guest and as a result
> > it kind of stalled in adoption.
>
> The ADI is supposed to present the HW programming API that is
> desired. It is always up to the implementation.
>
> SIOV was never a project to standardize HW programming models like
> virtio-net, NVMe, etc.

I agree. Just like SR-IOV never spelled out that network devices
should be using switchdev. That is something we decided on as a
community. What I am explaining here is that we should be thinking
about the implications of how the network interface is exposed in the
host in the case of subfunctions that are associated with a given
switchdev device.

> > > I'm a bit surprised HW that can do macvlan can't be modeled with
> > > switchdev? What is missing?
> >
> > If I recall it was the fact that the hardware defaults to transmitting
> > everything that doesn't match an existing rule to the external port
> > unless it comes from the external port.
>
> That seems small enough it should be resolvable, IMHO. eg some new
> switch rule that matches that specific HW behavior?

I would have to go digging to find the conversation. It was about 3 or
4 years ago. I seem to recall mentioning the idea of having some
static rules but it was a no-go at the time. If we wanted to spin off
this conversation and pull in some Intel folks I would be up for us
revisiting it. However I'm not with Intel anymore so it would mostly
be something I would be working on as a hobby project instead of
anything serious.

> > Something like the vdpa model is more like what I had in mind. Only
> > vdpa only works for the userspace networking case.
>
> That's because making a driver that converts the native HW to VDPA and
> then running a generic netdev on the resulting virtio-net is a pretty
> wild thing to do. I can't really think of an actual use case.

I'm not talking about us drastically changing existing models. I would
still expect the mlx5 driver to be running on top of the aux device.
However it may be that the aux device is associated with something
like the switchdev port as a parent and the output from the traffic is
then going to the subfunction netdev.

> > Basically the idea is to have an assignable device interface that
> > isn't directly tied to the hardware.
>
> The switchdev model is to create a switch port. As I explained in
> Linux we see "pci device" and "aux device" as being some "user port"
> options to access to that switch.
>
> If you want a "generic device" that is fine, but what exactly is that
> programming interface in Linux? Sketch out an API, where does the idea
> go?  What does the driver that implement it look like? What consumes
> it?
>
> Should this be a great idea, then a mlx5 version of this will still be
> to create an SF aux device, bind mlx5_core, then bind "generic device"
> on top of that. This is simply a reflection of how the mlx5 HW/SW
> layering works. Squashing all of this into a single layer is work with
> no bad ROI.

In my mind the mlx5_core is still binding to the SF aux device so I am
fine with that. The way I view this working is that it would work sort
of like the macvlan approach that was taken with the Intel parts.
Although instead of binding to the PF it might make more sense to have
it bound to the switchdev port associated with the subfunction and
looking like a veth pair from the host perspective in terms of
behavior. The basic idea is most of the control is still in the
switchdev port, but it becomes more visible that the switchdev port
and the subfunction netdev are linked with the switchdev port as the
parent.

> > they are pushed into containers you don't have to rip them out if for
> > some reason you need to change the network configuration.
>
> Why would you need to rip them out?

Because they are specifically tied to the mlx5 device. So if for
example I need to hotplug out the mlx5 and replace it, it would be
useful to allow the interface in the containers to stay in place and
fail over to some other software backing interface in the switchdev
namespace. One of the issues with VFs is that we have always had to
push some sort of bond on top, or switch over to a virtio-net
interface in order to support fail-over. If we can resolve that in the
host case that would go a long way toward solving one of the main
issues of SR-IOV.

Jason Gunthorpe Dec. 16, 2020, 8:35 p.m. UTC | #29

On Wed, Dec 16, 2020 at 11:27:32AM -0800, Alexander Duyck wrote:

> That has been the case for a long time. However it had been my
> experience that SR-IOV never scaled well to meet those needs and so it
> hadn't been used in such deployments.

Seems to be going quite well here, perhaps the applications are
different.

> > The optimization here is to reduce the hypervisor workload and free up
> > CPU cycles for the VMs/containers to consume. This means less handling
> > of packets in the CPU, especially for VM cases w/ SRIOV or VDPA.
> >
> > Even the extra DMA on the NIC is not really a big deal. These are 400G
> > NICs with big fast PCI. If you top them out you are already doing an
> > aggregate of 400G of network traffic. That is a big number for a
> > single sever, it is OK.
> 
> Yes, but at a certain point you start bumping up against memory
> throughput limitations as well. Doubling up the memory footprint by
> having the device have to write to new pages instead of being able to
> do something like pinning and zero-copy would be expensive.

You can't zero-copy when using VMs.

And when using containers every skb still has to go through all the
switching and encapsulation logic, which is not free in SW.

At a certain point the gains of avoiding the DMA copy are lost by the
costs of all the extra CPU work. The factor being optimized here is
CPU capacity.

> > Someone might feel differently if they did this on a 10/40G NIC, in
> > which case this is not the solution for their application.
> 
> My past experience was with 10/40G NIC with tens of VFs. When we start
> talking about hundreds I would imagine the overhead becomes orders of
> magnitudes worse as the problem becomes more of an n^2 issue since you
> will have n times more systems sending to n times more systems

The traffic demand is application dependent. If an application has an
n^2 traffic pattern then it needs a network to sustain that cross
sectional bandwidth regardless of how the VMs are packed.

It just becomes a design factor of the network and now the network
includes that switching component on the PCIe NIC as part of the
capacity for cross sectional BW.

There is some balance where a VM can only generate so much traffic
based on the CPU it has available, and you can design the entire
infrastructure to balance the CPU with the NIC with the switches and
come to some packing factor of VMs. 

As CPU constrains VM performance, removing CPU overheads from the
system will improve packing density. A HW network data path in the VMs
is one such case that can turn to a net win if the CPU bottleneck is
bigger than the network bottleneck.

It is really over simplifiying to just say PCIe DMA copies are bad.

> receiving. As such things like broadcast traffic would end up
> consuming a fair bit of traffic.

I think you have a lot bigger network problems if your broadcast
traffic is so high that you start to worry about DMA copy performance
in a 400G NIC.

> The key bit here is outside of netdev. Like I said, SIOV and SR-IOV
> tend to be PCIe specific specifications. What we are defining here is
> how the network interfaces presented by such devices will work.

I think we've achieved this..

> > That seems small enough it should be resolvable, IMHO. eg some new
> > switch rule that matches that specific HW behavior?
> 
> I would have to go digging to find the conversation. It was about 3 or
> 4 years ago. I seem to recall mentioning the idea of having some
> static rules but it was a no-go at the time. If we wanted to spin off
> this conversation and pull in some Intel folks I would be up for us
> revisiting it. However I'm not with Intel anymore so it would mostly
> be something I would be working on as a hobby project instead of
> anything serious.

Personally I welcome getting more drivers to implement the switchdev
model, I think it is only good for the netdev community as as a whole
to understand and standardize on this.

> > > Something like the vdpa model is more like what I had in mind. Only
> > > vdpa only works for the userspace networking case.
> >
> > That's because making a driver that converts the native HW to VDPA and
> > then running a generic netdev on the resulting virtio-net is a pretty
> > wild thing to do. I can't really think of an actual use case.
> 
> I'm not talking about us drastically changing existing models. I would
> still expect the mlx5 driver to be running on top of the aux device.
> However it may be that the aux device is associated with something
> like the switchdev port as a parent 
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That is exactly how this works. The switchdev representor and the aux
device are paired and form the analog of the veth tunnel. IIRC this
relationship with the aux device is shown in the devlink output for
the switchdev ports.

I still can't understand what you think should be changed here.

We can't get rid of the aux device, it is integral to the software
layering and essential to support any device assignment flow.

We can't add a mandatory netdev, because that is just pure waste for
any device assignment flow.

> The basic idea is most of the control is still in the switchdev
> port, but it becomes more visible that the switchdev port and the
> subfunction netdev are linked with the switchdev port as the parent.

If a user netdev is created, say for a container, then it should have
as a parent the aux device.

If you inspect the port configuration on the switchdev side it will
show the aux device associated with the port.

Are you just asking that the userspace tools give a little help and
show that switchdev port XX is visable as netdev YYY by cross matching
the auxbus?

> > > they are pushed into containers you don't have to rip them out if for
> > > some reason you need to change the network configuration.
> >
> > Why would you need to rip them out?
> 
> Because they are specifically tied to the mlx5 device. So if for
> example I need to hotplug out the mlx5 and replace it, 

Uhh, I think we are very very far away from being able to hot unplug a
switchdev driver, keep the switchdev running, and drop in a different
driver.

That isn't even on the radar, AFAIK.

> namespace. One of the issues with VFs is that we have always had to
> push some sort of bond on top, or switch over to a virtio-net
> interface in order to support fail-over. If we can resolve that in the
> host case that would go a long way toward solving one of the main
> issues of SR-IOV.

This is all solved already, virtio-net is the answer.

qemu can swap back ends under the virtio-net ADI it created on the
fly. This means it can go from processing a virtio-net queue in mlx5
HW, to full SW, to some other HW on another machine. All hitlessly and
transparently to the guest VM.

Direct HW processing of a queue inside a VM without any downsides for
VM migration. Check.

I have the feeling this stuff you are asking for is already done..

Jason

Alexander Duyck Dec. 16, 2020, 10:53 p.m. UTC | #30

On Wed, Dec 16, 2020 at 12:35 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Dec 16, 2020 at 11:27:32AM -0800, Alexander Duyck wrote:
>
> > That has been the case for a long time. However it had been my
> > experience that SR-IOV never scaled well to meet those needs and so it
> > hadn't been used in such deployments.
>
> Seems to be going quite well here, perhaps the applications are
> different.
>
> > > The optimization here is to reduce the hypervisor workload and free up
> > > CPU cycles for the VMs/containers to consume. This means less handling
> > > of packets in the CPU, especially for VM cases w/ SRIOV or VDPA.
> > >
> > > Even the extra DMA on the NIC is not really a big deal. These are 400G
> > > NICs with big fast PCI. If you top them out you are already doing an
> > > aggregate of 400G of network traffic. That is a big number for a
> > > single sever, it is OK.
> >
> > Yes, but at a certain point you start bumping up against memory
> > throughput limitations as well. Doubling up the memory footprint by
> > having the device have to write to new pages instead of being able to
> > do something like pinning and zero-copy would be expensive.
>
> You can't zero-copy when using VMs.
>
> And when using containers every skb still has to go through all the
> switching and encapsulation logic, which is not free in SW.
>
> At a certain point the gains of avoiding the DMA copy are lost by the
> costs of all the extra CPU work. The factor being optimized here is
> CPU capacity.
>
> > > Someone might feel differently if they did this on a 10/40G NIC, in
> > > which case this is not the solution for their application.
> >
> > My past experience was with 10/40G NIC with tens of VFs. When we start
> > talking about hundreds I would imagine the overhead becomes orders of
> > magnitudes worse as the problem becomes more of an n^2 issue since you
> > will have n times more systems sending to n times more systems
>
> The traffic demand is application dependent. If an application has an
> n^2 traffic pattern then it needs a network to sustain that cross
> sectional bandwidth regardless of how the VMs are packed.
>
> It just becomes a design factor of the network and now the network
> includes that switching component on the PCIe NIC as part of the
> capacity for cross sectional BW.
>
> There is some balance where a VM can only generate so much traffic
> based on the CPU it has available, and you can design the entire
> infrastructure to balance the CPU with the NIC with the switches and
> come to some packing factor of VMs.
>
> As CPU constrains VM performance, removing CPU overheads from the
> system will improve packing density. A HW network data path in the VMs
> is one such case that can turn to a net win if the CPU bottleneck is
> bigger than the network bottleneck.
>
> It is really over simplifiying to just say PCIe DMA copies are bad.

I'm not saying the copies are bad. However they can be limiting. As
you said it all depends on the use case. If you have nothing but
functions that are performing bump-in-the-wire type operations odds
are the PCIe bandwidth won't be the problem. It all depends on the use
case and that is why I would prefer the interface to be more flexible
rather than just repeating what has been done with SR-IOV.

The problem in my case was based on a past experience where east-west
traffic became a problem and it was easily shown that bypassing the
NIC for traffic was significantly faster.

> > receiving. As such things like broadcast traffic would end up
> > consuming a fair bit of traffic.
>
> I think you have a lot bigger network problems if your broadcast
> traffic is so high that you start to worry about DMA copy performance
> in a 400G NIC.

Usually the problems were more multicast rather than broadcast, but
yeah this typically isn't an issue. However still at 256 VFs you would
be talking about a replication rate such that at 1-2Gbps one VF could
saturate the entire 400G device.

> > The key bit here is outside of netdev. Like I said, SIOV and SR-IOV
> > tend to be PCIe specific specifications. What we are defining here is
> > how the network interfaces presented by such devices will work.
>
> I think we've achieved this..

Somewhat. We have it explained for the control plane. What we are
defining now is how it will appear in the guest/container.

> > > That seems small enough it should be resolvable, IMHO. eg some new
> > > switch rule that matches that specific HW behavior?
> >
> > I would have to go digging to find the conversation. It was about 3 or
> > 4 years ago. I seem to recall mentioning the idea of having some
> > static rules but it was a no-go at the time. If we wanted to spin off
> > this conversation and pull in some Intel folks I would be up for us
> > revisiting it. However I'm not with Intel anymore so it would mostly
> > be something I would be working on as a hobby project instead of
> > anything serious.
>
> Personally I welcome getting more drivers to implement the switchdev
> model, I think it is only good for the netdev community as as a whole
> to understand and standardize on this.

Agreed.

> > > > Something like the vdpa model is more like what I had in mind. Only
> > > > vdpa only works for the userspace networking case.
> > >
> > > That's because making a driver that converts the native HW to VDPA and
> > > then running a generic netdev on the resulting virtio-net is a pretty
> > > wild thing to do. I can't really think of an actual use case.
> >
> > I'm not talking about us drastically changing existing models. I would
> > still expect the mlx5 driver to be running on top of the aux device.
> > However it may be that the aux device is associated with something
> > like the switchdev port as a parent
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> That is exactly how this works. The switchdev representor and the aux
> device are paired and form the analog of the veth tunnel. IIRC this
> relationship with the aux device is shown in the devlink output for
> the switchdev ports.
>
> I still can't understand what you think should be changed here.
>
> We can't get rid of the aux device, it is integral to the software
> layering and essential to support any device assignment flow.
>
> We can't add a mandatory netdev, because that is just pure waste for
> any device assignment flow.

I'm not saying to get rid of the netdev. I'm saying the netdev created
should be a generic interface that could be reused by other vendors.
The whole idea behind using something like macvlan is to hide the
underlying device. It becomes the namespace assignable interface. What
I would like to see is us get rid of that step and instead just have a
generic interface spawned in the first place and push the driver
specific bits back to the switchdev port.

The problem right now is that the switchdev netdev and the subfunction
netdev are treated as peers. In my mind it should work somewhere in
between the macvlan and veth. Basically you have the switchedev port
handling both ends of the traffic and handling the aux device
directly, and the subfunction device is floating on top of it
sometimes acting like a macvlan when traffic is going to/from the aux
device, and acting like a veth pair in some cases perhaps such as
broadcast/multicast allowing the swtichdev to take care of that
locally.

> > The basic idea is most of the control is still in the switchdev
> > port, but it becomes more visible that the switchdev port and the
> > subfunction netdev are linked with the switchdev port as the parent.
>
> If a user netdev is created, say for a container, then it should have
> as a parent the aux device.
>
> If you inspect the port configuration on the switchdev side it will
> show the aux device associated with the port.
>
> Are you just asking that the userspace tools give a little help and
> show that switchdev port XX is visable as netdev YYY by cross matching
> the auxbus?

It isn't about the association, it is about who is handling the
traffic. Going back to the macvlan model what we did is we had a group
of rings on the device that would automatically forward unicast
packets to the macvlan interface and would be reserved for
transmitting packets from the macvlan interface. We took care of
multicast and broadcast replication in software.

In my mind it should be possible to do something similar for the
swtichdev case. Basically the packets would be routed from the
subfunction netdev, to the switchdev netdev, and could then be
transmitted through the aux device. In order to make this work in the
case of ixgbe I had come up with the concept of a subordinate device,
"sb_dev", which was a pointer used in the Tx case to identify Tx rings
and qdiscs that had been given to macvlan interface.

> > > > they are pushed into containers you don't have to rip them out if for
> > > > some reason you need to change the network configuration.
> > >
> > > Why would you need to rip them out?
> >
> > Because they are specifically tied to the mlx5 device. So if for
> > example I need to hotplug out the mlx5 and replace it,
>
> Uhh, I think we are very very far away from being able to hot unplug a
> switchdev driver, keep the switchdev running, and drop in a different
> driver.
>
> That isn't even on the radar, AFAIK.

That might be a bad example, I was thinking of the issues we have had
with VFs and direct assignment to Qemu based guests in the past.
Essentially what I am getting at is that the setup in the container
should be vendor agnostic. The interface exposed shouldn't be specific
to any one vendor. So if I want to fire up a container or Mellanox,
Broadcom, or some other vendor it shouldn't matter or be visible to
the user. They should just see a vendor agnostic subfunction
netdevice.

Something like that is doable already using something like a macvlan
on top of a subfunction interface, but I feel like that is an
unnecessary step and creates unnecessary netdevices.

> > namespace. One of the issues with VFs is that we have always had to
> > push some sort of bond on top, or switch over to a virtio-net
> > interface in order to support fail-over. If we can resolve that in the
> > host case that would go a long way toward solving one of the main
> > issues of SR-IOV.
>
> This is all solved already, virtio-net is the answer.
>
> qemu can swap back ends under the virtio-net ADI it created on the
> fly. This means it can go from processing a virtio-net queue in mlx5
> HW, to full SW, to some other HW on another machine. All hitlessly and
> transparently to the guest VM.
>
> Direct HW processing of a queue inside a VM without any downsides for
> VM migration. Check.
>
> I have the feeling this stuff you are asking for is already done..

The case you are describing has essentially solved it for Qemu
virtualization and direct assignment. It still doesn't necessarily
solve it for the container case though.

Jason Gunthorpe Dec. 17, 2020, 12:38 a.m. UTC | #31

On Wed, Dec 16, 2020 at 02:53:07PM -0800, Alexander Duyck wrote:

> It isn't about the association, it is about who is handling the
> traffic. Going back to the macvlan model what we did is we had a group
> of rings on the device that would automatically forward unicast
> packets to the macvlan interface and would be reserved for
> transmitting packets from the macvlan interface. We took care of
> multicast and broadcast replication in software.

Okay, maybe I'm starting to see where you are coming from.

First, I think some clarity here, as I see it the devlink
infrastructure is all about creating the auxdevice for a switchdev
port.

What goes into that auxdevice is *completely* up to the driver. mlx5
is doing a SF which == VF, but that is not a requirement of the design
at all.

If an Intel driver wants to put a queue block into the aux device and
that is != VF, it is just fine.

The Intel netdev that binds to the auxdevice can transform the queue
block and specific switchdev config into a netdev identical to
accelerated macvlan. Nothing about the breaks the switchdev model.

Essentially think of it as generalizing the acceleration plugin for a
netdev. Instead of making something specific to limited macvlan, the
driver gets to provide exactly the structure that matches its HW to
provide the netdev as the user side of the switchdev port. I see no
limitation here so long as the switchdev model for controlling traffic
is followed.

Let me segue into a short story from RDMA.. We've had a netdev called
IPoIB for a long time. It is actually kind of similar to this general
thing you are talking about, in that there is a programming layer
under the IPOIB netdev called RDMA verbs that generalizes the actual
HW. Over the years this became more complicated because every new
netdev offloaded needed mirroring into the RDMA verbs general
API. TSO, GSO, checksum offload, endlessly onwards. It became quite
dumb in the end. We gave up and said the HW driver should directly
implement netdev. Implementing a middle API layer makes zero sense
when netdev is already perfectly suited to implement ontop of
HW. Removing SW layers caused performance to go up something like
2x.

The hard earned lesson I take from that is don't put software layers
between a struct net_device and the actual HW. The closest coupling is
really the best thing. Provide libary code in the kernel to help
drivers implement common patterns when making their netdevs, do not
provide wrapper netdevs around drivers.

IMHO the approach of macvlan accleration made some sense in 2013, but
today I would say it is mashing unrelated layers together and
polluting what should be a pure SW implementation with HW hooks.

I see from the mailing list comments this was done because creating a
device specific netdev via 'ip link add' was rightly rejected. However
here we *can* create a device specific vmdq *auxdevice*.  This is OK
because the netdev is controlling and containing the aux device via
switchdev.

So, Intel can get the "VMDQ link type" that was originally desired more
or less directly, so long as the associated switchdev port controls
the MAC filter process, not "ip link add".

And if you want to make the vmdq auxdevice into an ADI by user DMA to
queues, then sure, that model is completely sane too (vs hacking up
macvlan to expose user queues) - so long as the kernel controls the
selection of traffic into those queues and follows the switchdev
model. I would recommend creating a simple RDMA raw ethernet queue
driver over the aux device for something like this :)

> That might be a bad example, I was thinking of the issues we have had
> with VFs and direct assignment to Qemu based guests in the past.

As described, this is solved by VDPA.

> Essentially what I am getting at is that the setup in the container
> should be vendor agnostic. The interface exposed shouldn't be specific
> to any one vendor. So if I want to fire up a container or Mellanox,
> Broadcom, or some other vendor it shouldn't matter or be visible to
> the user. They should just see a vendor agnostic subfunction
> netdevice.

Agree. The agnostic container user interface here is 'struct
net_device'.

> > I have the feeling this stuff you are asking for is already done..
> 
> The case you are describing has essentially solved it for Qemu
> virtualization and direct assignment. It still doesn't necessarily
> solve it for the container case though.

The container case doesn't need solving.

Any scheme I've heard for container live migration, like CRIU,
essentially hot plugs the entire kernel in/out of a user process. We
rely on the kernel providing low leakage of the implementation details
of the struct net_device as part of it's uAPI contract. When CRIU
swaps the kernel the new kernel can have any implementation of the
container netdev it wants.

I've never heard of a use case to hot swap the implemention *under* a
netdev from a container. macvlan can't do this today. If you have a
use case here, it really has nothing to do with with this series.

Jason

Alexander Duyck Dec. 17, 2020, 6:48 p.m. UTC | #32

On Wed, Dec 16, 2020 at 4:38 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Dec 16, 2020 at 02:53:07PM -0800, Alexander Duyck wrote:
>
> > It isn't about the association, it is about who is handling the
> > traffic. Going back to the macvlan model what we did is we had a group
> > of rings on the device that would automatically forward unicast
> > packets to the macvlan interface and would be reserved for
> > transmitting packets from the macvlan interface. We took care of
> > multicast and broadcast replication in software.
>
> Okay, maybe I'm starting to see where you are coming from.
>
> First, I think some clarity here, as I see it the devlink
> infrastructure is all about creating the auxdevice for a switchdev
> port.
>
> What goes into that auxdevice is *completely* up to the driver. mlx5
> is doing a SF which == VF, but that is not a requirement of the design
> at all.
>
> If an Intel driver wants to put a queue block into the aux device and
> that is != VF, it is just fine.
>
> The Intel netdev that binds to the auxdevice can transform the queue
> block and specific switchdev config into a netdev identical to
> accelerated macvlan. Nothing about the breaks the switchdev model.

Just to clarify I am not with Intel, nor do I plan to work on any
Intel drivers related to this.

My concern has more to do with how this is being plumbed and the fact
that the basic architecture is somewhat limiting.

> Essentially think of it as generalizing the acceleration plugin for a
> netdev. Instead of making something specific to limited macvlan, the
> driver gets to provide exactly the structure that matches its HW to
> provide the netdev as the user side of the switchdev port. I see no
> limitation here so long as the switchdev model for controlling traffic
> is followed.

I see plenty. The problem is it just sets up more vendor lock-in and
features that have to be thrown away when you have to settle for
least-common denominator in order to maintain functionality across
vendors.

> Let me segue into a short story from RDMA.. We've had a netdev called
> IPoIB for a long time. It is actually kind of similar to this general
> thing you are talking about, in that there is a programming layer
> under the IPOIB netdev called RDMA verbs that generalizes the actual
> HW. Over the years this became more complicated because every new
> netdev offloaded needed mirroring into the RDMA verbs general
> API. TSO, GSO, checksum offload, endlessly onwards. It became quite
> dumb in the end. We gave up and said the HW driver should directly
> implement netdev. Implementing a middle API layer makes zero sense
> when netdev is already perfectly suited to implement ontop of
> HW. Removing SW layers caused performance to go up something like
> 2x.
>
> The hard earned lesson I take from that is don't put software layers
> between a struct net_device and the actual HW. The closest coupling is
> really the best thing. Provide libary code in the kernel to help
> drivers implement common patterns when making their netdevs, do not
> provide wrapper netdevs around drivers.
>
> IMHO the approach of macvlan accleration made some sense in 2013, but
> today I would say it is mashing unrelated layers together and
> polluting what should be a pure SW implementation with HW hooks.

I disagree here. In my mind a design where two interfaces, which both
exist in the kernel, have to go to hardware in order to communicate is
very limiting. The main thing I am wanting to see is the option of
being able to pass traffic directly between the switchdev and the SF
without the need to touch the hardware.

An easy example of such traffic that would likely benefit from this is
multicast/broadcast traffic. Instead of having to process each and
every broadcast packet in hardware you could very easily process it at
the switchdev and then directly hand it off from the switchdev to the
SF in this case instead of having to send it to hardware for each
switchdev instance.

> I see from the mailing list comments this was done because creating a
> device specific netdev via 'ip link add' was rightly rejected. However
> here we *can* create a device specific vmdq *auxdevice*.  This is OK
> because the netdev is controlling and containing the aux device via
> switchdev.
>
> So, Intel can get the "VMDQ link type" that was originally desired more
> or less directly, so long as the associated switchdev port controls
> the MAC filter process, not "ip link add".
>
> And if you want to make the vmdq auxdevice into an ADI by user DMA to
> queues, then sure, that model is completely sane too (vs hacking up
> macvlan to expose user queues) - so long as the kernel controls the
> selection of traffic into those queues and follows the switchdev
> model. I would recommend creating a simple RDMA raw ethernet queue
> driver over the aux device for something like this :)

You lost me here, I'm not seeing how RDMA and macvlan are connected.

> > That might be a bad example, I was thinking of the issues we have had
> > with VFs and direct assignment to Qemu based guests in the past.
>
> As described, this is solved by VDPA.
>
> > Essentially what I am getting at is that the setup in the container
> > should be vendor agnostic. The interface exposed shouldn't be specific
> > to any one vendor. So if I want to fire up a container or Mellanox,
> > Broadcom, or some other vendor it shouldn't matter or be visible to
> > the user. They should just see a vendor agnostic subfunction
> > netdevice.
>
> Agree. The agnostic container user interface here is 'struct
> net_device'.

I disagree here. The fact is a mellanox netdev, versus a broadcom
netdev, versus an intel netdev all have a very different look at feel
as the netdev is essentially just the base device you are building
around.

In addition it still doesn't address my concern as called out above
which is the east-west traffic problem.

> > > I have the feeling this stuff you are asking for is already done..
> >
> > The case you are describing has essentially solved it for Qemu
> > virtualization and direct assignment. It still doesn't necessarily
> > solve it for the container case though.
>
> The container case doesn't need solving.

I disagree and that is at the heart where you and I have different
views. I view there being two advantages to having the container case
solved:
1. A standardized set of features that can be provided regardless of vendor
2. Allowing for the case where east-west traffic can avoid having to
touch hardware

> Any scheme I've heard for container live migration, like CRIU,
> essentially hot plugs the entire kernel in/out of a user process. We
> rely on the kernel providing low leakage of the implementation details
> of the struct net_device as part of it's uAPI contract. When CRIU
> swaps the kernel the new kernel can have any implementation of the
> container netdev it wants.

I'm not thinking about migration. I am thinking more about the user
experience. In my mind if I set up a container I shouldn't need to
know which vendor provided the network interface when I set it up. The
problem is most NICs have so many one-off proprietary tweaks needed
that it gets annoying. That is why in my mind it would make much more
sense to have a simple vendor agnostic interface. That is why I would
prefer to avoid the VF model.

> I've never heard of a use case to hot swap the implemention *under* a
> netdev from a container. macvlan can't do this today. If you have a
> use case here, it really has nothing to do with with this series.

Again, the hot-swap isn't necessarily what I am talking about. I am
talking about setting up a config for a set of containers in a
datacenter. What I don't want to do is have to have one set of configs
for an mlx5 SF, another for a broadcom SF, and yet another set for any
other vendors out there. I would much rather have all of that dealt
with within the namespace that is handling the switchdev setup.

In addition, the east-west traffic is the other bit I would like to
see addressed. I am okay excusing this in the case of direct
assignment since the resources for the SF will not be available to the
host. However if the SF will be operating in the same kernel as the
PF/switchev it would make much more sense to enable an east/west
channel which would allow for hardware bypass under certain
circumstances without having to ever leave the kernel.

Jason Gunthorpe Dec. 17, 2020, 7:40 p.m. UTC | #33

On Thu, Dec 17, 2020 at 10:48:48AM -0800, Alexander Duyck wrote:

> Just to clarify I am not with Intel, nor do I plan to work on any
> Intel drivers related to this.

Sure

> I disagree here. In my mind a design where two interfaces, which both
> exist in the kernel, have to go to hardware in order to communicate is
> very limiting. The main thing I am wanting to see is the option of
> being able to pass traffic directly between the switchdev and the SF
> without the need to touch the hardware.

I view the SW bypass path you are talking about similarly to
GSO/etc. It should be accessed by the HW driver as an optional service
provided by the core netdev, not implemented as some wrapper netdev
around a HW implementation.

If you feel strongly it is needed then there is nothing standing in
the way to implement it in the switchdev auxdevice model.

It is simple enough, the HW driver's tx path would somehow detect
east/west and queue it differently, and the rx path would somehow be
able to mux in skbs from a SW queue. Not seeing any blockers here.

> > model. I would recommend creating a simple RDMA raw ethernet queue
> > driver over the aux device for something like this :)
> 
> You lost me here, I'm not seeing how RDMA and macvlan are connected.

RDMA is the standard uAPI to get a userspace HW DMA queue for ethernet
packets.

> > > Essentially what I am getting at is that the setup in the container
> > > should be vendor agnostic. The interface exposed shouldn't be specific
> > > to any one vendor. So if I want to fire up a container or Mellanox,
> > > Broadcom, or some other vendor it shouldn't matter or be visible to
> > > the user. They should just see a vendor agnostic subfunction
> > > netdevice.
> >
> > Agree. The agnostic container user interface here is 'struct
> > net_device'.
> 
> I disagree here. The fact is a mellanox netdev, versus a broadcom
> netdev, versus an intel netdev all have a very different look at feel
> as the netdev is essentially just the base device you are building
> around.

Then fix the lack of standardization of netdev implementations!

Adding more abstraction layers isn't going to fix that fundamental
problem.

Frankly it seems a bit absurd to complain that the very basic element
of the common kernel uAPI - struct net_device - is so horribly
fragmented and vendor polluted that we can't rely on it as a stable
interface for containers.

Even if that is true, I don't belive for a second that adding a
different HW abstraction layer is going to somehow undo the mistakes
of the last 20 years.

> Again, the hot-swap isn't necessarily what I am talking about. I am
> talking about setting up a config for a set of containers in a
> datacenter. What I don't want to do is have to have one set of configs
> for an mlx5 SF, another for a broadcom SF, and yet another set for any
> other vendors out there. I would much rather have all of that dealt
> with within the namespace that is handling the switchdev setup.

If there is real problems here then I very much encourage you to start
an effort to push all the vendors to implement a consistent user
experience for the HW netdevs.

I don't know what your issues are, but it sounds like it would be a
very interesting conference presentation.

But it has nothing to do with this series.

Jason

Alexander Duyck Dec. 17, 2020, 9:05 p.m. UTC | #34

On Thu, Dec 17, 2020 at 11:40 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Dec 17, 2020 at 10:48:48AM -0800, Alexander Duyck wrote:
>
> > Just to clarify I am not with Intel, nor do I plan to work on any
> > Intel drivers related to this.
>
> Sure
>
> > I disagree here. In my mind a design where two interfaces, which both
> > exist in the kernel, have to go to hardware in order to communicate is
> > very limiting. The main thing I am wanting to see is the option of
> > being able to pass traffic directly between the switchdev and the SF
> > without the need to touch the hardware.
>
> I view the SW bypass path you are talking about similarly to
> GSO/etc. It should be accessed by the HW driver as an optional service
> provided by the core netdev, not implemented as some wrapper netdev
> around a HW implementation.

I view it as being something that would be a part of the switchdev API
itself. Basically the switchev and endpoint would need to be able to
control something like this because if XDP were enabled on one end or
the other you would need to be able to switch it off so that all of
the packets followed the same flow and could be scanned by the XDP
program.

> If you feel strongly it is needed then there is nothing standing in
> the way to implement it in the switchdev auxdevice model.
>
> It is simple enough, the HW driver's tx path would somehow detect
> east/west and queue it differently, and the rx path would somehow be
> able to mux in skbs from a SW queue. Not seeing any blockers here.

In my mind the simple proof of concept for this would be to check for
the multicast bit being set in the destination MAC address for packets
coming from the subfunction. If it is then shunt to this bypass route,
and if not then you transmit to the hardware queues. In the case of
packets coming from the switchdev port it would probably depend. The
part I am not sure about is if any packets need to be actually
transmitted to the hardware in the standard case for packets going
from the switchdev port to the subfunction. If there is no XDP or
anything like that present in the subfunction it probably wouldn't
matter and you could just shunt it straight across and bypass the
hardware, however if XDP is present you would need to get the packets
into the ring which would force the bypass to be turned off.

> > > model. I would recommend creating a simple RDMA raw ethernet queue
> > > driver over the aux device for something like this :)
> >
> > You lost me here, I'm not seeing how RDMA and macvlan are connected.
>
> RDMA is the standard uAPI to get a userspace HW DMA queue for ethernet
> packets.

Ah, I think you are talking about device assignment. In my mind I was
just talking about the interface assigned to the container which as
you have stated is basically just a netdev.

> > > > Essentially what I am getting at is that the setup in the container
> > > > should be vendor agnostic. The interface exposed shouldn't be specific
> > > > to any one vendor. So if I want to fire up a container or Mellanox,
> > > > Broadcom, or some other vendor it shouldn't matter or be visible to
> > > > the user. They should just see a vendor agnostic subfunction
> > > > netdevice.
> > >
> > > Agree. The agnostic container user interface here is 'struct
> > > net_device'.
> >
> > I disagree here. The fact is a mellanox netdev, versus a broadcom
> > netdev, versus an intel netdev all have a very different look at feel
> > as the netdev is essentially just the base device you are building
> > around.
>
> Then fix the lack of standardization of netdev implementations!

We're trying to work on that, but trying to fix it after the fact is
like herding cats.

> Adding more abstraction layers isn't going to fix that fundamental
> problem.
>
> Frankly it seems a bit absurd to complain that the very basic element
> of the common kernel uAPI - struct net_device - is so horribly
> fragmented and vendor polluted that we can't rely on it as a stable
> interface for containers.

The problem isn't necessarily the net_device it is more the
net_device_ops and the fact that there are so many different ways to
get things done. Arguably the flexibility of the netd_device is great
for allowing vendors to expose their features. However at the same
time it allows for features to be left out so what you end up with a
wide variety of things that are net_devices.

> Even if that is true, I don't belive for a second that adding a
> different HW abstraction layer is going to somehow undo the mistakes
> of the last 20 years.

It depends on how it is done. The general idea is to address the
biggest limitation that has occured, which is the fact that in many
cases we don't have software offloads to take care of things when the
hardware offloads provided by a certain piece of hardware are not
present. It would basically allow us to reset the feature set. If
something cannot be offloaded in software in a reasonable way, it is
not allowed to be present in the interface provided to a container.
That way instead of having to do all the custom configuration in the
container recipe it can be centralized to one container handling all
of the switching and hardware configuration.

> > Again, the hot-swap isn't necessarily what I am talking about. I am
> > talking about setting up a config for a set of containers in a
> > datacenter. What I don't want to do is have to have one set of configs
> > for an mlx5 SF, another for a broadcom SF, and yet another set for any
> > other vendors out there. I would much rather have all of that dealt
> > with within the namespace that is handling the switchdev setup.
>
> If there is real problems here then I very much encourage you to start
> an effort to push all the vendors to implement a consistent user
> experience for the HW netdevs.

To some extent that has been going on for some time. It is one of the
reasons why there is supposed to be software offloads for any datapath
features that get added to the hardware. Such as GSO to offset TSO.
However there always ends up the occasional thing that ends up getting
past and that is where the frustration comes in.

> I don't know what your issues are, but it sounds like it would be a
> very interesting conference presentation.
>
> But it has nothing to do with this series.
>
> Jason

There I disagree. Now I can agree that most of the series is about
presenting the aux device and that part I am fine with. However when
the aux device is a netdev and that netdev is being loaded into the
same kernel as the switchdev port is where the red flags start flying,
especially when we start talking about how it is the same as a VF.

In my mind we are talking about how the switchdev will behave and it
makes sense to see about defining if a east-west bypass makes sense
and how it could be implemented, rather than saying we won't bother
for now and potentially locking in the subfunction to virtual function
equality. In my mind we need more than just the increased count to
justify going to subfunctions, and I think being able to solve the
east-west problem at least in terms of containers would be such a
thing.

Jason Gunthorpe Dec. 18, 2020, 12:08 a.m. UTC | #35

On Thu, Dec 17, 2020 at 01:05:03PM -0800, Alexander Duyck wrote:

> > I view the SW bypass path you are talking about similarly to
> > GSO/etc. It should be accessed by the HW driver as an optional service
> > provided by the core netdev, not implemented as some wrapper netdev
> > around a HW implementation.
> 
> I view it as being something that would be a part of the switchdev API
> itself. Basically the switchev and endpoint would need to be able to
> control something like this because if XDP were enabled on one end or
> the other you would need to be able to switch it off so that all of
> the packets followed the same flow and could be scanned by the XDP
> program.

To me that still all comes down to being something like an optional
offload that the HW driver can trigger if the conditions are met.

> > It is simple enough, the HW driver's tx path would somehow detect
> > east/west and queue it differently, and the rx path would somehow be
> > able to mux in skbs from a SW queue. Not seeing any blockers here.
> 
> In my mind the simple proof of concept for this would be to check for
> the multicast bit being set in the destination MAC address for packets
> coming from the subfunction. If it is then shunt to this bypass route,
> and if not then you transmit to the hardware queues. 

Sure, not sure multicast optimization like this isn't incredibly niche
too, but it would be an interesting path to explore.

But again, there is nothing fundamental about the model here that
precludes this optional optimization.

> > Even if that is true, I don't belive for a second that adding a
> > different HW abstraction layer is going to somehow undo the mistakes
> > of the last 20 years.
> 
> It depends on how it is done. The general idea is to address the
> biggest limitation that has occured, which is the fact that in many
> cases we don't have software offloads to take care of things when the
> hardware offloads provided by a certain piece of hardware are not
> present. 

This is really disappointing to hear. Admittedly I don't follow all
the twists and turns on the mailing list, but I thought having a SW
version of everything was one of the fundamental tenants of netdev
that truly distinguished it from something like RDMA.

> It would basically allow us to reset the feature set. If something
> cannot be offloaded in software in a reasonable way, it is not
> allowed to be present in the interface provided to a container.
> That way instead of having to do all the custom configuration in the
> container recipe it can be centralized to one container handling all
> of the switching and hardware configuration.

Well, you could start by blocking stuff without a SW fallback..

> There I disagree. Now I can agree that most of the series is about
> presenting the aux device and that part I am fine with. However when
> the aux device is a netdev and that netdev is being loaded into the
> same kernel as the switchdev port is where the red flags start flying,
> especially when we start talking about how it is the same as a VF.

Well, it happens for the same reason a VF can create a netdev,
stopping it would actually be more patches. As I said before, people
are already doing this model with VFs.

I can agree with some of our points, but this is not the series to
argue them. What you want is to start some new thread on optimizing
switchdev for the container user case.

> In my mind we are talking about how the switchdev will behave and it
> makes sense to see about defining if a east-west bypass makes sense
> and how it could be implemented, rather than saying we won't bother
> for now and potentially locking in the subfunction to virtual function
> equality.

At least for mlx5 SF == VF, that is a consequence of the HW. Any SW
bypass would need to be specially built in the mlx5 netdev running on
a VF/SF attached to a switchdev port.

I don't see anything about this part of the model that precludes ever
doing that, and I also don't see this optimization as being valuable
enough to block things "just to be sure"

> In my mind we need more than just the increased count to justify
> going to subfunctions, and I think being able to solve the east-west
> problem at least in terms of containers would be such a thing.

Increased count is pretty important for users with SRIOV.

Jason

David Ahern Dec. 18, 2020, 1:30 a.m. UTC | #36

On 12/16/20 3:53 PM, Alexander Duyck wrote:
> The problem in my case was based on a past experience where east-west
> traffic became a problem and it was easily shown that bypassing the
> NIC for traffic was significantly faster.

If a deployment expects a lot of east-west traffic *within a host* why
is it using hardware based isolation like a VF. That is a side effect of
a design choice that is remedied by other options.

Alexander Duyck Dec. 18, 2020, 3:11 a.m. UTC | #37

On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 12/16/20 3:53 PM, Alexander Duyck wrote:
> > The problem in my case was based on a past experience where east-west
> > traffic became a problem and it was easily shown that bypassing the
> > NIC for traffic was significantly faster.
>
> If a deployment expects a lot of east-west traffic *within a host* why
> is it using hardware based isolation like a VF. That is a side effect of
> a design choice that is remedied by other options.

I am mostly talking about this from past experience as I had seen a
few instances when I was at Intel when it became an issue. Sales and
marketing people aren't exactly happy when you tell them "don't sell
that" in response to them trying to sell a feature into an area where
it doesn't belong. Generally they want a solution. The macvlan offload
addressed these issues as the replication and local switching can be
handled in software.

The problem is PCIe DMA wasn't designed to function as a network
switch fabric and when we start talking about a 400Gb NIC trying to
handle over 256 subfunctions it will quickly reduce the
receive/transmit throughput to gigabit or less speeds when
encountering hardware multicast/broadcast replication. With 256
subfunctions a simple 60B ARP could consume more than 19KB of PCIe
bandwidth due to the packet having to be duplicated so many times. In
my mind it should be simpler to simply clone a single skb 256 times,
forward that to the switchdev ports, and have them perform a bypass
(if available) to deliver it to the subfunctions. That's why I was
thinking it might be a good time to look at addressing it.

David Ahern Dec. 18, 2020, 3:55 a.m. UTC | #38

On 12/17/20 8:11 PM, Alexander Duyck wrote:
> On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
>>
>> On 12/16/20 3:53 PM, Alexander Duyck wrote:
>>> The problem in my case was based on a past experience where east-west
>>> traffic became a problem and it was easily shown that bypassing the
>>> NIC for traffic was significantly faster.
>>
>> If a deployment expects a lot of east-west traffic *within a host* why
>> is it using hardware based isolation like a VF. That is a side effect of
>> a design choice that is remedied by other options.
> 
> I am mostly talking about this from past experience as I had seen a
> few instances when I was at Intel when it became an issue. Sales and
> marketing people aren't exactly happy when you tell them "don't sell
> that" in response to them trying to sell a feature into an area where

that's a problem engineers can never solve...

> it doesn't belong. Generally they want a solution. The macvlan offload
> addressed these issues as the replication and local switching can be
> handled in software.

well, I guess almost never. :-)

> 
> The problem is PCIe DMA wasn't designed to function as a network
> switch fabric and when we start talking about a 400Gb NIC trying to
> handle over 256 subfunctions it will quickly reduce the
> receive/transmit throughput to gigabit or less speeds when
> encountering hardware multicast/broadcast replication. With 256
> subfunctions a simple 60B ARP could consume more than 19KB of PCIe
> bandwidth due to the packet having to be duplicated so many times. In
> my mind it should be simpler to simply clone a single skb 256 times,
> forward that to the switchdev ports, and have them perform a bypass
> (if available) to deliver it to the subfunctions. That's why I was
> thinking it might be a good time to look at addressing it.
> 

east-west traffic within a host is more than likely the same tenant in
which case a proper VPC is a better solution than the s/w stack trying
to detect and guess that a bypass is needed. Guesses cost cycles in the
fast path which is a net loss - and even more so as speeds increase.

Parav Pandit Dec. 18, 2020, 5:20 a.m. UTC | #39

> From: Alexander Duyck <alexander.duyck@gmail.com>
> Sent: Friday, December 18, 2020 8:41 AM
> 
> On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 12/16/20 3:53 PM, Alexander Duyck wrote:
> The problem is PCIe DMA wasn't designed to function as a network switch
> fabric and when we start talking about a 400Gb NIC trying to handle over 256
> subfunctions it will quickly reduce the receive/transmit throughput to gigabit
> or less speeds when encountering hardware multicast/broadcast replication.
> With 256 subfunctions a simple 60B ARP could consume more than 19KB of
> PCIe bandwidth due to the packet having to be duplicated so many times. In
> my mind it should be simpler to simply clone a single skb 256 times, forward
> that to the switchdev ports, and have them perform a bypass (if available) to
> deliver it to the subfunctions. That's why I was thinking it might be a good
> time to look at addressing it.
Linux tc framework is rich to address this and already used by openvswich for years now.
Today arp broadcasts are not offloaded. They go through software patch and replicated in the L2 domain.
It is a solved problem for many years now.

Parav Pandit Dec. 18, 2020, 5:36 a.m. UTC | #40

> From: Parav Pandit <parav@nvidia.com>
> Sent: Friday, December 18, 2020 10:51 AM
> 
> > From: Alexander Duyck <alexander.duyck@gmail.com>
> > Sent: Friday, December 18, 2020 8:41 AM
> >
> > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
> > >
> > > On 12/16/20 3:53 PM, Alexander Duyck wrote:
> > The problem is PCIe DMA wasn't designed to function as a network
> > switch fabric and when we start talking about a 400Gb NIC trying to
> > handle over 256 subfunctions it will quickly reduce the
> > receive/transmit throughput to gigabit or less speeds when encountering
> hardware multicast/broadcast replication.
> > With 256 subfunctions a simple 60B ARP could consume more than 19KB of
> > PCIe bandwidth due to the packet having to be duplicated so many
> > times. In my mind it should be simpler to simply clone a single skb
> > 256 times, forward that to the switchdev ports, and have them perform
> > a bypass (if available) to deliver it to the subfunctions. That's why
> > I was thinking it might be a good time to look at addressing it.
> Linux tc framework is rich to address this and already used by openvswich for
> years now.
> Today arp broadcasts are not offloaded. They go through software patch and
s/patch/path

> replicated in the L2 domain.
> It is a solved problem for many years now.

Alexander Duyck Dec. 18, 2020, 3:54 p.m. UTC | #41

On Thu, Dec 17, 2020 at 7:55 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 12/17/20 8:11 PM, Alexander Duyck wrote:
> > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
> >>
> >> On 12/16/20 3:53 PM, Alexander Duyck wrote:
> >>> The problem in my case was based on a past experience where east-west
> >>> traffic became a problem and it was easily shown that bypassing the
> >>> NIC for traffic was significantly faster.
> >>
> >> If a deployment expects a lot of east-west traffic *within a host* why
> >> is it using hardware based isolation like a VF. That is a side effect of
> >> a design choice that is remedied by other options.
> >
> > I am mostly talking about this from past experience as I had seen a
> > few instances when I was at Intel when it became an issue. Sales and
> > marketing people aren't exactly happy when you tell them "don't sell
> > that" in response to them trying to sell a feature into an area where
>
> that's a problem engineers can never solve...
>
> > it doesn't belong. Generally they want a solution. The macvlan offload
> > addressed these issues as the replication and local switching can be
> > handled in software.
>
> well, I guess almost never. :-)
>
> >
> > The problem is PCIe DMA wasn't designed to function as a network
> > switch fabric and when we start talking about a 400Gb NIC trying to
> > handle over 256 subfunctions it will quickly reduce the
> > receive/transmit throughput to gigabit or less speeds when
> > encountering hardware multicast/broadcast replication. With 256
> > subfunctions a simple 60B ARP could consume more than 19KB of PCIe
> > bandwidth due to the packet having to be duplicated so many times. In
> > my mind it should be simpler to simply clone a single skb 256 times,
> > forward that to the switchdev ports, and have them perform a bypass
> > (if available) to deliver it to the subfunctions. That's why I was
> > thinking it might be a good time to look at addressing it.
> >
>
> east-west traffic within a host is more than likely the same tenant in
> which case a proper VPC is a better solution than the s/w stack trying
> to detect and guess that a bypass is needed. Guesses cost cycles in the
> fast path which is a net loss - and even more so as speeds increase.

Yes, but this becomes the hardware limitations deciding the layout of
the network. I lean towards more flexibility to allow more
configuration options being a good thing rather than us needing to
dictate how a network has to be constructed based on the limitations
of the hardware and software.

For broadcast/multicast it isn't so much a guess. It would be a single
bit test. My understanding is the switchdev setup is already making
special cases for things like broadcast/multicast due to the extra
overhead incurred. I mentioned ARP because in many cases it has to be
offloaded specifically due to these sorts of issues.

Alexander Duyck Dec. 18, 2020, 4:01 p.m. UTC | #42

On Thu, Dec 17, 2020 at 9:20 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Alexander Duyck <alexander.duyck@gmail.com>
> > Sent: Friday, December 18, 2020 8:41 AM
> >
> > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
> > >
> > > On 12/16/20 3:53 PM, Alexander Duyck wrote:
> > The problem is PCIe DMA wasn't designed to function as a network switch
> > fabric and when we start talking about a 400Gb NIC trying to handle over 256
> > subfunctions it will quickly reduce the receive/transmit throughput to gigabit
> > or less speeds when encountering hardware multicast/broadcast replication.
> > With 256 subfunctions a simple 60B ARP could consume more than 19KB of
> > PCIe bandwidth due to the packet having to be duplicated so many times. In
> > my mind it should be simpler to simply clone a single skb 256 times, forward
> > that to the switchdev ports, and have them perform a bypass (if available) to
> > deliver it to the subfunctions. That's why I was thinking it might be a good
> > time to look at addressing it.
> Linux tc framework is rich to address this and already used by openvswich for years now.
> Today arp broadcasts are not offloaded. They go through software path and replicated in the L2 domain.
> It is a solved problem for many years now.

When you say they are replicated in the L2 domain I assume you are
talking about the software switch connected to the switchdev ports. My
question is what are you doing with them after you have replicated
them? I'm assuming they are being sent to the other switchdev ports
which will require a DMA to transmit them, and another to receive them
on the VF/SF, or are you saying something else is going on here?

My argument is that this cuts into both the transmit and receive DMA
bandwidth of the NIC, and could easily be avoided in the case where SF
exists in the same kernel as the switchdev port by identifying the
multicast bit being set and simply bypassing the device.

Parav Pandit Dec. 18, 2020, 6:01 p.m. UTC | #43

> From: Alexander Duyck <alexander.duyck@gmail.com>
> Sent: Friday, December 18, 2020 9:31 PM
> 
> On Thu, Dec 17, 2020 at 9:20 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Alexander Duyck <alexander.duyck@gmail.com>
> > > Sent: Friday, December 18, 2020 8:41 AM
> > >
> > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com>
> wrote:
> > > >
> > > > On 12/16/20 3:53 PM, Alexander Duyck wrote:
> > > The problem is PCIe DMA wasn't designed to function as a network
> > > switch fabric and when we start talking about a 400Gb NIC trying to
> > > handle over 256 subfunctions it will quickly reduce the
> > > receive/transmit throughput to gigabit or less speeds when encountering
> hardware multicast/broadcast replication.
> > > With 256 subfunctions a simple 60B ARP could consume more than 19KB
> > > of PCIe bandwidth due to the packet having to be duplicated so many
> > > times. In my mind it should be simpler to simply clone a single skb
> > > 256 times, forward that to the switchdev ports, and have them
> > > perform a bypass (if available) to deliver it to the subfunctions.
> > > That's why I was thinking it might be a good time to look at addressing it.
> > Linux tc framework is rich to address this and already used by openvswich
> for years now.
> > Today arp broadcasts are not offloaded. They go through software path
> and replicated in the L2 domain.
> > It is a solved problem for many years now.
> 
> When you say they are replicated in the L2 domain I assume you are talking
> about the software switch connected to the switchdev ports. 
Yes.

> My question is
> what are you doing with them after you have replicated them? I'm assuming
> they are being sent to the other switchdev ports which will require a DMA to
> transmit them, and another to receive them on the VF/SF, or are you saying
> something else is going on here?
> 
Yes, that is correct.

> My argument is that this cuts into both the transmit and receive DMA
> bandwidth of the NIC, and could easily be avoided in the case where SF
> exists in the same kernel as the switchdev port by identifying the multicast
> bit being set and simply bypassing the device.
It probably can be avoided but its probably not worth for occasional ARP packets on neighbor cache miss.
If I am not mistaken, even some recent HW can forward such ARP packets to multiple switchcev ports with commit 7ee3f6d2486e without following the above described DMA path.

Alexander Duyck Dec. 18, 2020, 7:22 p.m. UTC | #44

On Fri, Dec 18, 2020 at 10:01 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Alexander Duyck <alexander.duyck@gmail.com>
> > Sent: Friday, December 18, 2020 9:31 PM
> >
> > On Thu, Dec 17, 2020 at 9:20 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Alexander Duyck <alexander.duyck@gmail.com>
> > > > Sent: Friday, December 18, 2020 8:41 AM
> > > >
> > > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com>
> > wrote:
> > > > >
> > > > > On 12/16/20 3:53 PM, Alexander Duyck wrote:
> > > > The problem is PCIe DMA wasn't designed to function as a network
> > > > switch fabric and when we start talking about a 400Gb NIC trying to
> > > > handle over 256 subfunctions it will quickly reduce the
> > > > receive/transmit throughput to gigabit or less speeds when encountering
> > hardware multicast/broadcast replication.
> > > > With 256 subfunctions a simple 60B ARP could consume more than 19KB
> > > > of PCIe bandwidth due to the packet having to be duplicated so many
> > > > times. In my mind it should be simpler to simply clone a single skb
> > > > 256 times, forward that to the switchdev ports, and have them
> > > > perform a bypass (if available) to deliver it to the subfunctions.
> > > > That's why I was thinking it might be a good time to look at addressing it.
> > > Linux tc framework is rich to address this and already used by openvswich
> > for years now.
> > > Today arp broadcasts are not offloaded. They go through software path
> > and replicated in the L2 domain.
> > > It is a solved problem for many years now.
> >
> > When you say they are replicated in the L2 domain I assume you are talking
> > about the software switch connected to the switchdev ports.
> Yes.
>
> > My question is
> > what are you doing with them after you have replicated them? I'm assuming
> > they are being sent to the other switchdev ports which will require a DMA to
> > transmit them, and another to receive them on the VF/SF, or are you saying
> > something else is going on here?
> >
> Yes, that is correct.
>
> > My argument is that this cuts into both the transmit and receive DMA
> > bandwidth of the NIC, and could easily be avoided in the case where SF
> > exists in the same kernel as the switchdev port by identifying the multicast
> > bit being set and simply bypassing the device.
> It probably can be avoided but its probably not worth for occasional ARP packets on neighbor cache miss.
> If I am not mistaken, even some recent HW can forward such ARP packets to multiple switchdev ports with commit 7ee3f6d2486e without following the above described DMA path.

Even with that it sounds like it will have to DMA the packet to
multiple Rx destinations even if it is only performing the Tx DMA
once. The Intel NICs did all this replication in hardware as well so
that is what I was thinking of when I was talking about the
replication behavior seen with SR-IOV.

Basically what I am getting at is that this could be used as an
architectural feature for switchdev to avoid creating increased DMA
overhead for broadcast, multicast and unknown-unicast traffic. I'm not
saying this is anything mandatory, and I would be perfectly okay with
something like this being optional and defaulted to off. In my mind
the setup only has the interfaces handling traffic to single point
destinations so that at most you are only looking at a 2x bump in PCIe
bandwidth for those cases where the packet ends up needing to go out
the physical port. It would essentially be a software offload to avoid
saturating the PCIe bus.

This setup would only work if both interfaces are present in the same
kernel though so that is why I chose now to bring it up as
historically SR-IOV hasn't normally been associated with containers
due to the limited number of interfaces that could be created.

Also as far as the patch count complaints I have seen in a few threads
I would be fine with splitting things up so that the devlink and aux
device creation get handled in one set, and then we work out the
details of mlx5 attaching to the devices and spawning of the SF
netdevs in another since that seems to be where the debate is.

Jason Gunthorpe Dec. 18, 2020, 8:18 p.m. UTC | #45

On Fri, Dec 18, 2020 at 11:22:12AM -0800, Alexander Duyck wrote:

> Also as far as the patch count complaints I have seen in a few threads
> I would be fine with splitting things up so that the devlink and aux
> device creation get handled in one set, and then we work out the
> details of mlx5 attaching to the devices and spawning of the SF
> netdevs in another since that seems to be where the debate is.

It doesn't work like that. The aux device creates a mlx5_core and
every mlx5_core can run mlx5_en.

This really isn't the series to raise this feature request. Adding an
optional short cut path to VF/SF is something that can be done later
if up to date benchmarks show it has value. There is no blocker in
this model to doing that.

Jason

Alexander Duyck Dec. 19, 2020, 12:03 a.m. UTC | #46

On Fri, Dec 18, 2020 at 12:18 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Dec 18, 2020 at 11:22:12AM -0800, Alexander Duyck wrote:
>
> > Also as far as the patch count complaints I have seen in a few threads
> > I would be fine with splitting things up so that the devlink and aux
> > device creation get handled in one set, and then we work out the
> > details of mlx5 attaching to the devices and spawning of the SF
> > netdevs in another since that seems to be where the debate is.
>
> It doesn't work like that. The aux device creates a mlx5_core and
> every mlx5_core can run mlx5_en.

The aux device is still just a device isn't it? Until you register a
driver that latches on to "MLX5_SF_DEV_ID_NAME" the device is there
and it should function like an unbound/idle VF.

> This really isn't the series to raise this feature request. Adding an
> optional short cut path to VF/SF is something that can be done later
> if up to date benchmarks show it has value. There is no blocker in
> this model to doing that.

That is the point of contention that we probably won't solve. I feel
like if we are going to put out something that is an alternative to
the macvlan and SR-IOV approaches it should address the issues
resolved in both. Right now this just throws up yet another
virtualization solution that ends up reintroducing many of the
existing problems we already had with SR-IOV, and possibly
exacerbating them by allowing for an even greater number of
subfunctions.

Anyway I have made my opinion known, and I am not in the position to
block the patches since I am not the maintainer.

[net-next,v4,00/15] Add mlx5 subfunction support

Message

Comments