[pull,request,net-next,V10,00/14] Add mlx5 subfunction support

Message ID	20210122193658.282884-1-saeed@kernel.org (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Saeed Mahameed <saeed@kernel.org> To: "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Jason Gunthorpe <jgg@nvidia.com> Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org, alexander.duyck@gmail.com, sridhar.samudrala@intel.com, edwin.peer@broadcom.com, dsahern@kernel.org, kiran.patil@intel.com, jacob.e.keller@intel.com, david.m.ertman@intel.com, dan.j.williams@intel.com, Saeed Mahameed <saeedm@nvidia.com> Subject: [pull request][net-next V10 00/14] Add mlx5 subfunction support Date: Fri, 22 Jan 2021 11:36:44 -0800 Message-Id: <20210122193658.282884-1-saeed@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk

Saeed Mahameed Jan. 22, 2021, 7:36 p.m. UTC

From: Saeed Mahameed <saeedm@nvidia.com>

Hi Dave, Jakub, Jason,

This series form Parav was the theme of this mlx5 release cycle,
we've been waiting anxiously for the auxbus infrastructure to make it into
the kernel, and now as the auxbus is in and all the stars are aligned, I
can finally submit this patchset of the devlink and mlx5 subfunction support.

For more detailed information about subfunctions please see detailed tag
log below.

Please pull and let me know if there's any problem.

Thanks,
Saeed.

---
Changelog:
v9->v10:
 - Remove redundant statement from patch #4 commit message
 - Minor grammar improvement in SF documentation patch

v8->v9:
 - Use proper functions doc in patches #3,#4

v7->v8:
 - Address documentation related comments missed on v5, Jakub.

v6-v7:
 - Resolve new kdoc warning

v5->v6:
 - update docs and corrected spellings and typos according to previous
   review
 - use of shorted macro names
 - using updated callback to return port index
 - updated commit message example for add command return fields
 - driver name suffix corrected from 'mlx5_core' to 'sf'
 - using MLX5_ADEV_NAME prefix to match with other mlx5 auxiliary devices
 - fixed sf allocated condition
 - using 80 characters alignment
 - shorten the enum type names and enum values from
   PORT_FUNCTION to PORT_FN
 - return port attributes of newly created port
 - moved port add and delete callbacks pointer check before preparing
   attributes for driver
 - added comment to clarify that about desired port index during add
   callback
 - place SF number attribute only when port flavour is SF
 - packed the sf attribute structure
 - removed external flag for sf for initial patchset

v4->v5:
 - Fix some typos in the documentation
 
v3->v4:
 - Fix 32bit compilation issue

v2->v3:
 - added header file sf/priv.h to cmd.c to avoid missing prototype warning
 - made mlx5_sf_table_disable as static function as its used only in one file

v1->v2:
 - added documentation for subfunction and its mlx5 implementation
 - add MLX5_SF config option documentation
 - rebased
 - dropped devlink global lock improvement patch as mlx5 doesn't support
   reload while SFs are allocated
 - dropped devlink reload lock patch as mlx5 doesn't support reload
   when SFs are allocated
 - using updated vhca event from device to add remove auxiliary device
 - split sf devlink port allocation and sf hardware context allocation


Thanks,
Saeed.

---

The following changes since commit 7b8fc0103bb51d1d3e1fb5fd67958612e709f883:

  bonding: add a vlan+srcmac tx hashing option (2021-01-19 19:30:32 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2021-01-13

for you to fetch changes up to 142d93d12dc187f6a32aae2048da0c8230636b86:

  net/mlx5: Add devlink subfunction port documentation (2021-01-22 11:32:12 -0800)

----------------------------------------------------------------
mlx5 subfunction support

Parav Pandit Says:
=================

This patchset introduces support for mlx5 subfunction (SF).

A subfunction is a lightweight function that has a parent PCI function on
which it is deployed. mlx5 subfunction has its own function capabilities
and its own resources. This means a subfunction has its own dedicated
queues(txq, rxq, cq, eq). These queues are neither shared nor stolen from
the parent PCI function.

When subfunction is RDMA capable, it has its own QP1, GID table and rdma
resources neither shared nor stolen from the parent PCI function.

A subfunction has dedicated window in PCI BAR space that is not shared
with the other subfunctions or parent PCI function. This ensures that all
class devices of the subfunction accesses only assigned PCI BAR space.

A Subfunction supports eswitch representation through which it supports tc
offloads. User must configure eswitch to send/receive packets from/to
subfunction port.

Subfunctions share PCI level resources such as PCI MSI-X IRQs with
their other subfunctions and/or with its parent PCI function.

Patch summary:
--------------
Patch 1 to 4 prepares devlink
patch 5 to 7 mlx5 adds SF device support
Patch 8 to 11 mlx5 adds SF devlink port support
Patch 12 and 14 adds documentation

Patch-1 prepares code to handle multiple port function attributes
Patch-2 introduces devlink pcisf port flavour similar to pcipf and pcivf
Patch-3 adds port add and delete driver callbacks
Patch-4 adds port function state get and set callbacks
Patch-5 mlx5 vhca event notifier support to distribute subfunction
        state change notification
Patch-6 adds SF auxiliary device
Patch-7 adds SF auxiliary driver
Patch-8 prepares eswitch to handler SF vport
Patch-9 adds eswitch helpers to add/remove SF vport
Patch-10 implements devlink port add/del callbacks
Patch-11 implements devlink port function get/set callbacks
Patch-12 to 14 adds documentation
Patch-12 added mlx5 port function documentation
Patch-13 adds subfunction documentation
Patch-14 adds mlx5 subfunction documentation

Subfunction support is discussed in detail in RFC [1] and [2].
RFC [1] and extension [2] describes requirements, design and proposed
plumbing using devlink, auxiliary bus and sysfs for systemd/udev
support. Functionality of this patchset is best explained using real
examples further below.

overview:
--------
A subfunction can be created and deleted by a user using devlink port
add/delete interface.

A subfunction can be configured using devlink port function attribute
before its activated.

When a subfunction is activated, it results in an auxiliary device on
the host PCI device where it is deployed. A driver binds to the
auxiliary device that further creates supported class devices.

example subfunction usage sequence:
-----------------------------------
Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

Add a devlink port of subfunction flavour:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88

Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active

Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4

$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff

$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

After use inactivate the function:
$ devlink port function set ens2f0npf0sf88 state inactive

Now delete the subfunction port:
$ devlink port del ens2f0npf0sf88

[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2

=================

----------------------------------------------------------------
Parav Pandit (13):
      devlink: Prepare code to fill multiple port function attributes
      devlink: Introduce PCI SF port flavour and port attribute
      devlink: Support add and delete devlink port
      devlink: Support get and set state of port function
      net/mlx5: Introduce vhca state event notifier
      net/mlx5: SF, Add auxiliary device support
      net/mlx5: SF, Add auxiliary device driver
      net/mlx5: E-switch, Add eswitch helpers for SF vport
      net/mlx5: SF, Add port add delete functionality
      net/mlx5: SF, Port function state change support
      devlink: Add devlink port documentation
      devlink: Extend devlink port documentation for subfunctions
      net/mlx5: Add devlink subfunction port documentation

Vu Pham (1):
      net/mlx5: E-switch, Prepare eswitch to handle SF vport

 Documentation/driver-api/auxiliary_bus.rst         |   2 +
 .../device_drivers/ethernet/mellanox/mlx5.rst      | 215 ++++++++
 Documentation/networking/devlink/devlink-port.rst  | 199 ++++++++
 Documentation/networking/devlink/index.rst         |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |  19 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c      |   8 +
 drivers/net/ethernet/mellanox/mlx5/core/devlink.c  |  19 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c       |   5 +-
 .../mellanox/mlx5/core/esw/acl/egress_ofld.c       |   2 +-
 .../ethernet/mellanox/mlx5/core/esw/devlink_port.c |  41 ++
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  48 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  78 +++
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |  47 +-
 drivers/net/ethernet/mellanox/mlx5/core/events.c   |   7 +
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  60 ++-
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |  12 +
 drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c  |  20 +
 drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c   |  49 ++
 .../net/ethernet/mellanox/mlx5/core/sf/dev/dev.c   | 275 ++++++++++
 .../net/ethernet/mellanox/mlx5/core/sf/dev/dev.h   |  55 ++
 .../ethernet/mellanox/mlx5/core/sf/dev/driver.c    | 101 ++++
 .../net/ethernet/mellanox/mlx5/core/sf/devlink.c   | 556 +++++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/sf/hw_table.c  | 233 +++++++++
 .../mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h    |  82 +++
 drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h  |  21 +
 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h    | 100 ++++
 .../ethernet/mellanox/mlx5/core/sf/vhca_event.c    | 189 +++++++
 .../ethernet/mellanox/mlx5/core/sf/vhca_event.h    |  57 +++
 drivers/net/ethernet/mellanox/mlx5/core/vport.c    |   3 +-
 include/linux/mlx5/driver.h                        |  16 +-
 include/net/devlink.h                              | 100 ++++
 include/uapi/linux/devlink.h                       |  25 +
 net/core/devlink.c                                 | 310 ++++++++++--
 34 files changed, 2917 insertions(+), 47 deletions(-)
 create mode 100644 Documentation/networking/devlink/devlink-port.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h

Edwin Peer Jan. 24, 2021, 8:47 p.m. UTC | #1

On Fri, Jan 22, 2021 at 11:37 AM Saeed Mahameed <saeed@kernel.org> wrote:

> This series form Parav was the theme of this mlx5 release cycle,
> we've been waiting anxiously for the auxbus infrastructure to make it into
> the kernel, and now as the auxbus is in and all the stars are aligned, I
> can finally submit this patchset of the devlink and mlx5 subfunction support.
>
> For more detailed information about subfunctions please see detailed tag
> log below.

Apologies for the tardy question out of left field, but I've been
thinking about this some more. If I recall, the primary motivation for
this was a means to effectively address more VFs? But, why can't the
device simply expose more bus numbers?

From the PCI spec:

"SR-IOV Devices may consume more than one Bus Number. A VF can be
associated with any Bus Number within
the Device’s Bus Number range - the captured Bus Number plus any
additional Bus Numbers configured by
software. See Section 9.2.1.2 for details.

- Use of multiple Bus Numbers enables a device to support a very large
number of VFs - up to the size
of the Routing ID space minus the bits used to identify intervening busses"

Regards,
Edwin Peer

Parav Pandit Jan. 25, 2021, 10:57 a.m. UTC | #2

Hi Edwin,

> From: Edwin Peer <edwin.peer@broadcom.com>
> Sent: Monday, January 25, 2021 2:17 AM
> 
> On Fri, Jan 22, 2021 at 11:37 AM Saeed Mahameed <saeed@kernel.org>
> wrote:
> 
> > For more detailed information about subfunctions please see detailed tag
> > log below.
> 
> Apologies for the tardy question out of left field, but I've been
> thinking about this some more. If I recall, the primary motivation for
> this was a means to effectively address more VFs? But, why can't the
> device simply expose more bus numbers?

Several weeks back, Jason already answered this VF scaling question from you at discussion [1].

[1] https://lore.kernel.org/netdev/20201216023928.GG552508@nvidia.com/

Jason Gunthorpe Jan. 25, 2021, 1:22 p.m. UTC | #3

On Mon, Jan 25, 2021 at 10:57:14AM +0000, Parav Pandit wrote:
> Hi Edwin,
> 
> > From: Edwin Peer <edwin.peer@broadcom.com>
> > Sent: Monday, January 25, 2021 2:17 AM
> > 
> > On Fri, Jan 22, 2021 at 11:37 AM Saeed Mahameed <saeed@kernel.org>
> > wrote:
> > 
> > > For more detailed information about subfunctions please see detailed tag
> > > log below.
> > 
> > Apologies for the tardy question out of left field, but I've been
> > thinking about this some more. If I recall, the primary motivation for
> > this was a means to effectively address more VFs? But, why can't the
> > device simply expose more bus numbers?
> 
> Several weeks back, Jason already answered this VF scaling question
> from you at discussion [1].

To add a little more colour, the PCI spec design requires a CAM (ie
search) to figure out which function an incoming address is connected
to because there are no restrictions on how BAR's of each function
have to be layed out.

SRIOV and SF's require a simple linear lookup to learn the "function"
because the BAR space is required to be linear.

Scaling a CAM to high sizes is physicaly infeasible, so all approaches
to scaling PCI functions go this road of having a single large BAR
space.

Jason

Edwin Peer Jan. 25, 2021, 6:35 p.m. UTC | #4

On Mon, Jan 25, 2021 at 2:57 AM Parav Pandit <parav@nvidia.com> wrote:

> > Apologies for the tardy question out of left field, but I've been
> > thinking about this some more. If I recall, the primary motivation for
> > this was a means to effectively address more VFs? But, why can't the
> > device simply expose more bus numbers?
>
> Several weeks back, Jason already answered this VF scaling question from you at discussion [1].
>
> [1] https://lore.kernel.org/netdev/20201216023928.GG552508@nvidia.com/

True, although I didn't really consider the full cost argument at the
time because the core answer was "They can't", however, the fact is,
PCI can.

Regards,
Edwin Peer

Edwin Peer Jan. 25, 2021, 7:23 p.m. UTC | #5

On Mon, Jan 25, 2021 at 5:22 AM Jason Gunthorpe <jgg@nvidia.com> wrote:

> SRIOV and SF's require a simple linear lookup to learn the "function"
> because the BAR space is required to be linear.

Isn't this still true even for NumVF's > 256? Wouldn't there still be
a contiguous VF BAR space? Don't the routing IDs simply carry on
incrementing by stride, with each being assigned the next slice of the
shared BAR space?

> Scaling a CAM to high sizes is physicaly infeasible, so all approaches
> to scaling PCI functions go this road of having a single large BAR
> space.

If the above is true, is there really a need to scale up CAM?

Regards,
Edwin Peer

Edwin Peer Jan. 25, 2021, 7:34 p.m. UTC | #6

On Mon, Jan 25, 2021 at 10:35 AM Edwin Peer <edwin.peer@broadcom.com> wrote:

> > Several weeks back, Jason already answered this VF scaling question from you at discussion [1].
> >
> > [1] https://lore.kernel.org/netdev/20201216023928.GG552508@nvidia.com/

Regarding these costs:

> A lot of the trappings that PCI-SIG requires to be implemented in HW
> for a VF, like PCI config space, MSI tables, BAR space, etc. is all
> just dead weight when scaling up to 1000's of VFs.

What do these amount to in practice? Presumably config space is backed
by normal memory controlled by firmware. Do VF's need to expose ECAM?
Also, don't MSI tables come out of the BAR budget? Is the required BAR
space necessarily more than any other addressable unit that can be
delegated to a SF?

Whatever the costs, presumably they need to be weighed against the
complexity costs of the alternative?

Regards,
Edwin Peer

Jason Gunthorpe Jan. 25, 2021, 7:49 p.m. UTC | #7

On Mon, Jan 25, 2021 at 11:23:56AM -0800, Edwin Peer wrote:
> On Mon, Jan 25, 2021 at 5:22 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > SRIOV and SF's require a simple linear lookup to learn the "function"
> > because the BAR space is required to be linear.
> 
> Isn't this still true even for NumVF's > 256? Wouldn't there still be
> a contiguous VF BAR space? Don't the routing IDs simply carry on
> incrementing by stride, with each being assigned the next slice of the
> shared BAR space?

I've never seen someone implement a NumVF > 256 by co-opting the bus
number.

Can Linux even assign more bus numbers to a port without firmware
help? Bus numbers are something that requires the root complex to be
aware of to setup routability.

Jason

Jason Gunthorpe Jan. 25, 2021, 7:59 p.m. UTC | #8

On Mon, Jan 25, 2021 at 11:34:49AM -0800, Edwin Peer wrote:

> What do these amount to in practice? Presumably config space is backed
> by normal memory controlled by firmware. Do VF's need to expose ECAM?
> Also, don't MSI tables come out of the BAR budget? Is the required BAR
> space necessarily more than any other addressable unit that can be
> delegated to a SF?

Every writable data mandated by the PCI spec requires very expensive
on-die SRAM to store it.

We've seen Intel drivers that show their SIOV ADIs don't even have a
register file and the only PCI presence is just a write-only doorbell
page in the BAR.

It is hard to argue a write-only register in a BAR page vs all the
SRIOV trappings when it comes to HW cost.

Jason

Edwin Peer Jan. 25, 2021, 8:05 p.m. UTC | #9

On Mon, Jan 25, 2021 at 11:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote:

> I've never seen someone implement a NumVF > 256 by co-opting the bus
> number.

Usually the VF offset already places the VF routing IDs into a
different bus number range from the PF. That much at least works
today.

> Can Linux even assign more bus numbers to a port without firmware
> help? Bus numbers are something that requires the root complex to be
> aware of to setup routability.

I'm not sure, presumably something already infers this for the first
additional bus number based on the SR-IOV config capability?

Regards,
Edwin Peer

Edwin Peer Jan. 25, 2021, 8:22 p.m. UTC | #10

On Mon, Jan 25, 2021 at 11:59 AM Jason Gunthorpe <jgg@nvidia.com> wrote:

> Every writable data mandated by the PCI spec requires very expensive
> on-die SRAM to store it.

That's an implementation decision. Nothing mandates that the state has
to physically exist in the same structure, only that reads and writes
are appropriately responded to. Parts that are read only could be
generated on the fly and writes can be stored more efficiently.

> We've seen Intel drivers that show their SIOV ADIs don't even have a
> register file and the only PCI presence is just a write-only doorbell
> page in the BAR.

Right, but presumably it still needs to be at least a page. And,
nothing says your device's VF BAR protocol can't be equally simple.

> It is hard to argue a write-only register in a BAR page vs all the
> SRIOV trappings when it comes to HW cost.

Say it's double the cost? I don't know that it is, but does that
warrant the additional complexity of SFs? We should try to quantify.

Regards,
Edwin Peer

Michael Chan Jan. 25, 2021, 8:22 p.m. UTC | #11

On Mon, Jan 25, 2021 at 12:09 PM Edwin Peer <edwin.peer@broadcom.com> wrote:
>
> On Mon, Jan 25, 2021 at 11:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> > I've never seen someone implement a NumVF > 256 by co-opting the bus
> > number.
>
> Usually the VF offset already places the VF routing IDs into a
> different bus number range from the PF. That much at least works
> today.
>
> > Can Linux even assign more bus numbers to a port without firmware
> > help? Bus numbers are something that requires the root complex to be
> > aware of to setup routability.
>
> I'm not sure, presumably something already infers this for the first
> additional bus number based on the SR-IOV config capability?
>

Yes, this should work.  During enumeration, it sees that a device
capable of SRIOV needs more than one bus number and will assign the
proper primary and secondary bus numbers to the upstream bridge.

Parav Pandit Jan. 25, 2021, 8:26 p.m. UTC | #12

> From: Edwin Peer <edwin.peer@broadcom.com>
> Sent: Tuesday, January 26, 2021 1:36 AM
> 
> On Mon, Jan 25, 2021 at 11:49 AM Jason Gunthorpe <jgg@nvidia.com>
> wrote:
> 
> > I've never seen someone implement a NumVF > 256 by co-opting the bus
> > number.
> 
> Usually the VF offset already places the VF routing IDs into a
> different bus number range from the PF. That much at least works
> today.
> 
> > Can Linux even assign more bus numbers to a port without firmware
> > help? Bus numbers are something that requires the root complex to be
> > aware of to setup routability.
> 
> I'm not sure, presumably something already infers this for the first
> additional bus number based on the SR-IOV config capability?
> 
It is not inferred.
Linux pci core programs the additional registers for subordinate and secondary bus numbers.
Though, it comes with its own extra hw cost.

Keep in mind how 1000 Vfs are enabled and disabled in one go at pci spec level and so at OS level, as opposed to unit of one here.
PCI comes with heavy bus level reset requirement apart from AER and more.

Jason Gunthorpe Jan. 25, 2021, 8:41 p.m. UTC | #13

On Mon, Jan 25, 2021 at 12:22:13PM -0800, Edwin Peer wrote:
> On Mon, Jan 25, 2021 at 11:59 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > Every writable data mandated by the PCI spec requires very expensive
> > on-die SRAM to store it.
> 
> That's an implementation decision. Nothing mandates that the state has
> to physically exist in the same structure, only that reads and writes
> are appropriately responded to.

Yes, PCI does mandate this, you can't store the data on the other side
of the PCI link, and if you can't cross the PCI link that only leaves
on die/package memory resources.

> > We've seen Intel drivers that show their SIOV ADIs don't even have a
> > register file and the only PCI presence is just a write-only doorbell
> > page in the BAR.
> 
> Right, but presumably it still needs to be at least a page. And,
> nothing says your device's VF BAR protocol can't be equally simple.

Having VFs that are not self-contained would require significant
changing of current infrastructure, if we are going to change things
then let's fix everything instead of some half measure.

SRIOV really doesn't bring much benefits, it has lots of odd little
restrictions and strange lifecycle rules for what modern devices want
to do.

> > It is hard to argue a write-only register in a BAR page vs all the
> > SRIOV trappings when it comes to HW cost.
> 
> Say it's double the cost? I don't know that it is, but does that
> warrant the additional complexity of SFs? We should try to quantify.

The actual complexity inside the kernel is small and the user
experience to manage them through devlink is dramatically better than
SRIOV. I think it is a win even if there isn't any HW savings.

Jason

Edwin Peer Jan. 25, 2021, 9:23 p.m. UTC | #14

On Mon, Jan 25, 2021 at 12:41 PM Jason Gunthorpe <jgg@nvidia.com> wrote:

> > That's an implementation decision. Nothing mandates that the state has
> > to physically exist in the same structure, only that reads and writes
> > are appropriately responded to.
>
> Yes, PCI does mandate this, you can't store the data on the other side
> of the PCI link, and if you can't cross the PCI link that only leaves
> on die/package memory resources.

Going off device was not what I was suggesting at all. I meant the
data doesn't necessarily need to be stored in the same physical
layout.

Take the config space for example. Many fields are read-only,
constant, always zero (for non-legacy) or reserved. These could be
generated by firmware in response to requests without ever being
backed by physical memory. Similarly, writes that imply a state change
can simply make that state change in whatever internal representation
is convenient for the device. One need only make sure the read back of
that state is appropriately reverse translated from your internal
representation. Similarly, if you're not exposing a bunch of optional
capabilities (the SFs don't), then you don't need the full config
space either, simply render the zeroes in response to the reads where
you have nothing to say.

That's not to say all implementations would be capable of this, only
that it is an implementation choice.

> > Right, but presumably it still needs to be at least a page. And,
> > nothing says your device's VF BAR protocol can't be equally simple.
>
> Having VFs that are not self-contained would require significant
> changing of current infrastructure, if we are going to change things
> then let's fix everything instead of some half measure.

I don't understand what you mean by self-contained. If your device
only needs a doorbell write to trigger a DMA, no reason your VF BAR
needs to expose more. In practice, there will be some kind of
configuration channel too, but this doesn't necessarily need a lot of
room either (you don't have to represent configuration as a bulky
register file exposing every conceivable option, it could be a mailbox
with a command protocol).

> The actual complexity inside the kernel is small and the user
> experience to manage them through devlink is dramatically better than
> SRIOV. I think it is a win even if there isn't any HW savings.

I'm not sure I agree with respect to user experience. Users are
familiar with SR-IOV. Now you impose a complementary model for
accomplishing the same goal (without solving all the problems, as per
the previous discussion, so we'll need to reinvent it again later).
Adds to confusion.

It's not easier for vendors either. Now we need to get users onto new
drivers to exploit it, with all the distribution lags that entails
(where existing drivers would work for SR-IOV). Some vendors will
support it, some won't, further adding to user confusion.

Regards,
Edwin Peer

Jason Gunthorpe Jan. 25, 2021, 11:13 p.m. UTC | #15

On Mon, Jan 25, 2021 at 01:23:04PM -0800, Edwin Peer wrote:
> On Mon, Jan 25, 2021 at 12:41 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > > That's an implementation decision. Nothing mandates that the state has
> > > to physically exist in the same structure, only that reads and writes
> > > are appropriately responded to.
> >
> > Yes, PCI does mandate this, you can't store the data on the other side
> > of the PCI link, and if you can't cross the PCI link that only leaves
> > on die/package memory resources.
> 
> Going off device was not what I was suggesting at all. I meant the
> data doesn't necessarily need to be stored in the same physical
> layout.

It doesn't change anything, every writable bit must still be stored
on-die SRAM. You can compute the minimum by summing all writable and
read-reporting bits in the standard SRIOV config space.

Every bit used for SRIOV is a bit that couldn't be used to improve
device performance.

> > > Right, but presumably it still needs to be at least a page. And,
> > > nothing says your device's VF BAR protocol can't be equally simple.
> >
> > Having VFs that are not self-contained would require significant
> > changing of current infrastructure, if we are going to change things
> > then let's fix everything instead of some half measure.
> 
> I don't understand what you mean by self-contained. 

Self-contained means you can pass the VF to a VM with vfio and run a
driver on it. A VF that only has a write-only doorbell page probably
cannot be self contained.

> In practice, there will be some kind of configuration channel too,
> but this doesn't necessarily need a lot of room either 

I don't know of any device that can run without configuration, even in
a VF case.

So this all costs SRAM too.

> > The actual complexity inside the kernel is small and the user
> > experience to manage them through devlink is dramatically better than
> > SRIOV. I think it is a win even if there isn't any HW savings.
> 
> I'm not sure I agree with respect to user experience. Users are
> familiar with SR-IOV.

Sort of, SRIOV is a very bad fit for these sophisticated devices, and
no, users are not familiar with the weird intricate details of SR-IOV
in the context of very sophisticated reconfigurable HW like we are
seeing now.

Look at the other series about MSI-X reconfiguration for some colour
on where SRIOV runs into limits due to its specific design.

> Now you impose a complementary model for accomplishing the same goal
> (without solving all the problems, as per the previous discussion,
> so we'll need to reinvent it again later).  

I'm not sure what you are referring to.

> It's not easier for vendors either. Now we need to get users onto new
> drivers to exploit it, with all the distribution lags that entails
> (where existing drivers would work for SR-IOV). 

Compatability with existing drivers in a VM is a vendor
choice. Drivers can do a lot in a scalable way in hypervisor SW to
present whateve programming interface makes sense to the VM. Intel is
showing this approach in their IDXD SIOV ADI driver.

> Some vendors will support it, some won't, further adding to user
> confusion.

Such is the nature of all things, some vendors supported SRIOV and
other didn't too.

Jason

Jakub Kicinski Jan. 27, 2021, 1:34 a.m. UTC | #16

On Fri, 22 Jan 2021 11:36:44 -0800 Saeed Mahameed wrote:
> This series form Parav was the theme of this mlx5 release cycle,
> we've been waiting anxiously for the auxbus infrastructure to make it into
> the kernel, and now as the auxbus is in and all the stars are aligned, I
> can finally submit this patchset of the devlink and mlx5 subfunction support.
> 
> For more detailed information about subfunctions please see detailed tag
> log below.

Are there any further comments, objections or actions that need to be
taken on this series, anyone?

Looks like the discussion has ended. Not knowing any users who would
need this I'd like to at least make sure we have reasonable consensus
among vendors.

Saeed Mahameed Jan. 29, 2021, 12:03 a.m. UTC | #17

On Tue, 2021-01-26 at 17:34 -0800, Jakub Kicinski wrote:
> On Fri, 22 Jan 2021 11:36:44 -0800 Saeed Mahameed wrote:
> > This series form Parav was the theme of this mlx5 release cycle,
> > we've been waiting anxiously for the auxbus infrastructure to make
> > it into
> > the kernel, and now as the auxbus is in and all the stars are
> > aligned, I
> > can finally submit this patchset of the devlink and mlx5
> > subfunction support.
> > 
> > For more detailed information about subfunctions please see
> > detailed tag
> > log below.
> 
> Are there any further comments, objections or actions that need to be
> taken on this series, anyone?
> 
> Looks like the discussion has ended. Not knowing any users who would
> need this I'd like to at least make sure we have reasonable consensus
> among vendors.

Hey Jakub, sorry to nag, but I need to make some progress, can we move
on please ? my submission queue is about to explode :) !

Thanks,
Saeed.

Jakub Kicinski Jan. 29, 2021, 12:11 a.m. UTC | #18

On Thu, 28 Jan 2021 16:03:02 -0800 Saeed Mahameed wrote:
> On Tue, 2021-01-26 at 17:34 -0800, Jakub Kicinski wrote:
> > On Fri, 22 Jan 2021 11:36:44 -0800 Saeed Mahameed wrote:  
> > > This series form Parav was the theme of this mlx5 release cycle,
> > > we've been waiting anxiously for the auxbus infrastructure to make
> > > it into
> > > the kernel, and now as the auxbus is in and all the stars are
> > > aligned, I
> > > can finally submit this patchset of the devlink and mlx5
> > > subfunction support.
> > > 
> > > For more detailed information about subfunctions please see
> > > detailed tag
> > > log below.  
> > 
> > Are there any further comments, objections or actions that need to be
> > taken on this series, anyone?
> > 
> > Looks like the discussion has ended. Not knowing any users who would
> > need this I'd like to at least make sure we have reasonable consensus
> > among vendors.  
> 
> Hey Jakub, sorry to nag, but I need to make some progress, can we move
> on please ? my submission queue is about to explode :) !

I'll pull it in by the end of the day, just need to do some backports
and then it'll be on top of my list.

[pull,request,net-next,V10,00/14] Add mlx5 subfunction support

Pull-request

Message

Comments