diff mbox series

[RFC,v3,net-next] Documentation: devlink: Add devlink-sd

Message ID 20240125223617.7298-1-witu@nvidia.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,v3,net-next] Documentation: devlink: Add devlink-sd | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 0 of 0 maintainers
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch fail ERROR: Remove Gerrit Change-Id's before submitting upstream WARNING: Possible repeated word: 'devlink' WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

William Tu Jan. 25, 2024, 10:36 p.m. UTC
Add devlink-sd, shared descriptor, documentation. The devlink-sd
mechanism is targeted for configuration of the shared rx descriptors
that server as a descriptor pool for ethernet reprsentors (reps)
to better utilize memory. Following operations are provided:
 * add/delete a shared descriptor pool
 * Configure the pool's properties
 * Bind/unbind a representor's rx channel to a descriptor pool

Propose new devlink objects because existing solutions below do
not fit our use cases:
1) devlink params: Need to add many new params to support
   the shared descriptor pool. It doesn't seem to be a good idea.
2) devlink-sb (shared buffer): very similar to the API proposed in
   this patch, but devlink-sb is used in ASIC hardware switch buffer
   and switch's port. Here the use case is switchdev mode with
   reprensentor ports and its rx queues.

Signed-off-by: William Tu <witu@nvidia.com>
Change-Id: I1de0d9544ff8371955c6976b2d301b1630023100
---
v3: read again myself and explain NAPI context and descriptor pool
v2: work on Jiri's feedback
- use more consistent device name, p0, pf0vf0, etc
- several grammar and spelling errors
- several changes to devlink sd api
  - remove hex, remove sd show, make output 1:1 mapping, use
  count instead of size, use "add" instead of "create"
  - remove the use of "we"
- remove the "default" and introduce "shared-descs" in switchdev mode
- make description more consistent with definitions in ethtool,
such as ring, channel, queue.
---
 .../networking/devlink/devlink-sd.rst         | 296 ++++++++++++++++++
 1 file changed, 296 insertions(+)
 create mode 100644 Documentation/networking/devlink/devlink-sd.rst

Comments

Simon Horman Jan. 29, 2024, 10:56 a.m. UTC | #1
On Thu, Jan 25, 2024 at 02:36:17PM -0800, William Tu wrote:
> Add devlink-sd, shared descriptor, documentation. The devlink-sd
> mechanism is targeted for configuration of the shared rx descriptors
> that server as a descriptor pool for ethernet reprsentors (reps)
> to better utilize memory. Following operations are provided:
>  * add/delete a shared descriptor pool
>  * Configure the pool's properties
>  * Bind/unbind a representor's rx channel to a descriptor pool
> 
> Propose new devlink objects because existing solutions below do
> not fit our use cases:
> 1) devlink params: Need to add many new params to support
>    the shared descriptor pool. It doesn't seem to be a good idea.
> 2) devlink-sb (shared buffer): very similar to the API proposed in
>    this patch, but devlink-sb is used in ASIC hardware switch buffer
>    and switch's port. Here the use case is switchdev mode with
>    reprensentor ports and its rx queues.
> 
> Signed-off-by: William Tu <witu@nvidia.com>
> Change-Id: I1de0d9544ff8371955c6976b2d301b1630023100
> ---
> v3: read again myself and explain NAPI context and descriptor pool
> v2: work on Jiri's feedback
> - use more consistent device name, p0, pf0vf0, etc
> - several grammar and spelling errors
> - several changes to devlink sd api
>   - remove hex, remove sd show, make output 1:1 mapping, use
>   count instead of size, use "add" instead of "create"
>   - remove the use of "we"
> - remove the "default" and introduce "shared-descs" in switchdev mode
> - make description more consistent with definitions in ethtool,
> such as ring, channel, queue.
> ---
>  .../networking/devlink/devlink-sd.rst         | 296 ++++++++++++++++++
>  1 file changed, 296 insertions(+)
>  create mode 100644 Documentation/networking/devlink/devlink-sd.rst

Hi William,

a minor nit from my side:
I think that devlink-sd should be added to the toc in index.rst.

...
William Tu Jan. 29, 2024, 10:23 p.m. UTC | #2
On 1/29/24 2:56 AM, Simon Horman wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Jan 25, 2024 at 02:36:17PM -0800, William Tu wrote:
>> Add devlink-sd, shared descriptor, documentation. The devlink-sd
>> mechanism is targeted for configuration of the shared rx descriptors
>> that server as a descriptor pool for ethernet reprsentors (reps)
>> to better utilize memory. Following operations are provided:
>>   * add/delete a shared descriptor pool
>>   * Configure the pool's properties
>>   * Bind/unbind a representor's rx channel to a descriptor pool
>>
>> Propose new devlink objects because existing solutions below do
>> not fit our use cases:
>> 1) devlink params: Need to add many new params to support
>>     the shared descriptor pool. It doesn't seem to be a good idea.
>> 2) devlink-sb (shared buffer): very similar to the API proposed in
>>     this patch, but devlink-sb is used in ASIC hardware switch buffer
>>     and switch's port. Here the use case is switchdev mode with
>>     reprensentor ports and its rx queues.
>>
>> Signed-off-by: William Tu <witu@nvidia.com>
>> Change-Id: I1de0d9544ff8371955c6976b2d301b1630023100
>> ---
>> v3: read again myself and explain NAPI context and descriptor pool
>> v2: work on Jiri's feedback
>> - use more consistent device name, p0, pf0vf0, etc
>> - several grammar and spelling errors
>> - several changes to devlink sd api
>>    - remove hex, remove sd show, make output 1:1 mapping, use
>>    count instead of size, use "add" instead of "create"
>>    - remove the use of "we"
>> - remove the "default" and introduce "shared-descs" in switchdev mode
>> - make description more consistent with definitions in ethtool,
>> such as ring, channel, queue.
>> ---
>>   .../networking/devlink/devlink-sd.rst         | 296 ++++++++++++++++++
>>   1 file changed, 296 insertions(+)
>>   create mode 100644 Documentation/networking/devlink/devlink-sd.rst
> Hi William,
>
> a minor nit from my side:
> I think that devlink-sd should be added to the toc in index.rst.

Hi Simon,

Thanks! will add to index.rst in next version

William
Jakub Kicinski Jan. 31, 2024, 1:07 a.m. UTC | #3
On Thu, 25 Jan 2024 14:36:17 -0800 William Tu wrote:
> Add devlink-sd, shared descriptor, documentation. The devlink-sd
> mechanism is targeted for configuration of the shared rx descriptors
> that server as a descriptor pool for ethernet reprsentors (reps)
> to better utilize memory. 

Do you only need this API to configure representors?
William Tu Jan. 31, 2024, 6:47 p.m. UTC | #4
On 1/30/24 5:07 PM, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 25 Jan 2024 14:36:17 -0800 William Tu wrote:
>> Add devlink-sd, shared descriptor, documentation. The devlink-sd
>> mechanism is targeted for configuration of the shared rx descriptors
>> that server as a descriptor pool for ethernet reprsentors (reps)
>> to better utilize memory.
> Do you only need this API to configure representors?
Hi Jakub,

Thanks for taking a look. Yes, for our use case we only need this API.

And there is one additional add to devlink eswitch, mentioned in the RFC

+ * devlink dev eswitch set DEV mode switchdev - enable or disable default
+   port-pool mapping scheme
+
+    - DEV: the devlink device that supports shared descriptor pool.
+    - shared-descs { enable | disable }: enable/disable default port-pool
+      mapping scheme. See details below.

use case

+    # Enable switchdev mode with additional*shared-descs*  option
+    * devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
+      shared-descs enable

William
Jakub Kicinski Jan. 31, 2024, 7:06 p.m. UTC | #5
On Wed, 31 Jan 2024 10:47:29 -0800 William Tu wrote:
> > Do you only need this API to configure representors?  
> 
> Thanks for taking a look. Yes, for our use case we only need this API.

I'm not sure how to interpret that, I think you answered a different
question :) To avoid any misunderstandings here - let me rephrase a
bit: are you only going to use this API to configure representors? 
Is any other netdev functionality going to use shared pools (i.e. other
than RDMA)?
William Tu Jan. 31, 2024, 7:16 p.m. UTC | #6
On 1/31/24 11:06 AM, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, 31 Jan 2024 10:47:29 -0800 William Tu wrote:
>>> Do you only need this API to configure representors?
>> Thanks for taking a look. Yes, for our use case we only need this API.
> I'm not sure how to interpret that, I think you answered a different
> question :) To avoid any misunderstandings here - let me rephrase a
> bit: are you only going to use this API to configure representors?
> Is any other netdev functionality going to use shared pools (i.e. other
> than RDMA)?

oh, now I understand your question.

Yes, this API is only to configure representors in switchdev mode.

No other netdev functionality will use this API.

William
Jakub Kicinski Jan. 31, 2024, 8:45 p.m. UTC | #7
On Wed, 31 Jan 2024 11:16:58 -0800 William Tu wrote:
> On 1/31/24 11:06 AM, Jakub Kicinski wrote:
> >> Thanks for taking a look. Yes, for our use case we only need this API.  
> > I'm not sure how to interpret that, I think you answered a different
> > question :) To avoid any misunderstandings here - let me rephrase a
> > bit: are you only going to use this API to configure representors?
> > Is any other netdev functionality going to use shared pools (i.e. other
> > than RDMA)?  
> 
> oh, now I understand your question.
> 
> Yes, this API is only to configure representors in switchdev mode.
> 
> No other netdev functionality will use this API.

Hm, interesting. I was asking because I recently heard some academic
talk where the guy mentioned something along the lines of new nVidia
NICs having this amazing feature of reusing free buffer pools while
keeping completions separate. So I was worried this will scope creep 
from something we don't care about all that much (fallback traffic)
to something we care about very much (e.g. container interfaces).
Since you promise this is for representors only, it's a simpler
conversation!

Still, I feel like shared buffer pools / shared queues is how majority
of drivers implement representors. Did you had a look around?
William Tu Jan. 31, 2024, 9:37 p.m. UTC | #8
On 1/31/24 12:45 PM, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, 31 Jan 2024 11:16:58 -0800 William Tu wrote:
>> On 1/31/24 11:06 AM, Jakub Kicinski wrote:
>>>> Thanks for taking a look. Yes, for our use case we only need this API.
>>> I'm not sure how to interpret that, I think you answered a different
>>> question :) To avoid any misunderstandings here - let me rephrase a
>>> bit: are you only going to use this API to configure representors?
>>> Is any other netdev functionality going to use shared pools (i.e. other
>>> than RDMA)?
>> oh, now I understand your question.
>>
>> Yes, this API is only to configure representors in switchdev mode.
>>
>> No other netdev functionality will use this API.
> Hm, interesting. I was asking because I recently heard some academic
> talk where the guy mentioned something along the lines of new nVidia
> NICs having this amazing feature of reusing free buffer pools while
> keeping completions separate. So I was worried this will scope creep
> from something we don't care about all that much (fallback traffic)
> to something we care about very much (e.g. container interfaces).
> Since you promise this is for representors only, it's a simpler
> conversation!

For the academic talk, is it this one?

https://www.usenix.org/conference/osdi23/presentation/pismenny

>
> Still, I feel like shared buffer pools / shared queues is how majority
> of drivers implement representors. Did you had a look around?

Yes, I look at Intel ICE driver, which also has representors. (Add to CC)

IIUC, it's still dedicated buffer for each reps, so this new API might help.

William
Jacob Keller Jan. 31, 2024, 9:41 p.m. UTC | #9
On 1/31/2024 1:37 PM, William Tu wrote:
> 
> On 1/31/24 12:45 PM, Jakub Kicinski wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Wed, 31 Jan 2024 11:16:58 -0800 William Tu wrote:
>>> On 1/31/24 11:06 AM, Jakub Kicinski wrote:
>>>>> Thanks for taking a look. Yes, for our use case we only need this API.
>>>> I'm not sure how to interpret that, I think you answered a different
>>>> question :) To avoid any misunderstandings here - let me rephrase a
>>>> bit: are you only going to use this API to configure representors?
>>>> Is any other netdev functionality going to use shared pools (i.e. other
>>>> than RDMA)?
>>> oh, now I understand your question.
>>>
>>> Yes, this API is only to configure representors in switchdev mode.
>>>
>>> No other netdev functionality will use this API.
>> Hm, interesting. I was asking because I recently heard some academic
>> talk where the guy mentioned something along the lines of new nVidia
>> NICs having this amazing feature of reusing free buffer pools while
>> keeping completions separate. So I was worried this will scope creep
>> from something we don't care about all that much (fallback traffic)
>> to something we care about very much (e.g. container interfaces).
>> Since you promise this is for representors only, it's a simpler
>> conversation!
> 
> For the academic talk, is it this one?
> 
> https://www.usenix.org/conference/osdi23/presentation/pismenny
> 
>>
>> Still, I feel like shared buffer pools / shared queues is how majority
>> of drivers implement representors. Did you had a look around?
> 
> Yes, I look at Intel ICE driver, which also has representors. (Add to CC)
> 
> IIUC, it's still dedicated buffer for each reps, so this new API might help.
> 
> William
> 

Yea, I am pretty sure the ice implementation uses dedicated buffers for
representors right now.
Jakub Kicinski Jan. 31, 2024, 10:30 p.m. UTC | #10
On Wed, 31 Jan 2024 13:41:07 -0800 Jacob Keller wrote:
> >> Still, I feel like shared buffer pools / shared queues is how majority
> >> of drivers implement representors. Did you had a look around?  
> > 
> > Yes, I look at Intel ICE driver, which also has representors. (Add to CC)
> > 
> > IIUC, it's still dedicated buffer for each reps, so this new API might help.
> 
> Yea, I am pretty sure the ice implementation uses dedicated buffers for
> representors right now.

I just did a grep on METADATA_HW_PORT_MUX and assumed bnxt, ice and nfp
all do buffer sharing. You're saying you mux Tx queues but not Rx
queues? Or I need to actually read the code instead of grepping? :)
William Tu Jan. 31, 2024, 11:02 p.m. UTC | #11
On 1/31/24 2:30 PM, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, 31 Jan 2024 13:41:07 -0800 Jacob Keller wrote:
>>>> Still, I feel like shared buffer pools / shared queues is how majority
>>>> of drivers implement representors. Did you had a look around?
>>> Yes, I look at Intel ICE driver, which also has representors. (Add to CC)
>>>
>>> IIUC, it's still dedicated buffer for each reps, so this new API might help.
>> Yea, I am pretty sure the ice implementation uses dedicated buffers for
>> representors right now.
> I just did a grep on METADATA_HW_PORT_MUX and assumed bnxt, ice and nfp
> all do buffer sharing. You're saying you mux Tx queues but not Rx
> queues? Or I need to actually read the code instead of grepping? :)
>
I guess bnxt, ice, nfp are doing tx buffer sharing?


This devlink sd is for RX queues not TX queues.

And devlink-sd creates a pool of shared descriptors only for RX queue.

The TX queues/ TX path remain unchanged.

William
Jakub Kicinski Jan. 31, 2024, 11:17 p.m. UTC | #12
On Wed, 31 Jan 2024 15:02:58 -0800 William Tu wrote:
> > I just did a grep on METADATA_HW_PORT_MUX and assumed bnxt, ice and nfp
> > all do buffer sharing. You're saying you mux Tx queues but not Rx
> > queues? Or I need to actually read the code instead of grepping? :)
> 
> I guess bnxt, ice, nfp are doing tx buffer sharing?

I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
I'm 99.9% sure nfp does.

It'd be great if you could do the due diligence rather than guessing
given that you're proposing uAPI extension :(

> This devlink sd is for RX queues not TX queues.
> 
> And devlink-sd creates a pool of shared descriptors only for RX queue.
> 
> The TX queues/ TX path remain unchanged.
Samudrala, Sridhar Feb. 1, 2024, 2:23 a.m. UTC | #13
On 1/31/2024 5:17 PM, Jakub Kicinski wrote:
> On Wed, 31 Jan 2024 15:02:58 -0800 William Tu wrote:
>>> I just did a grep on METADATA_HW_PORT_MUX and assumed bnxt, ice and nfp
>>> all do buffer sharing. You're saying you mux Tx queues but not Rx
>>> queues? Or I need to actually read the code instead of grepping? :)
>>
>> I guess bnxt, ice, nfp are doing tx buffer sharing?
> 
> I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
> I'm 99.9% sure nfp does.

In ice, all the VF representor netdevs share a VSI(TX/RX queues). UL/PF 
netdev has its own VSI and TX/RX queues. But there is patch from Michal 
under review that is going to simplify the design with a single VSI and 
all the VF representor netdevs and UL/PF netdev will be sharing the 
TX/RX queues in switchdev mode.

Does mlx5 has separate TX/RX queues for each of its representor netdevs?

> 
> It'd be great if you could do the due diligence rather than guessing
> given that you're proposing uAPI extension :(
> 
>> This devlink sd is for RX queues not TX queues.
>>
>> And devlink-sd creates a pool of shared descriptors only for RX queue.
>>
>> The TX queues/ TX path remain unchanged.
>
Jiri Pirko Feb. 1, 2024, 10:13 a.m. UTC | #14
Thu, Feb 01, 2024 at 12:17:26AM CET, kuba@kernel.org wrote:
>On Wed, 31 Jan 2024 15:02:58 -0800 William Tu wrote:
>> > I just did a grep on METADATA_HW_PORT_MUX and assumed bnxt, ice and nfp
>> > all do buffer sharing. You're saying you mux Tx queues but not Rx
>> > queues? Or I need to actually read the code instead of grepping? :)
>> 
>> I guess bnxt, ice, nfp are doing tx buffer sharing?
>
>I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
>I'm 99.9% sure nfp does.

Wait a sec. You refer to using the lower device (like PF) to actually
send and receive trafic of representors. That means, you share the
entire queues. Or maybe better term is not "share" but "use PF queues".

The infra William is proposing is about something else. In that
scenario, each representor has a separate independent set of queues,
as well as the PF has. Currently in mlx5, all representor queues have
descriptors only used for the individual representor. So there is
a huge waste of memory for that, as often there is only very low traffic
there and probability of hitting trafic burst on many representors at
the same time is very low.

Say you have 1 queue for a rep. 1 queue has 1k descriptors. For 1k
representors you end up with:
1k x 1k = 1m descriptors

With this API, user can configure sharing of the descriptors.
So there would be a pool (or multiple pools) of descriptors and the
descriptors could be used by many queues/representors.

So in the example above, for 1k representors you have only 1k
descriptors.

The infra allows great flexibility in terms of configuring multiple
pools of different sizes and assigning queues from representors to
different pools. So you can have multiple "classes" of representors.
For example the ones you expect heavy trafic could have a separate pool,
the rest can share another pool together, etc.


>
>It'd be great if you could do the due diligence rather than guessing
>given that you're proposing uAPI extension :(
>
>> This devlink sd is for RX queues not TX queues.
>> 
>> And devlink-sd creates a pool of shared descriptors only for RX queue.
>> 
>> The TX queues/ TX path remain unchanged.
>
William Tu Feb. 1, 2024, 2 p.m. UTC | #15
On 1/31/24 6:23 PM, Samudrala, Sridhar wrote:
> External email: Use caution opening links or attachments
>
>
> On 1/31/2024 5:17 PM, Jakub Kicinski wrote:
>> On Wed, 31 Jan 2024 15:02:58 -0800 William Tu wrote:
>>>> I just did a grep on METADATA_HW_PORT_MUX and assumed bnxt, ice and 
>>>> nfp
>>>> all do buffer sharing. You're saying you mux Tx queues but not Rx
>>>> queues? Or I need to actually read the code instead of grepping? :)
>>>
>>> I guess bnxt, ice, nfp are doing tx buffer sharing?
>>
>> I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
>> I'm 99.9% sure nfp does.
>
> In ice, all the VF representor netdevs share a VSI(TX/RX queues). UL/PF
> netdev has its own VSI and TX/RX queues. But there is patch from Michal
> under review that is going to simplify the design with a single VSI and
> all the VF representor netdevs and UL/PF netdev will be sharing the
> TX/RX queues in switchdev mode.
>
Thank you!

Reading the ice code, ice_eswitch_remap_rings_to_vectors(), it is 
setting up tx/rx rings for each reps.

"Each port representor will have dedicated 1 Tx/Rx ring pair, so number 
of rings pair is equal to
  number of VFs."

So after Michal's patch, representors will share TX/RX queues of uplink-pf?


> Does mlx5 has separate TX/RX queues for each of its representor netdevs?
>
Yes, in mlx5e_rep_open calls mlx5e_open_locked, which will create TX/RX 
queues like typical mlx5 device.

Each representor can set it TX/RX queues by using ethtool -L

>>
>> It'd be great if you could do the due diligence rather than guessing
>> given that you're proposing uAPI extension :(
>>
Working on it!

Thanks

William
William Tu Feb. 1, 2024, 7:16 p.m. UTC | #16
On 1/31/24 3:17 PM, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, 31 Jan 2024 15:02:58 -0800 William Tu wrote:
>>> I just did a grep on METADATA_HW_PORT_MUX and assumed bnxt, ice and nfp
>>> all do buffer sharing. You're saying you mux Tx queues but not Rx
>>> queues? Or I need to actually read the code instead of grepping? :)
>> I guess bnxt, ice, nfp are doing tx buffer sharing?
> I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
> I'm 99.9% sure nfp does.
>
> It'd be great if you could do the due diligence rather than guessing
> given that you're proposing uAPI extension :(
>
*

(sorry again, html is detected in previous email)

due diligence here:


Summary

======

1. The VF-reps that simply "use" PF-rep's queue for rx and tx:

    sfc, ice (under review), bnxt, and nfp

2. VF-rep that has its own rx/tx queue:

    ice (1 rx/tx queue per rep), mlx5 (multiple rx/tx queues)


case 1: no way to prioritize important VF-rep’s traffic

case 2: scalling to 1k repr will wastes lots of memory.


Details

=======

Based on reading the code around open_repr, napi_poll for repr, and 
search commit message.

ICE

---

has dedicated 1 rx/tx ring for each VF-REP. Patches from Michal under 
review for VF-REP to share PF-REP’s queue.

see:

ice_eswitch_remap_rings_to_vectors, it is setting up tx rings and rx 
rings for each reps. "Each port representor will have dedicated 1 Tx/Rx 
ring pair, so number of rings pair is equal to number of VFs."

and later on it's setting up 1 napi for each rep, see ice_eswitch_setup_repr


BNXT

----

no dedicated rx/tx ring for rep. VF-REP rx/tx share PF-rep's rx/tx ring.

see

commit ee5c7fb34047: bnxt_en: add vf-rep RX/TX and netdev implementation

"This patch introduces the RX/TX and a simple netdev implementationfor 
VF-reps. The VF-reps use the RX/TX rings of the PF. "


NFP

---

VF-rep uses PF-rep’s rx/tx queues

see:

https://lore.kernel.org/netdev/20170621095933.GC6691@vergenet.net/T/ 
<https://lore.kernel.org/netdev/20170621095933.GC6691@vergenet.net/T/>

“The PF netdev acts as a lower-device which sends and receives packets to

and from the firmware. The representors act as upper-devices. For TX

representors attach a metadata dst to the skb which is used by the PF

netdev to prepend metadata to the packet before forwarding the firmware. On

RX the PF netdev looks up the representor based on the prepended metadata”


and

nfp_flower_spawn_vnic_reprs

nfp_abm_spawn_repr -> nfp_repr_alloc_mqs

nfp_repr_open -> nfp_app_repr -> nfp_flower_repr_netdev_open


SFC

---

VF-rep uses PF-rep’s queues, or PF-rep receives packets for VF-rep


See:

efx_ef100_rep_poll

efx_ef100_rep_open -> efx_ef100_rep_poll


“commit 9fe00c8 sfc: ef100 representor RX top half”

Representor RX uses a NAPI context driven by a 'fake interrupt': when

the parent PF receives a packet destined for the representor, it adds

it to an SKB list (efv->rx_list), and schedules NAPI if the 'fake

interrupt' is primed.


MLX5

----

VF-reps has its own dedicated rx/tx queue


all representor queues have descriptors only used for the individual 
representor

see mlx5e_rep_open -> mlx5e_open_locked -> mlx5e_open_channels -> queues

*
Jakub Kicinski Feb. 2, 2024, 3:30 a.m. UTC | #17
On Thu, 1 Feb 2024 11:16:22 -0800 William Tu wrote:
> >> I guess bnxt, ice, nfp are doing tx buffer sharing?  
> > I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
> > I'm 99.9% sure nfp does.
> >
> > It'd be great if you could do the due diligence rather than guessing
> > given that you're proposing uAPI extension :(
> >  
> *
> 
> (sorry again, html is detected in previous email)
> 
> due diligence here:

Thanks for collecting the deets!

If I'm reading this right we have 3 drivers which straight up share
queues. I don't remember anyone complaining about the queue sharing
in my time at Netro. Meaning we can probably find reasonable defaults
and not start with the full API? Just have the

  devlink dev eswitch set DEV mode switchdev shared-descs enable

extension to begin with?
Jakub Kicinski Feb. 2, 2024, 4 a.m. UTC | #18
On Thu, 1 Feb 2024 11:13:57 +0100 Jiri Pirko wrote:
> Thu, Feb 01, 2024 at 12:17:26AM CET, kuba@kernel.org wrote:
> >> I guess bnxt, ice, nfp are doing tx buffer sharing?  
> >
> >I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
> >I'm 99.9% sure nfp does.  
> 
> Wait a sec.

No, you wait a sec ;) Why do you think this belongs to devlink?
Two months ago you were complaining bitterly when people were
considering using devlink rate to control per-queue shapers.
And now it's fine to add queues as a concept to devlink?

> You refer to using the lower device (like PF) to actually
> send and receive trafic of representors. That means, you share the
> entire queues. Or maybe better term is not "share" but "use PF queues".
> 
> The infra William is proposing is about something else. In that
> scenario, each representor has a separate independent set of queues,
> as well as the PF has. Currently in mlx5, all representor queues have
> descriptors only used for the individual representor. So there is
> a huge waste of memory for that, as often there is only very low traffic
> there and probability of hitting trafic burst on many representors at
> the same time is very low.
> 
> Say you have 1 queue for a rep. 1 queue has 1k descriptors. For 1k
> representors you end up with:
> 1k x 1k = 1m descriptors

I understand the memory waste problem:
https://people.kernel.org/kuba/nic-memory-reserve

> With this API, user can configure sharing of the descriptors.
> So there would be a pool (or multiple pools) of descriptors and the
> descriptors could be used by many queues/representors.
> 
> So in the example above, for 1k representors you have only 1k
> descriptors.
> 
> The infra allows great flexibility in terms of configuring multiple
> pools of different sizes and assigning queues from representors to
> different pools. So you can have multiple "classes" of representors.
> For example the ones you expect heavy trafic could have a separate pool,
> the rest can share another pool together, etc.

Well, it does not extend naturally to the design described in that blog
post. There I only care about a netdev level pool, but every queue can
bind multiple pools.

It also does not cater naturally to a very interesting application
of such tech to lightweight container interfaces, macvlan-offload style.
As I said at the beginning, why is the pool a devlink thing if the only
objects that connect to it are netdevs?

Another netdev thing where this will be awkward is page pool
integration. It lives in netdev genl, are we going to add devlink pool
reference to indicate which pool a pp is feeding?

When memory providers finally materialize that will be another
netdev thing that needs to somehow connect here.
William Tu Feb. 2, 2024, 4:26 a.m. UTC | #19
On 2/1/24 7:30 PM, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 1 Feb 2024 11:16:22 -0800 William Tu wrote:
>>>> I guess bnxt, ice, nfp are doing tx buffer sharing?
>>> I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
>>> I'm 99.9% sure nfp does.
>>>
>>> It'd be great if you could do the due diligence rather than guessing
>>> given that you're proposing uAPI extension :(
>>>
>> *
>>
>> (sorry again, html is detected in previous email)
>>
>> due diligence here:
> Thanks for collecting the deets!
>
> If I'm reading this right we have 3 drivers which straight up share
> queues. I don't remember anyone complaining about the queue sharing
> in my time at Netro. Meaning we can probably find reasonable defaults
> and not start with the full API? Just have the
>
>    devlink dev eswitch set DEV mode switchdev shared-descs enable
>
> extension to begin with?
yes, that's what I began with months ago, by just adding new devlink params.
We need at least two new parameters: enable/disable, and size of the pool.
I have right now is s.t like:
$ devlink dev eswitch set pci/0000:08:00.0 mode switchdev 
shared-descs-counts  1024
$ devlink dev eswitch set pci/0000:08:00.0 mode switchdev shared-descs 
enable
Once enabled, the default sharing scheme, described in the RFC, applies 
to all REP queues, and no way to customize the queue-to-pool mapping.

Then discuss with a couple people internally, and come up with this more 
flexible and generic API.

William
Jiri Pirko Feb. 2, 2024, 7:46 a.m. UTC | #20
Fri, Feb 02, 2024 at 05:00:41AM CET, kuba@kernel.org wrote:
>On Thu, 1 Feb 2024 11:13:57 +0100 Jiri Pirko wrote:
>> Thu, Feb 01, 2024 at 12:17:26AM CET, kuba@kernel.org wrote:
>> >> I guess bnxt, ice, nfp are doing tx buffer sharing?  
>> >
>> >I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
>> >I'm 99.9% sure nfp does.  
>> 
>> Wait a sec.
>
>No, you wait a sec ;) Why do you think this belongs to devlink?
>Two months ago you were complaining bitterly when people were
>considering using devlink rate to control per-queue shapers.
>And now it's fine to add queues as a concept to devlink?

Do you have a better suggestion how to model common pool object for
multiple netdevices? This is the reason why devlink was introduced to
provide a platform for common/shared things for a device that contains
multiple netdevs/ports/whatever. But I may be missing something here,
for sure.


>
>> You refer to using the lower device (like PF) to actually
>> send and receive trafic of representors. That means, you share the
>> entire queues. Or maybe better term is not "share" but "use PF queues".
>> 
>> The infra William is proposing is about something else. In that
>> scenario, each representor has a separate independent set of queues,
>> as well as the PF has. Currently in mlx5, all representor queues have
>> descriptors only used for the individual representor. So there is
>> a huge waste of memory for that, as often there is only very low traffic
>> there and probability of hitting trafic burst on many representors at
>> the same time is very low.
>> 
>> Say you have 1 queue for a rep. 1 queue has 1k descriptors. For 1k
>> representors you end up with:
>> 1k x 1k = 1m descriptors
>
>I understand the memory waste problem:
>https://people.kernel.org/kuba/nic-memory-reserve
>
>> With this API, user can configure sharing of the descriptors.
>> So there would be a pool (or multiple pools) of descriptors and the
>> descriptors could be used by many queues/representors.
>> 
>> So in the example above, for 1k representors you have only 1k
>> descriptors.
>> 
>> The infra allows great flexibility in terms of configuring multiple
>> pools of different sizes and assigning queues from representors to
>> different pools. So you can have multiple "classes" of representors.
>> For example the ones you expect heavy trafic could have a separate pool,
>> the rest can share another pool together, etc.
>
>Well, it does not extend naturally to the design described in that blog
>post. There I only care about a netdev level pool, but every queue can
>bind multiple pools.
>
>It also does not cater naturally to a very interesting application
>of such tech to lightweight container interfaces, macvlan-offload style.
>As I said at the beginning, why is the pool a devlink thing if the only
>objects that connect to it are netdevs?

Okay. Let's model it differently, no problem. I find devlink device
as a good fit for object to contain shared things like pools.
But perhaps there could be something else. Something new?


>
>Another netdev thing where this will be awkward is page pool
>integration. It lives in netdev genl, are we going to add devlink pool
>reference to indicate which pool a pp is feeding?

Page pool is per-netdev, isn't it? It could be extended to be bound per
devlink-pool as you suggest. It is a bit awkward, I agree.

So instead of devlink, should be add the descriptor-pool object into
netdev genl and make possible for multiple netdevs to use it there?
I would still miss the namespace of the pool, as it naturally aligns
with devlink device. IDK :/


>
>When memory providers finally materialize that will be another
>netdev thing that needs to somehow connect here.
Michal Swiatkowski Feb. 2, 2024, 8:48 a.m. UTC | #21
On Thu, Feb 01, 2024 at 06:00:54AM -0800, William Tu wrote:
> 
> On 1/31/24 6:23 PM, Samudrala, Sridhar wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On 1/31/2024 5:17 PM, Jakub Kicinski wrote:
> > > On Wed, 31 Jan 2024 15:02:58 -0800 William Tu wrote:
> > > > > I just did a grep on METADATA_HW_PORT_MUX and assumed bnxt,
> > > > > ice and nfp
> > > > > all do buffer sharing. You're saying you mux Tx queues but not Rx
> > > > > queues? Or I need to actually read the code instead of grepping? :)
> > > > 
> > > > I guess bnxt, ice, nfp are doing tx buffer sharing?
> > > 
> > > I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
> > > I'm 99.9% sure nfp does.
> > 
> > In ice, all the VF representor netdevs share a VSI(TX/RX queues). UL/PF
> > netdev has its own VSI and TX/RX queues. But there is patch from Michal
> > under review that is going to simplify the design with a single VSI and
> > all the VF representor netdevs and UL/PF netdev will be sharing the
> > TX/RX queues in switchdev mode.
> > 
> Thank you!
> 
> Reading the ice code, ice_eswitch_remap_rings_to_vectors(), it is setting up
> tx/rx rings for each reps.
> 
> "Each port representor will have dedicated 1 Tx/Rx ring pair, so number of
> rings pair is equal to
>  number of VFs."
> 
> So after Michal's patch, representors will share TX/RX queues of uplink-pf?
> 
> 

Yeah, right, we though about solution like in mlx5, but we can easily
get queues shortage in ice. We need to allow representor to share the
queues. The easiest solution was to move to sharing queues with PF like
(I think so) nfp and few other vendors do.

> > Does mlx5 has separate TX/RX queues for each of its representor netdevs?
> > 
> Yes, in mlx5e_rep_open calls mlx5e_open_locked, which will create TX/RX
> queues like typical mlx5 device.
> 
> Each representor can set it TX/RX queues by using ethtool -L
>

I am a little out of context here. Do you allow also sharing queues
between representors? API for sharing descriptors sounds great, but in
ice we also have queues shortage, because of that we want to use PF
queues instead.

> > > It'd be great if you could do the due diligence rather than guessing
> > > given that you're proposing uAPI extension :(
> > > 
> Working on it!
> 
> Thanks
> 
> William
> 
>

Thaks,
Michal
William Tu Feb. 2, 2024, 3:27 p.m. UTC | #22
On 2/2/24 12:48 AM, Michal Swiatkowski wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Feb 01, 2024 at 06:00:54AM -0800, William Tu wrote:
>> On 1/31/24 6:23 PM, Samudrala, Sridhar wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 1/31/2024 5:17 PM, Jakub Kicinski wrote:
>>>> On Wed, 31 Jan 2024 15:02:58 -0800 William Tu wrote:
>>>>>> I just did a grep on METADATA_HW_PORT_MUX and assumed bnxt,
>>>>>> ice and nfp
>>>>>> all do buffer sharing. You're saying you mux Tx queues but not Rx
>>>>>> queues? Or I need to actually read the code instead of grepping? :)
>>>>> I guess bnxt, ice, nfp are doing tx buffer sharing?
>>>> I'm not familiar with ice. I'm 90% sure bnxt shares both Rx and Tx.
>>>> I'm 99.9% sure nfp does.
>>> In ice, all the VF representor netdevs share a VSI(TX/RX queues). UL/PF
>>> netdev has its own VSI and TX/RX queues. But there is patch from Michal
>>> under review that is going to simplify the design with a single VSI and
>>> all the VF representor netdevs and UL/PF netdev will be sharing the
>>> TX/RX queues in switchdev mode.
>>>
>> Thank you!
>>
>> Reading the ice code, ice_eswitch_remap_rings_to_vectors(), it is setting up
>> tx/rx rings for each reps.
>>
>> "Each port representor will have dedicated 1 Tx/Rx ring pair, so number of
>> rings pair is equal to
>>   number of VFs."
>>
>> So after Michal's patch, representors will share TX/RX queues of uplink-pf?
>>
>>
> Yeah, right, we though about solution like in mlx5, but we can easily
> get queues shortage in ice. We need to allow representor to share the
> queues. The easiest solution was to move to sharing queues with PF like
> (I think so) nfp and few other vendors do.
>
>>> Does mlx5 has separate TX/RX queues for each of its representor netdevs?
>>>
>> Yes, in mlx5e_rep_open calls mlx5e_open_locked, which will create TX/RX
>> queues like typical mlx5 device.
>>
>> Each representor can set it TX/RX queues by using ethtool -L
>>
> I am a little out of context here. Do you allow also sharing queues
> between representors? API for sharing descriptors sounds great, but in
> ice we also have queues shortage, because of that we want to use PF
> queues instead.
>
thanks.
Yes, in mlx5, each representor still has its rx queue, but only the 
metadata/control part of the queue. The actual queue which has 
descriptors are coming from the shared descriptor pool, and is used by 
all other representors.

For example below.
In the mlx5 fw/hw point of view, it creates only 1 shared queue with 
WQEs and the shared queue is used by multiple representors.

+     +--------+            +--------+
+     │ pf0vf1 │            │ pf0vf2 │
+     │   RQ   │            │   RQ   │
+     +----┬---+            +----┬---+
+          │                     │
+          +---------+  +--------+
+                    │  │
+              +-----┴--┴-------+
+              │     shared     |
+              | rx descriptors │
+              │ and buffers    │
+              +----------------+
William
Jakub Kicinski Feb. 9, 2024, 1:26 a.m. UTC | #23
On Fri, 2 Feb 2024 08:46:56 +0100 Jiri Pirko wrote:
> Fri, Feb 02, 2024 at 05:00:41AM CET, kuba@kernel.org wrote:
> >On Thu, 1 Feb 2024 11:13:57 +0100 Jiri Pirko wrote:  
> >> Wait a sec.  
> >
> >No, you wait a sec ;) Why do you think this belongs to devlink?
> >Two months ago you were complaining bitterly when people were
> >considering using devlink rate to control per-queue shapers.
> >And now it's fine to add queues as a concept to devlink?  
> 
> Do you have a better suggestion how to model common pool object for
> multiple netdevices? This is the reason why devlink was introduced to
> provide a platform for common/shared things for a device that contains
> multiple netdevs/ports/whatever. But I may be missing something here,
> for sure.

devlink just seems like the lowest common denominator, but the moment
we start talking about multi-PF devices it also gets wobbly :(
I think it's better to focus on the object, without scoping it to some
ancestor which may not be sufficient tomorrow (meaning its own family
or a new object in netdev like page pool).

> >> With this API, user can configure sharing of the descriptors.
> >> So there would be a pool (or multiple pools) of descriptors and the
> >> descriptors could be used by many queues/representors.
> >> 
> >> So in the example above, for 1k representors you have only 1k
> >> descriptors.
> >> 
> >> The infra allows great flexibility in terms of configuring multiple
> >> pools of different sizes and assigning queues from representors to
> >> different pools. So you can have multiple "classes" of representors.
> >> For example the ones you expect heavy trafic could have a separate pool,
> >> the rest can share another pool together, etc.  
> >
> >Well, it does not extend naturally to the design described in that blog
> >post. There I only care about a netdev level pool, but every queue can
> >bind multiple pools.
> >
> >It also does not cater naturally to a very interesting application
> >of such tech to lightweight container interfaces, macvlan-offload style.
> >As I said at the beginning, why is the pool a devlink thing if the only
> >objects that connect to it are netdevs?  
> 
> Okay. Let's model it differently, no problem. I find devlink device
> as a good fit for object to contain shared things like pools.
> But perhaps there could be something else. Something new?

We need something new for more advanced memory providers, anyway.
The huge page example I posted a year ago needs something to get
a huge page from CMA and slice it up for the page pools to draw from.
That's very similar, also not really bound to a netdev. I don't think
the cross-netdev aspect is the most important aspect of this problem.

> >Another netdev thing where this will be awkward is page pool
> >integration. It lives in netdev genl, are we going to add devlink pool
> >reference to indicate which pool a pp is feeding?  
> 
> Page pool is per-netdev, isn't it? It could be extended to be bound per
> devlink-pool as you suggest. It is a bit awkward, I agree.
> 
> So instead of devlink, should be add the descriptor-pool object into
> netdev genl and make possible for multiple netdevs to use it there?
> I would still miss the namespace of the pool, as it naturally aligns
> with devlink device. IDK :/

Maybe the first thing to iron out is the life cycle. Right now we
throw all configuration requests at the driver which ends really badly
for those of us who deal with heterogeneous environments. Applications
which try to do advanced stuff like pinning and XDP break because of
all the behavior differences between drivers. So I don't think we
should expose configuration of unstable objects (those which user
doesn't create explicitly - queues, irqs, page pools etc) to the driver.
The driver should get or read the config from the core when the object
is created.

This gets back to the proposed descriptor pool because there's a
chicken and an egg problem between creating the representors and
creating the descriptor pool, right? Either:
 - create reprs first with individual queues, reconfigure them to bind
   them to a pool
 - create pool first bind the reprs which don't exist to them,
   assuming the driver somehow maintains the mapping, pretty weird
   to configure objects which don't exist
 - create pool first, add an extra knob elsewhere (*cough* "shared-descs
   enable") which produces somewhat loosely defined reasonable behavior

Because this is a general problem (again, any queue config needs it)
I think we'll need to create some sort of a rule engine in netdev :(
Instead of configuring a page pool you'd add a configuration rule
which can match on netdev and queue id and gives any related page pool
some parameters. NAPI is another example of something user can't
reasonably configure directly. And if we create such a rule engine 
it should probably be shared...
Jiri Pirko Feb. 15, 2024, 1:19 p.m. UTC | #24
Fri, Feb 09, 2024 at 02:26:33AM CET, kuba@kernel.org wrote:
>On Fri, 2 Feb 2024 08:46:56 +0100 Jiri Pirko wrote:
>> Fri, Feb 02, 2024 at 05:00:41AM CET, kuba@kernel.org wrote:
>> >On Thu, 1 Feb 2024 11:13:57 +0100 Jiri Pirko wrote:  
>> >> Wait a sec.  
>> >
>> >No, you wait a sec ;) Why do you think this belongs to devlink?
>> >Two months ago you were complaining bitterly when people were
>> >considering using devlink rate to control per-queue shapers.
>> >And now it's fine to add queues as a concept to devlink?  
>> 
>> Do you have a better suggestion how to model common pool object for
>> multiple netdevices? This is the reason why devlink was introduced to
>> provide a platform for common/shared things for a device that contains
>> multiple netdevs/ports/whatever. But I may be missing something here,
>> for sure.
>
>devlink just seems like the lowest common denominator, but the moment
>we start talking about multi-PF devices it also gets wobbly :(

You mean you see real to have a multi-PF device that allows to share the
pools between the PFs? If, in theory, that exists, could this just be a
limitation perhaps?


>I think it's better to focus on the object, without scoping it to some
>ancestor which may not be sufficient tomorrow (meaning its own family
>or a new object in netdev like page pool).

Ok.


>
>> >> With this API, user can configure sharing of the descriptors.
>> >> So there would be a pool (or multiple pools) of descriptors and the
>> >> descriptors could be used by many queues/representors.
>> >> 
>> >> So in the example above, for 1k representors you have only 1k
>> >> descriptors.
>> >> 
>> >> The infra allows great flexibility in terms of configuring multiple
>> >> pools of different sizes and assigning queues from representors to
>> >> different pools. So you can have multiple "classes" of representors.
>> >> For example the ones you expect heavy trafic could have a separate pool,
>> >> the rest can share another pool together, etc.  
>> >
>> >Well, it does not extend naturally to the design described in that blog
>> >post. There I only care about a netdev level pool, but every queue can
>> >bind multiple pools.
>> >
>> >It also does not cater naturally to a very interesting application
>> >of such tech to lightweight container interfaces, macvlan-offload style.
>> >As I said at the beginning, why is the pool a devlink thing if the only
>> >objects that connect to it are netdevs?  
>> 
>> Okay. Let's model it differently, no problem. I find devlink device
>> as a good fit for object to contain shared things like pools.
>> But perhaps there could be something else. Something new?
>
>We need something new for more advanced memory providers, anyway.
>The huge page example I posted a year ago needs something to get
>a huge page from CMA and slice it up for the page pools to draw from.
>That's very similar, also not really bound to a netdev. I don't think
>the cross-netdev aspect is the most important aspect of this problem.

Well, in our case, the shared entity is not floating, it is bound to a
device related to netdev.


>
>> >Another netdev thing where this will be awkward is page pool
>> >integration. It lives in netdev genl, are we going to add devlink pool
>> >reference to indicate which pool a pp is feeding?  
>> 
>> Page pool is per-netdev, isn't it? It could be extended to be bound per
>> devlink-pool as you suggest. It is a bit awkward, I agree.
>> 
>> So instead of devlink, should be add the descriptor-pool object into
>> netdev genl and make possible for multiple netdevs to use it there?
>> I would still miss the namespace of the pool, as it naturally aligns
>> with devlink device. IDK :/
>
>Maybe the first thing to iron out is the life cycle. Right now we
>throw all configuration requests at the driver which ends really badly
>for those of us who deal with heterogeneous environments. Applications
>which try to do advanced stuff like pinning and XDP break because of
>all the behavior differences between drivers. So I don't think we
>should expose configuration of unstable objects (those which user
>doesn't create explicitly - queues, irqs, page pools etc) to the driver.
>The driver should get or read the config from the core when the object
>is created.

I see. But again, for global objects, I understand. But this is
device-specific object and configuration. How do you tie it up together?


>
>This gets back to the proposed descriptor pool because there's a
>chicken and an egg problem between creating the representors and
>creating the descriptor pool, right? Either:
> - create reprs first with individual queues, reconfigure them to bind
>   them to a pool
> - create pool first bind the reprs which don't exist to them,
>   assuming the driver somehow maintains the mapping, pretty weird
>   to configure objects which don't exist
> - create pool first, add an extra knob elsewhere (*cough* "shared-descs
>   enable") which produces somewhat loosely defined reasonable behavior
>
>Because this is a general problem (again, any queue config needs it)
>I think we'll need to create some sort of a rule engine in netdev :(
>Instead of configuring a page pool you'd add a configuration rule
>which can match on netdev and queue id and gives any related page pool
>some parameters. NAPI is another example of something user can't
>reasonably configure directly. And if we create such a rule engine 
>it should probably be shared...
Jacob Keller Feb. 15, 2024, 5:41 p.m. UTC | #25
On 2/15/2024 5:19 AM, Jiri Pirko wrote:
> Fri, Feb 09, 2024 at 02:26:33AM CET, kuba@kernel.org wrote:
>> On Fri, 2 Feb 2024 08:46:56 +0100 Jiri Pirko wrote:
>>> Fri, Feb 02, 2024 at 05:00:41AM CET, kuba@kernel.org wrote:
>>>> On Thu, 1 Feb 2024 11:13:57 +0100 Jiri Pirko wrote:  
>>>>> Wait a sec.  
>>>>
>>>> No, you wait a sec ;) Why do you think this belongs to devlink?
>>>> Two months ago you were complaining bitterly when people were
>>>> considering using devlink rate to control per-queue shapers.
>>>> And now it's fine to add queues as a concept to devlink?  
>>>
>>> Do you have a better suggestion how to model common pool object for
>>> multiple netdevices? This is the reason why devlink was introduced to
>>> provide a platform for common/shared things for a device that contains
>>> multiple netdevs/ports/whatever. But I may be missing something here,
>>> for sure.
>>
>> devlink just seems like the lowest common denominator, but the moment
>> we start talking about multi-PF devices it also gets wobbly :(
> 
> You mean you see real to have a multi-PF device that allows to share the
> pools between the PFs? If, in theory, that exists, could this just be a
> limitation perhaps?
> 
> 

I don't know offhand if we have a device which can share pools
specifically, but we do have multi-PF devices which have a lot of shared
resources. However, due to the multi-PF PCIe design. I looked into ways
to get a single devlink across the devices.. but ultimately got stymied
and gave up.

This left us with accepting the limitation that each PF gets its own
devlink and can't really communicate with other PFs.

The existing solution has just been to partition the shared resources
evenly across PFs, typically via firmware. No flexibility.

I do think the best solution here would be to figure out a generic way
to tie multiple functions into a single devlink representing the device.
Then each function gets the set of devlink_port objects associated to
it. I'm not entirely sure how that would work. We could hack something
together with auxbus.. but thats pretty ugly. Some sort of orchestration
in the PCI layer that could identify when a device wants to have some
sort of "parent" driver which loads once and has ties to each of the
function drivers would be ideal.

Then this parent driver could register devlink, and each function driver
could connect to it and allocate ports and function-specific resources.

Alternatively a design which loads a single driver that maintains
references to each function could work but that requires a significant
change to the entire driver design and is unlikely to be done for
existing drivers...

This is obviously a bit more of a tangential problem to this series.
Jakub Kicinski Feb. 16, 2024, 1:58 a.m. UTC | #26
On Thu, 15 Feb 2024 14:19:40 +0100 Jiri Pirko wrote:
> >Maybe the first thing to iron out is the life cycle. Right now we
> >throw all configuration requests at the driver which ends really badly
> >for those of us who deal with heterogeneous environments. Applications
> >which try to do advanced stuff like pinning and XDP break because of
> >all the behavior differences between drivers. So I don't think we
> >should expose configuration of unstable objects (those which user
> >doesn't create explicitly - queues, irqs, page pools etc) to the driver.
> >The driver should get or read the config from the core when the object
> >is created.  
> 
> I see. But again, for global objects, I understand. But this is
> device-specific object and configuration. How do you tie it up together?

We disagree how things should be modeled, sort of in principle.
I think it dates all the way back to your work on altnames.
We had the same conversation on DPLL :(

I prefer to give objects unique IDs and a bunch of optional identifying
attributes, rather than trying to build some well organized hierarchy.
The hierarchy often becomes an unnecessary constraint.
Jakub Kicinski Feb. 16, 2024, 2:07 a.m. UTC | #27
On Thu, 15 Feb 2024 09:41:31 -0800 Jacob Keller wrote:
> I don't know offhand if we have a device which can share pools
> specifically, but we do have multi-PF devices which have a lot of shared
> resources. However, due to the multi-PF PCIe design. I looked into ways
> to get a single devlink across the devices.. but ultimately got stymied
> and gave up.
> 
> This left us with accepting the limitation that each PF gets its own
> devlink and can't really communicate with other PFs.
> 
> The existing solution has just been to partition the shared resources
> evenly across PFs, typically via firmware. No flexibility.
> 
> I do think the best solution here would be to figure out a generic way
> to tie multiple functions into a single devlink representing the device.
> Then each function gets the set of devlink_port objects associated to
> it. I'm not entirely sure how that would work. We could hack something
> together with auxbus.. but thats pretty ugly. Some sort of orchestration
> in the PCI layer that could identify when a device wants to have some
> sort of "parent" driver which loads once and has ties to each of the
> function drivers would be ideal.
> 
> Then this parent driver could register devlink, and each function driver
> could connect to it and allocate ports and function-specific resources.
> 
> Alternatively a design which loads a single driver that maintains
> references to each function could work but that requires a significant
> change to the entire driver design and is unlikely to be done for
> existing drivers...

I think the complexity mostly stems from having to answer what the
"right behavior" is. At least that's what I concluded when thinking
about it back at Netronome :)  If you do a strict hierarchy where
one PF is preassigned the role of the leader, and just fail if anything
unexpected happens - it should be doable. We already kinda have the
model where devlink is the "first layer of probing" and "reload_up()"
is the second.

Have you had a chance to take a closer look at mlx5 "socket direct"
(rename pending) implementation?

BTW Jiri, weren't you expecting that to use component drivers or some
such?
Jiri Pirko Feb. 16, 2024, 8:06 a.m. UTC | #28
Fri, Feb 16, 2024 at 02:58:36AM CET, kuba@kernel.org wrote:
>On Thu, 15 Feb 2024 14:19:40 +0100 Jiri Pirko wrote:
>> >Maybe the first thing to iron out is the life cycle. Right now we
>> >throw all configuration requests at the driver which ends really badly
>> >for those of us who deal with heterogeneous environments. Applications
>> >which try to do advanced stuff like pinning and XDP break because of
>> >all the behavior differences between drivers. So I don't think we
>> >should expose configuration of unstable objects (those which user
>> >doesn't create explicitly - queues, irqs, page pools etc) to the driver.
>> >The driver should get or read the config from the core when the object
>> >is created.  
>> 
>> I see. But again, for global objects, I understand. But this is
>> device-specific object and configuration. How do you tie it up together?
>
>We disagree how things should be modeled, sort of in principle.
>I think it dates all the way back to your work on altnames.
>We had the same conversation on DPLL :(
>
>I prefer to give objects unique IDs and a bunch of optional identifying
>attributes, rather than trying to build some well organized hierarchy.
>The hierarchy often becomes an unnecessary constraint.

Sure, no problem on having floating objects with ids and attributes.
But in case they relate to HW configuration, you need to somehow glue
them to a device eventually. This is what I'm missing how you envision
it. The lifetime of object and glue/unglue operations.
Jiri Pirko Feb. 16, 2024, 8:10 a.m. UTC | #29
Thu, Feb 15, 2024 at 06:41:31PM CET, jacob.e.keller@intel.com wrote:
>
>
>On 2/15/2024 5:19 AM, Jiri Pirko wrote:
>> Fri, Feb 09, 2024 at 02:26:33AM CET, kuba@kernel.org wrote:
>>> On Fri, 2 Feb 2024 08:46:56 +0100 Jiri Pirko wrote:
>>>> Fri, Feb 02, 2024 at 05:00:41AM CET, kuba@kernel.org wrote:
>>>>> On Thu, 1 Feb 2024 11:13:57 +0100 Jiri Pirko wrote:  
>>>>>> Wait a sec.  
>>>>>
>>>>> No, you wait a sec ;) Why do you think this belongs to devlink?
>>>>> Two months ago you were complaining bitterly when people were
>>>>> considering using devlink rate to control per-queue shapers.
>>>>> And now it's fine to add queues as a concept to devlink?  
>>>>
>>>> Do you have a better suggestion how to model common pool object for
>>>> multiple netdevices? This is the reason why devlink was introduced to
>>>> provide a platform for common/shared things for a device that contains
>>>> multiple netdevs/ports/whatever. But I may be missing something here,
>>>> for sure.
>>>
>>> devlink just seems like the lowest common denominator, but the moment
>>> we start talking about multi-PF devices it also gets wobbly :(
>> 
>> You mean you see real to have a multi-PF device that allows to share the
>> pools between the PFs? If, in theory, that exists, could this just be a
>> limitation perhaps?
>> 
>> 
>
>I don't know offhand if we have a device which can share pools
>specifically, but we do have multi-PF devices which have a lot of shared
>resources. However, due to the multi-PF PCIe design. I looked into ways
>to get a single devlink across the devices.. but ultimately got stymied
>and gave up.
>
>This left us with accepting the limitation that each PF gets its own
>devlink and can't really communicate with other PFs.
>
>The existing solution has just been to partition the shared resources
>evenly across PFs, typically via firmware. No flexibility.
>
>I do think the best solution here would be to figure out a generic way
>to tie multiple functions into a single devlink representing the device.
>Then each function gets the set of devlink_port objects associated to
>it. I'm not entirely sure how that would work. We could hack something
>together with auxbus.. but thats pretty ugly. Some sort of orchestration
>in the PCI layer that could identify when a device wants to have some
>sort of "parent" driver which loads once and has ties to each of the
>function drivers would be ideal.

Hmm, dpll does this. You basically share dpll device instance in between
multiple pci functions. The same concept could be done for devlink. We
have to figure out the backward compatibility. User now expects the
instances per-pf.


>
>Then this parent driver could register devlink, and each function driver
>could connect to it and allocate ports and function-specific resources.
>
>Alternatively a design which loads a single driver that maintains
>references to each function could work but that requires a significant
>change to the entire driver design and is unlikely to be done for
>existing drivers...
>
>This is obviously a bit more of a tangential problem to this series.
Jiri Pirko Feb. 16, 2024, 8:15 a.m. UTC | #30
Fri, Feb 16, 2024 at 03:07:29AM CET, kuba@kernel.org wrote:
>On Thu, 15 Feb 2024 09:41:31 -0800 Jacob Keller wrote:
>> I don't know offhand if we have a device which can share pools
>> specifically, but we do have multi-PF devices which have a lot of shared
>> resources. However, due to the multi-PF PCIe design. I looked into ways
>> to get a single devlink across the devices.. but ultimately got stymied
>> and gave up.
>> 
>> This left us with accepting the limitation that each PF gets its own
>> devlink and can't really communicate with other PFs.
>> 
>> The existing solution has just been to partition the shared resources
>> evenly across PFs, typically via firmware. No flexibility.
>> 
>> I do think the best solution here would be to figure out a generic way
>> to tie multiple functions into a single devlink representing the device.
>> Then each function gets the set of devlink_port objects associated to
>> it. I'm not entirely sure how that would work. We could hack something
>> together with auxbus.. but thats pretty ugly. Some sort of orchestration
>> in the PCI layer that could identify when a device wants to have some
>> sort of "parent" driver which loads once and has ties to each of the
>> function drivers would be ideal.
>> 
>> Then this parent driver could register devlink, and each function driver
>> could connect to it and allocate ports and function-specific resources.
>> 
>> Alternatively a design which loads a single driver that maintains
>> references to each function could work but that requires a significant
>> change to the entire driver design and is unlikely to be done for
>> existing drivers...
>
>I think the complexity mostly stems from having to answer what the
>"right behavior" is. At least that's what I concluded when thinking
>about it back at Netronome :)  If you do a strict hierarchy where
>one PF is preassigned the role of the leader, and just fail if anything
>unexpected happens - it should be doable. We already kinda have the
>model where devlink is the "first layer of probing" and "reload_up()"
>is the second.
>
>Have you had a chance to take a closer look at mlx5 "socket direct"
>(rename pending) implementation?
>
>BTW Jiri, weren't you expecting that to use component drivers or some
>such?

IIRC, turned out that was not suitable for this case by my colleagues.
You have to ask them why, I don't recall.

But socket direct is a different kind of story. There 2/n PFs are just
separate NUMA PCI channels to a single FW entity.
Jacob Keller Feb. 16, 2024, 9:42 p.m. UTC | #31
On 2/15/2024 6:07 PM, Jakub Kicinski wrote:
> On Thu, 15 Feb 2024 09:41:31 -0800 Jacob Keller wrote:
>> I don't know offhand if we have a device which can share pools
>> specifically, but we do have multi-PF devices which have a lot of shared
>> resources. However, due to the multi-PF PCIe design. I looked into ways
>> to get a single devlink across the devices.. but ultimately got stymied
>> and gave up.
>>
>> This left us with accepting the limitation that each PF gets its own
>> devlink and can't really communicate with other PFs.
>>
>> The existing solution has just been to partition the shared resources
>> evenly across PFs, typically via firmware. No flexibility.
>>
>> I do think the best solution here would be to figure out a generic way
>> to tie multiple functions into a single devlink representing the device.
>> Then each function gets the set of devlink_port objects associated to
>> it. I'm not entirely sure how that would work. We could hack something
>> together with auxbus.. but thats pretty ugly. Some sort of orchestration
>> in the PCI layer that could identify when a device wants to have some
>> sort of "parent" driver which loads once and has ties to each of the
>> function drivers would be ideal.
>>
>> Then this parent driver could register devlink, and each function driver
>> could connect to it and allocate ports and function-specific resources.
>>
>> Alternatively a design which loads a single driver that maintains
>> references to each function could work but that requires a significant
>> change to the entire driver design and is unlikely to be done for
>> existing drivers...
> 
> I think the complexity mostly stems from having to answer what the
> "right behavior" is. At least that's what I concluded when thinking
> about it back at Netronome :)  If you do a strict hierarchy where
> one PF is preassigned the role of the leader, and just fail if anything
> unexpected happens - it should be doable. We already kinda have the
> model where devlink is the "first layer of probing" and "reload_up()"
> is the second.

Well, a lot comes from the choices made at hardware design to have a
multi-function device but then still have pieces shared across PFs, with
no clear plan for how to actually orchestrate that sharing...

> 
> Have you had a chance to take a closer look at mlx5 "socket direct"
> (rename pending) implementation?
> 

Hm. No I hadn't seen that yet.

> BTW Jiri, weren't you expecting that to use component drivers or some
> such?
Jacob Keller Feb. 16, 2024, 9:44 p.m. UTC | #32
On 2/16/2024 12:10 AM, Jiri Pirko wrote:
> Thu, Feb 15, 2024 at 06:41:31PM CET, jacob.e.keller@intel.com wrote:
>>
>>
>> On 2/15/2024 5:19 AM, Jiri Pirko wrote:
>>> Fri, Feb 09, 2024 at 02:26:33AM CET, kuba@kernel.org wrote:
>>>> On Fri, 2 Feb 2024 08:46:56 +0100 Jiri Pirko wrote:
>>>>> Fri, Feb 02, 2024 at 05:00:41AM CET, kuba@kernel.org wrote:
>>>>>> On Thu, 1 Feb 2024 11:13:57 +0100 Jiri Pirko wrote:  
>>>>>>> Wait a sec.  
>>>>>>
>>>>>> No, you wait a sec ;) Why do you think this belongs to devlink?
>>>>>> Two months ago you were complaining bitterly when people were
>>>>>> considering using devlink rate to control per-queue shapers.
>>>>>> And now it's fine to add queues as a concept to devlink?  
>>>>>
>>>>> Do you have a better suggestion how to model common pool object for
>>>>> multiple netdevices? This is the reason why devlink was introduced to
>>>>> provide a platform for common/shared things for a device that contains
>>>>> multiple netdevs/ports/whatever. But I may be missing something here,
>>>>> for sure.
>>>>
>>>> devlink just seems like the lowest common denominator, but the moment
>>>> we start talking about multi-PF devices it also gets wobbly :(
>>>
>>> You mean you see real to have a multi-PF device that allows to share the
>>> pools between the PFs? If, in theory, that exists, could this just be a
>>> limitation perhaps?
>>>
>>>
>>
>> I don't know offhand if we have a device which can share pools
>> specifically, but we do have multi-PF devices which have a lot of shared
>> resources. However, due to the multi-PF PCIe design. I looked into ways
>> to get a single devlink across the devices.. but ultimately got stymied
>> and gave up.
>>
>> This left us with accepting the limitation that each PF gets its own
>> devlink and can't really communicate with other PFs.
>>
>> The existing solution has just been to partition the shared resources
>> evenly across PFs, typically via firmware. No flexibility.
>>
>> I do think the best solution here would be to figure out a generic way
>> to tie multiple functions into a single devlink representing the device.
>> Then each function gets the set of devlink_port objects associated to
>> it. I'm not entirely sure how that would work. We could hack something
>> together with auxbus.. but thats pretty ugly. Some sort of orchestration
>> in the PCI layer that could identify when a device wants to have some
>> sort of "parent" driver which loads once and has ties to each of the
>> function drivers would be ideal.
> 
> Hmm, dpll does this. You basically share dpll device instance in between
> multiple pci functions. The same concept could be done for devlink. We
> have to figure out the backward compatibility. User now expects the
> instances per-pf.
> 
> 

Ya, ice started doing this over auxbus for PTP as well, because we had a
bunch of issues with E822 devices that couldn't be solved without some
communication across PF.

I'm not sure its actually worth trying to change how devlink works now,
its pretty ingrained at this point.

But I think it is a possibility that could have happened, and having a
centralized "this is the device owner" would have simplified a lot of
concepts for managing these sorts of shared resources which are not
owned by a single PF.
Jacob Keller Feb. 16, 2024, 9:47 p.m. UTC | #33
On 2/15/2024 6:07 PM, Jakub Kicinski wrote:
> On Thu, 15 Feb 2024 09:41:31 -0800 Jacob Keller wrote:
>> I don't know offhand if we have a device which can share pools
>> specifically, but we do have multi-PF devices which have a lot of shared
>> resources. However, due to the multi-PF PCIe design. I looked into ways
>> to get a single devlink across the devices.. but ultimately got stymied
>> and gave up.
>>
>> This left us with accepting the limitation that each PF gets its own
>> devlink and can't really communicate with other PFs.
>>
>> The existing solution has just been to partition the shared resources
>> evenly across PFs, typically via firmware. No flexibility.
>>
>> I do think the best solution here would be to figure out a generic way
>> to tie multiple functions into a single devlink representing the device.
>> Then each function gets the set of devlink_port objects associated to
>> it. I'm not entirely sure how that would work. We could hack something
>> together with auxbus.. but thats pretty ugly. Some sort of orchestration
>> in the PCI layer that could identify when a device wants to have some
>> sort of "parent" driver which loads once and has ties to each of the
>> function drivers would be ideal.
>>
>> Then this parent driver could register devlink, and each function driver
>> could connect to it and allocate ports and function-specific resources.
>>
>> Alternatively a design which loads a single driver that maintains
>> references to each function could work but that requires a significant
>> change to the entire driver design and is unlikely to be done for
>> existing drivers...
> 
> I think the complexity mostly stems from having to answer what the
> "right behavior" is. At least that's what I concluded when thinking
> about it back at Netronome :)  If you do a strict hierarchy where
> one PF is preassigned the role of the leader, and just fail if anything
> unexpected happens - it should be doable. We already kinda have the
> model where devlink is the "first layer of probing" and "reload_up()"
> is the second.
> 

You can of course just assign it such that one PF "owns" things, but
that seems a bit confusing if there isn't a clear mechanism for users to
understand which PF is the owner. I guess they can check
devlink/netlink/whatever and see the resources there. It also still
doesn't provide a communication mechanism to actually pass sub-ownership
across the PFs, unless your device firmware can do that for you.

The other option commonly used is partitioning so you just pre-determine
how to slice the resources up per PF. This isn't flexible, but it is simple.
Jakub Kicinski Feb. 17, 2024, 2:43 a.m. UTC | #34
On Fri, 16 Feb 2024 09:06:37 +0100 Jiri Pirko wrote:
> >We disagree how things should be modeled, sort of in principle.
> >I think it dates all the way back to your work on altnames.
> >We had the same conversation on DPLL :(
> >
> >I prefer to give objects unique IDs and a bunch of optional identifying
> >attributes, rather than trying to build some well organized hierarchy.
> >The hierarchy often becomes an unnecessary constraint.  
> 
> Sure, no problem on having floating objects with ids and attributes.
> But in case they relate to HW configuration, you need to somehow glue
> them to a device eventually. This is what I'm missing how you envision
> it. The lifetime of object and glue/unglue operations.

My desired lifetime is that the object (shared pool) gets created when
the first consumer (netdev) appears, and destroyed when the last one
disappears. Just like you can configure huge rings on a NIC while its
closed and that won't consume any memory, share pool shouldn't exist if
all its consumers are closed.

The tricky part is to come up with some way of saying that we want
multiple netdevs to not only use a shared pool, but which ones should
be sharing which pool, when the pools themselves don't get explicitly
created. Right?

Gluing to the device is easier, IIUC, once the pool is create we can
give it whatever attributes we want. devlink ID, bus/device, netdev,
IOMMU domain, anything.
Jiri Pirko Feb. 19, 2024, 8:59 a.m. UTC | #35
Fri, Feb 16, 2024 at 10:47:50PM CET, jacob.e.keller@intel.com wrote:
>
>
>On 2/15/2024 6:07 PM, Jakub Kicinski wrote:
>> On Thu, 15 Feb 2024 09:41:31 -0800 Jacob Keller wrote:
>>> I don't know offhand if we have a device which can share pools
>>> specifically, but we do have multi-PF devices which have a lot of shared
>>> resources. However, due to the multi-PF PCIe design. I looked into ways
>>> to get a single devlink across the devices.. but ultimately got stymied
>>> and gave up.
>>>
>>> This left us with accepting the limitation that each PF gets its own
>>> devlink and can't really communicate with other PFs.
>>>
>>> The existing solution has just been to partition the shared resources
>>> evenly across PFs, typically via firmware. No flexibility.
>>>
>>> I do think the best solution here would be to figure out a generic way
>>> to tie multiple functions into a single devlink representing the device.
>>> Then each function gets the set of devlink_port objects associated to
>>> it. I'm not entirely sure how that would work. We could hack something
>>> together with auxbus.. but thats pretty ugly. Some sort of orchestration
>>> in the PCI layer that could identify when a device wants to have some
>>> sort of "parent" driver which loads once and has ties to each of the
>>> function drivers would be ideal.
>>>
>>> Then this parent driver could register devlink, and each function driver
>>> could connect to it and allocate ports and function-specific resources.
>>>
>>> Alternatively a design which loads a single driver that maintains
>>> references to each function could work but that requires a significant
>>> change to the entire driver design and is unlikely to be done for
>>> existing drivers...
>> 
>> I think the complexity mostly stems from having to answer what the
>> "right behavior" is. At least that's what I concluded when thinking
>> about it back at Netronome :)  If you do a strict hierarchy where
>> one PF is preassigned the role of the leader, and just fail if anything
>> unexpected happens - it should be doable. We already kinda have the
>> model where devlink is the "first layer of probing" and "reload_up()"
>> is the second.
>> 
>
>You can of course just assign it such that one PF "owns" things, but
>that seems a bit confusing if there isn't a clear mechanism for users to
>understand which PF is the owner. I guess they can check
>devlink/netlink/whatever and see the resources there. It also still
>doesn't provide a communication mechanism to actually pass sub-ownership
>across the PFs, unless your device firmware can do that for you.
>
>The other option commonly used is partitioning so you just pre-determine
>how to slice the resources up per PF. This isn't flexible, but it is simple.

I will cook up a rfc for the devlink instance to represent the parent of
the PFs. I think we have everything we need in place already. Will send
that soon.
Jiri Pirko Feb. 19, 2024, 9:06 a.m. UTC | #36
Sat, Feb 17, 2024 at 03:43:32AM CET, kuba@kernel.org wrote:
>On Fri, 16 Feb 2024 09:06:37 +0100 Jiri Pirko wrote:
>> >We disagree how things should be modeled, sort of in principle.
>> >I think it dates all the way back to your work on altnames.
>> >We had the same conversation on DPLL :(
>> >
>> >I prefer to give objects unique IDs and a bunch of optional identifying
>> >attributes, rather than trying to build some well organized hierarchy.
>> >The hierarchy often becomes an unnecessary constraint.  
>> 
>> Sure, no problem on having floating objects with ids and attributes.
>> But in case they relate to HW configuration, you need to somehow glue
>> them to a device eventually. This is what I'm missing how you envision
>> it. The lifetime of object and glue/unglue operations.
>
>My desired lifetime is that the object (shared pool) gets created when
>the first consumer (netdev) appears, and destroyed when the last one
>disappears. Just like you can configure huge rings on a NIC while its

***


>closed and that won't consume any memory, share pool shouldn't exist if
>all its consumers are closed.
>
>The tricky part is to come up with some way of saying that we want
>multiple netdevs to not only use a shared pool, but which ones should
>be sharing which pool, when the pools themselves don't get explicitly
>created. Right?

Yeah, also, there is limitation of who can share with who.

>
>Gluing to the device is easier, IIUC, once the pool is create we can
>give it whatever attributes we want. devlink ID, bus/device, netdev,
>IOMMU domain, anything.

I'm confused. Above (***) you say that the shared pool is created upon
first netdev creation. Now you indicate the user creates it and then
"binds" it to some object (like devlink).

So, do you think there should be 2 types of pools:
1) implicit upon netdev creation
2) explicit defined by the user?
Jakub Kicinski Feb. 20, 2024, 10:17 p.m. UTC | #37
On Mon, 19 Feb 2024 10:06:17 +0100 Jiri Pirko wrote:
> >Gluing to the device is easier, IIUC, once the pool is create we can
> >give it whatever attributes we want. devlink ID, bus/device, netdev,
> >IOMMU domain, anything.  
> 
> I'm confused. Above (***) you say that the shared pool is created upon
> first netdev creation. Now you indicate the user creates it and then
> "binds" it to some object (like devlink).
> 
> So, do you think there should be 2 types of pools:
> 1) implicit upon netdev creation
> 2) explicit defined by the user?

Implicitly, "once the pool is create[d]" does not say that user created
the pool. The flow would be more like:
 - user changes configuration "rules"
 - if device is running && config changed:
   - core instantiates the new pools
   - core issues "reset" request to the device to rebuild queues
   - core shuts down old pools
diff mbox series

Patch

diff --git a/Documentation/networking/devlink/devlink-sd.rst b/Documentation/networking/devlink/devlink-sd.rst
new file mode 100644
index 000000000000..e73587de9c50
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-sd.rst
@@ -0,0 +1,296 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Devlink Shared Descriptors
+==========================
+
+Glossary
+========
+* REP port: Representor port
+* RQ: Receive Queue or RX Queue
+* WQE: Work Queue Entry
+* IRQ: Interrupt Request
+* Channel: A channel is an IRQ and the set of queues that can trigger
+  that IRQ.  ``devlink-sd`` assumes one rx queue per channel.
+* Descriptor: The data structure that describes a network packet.
+  An RQ consists of multiple descriptors.
+* NAPI context: New API context, associated with a device channel.
+* Device Naming:
+
+  - Uplink representors: p<port_number>, ex: p0.
+  - PF representors: pf<port_number>hpf, ex: pf0hpf.
+  - VF representors: pf<port_number>vf<function_number>, ex: pf0vf1.
+
+Background
+==========
+The ``devlink-sd`` mechanism is targeted for the configuration of the
+shared rx descriptors that host as a descriptor pool for ethernet
+representors (reps) to better utilize memory. Following operations are
+provided:
+
+* Add/delete a shared descriptor pool
+* Configure the pool's properties
+* Bind/unbind a representor's rx queue to a descriptor pool
+
+In switchdev mode, representors are slow-path ports that handle the
+miss traffic, i.e., traffic not being forwarded by the hardware.
+Representor ports are regular ethernet devices, with multiple channels
+consuming DMA memory. Memory consumption of the representor
+port's rx buffers can grow to several GB when scaling to 1k VFs reps.
+For example, in mlx5 driver, each RQ, with a typical 1K descriptors,
+consumes 3MB of DMA memory for packet buffer in descriptors, and with
+four channels, it consumes 4 * 3MB * 1024 = 12GB of memory. Since rep
+ports are for slow path traffic, most of these rep ports' rx DMA memory
+is idle when flows are forwarded directly in hardware to VFs.
+
+A network device driver consists of several channels and each channel
+represents an NAPI context and a set of queues that trigger that IRQ.
+devlink-sd considers only the *regular* RX queue in each channel,
+e.g., mlx5's non-regular RQs such as XSK RQ and drop RQ are not applicable
+here. Each device driver receives packets by setting up RQ, and
+each RQ receives packets by pre-allocating a dedicated set of rx
+ring descriptors, with each descriptor pointing to a memory buffer.
+The ``shared descriptor pool`` is a descriptor and buffer sharing
+mechanism. It allows multiple RQs to use the rx ring descriptors
+from the shared descriptor pool. In other words, the RQ no longer has
+its own dedicated rx ring descriptors, which might be idle when there
+is no traffic, but it consumes the descriptors from the descriptor
+pool only when packets arrive.
+
+The shared descriptor pool contains rx descriptors and its memory
+buffers. When multiple representors' RQs share the same pool of rx
+descriptors, they share the same set of memory buffers. As a result,
+the heavy-traffic representors can use all descriptors from the pool,
+while the idle, no-traffic representor consumes no memory. All the
+descriptors in the descriptor pool can be used by all the RQs. This
+makes the descriptor memory usage more efficient.
+
+The diagram below first shows two representors with their own regular
+RQ, and its rx descriptor ring and buffers, without using shared descriptor
+pool::
+
+      +--------+            +--------+
+      │ pf0vf1 │            │ pf0vf2 │
+      │   RQ   │            │   RQ   │
+      +--------+            +--------+
+           │                     │
+     +-----┴----------+    +-----┴----------+
+     │ rx descriptors │    │ rx descriptors │
+     │ and buffers    │    │ and buffers    │
+     +----------------+    +----------------+
+
+With shared descriptors, the diagram below shows that two representors
+can share the same descriptor pool::
+
+     +--------+            +--------+
+     │ pf0vf1 │            │ pf0vf2 │
+     │   RQ   │            │   RQ   │
+     +----┬---+            +----┬---+
+          │                     │
+          +---------+  +--------+
+                    │  │
+              +-----┴--┴-------+
+              │     shared     |
+              | rx descriptors │
+              │ and buffers    │
+              +----------------+
+
+Both packets arrived for pf0vf1 and pf0vf2 are consuming the descriptors
+and buffers in the pool. Once packets are passed to the upper Linux
+network stack, the driver will refill the rx descriptor with a new buffer,
+e.g., using the page_pool API.  Typically, a NAPI context is associated
+with each channel of a device, and packet reception and refilling operations
+happen in a NAPI context. Linux kernel guarantees that only one CPU at any
+time can call NAPI poll for each napi struct. In the shared rx descriptors
+case, a race condition happens when two NAPI contexts, scheduled to run
+on two CPUs, are fetching or refilling descriptors from/to the same
+shared descriptor pool. Thus, the shared descriptor pool should be either
+protected by a lock, or in a better design, have a 1:1 mapping between
+descriptor pool and NAPI context of a CPU (See examples below).
+
+API Overview
+============
+* Name:
+   - devlink-sd : devlink shared descriptor configuration
+* Synopsis:
+   - devlink sd pool show [DEV]
+   - devlink sd pool add DEV pool POOL_ID count DESCRIPTOR_NUM
+   - devlink sd pool delete id POOL_ID
+   - devlink sd port pool show [DEV]
+   - devlink sd port pool bind DEV queue QUEUE_ID pool POOL_ID
+   - devlink sd port pool unbind DEV queue QUEUE_ID
+
+Description
+===========
+ * devlink sd pool show - show shared descriptor pool and their
+   attributes
+
+    - DEV - the devlink device that supports shared descriptor pool.  If
+      this argument is omitted all available shared descriptor devices are
+      listed.
+
+ * devlink sd pool add - add a shared descriptor pool and the driver
+   allocates and returns descriptor pool id.
+
+    - DEV: the devlink device that supports shared descriptor pool.
+    - count DESCRIPTOR_NUM: the number of descriptors in the pool.
+
+ * devlink sd pool delete - delete shared descriptor pool
+
+    - pool POOL_ID: the id of the shared descriptor pool to be deleted.
+      Make sure no RX queue of any port is using the pool before deleting it.
+
+ * devlink sd port pool show - display port-pool mappings
+
+    - DEV: the devlink device that supports shared descriptor pool.
+
+ * devlink sd port pool bind - set the port-pool mapping
+
+    - DEV: the devlink device that supports shared descriptor pool.
+    - queue QUEUE_ID: the index of the channel. Note that a representor
+      might have multiple RX queues/channels, specify which queue id to
+      map to the pool.
+    - pool POOL_ID: the id of the shared descriptor pool to be mapped.
+
+ * devlink sd port pool unbind - unbind the port-pool mapping
+
+    - DEV: the devlink device that supports shared descriptor pool.
+    - queue QUEUE_ID: the index of the RX queue/channel.
+
+ * devlink dev eswitch set DEV mode switchdev - enable or disable default
+   port-pool mapping scheme
+
+    - DEV: the devlink device that supports shared descriptor pool.
+    - shared-descs { enable | disable }: enable/disable default port-pool
+      mapping scheme. See details below.
+
+
+Example usage
+=============
+
+.. code:: shell
+
+    # Enable switchdev mode for the device
+    * devlink dev eswitch set pci/0000:08:00.0 mode switchdev
+
+    # Show devlink device
+    $ devlink devlink show
+        pci/0000:08:00.0
+        pci/0000:08:00.1
+
+    # show existing descriptor pools
+    $ devlink sd pool show pci/0000:08:00.0
+        pci/0000:08:00.0: pool 11 count 2048
+
+    # Create a shared descriptor pool and 1024 descriptors, driver
+    # allocates and returns the pool id 12
+    $ devlink sd pool add pci/0000:08:00.0 count 1024
+        pci/0000:08:00.0: pool 12 count 1024
+
+    # Now check the pool again
+    $ devlink sd pool show pci/0000:08:00.0
+        pci/0000:08:00.0: pool 11 count 2048
+        pci/0000:08:00.0: pool 12 count 1024
+
+    # Bind a representor port, pf0vf1, queue 0 to the shared descriptor pool
+    $ devlink sd port pool bind pf0vf1 queue 0 pool 12
+
+    # Bind a representor port, pf0vf2, queue 0 to the shared descriptor pool
+    $ devlink sd port pool bind pf0vf2 queue 0 pool 12
+
+    # Show the rep port-pool mapping of pf0vf1
+    $ devlink sd port pool show pci/0000:08:00.0/11
+        pci/0000:08:00.0/11 queue 0 pool 12
+
+    # Show the rep port-pool mapping of pf0vf2
+    $ devlink sd port pool show pf0vf2
+    # or use the devlink port handle
+    $ devlink sd port pool show pci/0000:08:00.0/22
+        pci/0000:08:00.0/22 queue 0 pool 12
+
+    # To dump all ports mapping for a device
+    $ devlink sd port pool show pci/0000:08:00.0
+        pci/0000:08:00.0/11: queue 0 pool 12
+        pci/0000:08:00.0/22: queue 0 pool 12
+
+    # Unbind a representor port, pf0vf1, queue 0 from the shared descriptor pool
+    $ devlink sd port pool unbind pf0vf1 queue 0
+    $ devlink sd port pool show pci/0000:08:00.0
+        pci/0000:08:00.0/22: queue 0 pool 12
+
+Default Mapping Scheme
+======================
+The ``devlink-sd`` tries to be generic and fine-grained: allowing users
+to create shared descriptor pools and bind them to representor ports, in
+any mapping scheme they want. However, typically users don't want to
+do this by themselves. For convenience, ``devlink-sd`` adds a default mapping
+scheme as follows:
+
+.. code:: shell
+
+   # Create a shared descriptor pool for each rx queue of uplink
+     representor, assume having two queues:
+   $ devlink sd pool show p0
+       pci/0000:08:00.0: pool 8 count 1024 # reserved for queue 0
+       pci/0000:08:00.0: pool 9 count 1024 # reserved for queue 1
+
+   # Bind each representor port to its own shared descriptor pool, ex:
+   $ devlink sb port pool show pf0vf1
+        pci/0000:08:00.0/11: queue 0 pool 8
+        pci/0000:08:00.0/11: queue 1 pool 9
+
+   $ devlink sb port pool show pf0vf2
+        pci/0000:08:00.0/22: queue 0 pool 8
+        pci/0000:08:00.0/22: queue 1 pool 9
+
+The diagram shows the default mapping with two representors, each with
+two RX queues::
+
+     +--------+            +--------+     +--------+
+     │   p0   │            │ pf0vf1 │     | pf0vf2 │
+     │RQ0  RQ1│-------+    │RQ0  RQ1│     |RQ0  RQ1│
+     +-+------+       |    +-+----+-+     +-+----+-+
+       |              |      │    |         |    |
+       |  +------------------+    |     to  |    |
+       |  |           |           |      POOL-8  |
+   +---v--v-+         |     +-----v--+           |
+   │ POOL-8 |         |---> | POOL-9 |<----------+
+   +--------+               +--------+
+    NAPI-0                    NAPI-1
+
+The benefit of this default mapping is that it allows the p0, the uplink
+representor, to receive packets that are destined for pf0vf1 and pf0vf2,
+simply by polling the shared descriptor pools. In the above case, p0
+has two NAPI contexts, NAPI-0 polls for RQ0 and NAPI-1 polls for RQ1.
+Since the NAPI-0 receives packets by checking all the descriptors in
+the POOL-0, and the POOL-0 contains packets also for pf0vf1 and pf0vf2,
+polling POOL-1 can receive all the packets. As a result, uplink representors
+become the single device that receives packets for other representors.
+This makes managing pools and rx queues easier and since only one NAPI
+can poll on one pool, there is no lock required to avoid contention.
+
+Example usage (Default)
+=======================
+
+.. code:: shell
+
+    # Enable switchdev mode with additional *shared-descs* option
+    * devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
+      shared-descs enable
+
+    # Assume two rx queues and one uplink device p0, and two reps pf0vf1 and pf0vf2
+    $ devlink sd port pool show pci/0000:08:00.0
+        pci/0000:08:00.0: queue 0 pool 8
+        pci/0000:08:00.0: queue 1 pool 9
+        pci/0000:08:00.0/11: queue 0 pool 8
+        pci/0000:08:00.0/11: queue 1 pool 9
+        pci/0000:08:00.0/22: queue 0 pool 8
+        pci/0000:08:00.0/22: queue 1 pool 9
+
+    # Disable *shared-descs* option falls back to non-sharing
+    * devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
+      shared-descs disable
+
+    # pool and port-pool mappings are cleared
+    $ devlink sd port pool show pci/0000:08:00.0
+