[RFC,net-next,0/5] Add ETS and TBF Qdisc offload for Airoha EN7581 SoC

Message ID	cover.1733930558.git.lorenzo@kernel.org (mailing list archive)
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1CDF8134AC for <netdev@vger.kernel.org>; Wed, 11 Dec 2024 15:32:07 +0000 (UTC) From: Lorenzo Bianconi <lorenzo@kernel.org> To: netdev@vger.kernel.org Cc: andrew@lunn.ch, olteanv@gmail.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, horms@kernel.org, nbd@nbd.name, sean.wang@mediatek.com, Mark-MC.Lee@mediatek.com, lorenzo.bianconi83@gmail.com Subject: [RFC net-next 0/5] Add ETS and TBF Qdisc offload for Airoha EN7581 SoC Date: Wed, 11 Dec 2024 16:31:48 +0100 Message-ID: <cover.1733930558.git.lorenzo@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Add ETS and TBF Qdisc offload for Airoha EN7581 SoC \| expand [RFC,net-next,0/5] Add ETS and TBF Qdisc offload for Airoha EN7581 SoC [RFC,net-next,1/5] net: airoha: Enable Tx drop capability for each Tx DMA ring [RFC,net-next,2/5] net: airoha: Introduce ndo_select_queue callback [RFC,net-next,3/5] net: dsa: Introduce ndo_setup_tc_conduit callback [RFC,net-next,4/5] net: airoha: Add sched ETS offload support [RFC,net-next,5/5] net: airoha: Add sched TBF offload support

Lorenzo Bianconi Dec. 11, 2024, 3:31 p.m. UTC

Introduce support for ETS and TBF qdisc offload available in the Airoha
EN7581 ethernet controller.
Some DSA hw switches do not support Qdisc offloading or the mac chip
has more fine grained QoS capabilities with respect to the hw switch
(e.g. Airoha EN7581 mac chip has more hw QoS and buffering capabilities
with respect to the mt7530 switch). 
Introduce ndo_setup_tc_conduit callback in order to allow tc to offload
Qdisc policies for the specified DSA user port configuring the hw switch
cpu port (mac chip).

Lorenzo Bianconi (5):
  net: airoha: Enable Tx drop capability for each Tx DMA ring
  net: airoha: Introduce ndo_select_queue callback
  net: dsa: Introduce ndo_setup_tc_conduit callback
  net: airoha: Add sched ETS offload support
  net: airoha: Add sched TBF offload support

 drivers/net/ethernet/mediatek/airoha_eth.c | 372 ++++++++++++++++++++-
 include/linux/netdevice.h                  |  12 +
 net/dsa/user.c                             |  47 ++-
 3 files changed, 422 insertions(+), 9 deletions(-)

Vladimir Oltean Dec. 11, 2024, 3:41 p.m. UTC | #1

Hi Lorenzo,

On Wed, Dec 11, 2024 at 04:31:48PM +0100, Lorenzo Bianconi wrote:
> Introduce support for ETS and TBF qdisc offload available in the Airoha
> EN7581 ethernet controller.
> Some DSA hw switches do not support Qdisc offloading or the mac chip
> has more fine grained QoS capabilities with respect to the hw switch
> (e.g. Airoha EN7581 mac chip has more hw QoS and buffering capabilities
> with respect to the mt7530 switch). 
> Introduce ndo_setup_tc_conduit callback in order to allow tc to offload
> Qdisc policies for the specified DSA user port configuring the hw switch
> cpu port (mac chip).

Can you please make a detailed diagram explaining how is the conduit
involved in the packet data path for QoS? Offloaded tc on a DSA user
port is supposed to affect autonomously forwarded traffic too (like the
Linux bridge).

Lorenzo Bianconi Dec. 12, 2024, 9:19 a.m. UTC | #2

> Hi Lorenzo,

Hi Vladimir,

> 
> On Wed, Dec 11, 2024 at 04:31:48PM +0100, Lorenzo Bianconi wrote:
> > Introduce support for ETS and TBF qdisc offload available in the Airoha
> > EN7581 ethernet controller.
> > Some DSA hw switches do not support Qdisc offloading or the mac chip
> > has more fine grained QoS capabilities with respect to the hw switch
> > (e.g. Airoha EN7581 mac chip has more hw QoS and buffering capabilities
> > with respect to the mt7530 switch). 
> > Introduce ndo_setup_tc_conduit callback in order to allow tc to offload
> > Qdisc policies for the specified DSA user port configuring the hw switch
> > cpu port (mac chip).
> 
> Can you please make a detailed diagram explaining how is the conduit
> involved in the packet data path for QoS? Offloaded tc on a DSA user
> port is supposed to affect autonomously forwarded traffic too (like the
> Linux bridge).

I guess a typical use case would be the one below where the traffic from the
WAN port is forwarded to a DSA LAN one (e.g. lan0) via netfilter flowtable
offload.

            ┌─────────────────────────────────┐             
            │               BR0               │             
            └───┬────────┬────────┬────────┬──┘             
┌───────────────┼────────┼────────┼────────┼───┐            
│DSA            │        │        │        │   │            
│               │        │        │        │   │            
│ ┌───┐      ┌──▼─┐   ┌──▼─┐   ┌──▼─┐   ┌──▼─┐ │       ┌───┐
│ │CPU│      │LAN0│   │LAN1│   │LAN2│   │LAN3│ │       │WAN│
│ └───┘      └────┘   └────┘   └────┘   └────┘ │       └───┘
└──────────────────────────────────────────────┘            

In this case the mac chip forwards (in hw) the WAN traffic to the DSA switch
via the CPU port. In [0] we have the EN7581 mac chip architecture where we
can assume GDM1 is the CPU port and GDM2 is the WAN port.
The goal of this RFC series is to offload a Qdisc rule (e.g. ETS) on a given
LAN port using the mac chip QoS capabilities instead of creating the QoS
discipline directly in the DSA hw switch:

$tc qdisc replace dev lan0 root handle 1: ets bands 8 strict 2 quanta 1514 1514 1514 3528 1514 1514

As described above the reason for this approach would be to rely on the more
fine grained QoS capabilities available on the mac chip with respect to the
hw switch or because the DSA switch does not support QoS offloading.

Regards,
Lorenzo

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752

Vladimir Oltean Dec. 12, 2024, 3:06 p.m. UTC | #3

On Thu, Dec 12, 2024 at 10:19:41AM +0100, Lorenzo Bianconi wrote:
> > Hi Lorenzo,
> 
> Hi Vladimir,
> 
> > 
> > On Wed, Dec 11, 2024 at 04:31:48PM +0100, Lorenzo Bianconi wrote:
> > > Introduce support for ETS and TBF qdisc offload available in the Airoha
> > > EN7581 ethernet controller.
> > > Some DSA hw switches do not support Qdisc offloading or the mac chip
> > > has more fine grained QoS capabilities with respect to the hw switch
> > > (e.g. Airoha EN7581 mac chip has more hw QoS and buffering capabilities
> > > with respect to the mt7530 switch). 
> > > Introduce ndo_setup_tc_conduit callback in order to allow tc to offload
> > > Qdisc policies for the specified DSA user port configuring the hw switch
> > > cpu port (mac chip).
> > 
> > Can you please make a detailed diagram explaining how is the conduit
> > involved in the packet data path for QoS? Offloaded tc on a DSA user
> > port is supposed to affect autonomously forwarded traffic too (like the
> > Linux bridge).
> 
> I guess a typical use case would be the one below where the traffic from the
> WAN port is forwarded to a DSA LAN one (e.g. lan0) via netfilter flowtable
> offload.
> 
>             ┌─────────────────────────────────┐             
>             │               BR0               │             
>             └───┬────────┬────────┬────────┬──┘             
> ┌───────────────┼────────┼────────┼────────┼───┐            
> │DSA            │        │        │        │   │            
> │               │        │        │        │   │            
> │ ┌───┐      ┌──▼─┐   ┌──▼─┐   ┌──▼─┐   ┌──▼─┐ │       ┌───┐
> │ │CPU│      │LAN0│   │LAN1│   │LAN2│   │LAN3│ │       │WAN│
> │ └───┘      └────┘   └────┘   └────┘   └────┘ │       └───┘
> └──────────────────────────────────────────────┘            
> 
> In this case the mac chip forwards (in hw) the WAN traffic to the DSA switch
> via the CPU port. In [0] we have the EN7581 mac chip architecture where we
> can assume GDM1 is the CPU port and GDM2 is the WAN port.
> The goal of this RFC series is to offload a Qdisc rule (e.g. ETS) on a given
> LAN port using the mac chip QoS capabilities instead of creating the QoS
> discipline directly in the DSA hw switch:
> 
> $tc qdisc replace dev lan0 root handle 1: ets bands 8 strict 2 quanta 1514 1514 1514 3528 1514 1514
> 
> As described above the reason for this approach would be to rely on the more
> fine grained QoS capabilities available on the mac chip with respect to the
> hw switch or because the DSA switch does not support QoS offloading.
> 
> Regards,
> Lorenzo
> 
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752

Explain "the mac chip forwards (in hw) the WAN traffic to the DSA switch
via the CPU port". How many packets does airoha_dev_select_queue() see?
All of them, or only the first of a flow? What operations does the
offload consist of?

Lorenzo Bianconi Dec. 12, 2024, 5:03 p.m. UTC | #4

On Dec 12, Vladimir Oltean wrote:
> On Thu, Dec 12, 2024 at 10:19:41AM +0100, Lorenzo Bianconi wrote:
> > > Hi Lorenzo,
> > 
> > Hi Vladimir,
> > 
> > > 
> > > On Wed, Dec 11, 2024 at 04:31:48PM +0100, Lorenzo Bianconi wrote:
> > > > Introduce support for ETS and TBF qdisc offload available in the Airoha
> > > > EN7581 ethernet controller.
> > > > Some DSA hw switches do not support Qdisc offloading or the mac chip
> > > > has more fine grained QoS capabilities with respect to the hw switch
> > > > (e.g. Airoha EN7581 mac chip has more hw QoS and buffering capabilities
> > > > with respect to the mt7530 switch). 
> > > > Introduce ndo_setup_tc_conduit callback in order to allow tc to offload
> > > > Qdisc policies for the specified DSA user port configuring the hw switch
> > > > cpu port (mac chip).
> > > 
> > > Can you please make a detailed diagram explaining how is the conduit
> > > involved in the packet data path for QoS? Offloaded tc on a DSA user
> > > port is supposed to affect autonomously forwarded traffic too (like the
> > > Linux bridge).
> > 
> > I guess a typical use case would be the one below where the traffic from the
> > WAN port is forwarded to a DSA LAN one (e.g. lan0) via netfilter flowtable
> > offload.
> > 
> >             ┌─────────────────────────────────┐             
> >             │               BR0               │             
> >             └───┬────────┬────────┬────────┬──┘             
> > ┌───────────────┼────────┼────────┼────────┼───┐            
> > │DSA            │        │        │        │   │            
> > │               │        │        │        │   │            
> > │ ┌───┐      ┌──▼─┐   ┌──▼─┐   ┌──▼─┐   ┌──▼─┐ │       ┌───┐
> > │ │CPU│      │LAN0│   │LAN1│   │LAN2│   │LAN3│ │       │WAN│
> > │ └───┘      └────┘   └────┘   └────┘   └────┘ │       └───┘
> > └──────────────────────────────────────────────┘            
> > 
> > In this case the mac chip forwards (in hw) the WAN traffic to the DSA switch
> > via the CPU port. In [0] we have the EN7581 mac chip architecture where we
> > can assume GDM1 is the CPU port and GDM2 is the WAN port.
> > The goal of this RFC series is to offload a Qdisc rule (e.g. ETS) on a given
> > LAN port using the mac chip QoS capabilities instead of creating the QoS
> > discipline directly in the DSA hw switch:
> > 
> > $tc qdisc replace dev lan0 root handle 1: ets bands 8 strict 2 quanta 1514 1514 1514 3528 1514 1514
> > 
> > As described above the reason for this approach would be to rely on the more
> > fine grained QoS capabilities available on the mac chip with respect to the
> > hw switch or because the DSA switch does not support QoS offloading.
> > 
> > Regards,
> > Lorenzo
> > 
> > [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752
> 
> Explain "the mac chip forwards (in hw) the WAN traffic to the DSA switch
> via the CPU port". How many packets does airoha_dev_select_queue() see?
> All of them, or only the first of a flow? What operations does the
> offload consist of?

I am referring to the netfilter flowtable offload where the kernel receives
just the 3-way handshake of a TCP connection and then the traffic is fully
offloaded (the hw receives a flower rule to route the traffic between
interfaces applying NAT mangling if requested). Re-thinking about it,
I guess it is better to post flowtable support first and then continue
the discussion about QoS offloading, what do you think?

Regards,
Lorenzo

Vladimir Oltean Dec. 12, 2024, 6:46 p.m. UTC | #5

On Thu, Dec 12, 2024 at 06:03:08PM +0100, Lorenzo Bianconi wrote:
> > Explain "the mac chip forwards (in hw) the WAN traffic to the DSA switch
> > via the CPU port". How many packets does airoha_dev_select_queue() see?
> > All of them, or only the first of a flow? What operations does the
> > offload consist of?
> 
> I am referring to the netfilter flowtable offload where the kernel receives
> just the 3-way handshake of a TCP connection and then the traffic is fully
> offloaded (the hw receives a flower rule to route the traffic between
> interfaces applying NAT mangling if requested).

And how do the follow-up packets know to go to the same conduit queue as
the initial packets of the flow?

As mentioned, my trouble with your current proposal is that I don't
think it reacts adequately to the user space request. Given your command,
packets forwarded from lan1 to lan0 should also go through lan0's ETS
scheduler, but my understanding is that they won't, because they bypass
the conduit. I don't encourage adding new net_device_ops infrastructure
to implement unexpected behavior.

I'm trying to look at the big picture and abstract away the flowtable a
bit. I don't think the tc rule should be on the user port. Can the
redirection of packets destined towards a particular switch port be
accomplished with a tc u32 filter on the conduit interface instead?
If the tc primitives for either the filter or the action don't exist,
maybe those could be added instead? Like DSA keys in "flower" which gain
introspection into the encapsulated packet headers?

> Re-thinking about it, I guess it is better to post flowtable support
> first and then continue the discussion about QoS offloading, what do
> you think?

I don't know about Andrew, but I'm really not familiar with the
netfilter flowtable (and there's another series from Eric Woudstra
waiting for me to know everything about it).

Though, I don't think this can continue for long, we need to find a
common starting place for discussions, since the development for chips
with flowtable offload is starting to put pressure on DSA. What to read
as a starting point for a basic understanding?

Lorenzo Bianconi Dec. 16, 2024, 12:09 p.m. UTC | #6

> On Thu, Dec 12, 2024 at 06:03:08PM +0100, Lorenzo Bianconi wrote:
> > > Explain "the mac chip forwards (in hw) the WAN traffic to the DSA switch
> > > via the CPU port". How many packets does airoha_dev_select_queue() see?
> > > All of them, or only the first of a flow? What operations does the
> > > offload consist of?
> > 
> > I am referring to the netfilter flowtable offload where the kernel receives
> > just the 3-way handshake of a TCP connection and then the traffic is fully
> > offloaded (the hw receives a flower rule to route the traffic between
> > interfaces applying NAT mangling if requested).

Hi Vladimir,

Sorry for the late reply.

> 
> And how do the follow-up packets know to go to the same conduit queue as
> the initial packets of the flow?
> 
> As mentioned, my trouble with your current proposal is that I don't
> think it reacts adequately to the user space request. Given your command,
> packets forwarded from lan1 to lan0 should also go through lan0's ETS
> scheduler, but my understanding is that they won't, because they bypass
> the conduit. I don't encourage adding new net_device_ops infrastructure
> to implement unexpected behavior.

I guess what I did not make clear here is that we are discussing about
'routed' traffic (sorry for that). The traffic is received from the WAN
interface and routed to a DSA port (or the other way around).
In this scenario the 3-way handshake will be received by the CPU via the
WAN port (or the conduit port) while the subsequent packets will be hw
forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic
will be received by the system from GDM2 (WAN) and the PSE/PPE blocks
will forward it to the GDM1 port that is connected to the DSA cpu port.

The proposed series is about adding the control path to apply a given Qdisc
(ETS or TBF for EN7581) to the traffic that is following the described path
without creating it directly on the DSA switch port (for the reasons described
before). E.g. the user would want to apply an ETS Qdisc just for traffic
egressing via lan0.

This series is not strictly related to the airoha_eth flowtable offload
implementation but the latter is required to have a full pictures of the
possible use case (this is why I was saying it is better to post it first).

> 
> I'm trying to look at the big picture and abstract away the flowtable a
> bit. I don't think the tc rule should be on the user port. Can the
> redirection of packets destined towards a particular switch port be
> accomplished with a tc u32 filter on the conduit interface instead?
> If the tc primitives for either the filter or the action don't exist,
> maybe those could be added instead? Like DSA keys in "flower" which gain
> introspection into the encapsulated packet headers?

The issue with the current DSA infrastructure is there is no way to use
the conduit port to offload a Qdisc policy to a given lan port since we
are missing in the APIs the information about what user port we are
interested in (this is why I added the new netdev callback).
Please consider here we are discussing about Qdisc policies and not flower
rules to mangle the traffic. The hw needs to be configured in advance to apply
the requested policy (e.g TBF for traffic shaping).

> 
> > Re-thinking about it, I guess it is better to post flowtable support
> > first and then continue the discussion about QoS offloading, what do
> > you think?
> 
> I don't know about Andrew, but I'm really not familiar with the
> netfilter flowtable (and there's another series from Eric Woudstra
> waiting for me to know everything about it).
> 
> Though, I don't think this can continue for long, we need to find a
> common starting place for discussions, since the development for chips
> with flowtable offload is starting to put pressure on DSA. What to read
> as a starting point for a basic understanding?

I do not think there is much documentation about it (except the source code).
I guess you can take a look to [1],[2].

Regards,
Lorenzo

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752
[1] https://docs.kernel.org/networking/nf_flowtable.html
[2] https://thermalcircle.de/doku.php?id=blog:linux:flowtables_1_a_netfilter_nftables_fastpath

Vladimir Oltean Dec. 16, 2024, 3:49 p.m. UTC | #7

On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote:
> I guess what I did not make clear here is that we are discussing about
> 'routed' traffic (sorry for that). The traffic is received from the WAN
> interface and routed to a DSA port (or the other way around).
> In this scenario the 3-way handshake will be received by the CPU via the
> WAN port (or the conduit port) while the subsequent packets will be hw
> forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic
> will be received by the system from GDM2 (WAN) and the PSE/PPE blocks
> will forward it to the GDM1 port that is connected to the DSA cpu port.
> 
> The proposed series is about adding the control path to apply a given Qdisc
> (ETS or TBF for EN7581) to the traffic that is following the described path
> without creating it directly on the DSA switch port (for the reasons described
> before). E.g. the user would want to apply an ETS Qdisc just for traffic
> egressing via lan0.
> 
> This series is not strictly related to the airoha_eth flowtable offload
> implementation but the latter is required to have a full pictures of the
> possible use case (this is why I was saying it is better to post it first).

It's good to know this does not depend on flowtable.

When you add an offloaded Qdisc to the egress of a net device, you don't
affect just the traffic L3 routed to that device, but all traffic (also
includes the packets sent to it using L2 forwarding). As such, I simply
don't believe that the way in which the UAPI is interpreted here (root
egress qdisc matches only routed traffic) is proper.

Ack?

> > I'm trying to look at the big picture and abstract away the flowtable a
> > bit. I don't think the tc rule should be on the user port. Can the
> > redirection of packets destined towards a particular switch port be
> > accomplished with a tc u32 filter on the conduit interface instead?
> > If the tc primitives for either the filter or the action don't exist,
> > maybe those could be added instead? Like DSA keys in "flower" which gain
> > introspection into the encapsulated packet headers?
> 
> The issue with the current DSA infrastructure is there is no way to use
> the conduit port to offload a Qdisc policy to a given lan port since we
> are missing in the APIs the information about what user port we are
> interested in (this is why I added the new netdev callback).

How does the introduction of ndo_setup_tc_conduit() help, since the problem
is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly.
It is simply not comparable to the way in which it is offloaded by
drivers/net/dsa/microchip/ksz_common.c, even though the user space
syntax is the same. Unless you're suggesting that for ksz it is not
offloaded correctly?

Oleksij, am I missing something?

> Please consider here we are discussing about Qdisc policies and not flower
> rules to mangle the traffic.

What's a Qdisc policy?

Also, flower is a classifier, not an action. It doesn't mangle packets
by the very definition of what a classifier is.

> The hw needs to be configured in advance to apply the requested policy
> (e.g TBF for traffic shaping).

What are you missing exactly to make DSA packets go to a particular
channel on the conduit?

For Qdisc offloading you want to configure the NIC in advance, of course.

Can't you do something like this to guide packets to the correct channels?

tc qdisc add dev eth0 clsact
tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ...
tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \
	flowid 1:1

Oleksij Rempel Dec. 16, 2024, 6:14 p.m. UTC | #8

On Mon, Dec 16, 2024 at 05:49:47PM +0200, Vladimir Oltean wrote:
> On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote:
> > I guess what I did not make clear here is that we are discussing about
> > 'routed' traffic (sorry for that). The traffic is received from the WAN
> > interface and routed to a DSA port (or the other way around).
> > In this scenario the 3-way handshake will be received by the CPU via the
> > WAN port (or the conduit port) while the subsequent packets will be hw
> > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic
> > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks
> > will forward it to the GDM1 port that is connected to the DSA cpu port.
> > 
> > The proposed series is about adding the control path to apply a given Qdisc
> > (ETS or TBF for EN7581) to the traffic that is following the described path
> > without creating it directly on the DSA switch port (for the reasons described
> > before). E.g. the user would want to apply an ETS Qdisc just for traffic
> > egressing via lan0.
> > 
> > This series is not strictly related to the airoha_eth flowtable offload
> > implementation but the latter is required to have a full pictures of the
> > possible use case (this is why I was saying it is better to post it first).
> 
> It's good to know this does not depend on flowtable.
> 
> When you add an offloaded Qdisc to the egress of a net device, you don't
> affect just the traffic L3 routed to that device, but all traffic (also
> includes the packets sent to it using L2 forwarding). As such, I simply
> don't believe that the way in which the UAPI is interpreted here (root
> egress qdisc matches only routed traffic) is proper.
> 
> Ack?
> 
> > > I'm trying to look at the big picture and abstract away the flowtable a
> > > bit. I don't think the tc rule should be on the user port. Can the
> > > redirection of packets destined towards a particular switch port be
> > > accomplished with a tc u32 filter on the conduit interface instead?
> > > If the tc primitives for either the filter or the action don't exist,
> > > maybe those could be added instead? Like DSA keys in "flower" which gain
> > > introspection into the encapsulated packet headers?
> > 
> > The issue with the current DSA infrastructure is there is no way to use
> > the conduit port to offload a Qdisc policy to a given lan port since we
> > are missing in the APIs the information about what user port we are
> > interested in (this is why I added the new netdev callback).
> 
> How does the introduction of ndo_setup_tc_conduit() help, since the problem
> is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly.
> It is simply not comparable to the way in which it is offloaded by
> drivers/net/dsa/microchip/ksz_common.c, even though the user space
> syntax is the same. Unless you're suggesting that for ksz it is not
> offloaded correctly?
> 
> Oleksij, am I missing something?

Ahm.. let me try to understand the problem. First I would like to
interpret the diagram provide by Lorenzo, since I have no idea about
EN7581, I use KSZ* model and transfer it to this one:

                               CPU
                                |
              +-----------------|---------------------+
    EN7581    |      GDM1  <-- fabric? ---->    GDM2  | like NXP Layerscape?
              +-------|--------------------------|----+
                      |                          |
              +-------|---------+               WAN
    MT7530    |      CPU_port   |
              |     fabric      |
              | lan1 lan2 ....  |
	      +--|----|----|||--+

In case of lanN to lanN traffic, switch internal QoS should be used.
This is where "tc qdisc add lanN root ets" rules apply. At least, this
is how it is implemented for KSZ switches.

In case of traffic flow from WAN to lanN, we deal with two separate sets
of queues and schedulers: GDM1 egress queues and scheduler, (probably
ingress queues on CPU_port), lanN queues and scheduler. Combining both
parts in to one rule, is not good, especially if rules of each lanN port
may be different.

> > Please consider here we are discussing about Qdisc policies and not flower
> > rules to mangle the traffic.
> 
> What's a Qdisc policy?
> 
> Also, flower is a classifier, not an action. It doesn't mangle packets
> by the very definition of what a classifier is.
> 
> > The hw needs to be configured in advance to apply the requested policy
> > (e.g TBF for traffic shaping).
> 
> What are you missing exactly to make DSA packets go to a particular
> channel on the conduit?
> 
> For Qdisc offloading you want to configure the NIC in advance, of course.
> 
> Can't you do something like this to guide packets to the correct channels?
> 
> tc qdisc add dev eth0 clsact
> tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ...
> tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \
> 	flowid 1:1

ACK, this would be my expectation as well.

Lorenzo Bianconi Dec. 16, 2024, 7:01 p.m. UTC | #9

> On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote:
> > I guess what I did not make clear here is that we are discussing about
> > 'routed' traffic (sorry for that). The traffic is received from the WAN
> > interface and routed to a DSA port (or the other way around).
> > In this scenario the 3-way handshake will be received by the CPU via the
> > WAN port (or the conduit port) while the subsequent packets will be hw
> > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic
> > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks
> > will forward it to the GDM1 port that is connected to the DSA cpu port.
> > 
> > The proposed series is about adding the control path to apply a given Qdisc
> > (ETS or TBF for EN7581) to the traffic that is following the described path
> > without creating it directly on the DSA switch port (for the reasons described
> > before). E.g. the user would want to apply an ETS Qdisc just for traffic
> > egressing via lan0.
> > 
> > This series is not strictly related to the airoha_eth flowtable offload
> > implementation but the latter is required to have a full pictures of the
> > possible use case (this is why I was saying it is better to post it first).
> 
> It's good to know this does not depend on flowtable.
> 
> When you add an offloaded Qdisc to the egress of a net device, you don't
> affect just the traffic L3 routed to that device, but all traffic (also
> includes the packets sent to it using L2 forwarding). As such, I simply
> don't believe that the way in which the UAPI is interpreted here (root
> egress qdisc matches only routed traffic) is proper.
> 
> Ack?

Considering patch [0], we are still offloading the Qdisc on the provided
DSA switch port (e.g. LANx) via the port_setup_tc() callback available in
dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit()
callback in order to use the hw Qdisc capabilities available on the mac chip
(e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply
the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree?

> 
> > > I'm trying to look at the big picture and abstract away the flowtable a
> > > bit. I don't think the tc rule should be on the user port. Can the
> > > redirection of packets destined towards a particular switch port be
> > > accomplished with a tc u32 filter on the conduit interface instead?
> > > If the tc primitives for either the filter or the action don't exist,
> > > maybe those could be added instead? Like DSA keys in "flower" which gain
> > > introspection into the encapsulated packet headers?
> > 
> > The issue with the current DSA infrastructure is there is no way to use
> > the conduit port to offload a Qdisc policy to a given lan port since we
> > are missing in the APIs the information about what user port we are
> > interested in (this is why I added the new netdev callback).
> 
> How does the introduction of ndo_setup_tc_conduit() help, since the problem
> is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly.
> It is simply not comparable to the way in which it is offloaded by
> drivers/net/dsa/microchip/ksz_common.c, even though the user space
> syntax is the same. Unless you're suggesting that for ksz it is not
> offloaded correctly?

nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just
does not allow to exploit all hw capabilities available on EN7581 when the
traffic is routed from the WAN port to a given DSA switch port. If we do:

$tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ...

in the current upstream implementation we do:
  dsa_user_setup_tc():
     ...
       -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
          (it applies the Qdisc on lan0 configuring the hw switch)

adding the ndo_setup_tc_conduit() callback we have:

  dsa_user_setup_qdisc()
    ...
      -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data)
         (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0)

      -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
         (it applies the Qdisc on lan0 configuring the hw switch)

> 
> Oleksij, am I missing something?
> 
> > Please consider here we are discussing about Qdisc policies and not flower
> > rules to mangle the traffic.
> 
> What's a Qdisc policy?

I mean a queue scheduler algorithm (e.g. TBF, ETS, HTB, ...)

> 
> Also, flower is a classifier, not an action. It doesn't mangle packets
> by the very definition of what a classifier is.

yes, but goal of the series is the Queue scheduler offloading, not
classifier/action. Agree?

> 
> > The hw needs to be configured in advance to apply the requested policy
> > (e.g TBF for traffic shaping).
> 
> What are you missing exactly to make DSA packets go to a particular
> channel on the conduit?
> 
> For Qdisc offloading you want to configure the NIC in advance, of course.
> 
> Can't you do something like this to guide packets to the correct channels?
> 
> tc qdisc add dev eth0 clsact
> tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ...
> tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \
> 	flowid 1:1

If we apply the Qdisc directly on the conduit port (eth0) we can just apply the
queue scheduler on all the traffic egressing via the DSA switch while I would
like to apply it on per DSA port basis (but using the mac chip hw capabilities),
got my point?

Regards,
Lorenzo

[0] https://patchwork.kernel.org/project/netdevbpf/patch/8e57ae3c4b064403ca843ffa45a5eb4e4198cf80.1733930558.git.lorenzo@kernel.org/

Oleksij Rempel Dec. 16, 2024, 7:23 p.m. UTC | #10

On Mon, Dec 16, 2024 at 08:01:33PM +0100, Lorenzo Bianconi wrote:
> > On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote:
> > > I guess what I did not make clear here is that we are discussing about
> > > 'routed' traffic (sorry for that). The traffic is received from the WAN
> > > interface and routed to a DSA port (or the other way around).
> > > In this scenario the 3-way handshake will be received by the CPU via the
> > > WAN port (or the conduit port) while the subsequent packets will be hw
> > > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic
> > > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks
> > > will forward it to the GDM1 port that is connected to the DSA cpu port.
> > > 
> > > The proposed series is about adding the control path to apply a given Qdisc
> > > (ETS or TBF for EN7581) to the traffic that is following the described path
> > > without creating it directly on the DSA switch port (for the reasons described
> > > before). E.g. the user would want to apply an ETS Qdisc just for traffic
> > > egressing via lan0.
> > > 
> > > This series is not strictly related to the airoha_eth flowtable offload
> > > implementation but the latter is required to have a full pictures of the
> > > possible use case (this is why I was saying it is better to post it first).
> > 
> > It's good to know this does not depend on flowtable.
> > 
> > When you add an offloaded Qdisc to the egress of a net device, you don't
> > affect just the traffic L3 routed to that device, but all traffic (also
> > includes the packets sent to it using L2 forwarding). As such, I simply
> > don't believe that the way in which the UAPI is interpreted here (root
> > egress qdisc matches only routed traffic) is proper.
> > 
> > Ack?
> 
> Considering patch [0], we are still offloading the Qdisc on the provided
> DSA switch port (e.g. LANx) via the port_setup_tc() callback available in
> dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit()
> callback in order to use the hw Qdisc capabilities available on the mac chip
> (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply
> the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree?
> 
> > 
> > > > I'm trying to look at the big picture and abstract away the flowtable a
> > > > bit. I don't think the tc rule should be on the user port. Can the
> > > > redirection of packets destined towards a particular switch port be
> > > > accomplished with a tc u32 filter on the conduit interface instead?
> > > > If the tc primitives for either the filter or the action don't exist,
> > > > maybe those could be added instead? Like DSA keys in "flower" which gain
> > > > introspection into the encapsulated packet headers?
> > > 
> > > The issue with the current DSA infrastructure is there is no way to use
> > > the conduit port to offload a Qdisc policy to a given lan port since we
> > > are missing in the APIs the information about what user port we are
> > > interested in (this is why I added the new netdev callback).
> > 
> > How does the introduction of ndo_setup_tc_conduit() help, since the problem
> > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly.
> > It is simply not comparable to the way in which it is offloaded by
> > drivers/net/dsa/microchip/ksz_common.c, even though the user space
> > syntax is the same. Unless you're suggesting that for ksz it is not
> > offloaded correctly?
> 
> nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just
> does not allow to exploit all hw capabilities available on EN7581 when the
> traffic is routed from the WAN port to a given DSA switch port. If we do:
> 
> $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ...
> 
> in the current upstream implementation we do:
>   dsa_user_setup_tc():
>      ...
>        -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
>           (it applies the Qdisc on lan0 configuring the hw switch)
> 
> adding the ndo_setup_tc_conduit() callback we have:
> 
>   dsa_user_setup_qdisc()
>     ...
>       -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data)
>          (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0)
> 
>       -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
>          (it applies the Qdisc on lan0 configuring the hw switch)
> 
> > 
> > Oleksij, am I missing something?
> > 
> > > Please consider here we are discussing about Qdisc policies and not flower
> > > rules to mangle the traffic.
> > 
> > What's a Qdisc policy?
> 
> I mean a queue scheduler algorithm (e.g. TBF, ETS, HTB, ...)
> 
> > 
> > Also, flower is a classifier, not an action. It doesn't mangle packets
> > by the very definition of what a classifier is.
> 
> yes, but goal of the series is the Queue scheduler offloading, not
> classifier/action. Agree?
> 
> > 
> > > The hw needs to be configured in advance to apply the requested policy
> > > (e.g TBF for traffic shaping).
> > 
> > What are you missing exactly to make DSA packets go to a particular
> > channel on the conduit?
> > 
> > For Qdisc offloading you want to configure the NIC in advance, of course.
> > 
> > Can't you do something like this to guide packets to the correct channels?
> > 
> > tc qdisc add dev eth0 clsact
> > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ...
> > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \
> > 	flowid 1:1
> 
> If we apply the Qdisc directly on the conduit port (eth0) we can just apply the
> queue scheduler on all the traffic egressing via the DSA switch while I would
> like to apply it on per DSA port basis (but using the mac chip hw capabilities),
> got my point?

Hm, I guess I have similar use case in one of my projects. In my case, the CPU
interface is 1Gbit the switch ports are 100Mbit each. It is still
possible to keep the CPU interface busy by sending 1Gbit UDP stream, so
900Mbit is dropped by the switch. I would like to have traffic limiter
per virtual DSA port on the SoC site to reduce the load on DSA conduit.
Currently it was not possible.

Vladimir Oltean Dec. 16, 2024, 7:46 p.m. UTC | #11

On Mon, Dec 16, 2024 at 08:01:33PM +0100, Lorenzo Bianconi wrote:
> Considering patch [0], we are still offloading the Qdisc on the provided
> DSA switch port (e.g. LANx) via the port_setup_tc() callback available in
> dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit()
> callback in order to use the hw Qdisc capabilities available on the mac chip
> (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply
> the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree?

Not quite, no.

ndo_setup_tc_conduit() does not have the same instruments to offload
what port_setup_tc() can offload. It is not involved in all data paths
that port_setup_tc() has to handle. Please ack this. So if port_setup_tc()
returns -EOPNOTSUPP, the entire dsa_user_setup_qdisc() should return
-EOPNOTSUPP, UNLESS you install packet traps on all other offloaded data
paths in the switch, such that all packets that egress the DSA user port
are handled by ndo_setup_tc_conduit()'s instruments.

> > How does the introduction of ndo_setup_tc_conduit() help, since the problem
> > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly.
> > It is simply not comparable to the way in which it is offloaded by
> > drivers/net/dsa/microchip/ksz_common.c, even though the user space
> > syntax is the same. Unless you're suggesting that for ksz it is not
> > offloaded correctly?
> 
> nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just
> does not allow to exploit all hw capabilities available on EN7581 when the
> traffic is routed from the WAN port to a given DSA switch port.

And I don't believe it should, in this way.

> If we do:
> 
> $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ...
> 
> in the current upstream implementation we do:
>   dsa_user_setup_tc():
>      ...
>        -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
>           (it applies the Qdisc on lan0 configuring the hw switch)
> 
> adding the ndo_setup_tc_conduit() callback we have:
> 
>   dsa_user_setup_qdisc()
>     ...
>       -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data)
>          (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0)
> 
>       -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
>          (it applies the Qdisc on lan0 configuring the hw switch)
> 
> > Also, flower is a classifier, not an action. It doesn't mangle packets
> > by the very definition of what a classifier is.
> 
> yes, but goal of the series is the Queue scheduler offloading, not
> classifier/action. Agree?

Classifiers + flowid are an instrument to direct packets to classes of a
classful egress Qdisc. They seem perfectly relevant to the discussion,
given the information I currently have.

> > Can't you do something like this to guide packets to the correct channels?
> > 
> > tc qdisc add dev eth0 clsact
> > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ...
> > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \
> > 	flowid 1:1
> 
> If we apply the Qdisc directly on the conduit port (eth0) we can just apply the
> queue scheduler on all the traffic egressing via the DSA switch while I would
> like to apply it on per DSA port basis (but using the mac chip hw capabilities),
> got my point?

We need something as the root Qdisc of the conduit which exposes its
hardware capabilities. I just assumed that would be a simple (and single)
ETS, you can correct me if I am wrong.

On conduit egress, what is the arbitration scheme between the traffic
destined towards each DSA user port (channel, as the driver calls them)?
How can this be best represented?

IIUC, in your patch set, you expose the conduit hardware QoS capabilities
as if they can be perfectly virtualized among DSA user ports, and as if
each DSA user port can have its own ETS root Qdisc, completely independent
of each other, as if the packets do not serialize on the conduit <-> CPU
port link, and as if that is not a bottleneck. Is that really the case?
If so (but please explain how), maybe you really need your own root Qdisc
driver, with one class per DSA user port, and those classes have ETS
attached to them.

Lorenzo Bianconi Dec. 16, 2024, 9:44 p.m. UTC | #12

> On Mon, Dec 16, 2024 at 08:01:33PM +0100, Lorenzo Bianconi wrote:
> > > On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote:
> > > > I guess what I did not make clear here is that we are discussing about
> > > > 'routed' traffic (sorry for that). The traffic is received from the WAN
> > > > interface and routed to a DSA port (or the other way around).
> > > > In this scenario the 3-way handshake will be received by the CPU via the
> > > > WAN port (or the conduit port) while the subsequent packets will be hw
> > > > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic
> > > > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks
> > > > will forward it to the GDM1 port that is connected to the DSA cpu port.
> > > > 
> > > > The proposed series is about adding the control path to apply a given Qdisc
> > > > (ETS or TBF for EN7581) to the traffic that is following the described path
> > > > without creating it directly on the DSA switch port (for the reasons described
> > > > before). E.g. the user would want to apply an ETS Qdisc just for traffic
> > > > egressing via lan0.
> > > > 
> > > > This series is not strictly related to the airoha_eth flowtable offload
> > > > implementation but the latter is required to have a full pictures of the
> > > > possible use case (this is why I was saying it is better to post it first).
> > > 
> > > It's good to know this does not depend on flowtable.
> > > 
> > > When you add an offloaded Qdisc to the egress of a net device, you don't
> > > affect just the traffic L3 routed to that device, but all traffic (also
> > > includes the packets sent to it using L2 forwarding). As such, I simply
> > > don't believe that the way in which the UAPI is interpreted here (root
> > > egress qdisc matches only routed traffic) is proper.
> > > 
> > > Ack?
> > 
> > Considering patch [0], we are still offloading the Qdisc on the provided
> > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in
> > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit()
> > callback in order to use the hw Qdisc capabilities available on the mac chip
> > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply
> > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree?
> > 
> > > 
> > > > > I'm trying to look at the big picture and abstract away the flowtable a
> > > > > bit. I don't think the tc rule should be on the user port. Can the
> > > > > redirection of packets destined towards a particular switch port be
> > > > > accomplished with a tc u32 filter on the conduit interface instead?
> > > > > If the tc primitives for either the filter or the action don't exist,
> > > > > maybe those could be added instead? Like DSA keys in "flower" which gain
> > > > > introspection into the encapsulated packet headers?
> > > > 
> > > > The issue with the current DSA infrastructure is there is no way to use
> > > > the conduit port to offload a Qdisc policy to a given lan port since we
> > > > are missing in the APIs the information about what user port we are
> > > > interested in (this is why I added the new netdev callback).
> > > 
> > > How does the introduction of ndo_setup_tc_conduit() help, since the problem
> > > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly.
> > > It is simply not comparable to the way in which it is offloaded by
> > > drivers/net/dsa/microchip/ksz_common.c, even though the user space
> > > syntax is the same. Unless you're suggesting that for ksz it is not
> > > offloaded correctly?
> > 
> > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just
> > does not allow to exploit all hw capabilities available on EN7581 when the
> > traffic is routed from the WAN port to a given DSA switch port. If we do:
> > 
> > $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ...
> > 
> > in the current upstream implementation we do:
> >   dsa_user_setup_tc():
> >      ...
> >        -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
> >           (it applies the Qdisc on lan0 configuring the hw switch)
> > 
> > adding the ndo_setup_tc_conduit() callback we have:
> > 
> >   dsa_user_setup_qdisc()
> >     ...
> >       -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data)
> >          (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0)
> > 
> >       -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
> >          (it applies the Qdisc on lan0 configuring the hw switch)
> > 
> > > 
> > > Oleksij, am I missing something?
> > > 
> > > > Please consider here we are discussing about Qdisc policies and not flower
> > > > rules to mangle the traffic.
> > > 
> > > What's a Qdisc policy?
> > 
> > I mean a queue scheduler algorithm (e.g. TBF, ETS, HTB, ...)
> > 
> > > 
> > > Also, flower is a classifier, not an action. It doesn't mangle packets
> > > by the very definition of what a classifier is.
> > 
> > yes, but goal of the series is the Queue scheduler offloading, not
> > classifier/action. Agree?
> > 
> > > 
> > > > The hw needs to be configured in advance to apply the requested policy
> > > > (e.g TBF for traffic shaping).
> > > 
> > > What are you missing exactly to make DSA packets go to a particular
> > > channel on the conduit?
> > > 
> > > For Qdisc offloading you want to configure the NIC in advance, of course.
> > > 
> > > Can't you do something like this to guide packets to the correct channels?
> > > 
> > > tc qdisc add dev eth0 clsact
> > > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ...
> > > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \
> > > 	flowid 1:1
> > 
> > If we apply the Qdisc directly on the conduit port (eth0) we can just apply the
> > queue scheduler on all the traffic egressing via the DSA switch while I would
> > like to apply it on per DSA port basis (but using the mac chip hw capabilities),
> > got my point?
> 
> Hm, I guess I have similar use case in one of my projects. In my case, the CPU
> interface is 1Gbit the switch ports are 100Mbit each. It is still
> possible to keep the CPU interface busy by sending 1Gbit UDP stream, so
> 900Mbit is dropped by the switch. I would like to have traffic limiter
> per virtual DSA port on the SoC site to reduce the load on DSA conduit.
> Currently it was not possible.

Does the mac chip in your setup support TX shaping (e.g. via HTB or TBF)?
If so, I guess you could do it via the ndo_setup_tc_conduit() callback.

Regards,
Lorenzo

> 
> -- 
> Pengutronix e.K.                           |                             |
> Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
> 31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

Lorenzo Bianconi Dec. 16, 2024, 10:28 p.m. UTC | #13

> On Mon, Dec 16, 2024 at 08:01:33PM +0100, Lorenzo Bianconi wrote:
> > Considering patch [0], we are still offloading the Qdisc on the provided
> > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in
> > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit()
> > callback in order to use the hw Qdisc capabilities available on the mac chip
> > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply
> > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree?
> 
> Not quite, no.
> 
> ndo_setup_tc_conduit() does not have the same instruments to offload
> what port_setup_tc() can offload. It is not involved in all data paths
> that port_setup_tc() has to handle. Please ack this. So if port_setup_tc()

Can you please elaborate on this? Both (->ndo_setup_tc_conduit() and
->port_setup_tc()) refer to the same DSA user port (please take a look the
callback signature).

> returns -EOPNOTSUPP, the entire dsa_user_setup_qdisc() should return
> -EOPNOTSUPP, UNLESS you install packet traps on all other offloaded data
> paths in the switch, such that all packets that egress the DSA user port
> are handled by ndo_setup_tc_conduit()'s instruments.

Uhm, do you mean we are changing the user expected result in this way?
It seems to me the only case we are actually changing is if port_setup_tc()
callback is not supported by the DSA switch driver while ndo_setup_tc_conduit()
one is supported by the mac chip. In this case the previous implementation
returns -EOPNOTSUPP while the proposed one does not report any error.
Do we really care about this case? If so, I guess we can rework
dsa_user_setup_qdisc().

> 
> > > How does the introduction of ndo_setup_tc_conduit() help, since the problem
> > > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly.
> > > It is simply not comparable to the way in which it is offloaded by
> > > drivers/net/dsa/microchip/ksz_common.c, even though the user space
> > > syntax is the same. Unless you're suggesting that for ksz it is not
> > > offloaded correctly?
> > 
> > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just
> > does not allow to exploit all hw capabilities available on EN7581 when the
> > traffic is routed from the WAN port to a given DSA switch port.
> 
> And I don't believe it should, in this way.

Can you please elaborate on this? IIUC it seems even Oleksij has a use-case
for this.

> 
> > If we do:
> > 
> > $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ...
> > 
> > in the current upstream implementation we do:
> >   dsa_user_setup_tc():
> >      ...
> >        -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
> >           (it applies the Qdisc on lan0 configuring the hw switch)
> > 
> > adding the ndo_setup_tc_conduit() callback we have:
> > 
> >   dsa_user_setup_qdisc()
> >     ...
> >       -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data)
> >          (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0)
> > 
> >       -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
> >          (it applies the Qdisc on lan0 configuring the hw switch)
> > 
> > > Also, flower is a classifier, not an action. It doesn't mangle packets
> > > by the very definition of what a classifier is.
> > 
> > yes, but goal of the series is the Queue scheduler offloading, not
> > classifier/action. Agree?
> 
> Classifiers + flowid are an instrument to direct packets to classes of a
> classful egress Qdisc. They seem perfectly relevant to the discussion,
> given the information I currently have.

yep, sure. We will need a tc classifier to set the flow-id (I used flower during
development). What I mean is the series is taking care just of Qdisc offloading.

> 
> > > Can't you do something like this to guide packets to the correct channels?
> > > 
> > > tc qdisc add dev eth0 clsact
> > > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ...
> > > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \
> > > 	flowid 1:1
> > 
> > If we apply the Qdisc directly on the conduit port (eth0) we can just apply the
> > queue scheduler on all the traffic egressing via the DSA switch while I would
> > like to apply it on per DSA port basis (but using the mac chip hw capabilities),
> > got my point?
> 
> We need something as the root Qdisc of the conduit which exposes its
> hardware capabilities. I just assumed that would be a simple (and single)
> ETS, you can correct me if I am wrong.
> 
> On conduit egress, what is the arbitration scheme between the traffic
> destined towards each DSA user port (channel, as the driver calls them)?
> How can this be best represented?

The EN7581 supports up to 32 different 'channels' (each of them support 8
different hw queues). You can define an ETS and/or TBF Qdisc for each channel.
My idea is to associate a channel to each DSA switch port, so the user can
define independent QoS policies for each DSA ports (e.g. shape at 100Mbps lan0,
apply ETS on lan1, ...) configuring the mac chip instead of the hw switch.
The kernel (if the traffic is not offloaded) or the PPE block (if the traffic
is offloaded) updates the channel and queue information in the DMA descriptor
(please take a look to [0] for the first case).

> 
> IIUC, in your patch set, you expose the conduit hardware QoS capabilities
> as if they can be perfectly virtualized among DSA user ports, and as if
> each DSA user port can have its own ETS root Qdisc, completely independent
> of each other, as if the packets do not serialize on the conduit <-> CPU
> port link, and as if that is not a bottleneck. Is that really the case?

correct

> If so (but please explain how), maybe you really need your own root Qdisc
> driver, with one class per DSA user port, and those classes have ETS
> attached to them.

Can you please clarify what do you mean with 'root Qdisc driver'?

Regards,
Lorenzo

[0] https://patchwork.kernel.org/project/netdevbpf/patch/a7d8ec3d70d7a0e2208909189e46a63e769f8f9d.1733930558.git.lorenzo@kernel.org/

Vladimir Oltean Dec. 16, 2024, 11:13 p.m. UTC | #14

On Mon, Dec 16, 2024 at 11:28:18PM +0100, Lorenzo Bianconi wrote:
> > ndo_setup_tc_conduit() does not have the same instruments to offload
> > what port_setup_tc() can offload. It is not involved in all data paths
> > that port_setup_tc() has to handle. Please ack this. So if port_setup_tc()
> 
> Can you please elaborate on this? Both (->ndo_setup_tc_conduit() and
> ->port_setup_tc()) refer to the same DSA user port (please take a look the
> callback signature).

I'd be just repeating what I've said a few times before. Your proposed
ndo_setup_tc_conduit() appears to be configuring conduit resources
(QDMA, GDM) for mt7530 user port tc offload, as if it is in complete and
exclusive control of the user port data path. But as long as there are
packets in the user port data path that bypass those conduit QoS resources
(like for example mt7530 switch forwards packet from one port to another,
bypassing the GDM1 in your drawing[1]), it isn't a good model. Forget
about ndo_setup_tc_conduit(), it isn't a good tc command to run in the
first place. The tc command you're trying to make to do what you want is
supposed to _also_ include the mt7530 packets forwarded from one port to
another in its QoS mix. It applies at the _egress_ of the mt7530 port.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752

Let me try to add some squiggles based on your diagram, to clarify what
is my understanding and complaint.

               ┌───────┐                                   ┌───────┐
               │ QDMA2 │                                   │ QDMA1 │
               └───┬───┘                                   └───┬───┘
                   │                                           │
           ┌───────▼───────────────────────────────────────────▼────────┐
           │                                                            │
           │       P5                                          P0       │
           │                                                            │
           │                                                            │
           │                                                            │    ┌──────┐
           │                                                         P3 ├────► GDM3 │
           │                                                            │
┌─────┐    │                                                            │
│ PPE ◄────┤ P4                        PSE                              │
└─────┘    │                                                            │
           │                                                            │    ┌──────┐
           │                                                         P9 ├────► GDM4 │
           │                                                            │    └──────┘
           │                                                            │
           │        P2                                         P1       │
           └─────────┬─────────────────────────────────────────┬────────┘
                     │                                         │
                 ┌───▼──┐                                   ┌──▼───┐
                 │ GDM2 │                                   │ GDM1 │
                 └──────┘                                   └──┬───┘
                                                               │
                                                ┌──────────────▼───────────────┐
                                                │            CPU port          │
                                                │   ┌─────────┘                │
                                                │   │         MT7530           │
                                                │   │                          │
                                                │   ▼         x                │
                                                │   ┌─────┐ ┌─┘                │
                                                │  lan1  lan2  lan3  lan4      │
                                                └───│──────────────────────────┘
                                                    ▼

When you add an offloaded Qdisc to the egress of lan1, the expectation
is that packets from lan2 obey it too (offloaded tc goes hand in hand
with offloaded bridge). Whereas, by using GDM1/QDMA resources, you are
breaking that expectation, because packets from lan2 bridged by MT7530
don't go to GDM1 (the "x").

> > returns -EOPNOTSUPP, the entire dsa_user_setup_qdisc() should return
> > -EOPNOTSUPP, UNLESS you install packet traps on all other offloaded data
> > paths in the switch, such that all packets that egress the DSA user port
> > are handled by ndo_setup_tc_conduit()'s instruments.
> 
> Uhm, do you mean we are changing the user expected result in this way?
> It seems to me the only case we are actually changing is if port_setup_tc()
> callback is not supported by the DSA switch driver while ndo_setup_tc_conduit()
> one is supported by the mac chip. In this case the previous implementation
> returns -EOPNOTSUPP while the proposed one does not report any error.
> Do we really care about this case? If so, I guess we can rework
> dsa_user_setup_qdisc().

See above, there's nothing to rework.

> > > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just
> > > does not allow to exploit all hw capabilities available on EN7581 when the
> > > traffic is routed from the WAN port to a given DSA switch port.
> > 
> > And I don't believe it should, in this way.
> 
> Can you please elaborate on this? IIUC it seems even Oleksij has a use-case
> for this.

See above, I'm also waiting for Oleksij's answer but I don't expect you
2 to be talking about the same thing. If there's some common infrastructure
to be shared, my understanding is it has nothing to do with ndo_setup_tc_conduit().

> > We need something as the root Qdisc of the conduit which exposes its
> > hardware capabilities. I just assumed that would be a simple (and single)
> > ETS, you can correct me if I am wrong.
> > 
> > On conduit egress, what is the arbitration scheme between the traffic
> > destined towards each DSA user port (channel, as the driver calls them)?
> > How can this be best represented?
> 
> The EN7581 supports up to 32 different 'channels' (each of them support 8
> different hw queues). You can define an ETS and/or TBF Qdisc for each channel.
> My idea is to associate a channel to each DSA switch port, so the user can
> define independent QoS policies for each DSA ports (e.g. shape at 100Mbps lan0,
> apply ETS on lan1, ...) configuring the mac chip instead of the hw switch.
> The kernel (if the traffic is not offloaded) or the PPE block (if the traffic
> is offloaded) updates the channel and queue information in the DMA descriptor
> (please take a look to [0] for the first case).

But you call it a MAC chip because between the GDM1 and the MT7530 there's
an in-chip Ethernet MAC (GMII netlist), with a fixed packet rate, right?
I'm asking again, are the channels completely independent of one another,
or are they consuming shared bandwidth in a way that with your proposal
is just not visible? If there is a GMII between the GDM1 and the MT7530,
how come the bandwidth between the channels is not shared in any way?
And if there is no GMII or similar MAC interface, we need to take 100
steps back and discuss why was the DSA model chosen for this switch, and
not a freeform switchdev driver where the conduit is not discrete?

I'm not sure what to associate these channels with. Would it be wrong to
think of each channel as a separate DSA conduit? Because for example there
is API to customize the user port <-> conduit assignment.

> > IIUC, in your patch set, you expose the conduit hardware QoS capabilities
> > as if they can be perfectly virtualized among DSA user ports, and as if
> > each DSA user port can have its own ETS root Qdisc, completely independent
> > of each other, as if the packets do not serialize on the conduit <-> CPU
> > port link, and as if that is not a bottleneck. Is that really the case?
> 
> correct

Very interesting but will need more than one word for an explanation :)

> > If so (but please explain how), maybe you really need your own root Qdisc
> > driver, with one class per DSA user port, and those classes have ETS
> > attached to them.
> 
> Can you please clarify what do you mean with 'root Qdisc driver'?

Quite literally, an implementer of struct Qdisc_ops whose parent can
only be TC_H_ROOT. I was implying you'd have to create an abstract
software model of the QoS capabilities of the QDMA and the GDM port such
that we all understand them, a netlink attribute scheme for configuring
those QoS parameters, and then a QoS offload mechanism through which
they are communicated to compatible hardware. But let's leave that aside
until it becomes more clear what you have.

Andrew Lunn Dec. 16, 2024, 11:24 p.m. UTC | #15

> Considering patch [0], we are still offloading the Qdisc on the provided
> DSA switch port (e.g. LANx) via the port_setup_tc() callback available in
> dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit()
> callback in order to use the hw Qdisc capabilities available on the mac chip
> (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply
> the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree?

I've not read all the details, so i could be getting something
wrong. But let me point out the basics. Offloading is used to
accelerate what Linux already supports in software. So forget about
your hardware. How would i configure a bunch of e1000e cards connected
to a software bridge to do what you want?

There is no conduit interface in this, so i would not expect to
explicitly configure a conduit interface. Maybe the offloading needs
to implicitly configure the conduit, but that should be all hidden
away from the user. But given the software bridge has no concept of a
conduit, i doubt it.

It could well be our model does not map to the hardware too well,
leaving some bits unusable, but there is not much you can do about
that, that is the Linux model, accelerate what Linux supports in
software.

	Andrew

Oleksij Rempel Dec. 17, 2024, 8:46 a.m. UTC | #16

On Mon, Dec 16, 2024 at 10:44:54PM +0100, Lorenzo Bianconi wrote:
> > On Mon, Dec 16, 2024 at 08:01:33PM +0100, Lorenzo Bianconi wrote:
> > > > On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote:
> > > > > I guess what I did not make clear here is that we are discussing about
> > > > > 'routed' traffic (sorry for that). The traffic is received from the WAN
> > > > > interface and routed to a DSA port (or the other way around).
> > > > > In this scenario the 3-way handshake will be received by the CPU via the
> > > > > WAN port (or the conduit port) while the subsequent packets will be hw
> > > > > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic
> > > > > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks
> > > > > will forward it to the GDM1 port that is connected to the DSA cpu port.
> > > > > 
> > > > > The proposed series is about adding the control path to apply a given Qdisc
> > > > > (ETS or TBF for EN7581) to the traffic that is following the described path
> > > > > without creating it directly on the DSA switch port (for the reasons described
> > > > > before). E.g. the user would want to apply an ETS Qdisc just for traffic
> > > > > egressing via lan0.
> > > > > 
> > > > > This series is not strictly related to the airoha_eth flowtable offload
> > > > > implementation but the latter is required to have a full pictures of the
> > > > > possible use case (this is why I was saying it is better to post it first).
> > > > 
> > > > It's good to know this does not depend on flowtable.
> > > > 
> > > > When you add an offloaded Qdisc to the egress of a net device, you don't
> > > > affect just the traffic L3 routed to that device, but all traffic (also
> > > > includes the packets sent to it using L2 forwarding). As such, I simply
> > > > don't believe that the way in which the UAPI is interpreted here (root
> > > > egress qdisc matches only routed traffic) is proper.
> > > > 
> > > > Ack?
> > > 
> > > Considering patch [0], we are still offloading the Qdisc on the provided
> > > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in
> > > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit()
> > > callback in order to use the hw Qdisc capabilities available on the mac chip
> > > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply
> > > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree?
> > > 
> > > > 
> > > > > > I'm trying to look at the big picture and abstract away the flowtable a
> > > > > > bit. I don't think the tc rule should be on the user port. Can the
> > > > > > redirection of packets destined towards a particular switch port be
> > > > > > accomplished with a tc u32 filter on the conduit interface instead?
> > > > > > If the tc primitives for either the filter or the action don't exist,
> > > > > > maybe those could be added instead? Like DSA keys in "flower" which gain
> > > > > > introspection into the encapsulated packet headers?
> > > > > 
> > > > > The issue with the current DSA infrastructure is there is no way to use
> > > > > the conduit port to offload a Qdisc policy to a given lan port since we
> > > > > are missing in the APIs the information about what user port we are
> > > > > interested in (this is why I added the new netdev callback).
> > > > 
> > > > How does the introduction of ndo_setup_tc_conduit() help, since the problem
> > > > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly.
> > > > It is simply not comparable to the way in which it is offloaded by
> > > > drivers/net/dsa/microchip/ksz_common.c, even though the user space
> > > > syntax is the same. Unless you're suggesting that for ksz it is not
> > > > offloaded correctly?
> > > 
> > > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just
> > > does not allow to exploit all hw capabilities available on EN7581 when the
> > > traffic is routed from the WAN port to a given DSA switch port. If we do:
> > > 
> > > $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ...
> > > 
> > > in the current upstream implementation we do:
> > >   dsa_user_setup_tc():
> > >      ...
> > >        -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
> > >           (it applies the Qdisc on lan0 configuring the hw switch)
> > > 
> > > adding the ndo_setup_tc_conduit() callback we have:
> > > 
> > >   dsa_user_setup_qdisc()
> > >     ...
> > >       -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data)
> > >          (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0)
> > > 
> > >       -> ds->ops->port_setup_tc(ds, dp->index, type, type_data)
> > >          (it applies the Qdisc on lan0 configuring the hw switch)
> > > 
> > > > 
> > > > Oleksij, am I missing something?
> > > > 
> > > > > Please consider here we are discussing about Qdisc policies and not flower
> > > > > rules to mangle the traffic.
> > > > 
> > > > What's a Qdisc policy?
> > > 
> > > I mean a queue scheduler algorithm (e.g. TBF, ETS, HTB, ...)
> > > 
> > > > 
> > > > Also, flower is a classifier, not an action. It doesn't mangle packets
> > > > by the very definition of what a classifier is.
> > > 
> > > yes, but goal of the series is the Queue scheduler offloading, not
> > > classifier/action. Agree?
> > > 
> > > > 
> > > > > The hw needs to be configured in advance to apply the requested policy
> > > > > (e.g TBF for traffic shaping).
> > > > 
> > > > What are you missing exactly to make DSA packets go to a particular
> > > > channel on the conduit?
> > > > 
> > > > For Qdisc offloading you want to configure the NIC in advance, of course.
> > > > 
> > > > Can't you do something like this to guide packets to the correct channels?
> > > > 
> > > > tc qdisc add dev eth0 clsact
> > > > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ...
> > > > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \
> > > > 	flowid 1:1
> > > 
> > > If we apply the Qdisc directly on the conduit port (eth0) we can just apply the
> > > queue scheduler on all the traffic egressing via the DSA switch while I would
> > > like to apply it on per DSA port basis (but using the mac chip hw capabilities),
> > > got my point?
> > 
> > Hm, I guess I have similar use case in one of my projects. In my case, the CPU
> > interface is 1Gbit the switch ports are 100Mbit each. It is still
> > possible to keep the CPU interface busy by sending 1Gbit UDP stream, so
> > 900Mbit is dropped by the switch. I would like to have traffic limiter
> > per virtual DSA port on the SoC site to reduce the load on DSA conduit.
> > Currently it was not possible.
> 
> Does the mac chip in your setup support TX shaping (e.g. via HTB or TBF)?
> If so, I guess you could do it via the ndo_setup_tc_conduit() callback.

HW offloading is optional optimization. At first step, it should be able
to work without HW offloading.

Lorenzo Bianconi Dec. 17, 2024, 9:11 a.m. UTC | #17

> On Mon, Dec 16, 2024 at 11:28:18PM +0100, Lorenzo Bianconi wrote:
> > > ndo_setup_tc_conduit() does not have the same instruments to offload
> > > what port_setup_tc() can offload. It is not involved in all data paths
> > > that port_setup_tc() has to handle. Please ack this. So if port_setup_tc()
> > 
> > Can you please elaborate on this? Both (->ndo_setup_tc_conduit() and
> > ->port_setup_tc()) refer to the same DSA user port (please take a look the
> > callback signature).
> 
> I'd be just repeating what I've said a few times before. Your proposed
> ndo_setup_tc_conduit() appears to be configuring conduit resources
> (QDMA, GDM) for mt7530 user port tc offload, as if it is in complete and
> exclusive control of the user port data path. But as long as there are
> packets in the user port data path that bypass those conduit QoS resources
> (like for example mt7530 switch forwards packet from one port to another,
> bypassing the GDM1 in your drawing[1]), it isn't a good model. Forget
> about ndo_setup_tc_conduit(), it isn't a good tc command to run in the
> first place. The tc command you're trying to make to do what you want is
> supposed to _also_ include the mt7530 packets forwarded from one port to
> another in its QoS mix. It applies at the _egress_ of the mt7530 port.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752
> 
> Let me try to add some squiggles based on your diagram, to clarify what
> is my understanding and complaint.
> 
>                ┌───────┐                                   ┌───────┐
>                │ QDMA2 │                                   │ QDMA1 │
>                └───┬───┘                                   └───┬───┘
>                    │                                           │
>            ┌───────▼───────────────────────────────────────────▼────────┐
>            │                                                            │
>            │       P5                                          P0       │
>            │                                                            │
>            │                                                            │
>            │                                                            │    ┌──────┐
>            │                                                         P3 ├────► GDM3 │
>            │                                                            │
> ┌─────┐    │                                                            │
> │ PPE ◄────┤ P4                        PSE                              │
> └─────┘    │                                                            │
>            │                                                            │    ┌──────┐
>            │                                                         P9 ├────► GDM4 │
>            │                                                            │    └──────┘
>            │                                                            │
>            │        P2                                         P1       │
>            └─────────┬─────────────────────────────────────────┬────────┘
>                      │                                         │
>                  ┌───▼──┐                                   ┌──▼───┐
>                  │ GDM2 │                                   │ GDM1 │
>                  └──────┘                                   └──┬───┘
>                                                                │
>                                                 ┌──────────────▼───────────────┐
>                                                 │            CPU port          │
>                                                 │   ┌─────────┘                │
>                                                 │   │         MT7530           │
>                                                 │   │                          │
>                                                 │   ▼         x                │
>                                                 │   ┌─────┐ ┌─┘                │
>                                                 │  lan1  lan2  lan3  lan4      │
>                                                 └───│──────────────────────────┘
>                                                     ▼
> 
> When you add an offloaded Qdisc to the egress of lan1, the expectation
> is that packets from lan2 obey it too (offloaded tc goes hand in hand
> with offloaded bridge). Whereas, by using GDM1/QDMA resources, you are
> breaking that expectation, because packets from lan2 bridged by MT7530
> don't go to GDM1 (the "x").

ack, I got your point. I was assuming to cover this case (traffic from lan2 to
lan1) maintaining the port_setup_tc() callback in dsa_user_setup_qdisc() (this
traffic is not managed by ndo_setup_tc_conduit() callback). If this approach is
not ok, I guess we will need to revisit the approach.

> 
> > > returns -EOPNOTSUPP, the entire dsa_user_setup_qdisc() should return
> > > -EOPNOTSUPP, UNLESS you install packet traps on all other offloaded data
> > > paths in the switch, such that all packets that egress the DSA user port
> > > are handled by ndo_setup_tc_conduit()'s instruments.
> > 
> > Uhm, do you mean we are changing the user expected result in this way?
> > It seems to me the only case we are actually changing is if port_setup_tc()
> > callback is not supported by the DSA switch driver while ndo_setup_tc_conduit()
> > one is supported by the mac chip. In this case the previous implementation
> > returns -EOPNOTSUPP while the proposed one does not report any error.
> > Do we really care about this case? If so, I guess we can rework
> > dsa_user_setup_qdisc().
> 
> See above, there's nothing to rework.
> 
> > > > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just
> > > > does not allow to exploit all hw capabilities available on EN7581 when the
> > > > traffic is routed from the WAN port to a given DSA switch port.
> > > 
> > > And I don't believe it should, in this way.
> > 
> > Can you please elaborate on this? IIUC it seems even Oleksij has a use-case
> > for this.
> 
> See above, I'm also waiting for Oleksij's answer but I don't expect you
> 2 to be talking about the same thing. If there's some common infrastructure
> to be shared, my understanding is it has nothing to do with ndo_setup_tc_conduit().

ack

> 
> > > We need something as the root Qdisc of the conduit which exposes its
> > > hardware capabilities. I just assumed that would be a simple (and single)
> > > ETS, you can correct me if I am wrong.
> > > 
> > > On conduit egress, what is the arbitration scheme between the traffic
> > > destined towards each DSA user port (channel, as the driver calls them)?
> > > How can this be best represented?
> > 
> > The EN7581 supports up to 32 different 'channels' (each of them support 8
> > different hw queues). You can define an ETS and/or TBF Qdisc for each channel.
> > My idea is to associate a channel to each DSA switch port, so the user can
> > define independent QoS policies for each DSA ports (e.g. shape at 100Mbps lan0,
> > apply ETS on lan1, ...) configuring the mac chip instead of the hw switch.
> > The kernel (if the traffic is not offloaded) or the PPE block (if the traffic
> > is offloaded) updates the channel and queue information in the DMA descriptor
> > (please take a look to [0] for the first case).
> 
> But you call it a MAC chip because between the GDM1 and the MT7530 there's
> an in-chip Ethernet MAC (GMII netlist), with a fixed packet rate, right?

With "mac chip" I mean the set of PSE/PPE and QDMA blocks in the diagram
(what is managed by airoha_eth driver). There is no other chip in between
of GDM1 and MT7530 switch (sorry for the confusion).

> I'm asking again, are the channels completely independent of one another,
> or are they consuming shared bandwidth in a way that with your proposal
> is just not visible? If there is a GMII between the GDM1 and the MT7530,
> how come the bandwidth between the channels is not shared in any way?

Channels are logically independent.
GDM1 is connected to the MT7530 switch via a fixed speed link (10Gbps, similar
to what we have for other MediaTek chipset, like MT7988 [0]). The fixed link speed
is higher than the sum of DSA port link speeds (on my development boards I have
4 DSA ports @ 1Gbps);

> And if there is no GMII or similar MAC interface, we need to take 100
> steps back and discuss why was the DSA model chosen for this switch, and
> not a freeform switchdev driver where the conduit is not discrete?
> 
> I'm not sure what to associate these channels with. Would it be wrong to
> think of each channel as a separate DSA conduit? Because for example there
> is API to customize the user port <-> conduit assignment.
> 
> > > IIUC, in your patch set, you expose the conduit hardware QoS capabilities
> > > as if they can be perfectly virtualized among DSA user ports, and as if
> > > each DSA user port can have its own ETS root Qdisc, completely independent
> > > of each other, as if the packets do not serialize on the conduit <-> CPU
> > > port link, and as if that is not a bottleneck. Is that really the case?
> > 
> > correct
> 
> Very interesting but will need more than one word for an explanation :)

I mean your paragraph above well describes hw architecture.

> 
> > > If so (but please explain how), maybe you really need your own root Qdisc
> > > driver, with one class per DSA user port, and those classes have ETS
> > > attached to them.
> > 
> > Can you please clarify what do you mean with 'root Qdisc driver'?
> 
> Quite literally, an implementer of struct Qdisc_ops whose parent can
> only be TC_H_ROOT. I was implying you'd have to create an abstract
> software model of the QoS capabilities of the QDMA and the GDM port such
> that we all understand them, a netlink attribute scheme for configuring
> those QoS parameters, and then a QoS offload mechanism through which
> they are communicated to compatible hardware. But let's leave that aside
> until it becomes more clear what you have.
> 

ack, fine.

Regards,
Lorenzo

[0] https://github.com/openwrt/openwrt/blob/main/target/linux/mediatek/files-6.6/arch/arm64/boot/dts/mediatek/mt7988a.dtsi#L1531

Vladimir Oltean Dec. 17, 2024, 9:30 a.m. UTC | #18

On Tue, Dec 17, 2024 at 10:11:46AM +0100, Lorenzo Bianconi wrote:
> > When you add an offloaded Qdisc to the egress of lan1, the expectation
> > is that packets from lan2 obey it too (offloaded tc goes hand in hand
> > with offloaded bridge). Whereas, by using GDM1/QDMA resources, you are
> > breaking that expectation, because packets from lan2 bridged by MT7530
> > don't go to GDM1 (the "x").
> 
> ack, I got your point. I was assuming to cover this case (traffic from lan2 to
> lan1) maintaining the port_setup_tc() callback in dsa_user_setup_qdisc() (this
> traffic is not managed by ndo_setup_tc_conduit() callback). If this approach is
> not ok, I guess we will need to revisit the approach.

To offload QoS on the egress of a DSA user port:

port_setup_tc() is:
(a) necessary
(b) sufficient

ndo_setup_tc_conduit() is:
(a) unnecessary
(b) insufficient

> > But you call it a MAC chip because between the GDM1 and the MT7530 there's
> > an in-chip Ethernet MAC (GMII netlist), with a fixed packet rate, right?
> 
> With "mac chip" I mean the set of PSE/PPE and QDMA blocks in the diagram
> (what is managed by airoha_eth driver). There is no other chip in between
> of GDM1 and MT7530 switch (sorry for the confusion).

The MT7530 is also on the same chip as the GDM1, correct?

> > I'm asking again, are the channels completely independent of one another,
> > or are they consuming shared bandwidth in a way that with your proposal
> > is just not visible? If there is a GMII between the GDM1 and the MT7530,
> > how come the bandwidth between the channels is not shared in any way?
> 
> Channels are logically independent.

"Logically independent" does not mean "does not share resources", which
is what I asked.

> GDM1 is connected to the MT7530 switch via a fixed speed link (10Gbps, similar
> to what we have for other MediaTek chipset, like MT7988 [0]). The fixed link speed
> is higher than the sum of DSA port link speeds (on my development boards I have
> 4 DSA ports @ 1Gbps);

And this fixed connection is a pair of internal Ethernet MACs, correct?
I see on MT7988 we do have the "pause" property, which would suggest so,
since flow control is a MAC level feature. I assume 10 Gbps in the
device tree means it is an XGMII really limited at that speed, and that
speed is not just for phylink compliance, right?

What if we push your example to the extreme, and say that the DSA user
ports also have 10 Gbps links? How independent are the QDMA channels in
this case? What arbitration algorithm will be used for QoS among user
ports, when the combined bandwidth exceeds the CPU port capacity?

Oleksij Rempel Dec. 17, 2024, 9:38 a.m. UTC | #19

On Tue, Dec 17, 2024 at 12:24:13AM +0100, Andrew Lunn wrote:
> > Considering patch [0], we are still offloading the Qdisc on the provided
> > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in
> > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit()
> > callback in order to use the hw Qdisc capabilities available on the mac chip
> > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply
> > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree?
> 
> I've not read all the details, so i could be getting something
> wrong. But let me point out the basics. Offloading is used to
> accelerate what Linux already supports in software. So forget about
> your hardware. How would i configure a bunch of e1000e cards connected
> to a software bridge to do what you want?
> 
> There is no conduit interface in this, so i would not expect to
> explicitly configure a conduit interface. Maybe the offloading needs
> to implicitly configure the conduit, but that should be all hidden
> away from the user. But given the software bridge has no concept of a
> conduit, i doubt it.
> 
> It could well be our model does not map to the hardware too well,
> leaving some bits unusable, but there is not much you can do about
> that, that is the Linux model, accelerate what Linux supports in
> software.

Hi,

You are absolutely correct that offloading should accelerate what Linux already
supports in software, and we need to respect this model. However, I’d like to
step back for a moment to clarify the underlying problem before focusing too
much on solutions.

### The Core Problem: Flow Control Limitations

1. **QoS and Flow Control:** 

   At the heart of proper QoS implementation lies flow control. Flow control
   mechanisms exist at various levels:

   - MAC-level signaling (e.g., pause frames)

   - Queue management (e.g., stopping queues when the hardware is congested)

   The typical Linux driver uses flow control signaling from the MAC (e.g.,
   stopping queues) to coordinate traffic, and depending on the Qdisc, this
   flow control can propagate up to user space applications.

2. **Challenges with DSA:**
   In DSA, we lose direct **flow control communication** between:

   - The host MAC

   - The MAC of a DSA user port.

   While internal flow control within the switch may still work, it does not
   extend to the host. Specifically:

   - Pause frames often affect **all priorities** and are not granular enough
     for low-latency applications.

   - The signaling from the MAC of the DSA user port to the host is either
     **not supported** or is **disabled** (often through device tree
     configuration).

### Why This Matters for QoS

For traffic flowing **from the host** to DSA user ports:

- Without proper flow control, congestion cannot be communicated back to the
  host, leading to buffer overruns and degraded QoS.  

- To address this, we need to compensate for the lack of flow control signaling
  by applying traffic limits (or shaping).

### Approach: Applying Limits on the Conduit Interface

One way to solve this is by applying traffic shaping or limits directly on the
**conduit MAC**. However, this approach has significant complexity:

1. **Hardware-Specific Details:**

   We would need deep hardware knowledge to set up traffic filters or disectors
   at the conduit level. This includes:

   - Parsing **CPU tags** specific to the switch in use.  

   - Applying port-specific rules, some of which depend on **user port link
     speed**.

2. **Admin Burden:**

   Forcing network administrators to configure conduit-specific filters
   manually increases complexity and goes against the existing DSA abstractions,
   which are already well-integrated into the kernel.

### How Things Can Be Implemented

To address QoS for host-to-user port traffic in DSA, I see two possible
approaches:

#### 1. Apply Rules on the Conduit Port (Using `dst_port`)

In this approach, rules are applied to the **conduit interface**, and specific
user ports are matched using **port indices**.

# Conduit interface  
tc qdisc add dev conduit0 clsact  

# Match traffic for user port 1 (e.g., lan0)  
tc filter add dev conduit0 egress flower dst_port 1 \  
    action police rate 50mbit burst 5k drop  

# Match traffic for user port 2 (e.g., lan1)  
tc filter add dev conduit0 egress flower dst_port 2 \  
    action police rate 30mbit burst 3k drop  

#### 2. Apply Rules Directly on the User Ports (With Conduit Marker)

In this approach, rules are applied **directly to the user-facing DSA ports**
(e.g., `lan0`, `lan1`) with a **conduit-specific marker**. The kernel resolves
the mapping internally.

# Apply rules with conduit marker for user ports  
tc qdisc add dev lan0 root tbf rate 50mbit burst 5k conduit-only  
tc qdisc add dev lan1 root tbf rate 30mbit burst 3k conduit-only  

Here:  
- **`conduit-only`**: A marker (flag) indicating that the rule applies
specifically to **host-to-port traffic** and not to L2-forwarded traffic within
the switch.  

### Recommendation

The second approach (**user port-based with `conduit-only` marker**) is cleaner
and more intuitive. It avoids exposing hardware details like port indices while
letting the kernel handle conduit-specific behavior transparently.

Best regards,  
Oleksij

Lorenzo Bianconi Dec. 17, 2024, 10:01 a.m. UTC | #20

> On Tue, Dec 17, 2024 at 10:11:46AM +0100, Lorenzo Bianconi wrote:
> > > When you add an offloaded Qdisc to the egress of lan1, the expectation
> > > is that packets from lan2 obey it too (offloaded tc goes hand in hand
> > > with offloaded bridge). Whereas, by using GDM1/QDMA resources, you are
> > > breaking that expectation, because packets from lan2 bridged by MT7530
> > > don't go to GDM1 (the "x").
> > 
> > ack, I got your point. I was assuming to cover this case (traffic from lan2 to
> > lan1) maintaining the port_setup_tc() callback in dsa_user_setup_qdisc() (this
> > traffic is not managed by ndo_setup_tc_conduit() callback). If this approach is
> > not ok, I guess we will need to revisit the approach.
> 
> To offload QoS on the egress of a DSA user port:
> 
> port_setup_tc() is:
> (a) necessary
> (b) sufficient
> 
> ndo_setup_tc_conduit() is:
> (a) unnecessary

I agree it is unnecessary, but w/o it we must rely on limited QoS capabilities
of the hw dsa switch. The goal of the series is just exploit enhanced QoS
capabilities available on the EN7581 SoC.

> (b) insufficient
> 
> > > But you call it a MAC chip because between the GDM1 and the MT7530 there's
> > > an in-chip Ethernet MAC (GMII netlist), with a fixed packet rate, right?
> > 
> > With "mac chip" I mean the set of PSE/PPE and QDMA blocks in the diagram
> > (what is managed by airoha_eth driver). There is no other chip in between
> > of GDM1 and MT7530 switch (sorry for the confusion).
> 
> The MT7530 is also on the same chip as the GDM1, correct?

I think so, but I am not sure.

> 
> > > I'm asking again, are the channels completely independent of one another,
> > > or are they consuming shared bandwidth in a way that with your proposal
> > > is just not visible? If there is a GMII between the GDM1 and the MT7530,
> > > how come the bandwidth between the channels is not shared in any way?
> > 
> > Channels are logically independent.
> 
> "Logically independent" does not mean "does not share resources", which
> is what I asked.
> 
> > GDM1 is connected to the MT7530 switch via a fixed speed link (10Gbps, similar
> > to what we have for other MediaTek chipset, like MT7988 [0]). The fixed link speed
> > is higher than the sum of DSA port link speeds (on my development boards I have
> > 4 DSA ports @ 1Gbps);
> 
> And this fixed connection is a pair of internal Ethernet MACs, correct?

yes

> I see on MT7988 we do have the "pause" property, which would suggest so,
> since flow control is a MAC level feature. I assume 10 Gbps in the
> device tree means it is an XGMII really limited at that speed, and that
> speed is not just for phylink compliance, right?

I think so

> 
> What if we push your example to the extreme, and say that the DSA user
> ports also have 10 Gbps links? How independent are the QDMA channels in
> this case? What arbitration algorithm will be used for QoS among user
> ports, when the combined bandwidth exceeds the CPU port capacity?

AFIR there is even the possibility to configure inter-channel QoS on the chip,
like a Round Robin scheduler or Strict-Priority between channels.

Regards,
Lorenzo

Vladimir Oltean Dec. 17, 2024, 10:17 a.m. UTC | #21

On Tue, Dec 17, 2024 at 11:01:28AM +0100, Lorenzo Bianconi wrote:
> I agree it is unnecessary, but w/o it we must rely on limited QoS capabilities
> of the hw dsa switch. The goal of the series is just exploit enhanced QoS
> capabilities available on the EN7581 SoC.
(...)
> AFIR there is even the possibility to configure inter-channel QoS on the chip,
> like a Round Robin scheduler or Strict-Priority between channels.

So the conclusion of both these statements is that it would be good to understand
the QoS algorithm among channels of the MAC chip, see if there is any classful
root Qdisc which can describe that algorithm, then offload it with ndo_setup_tc(),
then assign traffic to user ports probably using a software classifier + flowid.
The ETS Qdiscs which you are trying to add here should not be attached to the root
egress Qdisc of DSA user ports, but to the classes of this unknown conduit root
Qdisc. Agree?

Oleksij Rempel Dec. 17, 2024, 10:23 a.m. UTC | #22

On Tue, Dec 17, 2024 at 12:17:24PM +0200, Vladimir Oltean wrote:
> On Tue, Dec 17, 2024 at 11:01:28AM +0100, Lorenzo Bianconi wrote:
> > I agree it is unnecessary, but w/o it we must rely on limited QoS capabilities
> > of the hw dsa switch. The goal of the series is just exploit enhanced QoS
> > capabilities available on the EN7581 SoC.
> (...)
> > AFIR there is even the possibility to configure inter-channel QoS on the chip,
> > like a Round Robin scheduler or Strict-Priority between channels.
> 
> So the conclusion of both these statements is that it would be good to understand
> the QoS algorithm among channels of the MAC chip, see if there is any classful
> root Qdisc which can describe that algorithm, then offload it with ndo_setup_tc(),
> then assign traffic to user ports probably using a software classifier + flowid.
> The ETS Qdiscs which you are trying to add here should not be attached to the root
> egress Qdisc of DSA user ports, but to the classes of this unknown conduit root
> Qdisc. Agree?

I would love to have some kind of easy to use DSA user ports classifiers.

Vladimir Oltean Dec. 17, 2024, 11:54 a.m. UTC | #23

On Tue, Dec 17, 2024 at 10:38:21AM +0100, Oleksij Rempel wrote:
> Hi,
> 
> You are absolutely correct that offloading should accelerate what Linux already
> supports in software, and we need to respect this model. However, I’d like to
> step back for a moment to clarify the underlying problem before focusing too
> much on solutions.
> 
> ### The Core Problem: Flow Control Limitations
> 
> 1. **QoS and Flow Control:** 
> 
>    At the heart of proper QoS implementation lies flow control. Flow control
>    mechanisms exist at various levels:
> 
>    - MAC-level signaling (e.g., pause frames)
> 
>    - Queue management (e.g., stopping queues when the hardware is congested)
> 
>    The typical Linux driver uses flow control signaling from the MAC (e.g.,
>    stopping queues) to coordinate traffic, and depending on the Qdisc, this
>    flow control can propagate up to user space applications.

I read this section as "The Core Problem: Ethernet".

> 2. **Challenges with DSA:**
>    In DSA, we lose direct **flow control communication** between:
> 
>    - The host MAC
> 
>    - The MAC of a DSA user port.
> 
>    While internal flow control within the switch may still work, it does not
>    extend to the host. Specifically:
> 
>    - Pause frames often affect **all priorities** and are not granular enough
>      for low-latency applications.
> 
>    - The signaling from the MAC of the DSA user port to the host is either
>      **not supported** or is **disabled** (often through device tree
>      configuration).

And this as: "Challenges with DSA: uses Ethernet". I think we can all
agree that standard Ethernet, with all the flexibility it gives to pair
any discrete DSA switch to any host NIC, does not give us sufficient
instruments for independent flow control of each user port.

Food for thought: strongly coupled MAC + integrated DSA switch systems,
like for example Broadcom, have custom methods of pacing transmission to
user ports by selectively stopping conduit TX queues associated with
those user ports on congestion:
https://lore.kernel.org/netdev/7510c29a-b60f-e0d7-4129-cb90fe376c74@gmail.com/

> ### Why This Matters for QoS
> 
> For traffic flowing **from the host** to DSA user ports:
> 
> - Without proper flow control, congestion cannot be communicated back to the
>   host, leading to buffer overruns and degraded QoS.  

There are multiple, and sometimes conflicting, goals to QoS and strategies on
congestion. Generally speaking, it is good to clarify that deterministic latency,
high throughput and zero loss cannot be all achieved at the same time. It is
also good to highlight the fact that you are focusing on zero loss and that
this is not necessarily the full picture. Some AVB/TSN switches, like SJA1105,
do not support pause frames at all, not even on user ports, because as you say,
it's like the nuclear solution which stops the entire port regardless of
packet priorities. And even if they did support it, for deterministic latency
applications it is best to turn it off. If you make a port enter congestion by
bombarding it with TC0 traffic, you'll incur latency to TC7 traffic until you
exit the congestion condition. These switches just expect to have reservations
very carefully configured by the system administrator. What exceeds reservations
and cannot consume shared resources (because they are temporarily depleted) is dropped.

> - To address this, we need to compensate for the lack of flow control signaling
>   by applying traffic limits (or shaping).

A splendid idea in theory. In practice, the traffic rate at the egress
of a user port is the sum of locally injected traffic plus autonomously
forwarded traffic. The port can enter congestion even with shaping of
CPU-injected traffic at a certain rate.

            Conduit
               |
               v
  +-------------------------+
  |         CPU port        |
  |            |            |
  |   +--------+            |
  |   |                     |
  |   +<---+                |
  |   |    |                |
  |   v    |                |
  | lan0  lan1  lan2  lan3  |
  +-------------------------+
      |
      v Just 1Gbps.

You _could_ apply this technique to achieve a different purpose than
net zero packet loss: selective transmission guarantees for CPU-injected
traffic. But you also need to ensure that injected packets have a higher
strict priority than the rest, and that the switch resources are
configured through devlink-sb to have enough reserved space to keep
these high priority packets on congestion and drop something else instead.

It's a tool to have for sure, but you need to be extremely specific and
realistic about your goals.

> ### Approach: Applying Limits on the Conduit Interface
> 
> One way to solve this is by applying traffic shaping or limits directly on the
> **conduit MAC**. However, this approach has significant complexity:
> 
> 1. **Hardware-Specific Details:**
> 
>    We would need deep hardware knowledge to set up traffic filters or disectors
>    at the conduit level. This includes:
> 
>    - Parsing **CPU tags** specific to the switch in use.  
> 
>    - Applying port-specific rules, some of which depend on **user port link
>      speed**.
> 
> 2. **Admin Burden:**
> 
>    Forcing network administrators to configure conduit-specific filters
>    manually increases complexity and goes against the existing DSA abstractions,
>    which are already well-integrated into the kernel.

Agree that there is high complexity. Just need to see a proposal which
acknowledges that it's not for nothing.

> ### How Things Can Be Implemented
> 
> To address QoS for host-to-user port traffic in DSA, I see two possible
> approaches:
> 
> #### 1. Apply Rules on the Conduit Port (Using `dst_port`)
> 
> In this approach, rules are applied to the **conduit interface**, and specific
> user ports are matched using **port indices**.
> 
> # Conduit interface  
> tc qdisc add dev conduit0 clsact  
> 
> # Match traffic for user port 1 (e.g., lan0)  
> tc filter add dev conduit0 egress flower dst_port 1 \  
>     action police rate 50mbit burst 5k drop  
> 
> # Match traffic for user port 2 (e.g., lan1)  
> tc filter add dev conduit0 egress flower dst_port 2 \  
>     action police rate 30mbit burst 3k drop  

Ok, so you propose an abstract key set for DSA in the flower classifier
with mappings to concrete packet fields happening in the backend,
probably done by the tagging protocol in use. The abstract key set
represents the superset of all known DSA fields, united by a common
interpretation, and each tagger rejects keys it cannot map to the
physical DSA tag.

I can immediately think of a challenge here, that we can dynamically
change the tagging protocol while tc rules are present, and this can
affect which flower keys can be mapped and which cannot. For example,
the ocelot tagging protocol could map a virtual DSA key "TX timestamp
type" to the REW_OP field, but the ocelot-8021q tagger cannot. Plus, you
could add tc filters to a block shared by multiple devices. You can't
always infer the physical tagging protocol from the device that the
filters are attached to.

> #### 2. Apply Rules Directly on the User Ports (With Conduit Marker)
> 
> In this approach, rules are applied **directly to the user-facing DSA ports**
> (e.g., `lan0`, `lan1`) with a **conduit-specific marker**. The kernel resolves
> the mapping internally.
> 
> # Apply rules with conduit marker for user ports  
> tc qdisc add dev lan0 root tbf rate 50mbit burst 5k conduit-only  
> tc qdisc add dev lan1 root tbf rate 30mbit burst 3k conduit-only  
> 
> Here:  
> - **`conduit-only`**: A marker (flag) indicating that the rule applies
> specifically to **host-to-port traffic** and not to L2-forwarded traffic within
> the switch.  
> 
> ### Recommendation
> 
> The second approach (**user port-based with `conduit-only` marker**) is cleaner
> and more intuitive. It avoids exposing hardware details like port indices while
> letting the kernel handle conduit-specific behavior transparently.
> 
> Best regards,  
> Oleksij

The second approach that you recommend suffers from the same problem as Lorenzo's
revised proposal, which is that it treats the conduit interface as a collection of
independent pipes of infinite capacity to each user port, with no arbitration concerns
of its own. The model is again great in theory, but maps really poorly on real life.
Your proposal actively encourages users to look away from the scheduling algorithm
of the conduit, and just look at user ports in isolation of each other. I strongly
disagree with it.

Oleksij Rempel Dec. 17, 2024, 12:22 p.m. UTC | #24

On Tue, Dec 17, 2024 at 01:54:48PM +0200, Vladimir Oltean wrote:
> On Tue, Dec 17, 2024 at 10:38:21AM +0100, Oleksij Rempel wrote:
> > Hi,
> > 
> > You are absolutely correct that offloading should accelerate what Linux already
> > supports in software, and we need to respect this model. However, I’d like to
> > step back for a moment to clarify the underlying problem before focusing too
> > much on solutions.
> > 
> > ### The Core Problem: Flow Control Limitations
> > 
> > 1. **QoS and Flow Control:** 
> > 
> >    At the heart of proper QoS implementation lies flow control. Flow control
> >    mechanisms exist at various levels:
> > 
> >    - MAC-level signaling (e.g., pause frames)
> > 
> >    - Queue management (e.g., stopping queues when the hardware is congested)
> > 
> >    The typical Linux driver uses flow control signaling from the MAC (e.g.,
> >    stopping queues) to coordinate traffic, and depending on the Qdisc, this
> >    flow control can propagate up to user space applications.
> 
> I read this section as "The Core Problem: Ethernet".

ack.

> > ### Why This Matters for QoS
> > 
> > For traffic flowing **from the host** to DSA user ports:
> > 
> > - Without proper flow control, congestion cannot be communicated back to the
> >   host, leading to buffer overruns and degraded QoS.  
> 
> There are multiple, and sometimes conflicting, goals to QoS and strategies on
> congestion. Generally speaking, it is good to clarify that deterministic latency,
> high throughput and zero loss cannot be all achieved at the same time. It is
> also good to highlight the fact that you are focusing on zero loss and that
> this is not necessarily the full picture. Some AVB/TSN switches, like SJA1105,
> do not support pause frames at all, not even on user ports, because as you say,
> it's like the nuclear solution which stops the entire port regardless of
> packet priorities. And even if they did support it, for deterministic latency
> applications it is best to turn it off. If you make a port enter congestion by
> bombarding it with TC0 traffic, you'll incur latency to TC7 traffic until you
> exit the congestion condition. These switches just expect to have reservations
> very carefully configured by the system administrator. What exceeds reservations
> and cannot consume shared resources (because they are temporarily depleted) is dropped.

> > - To address this, we need to compensate for the lack of flow control signaling
> >   by applying traffic limits (or shaping).
> 
> A splendid idea in theory. In practice, the traffic rate at the egress
> of a user port is the sum of locally injected traffic plus autonomously
> forwarded traffic. The port can enter congestion even with shaping of
> CPU-injected traffic at a certain rate.
> 
>             Conduit
>                |
>                v
>   +-------------------------+
>   |         CPU port        |
>   |            |            |
>   |   +--------+            |
>   |   |                     |
>   |   +<---+                |
>   |   |    |                |
>   |   v    |                |
>   | lan0  lan1  lan2  lan3  |
>   +-------------------------+
>       |
>       v Just 1Gbps.
> 
> You _could_ apply this technique to achieve a different purpose than
> net zero packet loss: selective transmission guarantees for CPU-injected
> traffic. But you also need to ensure that injected packets have a higher
> strict priority than the rest, and that the switch resources are
> configured through devlink-sb to have enough reserved space to keep
> these high priority packets on congestion and drop something else instead.
> 
> It's a tool to have for sure, but you need to be extremely specific and
> realistic about your goals.

Yes, you are right. For my specific use case the switch is used mostly as port
multiplayer.

> > #### 2. Apply Rules Directly on the User Ports (With Conduit Marker)
> > 
> > In this approach, rules are applied **directly to the user-facing DSA ports**
> > (e.g., `lan0`, `lan1`) with a **conduit-specific marker**. The kernel resolves
> > the mapping internally.
> > 
> > # Apply rules with conduit marker for user ports  
> > tc qdisc add dev lan0 root tbf rate 50mbit burst 5k conduit-only  
> > tc qdisc add dev lan1 root tbf rate 30mbit burst 3k conduit-only  
> > 
> > Here:  
> > - **`conduit-only`**: A marker (flag) indicating that the rule applies
> > specifically to **host-to-port traffic** and not to L2-forwarded traffic within
> > the switch.  
> > 
> > ### Recommendation
> > 
> > The second approach (**user port-based with `conduit-only` marker**) is cleaner
> > and more intuitive. It avoids exposing hardware details like port indices while
> > letting the kernel handle conduit-specific behavior transparently.
> > 
> > Best regards,  
> > Oleksij
> 
> The second approach that you recommend suffers from the same problem as Lorenzo's
> revised proposal, which is that it treats the conduit interface as a collection of
> independent pipes of infinite capacity to each user port, with no arbitration concerns
> of its own. The model is again great in theory, but maps really poorly on real life.
> Your proposal actively encourages users to look away from the scheduling algorithm
> of the conduit, and just look at user ports in isolation of each other. I strongly
> disagree with it.

I'm still thinking about best way to classify DSA user port traffic.
Will it be enough to set classid on user port?

tc filter add dev lan0 protocol all flower skip_hw \
    action classid 1:1
tc filter add dev lan1 protocol all flower skip_hw \
    action classid 1:2

And then process it on the conduit port:
# Add HTB Qdisc on the conduit interface
tc qdisc add dev conduit0 root handle 1: htb default 1

# Define rate-limiting classes
tc class add dev conduit0 parent 1: classid 1:1 htb rate 100mbit burst 5k
tc class add dev conduit0 parent 1: classid 1:2 htb rate 100mbit burst 5k

Or the classid will not be transferred between devices and i'll need to
use something like skbedit?

Vladimir Oltean Dec. 17, 2024, 1:28 p.m. UTC | #25

On Tue, Dec 17, 2024 at 01:22:45PM +0100, Oleksij Rempel wrote:
> I'm still thinking about best way to classify DSA user port traffic.
> Will it be enough to set classid on user port?
> 
> tc filter add dev lan0 protocol all flower skip_hw \
>     action classid 1:1
> tc filter add dev lan1 protocol all flower skip_hw \
>     action classid 1:2
> 
> And then process it on the conduit port:
> # Add HTB Qdisc on the conduit interface
> tc qdisc add dev conduit0 root handle 1: htb default 1
> 
> # Define rate-limiting classes
> tc class add dev conduit0 parent 1: classid 1:1 htb rate 100mbit burst 5k
> tc class add dev conduit0 parent 1: classid 1:2 htb rate 100mbit burst 5k
> 
> Or the classid will not be transferred between devices and i'll need to
> use something like skbedit?

I don't know, if you find out, let us know.

[RFC,net-next,0/5] Add ETS and TBF Qdisc offload for Airoha EN7581 SoC

Message

Comments