Message ID | cover.1733930558.git.lorenzo@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Add ETS and TBF Qdisc offload for Airoha EN7581 SoC | expand |
Hi Lorenzo, On Wed, Dec 11, 2024 at 04:31:48PM +0100, Lorenzo Bianconi wrote: > Introduce support for ETS and TBF qdisc offload available in the Airoha > EN7581 ethernet controller. > Some DSA hw switches do not support Qdisc offloading or the mac chip > has more fine grained QoS capabilities with respect to the hw switch > (e.g. Airoha EN7581 mac chip has more hw QoS and buffering capabilities > with respect to the mt7530 switch). > Introduce ndo_setup_tc_conduit callback in order to allow tc to offload > Qdisc policies for the specified DSA user port configuring the hw switch > cpu port (mac chip). Can you please make a detailed diagram explaining how is the conduit involved in the packet data path for QoS? Offloaded tc on a DSA user port is supposed to affect autonomously forwarded traffic too (like the Linux bridge).
> Hi Lorenzo, Hi Vladimir, > > On Wed, Dec 11, 2024 at 04:31:48PM +0100, Lorenzo Bianconi wrote: > > Introduce support for ETS and TBF qdisc offload available in the Airoha > > EN7581 ethernet controller. > > Some DSA hw switches do not support Qdisc offloading or the mac chip > > has more fine grained QoS capabilities with respect to the hw switch > > (e.g. Airoha EN7581 mac chip has more hw QoS and buffering capabilities > > with respect to the mt7530 switch). > > Introduce ndo_setup_tc_conduit callback in order to allow tc to offload > > Qdisc policies for the specified DSA user port configuring the hw switch > > cpu port (mac chip). > > Can you please make a detailed diagram explaining how is the conduit > involved in the packet data path for QoS? Offloaded tc on a DSA user > port is supposed to affect autonomously forwarded traffic too (like the > Linux bridge). I guess a typical use case would be the one below where the traffic from the WAN port is forwarded to a DSA LAN one (e.g. lan0) via netfilter flowtable offload. ┌─────────────────────────────────┐ │ BR0 │ └───┬────────┬────────┬────────┬──┘ ┌───────────────┼────────┼────────┼────────┼───┐ │DSA │ │ │ │ │ │ │ │ │ │ │ │ ┌───┐ ┌──▼─┐ ┌──▼─┐ ┌──▼─┐ ┌──▼─┐ │ ┌───┐ │ │CPU│ │LAN0│ │LAN1│ │LAN2│ │LAN3│ │ │WAN│ │ └───┘ └────┘ └────┘ └────┘ └────┘ │ └───┘ └──────────────────────────────────────────────┘ In this case the mac chip forwards (in hw) the WAN traffic to the DSA switch via the CPU port. In [0] we have the EN7581 mac chip architecture where we can assume GDM1 is the CPU port and GDM2 is the WAN port. The goal of this RFC series is to offload a Qdisc rule (e.g. ETS) on a given LAN port using the mac chip QoS capabilities instead of creating the QoS discipline directly in the DSA hw switch: $tc qdisc replace dev lan0 root handle 1: ets bands 8 strict 2 quanta 1514 1514 1514 3528 1514 1514 As described above the reason for this approach would be to rely on the more fine grained QoS capabilities available on the mac chip with respect to the hw switch or because the DSA switch does not support QoS offloading. Regards, Lorenzo [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752
On Thu, Dec 12, 2024 at 10:19:41AM +0100, Lorenzo Bianconi wrote: > > Hi Lorenzo, > > Hi Vladimir, > > > > > On Wed, Dec 11, 2024 at 04:31:48PM +0100, Lorenzo Bianconi wrote: > > > Introduce support for ETS and TBF qdisc offload available in the Airoha > > > EN7581 ethernet controller. > > > Some DSA hw switches do not support Qdisc offloading or the mac chip > > > has more fine grained QoS capabilities with respect to the hw switch > > > (e.g. Airoha EN7581 mac chip has more hw QoS and buffering capabilities > > > with respect to the mt7530 switch). > > > Introduce ndo_setup_tc_conduit callback in order to allow tc to offload > > > Qdisc policies for the specified DSA user port configuring the hw switch > > > cpu port (mac chip). > > > > Can you please make a detailed diagram explaining how is the conduit > > involved in the packet data path for QoS? Offloaded tc on a DSA user > > port is supposed to affect autonomously forwarded traffic too (like the > > Linux bridge). > > I guess a typical use case would be the one below where the traffic from the > WAN port is forwarded to a DSA LAN one (e.g. lan0) via netfilter flowtable > offload. > > ┌─────────────────────────────────┐ > │ BR0 │ > └───┬────────┬────────┬────────┬──┘ > ┌───────────────┼────────┼────────┼────────┼───┐ > │DSA │ │ │ │ │ > │ │ │ │ │ │ > │ ┌───┐ ┌──▼─┐ ┌──▼─┐ ┌──▼─┐ ┌──▼─┐ │ ┌───┐ > │ │CPU│ │LAN0│ │LAN1│ │LAN2│ │LAN3│ │ │WAN│ > │ └───┘ └────┘ └────┘ └────┘ └────┘ │ └───┘ > └──────────────────────────────────────────────┘ > > In this case the mac chip forwards (in hw) the WAN traffic to the DSA switch > via the CPU port. In [0] we have the EN7581 mac chip architecture where we > can assume GDM1 is the CPU port and GDM2 is the WAN port. > The goal of this RFC series is to offload a Qdisc rule (e.g. ETS) on a given > LAN port using the mac chip QoS capabilities instead of creating the QoS > discipline directly in the DSA hw switch: > > $tc qdisc replace dev lan0 root handle 1: ets bands 8 strict 2 quanta 1514 1514 1514 3528 1514 1514 > > As described above the reason for this approach would be to rely on the more > fine grained QoS capabilities available on the mac chip with respect to the > hw switch or because the DSA switch does not support QoS offloading. > > Regards, > Lorenzo > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752 Explain "the mac chip forwards (in hw) the WAN traffic to the DSA switch via the CPU port". How many packets does airoha_dev_select_queue() see? All of them, or only the first of a flow? What operations does the offload consist of?
On Dec 12, Vladimir Oltean wrote: > On Thu, Dec 12, 2024 at 10:19:41AM +0100, Lorenzo Bianconi wrote: > > > Hi Lorenzo, > > > > Hi Vladimir, > > > > > > > > On Wed, Dec 11, 2024 at 04:31:48PM +0100, Lorenzo Bianconi wrote: > > > > Introduce support for ETS and TBF qdisc offload available in the Airoha > > > > EN7581 ethernet controller. > > > > Some DSA hw switches do not support Qdisc offloading or the mac chip > > > > has more fine grained QoS capabilities with respect to the hw switch > > > > (e.g. Airoha EN7581 mac chip has more hw QoS and buffering capabilities > > > > with respect to the mt7530 switch). > > > > Introduce ndo_setup_tc_conduit callback in order to allow tc to offload > > > > Qdisc policies for the specified DSA user port configuring the hw switch > > > > cpu port (mac chip). > > > > > > Can you please make a detailed diagram explaining how is the conduit > > > involved in the packet data path for QoS? Offloaded tc on a DSA user > > > port is supposed to affect autonomously forwarded traffic too (like the > > > Linux bridge). > > > > I guess a typical use case would be the one below where the traffic from the > > WAN port is forwarded to a DSA LAN one (e.g. lan0) via netfilter flowtable > > offload. > > > > ┌─────────────────────────────────┐ > > │ BR0 │ > > └───┬────────┬────────┬────────┬──┘ > > ┌───────────────┼────────┼────────┼────────┼───┐ > > │DSA │ │ │ │ │ > > │ │ │ │ │ │ > > │ ┌───┐ ┌──▼─┐ ┌──▼─┐ ┌──▼─┐ ┌──▼─┐ │ ┌───┐ > > │ │CPU│ │LAN0│ │LAN1│ │LAN2│ │LAN3│ │ │WAN│ > > │ └───┘ └────┘ └────┘ └────┘ └────┘ │ └───┘ > > └──────────────────────────────────────────────┘ > > > > In this case the mac chip forwards (in hw) the WAN traffic to the DSA switch > > via the CPU port. In [0] we have the EN7581 mac chip architecture where we > > can assume GDM1 is the CPU port and GDM2 is the WAN port. > > The goal of this RFC series is to offload a Qdisc rule (e.g. ETS) on a given > > LAN port using the mac chip QoS capabilities instead of creating the QoS > > discipline directly in the DSA hw switch: > > > > $tc qdisc replace dev lan0 root handle 1: ets bands 8 strict 2 quanta 1514 1514 1514 3528 1514 1514 > > > > As described above the reason for this approach would be to rely on the more > > fine grained QoS capabilities available on the mac chip with respect to the > > hw switch or because the DSA switch does not support QoS offloading. > > > > Regards, > > Lorenzo > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752 > > Explain "the mac chip forwards (in hw) the WAN traffic to the DSA switch > via the CPU port". How many packets does airoha_dev_select_queue() see? > All of them, or only the first of a flow? What operations does the > offload consist of? I am referring to the netfilter flowtable offload where the kernel receives just the 3-way handshake of a TCP connection and then the traffic is fully offloaded (the hw receives a flower rule to route the traffic between interfaces applying NAT mangling if requested). Re-thinking about it, I guess it is better to post flowtable support first and then continue the discussion about QoS offloading, what do you think? Regards, Lorenzo
On Thu, Dec 12, 2024 at 06:03:08PM +0100, Lorenzo Bianconi wrote: > > Explain "the mac chip forwards (in hw) the WAN traffic to the DSA switch > > via the CPU port". How many packets does airoha_dev_select_queue() see? > > All of them, or only the first of a flow? What operations does the > > offload consist of? > > I am referring to the netfilter flowtable offload where the kernel receives > just the 3-way handshake of a TCP connection and then the traffic is fully > offloaded (the hw receives a flower rule to route the traffic between > interfaces applying NAT mangling if requested). And how do the follow-up packets know to go to the same conduit queue as the initial packets of the flow? As mentioned, my trouble with your current proposal is that I don't think it reacts adequately to the user space request. Given your command, packets forwarded from lan1 to lan0 should also go through lan0's ETS scheduler, but my understanding is that they won't, because they bypass the conduit. I don't encourage adding new net_device_ops infrastructure to implement unexpected behavior. I'm trying to look at the big picture and abstract away the flowtable a bit. I don't think the tc rule should be on the user port. Can the redirection of packets destined towards a particular switch port be accomplished with a tc u32 filter on the conduit interface instead? If the tc primitives for either the filter or the action don't exist, maybe those could be added instead? Like DSA keys in "flower" which gain introspection into the encapsulated packet headers? > Re-thinking about it, I guess it is better to post flowtable support > first and then continue the discussion about QoS offloading, what do > you think? I don't know about Andrew, but I'm really not familiar with the netfilter flowtable (and there's another series from Eric Woudstra waiting for me to know everything about it). Though, I don't think this can continue for long, we need to find a common starting place for discussions, since the development for chips with flowtable offload is starting to put pressure on DSA. What to read as a starting point for a basic understanding?
> On Thu, Dec 12, 2024 at 06:03:08PM +0100, Lorenzo Bianconi wrote: > > > Explain "the mac chip forwards (in hw) the WAN traffic to the DSA switch > > > via the CPU port". How many packets does airoha_dev_select_queue() see? > > > All of them, or only the first of a flow? What operations does the > > > offload consist of? > > > > I am referring to the netfilter flowtable offload where the kernel receives > > just the 3-way handshake of a TCP connection and then the traffic is fully > > offloaded (the hw receives a flower rule to route the traffic between > > interfaces applying NAT mangling if requested). Hi Vladimir, Sorry for the late reply. > > And how do the follow-up packets know to go to the same conduit queue as > the initial packets of the flow? > > As mentioned, my trouble with your current proposal is that I don't > think it reacts adequately to the user space request. Given your command, > packets forwarded from lan1 to lan0 should also go through lan0's ETS > scheduler, but my understanding is that they won't, because they bypass > the conduit. I don't encourage adding new net_device_ops infrastructure > to implement unexpected behavior. I guess what I did not make clear here is that we are discussing about 'routed' traffic (sorry for that). The traffic is received from the WAN interface and routed to a DSA port (or the other way around). In this scenario the 3-way handshake will be received by the CPU via the WAN port (or the conduit port) while the subsequent packets will be hw forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic will be received by the system from GDM2 (WAN) and the PSE/PPE blocks will forward it to the GDM1 port that is connected to the DSA cpu port. The proposed series is about adding the control path to apply a given Qdisc (ETS or TBF for EN7581) to the traffic that is following the described path without creating it directly on the DSA switch port (for the reasons described before). E.g. the user would want to apply an ETS Qdisc just for traffic egressing via lan0. This series is not strictly related to the airoha_eth flowtable offload implementation but the latter is required to have a full pictures of the possible use case (this is why I was saying it is better to post it first). > > I'm trying to look at the big picture and abstract away the flowtable a > bit. I don't think the tc rule should be on the user port. Can the > redirection of packets destined towards a particular switch port be > accomplished with a tc u32 filter on the conduit interface instead? > If the tc primitives for either the filter or the action don't exist, > maybe those could be added instead? Like DSA keys in "flower" which gain > introspection into the encapsulated packet headers? The issue with the current DSA infrastructure is there is no way to use the conduit port to offload a Qdisc policy to a given lan port since we are missing in the APIs the information about what user port we are interested in (this is why I added the new netdev callback). Please consider here we are discussing about Qdisc policies and not flower rules to mangle the traffic. The hw needs to be configured in advance to apply the requested policy (e.g TBF for traffic shaping). > > > Re-thinking about it, I guess it is better to post flowtable support > > first and then continue the discussion about QoS offloading, what do > > you think? > > I don't know about Andrew, but I'm really not familiar with the > netfilter flowtable (and there's another series from Eric Woudstra > waiting for me to know everything about it). > > Though, I don't think this can continue for long, we need to find a > common starting place for discussions, since the development for chips > with flowtable offload is starting to put pressure on DSA. What to read > as a starting point for a basic understanding? I do not think there is much documentation about it (except the source code). I guess you can take a look to [1],[2]. Regards, Lorenzo [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752 [1] https://docs.kernel.org/networking/nf_flowtable.html [2] https://thermalcircle.de/doku.php?id=blog:linux:flowtables_1_a_netfilter_nftables_fastpath
On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote: > I guess what I did not make clear here is that we are discussing about > 'routed' traffic (sorry for that). The traffic is received from the WAN > interface and routed to a DSA port (or the other way around). > In this scenario the 3-way handshake will be received by the CPU via the > WAN port (or the conduit port) while the subsequent packets will be hw > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks > will forward it to the GDM1 port that is connected to the DSA cpu port. > > The proposed series is about adding the control path to apply a given Qdisc > (ETS or TBF for EN7581) to the traffic that is following the described path > without creating it directly on the DSA switch port (for the reasons described > before). E.g. the user would want to apply an ETS Qdisc just for traffic > egressing via lan0. > > This series is not strictly related to the airoha_eth flowtable offload > implementation but the latter is required to have a full pictures of the > possible use case (this is why I was saying it is better to post it first). It's good to know this does not depend on flowtable. When you add an offloaded Qdisc to the egress of a net device, you don't affect just the traffic L3 routed to that device, but all traffic (also includes the packets sent to it using L2 forwarding). As such, I simply don't believe that the way in which the UAPI is interpreted here (root egress qdisc matches only routed traffic) is proper. Ack? > > I'm trying to look at the big picture and abstract away the flowtable a > > bit. I don't think the tc rule should be on the user port. Can the > > redirection of packets destined towards a particular switch port be > > accomplished with a tc u32 filter on the conduit interface instead? > > If the tc primitives for either the filter or the action don't exist, > > maybe those could be added instead? Like DSA keys in "flower" which gain > > introspection into the encapsulated packet headers? > > The issue with the current DSA infrastructure is there is no way to use > the conduit port to offload a Qdisc policy to a given lan port since we > are missing in the APIs the information about what user port we are > interested in (this is why I added the new netdev callback). How does the introduction of ndo_setup_tc_conduit() help, since the problem is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly. It is simply not comparable to the way in which it is offloaded by drivers/net/dsa/microchip/ksz_common.c, even though the user space syntax is the same. Unless you're suggesting that for ksz it is not offloaded correctly? Oleksij, am I missing something? > Please consider here we are discussing about Qdisc policies and not flower > rules to mangle the traffic. What's a Qdisc policy? Also, flower is a classifier, not an action. It doesn't mangle packets by the very definition of what a classifier is. > The hw needs to be configured in advance to apply the requested policy > (e.g TBF for traffic shaping). What are you missing exactly to make DSA packets go to a particular channel on the conduit? For Qdisc offloading you want to configure the NIC in advance, of course. Can't you do something like this to guide packets to the correct channels? tc qdisc add dev eth0 clsact tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ... tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \ flowid 1:1
On Mon, Dec 16, 2024 at 05:49:47PM +0200, Vladimir Oltean wrote: > On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote: > > I guess what I did not make clear here is that we are discussing about > > 'routed' traffic (sorry for that). The traffic is received from the WAN > > interface and routed to a DSA port (or the other way around). > > In this scenario the 3-way handshake will be received by the CPU via the > > WAN port (or the conduit port) while the subsequent packets will be hw > > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic > > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks > > will forward it to the GDM1 port that is connected to the DSA cpu port. > > > > The proposed series is about adding the control path to apply a given Qdisc > > (ETS or TBF for EN7581) to the traffic that is following the described path > > without creating it directly on the DSA switch port (for the reasons described > > before). E.g. the user would want to apply an ETS Qdisc just for traffic > > egressing via lan0. > > > > This series is not strictly related to the airoha_eth flowtable offload > > implementation but the latter is required to have a full pictures of the > > possible use case (this is why I was saying it is better to post it first). > > It's good to know this does not depend on flowtable. > > When you add an offloaded Qdisc to the egress of a net device, you don't > affect just the traffic L3 routed to that device, but all traffic (also > includes the packets sent to it using L2 forwarding). As such, I simply > don't believe that the way in which the UAPI is interpreted here (root > egress qdisc matches only routed traffic) is proper. > > Ack? > > > > I'm trying to look at the big picture and abstract away the flowtable a > > > bit. I don't think the tc rule should be on the user port. Can the > > > redirection of packets destined towards a particular switch port be > > > accomplished with a tc u32 filter on the conduit interface instead? > > > If the tc primitives for either the filter or the action don't exist, > > > maybe those could be added instead? Like DSA keys in "flower" which gain > > > introspection into the encapsulated packet headers? > > > > The issue with the current DSA infrastructure is there is no way to use > > the conduit port to offload a Qdisc policy to a given lan port since we > > are missing in the APIs the information about what user port we are > > interested in (this is why I added the new netdev callback). > > How does the introduction of ndo_setup_tc_conduit() help, since the problem > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly. > It is simply not comparable to the way in which it is offloaded by > drivers/net/dsa/microchip/ksz_common.c, even though the user space > syntax is the same. Unless you're suggesting that for ksz it is not > offloaded correctly? > > Oleksij, am I missing something? Ahm.. let me try to understand the problem. First I would like to interpret the diagram provide by Lorenzo, since I have no idea about EN7581, I use KSZ* model and transfer it to this one: CPU | +-----------------|---------------------+ EN7581 | GDM1 <-- fabric? ----> GDM2 | like NXP Layerscape? +-------|--------------------------|----+ | | +-------|---------+ WAN MT7530 | CPU_port | | fabric | | lan1 lan2 .... | +--|----|----|||--+ In case of lanN to lanN traffic, switch internal QoS should be used. This is where "tc qdisc add lanN root ets" rules apply. At least, this is how it is implemented for KSZ switches. In case of traffic flow from WAN to lanN, we deal with two separate sets of queues and schedulers: GDM1 egress queues and scheduler, (probably ingress queues on CPU_port), lanN queues and scheduler. Combining both parts in to one rule, is not good, especially if rules of each lanN port may be different. > > Please consider here we are discussing about Qdisc policies and not flower > > rules to mangle the traffic. > > What's a Qdisc policy? > > Also, flower is a classifier, not an action. It doesn't mangle packets > by the very definition of what a classifier is. > > > The hw needs to be configured in advance to apply the requested policy > > (e.g TBF for traffic shaping). > > What are you missing exactly to make DSA packets go to a particular > channel on the conduit? > > For Qdisc offloading you want to configure the NIC in advance, of course. > > Can't you do something like this to guide packets to the correct channels? > > tc qdisc add dev eth0 clsact > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ... > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \ > flowid 1:1 ACK, this would be my expectation as well.
> On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote: > > I guess what I did not make clear here is that we are discussing about > > 'routed' traffic (sorry for that). The traffic is received from the WAN > > interface and routed to a DSA port (or the other way around). > > In this scenario the 3-way handshake will be received by the CPU via the > > WAN port (or the conduit port) while the subsequent packets will be hw > > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic > > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks > > will forward it to the GDM1 port that is connected to the DSA cpu port. > > > > The proposed series is about adding the control path to apply a given Qdisc > > (ETS or TBF for EN7581) to the traffic that is following the described path > > without creating it directly on the DSA switch port (for the reasons described > > before). E.g. the user would want to apply an ETS Qdisc just for traffic > > egressing via lan0. > > > > This series is not strictly related to the airoha_eth flowtable offload > > implementation but the latter is required to have a full pictures of the > > possible use case (this is why I was saying it is better to post it first). > > It's good to know this does not depend on flowtable. > > When you add an offloaded Qdisc to the egress of a net device, you don't > affect just the traffic L3 routed to that device, but all traffic (also > includes the packets sent to it using L2 forwarding). As such, I simply > don't believe that the way in which the UAPI is interpreted here (root > egress qdisc matches only routed traffic) is proper. > > Ack? Considering patch [0], we are still offloading the Qdisc on the provided DSA switch port (e.g. LANx) via the port_setup_tc() callback available in dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit() callback in order to use the hw Qdisc capabilities available on the mac chip (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree? > > > > I'm trying to look at the big picture and abstract away the flowtable a > > > bit. I don't think the tc rule should be on the user port. Can the > > > redirection of packets destined towards a particular switch port be > > > accomplished with a tc u32 filter on the conduit interface instead? > > > If the tc primitives for either the filter or the action don't exist, > > > maybe those could be added instead? Like DSA keys in "flower" which gain > > > introspection into the encapsulated packet headers? > > > > The issue with the current DSA infrastructure is there is no way to use > > the conduit port to offload a Qdisc policy to a given lan port since we > > are missing in the APIs the information about what user port we are > > interested in (this is why I added the new netdev callback). > > How does the introduction of ndo_setup_tc_conduit() help, since the problem > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly. > It is simply not comparable to the way in which it is offloaded by > drivers/net/dsa/microchip/ksz_common.c, even though the user space > syntax is the same. Unless you're suggesting that for ksz it is not > offloaded correctly? nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just does not allow to exploit all hw capabilities available on EN7581 when the traffic is routed from the WAN port to a given DSA switch port. If we do: $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ... in the current upstream implementation we do: dsa_user_setup_tc(): ... -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) (it applies the Qdisc on lan0 configuring the hw switch) adding the ndo_setup_tc_conduit() callback we have: dsa_user_setup_qdisc() ... -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data) (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0) -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) (it applies the Qdisc on lan0 configuring the hw switch) > > Oleksij, am I missing something? > > > Please consider here we are discussing about Qdisc policies and not flower > > rules to mangle the traffic. > > What's a Qdisc policy? I mean a queue scheduler algorithm (e.g. TBF, ETS, HTB, ...) > > Also, flower is a classifier, not an action. It doesn't mangle packets > by the very definition of what a classifier is. yes, but goal of the series is the Queue scheduler offloading, not classifier/action. Agree? > > > The hw needs to be configured in advance to apply the requested policy > > (e.g TBF for traffic shaping). > > What are you missing exactly to make DSA packets go to a particular > channel on the conduit? > > For Qdisc offloading you want to configure the NIC in advance, of course. > > Can't you do something like this to guide packets to the correct channels? > > tc qdisc add dev eth0 clsact > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ... > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \ > flowid 1:1 If we apply the Qdisc directly on the conduit port (eth0) we can just apply the queue scheduler on all the traffic egressing via the DSA switch while I would like to apply it on per DSA port basis (but using the mac chip hw capabilities), got my point? Regards, Lorenzo [0] https://patchwork.kernel.org/project/netdevbpf/patch/8e57ae3c4b064403ca843ffa45a5eb4e4198cf80.1733930558.git.lorenzo@kernel.org/
On Mon, Dec 16, 2024 at 08:01:33PM +0100, Lorenzo Bianconi wrote: > > On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote: > > > I guess what I did not make clear here is that we are discussing about > > > 'routed' traffic (sorry for that). The traffic is received from the WAN > > > interface and routed to a DSA port (or the other way around). > > > In this scenario the 3-way handshake will be received by the CPU via the > > > WAN port (or the conduit port) while the subsequent packets will be hw > > > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic > > > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks > > > will forward it to the GDM1 port that is connected to the DSA cpu port. > > > > > > The proposed series is about adding the control path to apply a given Qdisc > > > (ETS or TBF for EN7581) to the traffic that is following the described path > > > without creating it directly on the DSA switch port (for the reasons described > > > before). E.g. the user would want to apply an ETS Qdisc just for traffic > > > egressing via lan0. > > > > > > This series is not strictly related to the airoha_eth flowtable offload > > > implementation but the latter is required to have a full pictures of the > > > possible use case (this is why I was saying it is better to post it first). > > > > It's good to know this does not depend on flowtable. > > > > When you add an offloaded Qdisc to the egress of a net device, you don't > > affect just the traffic L3 routed to that device, but all traffic (also > > includes the packets sent to it using L2 forwarding). As such, I simply > > don't believe that the way in which the UAPI is interpreted here (root > > egress qdisc matches only routed traffic) is proper. > > > > Ack? > > Considering patch [0], we are still offloading the Qdisc on the provided > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit() > callback in order to use the hw Qdisc capabilities available on the mac chip > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree? > > > > > > > I'm trying to look at the big picture and abstract away the flowtable a > > > > bit. I don't think the tc rule should be on the user port. Can the > > > > redirection of packets destined towards a particular switch port be > > > > accomplished with a tc u32 filter on the conduit interface instead? > > > > If the tc primitives for either the filter or the action don't exist, > > > > maybe those could be added instead? Like DSA keys in "flower" which gain > > > > introspection into the encapsulated packet headers? > > > > > > The issue with the current DSA infrastructure is there is no way to use > > > the conduit port to offload a Qdisc policy to a given lan port since we > > > are missing in the APIs the information about what user port we are > > > interested in (this is why I added the new netdev callback). > > > > How does the introduction of ndo_setup_tc_conduit() help, since the problem > > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly. > > It is simply not comparable to the way in which it is offloaded by > > drivers/net/dsa/microchip/ksz_common.c, even though the user space > > syntax is the same. Unless you're suggesting that for ksz it is not > > offloaded correctly? > > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just > does not allow to exploit all hw capabilities available on EN7581 when the > traffic is routed from the WAN port to a given DSA switch port. If we do: > > $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ... > > in the current upstream implementation we do: > dsa_user_setup_tc(): > ... > -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) > (it applies the Qdisc on lan0 configuring the hw switch) > > adding the ndo_setup_tc_conduit() callback we have: > > dsa_user_setup_qdisc() > ... > -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data) > (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0) > > -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) > (it applies the Qdisc on lan0 configuring the hw switch) > > > > > Oleksij, am I missing something? > > > > > Please consider here we are discussing about Qdisc policies and not flower > > > rules to mangle the traffic. > > > > What's a Qdisc policy? > > I mean a queue scheduler algorithm (e.g. TBF, ETS, HTB, ...) > > > > > Also, flower is a classifier, not an action. It doesn't mangle packets > > by the very definition of what a classifier is. > > yes, but goal of the series is the Queue scheduler offloading, not > classifier/action. Agree? > > > > > > The hw needs to be configured in advance to apply the requested policy > > > (e.g TBF for traffic shaping). > > > > What are you missing exactly to make DSA packets go to a particular > > channel on the conduit? > > > > For Qdisc offloading you want to configure the NIC in advance, of course. > > > > Can't you do something like this to guide packets to the correct channels? > > > > tc qdisc add dev eth0 clsact > > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ... > > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \ > > flowid 1:1 > > If we apply the Qdisc directly on the conduit port (eth0) we can just apply the > queue scheduler on all the traffic egressing via the DSA switch while I would > like to apply it on per DSA port basis (but using the mac chip hw capabilities), > got my point? Hm, I guess I have similar use case in one of my projects. In my case, the CPU interface is 1Gbit the switch ports are 100Mbit each. It is still possible to keep the CPU interface busy by sending 1Gbit UDP stream, so 900Mbit is dropped by the switch. I would like to have traffic limiter per virtual DSA port on the SoC site to reduce the load on DSA conduit. Currently it was not possible.
On Mon, Dec 16, 2024 at 08:01:33PM +0100, Lorenzo Bianconi wrote: > Considering patch [0], we are still offloading the Qdisc on the provided > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit() > callback in order to use the hw Qdisc capabilities available on the mac chip > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree? Not quite, no. ndo_setup_tc_conduit() does not have the same instruments to offload what port_setup_tc() can offload. It is not involved in all data paths that port_setup_tc() has to handle. Please ack this. So if port_setup_tc() returns -EOPNOTSUPP, the entire dsa_user_setup_qdisc() should return -EOPNOTSUPP, UNLESS you install packet traps on all other offloaded data paths in the switch, such that all packets that egress the DSA user port are handled by ndo_setup_tc_conduit()'s instruments. > > How does the introduction of ndo_setup_tc_conduit() help, since the problem > > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly. > > It is simply not comparable to the way in which it is offloaded by > > drivers/net/dsa/microchip/ksz_common.c, even though the user space > > syntax is the same. Unless you're suggesting that for ksz it is not > > offloaded correctly? > > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just > does not allow to exploit all hw capabilities available on EN7581 when the > traffic is routed from the WAN port to a given DSA switch port. And I don't believe it should, in this way. > If we do: > > $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ... > > in the current upstream implementation we do: > dsa_user_setup_tc(): > ... > -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) > (it applies the Qdisc on lan0 configuring the hw switch) > > adding the ndo_setup_tc_conduit() callback we have: > > dsa_user_setup_qdisc() > ... > -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data) > (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0) > > -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) > (it applies the Qdisc on lan0 configuring the hw switch) > > > Also, flower is a classifier, not an action. It doesn't mangle packets > > by the very definition of what a classifier is. > > yes, but goal of the series is the Queue scheduler offloading, not > classifier/action. Agree? Classifiers + flowid are an instrument to direct packets to classes of a classful egress Qdisc. They seem perfectly relevant to the discussion, given the information I currently have. > > Can't you do something like this to guide packets to the correct channels? > > > > tc qdisc add dev eth0 clsact > > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ... > > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \ > > flowid 1:1 > > If we apply the Qdisc directly on the conduit port (eth0) we can just apply the > queue scheduler on all the traffic egressing via the DSA switch while I would > like to apply it on per DSA port basis (but using the mac chip hw capabilities), > got my point? We need something as the root Qdisc of the conduit which exposes its hardware capabilities. I just assumed that would be a simple (and single) ETS, you can correct me if I am wrong. On conduit egress, what is the arbitration scheme between the traffic destined towards each DSA user port (channel, as the driver calls them)? How can this be best represented? IIUC, in your patch set, you expose the conduit hardware QoS capabilities as if they can be perfectly virtualized among DSA user ports, and as if each DSA user port can have its own ETS root Qdisc, completely independent of each other, as if the packets do not serialize on the conduit <-> CPU port link, and as if that is not a bottleneck. Is that really the case? If so (but please explain how), maybe you really need your own root Qdisc driver, with one class per DSA user port, and those classes have ETS attached to them.
> On Mon, Dec 16, 2024 at 08:01:33PM +0100, Lorenzo Bianconi wrote: > > > On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote: > > > > I guess what I did not make clear here is that we are discussing about > > > > 'routed' traffic (sorry for that). The traffic is received from the WAN > > > > interface and routed to a DSA port (or the other way around). > > > > In this scenario the 3-way handshake will be received by the CPU via the > > > > WAN port (or the conduit port) while the subsequent packets will be hw > > > > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic > > > > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks > > > > will forward it to the GDM1 port that is connected to the DSA cpu port. > > > > > > > > The proposed series is about adding the control path to apply a given Qdisc > > > > (ETS or TBF for EN7581) to the traffic that is following the described path > > > > without creating it directly on the DSA switch port (for the reasons described > > > > before). E.g. the user would want to apply an ETS Qdisc just for traffic > > > > egressing via lan0. > > > > > > > > This series is not strictly related to the airoha_eth flowtable offload > > > > implementation but the latter is required to have a full pictures of the > > > > possible use case (this is why I was saying it is better to post it first). > > > > > > It's good to know this does not depend on flowtable. > > > > > > When you add an offloaded Qdisc to the egress of a net device, you don't > > > affect just the traffic L3 routed to that device, but all traffic (also > > > includes the packets sent to it using L2 forwarding). As such, I simply > > > don't believe that the way in which the UAPI is interpreted here (root > > > egress qdisc matches only routed traffic) is proper. > > > > > > Ack? > > > > Considering patch [0], we are still offloading the Qdisc on the provided > > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in > > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit() > > callback in order to use the hw Qdisc capabilities available on the mac chip > > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply > > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree? > > > > > > > > > > I'm trying to look at the big picture and abstract away the flowtable a > > > > > bit. I don't think the tc rule should be on the user port. Can the > > > > > redirection of packets destined towards a particular switch port be > > > > > accomplished with a tc u32 filter on the conduit interface instead? > > > > > If the tc primitives for either the filter or the action don't exist, > > > > > maybe those could be added instead? Like DSA keys in "flower" which gain > > > > > introspection into the encapsulated packet headers? > > > > > > > > The issue with the current DSA infrastructure is there is no way to use > > > > the conduit port to offload a Qdisc policy to a given lan port since we > > > > are missing in the APIs the information about what user port we are > > > > interested in (this is why I added the new netdev callback). > > > > > > How does the introduction of ndo_setup_tc_conduit() help, since the problem > > > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly. > > > It is simply not comparable to the way in which it is offloaded by > > > drivers/net/dsa/microchip/ksz_common.c, even though the user space > > > syntax is the same. Unless you're suggesting that for ksz it is not > > > offloaded correctly? > > > > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just > > does not allow to exploit all hw capabilities available on EN7581 when the > > traffic is routed from the WAN port to a given DSA switch port. If we do: > > > > $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ... > > > > in the current upstream implementation we do: > > dsa_user_setup_tc(): > > ... > > -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) > > (it applies the Qdisc on lan0 configuring the hw switch) > > > > adding the ndo_setup_tc_conduit() callback we have: > > > > dsa_user_setup_qdisc() > > ... > > -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data) > > (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0) > > > > -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) > > (it applies the Qdisc on lan0 configuring the hw switch) > > > > > > > > Oleksij, am I missing something? > > > > > > > Please consider here we are discussing about Qdisc policies and not flower > > > > rules to mangle the traffic. > > > > > > What's a Qdisc policy? > > > > I mean a queue scheduler algorithm (e.g. TBF, ETS, HTB, ...) > > > > > > > > Also, flower is a classifier, not an action. It doesn't mangle packets > > > by the very definition of what a classifier is. > > > > yes, but goal of the series is the Queue scheduler offloading, not > > classifier/action. Agree? > > > > > > > > > The hw needs to be configured in advance to apply the requested policy > > > > (e.g TBF for traffic shaping). > > > > > > What are you missing exactly to make DSA packets go to a particular > > > channel on the conduit? > > > > > > For Qdisc offloading you want to configure the NIC in advance, of course. > > > > > > Can't you do something like this to guide packets to the correct channels? > > > > > > tc qdisc add dev eth0 clsact > > > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ... > > > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \ > > > flowid 1:1 > > > > If we apply the Qdisc directly on the conduit port (eth0) we can just apply the > > queue scheduler on all the traffic egressing via the DSA switch while I would > > like to apply it on per DSA port basis (but using the mac chip hw capabilities), > > got my point? > > Hm, I guess I have similar use case in one of my projects. In my case, the CPU > interface is 1Gbit the switch ports are 100Mbit each. It is still > possible to keep the CPU interface busy by sending 1Gbit UDP stream, so > 900Mbit is dropped by the switch. I would like to have traffic limiter > per virtual DSA port on the SoC site to reduce the load on DSA conduit. > Currently it was not possible. Does the mac chip in your setup support TX shaping (e.g. via HTB or TBF)? If so, I guess you could do it via the ndo_setup_tc_conduit() callback. Regards, Lorenzo > > -- > Pengutronix e.K. | | > Steuerwalder Str. 21 | http://www.pengutronix.de/ | > 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | > Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
> On Mon, Dec 16, 2024 at 08:01:33PM +0100, Lorenzo Bianconi wrote: > > Considering patch [0], we are still offloading the Qdisc on the provided > > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in > > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit() > > callback in order to use the hw Qdisc capabilities available on the mac chip > > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply > > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree? > > Not quite, no. > > ndo_setup_tc_conduit() does not have the same instruments to offload > what port_setup_tc() can offload. It is not involved in all data paths > that port_setup_tc() has to handle. Please ack this. So if port_setup_tc() Can you please elaborate on this? Both (->ndo_setup_tc_conduit() and ->port_setup_tc()) refer to the same DSA user port (please take a look the callback signature). > returns -EOPNOTSUPP, the entire dsa_user_setup_qdisc() should return > -EOPNOTSUPP, UNLESS you install packet traps on all other offloaded data > paths in the switch, such that all packets that egress the DSA user port > are handled by ndo_setup_tc_conduit()'s instruments. Uhm, do you mean we are changing the user expected result in this way? It seems to me the only case we are actually changing is if port_setup_tc() callback is not supported by the DSA switch driver while ndo_setup_tc_conduit() one is supported by the mac chip. In this case the previous implementation returns -EOPNOTSUPP while the proposed one does not report any error. Do we really care about this case? If so, I guess we can rework dsa_user_setup_qdisc(). > > > > How does the introduction of ndo_setup_tc_conduit() help, since the problem > > > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly. > > > It is simply not comparable to the way in which it is offloaded by > > > drivers/net/dsa/microchip/ksz_common.c, even though the user space > > > syntax is the same. Unless you're suggesting that for ksz it is not > > > offloaded correctly? > > > > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just > > does not allow to exploit all hw capabilities available on EN7581 when the > > traffic is routed from the WAN port to a given DSA switch port. > > And I don't believe it should, in this way. Can you please elaborate on this? IIUC it seems even Oleksij has a use-case for this. > > > If we do: > > > > $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ... > > > > in the current upstream implementation we do: > > dsa_user_setup_tc(): > > ... > > -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) > > (it applies the Qdisc on lan0 configuring the hw switch) > > > > adding the ndo_setup_tc_conduit() callback we have: > > > > dsa_user_setup_qdisc() > > ... > > -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data) > > (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0) > > > > -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) > > (it applies the Qdisc on lan0 configuring the hw switch) > > > > > Also, flower is a classifier, not an action. It doesn't mangle packets > > > by the very definition of what a classifier is. > > > > yes, but goal of the series is the Queue scheduler offloading, not > > classifier/action. Agree? > > Classifiers + flowid are an instrument to direct packets to classes of a > classful egress Qdisc. They seem perfectly relevant to the discussion, > given the information I currently have. yep, sure. We will need a tc classifier to set the flow-id (I used flower during development). What I mean is the series is taking care just of Qdisc offloading. > > > > Can't you do something like this to guide packets to the correct channels? > > > > > > tc qdisc add dev eth0 clsact > > > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ... > > > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \ > > > flowid 1:1 > > > > If we apply the Qdisc directly on the conduit port (eth0) we can just apply the > > queue scheduler on all the traffic egressing via the DSA switch while I would > > like to apply it on per DSA port basis (but using the mac chip hw capabilities), > > got my point? > > We need something as the root Qdisc of the conduit which exposes its > hardware capabilities. I just assumed that would be a simple (and single) > ETS, you can correct me if I am wrong. > > On conduit egress, what is the arbitration scheme between the traffic > destined towards each DSA user port (channel, as the driver calls them)? > How can this be best represented? The EN7581 supports up to 32 different 'channels' (each of them support 8 different hw queues). You can define an ETS and/or TBF Qdisc for each channel. My idea is to associate a channel to each DSA switch port, so the user can define independent QoS policies for each DSA ports (e.g. shape at 100Mbps lan0, apply ETS on lan1, ...) configuring the mac chip instead of the hw switch. The kernel (if the traffic is not offloaded) or the PPE block (if the traffic is offloaded) updates the channel and queue information in the DMA descriptor (please take a look to [0] for the first case). > > IIUC, in your patch set, you expose the conduit hardware QoS capabilities > as if they can be perfectly virtualized among DSA user ports, and as if > each DSA user port can have its own ETS root Qdisc, completely independent > of each other, as if the packets do not serialize on the conduit <-> CPU > port link, and as if that is not a bottleneck. Is that really the case? correct > If so (but please explain how), maybe you really need your own root Qdisc > driver, with one class per DSA user port, and those classes have ETS > attached to them. Can you please clarify what do you mean with 'root Qdisc driver'? Regards, Lorenzo [0] https://patchwork.kernel.org/project/netdevbpf/patch/a7d8ec3d70d7a0e2208909189e46a63e769f8f9d.1733930558.git.lorenzo@kernel.org/
On Mon, Dec 16, 2024 at 11:28:18PM +0100, Lorenzo Bianconi wrote: > > ndo_setup_tc_conduit() does not have the same instruments to offload > > what port_setup_tc() can offload. It is not involved in all data paths > > that port_setup_tc() has to handle. Please ack this. So if port_setup_tc() > > Can you please elaborate on this? Both (->ndo_setup_tc_conduit() and > ->port_setup_tc()) refer to the same DSA user port (please take a look the > callback signature). I'd be just repeating what I've said a few times before. Your proposed ndo_setup_tc_conduit() appears to be configuring conduit resources (QDMA, GDM) for mt7530 user port tc offload, as if it is in complete and exclusive control of the user port data path. But as long as there are packets in the user port data path that bypass those conduit QoS resources (like for example mt7530 switch forwards packet from one port to another, bypassing the GDM1 in your drawing[1]), it isn't a good model. Forget about ndo_setup_tc_conduit(), it isn't a good tc command to run in the first place. The tc command you're trying to make to do what you want is supposed to _also_ include the mt7530 packets forwarded from one port to another in its QoS mix. It applies at the _egress_ of the mt7530 port. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752 Let me try to add some squiggles based on your diagram, to clarify what is my understanding and complaint. ┌───────┐ ┌───────┐ │ QDMA2 │ │ QDMA1 │ └───┬───┘ └───┬───┘ │ │ ┌───────▼───────────────────────────────────────────▼────────┐ │ │ │ P5 P0 │ │ │ │ │ │ │ ┌──────┐ │ P3 ├────► GDM3 │ │ │ ┌─────┐ │ │ │ PPE ◄────┤ P4 PSE │ └─────┘ │ │ │ │ ┌──────┐ │ P9 ├────► GDM4 │ │ │ └──────┘ │ │ │ P2 P1 │ └─────────┬─────────────────────────────────────────┬────────┘ │ │ ┌───▼──┐ ┌──▼───┐ │ GDM2 │ │ GDM1 │ └──────┘ └──┬───┘ │ ┌──────────────▼───────────────┐ │ CPU port │ │ ┌─────────┘ │ │ │ MT7530 │ │ │ │ │ ▼ x │ │ ┌─────┐ ┌─┘ │ │ lan1 lan2 lan3 lan4 │ └───│──────────────────────────┘ ▼ When you add an offloaded Qdisc to the egress of lan1, the expectation is that packets from lan2 obey it too (offloaded tc goes hand in hand with offloaded bridge). Whereas, by using GDM1/QDMA resources, you are breaking that expectation, because packets from lan2 bridged by MT7530 don't go to GDM1 (the "x"). > > returns -EOPNOTSUPP, the entire dsa_user_setup_qdisc() should return > > -EOPNOTSUPP, UNLESS you install packet traps on all other offloaded data > > paths in the switch, such that all packets that egress the DSA user port > > are handled by ndo_setup_tc_conduit()'s instruments. > > Uhm, do you mean we are changing the user expected result in this way? > It seems to me the only case we are actually changing is if port_setup_tc() > callback is not supported by the DSA switch driver while ndo_setup_tc_conduit() > one is supported by the mac chip. In this case the previous implementation > returns -EOPNOTSUPP while the proposed one does not report any error. > Do we really care about this case? If so, I guess we can rework > dsa_user_setup_qdisc(). See above, there's nothing to rework. > > > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just > > > does not allow to exploit all hw capabilities available on EN7581 when the > > > traffic is routed from the WAN port to a given DSA switch port. > > > > And I don't believe it should, in this way. > > Can you please elaborate on this? IIUC it seems even Oleksij has a use-case > for this. See above, I'm also waiting for Oleksij's answer but I don't expect you 2 to be talking about the same thing. If there's some common infrastructure to be shared, my understanding is it has nothing to do with ndo_setup_tc_conduit(). > > We need something as the root Qdisc of the conduit which exposes its > > hardware capabilities. I just assumed that would be a simple (and single) > > ETS, you can correct me if I am wrong. > > > > On conduit egress, what is the arbitration scheme between the traffic > > destined towards each DSA user port (channel, as the driver calls them)? > > How can this be best represented? > > The EN7581 supports up to 32 different 'channels' (each of them support 8 > different hw queues). You can define an ETS and/or TBF Qdisc for each channel. > My idea is to associate a channel to each DSA switch port, so the user can > define independent QoS policies for each DSA ports (e.g. shape at 100Mbps lan0, > apply ETS on lan1, ...) configuring the mac chip instead of the hw switch. > The kernel (if the traffic is not offloaded) or the PPE block (if the traffic > is offloaded) updates the channel and queue information in the DMA descriptor > (please take a look to [0] for the first case). But you call it a MAC chip because between the GDM1 and the MT7530 there's an in-chip Ethernet MAC (GMII netlist), with a fixed packet rate, right? I'm asking again, are the channels completely independent of one another, or are they consuming shared bandwidth in a way that with your proposal is just not visible? If there is a GMII between the GDM1 and the MT7530, how come the bandwidth between the channels is not shared in any way? And if there is no GMII or similar MAC interface, we need to take 100 steps back and discuss why was the DSA model chosen for this switch, and not a freeform switchdev driver where the conduit is not discrete? I'm not sure what to associate these channels with. Would it be wrong to think of each channel as a separate DSA conduit? Because for example there is API to customize the user port <-> conduit assignment. > > IIUC, in your patch set, you expose the conduit hardware QoS capabilities > > as if they can be perfectly virtualized among DSA user ports, and as if > > each DSA user port can have its own ETS root Qdisc, completely independent > > of each other, as if the packets do not serialize on the conduit <-> CPU > > port link, and as if that is not a bottleneck. Is that really the case? > > correct Very interesting but will need more than one word for an explanation :) > > If so (but please explain how), maybe you really need your own root Qdisc > > driver, with one class per DSA user port, and those classes have ETS > > attached to them. > > Can you please clarify what do you mean with 'root Qdisc driver'? Quite literally, an implementer of struct Qdisc_ops whose parent can only be TC_H_ROOT. I was implying you'd have to create an abstract software model of the QoS capabilities of the QDMA and the GDM port such that we all understand them, a netlink attribute scheme for configuring those QoS parameters, and then a QoS offload mechanism through which they are communicated to compatible hardware. But let's leave that aside until it becomes more clear what you have.
> Considering patch [0], we are still offloading the Qdisc on the provided > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit() > callback in order to use the hw Qdisc capabilities available on the mac chip > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree? I've not read all the details, so i could be getting something wrong. But let me point out the basics. Offloading is used to accelerate what Linux already supports in software. So forget about your hardware. How would i configure a bunch of e1000e cards connected to a software bridge to do what you want? There is no conduit interface in this, so i would not expect to explicitly configure a conduit interface. Maybe the offloading needs to implicitly configure the conduit, but that should be all hidden away from the user. But given the software bridge has no concept of a conduit, i doubt it. It could well be our model does not map to the hardware too well, leaving some bits unusable, but there is not much you can do about that, that is the Linux model, accelerate what Linux supports in software. Andrew
On Mon, Dec 16, 2024 at 10:44:54PM +0100, Lorenzo Bianconi wrote: > > On Mon, Dec 16, 2024 at 08:01:33PM +0100, Lorenzo Bianconi wrote: > > > > On Mon, Dec 16, 2024 at 01:09:01PM +0100, Lorenzo Bianconi wrote: > > > > > I guess what I did not make clear here is that we are discussing about > > > > > 'routed' traffic (sorry for that). The traffic is received from the WAN > > > > > interface and routed to a DSA port (or the other way around). > > > > > In this scenario the 3-way handshake will be received by the CPU via the > > > > > WAN port (or the conduit port) while the subsequent packets will be hw > > > > > forwarded from WAN to LAN (or LAN to WAN). For EN7581 [0], the traffic > > > > > will be received by the system from GDM2 (WAN) and the PSE/PPE blocks > > > > > will forward it to the GDM1 port that is connected to the DSA cpu port. > > > > > > > > > > The proposed series is about adding the control path to apply a given Qdisc > > > > > (ETS or TBF for EN7581) to the traffic that is following the described path > > > > > without creating it directly on the DSA switch port (for the reasons described > > > > > before). E.g. the user would want to apply an ETS Qdisc just for traffic > > > > > egressing via lan0. > > > > > > > > > > This series is not strictly related to the airoha_eth flowtable offload > > > > > implementation but the latter is required to have a full pictures of the > > > > > possible use case (this is why I was saying it is better to post it first). > > > > > > > > It's good to know this does not depend on flowtable. > > > > > > > > When you add an offloaded Qdisc to the egress of a net device, you don't > > > > affect just the traffic L3 routed to that device, but all traffic (also > > > > includes the packets sent to it using L2 forwarding). As such, I simply > > > > don't believe that the way in which the UAPI is interpreted here (root > > > > egress qdisc matches only routed traffic) is proper. > > > > > > > > Ack? > > > > > > Considering patch [0], we are still offloading the Qdisc on the provided > > > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in > > > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit() > > > callback in order to use the hw Qdisc capabilities available on the mac chip > > > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply > > > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree? > > > > > > > > > > > > > I'm trying to look at the big picture and abstract away the flowtable a > > > > > > bit. I don't think the tc rule should be on the user port. Can the > > > > > > redirection of packets destined towards a particular switch port be > > > > > > accomplished with a tc u32 filter on the conduit interface instead? > > > > > > If the tc primitives for either the filter or the action don't exist, > > > > > > maybe those could be added instead? Like DSA keys in "flower" which gain > > > > > > introspection into the encapsulated packet headers? > > > > > > > > > > The issue with the current DSA infrastructure is there is no way to use > > > > > the conduit port to offload a Qdisc policy to a given lan port since we > > > > > are missing in the APIs the information about what user port we are > > > > > interested in (this is why I added the new netdev callback). > > > > > > > > How does the introduction of ndo_setup_tc_conduit() help, since the problem > > > > is elsewhere? You are not making "tc qdisc add lanN root ets" work correctly. > > > > It is simply not comparable to the way in which it is offloaded by > > > > drivers/net/dsa/microchip/ksz_common.c, even though the user space > > > > syntax is the same. Unless you're suggesting that for ksz it is not > > > > offloaded correctly? > > > > > > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just > > > does not allow to exploit all hw capabilities available on EN7581 when the > > > traffic is routed from the WAN port to a given DSA switch port. If we do: > > > > > > $tc qdisc add dev lan0 root handle 1: ets strict 8 priomap ... > > > > > > in the current upstream implementation we do: > > > dsa_user_setup_tc(): > > > ... > > > -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) > > > (it applies the Qdisc on lan0 configuring the hw switch) > > > > > > adding the ndo_setup_tc_conduit() callback we have: > > > > > > dsa_user_setup_qdisc() > > > ... > > > -> conduit->netdev_ops->ndo_setup_tc_conduit(conduit, dp->index, type, type_data) > > > (it applies the Qdisc on EN7581 mac chip for the routed traffic destinated to lan0) > > > > > > -> ds->ops->port_setup_tc(ds, dp->index, type, type_data) > > > (it applies the Qdisc on lan0 configuring the hw switch) > > > > > > > > > > > Oleksij, am I missing something? > > > > > > > > > Please consider here we are discussing about Qdisc policies and not flower > > > > > rules to mangle the traffic. > > > > > > > > What's a Qdisc policy? > > > > > > I mean a queue scheduler algorithm (e.g. TBF, ETS, HTB, ...) > > > > > > > > > > > Also, flower is a classifier, not an action. It doesn't mangle packets > > > > by the very definition of what a classifier is. > > > > > > yes, but goal of the series is the Queue scheduler offloading, not > > > classifier/action. Agree? > > > > > > > > > > > > The hw needs to be configured in advance to apply the requested policy > > > > > (e.g TBF for traffic shaping). > > > > > > > > What are you missing exactly to make DSA packets go to a particular > > > > channel on the conduit? > > > > > > > > For Qdisc offloading you want to configure the NIC in advance, of course. > > > > > > > > Can't you do something like this to guide packets to the correct channels? > > > > > > > > tc qdisc add dev eth0 clsact > > > > tc qdisc add dev eth0 root handle 1: ets strict 8 priomap ... > > > > tc filter add dev eth0 egress ${u32 or flower filter to match on DSA tagged packets} \ > > > > flowid 1:1 > > > > > > If we apply the Qdisc directly on the conduit port (eth0) we can just apply the > > > queue scheduler on all the traffic egressing via the DSA switch while I would > > > like to apply it on per DSA port basis (but using the mac chip hw capabilities), > > > got my point? > > > > Hm, I guess I have similar use case in one of my projects. In my case, the CPU > > interface is 1Gbit the switch ports are 100Mbit each. It is still > > possible to keep the CPU interface busy by sending 1Gbit UDP stream, so > > 900Mbit is dropped by the switch. I would like to have traffic limiter > > per virtual DSA port on the SoC site to reduce the load on DSA conduit. > > Currently it was not possible. > > Does the mac chip in your setup support TX shaping (e.g. via HTB or TBF)? > If so, I guess you could do it via the ndo_setup_tc_conduit() callback. HW offloading is optional optimization. At first step, it should be able to work without HW offloading.
> On Mon, Dec 16, 2024 at 11:28:18PM +0100, Lorenzo Bianconi wrote: > > > ndo_setup_tc_conduit() does not have the same instruments to offload > > > what port_setup_tc() can offload. It is not involved in all data paths > > > that port_setup_tc() has to handle. Please ack this. So if port_setup_tc() > > > > Can you please elaborate on this? Both (->ndo_setup_tc_conduit() and > > ->port_setup_tc()) refer to the same DSA user port (please take a look the > > callback signature). > > I'd be just repeating what I've said a few times before. Your proposed > ndo_setup_tc_conduit() appears to be configuring conduit resources > (QDMA, GDM) for mt7530 user port tc offload, as if it is in complete and > exclusive control of the user port data path. But as long as there are > packets in the user port data path that bypass those conduit QoS resources > (like for example mt7530 switch forwards packet from one port to another, > bypassing the GDM1 in your drawing[1]), it isn't a good model. Forget > about ndo_setup_tc_conduit(), it isn't a good tc command to run in the > first place. The tc command you're trying to make to do what you want is > supposed to _also_ include the mt7530 packets forwarded from one port to > another in its QoS mix. It applies at the _egress_ of the mt7530 port. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23020f04932701d5c8363e60756f12b43b8ed752 > > Let me try to add some squiggles based on your diagram, to clarify what > is my understanding and complaint. > > ┌───────┐ ┌───────┐ > │ QDMA2 │ │ QDMA1 │ > └───┬───┘ └───┬───┘ > │ │ > ┌───────▼───────────────────────────────────────────▼────────┐ > │ │ > │ P5 P0 │ > │ │ > │ │ > │ │ ┌──────┐ > │ P3 ├────► GDM3 │ > │ │ > ┌─────┐ │ │ > │ PPE ◄────┤ P4 PSE │ > └─────┘ │ │ > │ │ ┌──────┐ > │ P9 ├────► GDM4 │ > │ │ └──────┘ > │ │ > │ P2 P1 │ > └─────────┬─────────────────────────────────────────┬────────┘ > │ │ > ┌───▼──┐ ┌──▼───┐ > │ GDM2 │ │ GDM1 │ > └──────┘ └──┬───┘ > │ > ┌──────────────▼───────────────┐ > │ CPU port │ > │ ┌─────────┘ │ > │ │ MT7530 │ > │ │ │ > │ ▼ x │ > │ ┌─────┐ ┌─┘ │ > │ lan1 lan2 lan3 lan4 │ > └───│──────────────────────────┘ > ▼ > > When you add an offloaded Qdisc to the egress of lan1, the expectation > is that packets from lan2 obey it too (offloaded tc goes hand in hand > with offloaded bridge). Whereas, by using GDM1/QDMA resources, you are > breaking that expectation, because packets from lan2 bridged by MT7530 > don't go to GDM1 (the "x"). ack, I got your point. I was assuming to cover this case (traffic from lan2 to lan1) maintaining the port_setup_tc() callback in dsa_user_setup_qdisc() (this traffic is not managed by ndo_setup_tc_conduit() callback). If this approach is not ok, I guess we will need to revisit the approach. > > > > returns -EOPNOTSUPP, the entire dsa_user_setup_qdisc() should return > > > -EOPNOTSUPP, UNLESS you install packet traps on all other offloaded data > > > paths in the switch, such that all packets that egress the DSA user port > > > are handled by ndo_setup_tc_conduit()'s instruments. > > > > Uhm, do you mean we are changing the user expected result in this way? > > It seems to me the only case we are actually changing is if port_setup_tc() > > callback is not supported by the DSA switch driver while ndo_setup_tc_conduit() > > one is supported by the mac chip. In this case the previous implementation > > returns -EOPNOTSUPP while the proposed one does not report any error. > > Do we really care about this case? If so, I guess we can rework > > dsa_user_setup_qdisc(). > > See above, there's nothing to rework. > > > > > nope, I am not saying the current Qdisc DSA infrastructure is wrong, it just > > > > does not allow to exploit all hw capabilities available on EN7581 when the > > > > traffic is routed from the WAN port to a given DSA switch port. > > > > > > And I don't believe it should, in this way. > > > > Can you please elaborate on this? IIUC it seems even Oleksij has a use-case > > for this. > > See above, I'm also waiting for Oleksij's answer but I don't expect you > 2 to be talking about the same thing. If there's some common infrastructure > to be shared, my understanding is it has nothing to do with ndo_setup_tc_conduit(). ack > > > > We need something as the root Qdisc of the conduit which exposes its > > > hardware capabilities. I just assumed that would be a simple (and single) > > > ETS, you can correct me if I am wrong. > > > > > > On conduit egress, what is the arbitration scheme between the traffic > > > destined towards each DSA user port (channel, as the driver calls them)? > > > How can this be best represented? > > > > The EN7581 supports up to 32 different 'channels' (each of them support 8 > > different hw queues). You can define an ETS and/or TBF Qdisc for each channel. > > My idea is to associate a channel to each DSA switch port, so the user can > > define independent QoS policies for each DSA ports (e.g. shape at 100Mbps lan0, > > apply ETS on lan1, ...) configuring the mac chip instead of the hw switch. > > The kernel (if the traffic is not offloaded) or the PPE block (if the traffic > > is offloaded) updates the channel and queue information in the DMA descriptor > > (please take a look to [0] for the first case). > > But you call it a MAC chip because between the GDM1 and the MT7530 there's > an in-chip Ethernet MAC (GMII netlist), with a fixed packet rate, right? With "mac chip" I mean the set of PSE/PPE and QDMA blocks in the diagram (what is managed by airoha_eth driver). There is no other chip in between of GDM1 and MT7530 switch (sorry for the confusion). > I'm asking again, are the channels completely independent of one another, > or are they consuming shared bandwidth in a way that with your proposal > is just not visible? If there is a GMII between the GDM1 and the MT7530, > how come the bandwidth between the channels is not shared in any way? Channels are logically independent. GDM1 is connected to the MT7530 switch via a fixed speed link (10Gbps, similar to what we have for other MediaTek chipset, like MT7988 [0]). The fixed link speed is higher than the sum of DSA port link speeds (on my development boards I have 4 DSA ports @ 1Gbps); > And if there is no GMII or similar MAC interface, we need to take 100 > steps back and discuss why was the DSA model chosen for this switch, and > not a freeform switchdev driver where the conduit is not discrete? > > I'm not sure what to associate these channels with. Would it be wrong to > think of each channel as a separate DSA conduit? Because for example there > is API to customize the user port <-> conduit assignment. > > > > IIUC, in your patch set, you expose the conduit hardware QoS capabilities > > > as if they can be perfectly virtualized among DSA user ports, and as if > > > each DSA user port can have its own ETS root Qdisc, completely independent > > > of each other, as if the packets do not serialize on the conduit <-> CPU > > > port link, and as if that is not a bottleneck. Is that really the case? > > > > correct > > Very interesting but will need more than one word for an explanation :) I mean your paragraph above well describes hw architecture. > > > > If so (but please explain how), maybe you really need your own root Qdisc > > > driver, with one class per DSA user port, and those classes have ETS > > > attached to them. > > > > Can you please clarify what do you mean with 'root Qdisc driver'? > > Quite literally, an implementer of struct Qdisc_ops whose parent can > only be TC_H_ROOT. I was implying you'd have to create an abstract > software model of the QoS capabilities of the QDMA and the GDM port such > that we all understand them, a netlink attribute scheme for configuring > those QoS parameters, and then a QoS offload mechanism through which > they are communicated to compatible hardware. But let's leave that aside > until it becomes more clear what you have. > ack, fine. Regards, Lorenzo [0] https://github.com/openwrt/openwrt/blob/main/target/linux/mediatek/files-6.6/arch/arm64/boot/dts/mediatek/mt7988a.dtsi#L1531
On Tue, Dec 17, 2024 at 10:11:46AM +0100, Lorenzo Bianconi wrote: > > When you add an offloaded Qdisc to the egress of lan1, the expectation > > is that packets from lan2 obey it too (offloaded tc goes hand in hand > > with offloaded bridge). Whereas, by using GDM1/QDMA resources, you are > > breaking that expectation, because packets from lan2 bridged by MT7530 > > don't go to GDM1 (the "x"). > > ack, I got your point. I was assuming to cover this case (traffic from lan2 to > lan1) maintaining the port_setup_tc() callback in dsa_user_setup_qdisc() (this > traffic is not managed by ndo_setup_tc_conduit() callback). If this approach is > not ok, I guess we will need to revisit the approach. To offload QoS on the egress of a DSA user port: port_setup_tc() is: (a) necessary (b) sufficient ndo_setup_tc_conduit() is: (a) unnecessary (b) insufficient > > But you call it a MAC chip because between the GDM1 and the MT7530 there's > > an in-chip Ethernet MAC (GMII netlist), with a fixed packet rate, right? > > With "mac chip" I mean the set of PSE/PPE and QDMA blocks in the diagram > (what is managed by airoha_eth driver). There is no other chip in between > of GDM1 and MT7530 switch (sorry for the confusion). The MT7530 is also on the same chip as the GDM1, correct? > > I'm asking again, are the channels completely independent of one another, > > or are they consuming shared bandwidth in a way that with your proposal > > is just not visible? If there is a GMII between the GDM1 and the MT7530, > > how come the bandwidth between the channels is not shared in any way? > > Channels are logically independent. "Logically independent" does not mean "does not share resources", which is what I asked. > GDM1 is connected to the MT7530 switch via a fixed speed link (10Gbps, similar > to what we have for other MediaTek chipset, like MT7988 [0]). The fixed link speed > is higher than the sum of DSA port link speeds (on my development boards I have > 4 DSA ports @ 1Gbps); And this fixed connection is a pair of internal Ethernet MACs, correct? I see on MT7988 we do have the "pause" property, which would suggest so, since flow control is a MAC level feature. I assume 10 Gbps in the device tree means it is an XGMII really limited at that speed, and that speed is not just for phylink compliance, right? What if we push your example to the extreme, and say that the DSA user ports also have 10 Gbps links? How independent are the QDMA channels in this case? What arbitration algorithm will be used for QoS among user ports, when the combined bandwidth exceeds the CPU port capacity?
On Tue, Dec 17, 2024 at 12:24:13AM +0100, Andrew Lunn wrote: > > Considering patch [0], we are still offloading the Qdisc on the provided > > DSA switch port (e.g. LANx) via the port_setup_tc() callback available in > > dsa_user_setup_qdisc(), but we are introducing even the ndo_setup_tc_conduit() > > callback in order to use the hw Qdisc capabilities available on the mac chip > > (e.g. EN7581) for the routed traffic from WAN to LANx. We will still apply > > the Qdisc defined on LANx for L2 traffic from LANy to LANx. Agree? > > I've not read all the details, so i could be getting something > wrong. But let me point out the basics. Offloading is used to > accelerate what Linux already supports in software. So forget about > your hardware. How would i configure a bunch of e1000e cards connected > to a software bridge to do what you want? > > There is no conduit interface in this, so i would not expect to > explicitly configure a conduit interface. Maybe the offloading needs > to implicitly configure the conduit, but that should be all hidden > away from the user. But given the software bridge has no concept of a > conduit, i doubt it. > > It could well be our model does not map to the hardware too well, > leaving some bits unusable, but there is not much you can do about > that, that is the Linux model, accelerate what Linux supports in > software. Hi, You are absolutely correct that offloading should accelerate what Linux already supports in software, and we need to respect this model. However, I’d like to step back for a moment to clarify the underlying problem before focusing too much on solutions. ### The Core Problem: Flow Control Limitations 1. **QoS and Flow Control:** At the heart of proper QoS implementation lies flow control. Flow control mechanisms exist at various levels: - MAC-level signaling (e.g., pause frames) - Queue management (e.g., stopping queues when the hardware is congested) The typical Linux driver uses flow control signaling from the MAC (e.g., stopping queues) to coordinate traffic, and depending on the Qdisc, this flow control can propagate up to user space applications. 2. **Challenges with DSA:** In DSA, we lose direct **flow control communication** between: - The host MAC - The MAC of a DSA user port. While internal flow control within the switch may still work, it does not extend to the host. Specifically: - Pause frames often affect **all priorities** and are not granular enough for low-latency applications. - The signaling from the MAC of the DSA user port to the host is either **not supported** or is **disabled** (often through device tree configuration). ### Why This Matters for QoS For traffic flowing **from the host** to DSA user ports: - Without proper flow control, congestion cannot be communicated back to the host, leading to buffer overruns and degraded QoS. - To address this, we need to compensate for the lack of flow control signaling by applying traffic limits (or shaping). ### Approach: Applying Limits on the Conduit Interface One way to solve this is by applying traffic shaping or limits directly on the **conduit MAC**. However, this approach has significant complexity: 1. **Hardware-Specific Details:** We would need deep hardware knowledge to set up traffic filters or disectors at the conduit level. This includes: - Parsing **CPU tags** specific to the switch in use. - Applying port-specific rules, some of which depend on **user port link speed**. 2. **Admin Burden:** Forcing network administrators to configure conduit-specific filters manually increases complexity and goes against the existing DSA abstractions, which are already well-integrated into the kernel. ### How Things Can Be Implemented To address QoS for host-to-user port traffic in DSA, I see two possible approaches: #### 1. Apply Rules on the Conduit Port (Using `dst_port`) In this approach, rules are applied to the **conduit interface**, and specific user ports are matched using **port indices**. # Conduit interface tc qdisc add dev conduit0 clsact # Match traffic for user port 1 (e.g., lan0) tc filter add dev conduit0 egress flower dst_port 1 \ action police rate 50mbit burst 5k drop # Match traffic for user port 2 (e.g., lan1) tc filter add dev conduit0 egress flower dst_port 2 \ action police rate 30mbit burst 3k drop #### 2. Apply Rules Directly on the User Ports (With Conduit Marker) In this approach, rules are applied **directly to the user-facing DSA ports** (e.g., `lan0`, `lan1`) with a **conduit-specific marker**. The kernel resolves the mapping internally. # Apply rules with conduit marker for user ports tc qdisc add dev lan0 root tbf rate 50mbit burst 5k conduit-only tc qdisc add dev lan1 root tbf rate 30mbit burst 3k conduit-only Here: - **`conduit-only`**: A marker (flag) indicating that the rule applies specifically to **host-to-port traffic** and not to L2-forwarded traffic within the switch. ### Recommendation The second approach (**user port-based with `conduit-only` marker**) is cleaner and more intuitive. It avoids exposing hardware details like port indices while letting the kernel handle conduit-specific behavior transparently. Best regards, Oleksij
> On Tue, Dec 17, 2024 at 10:11:46AM +0100, Lorenzo Bianconi wrote: > > > When you add an offloaded Qdisc to the egress of lan1, the expectation > > > is that packets from lan2 obey it too (offloaded tc goes hand in hand > > > with offloaded bridge). Whereas, by using GDM1/QDMA resources, you are > > > breaking that expectation, because packets from lan2 bridged by MT7530 > > > don't go to GDM1 (the "x"). > > > > ack, I got your point. I was assuming to cover this case (traffic from lan2 to > > lan1) maintaining the port_setup_tc() callback in dsa_user_setup_qdisc() (this > > traffic is not managed by ndo_setup_tc_conduit() callback). If this approach is > > not ok, I guess we will need to revisit the approach. > > To offload QoS on the egress of a DSA user port: > > port_setup_tc() is: > (a) necessary > (b) sufficient > > ndo_setup_tc_conduit() is: > (a) unnecessary I agree it is unnecessary, but w/o it we must rely on limited QoS capabilities of the hw dsa switch. The goal of the series is just exploit enhanced QoS capabilities available on the EN7581 SoC. > (b) insufficient > > > > But you call it a MAC chip because between the GDM1 and the MT7530 there's > > > an in-chip Ethernet MAC (GMII netlist), with a fixed packet rate, right? > > > > With "mac chip" I mean the set of PSE/PPE and QDMA blocks in the diagram > > (what is managed by airoha_eth driver). There is no other chip in between > > of GDM1 and MT7530 switch (sorry for the confusion). > > The MT7530 is also on the same chip as the GDM1, correct? I think so, but I am not sure. > > > > I'm asking again, are the channels completely independent of one another, > > > or are they consuming shared bandwidth in a way that with your proposal > > > is just not visible? If there is a GMII between the GDM1 and the MT7530, > > > how come the bandwidth between the channels is not shared in any way? > > > > Channels are logically independent. > > "Logically independent" does not mean "does not share resources", which > is what I asked. > > > GDM1 is connected to the MT7530 switch via a fixed speed link (10Gbps, similar > > to what we have for other MediaTek chipset, like MT7988 [0]). The fixed link speed > > is higher than the sum of DSA port link speeds (on my development boards I have > > 4 DSA ports @ 1Gbps); > > And this fixed connection is a pair of internal Ethernet MACs, correct? yes > I see on MT7988 we do have the "pause" property, which would suggest so, > since flow control is a MAC level feature. I assume 10 Gbps in the > device tree means it is an XGMII really limited at that speed, and that > speed is not just for phylink compliance, right? I think so > > What if we push your example to the extreme, and say that the DSA user > ports also have 10 Gbps links? How independent are the QDMA channels in > this case? What arbitration algorithm will be used for QoS among user > ports, when the combined bandwidth exceeds the CPU port capacity? AFIR there is even the possibility to configure inter-channel QoS on the chip, like a Round Robin scheduler or Strict-Priority between channels. Regards, Lorenzo
On Tue, Dec 17, 2024 at 11:01:28AM +0100, Lorenzo Bianconi wrote: > I agree it is unnecessary, but w/o it we must rely on limited QoS capabilities > of the hw dsa switch. The goal of the series is just exploit enhanced QoS > capabilities available on the EN7581 SoC. (...) > AFIR there is even the possibility to configure inter-channel QoS on the chip, > like a Round Robin scheduler or Strict-Priority between channels. So the conclusion of both these statements is that it would be good to understand the QoS algorithm among channels of the MAC chip, see if there is any classful root Qdisc which can describe that algorithm, then offload it with ndo_setup_tc(), then assign traffic to user ports probably using a software classifier + flowid. The ETS Qdiscs which you are trying to add here should not be attached to the root egress Qdisc of DSA user ports, but to the classes of this unknown conduit root Qdisc. Agree?
On Tue, Dec 17, 2024 at 12:17:24PM +0200, Vladimir Oltean wrote: > On Tue, Dec 17, 2024 at 11:01:28AM +0100, Lorenzo Bianconi wrote: > > I agree it is unnecessary, but w/o it we must rely on limited QoS capabilities > > of the hw dsa switch. The goal of the series is just exploit enhanced QoS > > capabilities available on the EN7581 SoC. > (...) > > AFIR there is even the possibility to configure inter-channel QoS on the chip, > > like a Round Robin scheduler or Strict-Priority between channels. > > So the conclusion of both these statements is that it would be good to understand > the QoS algorithm among channels of the MAC chip, see if there is any classful > root Qdisc which can describe that algorithm, then offload it with ndo_setup_tc(), > then assign traffic to user ports probably using a software classifier + flowid. > The ETS Qdiscs which you are trying to add here should not be attached to the root > egress Qdisc of DSA user ports, but to the classes of this unknown conduit root > Qdisc. Agree? I would love to have some kind of easy to use DSA user ports classifiers.
On Tue, Dec 17, 2024 at 10:38:21AM +0100, Oleksij Rempel wrote: > Hi, > > You are absolutely correct that offloading should accelerate what Linux already > supports in software, and we need to respect this model. However, I’d like to > step back for a moment to clarify the underlying problem before focusing too > much on solutions. > > ### The Core Problem: Flow Control Limitations > > 1. **QoS and Flow Control:** > > At the heart of proper QoS implementation lies flow control. Flow control > mechanisms exist at various levels: > > - MAC-level signaling (e.g., pause frames) > > - Queue management (e.g., stopping queues when the hardware is congested) > > The typical Linux driver uses flow control signaling from the MAC (e.g., > stopping queues) to coordinate traffic, and depending on the Qdisc, this > flow control can propagate up to user space applications. I read this section as "The Core Problem: Ethernet". > 2. **Challenges with DSA:** > In DSA, we lose direct **flow control communication** between: > > - The host MAC > > - The MAC of a DSA user port. > > While internal flow control within the switch may still work, it does not > extend to the host. Specifically: > > - Pause frames often affect **all priorities** and are not granular enough > for low-latency applications. > > - The signaling from the MAC of the DSA user port to the host is either > **not supported** or is **disabled** (often through device tree > configuration). And this as: "Challenges with DSA: uses Ethernet". I think we can all agree that standard Ethernet, with all the flexibility it gives to pair any discrete DSA switch to any host NIC, does not give us sufficient instruments for independent flow control of each user port. Food for thought: strongly coupled MAC + integrated DSA switch systems, like for example Broadcom, have custom methods of pacing transmission to user ports by selectively stopping conduit TX queues associated with those user ports on congestion: https://lore.kernel.org/netdev/7510c29a-b60f-e0d7-4129-cb90fe376c74@gmail.com/ > ### Why This Matters for QoS > > For traffic flowing **from the host** to DSA user ports: > > - Without proper flow control, congestion cannot be communicated back to the > host, leading to buffer overruns and degraded QoS. There are multiple, and sometimes conflicting, goals to QoS and strategies on congestion. Generally speaking, it is good to clarify that deterministic latency, high throughput and zero loss cannot be all achieved at the same time. It is also good to highlight the fact that you are focusing on zero loss and that this is not necessarily the full picture. Some AVB/TSN switches, like SJA1105, do not support pause frames at all, not even on user ports, because as you say, it's like the nuclear solution which stops the entire port regardless of packet priorities. And even if they did support it, for deterministic latency applications it is best to turn it off. If you make a port enter congestion by bombarding it with TC0 traffic, you'll incur latency to TC7 traffic until you exit the congestion condition. These switches just expect to have reservations very carefully configured by the system administrator. What exceeds reservations and cannot consume shared resources (because they are temporarily depleted) is dropped. > - To address this, we need to compensate for the lack of flow control signaling > by applying traffic limits (or shaping). A splendid idea in theory. In practice, the traffic rate at the egress of a user port is the sum of locally injected traffic plus autonomously forwarded traffic. The port can enter congestion even with shaping of CPU-injected traffic at a certain rate. Conduit | v +-------------------------+ | CPU port | | | | | +--------+ | | | | | +<---+ | | | | | | v | | | lan0 lan1 lan2 lan3 | +-------------------------+ | v Just 1Gbps. You _could_ apply this technique to achieve a different purpose than net zero packet loss: selective transmission guarantees for CPU-injected traffic. But you also need to ensure that injected packets have a higher strict priority than the rest, and that the switch resources are configured through devlink-sb to have enough reserved space to keep these high priority packets on congestion and drop something else instead. It's a tool to have for sure, but you need to be extremely specific and realistic about your goals. > ### Approach: Applying Limits on the Conduit Interface > > One way to solve this is by applying traffic shaping or limits directly on the > **conduit MAC**. However, this approach has significant complexity: > > 1. **Hardware-Specific Details:** > > We would need deep hardware knowledge to set up traffic filters or disectors > at the conduit level. This includes: > > - Parsing **CPU tags** specific to the switch in use. > > - Applying port-specific rules, some of which depend on **user port link > speed**. > > 2. **Admin Burden:** > > Forcing network administrators to configure conduit-specific filters > manually increases complexity and goes against the existing DSA abstractions, > which are already well-integrated into the kernel. Agree that there is high complexity. Just need to see a proposal which acknowledges that it's not for nothing. > ### How Things Can Be Implemented > > To address QoS for host-to-user port traffic in DSA, I see two possible > approaches: > > #### 1. Apply Rules on the Conduit Port (Using `dst_port`) > > In this approach, rules are applied to the **conduit interface**, and specific > user ports are matched using **port indices**. > > # Conduit interface > tc qdisc add dev conduit0 clsact > > # Match traffic for user port 1 (e.g., lan0) > tc filter add dev conduit0 egress flower dst_port 1 \ > action police rate 50mbit burst 5k drop > > # Match traffic for user port 2 (e.g., lan1) > tc filter add dev conduit0 egress flower dst_port 2 \ > action police rate 30mbit burst 3k drop Ok, so you propose an abstract key set for DSA in the flower classifier with mappings to concrete packet fields happening in the backend, probably done by the tagging protocol in use. The abstract key set represents the superset of all known DSA fields, united by a common interpretation, and each tagger rejects keys it cannot map to the physical DSA tag. I can immediately think of a challenge here, that we can dynamically change the tagging protocol while tc rules are present, and this can affect which flower keys can be mapped and which cannot. For example, the ocelot tagging protocol could map a virtual DSA key "TX timestamp type" to the REW_OP field, but the ocelot-8021q tagger cannot. Plus, you could add tc filters to a block shared by multiple devices. You can't always infer the physical tagging protocol from the device that the filters are attached to. > #### 2. Apply Rules Directly on the User Ports (With Conduit Marker) > > In this approach, rules are applied **directly to the user-facing DSA ports** > (e.g., `lan0`, `lan1`) with a **conduit-specific marker**. The kernel resolves > the mapping internally. > > # Apply rules with conduit marker for user ports > tc qdisc add dev lan0 root tbf rate 50mbit burst 5k conduit-only > tc qdisc add dev lan1 root tbf rate 30mbit burst 3k conduit-only > > Here: > - **`conduit-only`**: A marker (flag) indicating that the rule applies > specifically to **host-to-port traffic** and not to L2-forwarded traffic within > the switch. > > ### Recommendation > > The second approach (**user port-based with `conduit-only` marker**) is cleaner > and more intuitive. It avoids exposing hardware details like port indices while > letting the kernel handle conduit-specific behavior transparently. > > Best regards, > Oleksij The second approach that you recommend suffers from the same problem as Lorenzo's revised proposal, which is that it treats the conduit interface as a collection of independent pipes of infinite capacity to each user port, with no arbitration concerns of its own. The model is again great in theory, but maps really poorly on real life. Your proposal actively encourages users to look away from the scheduling algorithm of the conduit, and just look at user ports in isolation of each other. I strongly disagree with it.
On Tue, Dec 17, 2024 at 01:54:48PM +0200, Vladimir Oltean wrote: > On Tue, Dec 17, 2024 at 10:38:21AM +0100, Oleksij Rempel wrote: > > Hi, > > > > You are absolutely correct that offloading should accelerate what Linux already > > supports in software, and we need to respect this model. However, I’d like to > > step back for a moment to clarify the underlying problem before focusing too > > much on solutions. > > > > ### The Core Problem: Flow Control Limitations > > > > 1. **QoS and Flow Control:** > > > > At the heart of proper QoS implementation lies flow control. Flow control > > mechanisms exist at various levels: > > > > - MAC-level signaling (e.g., pause frames) > > > > - Queue management (e.g., stopping queues when the hardware is congested) > > > > The typical Linux driver uses flow control signaling from the MAC (e.g., > > stopping queues) to coordinate traffic, and depending on the Qdisc, this > > flow control can propagate up to user space applications. > > I read this section as "The Core Problem: Ethernet". ack. > > ### Why This Matters for QoS > > > > For traffic flowing **from the host** to DSA user ports: > > > > - Without proper flow control, congestion cannot be communicated back to the > > host, leading to buffer overruns and degraded QoS. > > There are multiple, and sometimes conflicting, goals to QoS and strategies on > congestion. Generally speaking, it is good to clarify that deterministic latency, > high throughput and zero loss cannot be all achieved at the same time. It is > also good to highlight the fact that you are focusing on zero loss and that > this is not necessarily the full picture. Some AVB/TSN switches, like SJA1105, > do not support pause frames at all, not even on user ports, because as you say, > it's like the nuclear solution which stops the entire port regardless of > packet priorities. And even if they did support it, for deterministic latency > applications it is best to turn it off. If you make a port enter congestion by > bombarding it with TC0 traffic, you'll incur latency to TC7 traffic until you > exit the congestion condition. These switches just expect to have reservations > very carefully configured by the system administrator. What exceeds reservations > and cannot consume shared resources (because they are temporarily depleted) is dropped. > > - To address this, we need to compensate for the lack of flow control signaling > > by applying traffic limits (or shaping). > > A splendid idea in theory. In practice, the traffic rate at the egress > of a user port is the sum of locally injected traffic plus autonomously > forwarded traffic. The port can enter congestion even with shaping of > CPU-injected traffic at a certain rate. > > Conduit > | > v > +-------------------------+ > | CPU port | > | | | > | +--------+ | > | | | > | +<---+ | > | | | | > | v | | > | lan0 lan1 lan2 lan3 | > +-------------------------+ > | > v Just 1Gbps. > > You _could_ apply this technique to achieve a different purpose than > net zero packet loss: selective transmission guarantees for CPU-injected > traffic. But you also need to ensure that injected packets have a higher > strict priority than the rest, and that the switch resources are > configured through devlink-sb to have enough reserved space to keep > these high priority packets on congestion and drop something else instead. > > It's a tool to have for sure, but you need to be extremely specific and > realistic about your goals. Yes, you are right. For my specific use case the switch is used mostly as port multiplayer. > > #### 2. Apply Rules Directly on the User Ports (With Conduit Marker) > > > > In this approach, rules are applied **directly to the user-facing DSA ports** > > (e.g., `lan0`, `lan1`) with a **conduit-specific marker**. The kernel resolves > > the mapping internally. > > > > # Apply rules with conduit marker for user ports > > tc qdisc add dev lan0 root tbf rate 50mbit burst 5k conduit-only > > tc qdisc add dev lan1 root tbf rate 30mbit burst 3k conduit-only > > > > Here: > > - **`conduit-only`**: A marker (flag) indicating that the rule applies > > specifically to **host-to-port traffic** and not to L2-forwarded traffic within > > the switch. > > > > ### Recommendation > > > > The second approach (**user port-based with `conduit-only` marker**) is cleaner > > and more intuitive. It avoids exposing hardware details like port indices while > > letting the kernel handle conduit-specific behavior transparently. > > > > Best regards, > > Oleksij > > The second approach that you recommend suffers from the same problem as Lorenzo's > revised proposal, which is that it treats the conduit interface as a collection of > independent pipes of infinite capacity to each user port, with no arbitration concerns > of its own. The model is again great in theory, but maps really poorly on real life. > Your proposal actively encourages users to look away from the scheduling algorithm > of the conduit, and just look at user ports in isolation of each other. I strongly > disagree with it. I'm still thinking about best way to classify DSA user port traffic. Will it be enough to set classid on user port? tc filter add dev lan0 protocol all flower skip_hw \ action classid 1:1 tc filter add dev lan1 protocol all flower skip_hw \ action classid 1:2 And then process it on the conduit port: # Add HTB Qdisc on the conduit interface tc qdisc add dev conduit0 root handle 1: htb default 1 # Define rate-limiting classes tc class add dev conduit0 parent 1: classid 1:1 htb rate 100mbit burst 5k tc class add dev conduit0 parent 1: classid 1:2 htb rate 100mbit burst 5k Or the classid will not be transferred between devices and i'll need to use something like skbedit?
On Tue, Dec 17, 2024 at 01:22:45PM +0100, Oleksij Rempel wrote: > I'm still thinking about best way to classify DSA user port traffic. > Will it be enough to set classid on user port? > > tc filter add dev lan0 protocol all flower skip_hw \ > action classid 1:1 > tc filter add dev lan1 protocol all flower skip_hw \ > action classid 1:2 > > And then process it on the conduit port: > # Add HTB Qdisc on the conduit interface > tc qdisc add dev conduit0 root handle 1: htb default 1 > > # Define rate-limiting classes > tc class add dev conduit0 parent 1: classid 1:1 htb rate 100mbit burst 5k > tc class add dev conduit0 parent 1: classid 1:2 htb rate 100mbit burst 5k > > Or the classid will not be transferred between devices and i'll need to > use something like skbedit? I don't know, if you find out, let us know.