Message ID | 20201027105117.23052-1-tobias@waldekranz.com (mailing list archive) |
---|---|
Headers | show |
Series | net: dsa: link aggregation support | expand |
Hi Tobias, On Tue, Oct 27, 2020 at 11:51:13AM +0100, Tobias Waldekranz wrote: > I would really appreciate feedback on the following: > > All LAG configuration is cached in `struct dsa_lag`s. I realize that > the standard M.O. of DSA is to read back information from hardware > when required. With LAGs this becomes very tricky though. For example, > the change of a link state on one switch will require re-balancing of > LAG hash buckets on another one, which in turn depends on the total > number of active links in the LAG. Do you agree that this is > motivated? I don't really have an issue with that. > The LAG driver ops all receive the LAG netdev as an argument when this > information is already available through the port's lag pointer. This > was done to match the way that the bridge netdev is passed to all VLAN > ops even though it is in the port's bridge_dev. Is there a reason for > this or should I just remove it from the LAG ops? Maybe because on "leave", the bridge/LAG net device pointer inside struct dsa_port is first set to NULL, then the DSA notifier is called? > At least on mv88e6xxx, the exact source port is not available when > packets are received on the CPU. The way I see it, there are two ways > around that problem: > > - Inject the packet directly on the LAG device (what this series > does). Feels right because it matches all that we actually know; the > packet came in on the LAG. It does complicate dsa_switch_rcv > somewhat as we can no longer assume that skb->dev is a DSA port. > > - Inject the packet on "the designated port", i.e. some port in the > LAG. This lets us keep the current Rx path untouched. The problem is > that (a) the port would have to be dynamically updated to match the > expectations of the LAG driver (team/bond) as links are > enabled/disabled and (b) we would be presenting a lie because > packets would appear to ingress on netdevs that they might not in > fact have been physically received on. Since ocelot/felix does not have this restriction, and supports individual port addressing even under a LAG, you can imagine I am not very happy to see the RX data path punishing everyone else that is not mv88e6xxx. > (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA? It > seems like all chips capable of doing EDSA are using that, except for > the Peridot. I have no documentation whatsoever for mv88e6xxx, but just wondering, what is the benefit brought by EDSA here vs DSA? Does DSA have the same restriction when the ports are in a LAG?
> (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA? It > seems like all chips capable of doing EDSA are using that, except for > the Peridot. Hi Tobias Marvell removed the ability to use EDSA, in the way we do in Linux DSA, on Peridot. One of the configuration bits is gone. So i had to use DSA. It could be i just don't understand how to configure Peridot, and EDSA could be used, but i never figured it out. I will get back to your other questions. Andrew
> > (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA? It > > seems like all chips capable of doing EDSA are using that, except for > > the Peridot. > > I have no documentation whatsoever for mv88e6xxx, but just wondering, > what is the benefit brought by EDSA here vs DSA? Does DSA have the > same restriction when the ports are in a LAG? Hi Vladimir One advantage of EDSA is that it uses a well known Ether Type. It was easy for me to add support to tcpdump to spot this Ether type, decode the EDSA header, and pass the rest of the frame on for further processing as normal. With DSA, you cannot look at the packet and know it is DSA, and then correctly decode it. So tcpdump just show the packet as undecodable. Florian fixed this basic problem a while ago, since not being able to decode packets is a problem for all tagger except EDSA. So now there is extra meta information inserted into the pcap file, which gives tcpdump the hint it needs to do the extra decoding of the tagger header. But before that was added, it was much easier to debug EDSA vs DSA because of tcpdump decoding. Andrew
On Tue, Oct 27, 2020 at 14:27, Vladimir Oltean <olteanv@gmail.com> wrote: >> The LAG driver ops all receive the LAG netdev as an argument when this >> information is already available through the port's lag pointer. This >> was done to match the way that the bridge netdev is passed to all VLAN >> ops even though it is in the port's bridge_dev. Is there a reason for >> this or should I just remove it from the LAG ops? > > Maybe because on "leave", the bridge/LAG net device pointer inside > struct dsa_port is first set to NULL, then the DSA notifier is called? Right, that makes sense. For LAGs I keep ds->lag set until the leave ops have run. But perhaps I should change it to match the VLAN ops? > Since ocelot/felix does not have this restriction, and supports > individual port addressing even under a LAG, you can imagine I am not > very happy to see the RX data path punishing everyone else that is not > mv88e6xxx. I understand that, for sure. Though to be clear, the only penalty in terms of performance is an extra call to dsa_slave_check, which is just a load and compare on skb->dev->netdev_ops. >> (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA? It >> seems like all chips capable of doing EDSA are using that, except for >> the Peridot. > > I have no documentation whatsoever for mv88e6xxx, but just wondering, > what is the benefit brought by EDSA here vs DSA? Does DSA have the > same restriction when the ports are in a LAG? The same restrictions apply I'm afraid. The only difference is that you prepend a proper ethertype before the tag. The idea (as far as I know) is that you can trap control traffic (TO_CPU in DSA parlance) to the CPU and receive (E)DSA tagged to implement things like STP and LLDP, but you receive the data traffic (FORWARD) untagged or with an 802.1Q tag. This means you can use standard VLAN accelerators on NICs to remove/insert the 1Q tags. In a routing scenario this can bring a significant speed-up as you skip two memcpys per packet to remove and insert the tag.
When I first read about port trunking in the Peridot documentation, I immediately thought that this could be used to transparently offload that which is called Bonding in Linux... Is this what you want to eventually do? BTW, I thought about using port trunking to solve the multi-CPU DSA issue as well. On Turris Omnia we have 2 switch ports connected to the CPU. So I could trunk these 2 swtich ports, and on the other side create a bonding interface from eth0 and eth1. Andrew, what do you think about this? Is this something that can be done? Or is it too complicated? Marek
On Tue, Oct 27, 2020 at 15:00, Andrew Lunn <andrew@lunn.ch> wrote: >> (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA? It >> seems like all chips capable of doing EDSA are using that, except for >> the Peridot. > > Hi Tobias > > Marvell removed the ability to use EDSA, in the way we do in Linux > DSA, on Peridot. One of the configuration bits is gone. So i had to > use DSA. It could be i just don't understand how to configure > Peridot, and EDSA could be used, but i never figured it out. > > I will get back to your other questions. > > Andrew Hi Andrew, Very interesting! Which bit is that? Looking at the datasheets for Agate and Peridot side-by-side, the sections named "Ether Type DSA Tag" seem to be identical. Thanks, Tobias
On Tue, Oct 27, 2020 at 04:05:30PM +0100, Marek Behun wrote: > When I first read about port trunking in the Peridot documentation, I > immediately thought that this could be used to transparently offload > that which is called Bonding in Linux... > > Is this what you want to eventually do? > > BTW, I thought about using port trunking to solve the multi-CPU DSA > issue as well. On Turris Omnia we have 2 switch ports connected to the > CPU. So I could trunk these 2 swtich ports, and on the other side > create a bonding interface from eth0 and eth1. > > Andrew, what do you think about this? Is this something that can be > done? Or is it too complicated? Hi Marek trunking is something i've looked at once, but never had time to work on. There are three different use cases i thought of: 1) trunk user ports, with team/bonding controlling it 2) trunk DSA ports, i.e. the ports between switches in a D in DSA setup 3) trunk CPU ports. What Tobias is implementing here is 1). This seems like a good first step. I'm not sure 3) is even possible. Or it might depend on the switch generation. The 6352 for example, the CPU Dest field is a port number. It does not appear to allow for a trunk. 6390 moved this register, but as far as i know, it did not add trunk support. It might be possible to have multiple SoC interfaces sending frames to the Switch using DSA tags, but i don't see a way to have the switch send frames to the SoC using multiple ports. Andrew
On Tue, Oct 27, 2020 at 16:23, Andrew Lunn <andrew@lunn.ch> wrote: > Hi Marek > > trunking is something i've looked at once, but never had time to work > on. There are three different use cases i thought of: > > 1) trunk user ports, with team/bonding controlling it > 2) trunk DSA ports, i.e. the ports between switches in a D in DSA setup > 3) trunk CPU ports. > > What Tobias is implementing here is 1). This seems like a good first > step. > > I'm not sure 3) is even possible. Or it might depend on the switch > generation. The 6352 for example, the CPU Dest field is a port > number. It does not appear to allow for a trunk. 6390 moved this > register, but as far as i know, it did not add trunk support. It > might be possible to have multiple SoC interfaces sending frames to > the Switch using DSA tags, but i don't see a way to have the switch > send frames to the SoC using multiple ports. I think that (2) and (3) are essentially the same problem, i.e. creating LAGs out of DSA links, be they switch-to-switch or switch-to-cpu connections. I think you are correct that the CPU port can not be a LAG/trunk, but I believe that limitation only applies to TO_CPU packets. If the CPU ports configured as a LAG, all FORWARDs, i.e. the bulk traffic, would benefit from the load-balancing. Something like this: .-----. TO_CPU, FORWARD .-----. TO_CPU, FORWARD .-----. | +-----------------+ +-----------------+ | | CPU | | sw0 | | sw1 | | +-----------------+ +-----------------+ | '-----' FORWARD '-+-+-' FORWARD '-+-+-' | | | | swp1 swp2 swp3 swp4 So the links selected as the CPU ports will see a marginally higher load due to all TO_CPU being sent over it. But the hashing is not that great on this hardware anyway (DA/SA only) so some imbalance is unavoidable. In order for this to work on transmit, we need to add forward offloading to the bridge so that we can, for example, send one FORWARD from the CPU to send an ARP broadcast to swp1..4 instead of four FROM_CPUs. Tobias
> In order for this to work on transmit, we need to add forward offloading > to the bridge so that we can, for example, send one FORWARD from the CPU > to send an ARP broadcast to swp1..4 instead of four FROM_CPUs. Wouldn't this be solved if the CPU master interface was a bonding interface?
On Tue, Oct 27, 2020 at 07:25:16PM +0100, Tobias Waldekranz wrote: > > 1) trunk user ports, with team/bonding controlling it > > 2) trunk DSA ports, i.e. the ports between switches in a D in DSA setup > > 3) trunk CPU ports. [...] > I think that (2) and (3) are essentially the same problem, i.e. creating > LAGs out of DSA links, be they switch-to-switch or switch-to-cpu > connections. I think you are correct that the CPU port can not be a > LAG/trunk, but I believe that limitation only applies to TO_CPU packets. Which would still be ok? They are called "slow protocol PDUs" for a reason. > In order for this to work on transmit, we need to add forward offloading > to the bridge so that we can, for example, send one FORWARD from the CPU > to send an ARP broadcast to swp1..4 instead of four FROM_CPUs. That surely sounds like an interesting (and tough to implement) optimization to increase the throughput, but why would it be _needed_ for things to work? What's wrong with 4 FROM_CPU packets?
On Tue, Oct 27, 2020 at 07:33:37PM +0100, Marek Behun wrote: > > In order for this to work on transmit, we need to add forward offloading > > to the bridge so that we can, for example, send one FORWARD from the CPU > > to send an ARP broadcast to swp1..4 instead of four FROM_CPUs. > > Wouldn't this be solved if the CPU master interface was a bonding interface? I don't see how you would do that. Would DSA keep returning -EPROBE_DEFER until user space decides to set up a bond over the master interfaces? How would you even describe that in device tree?
On Tue, Oct 27, 2020 at 21:04, Vladimir Oltean <olteanv@gmail.com> wrote: > On Tue, Oct 27, 2020 at 07:33:37PM +0100, Marek Behun wrote: >> > In order for this to work on transmit, we need to add forward offloading >> > to the bridge so that we can, for example, send one FORWARD from the CPU >> > to send an ARP broadcast to swp1..4 instead of four FROM_CPUs. >> >> Wouldn't this be solved if the CPU master interface was a bonding interface? > > I don't see how you would do that. Would DSA keep returning -EPROBE_DEFER > until user space decides to set up a bond over the master interfaces? > How would you even describe that in device tree? Yeah that would be very hard indeed. Since this is going to be completely transparent to the user I think the best way is to just setup the hardware to see the two CPU ports as a LAG whenever you find e.g. "cpu0" and "cpu1", but have no representation of it as a separate netdev. DSA will have an rx_handler attached to both ports anyway, so it can just run the same handler for both. On Tx it can just load-balance in software like team does.
On Tue, Oct 27, 2020 at 21:00, Vladimir Oltean <olteanv@gmail.com> wrote: > On Tue, Oct 27, 2020 at 07:25:16PM +0100, Tobias Waldekranz wrote: >> > 1) trunk user ports, with team/bonding controlling it >> > 2) trunk DSA ports, i.e. the ports between switches in a D in DSA setup >> > 3) trunk CPU ports. > [...] >> I think that (2) and (3) are essentially the same problem, i.e. creating >> LAGs out of DSA links, be they switch-to-switch or switch-to-cpu >> connections. I think you are correct that the CPU port can not be a >> LAG/trunk, but I believe that limitation only applies to TO_CPU packets. > > Which would still be ok? They are called "slow protocol PDUs" for a reason. Oh yes, completely agree. That was the point I was trying to make :) >> In order for this to work on transmit, we need to add forward offloading >> to the bridge so that we can, for example, send one FORWARD from the CPU >> to send an ARP broadcast to swp1..4 instead of four FROM_CPUs. > > That surely sounds like an interesting (and tough to implement) > optimization to increase the throughput, but why would it be _needed_ > for things to work? What's wrong with 4 FROM_CPU packets? We have internal patches that do this, and I can confirm that it is tough :) I really would like to figure out a way to solve this, that would also be acceptable upstream. I have some ideas, it is on my TODO. In a single-chip system I agree that it is not needed, the CPU can do the load-balancing in software. But in order to have the hardware do load-balancing on a switch-to-switch LAG, you need to send a FORWARD. FROM_CPUs would just follow whatever is in the device mapping table. You essentially have the inverse of the TO_CPU problem, but on Tx FROM_CPU would make up 100% of traffic. Other than that there are some things that, while strictly speaking possible to do without FORWARDs, become much easier to deal with: - Multicast routing. This is one case where performance _really_ suffers from having to skb_clone() to each recipient. - Bridging between virtual interfaces and DSA ports. Typical example is an L2 VPN tunnel or one end of a veth pair. On FROM_CPUs, the switch can not perform SA learning, which means that once you bridge traffic from the VPN out to a DSA port, the return traffic will be classified as unknown unicast by the switch and be flooded everywhere.
On Tue, Oct 27, 2020 at 08:37:58PM +0100, Tobias Waldekranz wrote: > >> In order for this to work on transmit, we need to add forward offloading > >> to the bridge so that we can, for example, send one FORWARD from the CPU > >> to send an ARP broadcast to swp1..4 instead of four FROM_CPUs. [...] > In a single-chip system I agree that it is not needed, the CPU can do > the load-balancing in software. But in order to have the hardware do > load-balancing on a switch-to-switch LAG, you need to send a FORWARD. > > FROM_CPUs would just follow whatever is in the device mapping table. You > essentially have the inverse of the TO_CPU problem, but on Tx FROM_CPU > would make up 100% of traffic. Woah, hold on, could you explain in more detail for non-expert people like myself to understand. So FROM_CPU frames (what tag_edsa.c uses now in xmit) can encode a _single_ destination port in the frame header. Whereas the FORWARD frames encode a _source_ port in the frame header. You inject FORWARD frames from the CPU port, and you just let the L2 forwarding process select the adequate destination ports (or LAG, if any ports are under one) _automatically_. The reason why you do this, is because you want to take advantage of the switch's flooding abilities in order to replicate the packet into 4 packets. So you will avoid cloning that packet in the bridge in the first place. But correct me if I'm wrong, sending a FORWARD frame from the CPU is a slippery slope, since you're never sure that the switch will perform the replication exactly as you intended to. The switch will replicate a FORWARD frame by looking up the FDB, and we don't even attempt in DSA to keep the FDB in sync between software and hardware. And that's why we send FROM_CPU frames in tag_edsa.c and not FORWARD frames. What you are really looking for is hardware where the destination field for FROM_CPU packets is not a single port index, but a port mask. Right? Also, this problem is completely orthogonal to LAG? Where does LAG even come into play here? > Other than that there are some things that, while strictly speaking > possible to do without FORWARDs, become much easier to deal with: > > - Multicast routing. This is one case where performance _really_ suffers > from having to skb_clone() to each recipient. > > - Bridging between virtual interfaces and DSA ports. Typical example is > an L2 VPN tunnel or one end of a veth pair. On FROM_CPUs, the switch > can not perform SA learning, which means that once you bridge traffic > from the VPN out to a DSA port, the return traffic will be classified > as unknown unicast by the switch and be flooded everywhere. And how is this going to solve that problem? You mean that the switch learns only from FORWARD, but not from FROM_CPU? Why don't you attempt to solve this more generically somehow? Your switch is not the only one that can't perform source address learning for injected traffic, there are tons more, some are not even DSA. We can't have everybody roll their own solution.
On Tue, Oct 27, 2020 at 22:02, Vladimir Oltean <olteanv@gmail.com> wrote: > On Tue, Oct 27, 2020 at 08:37:58PM +0100, Tobias Waldekranz wrote: >> >> In order for this to work on transmit, we need to add forward offloading >> >> to the bridge so that we can, for example, send one FORWARD from the CPU >> >> to send an ARP broadcast to swp1..4 instead of four FROM_CPUs. > > [...] > >> In a single-chip system I agree that it is not needed, the CPU can do >> the load-balancing in software. But in order to have the hardware do >> load-balancing on a switch-to-switch LAG, you need to send a FORWARD. >> >> FROM_CPUs would just follow whatever is in the device mapping table. You >> essentially have the inverse of the TO_CPU problem, but on Tx FROM_CPU >> would make up 100% of traffic. > > Woah, hold on, could you explain in more detail for non-expert people > like myself to understand. > > So FROM_CPU frames (what tag_edsa.c uses now in xmit) can encode a > _single_ destination port in the frame header. Correct. > Whereas the FORWARD frames encode a _source_ port in the frame header. > You inject FORWARD frames from the CPU port, and you just let the L2 > forwarding process select the adequate destination ports (or LAG, if > any ports are under one) _automatically_. The reason why you do this, is > because you want to take advantage of the switch's flooding abilities in > order to replicate the packet into 4 packets. So you will avoid cloning > that packet in the bridge in the first place. Exactly so. > But correct me if I'm wrong, sending a FORWARD frame from the CPU is a > slippery slope, since you're never sure that the switch will perform the > replication exactly as you intended to. The switch will replicate a > FORWARD frame by looking up the FDB, and we don't even attempt in DSA to > keep the FDB in sync between software and hardware. And that's why we > send FROM_CPU frames in tag_edsa.c and not FORWARD frames. I'm not sure if I agree that it's a slippery slope. The whole point of the switchdev effort is to sync the switch with the bridge. We trust the fabric to do all the steps you describe for _all_ other ports. > What you are really looking for is hardware where the destination field > for FROM_CPU packets is not a single port index, but a port mask. > > Right? Sure, if that's available it's great. Chips from Marvell's Prestera line can do this, and many others I'm sure. Alas, LinkStreet devices can not, and I still want the best performance I can get i that case. > Also, this problem is completely orthogonal to LAG? Where does LAG even > come into play here? It matters if you setup switch-to-switch LAGs. FROM_CPU packets encode the final device/port, and switches will forward those packet according to their device mapping tables, which selects a _single_ local port to use to reach a remote device/port. So all FROM_CPU packets to a given device/port will always travel through the same set of ports. In the FORWARD case, you look up the destination in the FDB of each device, find that it is located on the other side of a LAG, and the hardware will perform load-balancing. >> Other than that there are some things that, while strictly speaking >> possible to do without FORWARDs, become much easier to deal with: >> >> - Multicast routing. This is one case where performance _really_ suffers >> from having to skb_clone() to each recipient. >> >> - Bridging between virtual interfaces and DSA ports. Typical example is >> an L2 VPN tunnel or one end of a veth pair. On FROM_CPUs, the switch >> can not perform SA learning, which means that once you bridge traffic >> from the VPN out to a DSA port, the return traffic will be classified >> as unknown unicast by the switch and be flooded everywhere. > > And how is this going to solve that problem? You mean that the switch > learns only from FORWARD, but not from FROM_CPU? Yes, so when you send the FORWARD the switch knows that the station is located somewhere behind the CPU port. It does not know exactly where, i.e. it has no knowledge of the VPN tunnel or anything. It just directs it towards the CPU and the bridge's FDB will take care of the rest. > Why don't you attempt to solve this more generically somehow? Your > switch is not the only one that can't perform source address learning > for injected traffic, there are tons more, some are not even DSA. We > can't have everybody roll their own solution. Who said anything about rolling my solution? I'm going for a generic solution where a netdev can announce to the bridge it is being added to that it can offload forwarding of packets for all ports belonging to the same switchdev device. Most probably modeled after how the macvlan offloading stuff is done. In the case of mv88e6xxx that would kill two birds with one stone - great! In other cases you might have to have the DSA subsystem listen to new neighbors appearing on the bridge and sync those to hardware or something. Hopefully someone working with that kind of hardware can solve that problem.
On Tue, Oct 27, 2020 at 09:53:45PM +0100, Tobias Waldekranz wrote: > So all FROM_CPU packets to a given device/port will always travel > through the same set of ports. Ah, this is the part that didn't click for me. For the simple case where you have a multi-CPU port system: Switch 0 +--------------------------------+ DSA master #1 |CPU port #1 | +------+ +------+ +-------+ | eth0 | ----> | | -----\ ----- | sw0p0 | ------> +------+ +------+ \ / +-------+ | \/ | DSA master #2 |CPU port #2 /\ | +------+ +------+ / \ +-------| | eth1 | ----> | | -----/ \-----| sw0p1 | ------> +------+ +------+ +-------+ | | +--------------------------------+ you can have Linux do load balancing of CPU ports on TX for many streams being delivered to the same egress port (sw0p0). But if you have a cascade: Switch 0 Switch 1 +--------------------------------+ +--------------------------------+ DSA master #1 |CPU port #1 DSA link #1 | |DSA link #1 | +------+ +------+ +-------+ +------+ +-------+ | eth0 | ----> | | -----\ ----- | | ----> | | -----\ ----- | sw1p0 | ----> +------+ +------+ \ / +-------+ +------+ \ / +-------+ | \/ | | \/ | DSA master #2 |CPU port #2 /\ DSA link #2 | |DSA link #2 /\ | +------+ +------+ / \ +-------| +------+ / \ +-------| | eth1 | ----> | | -----/ \-----| | ----> | | -----/ \-----| sw1p1 | ----> +------+ +------+ +-------+ +------+ +-------+ | | | | +--------------------------------+ +--------------------------------+ then you have no good way to spread the same load (many streams all delivered to the same egress port, sw1p0) between DSA link #1 and DSA link #2. DSA link #1 will get congested, while DSA link #2 will remain unused. And this all happens because for FROM_CPU packets, the hardware is configured in mv88e6xxx_devmap_setup to deliver all packets with a non-local switch ID towards the same "routing" port, right? Whereas for FORWARD frames, the destination port for non-local switch ID will not be established based on mv88e6xxx_devmap_setup, but based on FDB lookup of {DMAC, VID}. In the second case above, this is the only way for your hardware that the FDB could select the LAG as the destination based on the FDB. Then, the hash code would be determined from the packet, and the appropriate egress port within the LAG would be selected. So, to answer my own question: Q: What does using FORWARD frames to offload TX flooding from the bridge have to do with a LAG between 2 switches? A: Nothing, they would just both need FORWARD frames to be used. > > Why don't you attempt to solve this more generically somehow? Your > > switch is not the only one that can't perform source address learning > > for injected traffic, there are tons more, some are not even DSA. We > > can't have everybody roll their own solution. > > Who said anything about rolling my solution? I'm going for a generic > solution where a netdev can announce to the bridge it is being added to > that it can offload forwarding of packets for all ports belonging to the > same switchdev device. Most probably modeled after how the macvlan > offloading stuff is done. The fact that I have no idea how the macvlan offloading is done does not really help me, but here, the fact that I understood nothing doesn't appear to stem from that. "a netdev can announce to the bridge it is being added to that it can offload forwarding of packets for all ports belonging to the same switchdev device" What do you mean? skb->offload_fwd_mark? Or are you still talking about its TX-side equivalent here, which is what we've been talking about in these past few mails? If so, I'm confused by you calling it "offload forwarding of packets", I was expecting a description more in the lines of "offload flooding of packets coming from host" or something like that. > In the case of mv88e6xxx that would kill two birds with one stone - > great! In other cases you might have to have the DSA subsystem listen to > new neighbors appearing on the bridge and sync those to hardware or > something. Hopefully someone working with that kind of hardware can > solve that problem. If by "neighbors" you mean that you bridge a DSA swp0 with an e1000 eth0, then that is not going to be enough. The CPU port of swp0 will need to learn not eth0's MAC address, but in fact the MAC address of all stations that might be connected to eth0. There might even be a network switch connected to eth0, not just a directly connected link partner. So there are potentially many MAC addresses to be learnt, and all are unknown off-hand. I admit I haven't actually looked at implementing this, but I would expect that what needs to be done is that the local (master) FDB of the bridge (which would get populated on the RX side of the "foreign interface" through software learning) would need to get offloaded in its entirety towards all switchdev ports, via a new switchdev "host FDB" object or something of that kind (where a "host FDB" entry offloaded on a port would mean "see this {DMAC, VID} pair? send it to the CPU"). With your FORWARD frames life-hack you can eschew all of that, good for you. I was just speculatively hoping you might be interested in tackling the hard way. Anyway, this discussion has started mixing up basic stuff (like resolving your source address learning issue on the CPU port, when bridged with a foreign interface) with advanced / optimization stuff (LAG, offload flooding from host), the only commonality appearing to be a need for FORWARD frames. Can you even believe we are still commenting on a series about something as mundane as link aggregation on DSA user ports? At least I can't. I'll go off and start reviewing your patches, before we manage to lose everybody along the way.
Hi Tobias > All LAG configuration is cached in `struct dsa_lag`s. I realize that > the standard M.O. of DSA is to read back information from hardware > when required. With LAGs this becomes very tricky though. For example, > the change of a link state on one switch will require re-balancing of > LAG hash buckets on another one, which in turn depends on the total > number of active links in the LAG. Do you agree that this is > motivated? As you say, DSA tries to be stateless and not allocate any memory. That keeps things simple. If you cannot allocate the needed memory, you need to ensure you leave the system untouched. And that needs to happen all the way up the stack when you have nested devices etc. That is why many APIs have a prepare phase and then a commit phase. The prepare phase allocates all the needed memory, can fail, but does not otherwise touch the running system. The commit phase cannot fail, since it has everything it needs. If you are dynamically allocating dsa_lag structures, at run time, you need to think about this. But the number of LAGs is limited by the number of ports. So i would consider just allocating the worst case number at probe, and KISS for runtime. > At least on mv88e6xxx, the exact source port is not available when > packets are received on the CPU. The way I see it, there are two ways > around that problem: Does that break team/bonding? Do any of the algorithms send packets on specific ports to make sure they are alive? I've not studied how team/bonding works, but it must have a way to determine if a link has failed and it needs to fallover. > (mv88e6xxx) The cross-chip PVT changes required to allow a LAG to > communicate with the other ports do not feel quite right, but I'm > unsure about what the proper way of doing it would be. Any ideas? Vivien implemented all that. I hope he can help you, i've no real idea how that all works. > (mv88e6xxx) Marvell has historically used the idiosyncratic term > "trunk" to refer to link aggregates. Somewhere around the Peridot they > have switched and are now referring to the same registers/tables using > the term "LAG". In this series I've stuck to using LAG for all generic > stuff, and only used trunk for driver-internal functions. Do we want > to rename everything to use the LAG nomenclature? Where possible, i would keep to the datasheet terminology. So any 6352 specific function should use 6352 terminology. Any 6390 specific function should use 6390 terminology. For code which supports a range of generations, we have used the terminology from the first device which had the feature. In practice, this probably means trunk is going to be used most of the time, and LAG in just 6390 code. Often, the glue code in chip.c uses linux stack terminology. Andrew
On Wed, Oct 28, 2020 at 00:32, Vladimir Oltean <olteanv@gmail.com> wrote: > And this all happens because for FROM_CPU packets, the hardware is > configured in mv88e6xxx_devmap_setup to deliver all packets with a > non-local switch ID towards the same "routing" port, right? Precisely. > Whereas for FORWARD frames, the destination port for non-local switch ID > will not be established based on mv88e6xxx_devmap_setup, but based on > FDB lookup of {DMAC, VID}. In the second case above, this is the only > way for your hardware that the FDB could select the LAG as the > destination based on the FDB. Then, the hash code would be determined > from the packet, and the appropriate egress port within the LAG would be > selected. That's it! > What do you mean? skb->offload_fwd_mark? Or are you still talking about > its TX-side equivalent here, which is what we've been talking about in > these past few mails? If so, I'm confused by you calling it "offload > forwarding of packets", I was expecting a description more in the lines > of "offload flooding of packets coming from host" or something like > that. I'm still talking about the TX-equivalent. I chose my words carefully because it is not _only_ for flooding, although that is the main benefit. If I've understood the basics of macvlan offloading correctly, it uses the ndo_dfwd_add/del_station ops to ask the lower device if it can offload transmissions on behalf of the macvlan device. If the lower is capable, the macvlan code will use dev_queue_xmit_accel to specify that the skb is being forwarded from a "subordinate" device. For a bridge, that would mean "forward this packet to the relevant ports, given the current configuration". This is just one possible approach though. >> In the case of mv88e6xxx that would kill two birds with one stone - >> great! In other cases you might have to have the DSA subsystem listen to >> new neighbors appearing on the bridge and sync those to hardware or >> something. Hopefully someone working with that kind of hardware can >> solve that problem. > > If by "neighbors" you mean that you bridge a DSA swp0 with an e1000 > eth0, then that is not going to be enough. The CPU port of swp0 will > need to learn not eth0's MAC address, but in fact the MAC address of all > stations that might be connected to eth0. There might even be a network > switch connected to eth0, not just a directly connected link partner. > So there are potentially many MAC addresses to be learnt, and all are > unknown off-hand. Yep, hence the "technically possible, but hard" remark I made earlier :) > I admit I haven't actually looked at implementing this, but I would > expect that what needs to be done is that the local (master) FDB of the > bridge (which would get populated on the RX side of the "foreign > interface" through software learning) would need to get offloaded in its > entirety towards all switchdev ports, via a new switchdev "host FDB" > object or something of that kind (where a "host FDB" entry offloaded on > a port would mean "see this {DMAC, VID} pair? send it to the CPU"). > > With your FORWARD frames life-hack you can eschew all of that, good for > you. I was just speculatively hoping you might be interested in tackling > the hard way. Being able to set host FDB entries like we can for host MDB is useful for other things as well, so I might very well be willing to do it. > Anyway, this discussion has started mixing up basic stuff (like > resolving your source address learning issue on the CPU port, when > bridged with a foreign interface) with advanced / optimization stuff > (LAG, offload flooding from host), the only commonality appearing to be > a need for FORWARD frames. Can you even believe we are still commenting > on a series about something as mundane as link aggregation on DSA user > ports? At least I can't. I'll go off and start reviewing your patches, > before we manage to lose everybody along the way. Agreed, we went deep down the rabbit hole! This might not have been the most natural place for these discussions, but it was fun nonetheless :)
On Tue, Oct 27, 2020 at 23:36, Andrew Lunn <andrew@lunn.ch> wrote: > If you are dynamically allocating dsa_lag structures, at run time, you > need to think about this. But the number of LAGs is limited by the > number of ports. So i would consider just allocating the worst case > number at probe, and KISS for runtime. Oh OK, yeah that just makes stuff easier so that's absolutely fine. I got the sense that the overall movement within DSA was in the opposite direction. E.g. didn't the dst use to have an array of ds pointers? Whereas now you iterate through dst->ports to find them? >> At least on mv88e6xxx, the exact source port is not available when >> packets are received on the CPU. The way I see it, there are two ways >> around that problem: > > Does that break team/bonding? Do any of the algorithms send packets on > specific ports to make sure they are alive? I've not studied how > team/bonding works, but it must have a way to determine if a link has > failed and it needs to fallover. This limitation only applies to FORWARD packets. TO_CPU packets will still contain device/port. So you have to make sure that the control packets are trapped and not forwarded to the CPU (e.g. by setting the Resvd2CPU bits in Global2). > Where possible, i would keep to the datasheet terminology. So any 6352 > specific function should use 6352 terminology. Any 6390 specific > function should use 6390 terminology. For code which supports a range > of generations, we have used the terminology from the first device > which had the feature. In practice, this probably means trunk is going > to be used most of the time, and LAG in just 6390 code. Often, the > glue code in chip.c uses linux stack terminology. Fair enough, trunking it is then. I don't expect we'll have anything mv88e6xxx specific using the LAG term in that case. From what I can tell, the trunk settings have not changed since at least 6095.
On Wed, Oct 28, 2020 at 01:45:11AM +0100, Tobias Waldekranz wrote: > On Tue, Oct 27, 2020 at 23:36, Andrew Lunn <andrew@lunn.ch> wrote: > > If you are dynamically allocating dsa_lag structures, at run time, you > > need to think about this. But the number of LAGs is limited by the > > number of ports. So i would consider just allocating the worst case > > number at probe, and KISS for runtime. > > Oh OK, yeah that just makes stuff easier so that's absolutely fine. I > got the sense that the overall movement within DSA was in the opposite > direction. E.g. didn't the dst use to have an array of ds pointers? > Whereas now you iterate through dst->ports to find them? Yes, but they are all allocated at probe time. It saved a bit of heap for adding some code. Andrew
On Tue, 27 Oct 2020 19:25:16 +0100 Tobias Waldekranz <tobias@waldekranz.com> wrote: > .-----. TO_CPU, FORWARD .-----. TO_CPU, FORWARD .-----. > | +-----------------+ +-----------------+ | > | CPU | | sw0 | | sw1 | > | +-----------------+ +-----------------+ | > '-----' FORWARD '-+-+-' FORWARD '-+-+-' > | | | | > swp1 swp2 swp3 swp4 > > So the links selected as the CPU ports will see a marginally higher load > due to all TO_CPU being sent over it. But the hashing is not that great > on this hardware anyway (DA/SA only) so some imbalance is unavoidable. The hashing is horrible :( On Turris Omnia we have 5 user ports and 2 CPU ports, and I suspect that for most of our users there is at most one peer MAC address on the other side of an user port. So if such a user has 5 devices connected to each switch port, there are 5 pairs of (DA,SA), so 2^5 = 32 different assignments of (DA,SA) pairs to CPU ports. With probability 2/32 = 6.25% traffic from all 5 user ports would go via one port, with probability 10/32 = 31.25% traffic from 4 user ports would go via one port. That is not good balancing :) Marek
On 10/27/2020 3:51 AM, Tobias Waldekranz wrote: > This series starts by adding the generic support required to offload > link aggregates to drivers built on top of the DSA subsystem. It then > implements offloading for the mv88e6xxx driver, i.e. Marvell's > LinkStreet family. > > Posting this as an RFC as there are some topics that I would like > feedback on before going further with testing. Thus far I've done some > basic tests to verify that: > > - A LAG can be used as a regular interface. > - Bridging a LAG with other DSA ports works as expected. > - Load balancing is done by the hardware, both in single- and > multi-chip systems. > - Load balancing is dynamically reconfigured when the state of > individual links change. > > Testing as been done on two systems: > > 1. Single-chip system with one Peridot (6390X). > 2. Multi-chip system with one Agate (6352) daisy-chained with an Opal > (6097F). > > I would really appreciate feedback on the following: > > All LAG configuration is cached in `struct dsa_lag`s. I realize that > the standard M.O. of DSA is to read back information from hardware > when required. With LAGs this becomes very tricky though. For example, > the change of a link state on one switch will require re-balancing of > LAG hash buckets on another one, which in turn depends on the total > number of active links in the LAG. Do you agree that this is > motivated? Yes, this makes sense, I did something quite similar in this branch nearly 3 years ago, it was tested to the point where the switch was programmed correctly but I had not configured the CPU port to support 2Gbits/sec (doh) to verify that we got 2x the desired throughput: https://github.com/ffainelli/linux/commits/b53-bond Your patch looks definitively more complete. > > The LAG driver ops all receive the LAG netdev as an argument when this > information is already available through the port's lag pointer. This > was done to match the way that the bridge netdev is passed to all VLAN > ops even though it is in the port's bridge_dev. Is there a reason for > this or should I just remove it from the LAG ops? > > At least on mv88e6xxx, the exact source port is not available when > packets are received on the CPU. The way I see it, there are two ways > around that problem: > > - Inject the packet directly on the LAG device (what this series > does). Feels right because it matches all that we actually know; the > packet came in on the LAG. It does complicate dsa_switch_rcv > somewhat as we can no longer assume that skb->dev is a DSA port. > > - Inject the packet on "the designated port", i.e. some port in the > LAG. This lets us keep the current Rx path untouched. The problem is > that (a) the port would have to be dynamically updated to match the > expectations of the LAG driver (team/bond) as links are > enabled/disabled and (b) we would be presenting a lie because > packets would appear to ingress on netdevs that they might not in > fact have been physically received on. > > (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA? It > seems like all chips capable of doing EDSA are using that, except for > the Peridot. > > (mv88e6xxx) The cross-chip PVT changes required to allow a LAG to > communicate with the other ports do not feel quite right, but I'm > unsure about what the proper way of doing it would be. Any ideas? > > (mv88e6xxx) Marvell has historically used the idiosyncratic term > "trunk" to refer to link aggregates. Somewhere around the Peridot they > have switched and are now referring to the same registers/tables using > the term "LAG". In this series I've stuck to using LAG for all generic > stuff, and only used trunk for driver-internal functions. Do we want > to rename everything to use the LAG nomenclature? Yes please!
On Tue, Oct 27, 2020 at 11:51:13AM +0100, Tobias Waldekranz wrote: > This series starts by adding the generic support required to offload > link aggregates to drivers built on top of the DSA subsystem. It then > implements offloading for the mv88e6xxx driver, i.e. Marvell's > LinkStreet family. > > Posting this as an RFC as there are some topics that I would like > feedback on before going further with testing. Thus far I've done some > basic tests to verify that: > > - A LAG can be used as a regular interface. > - Bridging a LAG with other DSA ports works as expected. > - Load balancing is done by the hardware, both in single- and > multi-chip systems. > - Load balancing is dynamically reconfigured when the state of > individual links change. > > Testing as been done on two systems: > > 1. Single-chip system with one Peridot (6390X). > 2. Multi-chip system with one Agate (6352) daisy-chained with an Opal > (6097F). > > I would really appreciate feedback on the following: > > All LAG configuration is cached in `struct dsa_lag`s. I realize that > the standard M.O. of DSA is to read back information from hardware > when required. With LAGs this becomes very tricky though. For example, > the change of a link state on one switch will require re-balancing of > LAG hash buckets on another one, which in turn depends on the total > number of active links in the LAG. Do you agree that this is > motivated? > > The LAG driver ops all receive the LAG netdev as an argument when this > information is already available through the port's lag pointer. This > was done to match the way that the bridge netdev is passed to all VLAN > ops even though it is in the port's bridge_dev. Is there a reason for > this or should I just remove it from the LAG ops? > > At least on mv88e6xxx, the exact source port is not available when > packets are received on the CPU. The way I see it, there are two ways > around that problem: > > - Inject the packet directly on the LAG device (what this series > does). Feels right because it matches all that we actually know; the > packet came in on the LAG. It does complicate dsa_switch_rcv > somewhat as we can no longer assume that skb->dev is a DSA port. > > - Inject the packet on "the designated port", i.e. some port in the > LAG. This lets us keep the current Rx path untouched. The problem is > that (a) the port would have to be dynamically updated to match the > expectations of the LAG driver (team/bond) as links are > enabled/disabled and (b) we would be presenting a lie because > packets would appear to ingress on netdevs that they might not in > fact have been physically received on. > > (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA? It > seems like all chips capable of doing EDSA are using that, except for > the Peridot. > > (mv88e6xxx) The cross-chip PVT changes required to allow a LAG to > communicate with the other ports do not feel quite right, but I'm > unsure about what the proper way of doing it would be. Any ideas? > > (mv88e6xxx) Marvell has historically used the idiosyncratic term > "trunk" to refer to link aggregates. Somewhere around the Peridot they > have switched and are now referring to the same registers/tables using > the term "LAG". In this series I've stuck to using LAG for all generic > stuff, and only used trunk for driver-internal functions. Do we want > to rename everything to use the LAG nomenclature? I have tested these patches on ocelot/felix and all is OK there, I would appreciate if you could resend as non-RFC. In the case of my hardware, it appears that I don't need the .port_lag_change callback, and that the source port that is being put in the DSA header is still the physical port ID and not the logical port ID (the LAG number). That being said, the framework you've built is clearly nice and works well even with bridging a bond.
On Thu Nov 19, 2020 at 1:51 PM CET, Vladimir Oltean wrote: > I have tested these patches on ocelot/felix and all is OK there, I would > appreciate if you could resend as non-RFC. In the case of my hardware, For sure, I am working on it as we speak. I was mostly waiting for the dsa-tag-unification series to make its way to net-next first as v1 depends on that. But then I remembered that I had to test against the bonding driver (I have used team up to this point), and there is some bug there that I need to squash first. > it appears that I don't need the .port_lag_change callback, and that the Ok, does ocelot automatically rebalance the LAG based on link state? I took a quick look through the datasheet for another switch from Vitesse, and it explicitly states that you need to update a table on link changes. I.e. in this situation: br0 / | lag | /|\ | 1 2 3 4 | | | \ | | | B | | | 1 2 3 A If you unplug cable 1, does the hardware rebalance all flows between A<->B to only use 2 and 3 without software assistance? If not, you will loose 1/3 of your flows. > source port that is being put in the DSA header is still the physical > port ID and not the logical port ID (the LAG number). That being said, Ok, yeah I really wish this was true for mv88e6xxx as well. > the framework you've built is clearly nice and works well even with > bridging a bond. Thank you!
On Thu, Nov 19, 2020 at 12:52:14PM +0100, Tobias Waldekranz wrote: > > it appears that I don't need the .port_lag_change callback, and that the > > Ok, does ocelot automatically rebalance the LAG based on link state? I > took a quick look through the datasheet for another switch from > Vitesse, and it explicitly states that you need to update a table on > link changes. > > I.e. in this situation: > > br0 > / | > lag | > /|\ | > 1 2 3 4 > | | | \ > | | | B > | | | > 1 2 3 > A > > If you unplug cable 1, does the hardware rebalance all flows between > A<->B to only use 2 and 3 without software assistance? If not, you > will loose 1/3 of your flows. Yes, you're right, the switch doesn't rebalance the aggregation codes across the remaining ports automatically. In my mind I was subconsiously hoping that would be the case, because I need to make use of the information in struct dsa_lag in non-DSA code (long story, but the drivers/net/dsa/ocelot code shares the implementation with drivers/net/ethernet/mscc/ocelot* which is a switchdev-only driver). It doesn't mean that keeping state in dp->lag is the wrong thing to do, it's just that this is an extra challenge for an already odd driver, that I will have to see how I deal with :D