Message ID | 20240402001137.2980589-1-Joseph.Huang@garmin.com (mailing list archive) |
---|---|
Headers | show |
Series | MC Flood disable and snooping | expand |
On 4/2/24 03:10, Joseph Huang wrote: > There is a use case where one would like to enable multicast snooping > on a bridge but disable multicast flooding on all bridge ports so that > registered multicast traffic will only reach the intended recipients and > unregistered multicast traffic will be dropped. However, with existing > bridge ports' mcast_flood flag implementation, it doesn't work as desired. > > This patchset aims to make multicast snooping work even when multicast > flooding is disabled on the bridge ports, without changing the semantic of > the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue. > > Also, in a network where more than one multicast snooping capable bridges > are interconnected without multicast routers being present, multicast > snooping fails if: > > 1. The source is not directly attached to the Querier > 2. The listener is beyond the mrouter port of the bridge where the > source is directly attached > 3. A hardware offloading switch is involved > > When all of the conditions are met, the listener will not receive any > multicast packets from the source. Patches 5 to 10 attempt to address this > issue. Specifically, patches 5 to 8 set up the infrastructure, patch 9 > handles unregistered multicast packets forwarding, and patch 10 handles > registered multicast packets forwarding to the mrouter port. > > The patches were developed against 5.15, and forward-ported to 6.8. > Tests were done on a Pi 4B + Marvell 6393X Eval board with a single > switch chip with no VLAN. > > V1 -> V2: > - Moved the bulk of the change from the bridge to the mv88e6xxx driver. > - Added more patches (specifically 3 and 4) to workaround some more > issues with multicast flooding being disabled. > > v1 here: > https://patchwork.kernel.org/project/netdevbpf/cover/20210504182259.5042-1-Joseph.Huang@garmin.com/ > For the bridge patches: Nacked-by: Nikolay Aleksandrov <razor@blackwall.org> You cannot break the multicast flood flag to add support for a custom use-case. This is unacceptable. The current bridge behaviour is correct your patch 02 doesn't fix anything, you should configure the bridge properly to avoid all those problems, not break protocols. Your special use case can easily be solved by a user-space helper or eBPF and nftables. You can set the mcast flood flag and bypass the bridge for these packets. I basically said the same in 2021, if this is going to be in the bridge it should be hidden behind an option that is default off. But in my opinion adding an option to solve such special cases is undesirable, they can be easily solved with what's currently available.
On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote: > There is a use case where one would like to enable multicast snooping > on a bridge but disable multicast flooding on all bridge ports so that > registered multicast traffic will only reach the intended recipients and > unregistered multicast traffic will be dropped. However, with existing > bridge ports' mcast_flood flag implementation, it doesn't work as desired. > > This patchset aims to make multicast snooping work even when multicast > flooding is disabled on the bridge ports, without changing the semantic of > the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue. > > Also, in a network where more than one multicast snooping capable bridges > are interconnected without multicast routers being present, multicast > snooping fails if: > > 1. The source is not directly attached to the Querier > 2. The listener is beyond the mrouter port of the bridge where the > source is directly attached > 3. A hardware offloading switch is involved I've not studied the details here, but that last point makes me think the offload driver is broken. There should not be any difference between software bridging and hardware bridging. The whole idea is that you take what Linux can do in software and accelerate it by offloading to hardware. Doing acceleration should not change the behaviour. > The patches were developed against 5.15, and forward-ported to 6.8. > Tests were done on a Pi 4B + Marvell 6393X Eval board with a single > switch chip with no VLAN. So what is the mv88e6xxx driver doing wrong? Andrew
Hi Nikolai, On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote: > For the bridge patches: > Nacked-by: Nikolay Aleksandrov <razor@blackwall.org> > > You cannot break the multicast flood flag to add support for a custom > use-case. This is unacceptable. The current bridge behaviour is correct > your patch 02 doesn't fix anything, you should configure the bridge > properly to avoid all those problems, not break protocols. > > Your special use case can easily be solved by a user-space helper or > eBPF and nftables. You can set the mcast flood flag and bypass the > bridge for these packets. I basically said the same in 2021, if this is > going to be in the bridge it should be hidden behind an option that is > default off. But in my opinion adding an option to solve such special > cases is undesirable, they can be easily solved with what's currently > available. I appreciate your time is limited, but could you please translate your suggestion, and detail your proposed alternative a bit, for those of us who are not very familiar with IP multicast snooping? Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't that break snooping? And then do what with the packets, forward them in another software layer than the bridge? I also don't quite understand the suggestion of turning on mcast flooding: isn't Joseph saying that he wants it off for the unregistered multicast data traffic?
On 4/2/24 20:43, Vladimir Oltean wrote: > Hi Nikolai, > > On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote: >> For the bridge patches: >> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org> >> >> You cannot break the multicast flood flag to add support for a custom >> use-case. This is unacceptable. The current bridge behaviour is correct >> your patch 02 doesn't fix anything, you should configure the bridge >> properly to avoid all those problems, not break protocols. >> >> Your special use case can easily be solved by a user-space helper or >> eBPF and nftables. You can set the mcast flood flag and bypass the >> bridge for these packets. I basically said the same in 2021, if this is >> going to be in the bridge it should be hidden behind an option that is >> default off. But in my opinion adding an option to solve such special >> cases is undesirable, they can be easily solved with what's currently >> available. > > I appreciate your time is limited, but could you please translate your > suggestion, and detail your proposed alternative a bit, for those of us > who are not very familiar with IP multicast snooping? > My suggestion is not related to snooping really, but to the goal of patches 01-03. The bridge patches in this set are trying to forward traffic that is not supposed to be forwarded with the proposed configuration, so that can be done by a user-space helper that installs rules to bypass the bridge specifically for those packets while monitoring the bridge state to implement a policy and manage these rules in order to keep snooping working. > Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't > that break snooping? And then do what with the packets, forward them in > another software layer than the bridge? > The ones that are not supposed to be forwarded in the proposed config and are needed for this use case (control traffic and link-local). Obviously to have proper snooping you'd need to manage these bypass rules and use them only while needed. > I also don't quite understand the suggestion of turning on mcast flooding: > isn't Joseph saying that he wants it off for the unregistered multicast > data traffic? Ah my bad, I meant to turn off flooding and bypass the bridge for those packets and ports while necessary, under necessary can be any policy that the user-space helper wants to implement. In any case, if this is going to be yet another kernel solution then it must be a new option that is default off, and doesn't break current mcast flood flag behaviour. In general my opinion is that the whole snooping control must be in user-space and only have the dataplane in the kernel, but that is beyond the scope of this set.
On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote: > On 4/2/24 20:43, Vladimir Oltean wrote: > > Hi Nikolai, > > > > On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote: > > > For the bridge patches: > > > Nacked-by: Nikolay Aleksandrov <razor@blackwall.org> > > > > > > You cannot break the multicast flood flag to add support for a custom > > > use-case. This is unacceptable. The current bridge behaviour is correct > > > your patch 02 doesn't fix anything, you should configure the bridge > > > properly to avoid all those problems, not break protocols. > > > > > > Your special use case can easily be solved by a user-space helper or > > > eBPF and nftables. You can set the mcast flood flag and bypass the > > > bridge for these packets. I basically said the same in 2021, if this is > > > going to be in the bridge it should be hidden behind an option that is > > > default off. But in my opinion adding an option to solve such special > > > cases is undesirable, they can be easily solved with what's currently > > > available. > > > > I appreciate your time is limited, but could you please translate your > > suggestion, and detail your proposed alternative a bit, for those of us > > who are not very familiar with IP multicast snooping? > > > > My suggestion is not related to snooping really, but to the goal of > patches 01-03. The bridge patches in this set are trying to forward > traffic that is not supposed to be forwarded with the proposed > configuration, Correct up to a point. Reinterpreting the given user space configuration and trying to make it do something else seems like a mistake, but in principle one could also look at alternative bridge configurations like the one I described here: https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/ > so that can be done by a user-space helper that installs > rules to bypass the bridge specifically for those packets while > monitoring the bridge state to implement a policy and manage these rules > in order to keep snooping working. > > > Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't > > that break snooping? And then do what with the packets, forward them in > > another software layer than the bridge? > > > > The ones that are not supposed to be forwarded in the proposed config > and are needed for this use case (control traffic and link-local). Obviously > to have proper snooping you'd need to manage these bypass > rules and use them only while needed. I think Joseph will end up in a situation where he needs IGMP control messages both in the bridge data path and outside of it :) Also, your proposal eliminates the possibility of cooperating with a hardware accelerator which can forward the IGMP messages where they need to go. As far as I understand, I don't think Joseph has a very "special" use case. Disabling flooding of unregistered multicast in the data plane sounds reasonable. There seems to be a gap in the bridge API, in that this operation also affects the control plane, which he is trying to fix with this "force flooding", because of insufficiently fine grained control. > > I also don't quite understand the suggestion of turning on mcast flooding: > > isn't Joseph saying that he wants it off for the unregistered multicast > > data traffic? > > Ah my bad, I meant to turn off flooding and bypass the bridge for those > packets and ports while necessary, under necessary can be any policy > that the user-space helper wants to implement. > > In any case, if this is going to be yet another kernel solution then it > must be a new option that is default off, and doesn't break current mcast > flood flag behaviour. Yeah, maybe something like this, simple and with clear offload semantics, as seen in existing hardware (not Marvell though): mcast_flood == off: - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off") - mcast_ipv4_data_flood: don't care - mcast_ipv6_ctrl_flood: don't care - mcast_ipv6_data_flood: don't care - mcast_l2_flood: don't care mcast_flood == on: - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood - Flood all other IPv4 multicast according to mcast_ipv4_data_flood - Flood ff02::/16 according to mcast_ipv6_ctrl_flood - Flood all other IPv6 multicast according to mcast_ipv6_data_flood - Flood L2 according to mcast_l2_flood
On 4/2/24 23:46, Vladimir Oltean wrote: > On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote: >> On 4/2/24 20:43, Vladimir Oltean wrote: >>> Hi Nikolai, >>> >>> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote: >>>> For the bridge patches: >>>> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org> >>>> >>>> You cannot break the multicast flood flag to add support for a custom >>>> use-case. This is unacceptable. The current bridge behaviour is correct >>>> your patch 02 doesn't fix anything, you should configure the bridge >>>> properly to avoid all those problems, not break protocols. >>>> >>>> Your special use case can easily be solved by a user-space helper or >>>> eBPF and nftables. You can set the mcast flood flag and bypass the >>>> bridge for these packets. I basically said the same in 2021, if this is >>>> going to be in the bridge it should be hidden behind an option that is >>>> default off. But in my opinion adding an option to solve such special >>>> cases is undesirable, they can be easily solved with what's currently >>>> available. >>> >>> I appreciate your time is limited, but could you please translate your >>> suggestion, and detail your proposed alternative a bit, for those of us >>> who are not very familiar with IP multicast snooping? >>> >> >> My suggestion is not related to snooping really, but to the goal of >> patches 01-03. The bridge patches in this set are trying to forward >> traffic that is not supposed to be forwarded with the proposed >> configuration, > > Correct up to a point. Reinterpreting the given user space configuration > and trying to make it do something else seems like a mistake, but in > principle one could also look at alternative bridge configurations like > the one I described here: > https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/ > >> so that can be done by a user-space helper that installs >> rules to bypass the bridge specifically for those packets while >> monitoring the bridge state to implement a policy and manage these rules >> in order to keep snooping working. >> >>> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't >>> that break snooping? And then do what with the packets, forward them in >>> another software layer than the bridge? >>> >> >> The ones that are not supposed to be forwarded in the proposed config >> and are needed for this use case (control traffic and link-local). Obviously >> to have proper snooping you'd need to manage these bypass >> rules and use them only while needed. > > I think Joseph will end up in a situation where he needs IGMP control > messages both in the bridge data path and outside of it :) > My solution does not exclude such scenario. With all unregistered mcast disabled it will be handled the same as with this proposed solution. > Also, your proposal eliminates the possibility of cooperating with a > hardware accelerator which can forward the IGMP messages where they need > to go. > > As far as I understand, I don't think Joseph has a very "special" use case. > Disabling flooding of unregistered multicast in the data plane sounds > reasonable. There seems to be a gap in the bridge API, in that this This we already have, but.. > operation also affects the control plane, which he is trying to fix with > this "force flooding", because of insufficiently fine grained control. > Right, this is the part that is more special, I'm not saying it's unreasonable. The proposition to make it optional and break it down to type of traffic sounds good to me. >>> I also don't quite understand the suggestion of turning on mcast flooding: >>> isn't Joseph saying that he wants it off for the unregistered multicast >>> data traffic? >> >> Ah my bad, I meant to turn off flooding and bypass the bridge for those >> packets and ports while necessary, under necessary can be any policy >> that the user-space helper wants to implement. >> >> In any case, if this is going to be yet another kernel solution then it >> must be a new option that is default off, and doesn't break current mcast >> flood flag behaviour. > > Yeah, maybe something like this, simple and with clear offload > semantics, as seen in existing hardware (not Marvell though): > > mcast_flood == off: > - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off") > - mcast_ipv4_data_flood: don't care > - mcast_ipv6_ctrl_flood: don't care > - mcast_ipv6_data_flood: don't care > - mcast_l2_flood: don't care > mcast_flood == on: > - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood > - Flood all other IPv4 multicast according to mcast_ipv4_data_flood > - Flood ff02::/16 according to mcast_ipv6_ctrl_flood > - Flood all other IPv6 multicast according to mcast_ipv6_data_flood > - Flood L2 according to mcast_l2_flood Yep, sounds good to me. I was thinking about something in these lines as well if doing a kernel solution in order to make it simpler and more generic. The ctrl flood bits need to be handled more carefully to make sure they match only control traffic and not link-local data. I think the old option can be converted to use this fine-grained control, that is if anyone sets the old flood on/off then the flood mask gets set properly so we can do just 1 & in the fast path and avoid adding more tests. It will also make it symmetric - if it can override the on case, then it will be able to override the off case. And to be more explicit you can pass a mask variable to br_multicast_rcv() which will get populated and then you can pass it down to br_flood(). That will also avoid adding new bits to the skb's bridge CB.
Hi Andrew, On 4/2/2024 8:43 AM, Andrew Lunn wrote: > On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote: >> There is a use case where one would like to enable multicast snooping >> on a bridge but disable multicast flooding on all bridge ports so that >> registered multicast traffic will only reach the intended recipients and >> unregistered multicast traffic will be dropped. However, with existing >> bridge ports' mcast_flood flag implementation, it doesn't work as desired. >> >> This patchset aims to make multicast snooping work even when multicast >> flooding is disabled on the bridge ports, without changing the semantic of >> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue. >> >> Also, in a network where more than one multicast snooping capable bridges >> are interconnected without multicast routers being present, multicast >> snooping fails if: >> >> 1. The source is not directly attached to the Querier >> 2. The listener is beyond the mrouter port of the bridge where the >> source is directly attached >> 3. A hardware offloading switch is involved > > I've not studied the details here, but that last point makes me think > the offload driver is broken. There should not be any difference > between software bridging and hardware bridging. The whole idea is > that you take what Linux can do in software and accelerate it by > offloading to hardware. Doing acceleration should not change the > behaviour. > In patch 10 I gave a little more detail about the fix, but basically this is what happened. Assuming we have a soft bridge like the following: bp1 +------------+ Querier <---- | bridge | +------------+ bp2 | | bp3 | | v v MC Source MC Listener Here bp1 is the mrouter port, bp2 is connected to the multicast source, and bp3 is connected to the multicast listener who wishes to receive multicast traffic for that group. After some Query/Report exchange, the snooping code in the bridge is going to learn about the Listener from bp3, and is going to create an MDB group which includes bp3 as the sole member. When the bridge receives a multicast packet for that group from bp2, the bridge will do a look up to find the members of that group (in this case, bp3) and forward the packet to every single member in that group. At the same time, the bridge will also forward the packet to every mrouter port so that listeners beyond mrouter ports can receive that multicast packet as well. Now consider the same scenario, but with a hardware-offloaded switch: +------------+ | bridge | +------------+ ^ | | p6 (Host CPU port) p1/bp1 +------------+ Querier <---- | sw | +------------+ p2/bp2 | | p3/bp3 | | v v MC Source MC Listener Same Query/Report exchange, same MDB group, except that this time around the MDB group will be offloaded to the switch as well. So in the switch's ATU we will now have an entry for the multicast group and with p3 being the only member of that ATU. When the multicast packet arrives at the switch from p2, the switch will do an ATU lookup, and forward the packet to p3 only. This means that the Host CPU (p6) will not get a copy of the packet, and so the soft bridge will not have the opportunity to forward that packet to the mrouter port. This is what patch 10 attempts to address. One possible solution of course is to add p6 to every MDB group, however that's probably not very desirable. Besides, it will be more efficient if the packet is forwarded to the mrouter port by the switch in hardware (i.e., offload mrouter forwarding), vs. being forwarded in the bridge by software. >> The patches were developed against 5.15, and forward-ported to 6.8. >> Tests were done on a Pi 4B + Marvell 6393X Eval board with a single >> switch chip with no VLAN. > > So what is the mv88e6xxx driver doing wrong? > > Andrew > The mv88e6xxx driver did nothing differently than the other DSA drivers. I chose the mv88e6xxx driver to implement the fix simply because that's the only platform I have at hand to verify the solution.
On Thu, Apr 04, 2024 at 05:35:41PM -0400, Joseph Huang wrote: > Hi Andrew, > > On 4/2/2024 8:43 AM, Andrew Lunn wrote: > > On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote: > > > There is a use case where one would like to enable multicast snooping > > > on a bridge but disable multicast flooding on all bridge ports so that > > > registered multicast traffic will only reach the intended recipients and > > > unregistered multicast traffic will be dropped. However, with existing > > > bridge ports' mcast_flood flag implementation, it doesn't work as desired. > > > > > > This patchset aims to make multicast snooping work even when multicast > > > flooding is disabled on the bridge ports, without changing the semantic of > > > the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue. > > > > > > Also, in a network where more than one multicast snooping capable bridges > > > are interconnected without multicast routers being present, multicast > > > snooping fails if: > > > > > > 1. The source is not directly attached to the Querier > > > 2. The listener is beyond the mrouter port of the bridge where the > > > source is directly attached > > > 3. A hardware offloading switch is involved > > > > I've not studied the details here, but that last point makes me think > > the offload driver is broken. There should not be any difference > > between software bridging and hardware bridging. The whole idea is > > that you take what Linux can do in software and accelerate it by > > offloading to hardware. Doing acceleration should not change the > > behaviour. > > > > In patch 10 I gave a little more detail about the fix, but basically this is > what happened. > > Assuming we have a soft bridge like the following: > > bp1 +------------+ > Querier <---- | bridge | > +------------+ > bp2 | | bp3 > | | > v v > MC Source MC Listener > > Here bp1 is the mrouter port, bp2 is connected to the multicast source, and > bp3 is connected to the multicast listener who wishes to receive multicast > traffic for that group. > > After some Query/Report exchange, the snooping code in the bridge is going > to learn about the Listener from bp3, and is going to create an MDB group > which includes bp3 as the sole member. When the bridge receives a multicast > packet for that group from bp2, the bridge will do a look up to find the > members of that group (in this case, bp3) and forward the packet to every > single member in that group. At the same time, the bridge will also forward > the packet to every mrouter port so that listeners beyond mrouter ports can > receive that multicast packet as well. > > Now consider the same scenario, but with a hardware-offloaded switch: > > +------------+ > | bridge | > +------------+ > ^ > | > | p6 (Host CPU port) > p1/bp1 +------------+ > Querier <---- | sw | > +------------+ > p2/bp2 | | p3/bp3 > | | > v v > MC Source MC Listener > > Same Query/Report exchange, same MDB group, except that this time around the > MDB group will be offloaded to the switch as well. So in the switch's ATU we > will now have an entry for the multicast group and with p3 being the only > member of that ATU. When the multicast packet arrives at the switch from p2, > the switch will do an ATU lookup, and forward the packet to p3 only. This > means that the Host CPU (p6) will not get a copy of the packet, and so the > soft bridge will not have the opportunity to forward that packet to the > mrouter port. This is what patch 10 attempts to address. > > One possible solution of course is to add p6 to every MDB group, however > that's probably not very desirable. Besides, it will be more efficient if > the packet is forwarded to the mrouter port by the switch in hardware (i.e., > offload mrouter forwarding), vs. being forwarded in the bridge by software. Thanks for the explanation. So i think the key part which you said above is: At the same time, the bridge will also forward the packet to every mrouter port so that listeners beyond mrouter ports can receive that multicast packet as well. How does the bridge know about the mrouter port? It seems like the bridge needs to pass that information down to the switch. Is the mrouter itself performing a join on the group when it has members of the group on the other side of it? Andrew
On 4/2/2024 5:59 PM, Nikolay Aleksandrov wrote: > On 4/2/24 23:46, Vladimir Oltean wrote: >> On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote: >>> On 4/2/24 20:43, Vladimir Oltean wrote: >>>> Hi Nikolai, >>>> >>>> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote: >>>>> For the bridge patches: >>>>> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org> >>>>> >>>>> You cannot break the multicast flood flag to add support for a custom >>>>> use-case. This is unacceptable. The current bridge behaviour is correct >>>>> your patch 02 doesn't fix anything, you should configure the bridge >>>>> properly to avoid all those problems, not break protocols. >>>>> >>>>> Your special use case can easily be solved by a user-space helper or >>>>> eBPF and nftables. You can set the mcast flood flag and bypass the >>>>> bridge for these packets. I basically said the same in 2021, if this is >>>>> going to be in the bridge it should be hidden behind an option that is >>>>> default off. But in my opinion adding an option to solve such special >>>>> cases is undesirable, they can be easily solved with what's currently >>>>> available. >>>> >>>> I appreciate your time is limited, but could you please translate your >>>> suggestion, and detail your proposed alternative a bit, for those of us >>>> who are not very familiar with IP multicast snooping? >>>> >>> >>> My suggestion is not related to snooping really, but to the goal of >>> patches 01-03. The bridge patches in this set are trying to forward >>> traffic that is not supposed to be forwarded with the proposed >>> configuration, >> >> Correct up to a point. Reinterpreting the given user space configuration >> and trying to make it do something else seems like a mistake, but in >> principle one could also look at alternative bridge configurations like >> the one I described here: >> https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/ >> >>> so that can be done by a user-space helper that installs >>> rules to bypass the bridge specifically for those packets while >>> monitoring the bridge state to implement a policy and manage these rules >>> in order to keep snooping working. >>> >>>> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't >>>> that break snooping? And then do what with the packets, forward them in >>>> another software layer than the bridge? >>>> >>> >>> The ones that are not supposed to be forwarded in the proposed config >>> and are needed for this use case (control traffic and link-local). Obviously >>> to have proper snooping you'd need to manage these bypass >>> rules and use them only while needed. >> >> I think Joseph will end up in a situation where he needs IGMP control >> messages both in the bridge data path and outside of it :) >> > > My solution does not exclude such scenario. With all unregistered mcast > disabled it will be handled the same as with this proposed solution. > >> Also, your proposal eliminates the possibility of cooperating with a >> hardware accelerator which can forward the IGMP messages where they need >> to go. >> >> As far as I understand, I don't think Joseph has a very "special" use case. >> Disabling flooding of unregistered multicast in the data plane sounds >> reasonable. There seems to be a gap in the bridge API, in that this > > This we already have, but.. > >> operation also affects the control plane, which he is trying to fix with >> this "force flooding", because of insufficiently fine grained control. >> > > Right, this is the part that is more special, I'm not saying it's > unreasonable. The proposition to make it optional and break it down to > type of traffic sounds good to me. > >>>> I also don't quite understand the suggestion of turning on mcast flooding: >>>> isn't Joseph saying that he wants it off for the unregistered multicast >>>> data traffic? >>> >>> Ah my bad, I meant to turn off flooding and bypass the bridge for those >>> packets and ports while necessary, under necessary can be any policy >>> that the user-space helper wants to implement. >>> >>> In any case, if this is going to be yet another kernel solution then it >>> must be a new option that is default off, and doesn't break current mcast >>> flood flag behaviour. >> >> Yeah, maybe something like this, simple and with clear offload >> semantics, as seen in existing hardware (not Marvell though): >> >> mcast_flood == off: >> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off") >> - mcast_ipv4_data_flood: don't care >> - mcast_ipv6_ctrl_flood: don't care >> - mcast_ipv6_data_flood: don't care >> - mcast_l2_flood: don't care >> mcast_flood == on: >> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood >> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood >> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood >> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood >> - Flood L2 according to mcast_l2_flood Did you mean if mcast_flood == on (meaning mcast_flood is ENABLED) - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway) ... if mcast_flood == off (meaning mcast_flood is DISABLED) - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood ... ? Otherwise the problem is still not solved when mcast_flood is disabled. > > Yep, sounds good to me. I was thinking about something in these lines > as well if doing a kernel solution in order to make it simpler and more > generic. The ctrl flood bits need to be handled more carefully to make > sure they match only control traffic and not link-local data. Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies as control I guess that's my question. > I think the old option can be converted to use this fine-grained > control, that is if anyone sets the old flood on/off then the flood > mask gets set properly so we can do just 1 & in the fast path and avoid > adding more tests. It will also make it symmetric - if it can override > the on case, then it will be able to override the off case. And to be > more explicit you can pass a mask variable to br_multicast_rcv() which > will get populated and then you can pass it down to br_flood(). That > will also avoid adding new bits to the skb's bridge CB. > >
Hi Andrew, On 4/4/2024 6:11 PM, Andrew Lunn wrote: > On Thu, Apr 04, 2024 at 05:35:41PM -0400, Joseph Huang wrote: >> Hi Andrew, >> >> On 4/2/2024 8:43 AM, Andrew Lunn wrote: >>> On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote: >>>> There is a use case where one would like to enable multicast snooping >>>> on a bridge but disable multicast flooding on all bridge ports so that >>>> registered multicast traffic will only reach the intended recipients and >>>> unregistered multicast traffic will be dropped. However, with existing >>>> bridge ports' mcast_flood flag implementation, it doesn't work as desired. >>>> >>>> This patchset aims to make multicast snooping work even when multicast >>>> flooding is disabled on the bridge ports, without changing the semantic of >>>> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue. >>>> >>>> Also, in a network where more than one multicast snooping capable bridges >>>> are interconnected without multicast routers being present, multicast >>>> snooping fails if: >>>> >>>> 1. The source is not directly attached to the Querier >>>> 2. The listener is beyond the mrouter port of the bridge where the >>>> source is directly attached >>>> 3. A hardware offloading switch is involved >>> >>> I've not studied the details here, but that last point makes me think >>> the offload driver is broken. There should not be any difference >>> between software bridging and hardware bridging. The whole idea is >>> that you take what Linux can do in software and accelerate it by >>> offloading to hardware. Doing acceleration should not change the >>> behaviour. >>> >> >> In patch 10 I gave a little more detail about the fix, but basically this is >> what happened. >> >> Assuming we have a soft bridge like the following: >> >> bp1 +------------+ >> Querier <---- | bridge | >> +------------+ >> bp2 | | bp3 >> | | >> v v >> MC Source MC Listener >> >> Here bp1 is the mrouter port, bp2 is connected to the multicast source, and >> bp3 is connected to the multicast listener who wishes to receive multicast >> traffic for that group. >> >> After some Query/Report exchange, the snooping code in the bridge is going >> to learn about the Listener from bp3, and is going to create an MDB group >> which includes bp3 as the sole member. When the bridge receives a multicast >> packet for that group from bp2, the bridge will do a look up to find the >> members of that group (in this case, bp3) and forward the packet to every >> single member in that group. At the same time, the bridge will also forward >> the packet to every mrouter port so that listeners beyond mrouter ports can >> receive that multicast packet as well. >> >> Now consider the same scenario, but with a hardware-offloaded switch: >> >> +------------+ >> | bridge | >> +------------+ >> ^ >> | >> | p6 (Host CPU port) >> p1/bp1 +------------+ >> Querier <---- | sw | >> +------------+ >> p2/bp2 | | p3/bp3 >> | | >> v v >> MC Source MC Listener >> >> Same Query/Report exchange, same MDB group, except that this time around the >> MDB group will be offloaded to the switch as well. So in the switch's ATU we >> will now have an entry for the multicast group and with p3 being the only >> member of that ATU. When the multicast packet arrives at the switch from p2, >> the switch will do an ATU lookup, and forward the packet to p3 only. This >> means that the Host CPU (p6) will not get a copy of the packet, and so the >> soft bridge will not have the opportunity to forward that packet to the >> mrouter port. This is what patch 10 attempts to address. >> >> One possible solution of course is to add p6 to every MDB group, however >> that's probably not very desirable. Besides, it will be more efficient if >> the packet is forwarded to the mrouter port by the switch in hardware (i.e., >> offload mrouter forwarding), vs. being forwarded in the bridge by software. > > Thanks for the explanation. So i think the key part which you said > above is: > > At the same time, the bridge will also forward the packet to every > mrouter port so that listeners beyond mrouter ports can receive that > multicast packet as well. > > How does the bridge know about the mrouter port? It seems like the The bridge learns about the existence of the Querier by the reception of Queries. The bridge will mark the port which it received Queries from as the mrouter port. > bridge needs to pass that information down to the switch. Is the The bridge does pass that information down to switchdev. Patch 5 adds DSA handling of that event as well. Patches 9 and 10 adds the support in the mv88e6xxx driver. > mrouter itself performing a join on the group when it has members of > the group on the other side of it? You hit a key point here. The Querier does not send Report (not with IGMPv2 anyway), so the bridge will never add the mrouter port to any MDB group. That's why patch 10 is needed. Prestera driver does something similar (which is what patches 6,7, and 10 are modeled after). > > Andrew
On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote: > > > mcast_flood == off: > > > - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off") > > > - mcast_ipv4_data_flood: don't care > > > - mcast_ipv6_ctrl_flood: don't care > > > - mcast_ipv6_data_flood: don't care > > > - mcast_l2_flood: don't care > > > mcast_flood == on: > > > - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood > > > - Flood all other IPv4 multicast according to mcast_ipv4_data_flood > > > - Flood ff02::/16 according to mcast_ipv6_ctrl_flood > > > - Flood all other IPv6 multicast according to mcast_ipv6_data_flood > > > - Flood L2 according to mcast_l2_flood > > Did you mean > > if mcast_flood == on (meaning mcast_flood is ENABLED) > - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway) > ... > > if mcast_flood == off (meaning mcast_flood is DISABLED) > - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood > ... > > ? Otherwise the problem is still not solved when mcast_flood is disabled. No, I mean exactly as I said. My goal was not to "solve the problem" when mcast_flood is disabled, but to give you an option to configure the bridge to achieve what you want, in a way which I think is more acceptable. AFAIU, there is not really any "problem" - the bridge behaves exactly as instructed given the limited language available to instruct it ("mcast_flood" covers all multicast). So the other knobs have the role of fine-tuning what gets flooded when mcast_flood is on. Like "yes, but..." You can't "solve the problem" when it involves changing an established behavior that somebody probably depended on to be just like that. > > Yep, sounds good to me. I was thinking about something in these lines > > as well if doing a kernel solution in order to make it simpler and more > > generic. The ctrl flood bits need to be handled more carefully to make > > sure they match only control traffic and not link-local data. > > Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies as > control I guess that's my question. Well, as I said, I'm proposing that 224.0.0.x qualifies as control and the rest of IPv4 multicast as data. Which means that, applied to your case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off" will "force flood" mDNS just like the IGMP traffic from your patches. I'm not aware if this could be considered problematic (I don't think so). The reason behind this proposal is that, AFAIU, endpoints may choose to join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that switches shouldn't prune the destinations towards endpoints that don't join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6 Whereas for IP multicast traffic towards an address outside 224.0.0.x, pruning will happen as per the IGMP join tracking mechanism.
On 4/5/24 13:20, Vladimir Oltean wrote: > On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote: >>>> mcast_flood == off: >>>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off") >>>> - mcast_ipv4_data_flood: don't care >>>> - mcast_ipv6_ctrl_flood: don't care >>>> - mcast_ipv6_data_flood: don't care >>>> - mcast_l2_flood: don't care >>>> mcast_flood == on: >>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood >>>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood >>>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood >>>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood >>>> - Flood L2 according to mcast_l2_flood >> >> Did you mean >> >> if mcast_flood == on (meaning mcast_flood is ENABLED) >> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway) >> ... >> >> if mcast_flood == off (meaning mcast_flood is DISABLED) >> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood >> ... >> >> ? Otherwise the problem is still not solved when mcast_flood is disabled. > > No, I mean exactly as I said. My goal was not to "solve the problem" > when mcast_flood is disabled, but to give you an option to configure the > bridge to achieve what you want, in a way which I think is more acceptable. > > AFAIU, there is not really any "problem" - the bridge behaves exactly as > instructed given the limited language available to instruct it ("mcast_flood" > covers all multicast). So the other knobs have the role of fine-tuning > what gets flooded when mcast_flood is on. Like "yes, but..." > > You can't "solve the problem" when it involves changing an established > behavior that somebody probably depended on to be just like that. > >>> Yep, sounds good to me. I was thinking about something in these lines >>> as well if doing a kernel solution in order to make it simpler and more >>> generic. The ctrl flood bits need to be handled more carefully to make >>> sure they match only control traffic and not link-local data. >> >> Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies as >> control I guess that's my question. > > Well, as I said, I'm proposing that 224.0.0.x qualifies as control and > the rest of IPv4 multicast as data. Which means that, applied to your > case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off" > will "force flood" mDNS just like the IGMP traffic from your patches. > I'm not aware if this could be considered problematic (I don't think so). > > The reason behind this proposal is that, AFAIU, endpoints may choose to > join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that > switches shouldn't prune the destinations towards endpoints that don't > join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6 > > Whereas for IP multicast traffic towards an address outside 224.0.0.x, > pruning will happen as per the IGMP join tracking mechanism. +1, non-IGMP traffic to 224.0.0.x must be flooded to all anyway so this should allow for a better control over it, but perhaps the naming should be link_local instead because control usually means IGMP, maybe something like mcast_ip_link_local_flood
> > Thanks for the explanation. So i think the key part which you said > > above is: > > > > At the same time, the bridge will also forward the packet to every > > mrouter port so that listeners beyond mrouter ports can receive that > > multicast packet as well. > > > > How does the bridge know about the mrouter port? It seems like the > > The bridge learns about the existence of the Querier by the reception of > Queries. The bridge will mark the port which it received Queries from as the > mrouter port. > > > bridge needs to pass that information down to the switch. Is the > > The bridge does pass that information down to switchdev. Patch 5 adds DSA > handling of that event as well. Patches 9 and 10 adds the support in the > mv88e6xxx driver. I've not been looking at the details too much for this patchset. It does however seem that some parts are controversial, and others just seem like a bug fix. Does it make sense to split this into two patchsets? Andrew
On 4/5/2024 7:00 AM, Nikolay Aleksandrov wrote: > On 4/5/24 13:20, Vladimir Oltean wrote: >> On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote: >>>>> mcast_flood == off: >>>>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off") >>>>> - mcast_ipv4_data_flood: don't care >>>>> - mcast_ipv6_ctrl_flood: don't care >>>>> - mcast_ipv6_data_flood: don't care >>>>> - mcast_l2_flood: don't care >>>>> mcast_flood == on: >>>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood >>>>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood >>>>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood >>>>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood >>>>> - Flood L2 according to mcast_l2_flood >>> >>> Did you mean >>> >>> if mcast_flood == on (meaning mcast_flood is ENABLED) >>> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded >>> anyway) >>> ... >>> >>> if mcast_flood == off (meaning mcast_flood is DISABLED) >>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood >>> ... >>> >>> ? Otherwise the problem is still not solved when mcast_flood is >>> disabled. >> >> No, I mean exactly as I said. My goal was not to "solve the problem" >> when mcast_flood is disabled, but to give you an option to configure the >> bridge to achieve what you want, in a way which I think is more >> acceptable. >> >> AFAIU, there is not really any "problem" - the bridge behaves exactly as >> instructed given the limited language available to instruct it >> ("mcast_flood" >> covers all multicast). So the other knobs have the role of fine-tuning >> what gets flooded when mcast_flood is on. Like "yes, but..." >> >> You can't "solve the problem" when it involves changing an established >> behavior that somebody probably depended on to be just like that. >> >>>> Yep, sounds good to me. I was thinking about something in these lines >>>> as well if doing a kernel solution in order to make it simpler and more >>>> generic. The ctrl flood bits need to be handled more carefully to make >>>> sure they match only control traffic and not link-local data. >>> >>> Do we consider 224.0.0.251 (mDNS) to be control or data? What >>> qualifies as >>> control I guess that's my question. >> >> Well, as I said, I'm proposing that 224.0.0.x qualifies as control and >> the rest of IPv4 multicast as data. Which means that, applied to your >> case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off" >> will "force flood" mDNS just like the IGMP traffic from your patches. >> I'm not aware if this could be considered problematic (I don't think so). >> >> The reason behind this proposal is that, AFAIU, endpoints may choose to >> join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that >> switches shouldn't prune the destinations towards endpoints that don't >> join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6 >> >> Whereas for IP multicast traffic towards an address outside 224.0.0.x, >> pruning will happen as per the IGMP join tracking mechanism. > > +1, non-IGMP traffic to 224.0.0.x must be flooded to all anyway > so this should allow for a better control over it, but perhaps > the naming should be link_local instead because control usually > means IGMP, maybe something like mcast_ip_link_local_flood > Like this? bridge link set dev swp0 mcast_flood off - all flooding disabled bridge link set dev swp0 mcast_flood on - all flooding enabled bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off mcast_ipv6_data_flood off - IPv4 data packets flooding disabled, IPv6 data packets flooding disabled, everything else floods (that is to say, only allow IPv4 local subnet and IPv6 link-local to flood) ? The syntax seems to be counterintuitive. Or like this? bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on - only allow IPv4 local subnet to flood, everything else off ? So basically the question is, what should the behavior be when something is omitted from the command line?
On Fri, Apr 05, 2024 at 04:22:43PM -0400, Joseph Huang wrote: > Like this? > > bridge link set dev swp0 mcast_flood off > - all flooding disabled > > bridge link set dev swp0 mcast_flood on > - all flooding enabled > > bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off > mcast_ipv6_data_flood off > - IPv4 data packets flooding disabled, IPv6 data packets flooding > disabled, everything else floods (that is to say, only allow IPv4 local > subnet and IPv6 link-local to flood) > > ? Yeah. > The syntax seems to be counterintuitive. > > Or like this? > > bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on > - only allow IPv4 local subnet to flood, everything else off > > ? Nope. > So basically the question is, what should the behavior be when something is > omitted from the command line? The answer is always: "new options should default to behaving exactly like before". It's not just about the command line arguments, but also about the actual netlink attributes that iproute2 (and other tooling) creates when communicating with the kernel. Old user space has no idea about the existence of mcast_ipv4_ctrl_flood et. al. So, if netlink attributes specifying their value are not sent by user space, their value in the kernel must mimic the value of mcast_flood.
On 4/5/2024 5:15 PM, Vladimir Oltean wrote: > On Fri, Apr 05, 2024 at 04:22:43PM -0400, Joseph Huang wrote: >> Like this? >> >> bridge link set dev swp0 mcast_flood off >> - all flooding disabled >> >> bridge link set dev swp0 mcast_flood on >> - all flooding enabled >> >> bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off >> mcast_ipv6_data_flood off >> - IPv4 data packets flooding disabled, IPv6 data packets flooding >> disabled, everything else floods (that is to say, only allow IPv4 local >> subnet and IPv6 link-local to flood) >> >> ? > > Yeah. > >> The syntax seems to be counterintuitive. >> >> Or like this? >> >> bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on >> - only allow IPv4 local subnet to flood, everything else off >> >> ? > > Nope. > >> So basically the question is, what should the behavior be when something is >> omitted from the command line? > > The answer is always: "new options should default to behaving exactly > like before". It's not just about the command line arguments, but also > about the actual netlink attributes that iproute2 (and other tooling) > creates when communicating with the kernel. Old user space has no idea > about the existence of mcast_ipv4_ctrl_flood et. al. So, if netlink > attributes specifying their value are not sent by user space, their > value in the kernel must mimic the value of mcast_flood. How about the following syntax? I think it satisfies all the "not breaking existing behavior" requirements (new option defaults to off, and missing user space netlink attributes does not change the existing behavior): mcast_flood off all off mcast_flood off mcast_flood_rfc4541 off all off mcast_flood off mcast_flood_rfc4541 on 224.0.0.X and ff02::1 on, the rest off mcast_flood on all on mcast_flood on mcast_flood_rfc4541 off all on (mcast_flood on overrides mcast_flood_rfc4541) mcast_flood on mcast_flood_rfc4541 on all on mcast_flood_rfc4541 off invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is specified first) mcast_flood_rfc4541 on invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is specified first) Think of mcast_flood_rfc4541 like a pet door if you will.
On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote: > How about the following syntax? I think it satisfies all the "not breaking > existing behavior" requirements (new option defaults to off, and missing > user space netlink attributes does not change the existing behavior): > > mcast_flood off > all off > mcast_flood off mcast_flood_rfc4541 off > all off > mcast_flood off mcast_flood_rfc4541 on > 224.0.0.X and ff02::1 on, the rest off > mcast_flood on > all on > mcast_flood on mcast_flood_rfc4541 off > all on (mcast_flood on overrides mcast_flood_rfc4541) > mcast_flood on mcast_flood_rfc4541 on > all on > mcast_flood_rfc4541 off > invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is > specified first) > mcast_flood_rfc4541 on > invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is > specified first) A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp(). Netlink attributes are only there to _change_ the state of properties in the kernel. They don't need to be specified by user space if there's nothing to be changed. "Only valid if another netlink attribute comes first" makes no sense. You can alter 2 bridge port flags as part of the same netlink message, or as part of different netlink messages (sent over sockets of other processes). > > Think of mcast_flood_rfc4541 like a pet door if you will. Ultimately, as far as I see it, both the OR-based and the AND-based UAPI addition could be made to work in a way that's perhaps similarly backwards compatible. It needs to be worked out with the bridge maintainers. Given that I'm not doing great with my spare time, I will take a back seat on that. However, what I don't quite understand in your proposal is how many IPv4 addresses lie beyond the "224.0.0.X" notation? 256? Explain why there is such a large discrepancy in the number of IPv4 addresses you are in control of (256), vs only 1 IPv6 address with the new rfc4541 flag?
On 4/29/2024 9:21 PM, Vladimir Oltean wrote: > On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote: >> How about the following syntax? I think it satisfies all the "not breaking >> existing behavior" requirements (new option defaults to off, and missing >> user space netlink attributes does not change the existing behavior): >> >> mcast_flood off >> all off >> mcast_flood off mcast_flood_rfc4541 off >> all off >> mcast_flood off mcast_flood_rfc4541 on >> 224.0.0.X and ff02::1 on, the rest off >> mcast_flood on >> all on >> mcast_flood on mcast_flood_rfc4541 off >> all on (mcast_flood on overrides mcast_flood_rfc4541) >> mcast_flood on mcast_flood_rfc4541 on >> all on >> mcast_flood_rfc4541 off >> invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is >> specified first) >> mcast_flood_rfc4541 on >> invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is >> specified first) > > A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp(). > Netlink attributes are only there to _change_ the state of properties in > the kernel. They don't need to be specified by user space if there's > nothing to be changed. "Only valid if another netlink attribute comes > first" makes no sense. You can alter 2 bridge port flags as part of the > same netlink message, or as part of different netlink messages (sent > over sockets of other processes). > >> >> Think of mcast_flood_rfc4541 like a pet door if you will. > > Ultimately, as far as I see it, both the OR-based and the AND-based UAPI > addition could be made to work in a way that's perhaps similarly backwards > compatible. It needs to be worked out with the bridge maintainers. Given > that I'm not doing great with my spare time, I will take a back seat on > that. Nik, do you have any objection to the following proposal? mcast_flood -> default/ off on (existing flag) missing (specified/ (specified/ (on) nlmsg) nlmsg) mcast_flood_rfc4541 (proposed new flag) | v default/ flood all no flood flood all missing (off) off flood all no flood flood all (specified/nlmsg) on flood all flood 4541 flood all (specified/nlmsg) ^^^^^^^^^^ only behavior change Basically the attributes are OR'ed together to form the final flooding decision. > > However, what I don't quite understand in your proposal is how many IPv4 > addresses lie beyond the "224.0.0.X" notation? 256? Explain why there is > such a large discrepancy in the number of IPv4 addresses you are in > control of (256), vs only 1 IPv6 address with the new rfc4541 flag? That's straight out of RFC-4541 ("coincidentally", these are also the IP addresses for which the bridge will not create mdb's): 2.1.2. 2) Packets with a destination IP (DIP) address in the 224.0.0.X range which are not IGMP must be forwarded on all ports. This recommendation is based on the fact that many host systems do not send Join IP multicast addresses in this range before sending or listening to IP multicast packets. Furthermore, since the 224.0.0.X address range is defined as link-local (not to be routed), it seems unnecessary to keep the state for each address in this range. Additionally, some routers operate in the 224.0.0.X address range without issuing IGMP Joins, and these applications would break if the switch were to prune them due to not having seen a Join Group message from the router. and 3. In IPv6, the data forwarding rules are more straight forward because MLD is mandated for addresses with scope 2 (link-scope) or greater. The only exception is the address FF02::1 which is the all hosts link-scope address for which MLD messages are never sent. Packets with the all hosts link-scope address should be forwarded on all ports.
On 30/04/2024 20:01, Joseph Huang wrote: > On 4/29/2024 9:21 PM, Vladimir Oltean wrote: >> On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote: >>> How about the following syntax? I think it satisfies all the "not breaking >>> existing behavior" requirements (new option defaults to off, and missing >>> user space netlink attributes does not change the existing behavior): >>> >>> mcast_flood off >>> all off >>> mcast_flood off mcast_flood_rfc4541 off >>> all off >>> mcast_flood off mcast_flood_rfc4541 on >>> 224.0.0.X and ff02::1 on, the rest off >>> mcast_flood on >>> all on >>> mcast_flood on mcast_flood_rfc4541 off >>> all on (mcast_flood on overrides mcast_flood_rfc4541) >>> mcast_flood on mcast_flood_rfc4541 on >>> all on >>> mcast_flood_rfc4541 off >>> invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is >>> specified first) >>> mcast_flood_rfc4541 on >>> invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is >>> specified first) >> >> A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp(). >> Netlink attributes are only there to _change_ the state of properties in >> the kernel. They don't need to be specified by user space if there's >> nothing to be changed. "Only valid if another netlink attribute comes >> first" makes no sense. You can alter 2 bridge port flags as part of the >> same netlink message, or as part of different netlink messages (sent >> over sockets of other processes). >> >>> >>> Think of mcast_flood_rfc4541 like a pet door if you will. >> >> Ultimately, as far as I see it, both the OR-based and the AND-based UAPI >> addition could be made to work in a way that's perhaps similarly backwards >> compatible. It needs to be worked out with the bridge maintainers. Given >> that I'm not doing great with my spare time, I will take a back seat on >> that. > > Nik, do you have any objection to the following proposal? > > mcast_flood -> default/ off on > (existing flag) missing (specified/ (specified/ > (on) nlmsg) nlmsg) > > mcast_flood_rfc4541 > (proposed new flag) > | > v > default/ flood all no flood flood all > missing > (off) > > off flood all no flood flood all > (specified/nlmsg) > > on flood all flood 4541 flood all > (specified/nlmsg) ^^^^^^^^^^ > only behavior change > > > Basically the attributes are OR'ed together to form the final flooding decision. > > Looks good to me. Please make use of the boolopt uapi to avoid adding new nl attributes. Thanks, Nik
Sorry for chiming in so late in this conversation, missed it due to the mailing list address change and how my procmail sorts things... First of all, many thanks Joseph for looking into this! I think I agree with you that there is an actual RFC4541 compatiblity issue for multicast offloading switches which (try to) use the kernel space Linux bridge IGMP/MLD snooping code. For both routable and non-routable multicast traffic. I see a lot of confusion and misunderstandings in this thread. And initially got very confused by the cover letter, too, and needed to recheck the MCAST_FLOOD flag behaviour in the code and in practice :-). > There is a use case where one would like to enable multicast snooping > on a bridge but disable multicast flooding on all bridge ports so that > registered multicast traffic will only reach the intended recipients and > unregistered multicast traffic will be dropped. To clarify for others: This is exactly what the Linux bridge does right now. With bridge multicast snooping enabled and active (a querier is present) any snoopable, unregistered (no MDB entry) IP multicast payload traffic will only be forwarded to ports which were either manually set to a multicast router port or where a multicast router was detected via IGMP/MLD query, MRD or PIM snooping. And if no such port exists, will be dropped. The current per port MCAST_FLOOD flag implementation does not change this behaviour. Its current (unfortunately not very well documented) behaviour is basically, mainly to decide on what to do with packets which the bridge multicast snooping code *can not* deal with / can not learn about. Which can be because multicast snooping is disabled, because there is no IGMP/MLD querier, because they are IGMPv1/v2/MLDv1 packets, because they are not in the RFC4541 defined snoopable address ranges - or because it is a multicast packet which is not an IP packet after all. The last case should make it clear why the MCAST_FLOOD is 1/enabled by default. Which might be a bit confusing/counterintuitive initially when coming and thinking from the other, the IP snooping direction. > bridge ports' mcast_flood flag implementation, it doesn't work as desired. The important context to the "it doesn't work as desired" can be found some lines later: "3. A hardware offloading switch is involved". The main issue seems that the learned or manually set multicast router ports in the Linux bridge are not propagated down to the actual multicast offloading switches. And therefore these switches won't be able to follow RFC4541, which will potentially lead to packet loss for multicast packets, both routable ones - as multicast routers are not detected - but also for non-routable ones due to potential IGMPv1/v2/MLDv1 report suppression. This is the main issue this patchset tries to tackle, making multicast offloading switches with kernelspace IGMP/MLD snooping RFC4541 compliant, I believe? ----- > [PATCH RFC net-next 03/10] net: bridge: Always flood local subnet mc packets > ... > If multicast flooding is disabled on a bridge port, local subnet multicast > packets from the bridge will not be forwarded out of that port, even if > IGMP snooping is running and the hosts beyond the bridge port are sending > Reports to join these groups (e.g., 224.0.0.251) This is a fix for a regression which patch 01/10 introduced in the first place > [PATCH RFC net-next 02/10] net: bridge: Always multicast_flood Reports > ... > Fixes: b00589af3b04 ("bridge: disable snooping if there is no querier") This is a regression of patch 01/10, the "Fixes" line is incorrect. > [PATCH RFC net-next 09/10] net: dsa: mv88e6xxx: Enable mc flood for mrouter port So this is what the idea of repurposing/redefining of MCAST_FLOOD ultimately comes down to, I guess? I'm wondering, shouldn't the information you're looking for already be available inside the Linux bridge as is? As it right now correctly knows when to flood on a port? Hm, would it maybe alternatively be possible to somehow call the new dsa_switch_ops.port_mrouter you're adding - which seems to be a missing, key part to fix this in the DSA API - from for instance br_port_mc_router_state_change() instead? Without needing any new or redefined state or knob in the Linux bridge? I'm also not quite sure yet what effect your proposed changes to the MCAST_FLOOD knob would have to non-IP multicast packets - could it break them? ----- For the discussion regarding more knobs for a more fine-grained control for flooding various types of packets: I agree that would be nice to have. But I don't think it's necessary to fix the original issue which Joseph tries to address. Also I think they maybe should be added afterwards, after fixing the issue with offloading switches? As already as is the port MCAST_FLOOD and BR_INPUT_SKB_CB(skb)->mrouters_only flags interactions are quite hard to follow and confusing in the Linux bridge code (qed.?). Regards, Linus PS: Also adding Jan Hoffmann, Birger Koblitz and Sebastian Gottschall to CC, as they seem to have been working on getting such a feature running in OpenWrt/DD-Wrt for rtl83xx DSA switches, too.
On Wed, Feb 26, 2025 at 09:20:58PM +0100, Linus Lüssing wrote: > [...] > The main issue seems that the learned or manually set multicast > router ports in the Linux bridge are not propagated down to the > actual multicast offloading switches. Next to this issue I'm also wondering if the following might still need addressing (which this patchset does not try to address?) to support multicast offloading switches with kernelspace IGMP/MLD snooping - or if there are some switch chips which do not support this and hence won't be able to use kernelspace IGMP/MLD snooping. A rough first attempt of a guideline/checklist: Needed switch chip capabilities: 1) adding MDB entries to ports 2) adding multicast router ports 3) -> the switch chip must only apply these to a) IP packets with a matching protocol family (ether type 0x0800 || 0x86DD) and: b.1) snoopable IP multicast address ranges (224.0.0.0/4 minus 224.0.0.0/24, ff00::/8 minus ff02::1/128 IP destination addresses) b.2) alternatively to b.1 (+a), but less desirably a switch chip might match on layer 2: 01:00:5E:00:00:00, mask ff:ff:ff:f8:00:00 minus 01:00:5e:00:00:00, mask ff:ff:ff:ff:ff:00; 33:33:00:00:00:00, mask ff:ff:00:00:00:00 minus 33:33:00:00:00:01, mask ff:ff:ff:ff:ff:ff c.1) must *not* apply this to IGMP/MLD reports (especially IGMPv1/v2/MLDv1 reports), they must only be forwarded to the registered multicast router ports *plus* the CPU port / Linux bridge, for the latter to be able to learn c.2) may forward IGMP/MLD packets only to the CPU port / Linux bridge, if the Linux bridge (or DSA) can in turn selectively forward the IGMP/MLD to multicast router ports, excluding the incoming port -> DSA might need to inform the Linux bridge about the supported mode of the switch chip? (d) IGMP/MLD queries need to be flooded to all ports by default, but they would not match 3.b or 3.c.1 anyway; 3.c.2 may match, then the Linux bridge (or DSA) needs to make sure to reflect it back to all ports excluding the incoming port 4) Any frame that did not match via 3) within the switch chip must by default be flooded to all ports 4.1) this should be tunable and propagated from Linux bridge MCAST_FLOOD port flag to the switch chip 4.2) if a switch chip cannot comply with 3) and has bridge port isolation enabled then the Linux bridge should perform multicast forwarding and IGMP/MLD snooping fully in kernelspace and return a warning about missing hardware support 4.3) if a switch chip cannot adhere to neither 3) nor 4.2) then a user trying to enable bridge multicast snooping should be denied and return an error => no incomplete hacks allowed, which might break especially IP in specific scenarios Would it maybe make sense to add some guideline/checklist like this, which is more explicit than RFC4541 but should be compatible to it, to Documentation/networking/dsa/dsa.rst? (I'm not as familiar with DSA/switchdev/switch chips as with IP/IGMP/MLD/RFC4541 on layer 2+. So especially feedback from people more familiar with these lower layes would be appreciated.) ----- Why I'm also wondering if a guideline might be useful because: Saw this merging of multicast routers ports and MDB approach discussion here: https://lore.kernel.org/netdev/db38eb8f-9109-b142-6ef4-91df1c1c9de3@3e8.eu/ I have some suspicion what it might try to achieve, but am unsure if that can work reliably in all scenarios. Is this intended as a hack where the switch chip or DSA has no support to configure multicast router ports? If yes, what would happen if there is: 1) a layer 3 multicast router 2) a multicast sender with a routable destination address 3) no local multicast listener for 3), so no local reports for it? Would neither an MDB nor a multicast router port be configured then? Or with two multicast snooping switches, if one of them never sees the according IGMP/MLD reports due to RFC4541 forwarding restrictions? Regards, Linus