Message ID | 980e8b0a82042d7e5801e02bf16fe72a0bde6759.1441934465.git.dledford@redhat.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
On Thu, 10 Sep 2015, Doug Ledford wrote: > + * 1) ifdown/ifup > + * 2) a regular mcast join/leave happens and we run > + * ipoib_mcast_restart_task > + * 3) a REREGISTER event comes in from the SM > + * 4) any other event that might cause a mcast flush Could we have a timeout and leave the multicast group on process exit? The old code base did that with the ipoib_mcast_leave_task() function. With that timeout we do not longer accumulate MC sendonly subscriptions for long running systems. Also IPOIB_MAX_MCAST_QUEUE's default to 3 is not really enough to capture a burst of traffic send to a multicast group. Can we make this configurable or increase the max? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 11, 2015 at 09:40:25AM -0500, Christoph Lameter wrote: > On Thu, 10 Sep 2015, Doug Ledford wrote: > > > + * 1) ifdown/ifup > > + * 2) a regular mcast join/leave happens and we run > > + * ipoib_mcast_restart_task > > + * 3) a REREGISTER event comes in from the SM > > + * 4) any other event that might cause a mcast flush > > Could we have a timeout and leave the multicast group on process > exit? At a minimum, when the socket that did the send closes the send-only could be de-refed.. But really send-only in IB is very similar to neighbour discovery, and should work the same, with a timer, etc. In my mind it would be ideal to even add them to the ND table, if the core would support that.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote: > During the recent rework of the mcast handling in ipoib, the join > task for regular and send-only joins were merged. In the old code, > the comments indicated that the ipoib driver didn't send enough > information to auto-create IB multicast groups when the join was a > send-only join. The reality is that the comments said we didn't, but > we actually did. Since we merged the two join tasks, we now follow > the comments and don't auto-create IB multicast groups for an ipoib > send-only multicast join. This has been reported to cause problems > in certain environments that rely on this behavior. Specifically, > if you have an IB <-> Ethernet gateway then there is a fundamental > mismatch between the methodologies used on the two fabrics. On > Ethernet, an app need not subscribe to a multicast group, merely > listen. This should probably be clarified. On all IP networks IGMP/MLD is used to advertise listeners. A IB/Eth gateway is a router, and IP routers are expected to process IGMP - so the gateway certainly can (and maybe must) be copying groups declared with IGMP from the eth side into listeners on IB MGIDs We may also be making a mistake in IPoIB - if the send side MGID join fails, the send should probably go to the broadcast, and the join retried from time to time. This would at least let the gateway optimize a bit more by only creating groups in active use. That would emulate the ethernet philosophy of degrade to broadcast. TBH, fixing broken gateway devices by sending to broadcast appeals to me more than making a module option.. > on the IB side of the gateway. There are instances of installations > with 100's (maybe 1000's) of multicast groups where static creation > of all the groups is not practical that rely upon the send-only With my last patch the SM now has enough information to auto-create the wonky send-only join attempts, if wanted to. It just needs to fill in tclass, so it certainly is possible to address this long term without a kernel patch. > joins creating the IB multicast group in order to function, so to > preserve these existing installations, add a module option to the > ipoib module to restore the previous behavior. .. but an option to restore behavior of an older kernel version makes sense - did we really have a kernel that completely converted a send-only join to full join&create? > + * An additional problem is that if we auto-create the IB > + * mcast group in response to a send-only action, then we > + * will be the creating entity, but we will not have any > + * mechanism by which we will track when we should leave > + * the group ourselves. We will occasionally leave and > + * re-join the group when these events occur: I would drop the langauge of creating-entity, that isn't something from the IBA. The uncontrolled lifetime of the join is not related to creating or not. > + * 2) a regular mcast join/leave happens and we run > + * ipoib_mcast_restart_task Really? All send only mgids are discarded at that point? Ugly. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2015 02:21 PM, Jason Gunthorpe wrote: > On Fri, Sep 11, 2015 at 09:40:25AM -0500, Christoph Lameter wrote: >> On Thu, 10 Sep 2015, Doug Ledford wrote: >> >>> + * 1) ifdown/ifup >>> + * 2) a regular mcast join/leave happens and we run >>> + * ipoib_mcast_restart_task >>> + * 3) a REREGISTER event comes in from the SM >>> + * 4) any other event that might cause a mcast flush >> >> Could we have a timeout and leave the multicast group on process >> exit? > > At a minimum, when the socket that did the send closes the send-only > could be de-refed.. If we kept a ref count, but we don't. Tracking this is not a small change. > But really send-only in IB is very similar to neighbour discovery, and > should work the same, with a timer, etc. In my mind it would be ideal > to even add them to the ND table, if the core would support that.. > > Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
On 09/11/2015 02:39 PM, Jason Gunthorpe wrote: > On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote: >> During the recent rework of the mcast handling in ipoib, the join >> task for regular and send-only joins were merged. In the old code, >> the comments indicated that the ipoib driver didn't send enough >> information to auto-create IB multicast groups when the join was a >> send-only join. The reality is that the comments said we didn't, but >> we actually did. Since we merged the two join tasks, we now follow >> the comments and don't auto-create IB multicast groups for an ipoib >> send-only multicast join. This has been reported to cause problems >> in certain environments that rely on this behavior. Specifically, >> if you have an IB <-> Ethernet gateway then there is a fundamental >> mismatch between the methodologies used on the two fabrics. On >> Ethernet, an app need not subscribe to a multicast group, merely >> listen. > > This should probably be clarified. On all IP networks IGMP/MLD is used > to advertise listeners. > > A IB/Eth gateway is a router, and IP routers are expected to process > IGMP - so the gateway certainly can (and maybe must) be copying > groups declared with IGMP from the eth side into listeners on IB MGIDs Obviously, the gateway in question currently is not doing this. > We may also be making a mistake in IPoIB - if the send side MGID join > fails, the send should probably go to the broadcast, and the join > retried from time to time. This would at least let the gateway > optimize a bit more by only creating groups in active use. That's not *too* far off from what we do. This change could be made without a huge amount of effort. Right now we queue up the packet and retry the join on a timer. Only after all of the join attempts have failed their timed retries do we dequeue and drop the packets. We could drop the queue backlog entirely and just send to broadcast when the multicast group is unsubscribed. > That would emulate the ethernet philosophy of degrade to broadcast. Indeed. > TBH, fixing broken gateway devices by sending to broadcast appeals to > me more than making a module option.. Well, we've already established that the gateway device might be well be broken. That makes one wonder if this will work or if it might be broken too. >> on the IB side of the gateway. There are instances of installations >> with 100's (maybe 1000's) of multicast groups where static creation >> of all the groups is not practical that rely upon the send-only > > With my last patch the SM now has enough information to auto-create > the wonky send-only join attempts, if wanted to. It just needs to fill > in tclass, so it certainly is possible to address this long term > without a kernel patch. > >> joins creating the IB multicast group in order to function, so to >> preserve these existing installations, add a module option to the >> ipoib module to restore the previous behavior. > > .. but an option to restore behavior of an older kernel version makes > sense - did we really have a kernel that completely converted a > send-only join to full join&create? I just re-checked to verify this and here's what I found: kernels v3.10, v3,19, and v4.0 have this in ipoib_mcast_sendonly_join: if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n"); return -EBUSY; } rec.mgid = mcast->mcmember.mgid; rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); comp_mask = IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE; mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port, &rec, comp_mask, GFP_ATOMIC, ipoib_mcast_sendonly_join_complete, mcast); kernel v4.1 and on no longer have sendonly_join as a separate function. So, to answer your question, no, upstream didn't do this. This, however, is carried in OFED: rec.mgid = mcast->mcmember.mgid; rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); comp_mask = IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE; if (priv->broadcast) { comp_mask |= IB_SA_MCMEMBER_REC_QKEY | IB_SA_MCMEMBER_REC_MTU_SELECTOR | IB_SA_MCMEMBER_REC_MTU | IB_SA_MCMEMBER_REC_TRAFFIC_CLASS | IB_SA_MCMEMBER_REC_RATE_SELECTOR | IB_SA_MCMEMBER_REC_RATE | IB_SA_MCMEMBER_REC_SL | IB_SA_MCMEMBER_REC_FLOW_LABEL | IB_SA_MCMEMBER_REC_HOP_LIMIT; rec.qkey = priv->broadcast->mcmember.qkey; rec.mtu_selector = IB_SA_EQ; rec.mtu = priv->broadcast->mcmember.mtu; rec.traffic_class = priv->broadcast->mcmember.traffic_class; rec.rate_selector = IB_SA_EQ; rec.rate = priv->broadcast->mcmember.rate; rec.sl = priv->broadcast->mcmember.sl; rec.flow_label = priv->broadcast->mcmember.flow_label; rec.hop_limit = priv->broadcast->mcmember.hop_limit; } and so this has been happening since forever in OFED (the above is from 1.5.4.1). Part of the confusion here is that I was on a concall with the people this problem is effecting, and their claim was that this started with my cleanup of the multicast stuff in the v4.1 kernel. I double checked the OFED sources at the time, but not the upstream kernel, and took them at their word that my patches caused the problem to appear on upstream kernels, which implies that the older upstream kernels, where the sendonly_join that I deleted still existed, were like OFED. If my patch set *did* start the problem, then it's not because of the autocreate issue. So in that sense, this would be more like creating an OFED compatibility hook. >> + * An additional problem is that if we auto-create the IB >> + * mcast group in response to a send-only action, then we >> + * will be the creating entity, but we will not have any >> + * mechanism by which we will track when we should leave >> + * the group ourselves. We will occasionally leave and >> + * re-join the group when these events occur: > > I would drop the langauge of creating-entity, that isn't something > from the IBA. > > The uncontrolled lifetime of the join is not related to creating or > not. > >> + * 2) a regular mcast join/leave happens and we run >> + * ipoib_mcast_restart_task > > Really? All send only mgids are discarded at that point? Ugly. Yep. Any time an app joins/leaves a multicast group, it updates our multicast list in the net core, and via our net update we schedule a multicast_restart_task(). When that happens, all full joins are matched against the new list of joined groups from the net core, any that no longer match a subscribed group we put on the remove list, any that match we leave on the list, any that are now in the net core and not in our list we add. Because the sendonly groups are not tracked at the net core level, our only option is to move them all to the remove list and when we get another sendonly packet, rejoin. Unless we want them to stay around forever. But since they aren't real send-only joins, they are full joins where we simply ignore the incoming data, leaving them around seems a bad idea.
On Fri, Sep 11, 2015 at 04:09:49PM -0400, Doug Ledford wrote: > On 09/11/2015 02:39 PM, Jason Gunthorpe wrote: > > On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote: > >> During the recent rework of the mcast handling in ipoib, the join > >> task for regular and send-only joins were merged. In the old code, > >> the comments indicated that the ipoib driver didn't send enough > >> information to auto-create IB multicast groups when the join was a > >> send-only join. The reality is that the comments said we didn't, but > >> we actually did. Since we merged the two join tasks, we now follow > >> the comments and don't auto-create IB multicast groups for an ipoib > >> send-only multicast join. This has been reported to cause problems > >> in certain environments that rely on this behavior. Specifically, > >> if you have an IB <-> Ethernet gateway then there is a fundamental > >> mismatch between the methodologies used on the two fabrics. On > >> Ethernet, an app need not subscribe to a multicast group, merely > >> listen. > > > > This should probably be clarified. On all IP networks IGMP/MLD is used > > to advertise listeners. > > > > A IB/Eth gateway is a router, and IP routers are expected to process > > IGMP - so the gateway certainly can (and maybe must) be copying > > groups declared with IGMP from the eth side into listeners on IB MGIDs > > Obviously, the gateway in question currently is not doing this. Sure, my remark was the clarify the commit comment so people don't think this is OK/expected behavior from a gateway. > We could drop the queue backlog entirely and just send to broadcast > when the multicast group is unsubscribed. I'm pretty sure that would upset the people who care about this stuff.. Steady state operation has to eventually move to the optimal MLID. > Well, we've already established that the gateway device might be well be > broken. That makes one wonder if this will work or if it might be > broken too. If it isn't subscribed to the broadcast MLID, it is violating MUST statements in the RFC... > and so this has been happening since forever in OFED (the above is from > 1.5.4.1). But has this has been dropped from the new 3.x series that track upstream exactly? > our list we add. Because the sendonly groups are not tracked at the net > core level, our only option is to move them all to the remove list and > when we get another sendonly packet, rejoin. Unless we want them to > stay around forever. But since they aren't real send-only joins, they > are full joins where we simply ignore the incoming data, leaving them > around seems a bad idea. It doesn't make any sense to work like that. As is, the send-only side looks pretty messed up to me. It really needs to act like ND, and yah, that is a big change. Just to be clear, I'm not entirely opposed to an OFED compatability module option, but lets understand how this is broken, what the fix is we want to see for mainline and why the OFED 'solution' is not acceptable for mainline. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2015 04:38 PM, Jason Gunthorpe wrote: > On Fri, Sep 11, 2015 at 04:09:49PM -0400, Doug Ledford wrote: >> On 09/11/2015 02:39 PM, Jason Gunthorpe wrote: >>> On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote: >>>> During the recent rework of the mcast handling in ipoib, the join >>>> task for regular and send-only joins were merged. In the old code, >>>> the comments indicated that the ipoib driver didn't send enough >>>> information to auto-create IB multicast groups when the join was a >>>> send-only join. The reality is that the comments said we didn't, but >>>> we actually did. Since we merged the two join tasks, we now follow >>>> the comments and don't auto-create IB multicast groups for an ipoib >>>> send-only multicast join. This has been reported to cause problems >>>> in certain environments that rely on this behavior. Specifically, >>>> if you have an IB <-> Ethernet gateway then there is a fundamental >>>> mismatch between the methodologies used on the two fabrics. On >>>> Ethernet, an app need not subscribe to a multicast group, merely >>>> listen. >>> >>> This should probably be clarified. On all IP networks IGMP/MLD is used >>> to advertise listeners. >>> >>> A IB/Eth gateway is a router, and IP routers are expected to process >>> IGMP - so the gateway certainly can (and maybe must) be copying >>> groups declared with IGMP from the eth side into listeners on IB MGIDs >> >> Obviously, the gateway in question currently is not doing this. > > Sure, my remark was the clarify the commit comment so people don't think > this is OK/expected behavior from a gateway. > >> We could drop the queue backlog entirely and just send to broadcast >> when the multicast group is unsubscribed. > > I'm pretty sure that would upset the people who care about this > stuff.. Steady state operation has to eventually move to the optimal > MLID. I didn't mean to imply that we wouldn't attempt to join the group still. Just that right now the process is this: send_mcast detect no-mcast group initiate sendonly join, while queueing packet to be sent process join completion success - send queued packet (which has been delayed) fail - leave packet on queue, set backoff timer, try join again after timer expires, if we exceed the number of backoff/join attempts, drop packet The people I was on the phone with indicated that they are seeing some packet loss of mcast packets on these sendonly groups. For one thing, that backlog queue we have while waiting for the join to complete is capped at 3 deep and drops anything beyond that. So it is easy to imagine that even a small burst of packets could cause a drop. But this process also introduces delay on the packets being sent. My comment above then was to suggest the following change: send_mcast no-mcast group initiate sendonly join simultaneously send current packet to broadcast group process join completion success - mark mcast group live, no queue backlog to send fail - retry as above mcast group, but group still no live send current packet to broadcast group mcast group, group is live send current packet to mcast group The major change here is that we would never queue the packets any more. If the group is not already successfully joined, then we simply send to the broadcast group instead of queueing. Once the group is live, we start sending to the group. This would eliminate packet delays assuming that the gateway properly forwards multicast packets received on the broadcast group versus on a specific IB multicast group. Then a simple mcast_expire_task that runs every 10 minutes or so and leaves any send-only groups that haven't had a packet timestamp update in more than some arbitrary amount of time would be a very simple addition to make. >> Well, we've already established that the gateway device might be well be >> broken. That makes one wonder if this will work or if it might be >> broken too. > > If it isn't subscribed to the broadcast MLID, it is violating MUST > statements in the RFC... Yeah, my comment was in reference to whether or not it would receive a multicast on broadcast packet and forward it properly. >> and so this has been happening since forever in OFED (the above is from >> 1.5.4.1). > > But has this has been dropped from the new 3.x series that track > upstream exactly? I haven't check that (I don't have any of the later OFEDs downloaded, and I didn't see a git repo for the 3.x series under Vlad's area on OpenFabrics git server). >> our list we add. Because the sendonly groups are not tracked at the net >> core level, our only option is to move them all to the remove list and >> when we get another sendonly packet, rejoin. Unless we want them to >> stay around forever. But since they aren't real send-only joins, they >> are full joins where we simply ignore the incoming data, leaving them >> around seems a bad idea. > > It doesn't make any sense to work like that. As is, the send-only > side looks pretty messed up to me. > > It really needs to act like ND, and yah, that is a big change. > > Just to be clear, I'm not entirely opposed to an OFED compatability > module option, but lets understand how this is broken, what the fix is > we want to see for mainline and why the OFED 'solution' is not > acceptable for mainline. > > Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
On Fri, Sep 11, 2015 at 05:40:41PM -0400, Doug Ledford wrote: > Then a simple mcast_expire_task that runs every 10 minutes or so and > leaves any send-only groups that haven't had a packet timestamp update > in more than some arbitrary amount of time would be a very simple > addition to make. That all makes sense to me and addresses the backlog issue Christoph mentioned. It would also be fairly simple to do send-only join properly with the above, as we simply switch to polling the SA to detect if the join is good or not on the same timer that would expire it. > > If it isn't subscribed to the broadcast MLID, it is violating MUST > > statements in the RFC... > > Yeah, my comment was in reference to whether or not it would receive a > multicast on broadcast packet and forward it properly. It would be hugely broken to be sensitive to the MLID when working with multicast routing. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 11 Sep 2015, Doug Ledford wrote: > > At a minimum, when the socket that did the send closes the send-only > > could be de-refed.. > > If we kept a ref count, but we don't. Tracking this is not a small change. We could call ip_mc_join_group() from ipoib_mcast_send() which would join it at the socket layer. That layer would do the tracking for us and leave the group when the process terminates. The join would be visible the same way as if one would have done an explicit setsockopt(). -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/14/2015 01:03 PM, Christoph Lameter wrote: > On Fri, 11 Sep 2015, Doug Ledford wrote: > >>> At a minimum, when the socket that did the send closes the send-only >>> could be de-refed.. >> >> If we kept a ref count, but we don't. Tracking this is not a small change. > > We could call ip_mc_join_group() from ipoib_mcast_send() which would join > it at the socket layer. That layer would do the tracking for us and leave the > group when the process terminates. The join would be visible the same way > as if one would have done an explicit setsockopt(). I actually think this is a step in the right direction, but I think we are talking about a layering violation doing it the way you are suggesting. What's more, there are a few difficulties here in that I'm fairly certain the core networking layer doesn't have the concept of a send-only join, yet we would need it to have such. If we had three apps do send only sends, we would need to track all three of those sockets as being send only socket joins, and then if someone did a full join, we would need to automatically upgrade the join, and then if that app with the full join left, we would need to gracefully downgrade to send only again. So, I think it would take work at the core level to get this right, but I think that's probably the right place to do the work.
On Tue, 15 Sep 2015, Doug Ledford wrote: > > I actually think this is a step in the right direction, but I think we > are talking about a layering violation doing it the way you are > suggesting. What's more, there are a few difficulties here in that I'm The function is used in various lower layer processing functions throughout the network stack. As long as we still have a task context and a socket accessible this should be fine. > fairly certain the core networking layer doesn't have the concept of a > send-only join, yet we would need it to have such. If we had three apps The networking layer supports a setsockopt that can set IP_MULTICAST_LOOP behavior. The the API is there to properly control this from user space and thus its possible to join sendonly from user space. Actually we would need that for the proper implementaiton of sendonly joins at the fabric level I would think. I can work on making this work properly with the IP stack. > do send only sends, we would need to track all three of those sockets as > being send only socket joins, and then if someone did a full join, we > would need to automatically upgrade the join, and then if that app with > the full join left, we would need to gracefully downgrade to send only > again. So, I think it would take work at the core level to get this > right, but I think that's probably the right place to do the work. Well the IP stack already has to deal with this. See mc_loop option in the socket field as well as the sk_mc_loop() function in the network stack. > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 15, 2015 at 02:24:21PM -0400, Doug Ledford wrote: > suggesting. What's more, there are a few difficulties here in that I'm > fairly certain the core networking layer doesn't have the concept of a > send-only join, yet we would need it to have such. The mcast list in the core is soley for listing subscriptions for inbound - ie receive. Abusing it for send-side is probably the wrong direction overall. send-side addressing is always done with the neighbour table. If core net infrastructure is to be done here it makes alot more sense to store multicast IP's in the neighbour table and drive the join lifetime from that side. [mloop is not the same as IB send-only, IB send only means no receive at all, from anyone] Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 15 Sep 2015, Jason Gunthorpe wrote: > The mcast list in the core is soley for listing subscriptions for > inbound - ie receive. Abusing it for send-side is probably the wrong > direction overall. Ok then a simple approach would be to port timeout logic from OFED-1.5.X. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/15/2015 07:53 PM, Christoph Lameter wrote: > On Tue, 15 Sep 2015, Jason Gunthorpe wrote: > >> The mcast list in the core is soley for listing subscriptions for >> inbound - ie receive. Abusing it for send-side is probably the wrong >> direction overall. > > Ok then a simple approach would be to port timeout logic from > OFED-1.5.X. It's the simple fix, but not the right fix. I would prefer to find the right fix for upstream.
On 09/15/2015 03:11 PM, Jason Gunthorpe wrote: > On Tue, Sep 15, 2015 at 02:24:21PM -0400, Doug Ledford wrote: > >> suggesting. What's more, there are a few difficulties here in that I'm >> fairly certain the core networking layer doesn't have the concept of a >> send-only join, yet we would need it to have such. > > The mcast list in the core is soley for listing subscriptions for > inbound - ie receive. Agreed. > Abusing it for send-side is probably the wrong > direction overall. I wouldn't "abuse" it for such, I would suggest adding a proper notion of send-only registrations. > send-side addressing is always done with the neighbour table. If core > net infrastructure is to be done here it makes alot more sense to > store multicast IP's in the neighbour table and drive the join > lifetime from that side. Wherever the right place in the core would turn out to be, my point was that we should talk to netdev about extending the core multicast code to have a notion of IPoIB's need for send-only groups to be tracked, then set a flag on the device that notes it needs tracked, and have the core code track these subscriptions. I would then modify the ipoib multicast code so that the join/leave handler could deal with both sendonly and full joins/leaves, scrap the auto send-only join in mcast_send, and modify mcast_restart_task to be able to promote/demote joins based upon changes in the core maintained join list. That would be my preferred way to handle this.
On Wed, 16 Sep 2015, Doug Ledford wrote: > On 09/15/2015 07:53 PM, Christoph Lameter wrote: > > On Tue, 15 Sep 2015, Jason Gunthorpe wrote: > > > >> The mcast list in the core is soley for listing subscriptions for > >> inbound - ie receive. Abusing it for send-side is probably the wrong > >> direction overall. > > > > Ok then a simple approach would be to port timeout logic from > > OFED-1.5.X. > > It's the simple fix, but not the right fix. I would prefer to find the > right fix for upstream. We would have to track which sockets have sent sendonly multicast traffic. Some sort of a refcount on the sendonly multicast group that gets decremented when the socket is closed down. We need some sort of custom callback during socket shutdown. The IPoIB layer is not a protocol otherwise we would have a shutdown callback to work with. Hmmm... For the UDP protocol the shutdown function is not populated in the protocol methods. There is an encap_destroy() that is called on udp_destroy_sock(). We could add another check in udp_destroy_sock() that does a callback for IPoIB. That could then release the refcount. Question then is how do we know which socket has done a sendonly join to which multicast groups? We cannot use the regular multicast list for a socket. So add another list? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 16 Sep 2015, Doug Ledford wrote: > > Abusing it for send-side is probably the wrong > > direction overall. > > I wouldn't "abuse" it for such, I would suggest adding a proper notion > of send-only registrations. That is really not necessary for IP traffic. There is no need to track these since multicast can be send without subscriptions. So I guess that there will not be much support on netdev for such an approach. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/16/2015 11:59 AM, Christoph Lameter wrote: > On Wed, 16 Sep 2015, Doug Ledford wrote: > >>> Abusing it for send-side is probably the wrong >>> direction overall. >> >> I wouldn't "abuse" it for such, I would suggest adding a proper notion >> of send-only registrations. > > That is really not necessary for IP traffic. There is no need to track > these since multicast can be send without subscriptions. So I guess that > there will not be much support on netdev for such an approach. I agree with you that it's not necessary for Ethernet/IP traffic. The point is that this isn't really Ethernet/IP traffic, it's InfiniBand/IP traffic. And there it *is* necessary. An argument can be made that the core code shouldn't assume Ethernet/IP and should support either Ethernet/IP or InfiniBand/IP.
Another approach may be to tie the unsub from sendonly multicast joins to the expiration of the layer 2 addresses in IPoIB. F.e. add code to __ipoib_reap_ah() to detect if the handle was used for a sendonly multicast join. If so unsubscribe from the MC group. This will result in behavior consistent with address resolution and caching on IPoIB. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 16, 2015 at 10:59:50AM -0500, Christoph Lameter wrote: > On Wed, 16 Sep 2015, Doug Ledford wrote: > > > > Abusing it for send-side is probably the wrong > > > direction overall. > > > > I wouldn't "abuse" it for such, I would suggest adding a proper notion > > of send-only registrations. > > That is really not necessary for IP traffic. There is no need to track > these since multicast can be send without subscriptions. So I guess that > there will not be much support on netdev for such an approach. InfiniBand is not unique here, eg, long ago proposals for IPoATM had the same problem. Not sure if there is any other more current networks that work this way.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 16, 2015 at 7:31 PM, Christoph Lameter <cl@linux.com> wrote: > Another approach may be to tie the unsub from sendonly multicast joins to > the expiration of the layer 2 addresses in IPoIB. F.e. add code to > __ipoib_reap_ah() to detect if the handle was used for a sendonly > multicast join. If so unsubscribe from the MC group. This will result in > behavior consistent with address resolution and caching on IPoIB. yep, Erez has the patches to do so. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 16 Sep 2015, Or Gerlitz wrote: > On Wed, Sep 16, 2015 at 7:31 PM, Christoph Lameter <cl@linux.com> wrote: > > Another approach may be to tie the unsub from sendonly multicast joins to > > the expiration of the layer 2 addresses in IPoIB. F.e. add code to > > __ipoib_reap_ah() to detect if the handle was used for a sendonly > > multicast join. If so unsubscribe from the MC group. This will result in > > behavior consistent with address resolution and caching on IPoIB. > > yep, Erez has the patches to do so. Would you please share them? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 16, 2015 at 11:17 PM, Christoph Lameter <cl@linux.com> wrote: > On Wed, 16 Sep 2015, Or Gerlitz wrote: > >> On Wed, Sep 16, 2015 at 7:31 PM, Christoph Lameter <cl@linux.com> wrote: >> > Another approach may be to tie the unsub from sendonly multicast joins to >> > the expiration of the layer 2 addresses in IPoIB. F.e. add code to >> > __ipoib_reap_ah() to detect if the handle was used for a sendonly >> > multicast join. If so unsubscribe from the MC group. This will result in >> > behavior consistent with address resolution and caching on IPoIB. >> >> yep, Erez has the patches to do so. > > Would you please share them? I will check with him tomorrow, basically I think he's pretty busy and hence didn't participate in these threads so far, we'll see. Could you please post here a few (say 2-4) liner summary of what is still missing or done wrong in 4.3-rc1 and what is your suggestion how to resolve that. The amount of TEXT @ the IPoIB patch by Doug that went into rc1 and the texts written so far over this thread are just too much for a busy Erez and myself to grasp and act, we need your help with summing this up... Or. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 16 Sep 2015, Or Gerlitz wrote: > Could you please post here a few (say 2-4) liner summary of what is > still missing or done wrong in 4.3-rc1 and what is your suggestion how > to resolve that. With Doug's patch here the only thing that is left to be done is to properly leave the multicast group. And it seems that Erez patch does just that. And then there are the 20 other things that I have pending with Mellanox but those are different issues that do not belong here. This one is a critical bug for us. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 9/17/2015 3:48 AM, Christoph Lameter wrote: > On Wed, 16 Sep 2015, Or Gerlitz wrote: > >> Could you please post here a few (say 2-4) liner summary of what is >> still missing or done wrong in 4.3-rc1 and what is your suggestion how >> to resolve that. > With Doug's patch here the only thing that is left to be done is to > properly leave the multicast group. And it seems that Erez patch does just that. sent it out now > And then there are the 20 other things that I have pending with Mellanox > but those are different issues that do not belong here. This one is a > critical bug for us. > so 19 left? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 09a1748f9d13..2d95b8ae379b 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -47,6 +47,11 @@ #include "ipoib.h" +static bool __read_mostly mcast_auto_create; + +module_param(mcast_auto_create, bool, 0644); +MODULE_PARM_DESC(mcast_auto_create, "Should multicast sends auto-create the IB multicast group? (Default: false)"); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG static int mcast_debug_level; @@ -514,9 +519,34 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast) * detect if there are full members or not. A major problem * with supporting SEND ONLY is detecting when the group is * auto-destroyed as IPoIB will cache the MLID.. + * + * An additional problem is that if we auto-create the IB + * mcast group in response to a send-only action, then we + * will be the creating entity, but we will not have any + * mechanism by which we will track when we should leave + * the group ourselves. We will occasionally leave and + * re-join the group when these events occur: + * + * 1) ifdown/ifup + * 2) a regular mcast join/leave happens and we run + * ipoib_mcast_restart_task + * 3) a REREGISTER event comes in from the SM + * 4) any other event that might cause a mcast flush + * + * However, these events are not deterministic and we can + * leave unused groups subscribed for long periods of time. + * In addition, since the core IB layer does not yet support + * send-only IB joins, we have to do a regular join and then + * simply never attach a QP to listen to the incoming data. + * This means that phantom, wasted data will end up coming + * across our inbound physical link only to be thrown away + * by the multicast dispatch mechanism on the card or in + * the kernel driver. For these reasons, we default to not + * auto creating groups for send-only multicast operations. */ #if 1 - if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) + if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) && + !mcast_auto_create) comp_mask &= ~IB_SA_MCMEMBER_REC_TRAFFIC_CLASS; #else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
During the recent rework of the mcast handling in ipoib, the join task for regular and send-only joins were merged. In the old code, the comments indicated that the ipoib driver didn't send enough information to auto-create IB multicast groups when the join was a send-only join. The reality is that the comments said we didn't, but we actually did. Since we merged the two join tasks, we now follow the comments and don't auto-create IB multicast groups for an ipoib send-only multicast join. This has been reported to cause problems in certain environments that rely on this behavior. Specifically, if you have an IB <-> Ethernet gateway then there is a fundamental mismatch between the methodologies used on the two fabrics. On Ethernet, an app need not subscribe to a multicast group, merely listen. As such, the Ethernet side of the gateway has no way of knowing if there are listeners. If we don't create groups for sends in this case, and the listeners are only on the Ethernet side of the gateway, the listeners will not get any of the packets sent on the IB side of the gateway. There are instances of installations with 100's (maybe 1000's) of multicast groups where static creation of all the groups is not practical that rely upon the send-only joins creating the IB multicast group in order to function, so to preserve these existing installations, add a module option to the ipoib module to restore the previous behavior. Signed-off-by: Doug Ledford <dledford@redhat.com> --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 32 +++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-)