diff mbox

[for-4.3] IB/ipoib: add module option for auto-creating mcast groups

Message ID 980e8b0a82042d7e5801e02bf16fe72a0bde6759.1441934465.git.dledford@redhat.com (mailing list archive)
State Rejected
Headers show

Commit Message

Doug Ledford Sept. 11, 2015, 1:21 a.m. UTC
During the recent rework of the mcast handling in ipoib, the join
task for regular and send-only joins were merged.  In the old code,
the comments indicated that the ipoib driver didn't send enough
information to auto-create IB multicast groups when the join was a
send-only join.  The reality is that the comments said we didn't, but
we actually did.  Since we merged the two join tasks, we now follow
the comments and don't auto-create IB multicast groups for an ipoib
send-only multicast join.  This has been reported to cause problems
in certain environments that rely on this behavior.  Specifically,
if you have an IB <-> Ethernet gateway then there is a fundamental
mismatch between the methodologies used on the two fabrics.  On
Ethernet, an app need not subscribe to a multicast group, merely
listen.  As such, the Ethernet side of the gateway has no way of
knowing if there are listeners.  If we don't create groups for sends
in this case, and the listeners are only on the Ethernet side of
the gateway, the listeners will not get any of the packets sent
on the IB side of the gateway.  There are instances of installations
with 100's (maybe 1000's) of multicast groups where static creation
of all the groups is not practical that rely upon the send-only
joins creating the IB multicast group in order to function, so to
preserve these existing installations, add a module option to the
ipoib module to restore the previous behavior.

Signed-off-by: Doug Ledford <dledford@redhat.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 32 +++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

Comments

Christoph Lameter (Ampere) Sept. 11, 2015, 2:40 p.m. UTC | #1
On Thu, 10 Sep 2015, Doug Ledford wrote:

> +		 * 1) ifdown/ifup
> +		 * 2) a regular mcast join/leave happens and we run
> +		 *    ipoib_mcast_restart_task
> +		 * 3) a REREGISTER event comes in from the SM
> +		 * 4) any other event that might cause a mcast flush

Could we have a timeout and leave the multicast group on process exit?
The old code base did that with the ipoib_mcast_leave_task() function.

With that timeout we do not longer accumulate MC sendonly subscriptions
for long running systems.

Also IPOIB_MAX_MCAST_QUEUE's default to 3 is not really enough to capture
a burst of traffic send to a multicast group. Can we make this
configurable or increase the max?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Sept. 11, 2015, 6:21 p.m. UTC | #2
On Fri, Sep 11, 2015 at 09:40:25AM -0500, Christoph Lameter wrote:
> On Thu, 10 Sep 2015, Doug Ledford wrote:
> 
> > +		 * 1) ifdown/ifup
> > +		 * 2) a regular mcast join/leave happens and we run
> > +		 *    ipoib_mcast_restart_task
> > +		 * 3) a REREGISTER event comes in from the SM
> > +		 * 4) any other event that might cause a mcast flush
> 
> Could we have a timeout and leave the multicast group on process
> exit?

At a minimum, when the socket that did the send closes the send-only
could be de-refed..

But really send-only in IB is very similar to neighbour discovery, and
should work the same, with a timer, etc. In my mind it would be ideal
to even add them to the ND table, if the core would support that..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Sept. 11, 2015, 6:39 p.m. UTC | #3
On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote:
> During the recent rework of the mcast handling in ipoib, the join
> task for regular and send-only joins were merged.  In the old code,
> the comments indicated that the ipoib driver didn't send enough
> information to auto-create IB multicast groups when the join was a
> send-only join.  The reality is that the comments said we didn't, but
> we actually did.  Since we merged the two join tasks, we now follow
> the comments and don't auto-create IB multicast groups for an ipoib
> send-only multicast join.  This has been reported to cause problems
> in certain environments that rely on this behavior.  Specifically,
> if you have an IB <-> Ethernet gateway then there is a fundamental
> mismatch between the methodologies used on the two fabrics.  On
> Ethernet, an app need not subscribe to a multicast group, merely
> listen.

This should probably be clarified. On all IP networks IGMP/MLD is used
to advertise listeners.

A IB/Eth gateway is a router, and IP routers are expected to process
IGMP - so the gateway certainly can (and maybe must) be copying
groups declared with IGMP from the eth side into listeners on IB MGIDs

We may also be making a mistake in IPoIB - if the send side MGID join
fails, the send should probably go to the broadcast, and the join
retried from time to time. This would at least let the gateway
optimize a bit more by only creating groups in active use.

That would emulate the ethernet philosophy of degrade to broadcast.

TBH, fixing broken gateway devices by sending to broadcast appeals to
me more than making a module option..

> on the IB side of the gateway.  There are instances of installations
> with 100's (maybe 1000's) of multicast groups where static creation
> of all the groups is not practical that rely upon the send-only

With my last patch the SM now has enough information to auto-create
the wonky send-only join attempts, if wanted to. It just needs to fill
in tclass, so it certainly is possible to address this long term
without a kernel patch.

> joins creating the IB multicast group in order to function, so to
> preserve these existing installations, add a module option to the
> ipoib module to restore the previous behavior.

.. but an option to restore behavior of an older kernel version makes
sense - did we really have a kernel that completely converted a
send-only join to full join&create?

> +		 * An additional problem is that if we auto-create the IB
> +		 * mcast group in response to a send-only action, then we
> +		 * will be the creating entity, but we will not have any
> +		 * mechanism by which we will track when we should leave
> +		 * the group ourselves.  We will occasionally leave and
> +		 * re-join the group when these events occur:

I would drop the langauge of creating-entity, that isn't something
from the IBA.

The uncontrolled lifetime of the join is not related to creating or
not.

> +		 * 2) a regular mcast join/leave happens and we run
> +		 *    ipoib_mcast_restart_task

Really? All send only mgids are discarded at that point? Ugly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Sept. 11, 2015, 7:41 p.m. UTC | #4
On 09/11/2015 02:21 PM, Jason Gunthorpe wrote:
> On Fri, Sep 11, 2015 at 09:40:25AM -0500, Christoph Lameter wrote:
>> On Thu, 10 Sep 2015, Doug Ledford wrote:
>>
>>> +		 * 1) ifdown/ifup
>>> +		 * 2) a regular mcast join/leave happens and we run
>>> +		 *    ipoib_mcast_restart_task
>>> +		 * 3) a REREGISTER event comes in from the SM
>>> +		 * 4) any other event that might cause a mcast flush
>>
>> Could we have a timeout and leave the multicast group on process
>> exit?
> 
> At a minimum, when the socket that did the send closes the send-only
> could be de-refed..

If we kept a ref count, but we don't.  Tracking this is not a small change.

> But really send-only in IB is very similar to neighbour discovery, and
> should work the same, with a timer, etc. In my mind it would be ideal
> to even add them to the ND table, if the core would support that..
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Doug Ledford Sept. 11, 2015, 8:09 p.m. UTC | #5
On 09/11/2015 02:39 PM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote:
>> During the recent rework of the mcast handling in ipoib, the join
>> task for regular and send-only joins were merged.  In the old code,
>> the comments indicated that the ipoib driver didn't send enough
>> information to auto-create IB multicast groups when the join was a
>> send-only join.  The reality is that the comments said we didn't, but
>> we actually did.  Since we merged the two join tasks, we now follow
>> the comments and don't auto-create IB multicast groups for an ipoib
>> send-only multicast join.  This has been reported to cause problems
>> in certain environments that rely on this behavior.  Specifically,
>> if you have an IB <-> Ethernet gateway then there is a fundamental
>> mismatch between the methodologies used on the two fabrics.  On
>> Ethernet, an app need not subscribe to a multicast group, merely
>> listen.
> 
> This should probably be clarified. On all IP networks IGMP/MLD is used
> to advertise listeners.
> 
> A IB/Eth gateway is a router, and IP routers are expected to process
> IGMP - so the gateway certainly can (and maybe must) be copying
> groups declared with IGMP from the eth side into listeners on IB MGIDs

Obviously, the gateway in question currently is not doing this.

> We may also be making a mistake in IPoIB - if the send side MGID join
> fails, the send should probably go to the broadcast, and the join
> retried from time to time. This would at least let the gateway
> optimize a bit more by only creating groups in active use.

That's not *too* far off from what we do.  This change could be made
without a huge amount of effort.  Right now we queue up the packet and
retry the join on a timer.  Only after all of the join attempts have
failed their timed retries do we dequeue and drop the packets.  We could
drop the queue backlog entirely and just send to broadcast when the
multicast group is unsubscribed.

> That would emulate the ethernet philosophy of degrade to broadcast.

Indeed.

> TBH, fixing broken gateway devices by sending to broadcast appeals to
> me more than making a module option..

Well, we've already established that the gateway device might be well be
broken.  That makes one wonder if this will work or if it might be
broken too.

>> on the IB side of the gateway.  There are instances of installations
>> with 100's (maybe 1000's) of multicast groups where static creation
>> of all the groups is not practical that rely upon the send-only
> 
> With my last patch the SM now has enough information to auto-create
> the wonky send-only join attempts, if wanted to. It just needs to fill
> in tclass, so it certainly is possible to address this long term
> without a kernel patch.
> 
>> joins creating the IB multicast group in order to function, so to
>> preserve these existing installations, add a module option to the
>> ipoib module to restore the previous behavior.
> 
> .. but an option to restore behavior of an older kernel version makes
> sense - did we really have a kernel that completely converted a
> send-only join to full join&create?

I just re-checked to verify this and here's what I found:

kernels v3.10, v3,19, and v4.0 have this in ipoib_mcast_sendonly_join:

        if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
                ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n");
                return -EBUSY;
        }

        rec.mgid     = mcast->mcmember.mgid;
        rec.port_gid = priv->local_gid;
        rec.pkey     = cpu_to_be16(priv->pkey);

        comp_mask =
                IB_SA_MCMEMBER_REC_MGID         |
                IB_SA_MCMEMBER_REC_PORT_GID     |
                IB_SA_MCMEMBER_REC_PKEY         |
                IB_SA_MCMEMBER_REC_JOIN_STATE;

        mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
                                         priv->port, &rec,
                                         comp_mask,
                                         GFP_ATOMIC,
                                         ipoib_mcast_sendonly_join_complete,
                                         mcast);

kernel v4.1 and on no longer have sendonly_join as a separate function.
 So, to answer your question, no, upstream didn't do this.  This,
however, is carried in OFED:

        rec.mgid     = mcast->mcmember.mgid;
        rec.port_gid = priv->local_gid;
        rec.pkey     = cpu_to_be16(priv->pkey);

        comp_mask =
                IB_SA_MCMEMBER_REC_MGID         |
                IB_SA_MCMEMBER_REC_PORT_GID     |
                IB_SA_MCMEMBER_REC_PKEY         |
                IB_SA_MCMEMBER_REC_JOIN_STATE;

        if (priv->broadcast) {
                comp_mask |=
                        IB_SA_MCMEMBER_REC_QKEY                 |
                        IB_SA_MCMEMBER_REC_MTU_SELECTOR         |
                        IB_SA_MCMEMBER_REC_MTU                  |
                        IB_SA_MCMEMBER_REC_TRAFFIC_CLASS        |
                        IB_SA_MCMEMBER_REC_RATE_SELECTOR        |
                        IB_SA_MCMEMBER_REC_RATE                 |
                        IB_SA_MCMEMBER_REC_SL                   |
                        IB_SA_MCMEMBER_REC_FLOW_LABEL           |
                        IB_SA_MCMEMBER_REC_HOP_LIMIT;

                rec.qkey          = priv->broadcast->mcmember.qkey;
                rec.mtu_selector  = IB_SA_EQ;
                rec.mtu           = priv->broadcast->mcmember.mtu;
                rec.traffic_class = priv->broadcast->mcmember.traffic_class;
                rec.rate_selector = IB_SA_EQ;
                rec.rate          = priv->broadcast->mcmember.rate;
                rec.sl            = priv->broadcast->mcmember.sl;
                rec.flow_label    = priv->broadcast->mcmember.flow_label;
                rec.hop_limit     = priv->broadcast->mcmember.hop_limit;
        }

and so this has been happening since forever in OFED (the above is from
1.5.4.1).

Part of the confusion here is that I was on a concall with the people
this problem is effecting, and their claim was that this started with my
cleanup of the multicast stuff in the v4.1 kernel.  I double checked the
OFED sources at the time, but not the upstream kernel, and took them at
their word that my patches caused the problem to appear on upstream
kernels, which implies that the older upstream kernels, where the
sendonly_join that I deleted still existed, were like OFED.  If my patch
set *did* start the problem, then it's not because of the autocreate
issue.  So in that sense, this would be more like creating an OFED
compatibility hook.

>> +		 * An additional problem is that if we auto-create the IB
>> +		 * mcast group in response to a send-only action, then we
>> +		 * will be the creating entity, but we will not have any
>> +		 * mechanism by which we will track when we should leave
>> +		 * the group ourselves.  We will occasionally leave and
>> +		 * re-join the group when these events occur:
> 
> I would drop the langauge of creating-entity, that isn't something
> from the IBA.
> 
> The uncontrolled lifetime of the join is not related to creating or
> not.
> 
>> +		 * 2) a regular mcast join/leave happens and we run
>> +		 *    ipoib_mcast_restart_task
> 
> Really? All send only mgids are discarded at that point? Ugly.

Yep.  Any time an app joins/leaves a multicast group, it updates our
multicast list in the net core, and via our net update we schedule a
multicast_restart_task().  When that happens, all full joins are matched
against the new list of joined groups from the net core, any that no
longer match a subscribed group we put on the remove list, any that
match we leave on the list, any that are now in the net core and not in
our list we add.  Because the sendonly groups are not tracked at the net
core level, our only option is to move them all to the remove list and
when we get another sendonly packet, rejoin.  Unless we want them to
stay around forever.  But since they aren't real send-only joins, they
are full joins where we simply ignore the incoming data, leaving them
around seems a bad idea.
Jason Gunthorpe Sept. 11, 2015, 8:38 p.m. UTC | #6
On Fri, Sep 11, 2015 at 04:09:49PM -0400, Doug Ledford wrote:
> On 09/11/2015 02:39 PM, Jason Gunthorpe wrote:
> > On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote:
> >> During the recent rework of the mcast handling in ipoib, the join
> >> task for regular and send-only joins were merged.  In the old code,
> >> the comments indicated that the ipoib driver didn't send enough
> >> information to auto-create IB multicast groups when the join was a
> >> send-only join.  The reality is that the comments said we didn't, but
> >> we actually did.  Since we merged the two join tasks, we now follow
> >> the comments and don't auto-create IB multicast groups for an ipoib
> >> send-only multicast join.  This has been reported to cause problems
> >> in certain environments that rely on this behavior.  Specifically,
> >> if you have an IB <-> Ethernet gateway then there is a fundamental
> >> mismatch between the methodologies used on the two fabrics.  On
> >> Ethernet, an app need not subscribe to a multicast group, merely
> >> listen.
> > 
> > This should probably be clarified. On all IP networks IGMP/MLD is used
> > to advertise listeners.
> > 
> > A IB/Eth gateway is a router, and IP routers are expected to process
> > IGMP - so the gateway certainly can (and maybe must) be copying
> > groups declared with IGMP from the eth side into listeners on IB MGIDs
> 
> Obviously, the gateway in question currently is not doing this.

Sure, my remark was the clarify the commit comment so people don't think
this is OK/expected behavior from a gateway.

> We could drop the queue backlog entirely and just send to broadcast
> when the multicast group is unsubscribed.

I'm pretty sure that would upset the people who care about this
stuff.. Steady state operation has to eventually move to the optimal
MLID.

> Well, we've already established that the gateway device might be well be
> broken.  That makes one wonder if this will work or if it might be
> broken too.

If it isn't subscribed to the broadcast MLID, it is violating MUST
statements in the RFC...

> and so this has been happening since forever in OFED (the above is from
> 1.5.4.1).

But has this has been dropped from the new 3.x series that track
upstream exactly?

> our list we add.  Because the sendonly groups are not tracked at the net
> core level, our only option is to move them all to the remove list and
> when we get another sendonly packet, rejoin.  Unless we want them to
> stay around forever.  But since they aren't real send-only joins, they
> are full joins where we simply ignore the incoming data, leaving them
> around seems a bad idea.

It doesn't make any sense to work like that. As is, the send-only
side looks pretty messed up to me.

It really needs to act like ND, and yah, that is a big change.

Just to be clear, I'm not entirely opposed to an OFED compatability
module option, but lets understand how this is broken, what the fix is
we want to see for mainline and why the OFED 'solution' is not
acceptable for mainline.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Sept. 11, 2015, 9:40 p.m. UTC | #7
On 09/11/2015 04:38 PM, Jason Gunthorpe wrote:
> On Fri, Sep 11, 2015 at 04:09:49PM -0400, Doug Ledford wrote:
>> On 09/11/2015 02:39 PM, Jason Gunthorpe wrote:
>>> On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote:
>>>> During the recent rework of the mcast handling in ipoib, the join
>>>> task for regular and send-only joins were merged.  In the old code,
>>>> the comments indicated that the ipoib driver didn't send enough
>>>> information to auto-create IB multicast groups when the join was a
>>>> send-only join.  The reality is that the comments said we didn't, but
>>>> we actually did.  Since we merged the two join tasks, we now follow
>>>> the comments and don't auto-create IB multicast groups for an ipoib
>>>> send-only multicast join.  This has been reported to cause problems
>>>> in certain environments that rely on this behavior.  Specifically,
>>>> if you have an IB <-> Ethernet gateway then there is a fundamental
>>>> mismatch between the methodologies used on the two fabrics.  On
>>>> Ethernet, an app need not subscribe to a multicast group, merely
>>>> listen.
>>>
>>> This should probably be clarified. On all IP networks IGMP/MLD is used
>>> to advertise listeners.
>>>
>>> A IB/Eth gateway is a router, and IP routers are expected to process
>>> IGMP - so the gateway certainly can (and maybe must) be copying
>>> groups declared with IGMP from the eth side into listeners on IB MGIDs
>>
>> Obviously, the gateway in question currently is not doing this.
> 
> Sure, my remark was the clarify the commit comment so people don't think
> this is OK/expected behavior from a gateway.
> 
>> We could drop the queue backlog entirely and just send to broadcast
>> when the multicast group is unsubscribed.
> 
> I'm pretty sure that would upset the people who care about this
> stuff.. Steady state operation has to eventually move to the optimal
> MLID.

I didn't mean to imply that we wouldn't attempt to join the group still.
 Just that right now the process is this:

send_mcast
  detect no-mcast group
    initiate sendonly join, while queueing packet to be sent
      process join completion
        success - send queued packet (which has been delayed)
        fail - leave packet on queue, set backoff timer, try
          join again after timer expires, if we exceed the
          number of backoff/join attempts, drop packet

The people I was on the phone with indicated that they are seeing some
packet loss of mcast packets on these sendonly groups.  For one thing,
that backlog queue we have while waiting for the join to complete is
capped at 3 deep and drops anything beyond that.  So it is easy to
imagine that even a small burst of packets could cause a drop.  But this
process also introduces delay on the packets being sent.  My comment
above then was to suggest the following change:

send_mcast
  no-mcast group
    initiate sendonly join
    simultaneously send current packet to broadcast group
      process join completion
        success - mark mcast group live, no queue backlog to send
        fail - retry as above
  mcast group, but group still no live
    send current packet to broadcast group
  mcast group, group is live
    send current packet to mcast group

The major change here is that we would never queue the packets any more.
 If the group is not already successfully joined, then we simply send to
the broadcast group instead of queueing.  Once the group is live, we
start sending to the group.  This would eliminate packet delays assuming
that the gateway properly forwards multicast packets received on the
broadcast group versus on a specific IB multicast group.

Then a simple mcast_expire_task that runs every 10 minutes or so and
leaves any send-only groups that haven't had a packet timestamp update
in more than some arbitrary amount of time would be a very simple
addition to make.

>> Well, we've already established that the gateway device might be well be
>> broken.  That makes one wonder if this will work or if it might be
>> broken too.
> 
> If it isn't subscribed to the broadcast MLID, it is violating MUST
> statements in the RFC...

Yeah, my comment was in reference to whether or not it would receive a
multicast on broadcast packet and forward it properly.

>> and so this has been happening since forever in OFED (the above is from
>> 1.5.4.1).
> 
> But has this has been dropped from the new 3.x series that track
> upstream exactly?

I haven't check that (I don't have any of the later OFEDs downloaded,
and I didn't see a git repo for the 3.x series under Vlad's area on
OpenFabrics git server).

>> our list we add.  Because the sendonly groups are not tracked at the net
>> core level, our only option is to move them all to the remove list and
>> when we get another sendonly packet, rejoin.  Unless we want them to
>> stay around forever.  But since they aren't real send-only joins, they
>> are full joins where we simply ignore the incoming data, leaving them
>> around seems a bad idea.
> 
> It doesn't make any sense to work like that. As is, the send-only
> side looks pretty messed up to me.
> 
> It really needs to act like ND, and yah, that is a big change.
> 
> Just to be clear, I'm not entirely opposed to an OFED compatability
> module option, but lets understand how this is broken, what the fix is
> we want to see for mainline and why the OFED 'solution' is not
> acceptable for mainline.
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Jason Gunthorpe Sept. 11, 2015, 10 p.m. UTC | #8
On Fri, Sep 11, 2015 at 05:40:41PM -0400, Doug Ledford wrote:
> Then a simple mcast_expire_task that runs every 10 minutes or so and
> leaves any send-only groups that haven't had a packet timestamp update
> in more than some arbitrary amount of time would be a very simple
> addition to make.

That all makes sense to me and addresses the backlog issue Christoph
mentioned.

It would also be fairly simple to do send-only join properly with the
above, as we simply switch to polling the SA to detect if the join is
good or not on the same timer that would expire it.

> > If it isn't subscribed to the broadcast MLID, it is violating MUST
> > statements in the RFC...
> 
> Yeah, my comment was in reference to whether or not it would receive a
> multicast on broadcast packet and forward it properly.

It would be hugely broken to be sensitive to the MLID when working
with multicast routing.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Sept. 14, 2015, 5:03 p.m. UTC | #9
On Fri, 11 Sep 2015, Doug Ledford wrote:

> > At a minimum, when the socket that did the send closes the send-only
> > could be de-refed..
>
> If we kept a ref count, but we don't.  Tracking this is not a small change.

We could call ip_mc_join_group() from ipoib_mcast_send() which would join
it at the socket layer. That layer would do the tracking for us and leave the
group when the process terminates. The join would be visible the same way
as if one would have done an explicit setsockopt().

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Sept. 15, 2015, 6:24 p.m. UTC | #10
On 09/14/2015 01:03 PM, Christoph Lameter wrote:
> On Fri, 11 Sep 2015, Doug Ledford wrote:
> 
>>> At a minimum, when the socket that did the send closes the send-only
>>> could be de-refed..
>>
>> If we kept a ref count, but we don't.  Tracking this is not a small change.
> 
> We could call ip_mc_join_group() from ipoib_mcast_send() which would join
> it at the socket layer. That layer would do the tracking for us and leave the
> group when the process terminates. The join would be visible the same way
> as if one would have done an explicit setsockopt().

I actually think this is a step in the right direction, but I think we
are talking about a layering violation doing it the way you are
suggesting.  What's more, there are a few difficulties here in that I'm
fairly certain the core networking layer doesn't have the concept of a
send-only join, yet we would need it to have such.  If we had three apps
do send only sends, we would need to track all three of those sockets as
being send only socket joins, and then if someone did a full join, we
would need to automatically upgrade the join, and then if that app with
the full join left, we would need to gracefully downgrade to send only
again.  So, I think it would take work at the core level to get this
right, but I think that's probably the right place to do the work.
Christoph Lameter (Ampere) Sept. 15, 2015, 6:42 p.m. UTC | #11
On Tue, 15 Sep 2015, Doug Ledford wrote:

>
> I actually think this is a step in the right direction, but I think we
> are talking about a layering violation doing it the way you are
> suggesting.  What's more, there are a few difficulties here in that I'm

The function is used in various lower layer processing functions
throughout the network stack. As long as we still have a task context and
a socket accessible this should be fine.


> fairly certain the core
networking layer doesn't have the concept of a
> send-only join, yet we would need it to have such.  If we had three apps

The networking layer supports a setsockopt that can set IP_MULTICAST_LOOP
behavior. The the API is there to properly control this from user space
and thus its possible to join sendonly from user space. Actually we would
need that for the proper implementaiton of sendonly joins at the fabric
level I would think. I can work on making this work properly with the IP
stack.

> do send only sends, we would need to track all three of those sockets as
> being send only socket joins, and then if someone did a full join, we
> would need to automatically upgrade the join, and then if that app with
> the full join left, we would need to gracefully downgrade to send only
> again.  So, I think it would take work at the core level to get this
> right, but I think that's probably the right place to do the work.

Well the IP stack already has to deal with this. See mc_loop option in
the socket field as well as the sk_mc_loop() function in the network
stack.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Sept. 15, 2015, 7:11 p.m. UTC | #12
On Tue, Sep 15, 2015 at 02:24:21PM -0400, Doug Ledford wrote:

> suggesting.  What's more, there are a few difficulties here in that I'm
> fairly certain the core networking layer doesn't have the concept of a
> send-only join, yet we would need it to have such.

The mcast list in the core is soley for listing subscriptions for
inbound - ie receive. Abusing it for send-side is probably the wrong
direction overall.

send-side addressing is always done with the neighbour table. If core
net infrastructure is to be done here it makes alot more sense to
store multicast IP's in the neighbour table and drive the join
lifetime from that side.

[mloop is not the same as IB send-only, IB send only means no receive
at all, from anyone]

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Sept. 15, 2015, 11:53 p.m. UTC | #13
On Tue, 15 Sep 2015, Jason Gunthorpe wrote:

> The mcast list in the core is soley for listing subscriptions for
> inbound - ie receive. Abusing it for send-side is probably the wrong
> direction overall.

Ok then a simple approach would be to port timeout logic from
OFED-1.5.X.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Sept. 16, 2015, 3:14 p.m. UTC | #14
On 09/15/2015 07:53 PM, Christoph Lameter wrote:
> On Tue, 15 Sep 2015, Jason Gunthorpe wrote:
> 
>> The mcast list in the core is soley for listing subscriptions for
>> inbound - ie receive. Abusing it for send-side is probably the wrong
>> direction overall.
> 
> Ok then a simple approach would be to port timeout logic from
> OFED-1.5.X.

It's the simple fix, but not the right fix.  I would prefer to find the
right fix for upstream.
Doug Ledford Sept. 16, 2015, 3:21 p.m. UTC | #15
On 09/15/2015 03:11 PM, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2015 at 02:24:21PM -0400, Doug Ledford wrote:
> 
>> suggesting.  What's more, there are a few difficulties here in that I'm
>> fairly certain the core networking layer doesn't have the concept of a
>> send-only join, yet we would need it to have such.
> 
> The mcast list in the core is soley for listing subscriptions for
> inbound - ie receive.

Agreed.

> Abusing it for send-side is probably the wrong
> direction overall.

I wouldn't "abuse" it for such, I would suggest adding a proper notion
of send-only registrations.

> send-side addressing is always done with the neighbour table. If core
> net infrastructure is to be done here it makes alot more sense to
> store multicast IP's in the neighbour table and drive the join
> lifetime from that side.

Wherever the right place in the core would turn out to be, my point was
that we should talk to netdev about extending the core multicast code to
have a notion of IPoIB's need for send-only groups to be tracked, then
set a flag on the device that notes it needs tracked, and have the core
code track these subscriptions.  I would then modify the ipoib multicast
code so that the join/leave handler could deal with both sendonly and
full joins/leaves, scrap the auto send-only join in mcast_send, and
modify mcast_restart_task to be able to promote/demote joins based upon
changes in the core maintained join list.  That would be my preferred
way to handle this.
Christoph Lameter (Ampere) Sept. 16, 2015, 3:38 p.m. UTC | #16
On Wed, 16 Sep 2015, Doug Ledford wrote:

> On 09/15/2015 07:53 PM, Christoph Lameter wrote:
> > On Tue, 15 Sep 2015, Jason Gunthorpe wrote:
> >
> >> The mcast list in the core is soley for listing subscriptions for
> >> inbound - ie receive. Abusing it for send-side is probably the wrong
> >> direction overall.
> >
> > Ok then a simple approach would be to port timeout logic from
> > OFED-1.5.X.
>
> It's the simple fix, but not the right fix.  I would prefer to find the
> right fix for upstream.

We would have to track which sockets have sent sendonly multicast
traffic. Some sort of a refcount on the sendonly multicast group that
gets decremented when the socket is closed down. We need some sort of
custom callback during socket shutdown.

The IPoIB layer is not a protocol otherwise we would have a shutdown
callback to work with.

Hmmm... For the UDP protocol the shutdown function is not populated in the
protocol methods. There is an encap_destroy() that is called on
udp_destroy_sock(). We could add another check in udp_destroy_sock()
that does a callback for IPoIB. That could then release the refcount.

Question then is how do we know which socket has done a sendonly join to
which multicast groups? We cannot use the regular multicast list for a
socket. So add another list?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Sept. 16, 2015, 3:59 p.m. UTC | #17
On Wed, 16 Sep 2015, Doug Ledford wrote:

> > Abusing it for send-side is probably the wrong
> > direction overall.
>
> I wouldn't "abuse" it for such, I would suggest adding a proper notion
> of send-only registrations.

That is really not necessary for IP traffic. There is no need to track
these since multicast can be send without subscriptions. So I guess that
there will not be much support on netdev for such an approach.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Sept. 16, 2015, 4:26 p.m. UTC | #18
On 09/16/2015 11:59 AM, Christoph Lameter wrote:
> On Wed, 16 Sep 2015, Doug Ledford wrote:
> 
>>> Abusing it for send-side is probably the wrong
>>> direction overall.
>>
>> I wouldn't "abuse" it for such, I would suggest adding a proper notion
>> of send-only registrations.
> 
> That is really not necessary for IP traffic. There is no need to track
> these since multicast can be send without subscriptions. So I guess that
> there will not be much support on netdev for such an approach.

I agree with you that it's not necessary for Ethernet/IP traffic.  The
point is that this isn't really Ethernet/IP traffic, it's InfiniBand/IP
traffic.  And there it *is* necessary.  An argument can be made that the
core code shouldn't assume Ethernet/IP and should support either
Ethernet/IP or InfiniBand/IP.
Christoph Lameter (Ampere) Sept. 16, 2015, 4:31 p.m. UTC | #19
Another approach may be to tie the unsub from sendonly multicast joins to
the expiration of the layer 2 addresses in IPoIB. F.e. add code to
 __ipoib_reap_ah() to detect if the handle was used for a sendonly
multicast join. If so unsubscribe from the MC group. This will result in
behavior consistent with address resolution and caching on IPoIB.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Sept. 16, 2015, 5:03 p.m. UTC | #20
On Wed, Sep 16, 2015 at 10:59:50AM -0500, Christoph Lameter wrote:
> On Wed, 16 Sep 2015, Doug Ledford wrote:
> 
> > > Abusing it for send-side is probably the wrong
> > > direction overall.
> >
> > I wouldn't "abuse" it for such, I would suggest adding a proper notion
> > of send-only registrations.
> 
> That is really not necessary for IP traffic. There is no need to track
> these since multicast can be send without subscriptions. So I guess that
> there will not be much support on netdev for such an approach.

InfiniBand is not unique here, eg, long ago proposals for IPoATM had
the same problem. Not sure if there is any other more current networks
that work this way..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Sept. 16, 2015, 8:13 p.m. UTC | #21
On Wed, Sep 16, 2015 at 7:31 PM, Christoph Lameter <cl@linux.com> wrote:
> Another approach may be to tie the unsub from sendonly multicast joins to
> the expiration of the layer 2 addresses in IPoIB. F.e. add code to
>  __ipoib_reap_ah() to detect if the handle was used for a sendonly
> multicast join. If so unsubscribe from the MC group. This will result in
> behavior consistent with address resolution and caching on IPoIB.

yep, Erez has the patches to do so.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Sept. 16, 2015, 8:17 p.m. UTC | #22
On Wed, 16 Sep 2015, Or Gerlitz wrote:

> On Wed, Sep 16, 2015 at 7:31 PM, Christoph Lameter <cl@linux.com> wrote:
> > Another approach may be to tie the unsub from sendonly multicast joins to
> > the expiration of the layer 2 addresses in IPoIB. F.e. add code to
> >  __ipoib_reap_ah() to detect if the handle was used for a sendonly
> > multicast join. If so unsubscribe from the MC group. This will result in
> > behavior consistent with address resolution and caching on IPoIB.
>
> yep, Erez has the patches to do so.

Would you please share them?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Sept. 16, 2015, 8:52 p.m. UTC | #23
On Wed, Sep 16, 2015 at 11:17 PM, Christoph Lameter <cl@linux.com> wrote:
> On Wed, 16 Sep 2015, Or Gerlitz wrote:
>
>> On Wed, Sep 16, 2015 at 7:31 PM, Christoph Lameter <cl@linux.com> wrote:
>> > Another approach may be to tie the unsub from sendonly multicast joins to
>> > the expiration of the layer 2 addresses in IPoIB. F.e. add code to
>> >  __ipoib_reap_ah() to detect if the handle was used for a sendonly
>> > multicast join. If so unsubscribe from the MC group. This will result in
>> > behavior consistent with address resolution and caching on IPoIB.
>>
>> yep, Erez has the patches to do so.
>
> Would you please share them?

I will check with him tomorrow, basically I think he's pretty busy and
hence didn't participate in these threads so far, we'll see.

Could you please post here a few (say 2-4) liner summary of what is
still missing or done wrong in 4.3-rc1 and what is your suggestion how
to resolve that.

The amount of TEXT @ the IPoIB patch by Doug that went into rc1 and
the texts written so far over this thread are just too much for a busy
Erez and myself to grasp and act, we need your help with summing this
up...

Or.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Sept. 17, 2015, 12:48 a.m. UTC | #24
On Wed, 16 Sep 2015, Or Gerlitz wrote:

> Could you please post here a few (say 2-4) liner summary of what is
> still missing or done wrong in 4.3-rc1 and what is your suggestion how
> to resolve that.

With Doug's patch here the only thing that is left to be done is to
properly leave the multicast group. And it seems that Erez patch does just that.

And then there are the 20 other things that I have pending with Mellanox
but those are different issues that do not belong here. This one is a
critical bug for us.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Sept. 17, 2015, 10:39 a.m. UTC | #25
On 9/17/2015 3:48 AM, Christoph Lameter wrote:
> On Wed, 16 Sep 2015, Or Gerlitz wrote:
>
>> Could you please post here a few (say 2-4) liner summary of what is
>> still missing or done wrong in 4.3-rc1 and what is your suggestion how
>> to resolve that.
> With Doug's patch here the only thing that is left to be done is to
> properly leave the multicast group. And it seems that Erez patch does just that.

sent it out now

> And then there are the 20 other things that I have pending with Mellanox
> but those are different issues that do not belong here. This one is a
> critical bug for us.
>

so 19 left?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 09a1748f9d13..2d95b8ae379b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -47,6 +47,11 @@ 
 
 #include "ipoib.h"
 
+static bool __read_mostly mcast_auto_create;
+
+module_param(mcast_auto_create, bool, 0644);
+MODULE_PARM_DESC(mcast_auto_create, "Should multicast sends auto-create the IB multicast group? (Default: false)");
+
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
 static int mcast_debug_level;
 
@@ -514,9 +519,34 @@  static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast)
 		 * detect if there are full members or not. A major problem
 		 * with supporting SEND ONLY is detecting when the group is
 		 * auto-destroyed as IPoIB will cache the MLID..
+		 *
+		 * An additional problem is that if we auto-create the IB
+		 * mcast group in response to a send-only action, then we
+		 * will be the creating entity, but we will not have any
+		 * mechanism by which we will track when we should leave
+		 * the group ourselves.  We will occasionally leave and
+		 * re-join the group when these events occur:
+		 *
+		 * 1) ifdown/ifup
+		 * 2) a regular mcast join/leave happens and we run
+		 *    ipoib_mcast_restart_task
+		 * 3) a REREGISTER event comes in from the SM
+		 * 4) any other event that might cause a mcast flush
+		 *
+		 * However, these events are not deterministic and we can
+		 * leave unused groups subscribed for long periods of time.
+		 * In addition, since the core IB layer does not yet support
+		 * send-only IB joins, we have to do a regular join and then
+		 * simply never attach a QP to listen to the incoming data.
+		 * This means that phantom, wasted data will end up coming
+		 * across our inbound physical link only to be thrown away
+		 * by the multicast dispatch mechanism on the card or in
+		 * the kernel driver.  For these reasons, we default to not
+		 * auto creating groups for send-only multicast operations.
 		 */
 #if 1
-		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) &&
+		    !mcast_auto_create)
 			comp_mask &= ~IB_SA_MCMEMBER_REC_TRAFFIC_CLASS;
 #else
 		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))