[1/9] IB/ipoib: factor out ah flushing

Message ID	b06eb720c2f654f5ecdb72c66f4e89149d1c24ec.1424562072.git.dledford@redhat.com (mailing list archive)
State	Rejected
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Doug Ledford <dledford@redhat.com> To: linux-rdma@vger.kernel.org, roland@kernel.org Cc: Or Gerlitz <gerlitz.or@gmail.com>, Erez Shitrit <erezsh@mellanox.com>, Doug Ledford <dledford@redhat.com> Subject: [PATCH 1/9] IB/ipoib: factor out ah flushing Date: Sat, 21 Feb 2015 19:26:59 -0500 Message-Id: <b06eb720c2f654f5ecdb72c66f4e89149d1c24ec.1424562072.git.dledford@redhat.com> In-Reply-To: <cover.1424562072.git.dledford@redhat.com> References: <cover.1424562072.git.dledford@redhat.com> In-Reply-To: <cover.1424562072.git.dledford@redhat.com> References: <cover.1424562072.git.dledford@redhat.com> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk

Doug Ledford Feb. 22, 2015, 12:26 a.m. UTC

Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
appropriate times to flush out all remaining ah entries before we shut
the device down.

Because neighbors and mcast entries can each have a reference on any
given ah, we must make sure to free all of those first before our ah
will actually have a 0 refcount and be able to be reaped.

This factoring is needed in preparation for having per-device work
queues.  The original per-device workqueue code resulted in the following
error message:

<ibdev>: ib_dealloc_pd failed

That error was tracked down to this issue.  With the changes to which
workqueues were flushed when, there were no flushes of the per device
workqueue after the last ah's were freed, resulting in an attempt to
dealloc the pd with outstanding resources still allocated.  This code
puts the explicit flushes in the needed places to avoid that problem.

Signed-off-by: Doug Ledford <dledford@redhat.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 ++++++++++++++++++++-------------
 1 file changed, 28 insertions(+), 18 deletions(-)

Erez Shitrit Feb. 26, 2015, 1:28 p.m. UTC | #1

On 2/22/2015 2:26 AM, Doug Ledford wrote:
> Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
> appropriate times to flush out all remaining ah entries before we shut
> the device down.
>
> Because neighbors and mcast entries can each have a reference on any
> given ah, we must make sure to free all of those first before our ah
> will actually have a 0 refcount and be able to be reaped.
>
> This factoring is needed in preparation for having per-device work
> queues.  The original per-device workqueue code resulted in the following
> error message:
>
> <ibdev>: ib_dealloc_pd failed
>
> That error was tracked down to this issue.  With the changes to which
> workqueues were flushed when, there were no flushes of the per device
> workqueue after the last ah's were freed, resulting in an attempt to
> dealloc the pd with outstanding resources still allocated.  This code
> puts the explicit flushes in the needed places to avoid that problem.
>
> Signed-off-by: Doug Ledford <dledford@redhat.com>
> ---
>   drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 ++++++++++++++++++++-------------
>   1 file changed, 28 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 72626c34817..cb02466a0eb 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
>   				   round_jiffies_relative(HZ));
>   }
>   
> +static void ipoib_flush_ah(struct net_device *dev, int flush)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +
> +	cancel_delayed_work(&priv->ah_reap_task);
> +	if (flush)
> +		flush_workqueue(ipoib_workqueue);
> +	ipoib_reap_ah(&priv->ah_reap_task.work);
> +}
> +
> +static void ipoib_stop_ah(struct net_device *dev, int flush)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +
> +	set_bit(IPOIB_STOP_REAPER, &priv->flags);
> +	ipoib_flush_ah(dev, flush);
> +}
> +
>   static void ipoib_ib_tx_timer_func(unsigned long ctx)
>   {
>   	drain_tx_cq((struct net_device *)ctx);
> @@ -877,24 +895,7 @@ timeout:
>   	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
>   		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
>   
> -	/* Wait for all AHs to be reaped */
> -	set_bit(IPOIB_STOP_REAPER, &priv->flags);
> -	cancel_delayed_work(&priv->ah_reap_task);
> -	if (flush)
> -		flush_workqueue(ipoib_workqueue);
> -
> -	begin = jiffies;
> -
> -	while (!list_empty(&priv->dead_ahs)) {
> -		__ipoib_reap_ah(dev);
> -
> -		if (time_after(jiffies, begin + HZ)) {
> -			ipoib_warn(priv, "timing out; will leak address handles\n");
> -			break;
> -		}
> -
> -		msleep(1);
> -	}
> +	ipoib_flush_ah(dev, flush);
>   
>   	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
>   
> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
>   	if (level == IPOIB_FLUSH_LIGHT) {
>   		ipoib_mark_paths_invalid(dev);
>   		ipoib_mcast_dev_flush(dev);
> +		ipoib_flush_ah(dev, 0);

Why do you need to call the flush function here?
I can't see the reason to use the flush not from the stop_ah, meaning 
without setting the IPOIB_STOP_REAPER, the flush can send twice the same 
work.

>   	}
>   
>   	if (level >= IPOIB_FLUSH_NORMAL)
> @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
>   	ipoib_mcast_stop_thread(dev, 1);
>   	ipoib_mcast_dev_flush(dev);
>   
> +	/*
> +	 * All of our ah references aren't free until after
> +	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
> +	 * the neighbor garbage collection is stopped and reaped.
> +	 * That should all be done now, so make a final ah flush.
> +	 */
> +	ipoib_stop_ah(dev, 1);
> +
>   	ipoib_transport_dev_cleanup(dev);
>   }
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford Feb. 26, 2015, 4:27 p.m. UTC | #2

On Thu, 2015-02-26 at 15:28 +0200, Erez Shitrit wrote:
> On 2/22/2015 2:26 AM, Doug Ledford wrote:
> > Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
> > appropriate times to flush out all remaining ah entries before we shut
> > the device down.
> >
> > Because neighbors and mcast entries can each have a reference on any
> > given ah, we must make sure to free all of those first before our ah
> > will actually have a 0 refcount and be able to be reaped.
> >
> > This factoring is needed in preparation for having per-device work
> > queues.  The original per-device workqueue code resulted in the following
> > error message:
> >
> > <ibdev>: ib_dealloc_pd failed
> >
> > That error was tracked down to this issue.  With the changes to which
> > workqueues were flushed when, there were no flushes of the per device
> > workqueue after the last ah's were freed, resulting in an attempt to
> > dealloc the pd with outstanding resources still allocated.  This code
> > puts the explicit flushes in the needed places to avoid that problem.
> >
> > Signed-off-by: Doug Ledford <dledford@redhat.com>
> > ---
> >   drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 ++++++++++++++++++++-------------
> >   1 file changed, 28 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > index 72626c34817..cb02466a0eb 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > @@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
> >   				   round_jiffies_relative(HZ));
> >   }
> >   
> > +static void ipoib_flush_ah(struct net_device *dev, int flush)
> > +{
> > +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +
> > +	cancel_delayed_work(&priv->ah_reap_task);
> > +	if (flush)
> > +		flush_workqueue(ipoib_workqueue);
> > +	ipoib_reap_ah(&priv->ah_reap_task.work);
> > +}
> > +
> > +static void ipoib_stop_ah(struct net_device *dev, int flush)
> > +{
> > +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +
> > +	set_bit(IPOIB_STOP_REAPER, &priv->flags);
> > +	ipoib_flush_ah(dev, flush);
> > +}
> > +
> >   static void ipoib_ib_tx_timer_func(unsigned long ctx)
> >   {
> >   	drain_tx_cq((struct net_device *)ctx);
> > @@ -877,24 +895,7 @@ timeout:
> >   	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
> >   		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
> >   
> > -	/* Wait for all AHs to be reaped */
> > -	set_bit(IPOIB_STOP_REAPER, &priv->flags);
> > -	cancel_delayed_work(&priv->ah_reap_task);
> > -	if (flush)
> > -		flush_workqueue(ipoib_workqueue);
> > -
> > -	begin = jiffies;
> > -
> > -	while (!list_empty(&priv->dead_ahs)) {
> > -		__ipoib_reap_ah(dev);
> > -
> > -		if (time_after(jiffies, begin + HZ)) {
> > -			ipoib_warn(priv, "timing out; will leak address handles\n");
> > -			break;
> > -		}
> > -
> > -		msleep(1);
> > -	}
> > +	ipoib_flush_ah(dev, flush);
> >   
> >   	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
> >   
> > @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
> >   	if (level == IPOIB_FLUSH_LIGHT) {
> >   		ipoib_mark_paths_invalid(dev);
> >   		ipoib_mcast_dev_flush(dev);
> > +		ipoib_flush_ah(dev, 0);
> 
> Why do you need to call the flush function here?

To remove all of the ah's that were reduced to a 0 refcount by the
previous two functions prior to restarting operations.  When we remove
an ah, it calls ib_destroy_ah which calls all the way down into the low
level driver.  This was to make sure that old, stale data was removed
all the way down to the card level before we started new queries for
paths and ahs.

> I can't see the reason to use the flush not from the stop_ah, meaning 
> without setting the IPOIB_STOP_REAPER, the flush can send twice the same 
> work.

No, it can't.  The ah flush routine does not search through ahs to find
ones to flush.  When you delete neighbors and mcasts, they release their
references to ahs.  When the refcount goes to 0, the put routine puts
the ah on the to-be-deleted ah list.  All the flush does is take that
list and delete the items.  If you run the flush twice, the first run
deletes all the items on the to-be-deleted list, the second run sees an
empty list and does nothing.

As for using flush versus stop: the flush function cancels any delayed
ah_flush work so that it isn't racing with the normally scheduled
ah_flush, then flushes the workqueue to make sure anything that might
result in an ah getting freed is done, then flushes, then schedules a
new delayed flush_ah work 1 second later.  So, it does exactly what a
flush should do: it removes what there is currently to remove, and in
the case of a periodically scheduled garbage collection, schedules a new
periodic flush at the maximum interval.

It is not appropriate to call stop_ah at this point because it will
cancel the delayed work, flush the ahs, then never reschedule the
garbage collection.  If we called stop here, we would have to call start
later.  But that's not really necessary as the flush cancels the
scheduled work and reschedules it for a second later.

> >   	}
> >   
> >   	if (level >= IPOIB_FLUSH_NORMAL)
> > @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
> >   	ipoib_mcast_stop_thread(dev, 1);
> >   	ipoib_mcast_dev_flush(dev);
> >   
> > +	/*
> > +	 * All of our ah references aren't free until after
> > +	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
> > +	 * the neighbor garbage collection is stopped and reaped.
> > +	 * That should all be done now, so make a final ah flush.
> > +	 */
> > +	ipoib_stop_ah(dev, 1);
> > +
> >   	ipoib_transport_dev_cleanup(dev);
> >   }
> >   
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Erez Shitrit March 1, 2015, 6:47 a.m. UTC | #3

On 2/26/2015 6:27 PM, Doug Ledford wrote:
> On Thu, 2015-02-26 at 15:28 +0200, Erez Shitrit wrote:
>> On 2/22/2015 2:26 AM, Doug Ledford wrote:
>>> Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
>>> appropriate times to flush out all remaining ah entries before we shut
>>> the device down.
>>>
>>> Because neighbors and mcast entries can each have a reference on any
>>> given ah, we must make sure to free all of those first before our ah
>>> will actually have a 0 refcount and be able to be reaped.
>>>
>>> This factoring is needed in preparation for having per-device work
>>> queues.  The original per-device workqueue code resulted in the following
>>> error message:
>>>
>>> <ibdev>: ib_dealloc_pd failed
>>>
>>> That error was tracked down to this issue.  With the changes to which
>>> workqueues were flushed when, there were no flushes of the per device
>>> workqueue after the last ah's were freed, resulting in an attempt to
>>> dealloc the pd with outstanding resources still allocated.  This code
>>> puts the explicit flushes in the needed places to avoid that problem.
>>>
>>> Signed-off-by: Doug Ledford <dledford@redhat.com>
>>> ---
>>>    drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 ++++++++++++++++++++-------------
>>>    1 file changed, 28 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
>>> index 72626c34817..cb02466a0eb 100644
>>> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
>>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
>>> @@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
>>>    				   round_jiffies_relative(HZ));
>>>    }
>>>    
>>> +static void ipoib_flush_ah(struct net_device *dev, int flush)
>>> +{
>>> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
>>> +
>>> +	cancel_delayed_work(&priv->ah_reap_task);
>>> +	if (flush)
>>> +		flush_workqueue(ipoib_workqueue);
>>> +	ipoib_reap_ah(&priv->ah_reap_task.work);
>>> +}
>>> +
>>> +static void ipoib_stop_ah(struct net_device *dev, int flush)
>>> +{
>>> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
>>> +
>>> +	set_bit(IPOIB_STOP_REAPER, &priv->flags);
>>> +	ipoib_flush_ah(dev, flush);
>>> +}
>>> +
>>>    static void ipoib_ib_tx_timer_func(unsigned long ctx)
>>>    {
>>>    	drain_tx_cq((struct net_device *)ctx);
>>> @@ -877,24 +895,7 @@ timeout:
>>>    	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
>>>    		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
>>>    
>>> -	/* Wait for all AHs to be reaped */
>>> -	set_bit(IPOIB_STOP_REAPER, &priv->flags);
>>> -	cancel_delayed_work(&priv->ah_reap_task);
>>> -	if (flush)
>>> -		flush_workqueue(ipoib_workqueue);
>>> -
>>> -	begin = jiffies;
>>> -
>>> -	while (!list_empty(&priv->dead_ahs)) {
>>> -		__ipoib_reap_ah(dev);
>>> -
>>> -		if (time_after(jiffies, begin + HZ)) {
>>> -			ipoib_warn(priv, "timing out; will leak address handles\n");
>>> -			break;
>>> -		}
>>> -
>>> -		msleep(1);
>>> -	}
>>> +	ipoib_flush_ah(dev, flush);
>>>    
>>>    	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
>>>    
>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
>>>    	if (level == IPOIB_FLUSH_LIGHT) {
>>>    		ipoib_mark_paths_invalid(dev);
>>>    		ipoib_mcast_dev_flush(dev);
>>> +		ipoib_flush_ah(dev, 0);
>> Why do you need to call the flush function here?
> To remove all of the ah's that were reduced to a 0 refcount by the
> previous two functions prior to restarting operations.  When we remove
> an ah, it calls ib_destroy_ah which calls all the way down into the low
> level driver.  This was to make sure that old, stale data was removed
> all the way down to the card level before we started new queries for
> paths and ahs.

Yes. but it is not needed.
The bug happened when the driver was removed (via modprobe -r etc.), and 
there were ah's in the dead_ah list, that was fixed by you in the 
function ipoib_ib_dev_cleanup, the call that you added here is not 
relevant to the bug (and IMHO is not needed at all)
So, the the task of cleaning the dead_ah is already there, no need to 
recall it, it will be called anyway 1 sec at the most from now.

You can try that, take of that call, no harm or memory leak will happened.

>> I can't see the reason to use the flush not from the stop_ah, meaning
>> without setting the IPOIB_STOP_REAPER, the flush can send twice the same
>> work.
> No, it can't.  The ah flush routine does not search through ahs to find
> ones to flush.  When you delete neighbors and mcasts, they release their
> references to ahs.  When the refcount goes to 0, the put routine puts
> the ah on the to-be-deleted ah list.  All the flush does is take that
> list and delete the items.  If you run the flush twice, the first run
> deletes all the items on the to-be-deleted list, the second run sees an
> empty list and does nothing.
>
> As for using flush versus stop: the flush function cancels any delayed
> ah_flush work so that it isn't racing with the normally scheduled

when you call cancel_delayed_work to work that can schedule itself, it 
is not help, the work can be at the middle of the run and re-schedule 
itself...


> ah_flush, then flushes the workqueue to make sure anything that might
> result in an ah getting freed is done, then flushes, then schedules a
> new delayed flush_ah work 1 second later.  So, it does exactly what a
> flush should do: it removes what there is currently to remove, and in
> the case of a periodically scheduled garbage collection, schedules a new
> periodic flush at the maximum interval.
>
> It is not appropriate to call stop_ah at this point because it will
> cancel the delayed work, flush the ahs, then never reschedule the
> garbage collection.  If we called stop here, we would have to call start
> later.  But that's not really necessary as the flush cancels the
> scheduled work and reschedules it for a second later.
>
>>>    	}
>>>    
>>>    	if (level >= IPOIB_FLUSH_NORMAL)
>>> @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
>>>    	ipoib_mcast_stop_thread(dev, 1);
>>>    	ipoib_mcast_dev_flush(dev);
>>>    
>>> +	/*
>>> +	 * All of our ah references aren't free until after
>>> +	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
>>> +	 * the neighbor garbage collection is stopped and reaped.
>>> +	 * That should all be done now, so make a final ah flush.
>>> +	 */
>>> +	ipoib_stop_ah(dev, 1);
>>> +
>>>    	ipoib_transport_dev_cleanup(dev);
>>>    }
>>>    
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford March 2, 2015, 3:09 p.m. UTC | #4

On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
> On 2/26/2015 6:27 PM, Doug Ledford wrote:
> >
> >>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
> >>>    	if (level == IPOIB_FLUSH_LIGHT) {
> >>>    		ipoib_mark_paths_invalid(dev);
> >>>    		ipoib_mcast_dev_flush(dev);
> >>> +		ipoib_flush_ah(dev, 0);
> >> Why do you need to call the flush function here?
> > To remove all of the ah's that were reduced to a 0 refcount by the
> > previous two functions prior to restarting operations.  When we remove
> > an ah, it calls ib_destroy_ah which calls all the way down into the low
> > level driver.  This was to make sure that old, stale data was removed
> > all the way down to the card level before we started new queries for
> > paths and ahs.
> 
> Yes. but it is not needed.

That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
isn't needed but doesn't hurt either.  For older cards (in particular,
mthca), the driver actually frees up card resources at the time of the
call.

> The bug happened when the driver was removed (via modprobe -r etc.), and 
> there were ah's in the dead_ah list, that was fixed by you in the 
> function ipoib_ib_dev_cleanup, the call that you added here is not 
> relevant to the bug (and IMHO is not needed at all)

I never said that this hunk was part of the original bug I saw before.

> So, the the task of cleaning the dead_ah is already there, no need to 
> recall it, it will be called anyway 1 sec at the most from now.
> 
> You can try that, take of that call, no harm or memory leak will happened.

I have no doubt that it will get freed later.  As I said, I never
considered this particular hunk part of that original bug.  But, as I
point out above, there is no harm in it for any hardware, and depending
on hardware it can help to make sure there isn't a shortage of
resources.  Given that fact, I see no reason to remove it.

> >> I can't see the reason to use the flush not from the stop_ah, meaning
> >> without setting the IPOIB_STOP_REAPER, the flush can send twice the same
> >> work.
> > No, it can't.  The ah flush routine does not search through ahs to find
> > ones to flush.  When you delete neighbors and mcasts, they release their
> > references to ahs.  When the refcount goes to 0, the put routine puts
> > the ah on the to-be-deleted ah list.  All the flush does is take that
> > list and delete the items.  If you run the flush twice, the first run
> > deletes all the items on the to-be-deleted list, the second run sees an
> > empty list and does nothing.
> >
> > As for using flush versus stop: the flush function cancels any delayed
> > ah_flush work so that it isn't racing with the normally scheduled
> 
> when you call cancel_delayed_work to work that can schedule itself, it 
> is not help, the work can be at the middle of the run and re-schedule 
> itself...

If it is in the middle of a run and reschedules itself, then it will
schedule itself at precisely the same time we would have anyway, and we
will get flushed properly, so the net result of this particular race is
that we end up doing exactly what we wanted to do anyway.

> 
> > ah_flush, then flushes the workqueue to make sure anything that might
> > result in an ah getting freed is done, then flushes, then schedules a
> > new delayed flush_ah work 1 second later.  So, it does exactly what a
> > flush should do: it removes what there is currently to remove, and in
> > the case of a periodically scheduled garbage collection, schedules a new
> > periodic flush at the maximum interval.
> >
> > It is not appropriate to call stop_ah at this point because it will
> > cancel the delayed work, flush the ahs, then never reschedule the
> > garbage collection.  If we called stop here, we would have to call start
> > later.  But that's not really necessary as the flush cancels the
> > scheduled work and reschedules it for a second later.
> >
> >>>    	}
> >>>    
> >>>    	if (level >= IPOIB_FLUSH_NORMAL)
> >>> @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
> >>>    	ipoib_mcast_stop_thread(dev, 1);
> >>>    	ipoib_mcast_dev_flush(dev);
> >>>    
> >>> +	/*
> >>> +	 * All of our ah references aren't free until after
> >>> +	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
> >>> +	 * the neighbor garbage collection is stopped and reaped.
> >>> +	 * That should all be done now, so make a final ah flush.
> >>> +	 */
> >>> +	ipoib_stop_ah(dev, 1);
> >>> +
> >>>    	ipoib_transport_dev_cleanup(dev);
> >>>    }
> >>>    
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Erez Shitrit March 3, 2015, 9:59 a.m. UTC | #5

On 3/2/2015 5:09 PM, Doug Ledford wrote:
> On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
>> On 2/26/2015 6:27 PM, Doug Ledford wrote:
>>>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
>>>>>     	if (level == IPOIB_FLUSH_LIGHT) {
>>>>>     		ipoib_mark_paths_invalid(dev);
>>>>>     		ipoib_mcast_dev_flush(dev);
>>>>> +		ipoib_flush_ah(dev, 0);
>>>> Why do you need to call the flush function here?
>>> To remove all of the ah's that were reduced to a 0 refcount by the
>>> previous two functions prior to restarting operations.  When we remove
>>> an ah, it calls ib_destroy_ah which calls all the way down into the low
>>> level driver.  This was to make sure that old, stale data was removed
>>> all the way down to the card level before we started new queries for
>>> paths and ahs.
>> Yes. but it is not needed.
> That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
> isn't needed but doesn't hurt either.  For older cards (in particular,
> mthca), the driver actually frees up card resources at the time of the
> call.

Can you please elaborate more here, I took a look in the mthca and 
didn't see that.
anyway, what i don't understand is why you need to do that now, the ah 
is already in the dead_ah_list so, in at the most 1 sec will be cleared 
and if the driver goes down your other hunk fixed that.

>> The bug happened when the driver was removed (via modprobe -r etc.), and
>> there were ah's in the dead_ah list, that was fixed by you in the
>> function ipoib_ib_dev_cleanup, the call that you added here is not
>> relevant to the bug (and IMHO is not needed at all)
> I never said that this hunk was part of the original bug I saw before.
>
>> So, the the task of cleaning the dead_ah is already there, no need to
>> recall it, it will be called anyway 1 sec at the most from now.
>>
>> You can try that, take of that call, no harm or memory leak will happened.
> I have no doubt that it will get freed later.  As I said, I never
> considered this particular hunk part of that original bug.  But, as I
> point out above, there is no harm in it for any hardware, and depending
> on hardware it can help to make sure there isn't a shortage of
> resources.  Given that fact, I see no reason to remove it.
>
>>>> I can't see the reason to use the flush not from the stop_ah, meaning
>>>> without setting the IPOIB_STOP_REAPER, the flush can send twice the same
>>>> work.
>>> No, it can't.  The ah flush routine does not search through ahs to find
>>> ones to flush.  When you delete neighbors and mcasts, they release their
>>> references to ahs.  When the refcount goes to 0, the put routine puts
>>> the ah on the to-be-deleted ah list.  All the flush does is take that
>>> list and delete the items.  If you run the flush twice, the first run
>>> deletes all the items on the to-be-deleted list, the second run sees an
>>> empty list and does nothing.
>>>
>>> As for using flush versus stop: the flush function cancels any delayed
>>> ah_flush work so that it isn't racing with the normally scheduled
>> when you call cancel_delayed_work to work that can schedule itself, it
>> is not help, the work can be at the middle of the run and re-schedule
>> itself...
> If it is in the middle of a run and reschedules itself, then it will
> schedule itself at precisely the same time we would have anyway, and we
> will get flushed properly, so the net result of this particular race is
> that we end up doing exactly what we wanted to do anyway.
>
>>> ah_flush, then flushes the workqueue to make sure anything that might
>>> result in an ah getting freed is done, then flushes, then schedules a
>>> new delayed flush_ah work 1 second later.  So, it does exactly what a
>>> flush should do: it removes what there is currently to remove, and in
>>> the case of a periodically scheduled garbage collection, schedules a new
>>> periodic flush at the maximum interval.
>>>
>>> It is not appropriate to call stop_ah at this point because it will
>>> cancel the delayed work, flush the ahs, then never reschedule the
>>> garbage collection.  If we called stop here, we would have to call start
>>> later.  But that's not really necessary as the flush cancels the
>>> scheduled work and reschedules it for a second later.
>>>
>>>>>     	}
>>>>>     
>>>>>     	if (level >= IPOIB_FLUSH_NORMAL)
>>>>> @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
>>>>>     	ipoib_mcast_stop_thread(dev, 1);
>>>>>     	ipoib_mcast_dev_flush(dev);
>>>>>     
>>>>> +	/*
>>>>> +	 * All of our ah references aren't free until after
>>>>> +	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
>>>>> +	 * the neighbor garbage collection is stopped and reaped.
>>>>> +	 * That should all be done now, so make a final ah flush.
>>>>> +	 */
>>>>> +	ipoib_stop_ah(dev, 1);
>>>>> +
>>>>>     	ipoib_transport_dev_cleanup(dev);
>>>>>     }
>>>>>     
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Or Gerlitz March 13, 2015, 8:39 a.m. UTC | #6

On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit <erezsh@dev.mellanox.co.il> wrote:
> On 3/2/2015 5:09 PM, Doug Ledford wrote:
>>
>> On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
>>>
>>> On 2/26/2015 6:27 PM, Doug Ledford wrote:
>>>>>>
>>>>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct
>>>>>> ipoib_dev_priv *priv,
>>>>>>         if (level == IPOIB_FLUSH_LIGHT) {
>>>>>>                 ipoib_mark_paths_invalid(dev);
>>>>>>                 ipoib_mcast_dev_flush(dev);
>>>>>> +               ipoib_flush_ah(dev, 0);
>>>>>
>>>>> Why do you need to call the flush function here?
>>>>
>>>> To remove all of the ah's that were reduced to a 0 refcount by the
>>>> previous two functions prior to restarting operations.  When we remove
>>>> an ah, it calls ib_destroy_ah which calls all the way down into the low
>>>> level driver.  This was to make sure that old, stale data was removed
>>>> all the way down to the card level before we started new queries for
>>>> paths and ahs.
>>>
>>> Yes. but it is not needed.
>>
>> That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
>> isn't needed but doesn't hurt either.  For older cards (in particular,
>> mthca), the driver actually frees up card resources at the time of the
>> call.
>
>
> Can you please elaborate more here, I took a look in the mthca and didn't
> see that.
> anyway, what i don't understand is why you need to do that now, the ah is
> already in the dead_ah_list so, in at the most 1 sec will be cleared and if
> the driver goes down your other hunk fixed that.

Doug, ten days and no response from you... lets finalize the review on
this series so we have it safely done for 4.1 -- on top of it Erez
prepares a set of IPoIB fixes from our internal tree and we want that
for 4.1 too. Please address.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford March 15, 2015, 6:42 p.m. UTC | #7

> On Mar 13, 2015, at 1:39 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> 
> On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit <erezsh@dev.mellanox.co.il> wrote:
>> On 3/2/2015 5:09 PM, Doug Ledford wrote:
>>> 
>>> On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
>>>> 
>>>> On 2/26/2015 6:27 PM, Doug Ledford wrote:
>>>>>>> 
>>>>>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct
>>>>>>> ipoib_dev_priv *priv,
>>>>>>>        if (level == IPOIB_FLUSH_LIGHT) {
>>>>>>>                ipoib_mark_paths_invalid(dev);
>>>>>>>                ipoib_mcast_dev_flush(dev);
>>>>>>> +               ipoib_flush_ah(dev, 0);
>>>>>> 
>>>>>> Why do you need to call the flush function here?
>>>>> 
>>>>> To remove all of the ah's that were reduced to a 0 refcount by the
>>>>> previous two functions prior to restarting operations.  When we remove
>>>>> an ah, it calls ib_destroy_ah which calls all the way down into the low
>>>>> level driver.  This was to make sure that old, stale data was removed
>>>>> all the way down to the card level before we started new queries for
>>>>> paths and ahs.
>>>> 
>>>> Yes. but it is not needed.
>>> 
>>> That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
>>> isn't needed but doesn't hurt either.  For older cards (in particular,
>>> mthca), the driver actually frees up card resources at the time of the
>>> call.
>> 
>> 
>> Can you please elaborate more here, I took a look in the mthca and didn't
>> see that.
>> anyway, what i don't understand is why you need to do that now, the ah is
>> already in the dead_ah_list so, in at the most 1 sec will be cleared and if
>> the driver goes down your other hunk fixed that.
> 
> Doug, ten days and no response from you... lets finalize the review on
> this series so we have it safely done for 4.1 -- on top of it Erez
> prepares a set of IPoIB fixes from our internal tree and we want that
> for 4.1 too. Please address.

I didn’t have much to say here.  I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c

int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
{
        switch (ah->type) {
        case MTHCA_AH_ON_HCA:
                mthca_free(&dev->av_table.alloc,
                           (ah->avdma - dev->av_table.ddr_av_base) /
                           MTHCA_AV_SIZE);
                break;


I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course).  So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit.  And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue.

—
Doug Ledford <dledford@redhat.com>
	GPG Key ID: 0E572FDD

Erez Shitrit March 16, 2015, 3:24 p.m. UTC | #8

On 3/15/2015 8:42 PM, Doug Ledford wrote:
>> On Mar 13, 2015, at 1:39 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>
>> On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit <erezsh@dev.mellanox.co.il> wrote:
>>> On 3/2/2015 5:09 PM, Doug Ledford wrote:
>>>> On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
>>>>> On 2/26/2015 6:27 PM, Doug Ledford wrote:
>>>>>>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct
>>>>>>>> ipoib_dev_priv *priv,
>>>>>>>>         if (level == IPOIB_FLUSH_LIGHT) {
>>>>>>>>                 ipoib_mark_paths_invalid(dev);
>>>>>>>>                 ipoib_mcast_dev_flush(dev);
>>>>>>>> +               ipoib_flush_ah(dev, 0);
>>>>>>> Why do you need to call the flush function here?
>>>>>> To remove all of the ah's that were reduced to a 0 refcount by the
>>>>>> previous two functions prior to restarting operations.  When we remove
>>>>>> an ah, it calls ib_destroy_ah which calls all the way down into the low
>>>>>> level driver.  This was to make sure that old, stale data was removed
>>>>>> all the way down to the card level before we started new queries for
>>>>>> paths and ahs.
>>>>> Yes. but it is not needed.
>>>> That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
>>>> isn't needed but doesn't hurt either.  For older cards (in particular,
>>>> mthca), the driver actually frees up card resources at the time of the
>>>> call.
>>>
>>> Can you please elaborate more here, I took a look in the mthca and didn't
>>> see that.
>>> anyway, what i don't understand is why you need to do that now, the ah is
>>> already in the dead_ah_list so, in at the most 1 sec will be cleared and if
>>> the driver goes down your other hunk fixed that.
>> Doug, ten days and no response from you... lets finalize the review on
>> this series so we have it safely done for 4.1 -- on top of it Erez
>> prepares a set of IPoIB fixes from our internal tree and we want that
>> for 4.1 too. Please address.
> I didn’t have much to say here.  I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c
>
> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
> {
>          switch (ah->type) {
>          case MTHCA_AH_ON_HCA:
>                  mthca_free(&dev->av_table.alloc,
>                             (ah->avdma - dev->av_table.ddr_av_base) /
>                             MTHCA_AV_SIZE);
>                  break;
>
>
> I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course).  So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit.  And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue.
sorry, but i still don't see the connection to the device type.
It will be deleted/freed with the regular flow, like it does in the rest 
of the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so 
why here it should be directly after the event?

> —
> Doug Ledford <dledford@redhat.com>
> 	GPG Key ID: 0E572FDD
>
>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford March 16, 2015, 4:06 p.m. UTC | #9

> On Mar 16, 2015, at 8:24 AM, Erez Shitrit <erezsh@dev.mellanox.co.il> wrote:
> 
> On 3/15/2015 8:42 PM, Doug Ledford wrote:
>> 
>>> Doug, ten days and no response from you... lets finalize the review on
>>> this series so we have it safely done for 4.1 -- on top of it Erez
>>> prepares a set of IPoIB fixes from our internal tree and we want that
>>> for 4.1 too. Please address.
>> I didn’t have much to say here.  I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c
>> 
>> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
>> {
>>         switch (ah->type) {
>>         case MTHCA_AH_ON_HCA:
>>                 mthca_free(&dev->av_table.alloc,
>>                            (ah->avdma - dev->av_table.ddr_av_base) /
>>                            MTHCA_AV_SIZE);
>>                 break;
>> 
>> 
>> I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course).  So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit.  And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue.
> sorry, but i still don't see the connection to the device type.
> It will be deleted/freed with the regular flow, like it does in the rest of the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here it should be directly after the event?

Because it’s the right thing to do.  The only reason to do lazy deletion is when there is a performance benefit to batching.  There is no performance benefit to batching here.  And because on certain hardware the action frees resources on the card, which are limited, doing non-lazy deletion can be beneficial.  Given that there is no downside to doing the deletions in a non-lazy fashion, and that there can be an upside depending on hardware, there is no reason to stick with the lazy deletions.

—
Doug Ledford <dledford@redhat.com>
	GPG Key ID: 0E572FDD

Erez Shitrit March 16, 2015, 4:51 p.m. UTC | #10

On 3/16/2015 6:06 PM, Doug Ledford wrote:
>> On Mar 16, 2015, at 8:24 AM, Erez Shitrit <erezsh@dev.mellanox.co.il> wrote:
>>
>> On 3/15/2015 8:42 PM, Doug Ledford wrote:
>>>> Doug, ten days and no response from you... lets finalize the review on
>>>> this series so we have it safely done for 4.1 -- on top of it Erez
>>>> prepares a set of IPoIB fixes from our internal tree and we want that
>>>> for 4.1 too. Please address.
>>> I didn’t have much to say here.  I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c
>>>
>>> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
>>> {
>>>          switch (ah->type) {
>>>          case MTHCA_AH_ON_HCA:
>>>                  mthca_free(&dev->av_table.alloc,
>>>                             (ah->avdma - dev->av_table.ddr_av_base) /
>>>                             MTHCA_AV_SIZE);
>>>                  break;
>>>
>>>
>>> I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course).  So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit.  And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue.
>> sorry, but i still don't see the connection to the device type.
>> It will be deleted/freed with the regular flow, like it does in the rest of the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here it should be directly after the event?
> Because it’s the right thing to do.  The only reason to do lazy deletion is when there is a performance benefit to batching.  There is no performance benefit to batching here.  And because on certain hardware the action frees resources on the card, which are limited, doing non-lazy deletion can be beneficial.  Given that there is no downside to doing the deletions in a non-lazy fashion, and that there can be an upside depending on hardware, there is no reason to stick with the lazy deletions.
OK, i understand your point, not sure why it is not always with the ah 
deletion, anyway it is harmless here.

> —
> Doug Ledford <dledford@redhat.com>
> 	GPG Key ID: 0E572FDD
>
>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford March 16, 2015, 6 p.m. UTC | #11

> On Mar 16, 2015, at 8:24 AM, Erez Shitrit <erezsh@dev.mellanox.co.il> wrote:
> 
> On 3/15/2015 8:42 PM, Doug Ledford wrote:
>> 
>>> Doug, ten days and no response from you... lets finalize the review on
>>> this series so we have it safely done for 4.1 -- on top of it Erez
>>> prepares a set of IPoIB fixes from our internal tree and we want that
>>> for 4.1 too. Please address.
>> I didn’t have much to say here.  I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c
>> 
>> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
>> {
>>        switch (ah->type) {
>>        case MTHCA_AH_ON_HCA:
>>                mthca_free(&dev->av_table.alloc,
>>                           (ah->avdma - dev->av_table.ddr_av_base) /
>>                           MTHCA_AV_SIZE);
>>                break;
>> 
>> 
>> I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course).  So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit.  And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue.
> sorry, but i still don't see the connection to the device type.
> It will be deleted/freed with the regular flow, like it does in the rest of the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here it should be directly after the event?

Because it’s the right thing to do.  The only reason to do lazy deletion is when there is a performance benefit to batching.  There is no performance benefit to batching here.  And because on certain hardware the action frees resources on the card, which are limited, doing non-lazy deletion can be beneficial.  Given that there is no downside to doing the deletions in a non-lazy fashion, and that there can be an upside depending on hardware, there is no reason to stick with the lazy deletions.

—
Doug Ledford <dledford@redhat.com>
	GPG Key ID: 0E572FDD





--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1/9] IB/ipoib: factor out ah flushing

Commit Message

Comments

Patch