diff mbox series

[RFC] mt76: fix tx hung regression on MT7630E

Message ID 1564143056-14610-1-git-send-email-sgruszka@redhat.com (mailing list archive)
State RFC
Delegated to: Felix Fietkau
Headers show
Series [RFC] mt76: fix tx hung regression on MT7630E | expand

Commit Message

Stanislaw Gruszka July 26, 2019, 12:10 p.m. UTC
Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet")
I can observe firmware hangs on MT7630E on station mode: tx stop
functioning after minor activity (rx keep working) and on module
unload device fail to stop with messages:

[ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
[ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop

Loading module again results in failure to associate with AP.
Only machine power off / power on cycle can make device work again.

I have no idea why the commit caused F/W hangs. Most likely some proper
fix is needed of changing registers programming (or assuring proper order
of actions), but so far I can not came up with any better fix than
a partial revert of 41634aa8d6db.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 drivers/net/wireless/mediatek/mt76/tx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Stanislaw Gruszka July 29, 2019, 12:53 p.m. UTC | #1
On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote:
> Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet")
> I can observe firmware hangs on MT7630E on station mode: tx stop
> functioning after minor activity (rx keep working) and on module
> unload device fail to stop with messages:
> 
> [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
> [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop
> 
> Loading module again results in failure to associate with AP.
> Only machine power off / power on cycle can make device work again.
> 
> I have no idea why the commit caused F/W hangs. Most likely some proper
> fix is needed of changing registers programming (or assuring proper order
> of actions), but so far I can not came up with any better fix than
> a partial revert of 41634aa8d6db.

The difference is that with 41634aa8d6db we can run mt76x02_poll_tx()
and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without
the commit, since tasklet is not scheduled from mt76_wake_tx_queue(),
it does not happen.

I'm not quite sure why, but MT7630E does not like when we poll tx status
on 2 cpus at once. Change like below:

diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
index 467b28379870..622251faa415 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
+++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
@@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
 					       mt76.tx_napi);
 	int i;
 
-	mt76x02_mac_poll_tx_status(dev, false);
+	mt76x02_mac_poll_tx_status(dev, true);
 
 	for (i = MT_TXQ_MCU; i >= 0; i--)
 		mt76_queue_tx_cleanup(dev, i, false);

is sufficient to avoid the hangs.

> diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> index 5397827668b9..fefe0ee52584 100644
> --- a/drivers/net/wireless/mediatek/mt76/tx.c
> +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
>  	if (!test_bit(MT76_STATE_RUNNING, &dev->state))
>  		return;
>  
> -	tasklet_schedule(&dev->tx_tasklet);
> +	mt76_txq_schedule(dev, txq->ac);

However I'm not sure if change to tasklet_schedule() is indeed correct,
since on tasklet we schedule all queues, hence queues that could
possibly be still blocked? And on mt76_wake_tx_queue() we schedule only
one queue.

Stanislaw
Lorenzo Bianconi July 29, 2019, 2:02 p.m. UTC | #2
> On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote:
> > Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet")
> > I can observe firmware hangs on MT7630E on station mode: tx stop
> > functioning after minor activity (rx keep working) and on module
> > unload device fail to stop with messages:
> > 
> > [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
> > [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop
> > 
> > Loading module again results in failure to associate with AP.
> > Only machine power off / power on cycle can make device work again.
> > 
> > I have no idea why the commit caused F/W hangs. Most likely some proper
> > fix is needed of changing registers programming (or assuring proper order
> > of actions), but so far I can not came up with any better fix than
> > a partial revert of 41634aa8d6db.
> 
> The difference is that with 41634aa8d6db we can run mt76x02_poll_tx()
> and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without
> the commit, since tasklet is not scheduled from mt76_wake_tx_queue(),
> it does not happen.
> 
> I'm not quite sure why, but MT7630E does not like when we poll tx status
> on 2 cpus at once. Change like below:

Hi Stanislaw,

have you tried to look at the commit: 6fe533378795f87
("mt76: mt76x02: remove irqsave/restore in locking for tx status fifo")?
Could it be a race between a registermap update and a stats load? (we are using a
different lock now)

> 
> diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> index 467b28379870..622251faa415 100644
> --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
>  					       mt76.tx_napi);
>  	int i;
>  
> -	mt76x02_mac_poll_tx_status(dev, false);
> +	mt76x02_mac_poll_tx_status(dev, true);

I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
difference doing so is we do not run mt76x02_send_tx_status().

Regards,
Lorenzo

>  
>  	for (i = MT_TXQ_MCU; i >= 0; i--)
>  		mt76_queue_tx_cleanup(dev, i, false);
> 
> is sufficient to avoid the hangs.
> 
> > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> > index 5397827668b9..fefe0ee52584 100644
> > --- a/drivers/net/wireless/mediatek/mt76/tx.c
> > +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
> >  	if (!test_bit(MT76_STATE_RUNNING, &dev->state))
> >  		return;
> >  
> > -	tasklet_schedule(&dev->tx_tasklet);
> > +	mt76_txq_schedule(dev, txq->ac);
> 
> However I'm not sure if change to tasklet_schedule() is indeed correct,
> since on tasklet we schedule all queues, hence queues that could
> possibly be still blocked? And on mt76_wake_tx_queue() we schedule only
> one queue.
> 
> Stanislaw
Stanislaw Gruszka July 30, 2019, 1:54 p.m. UTC | #3
On Mon, Jul 29, 2019 at 04:02:41PM +0200, Lorenzo Bianconi wrote:
> > On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote:
> > > Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet")
> > > I can observe firmware hangs on MT7630E on station mode: tx stop
> > > functioning after minor activity (rx keep working) and on module
> > > unload device fail to stop with messages:
> > > 
> > > [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
> > > [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop
> > > 
> > > Loading module again results in failure to associate with AP.
> > > Only machine power off / power on cycle can make device work again.
> > > 
> > > I have no idea why the commit caused F/W hangs. Most likely some proper
> > > fix is needed of changing registers programming (or assuring proper order
> > > of actions), but so far I can not came up with any better fix than
> > > a partial revert of 41634aa8d6db.
> > 
> > The difference is that with 41634aa8d6db we can run mt76x02_poll_tx()
> > and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without
> > the commit, since tasklet is not scheduled from mt76_wake_tx_queue(),
> > it does not happen.
> > 
> > I'm not quite sure why, but MT7630E does not like when we poll tx status
> > on 2 cpus at once. Change like below:
> 
> Hi Stanislaw,

Hi

> have you tried to look at the commit: 6fe533378795f87
> ("mt76: mt76x02: remove irqsave/restore in locking for tx status fifo")?
> Could it be a race between a registermap update and a stats load? (we are using a
> different lock now)

This commit seems to be fine, reverting it did not solve the issue.

> > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > index 467b28379870..622251faa415 100644
> > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> >  					       mt76.tx_napi);
> >  	int i;
> >  
> > -	mt76x02_mac_poll_tx_status(dev, false);
> > +	mt76x02_mac_poll_tx_status(dev, true);
> 
> I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> difference doing so is we do not run mt76x02_send_tx_status().

I thought this is the problem, but it was my mistake during testing.
I tested the above change together with mt76_txq_schedule(dev, txq->ac)
change and get wrong impression it fixes the issue. But above change
alone does not help.

I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
originally posted here make the problem gone.

Stanislaw
Lorenzo Bianconi July 30, 2019, 2:55 p.m. UTC | #4
> On Mon, Jul 29, 2019 at 04:02:41PM +0200, Lorenzo Bianconi wrote:
> > > On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote:
> > > > Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet")
> > > > I can observe firmware hangs on MT7630E on station mode: tx stop
> > > > functioning after minor activity (rx keep working) and on module
> > > > unload device fail to stop with messages:
> > > > 
> > > > [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
> > > > [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop
> > > > 
> > > > Loading module again results in failure to associate with AP.
> > > > Only machine power off / power on cycle can make device work again.
> > > > 
> > > > I have no idea why the commit caused F/W hangs. Most likely some proper
> > > > fix is needed of changing registers programming (or assuring proper order
> > > > of actions), but so far I can not came up with any better fix than
> > > > a partial revert of 41634aa8d6db.
> > > 
> > > The difference is that with 41634aa8d6db we can run mt76x02_poll_tx()
> > > and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without
> > > the commit, since tasklet is not scheduled from mt76_wake_tx_queue(),
> > > it does not happen.
> > > 
> > > I'm not quite sure why, but MT7630E does not like when we poll tx status
> > > on 2 cpus at once. Change like below:
> > 
> > Hi Stanislaw,
> 
> Hi
> 
> > have you tried to look at the commit: 6fe533378795f87
> > ("mt76: mt76x02: remove irqsave/restore in locking for tx status fifo")?
> > Could it be a race between a registermap update and a stats load? (we are using a
> > different lock now)
> 
> This commit seems to be fine, reverting it did not solve the issue.

ack

> 
> > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > index 467b28379870..622251faa415 100644
> > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > >  					       mt76.tx_napi);
> > >  	int i;
> > >  
> > > -	mt76x02_mac_poll_tx_status(dev, false);
> > > +	mt76x02_mac_poll_tx_status(dev, true);
> > 
> > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > difference doing so is we do not run mt76x02_send_tx_status().
> 
> I thought this is the problem, but it was my mistake during testing.
> I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> change and get wrong impression it fixes the issue. But above change
> alone does not help.
> 
> I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> originally posted here make the problem gone.

so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
mt76_txq_schedule_all() in mt76x02_poll_tx(), right?

Lorenzo

> 
> Stanislaw
>
Stanislaw Gruszka July 31, 2019, 8:19 a.m. UTC | #5
On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote:
> > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > index 467b28379870..622251faa415 100644
> > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > >  					       mt76.tx_napi);
> > > >  	int i;
> > > >  
> > > > -	mt76x02_mac_poll_tx_status(dev, false);
> > > > +	mt76x02_mac_poll_tx_status(dev, true);
> > > 
> > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > > difference doing so is we do not run mt76x02_send_tx_status().
> > 
> > I thought this is the problem, but it was my mistake during testing.
> > I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> > change and get wrong impression it fixes the issue. But above change
> > alone does not help.
> > 
> > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> > originally posted here make the problem gone.
> 
> so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
> in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
> mt76_txq_schedule_all() in mt76x02_poll_tx(), right?

Yes.

Stanislaw
Stanislaw Gruszka July 31, 2019, 8:51 a.m. UTC | #6
On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote:
> On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote:
> > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > index 467b28379870..622251faa415 100644
> > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > > >  					       mt76.tx_napi);
> > > > >  	int i;
> > > > >  
> > > > > -	mt76x02_mac_poll_tx_status(dev, false);
> > > > > +	mt76x02_mac_poll_tx_status(dev, true);
> > > > 
> > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > > > difference doing so is we do not run mt76x02_send_tx_status().
> > > 
> > > I thought this is the problem, but it was my mistake during testing.
> > > I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> > > change and get wrong impression it fixes the issue. But above change
> > > alone does not help.
> > > 
> > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> > > originally posted here make the problem gone.
> > 
> > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
> > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
> > mt76_txq_schedule_all() in mt76x02_poll_tx(), right?
> 
> Yes.

Err, no, I should read more cerfully. It is partiall revert of 
41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") : 

diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
index 5397827668b9..fefe0ee52584 100644
--- a/drivers/net/wireless/mediatek/mt76/tx.c
+++ b/drivers/net/wireless/mediatek/mt76/tx.c
@@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
        if (!test_bit(MT76_STATE_RUNNING, &dev->state))
                return;
 
-       tasklet_schedule(&dev->tx_tasklet);
+       mt76_txq_schedule(dev, txq->ac);
 }
 EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);
Lorenzo Bianconi July 31, 2019, 9:09 a.m. UTC | #7
> On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote:
> > On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote:
> > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > index 467b28379870..622251faa415 100644
> > > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > > > >  					       mt76.tx_napi);
> > > > > >  	int i;
> > > > > >  
> > > > > > -	mt76x02_mac_poll_tx_status(dev, false);
> > > > > > +	mt76x02_mac_poll_tx_status(dev, true);
> > > > > 
> > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > > > > difference doing so is we do not run mt76x02_send_tx_status().
> > > > 
> > > > I thought this is the problem, but it was my mistake during testing.
> > > > I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> > > > change and get wrong impression it fixes the issue. But above change
> > > > alone does not help.
> > > > 
> > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> > > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> > > > originally posted here make the problem gone.
> > > 
> > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
> > > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
> > > mt76_txq_schedule_all() in mt76x02_poll_tx(), right?
> > 
> > Yes.
> 
> Err, no, I should read more cerfully. It is partiall revert of 
> 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") : 
> 
> diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> index 5397827668b9..fefe0ee52584 100644
> --- a/drivers/net/wireless/mediatek/mt76/tx.c
> +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
>         if (!test_bit(MT76_STATE_RUNNING, &dev->state))
>                 return;
>  
> -       tasklet_schedule(&dev->tx_tasklet);
> +       mt76_txq_schedule(dev, txq->ac);
>  }
>  EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);

reviewing the code I think:

- we should not run mt76u_tx_tasklet() from mt76_wake_tx_queue() since we do
  not have tx_napi for usb and it will unnecessary go through tx queue checks.
  We should probably do in mt76_wake_tx_queue() something like:

  if (is_mmio())
	  tasklet_schedule(&dev->tx_tasklet);
  else
	  mt76_txq_schedule(dev, txq->ac);

  Another solution would be add a status_tasklet that just goes through the tx
  queues receiving the usb tx completion and it schedules the tx_tasklet
  What do you think?

- I guess it does not fix the 76x0e issue but we should just schedule tx queues in
  mt76x02_tx_tasklet() (like it is done for mt7603 and mt7615) and move status
  processing in mt76x02_poll_tx()

Regards,
Lorenzo
Stanislaw Gruszka Aug. 5, 2019, 10:01 a.m. UTC | #8
On Wed, Jul 31, 2019 at 11:09:27AM +0200, Lorenzo Bianconi wrote:
> > On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote:
> > > On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote:
> > > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > index 467b28379870..622251faa415 100644
> > > > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > > > > >  					       mt76.tx_napi);
> > > > > > >  	int i;
> > > > > > >  
> > > > > > > -	mt76x02_mac_poll_tx_status(dev, false);
> > > > > > > +	mt76x02_mac_poll_tx_status(dev, true);
> > > > > > 
> > > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > > > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > > > > > difference doing so is we do not run mt76x02_send_tx_status().
> > > > > 
> > > > > I thought this is the problem, but it was my mistake during testing.
> > > > > I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> > > > > change and get wrong impression it fixes the issue. But above change
> > > > > alone does not help.
> > > > > 
> > > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> > > > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> > > > > originally posted here make the problem gone.
> > > > 
> > > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
> > > > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
> > > > mt76_txq_schedule_all() in mt76x02_poll_tx(), right?
> > > 
> > > Yes.
> > 
> > Err, no, I should read more cerfully. It is partiall revert of 
> > 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") : 
> > 
> > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> > index 5397827668b9..fefe0ee52584 100644
> > --- a/drivers/net/wireless/mediatek/mt76/tx.c
> > +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
> >         if (!test_bit(MT76_STATE_RUNNING, &dev->state))
> >                 return;
> >  
> > -       tasklet_schedule(&dev->tx_tasklet);
> > +       mt76_txq_schedule(dev, txq->ac);
> >  }
> >  EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);
> 
> reviewing the code I think:
> 
> - we should not run mt76u_tx_tasklet() from mt76_wake_tx_queue() since we do
>   not have tx_napi for usb and it will unnecessary go through tx queue checks.
>   We should probably do in mt76_wake_tx_queue() something like:
> 
>   if (is_mmio())

Adding '&& !is_mt7630()' will solve the problem for MT7630E as well ...

> 	  tasklet_schedule(&dev->tx_tasklet);
>   else
> 	  mt76_txq_schedule(dev, txq->ac);
> 
>   Another solution would be add a status_tasklet that just goes through the tx
>   queues receiving the usb tx completion and it schedules the tx_tasklet
>   What do you think?
> 
> - I guess it does not fix the 76x0e issue but we should just schedule tx queues in
>   mt76x02_tx_tasklet() (like it is done for mt7603 and mt7615) and move status
>   processing in mt76x02_poll_tx()

... but I think we have bug when do mt76_txq_schedule_all() in
tx_tasklet, because we can schedule on queues that are stoped.
So reverting 41634aa8d6db and then optimize by removing tx_tasklet
for mmio and remove not needed mt76_txq_schedule_all() calls looks
more reasoneble to me.
 
Stanislaw
Lorenzo Bianconi Aug. 5, 2019, 11:27 a.m. UTC | #9
> On Wed, Jul 31, 2019 at 11:09:27AM +0200, Lorenzo Bianconi wrote:
> > > On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote:
> > > > On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote:
> > > > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > > index 467b28379870..622251faa415 100644
> > > > > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c
> > > > > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget)
> > > > > > > >  					       mt76.tx_napi);
> > > > > > > >  	int i;
> > > > > > > >  
> > > > > > > > -	mt76x02_mac_poll_tx_status(dev, false);
> > > > > > > > +	mt76x02_mac_poll_tx_status(dev, true);
> > > > > > > 
> > > > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run
> > > > > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only
> > > > > > > difference doing so is we do not run mt76x02_send_tx_status().
> > > > > > 
> > > > > > I thought this is the problem, but it was my mistake during testing.
> > > > > > I tested the above change together with mt76_txq_schedule(dev, txq->ac)
> > > > > > change and get wrong impression it fixes the issue. But above change
> > > > > > alone does not help.
> > > > > > 
> > > > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx()
> > > > > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch
> > > > > > originally posted here make the problem gone.
> > > > > 
> > > > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status()
> > > > > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run
> > > > > mt76_txq_schedule_all() in mt76x02_poll_tx(), right?
> > > > 
> > > > Yes.
> > > 
> > > Err, no, I should read more cerfully. It is partiall revert of 
> > > 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") : 
> > > 
> > > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> > > index 5397827668b9..fefe0ee52584 100644
> > > --- a/drivers/net/wireless/mediatek/mt76/tx.c
> > > +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> > > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
> > >         if (!test_bit(MT76_STATE_RUNNING, &dev->state))
> > >                 return;
> > >  
> > > -       tasklet_schedule(&dev->tx_tasklet);
> > > +       mt76_txq_schedule(dev, txq->ac);
> > >  }
> > >  EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);
> > 
> > reviewing the code I think:
> > 
> > - we should not run mt76u_tx_tasklet() from mt76_wake_tx_queue() since we do
> >   not have tx_napi for usb and it will unnecessary go through tx queue checks.
> >   We should probably do in mt76_wake_tx_queue() something like:
> > 
> >   if (is_mmio())
> 
> Adding '&& !is_mt7630()' will solve the problem for MT7630E as well ...
> 
> > 	  tasklet_schedule(&dev->tx_tasklet);
> >   else
> > 	  mt76_txq_schedule(dev, txq->ac);
> > 
> >   Another solution would be add a status_tasklet that just goes through the tx
> >   queues receiving the usb tx completion and it schedules the tx_tasklet
> >   What do you think?
> > 
> > - I guess it does not fix the 76x0e issue but we should just schedule tx queues in
> >   mt76x02_tx_tasklet() (like it is done for mt7603 and mt7615) and move status
> >   processing in mt76x02_poll_tx()
> 
> ... but I think we have bug when do mt76_txq_schedule_all() in
> tx_tasklet, because we can schedule on queues that are stoped.
> So reverting 41634aa8d6db and then optimize by removing tx_tasklet
> for mmio and remove not needed mt76_txq_schedule_all() calls looks
> more reasoneble to me.

schedule a stopped queue seems not harmful at a first glance since we do not
copy pending skbs if we have not enough room in the dma ring. Maybe we can be
more conservative doing something like:

diff --git a/drivers/net/wireless/mediatek/mt76/dma.c b/drivers/net/wireless/mediatek/mt76/dma.c
index d8f61e540bfd..c6482155e5e4 100644
--- a/drivers/net/wireless/mediatek/mt76/dma.c
+++ b/drivers/net/wireless/mediatek/mt76/dma.c
@@ -346,6 +346,11 @@ mt76_dma_tx_queue_skb(struct mt76_dev *dev, enum mt76_txq_id qid,
 		goto unmap;
 
 	if (q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1) {
+		if (!q->stopped) {
+			ieee80211_stop_queue(dev->hw,
+					     skb_get_queue_mapping(skb));
+			q->stopped = true;
+		}
 		ret = -ENOMEM;
 		goto unmap;
 	}
diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
index 5397827668b9..bd2d34c4f326 100644
--- a/drivers/net/wireless/mediatek/mt76/tx.c
+++ b/drivers/net/wireless/mediatek/mt76/tx.c
@@ -495,6 +495,9 @@ mt76_txq_schedule_list(struct mt76_dev *dev, enum mt76_txq_id qid)
 	while (1) {
 		bool empty = false;
 
+		if (hwq->stopped)
+			break;
+
 		if (sq->swq_queued >= 4)
 			break;

Does it fix the issue you are facing?

Regards,
Lorenzo

>  
> Stanislaw
Stanislaw Gruszka Aug. 5, 2019, 12:39 p.m. UTC | #10
On Mon, Aug 05, 2019 at 01:27:19PM +0200, Lorenzo Bianconi wrote:
> > ... but I think we have bug when do mt76_txq_schedule_all() in
> > tx_tasklet, because we can schedule on queues that are stoped.
> > So reverting 41634aa8d6db and then optimize by removing tx_tasklet
> > for mmio and remove not needed mt76_txq_schedule_all() calls looks
> > more reasoneble to me.
> 
> schedule a stopped queue seems not harmful at a first glance since we do not
> copy pending skbs if we have not enough room in the dma ring.

mac80211 stop queues for various other reasons than 
IEEE80211_QUEUE_STOP_REASON_DRIVER .
 
> Maybe we can be
> more conservative doing something like:
>
> diff --git a/drivers/net/wireless/mediatek/mt76/dma.c b/drivers/net/wireless/mediatek/mt76/dma.c
> index d8f61e540bfd..c6482155e5e4 100644
> --- a/drivers/net/wireless/mediatek/mt76/dma.c
> +++ b/drivers/net/wireless/mediatek/mt76/dma.c
> @@ -346,6 +346,11 @@ mt76_dma_tx_queue_skb(struct mt76_dev *dev, enum mt76_txq_id qid,
>  		goto unmap;
>  
>  	if (q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1) {
> +		if (!q->stopped) {
> +			ieee80211_stop_queue(dev->hw,
> +					     skb_get_queue_mapping(skb));
> +			q->stopped = true;
> +		}
>  		ret = -ENOMEM;
>  		goto unmap;
>  	}
> diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> index 5397827668b9..bd2d34c4f326 100644
> --- a/drivers/net/wireless/mediatek/mt76/tx.c
> +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> @@ -495,6 +495,9 @@ mt76_txq_schedule_list(struct mt76_dev *dev, enum mt76_txq_id qid)
>  	while (1) {
>  		bool empty = false;
>  
> +		if (hwq->stopped)
> +			break;
> +
>  		if (sq->swq_queued >= 4)
>  			break;
> 
> Does it fix the issue you are facing?

I'll not be able to test this patch this week. Will have access to
the hardware next week. 

I checeked before, if
'q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1' is triggered
when MT7630E hangs and it is not. But maybe second part of the patch
will help.

Stanislaw
Stanislaw Gruszka Aug. 12, 2019, 11:50 a.m. UTC | #11
On Mon, Aug 05, 2019 at 02:39:16PM +0200, Stanislaw Gruszka wrote:
> On Mon, Aug 05, 2019 at 01:27:19PM +0200, Lorenzo Bianconi wrote:
> > > ... but I think we have bug when do mt76_txq_schedule_all() in
> > > tx_tasklet, because we can schedule on queues that are stoped.
> > > So reverting 41634aa8d6db and then optimize by removing tx_tasklet
> > > for mmio and remove not needed mt76_txq_schedule_all() calls looks
> > > more reasoneble to me.
> > 
> > schedule a stopped queue seems not harmful at a first glance since we do not
> > copy pending skbs if we have not enough room in the dma ring.
> 
> mac80211 stop queues for various other reasons than 
> IEEE80211_QUEUE_STOP_REASON_DRIVER .

After looking in more details, we check if queue is stopped in 
ieee80211_tx_dequeue(). But we do not check that for skb's queued
in mtxq->retry_q .

> > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
> > index 5397827668b9..bd2d34c4f326 100644
> > --- a/drivers/net/wireless/mediatek/mt76/tx.c
> > +++ b/drivers/net/wireless/mediatek/mt76/tx.c
> > @@ -495,6 +495,9 @@ mt76_txq_schedule_list(struct mt76_dev *dev, enum mt76_txq_id qid)
> >  	while (1) {
> >  		bool empty = false;
> >  
> > +		if (hwq->stopped)
> > +			break;
> > +
> >  		if (sq->swq_queued >= 4)
> >  			break;
> > 
> > Does it fix the issue you are facing?
> 
> I'll not be able to test this patch this week. Will have access to
> the hardware next week. 
> 
> I checeked before, if
> 'q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1' is triggered
> when MT7630E hangs and it is not. But maybe second part of the patch
> will help.

Patch did not help.

I debugged problem a bit more and issue is related with HW encryption.
Using full mac80211 SW encyption by retuning -EOPNOTSUPP
mt76x02_set_key() make the problem gone as well. Looks there is
some race between setting HW keys and transmitting.

Stanislaw
diff mbox series

Patch

diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
index 5397827668b9..fefe0ee52584 100644
--- a/drivers/net/wireless/mediatek/mt76/tx.c
+++ b/drivers/net/wireless/mediatek/mt76/tx.c
@@ -598,7 +598,7 @@  void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
 	if (!test_bit(MT76_STATE_RUNNING, &dev->state))
 		return;
 
-	tasklet_schedule(&dev->tx_tasklet);
+	mt76_txq_schedule(dev, txq->ac);
 }
 EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);