Message ID | 1564143056-14610-1-git-send-email-sgruszka@redhat.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Felix Fietkau |
Headers | show |
Series | [RFC] mt76: fix tx hung regression on MT7630E | expand |
On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote: > Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") > I can observe firmware hangs on MT7630E on station mode: tx stop > functioning after minor activity (rx keep working) and on module > unload device fail to stop with messages: > > [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop > [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop > > Loading module again results in failure to associate with AP. > Only machine power off / power on cycle can make device work again. > > I have no idea why the commit caused F/W hangs. Most likely some proper > fix is needed of changing registers programming (or assuring proper order > of actions), but so far I can not came up with any better fix than > a partial revert of 41634aa8d6db. The difference is that with 41634aa8d6db we can run mt76x02_poll_tx() and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without the commit, since tasklet is not scheduled from mt76_wake_tx_queue(), it does not happen. I'm not quite sure why, but MT7630E does not like when we poll tx status on 2 cpus at once. Change like below: diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c index 467b28379870..622251faa415 100644 --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget) mt76.tx_napi); int i; - mt76x02_mac_poll_tx_status(dev, false); + mt76x02_mac_poll_tx_status(dev, true); for (i = MT_TXQ_MCU; i >= 0; i--) mt76_queue_tx_cleanup(dev, i, false); is sufficient to avoid the hangs. > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c > index 5397827668b9..fefe0ee52584 100644 > --- a/drivers/net/wireless/mediatek/mt76/tx.c > +++ b/drivers/net/wireless/mediatek/mt76/tx.c > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq) > if (!test_bit(MT76_STATE_RUNNING, &dev->state)) > return; > > - tasklet_schedule(&dev->tx_tasklet); > + mt76_txq_schedule(dev, txq->ac); However I'm not sure if change to tasklet_schedule() is indeed correct, since on tasklet we schedule all queues, hence queues that could possibly be still blocked? And on mt76_wake_tx_queue() we schedule only one queue. Stanislaw
> On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote: > > Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") > > I can observe firmware hangs on MT7630E on station mode: tx stop > > functioning after minor activity (rx keep working) and on module > > unload device fail to stop with messages: > > > > [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop > > [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop > > > > Loading module again results in failure to associate with AP. > > Only machine power off / power on cycle can make device work again. > > > > I have no idea why the commit caused F/W hangs. Most likely some proper > > fix is needed of changing registers programming (or assuring proper order > > of actions), but so far I can not came up with any better fix than > > a partial revert of 41634aa8d6db. > > The difference is that with 41634aa8d6db we can run mt76x02_poll_tx() > and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without > the commit, since tasklet is not scheduled from mt76_wake_tx_queue(), > it does not happen. > > I'm not quite sure why, but MT7630E does not like when we poll tx status > on 2 cpus at once. Change like below: Hi Stanislaw, have you tried to look at the commit: 6fe533378795f87 ("mt76: mt76x02: remove irqsave/restore in locking for tx status fifo")? Could it be a race between a registermap update and a stats load? (we are using a different lock now) > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > index 467b28379870..622251faa415 100644 > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget) > mt76.tx_napi); > int i; > > - mt76x02_mac_poll_tx_status(dev, false); > + mt76x02_mac_poll_tx_status(dev, true); I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only difference doing so is we do not run mt76x02_send_tx_status(). Regards, Lorenzo > > for (i = MT_TXQ_MCU; i >= 0; i--) > mt76_queue_tx_cleanup(dev, i, false); > > is sufficient to avoid the hangs. > > > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c > > index 5397827668b9..fefe0ee52584 100644 > > --- a/drivers/net/wireless/mediatek/mt76/tx.c > > +++ b/drivers/net/wireless/mediatek/mt76/tx.c > > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq) > > if (!test_bit(MT76_STATE_RUNNING, &dev->state)) > > return; > > > > - tasklet_schedule(&dev->tx_tasklet); > > + mt76_txq_schedule(dev, txq->ac); > > However I'm not sure if change to tasklet_schedule() is indeed correct, > since on tasklet we schedule all queues, hence queues that could > possibly be still blocked? And on mt76_wake_tx_queue() we schedule only > one queue. > > Stanislaw
On Mon, Jul 29, 2019 at 04:02:41PM +0200, Lorenzo Bianconi wrote: > > On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote: > > > Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") > > > I can observe firmware hangs on MT7630E on station mode: tx stop > > > functioning after minor activity (rx keep working) and on module > > > unload device fail to stop with messages: > > > > > > [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop > > > [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop > > > > > > Loading module again results in failure to associate with AP. > > > Only machine power off / power on cycle can make device work again. > > > > > > I have no idea why the commit caused F/W hangs. Most likely some proper > > > fix is needed of changing registers programming (or assuring proper order > > > of actions), but so far I can not came up with any better fix than > > > a partial revert of 41634aa8d6db. > > > > The difference is that with 41634aa8d6db we can run mt76x02_poll_tx() > > and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without > > the commit, since tasklet is not scheduled from mt76_wake_tx_queue(), > > it does not happen. > > > > I'm not quite sure why, but MT7630E does not like when we poll tx status > > on 2 cpus at once. Change like below: > > Hi Stanislaw, Hi > have you tried to look at the commit: 6fe533378795f87 > ("mt76: mt76x02: remove irqsave/restore in locking for tx status fifo")? > Could it be a race between a registermap update and a stats load? (we are using a > different lock now) This commit seems to be fine, reverting it did not solve the issue. > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > index 467b28379870..622251faa415 100644 > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget) > > mt76.tx_napi); > > int i; > > > > - mt76x02_mac_poll_tx_status(dev, false); > > + mt76x02_mac_poll_tx_status(dev, true); > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only > difference doing so is we do not run mt76x02_send_tx_status(). I thought this is the problem, but it was my mistake during testing. I tested the above change together with mt76_txq_schedule(dev, txq->ac) change and get wrong impression it fixes the issue. But above change alone does not help. I tried to add some locking to avoid parallel execution of mt76x02_poll_tx() and mt76x02_tx_tasklet(), but it didn't help either. So far only patch originally posted here make the problem gone. Stanislaw
> On Mon, Jul 29, 2019 at 04:02:41PM +0200, Lorenzo Bianconi wrote: > > > On Fri, Jul 26, 2019 at 02:10:56PM +0200, Stanislaw Gruszka wrote: > > > > Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") > > > > I can observe firmware hangs on MT7630E on station mode: tx stop > > > > functioning after minor activity (rx keep working) and on module > > > > unload device fail to stop with messages: > > > > > > > > [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop > > > > [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop > > > > > > > > Loading module again results in failure to associate with AP. > > > > Only machine power off / power on cycle can make device work again. > > > > > > > > I have no idea why the commit caused F/W hangs. Most likely some proper > > > > fix is needed of changing registers programming (or assuring proper order > > > > of actions), but so far I can not came up with any better fix than > > > > a partial revert of 41634aa8d6db. > > > > > > The difference is that with 41634aa8d6db we can run mt76x02_poll_tx() > > > and mt76x02_tx_tasklet() in parallel on 2 different CPUs. Without > > > the commit, since tasklet is not scheduled from mt76_wake_tx_queue(), > > > it does not happen. > > > > > > I'm not quite sure why, but MT7630E does not like when we poll tx status > > > on 2 cpus at once. Change like below: > > > > Hi Stanislaw, > > Hi > > > have you tried to look at the commit: 6fe533378795f87 > > ("mt76: mt76x02: remove irqsave/restore in locking for tx status fifo")? > > Could it be a race between a registermap update and a stats load? (we are using a > > different lock now) > > This commit seems to be fine, reverting it did not solve the issue. ack > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > index 467b28379870..622251faa415 100644 > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget) > > > mt76.tx_napi); > > > int i; > > > > > > - mt76x02_mac_poll_tx_status(dev, false); > > > + mt76x02_mac_poll_tx_status(dev, true); > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only > > difference doing so is we do not run mt76x02_send_tx_status(). > > I thought this is the problem, but it was my mistake during testing. > I tested the above change together with mt76_txq_schedule(dev, txq->ac) > change and get wrong impression it fixes the issue. But above change > alone does not help. > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx() > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch > originally posted here make the problem gone. so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status() in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run mt76_txq_schedule_all() in mt76x02_poll_tx(), right? Lorenzo > > Stanislaw >
On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote: > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > index 467b28379870..622251faa415 100644 > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget) > > > > mt76.tx_napi); > > > > int i; > > > > > > > > - mt76x02_mac_poll_tx_status(dev, false); > > > > + mt76x02_mac_poll_tx_status(dev, true); > > > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only > > > difference doing so is we do not run mt76x02_send_tx_status(). > > > > I thought this is the problem, but it was my mistake during testing. > > I tested the above change together with mt76_txq_schedule(dev, txq->ac) > > change and get wrong impression it fixes the issue. But above change > > alone does not help. > > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx() > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch > > originally posted here make the problem gone. > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status() > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run > mt76_txq_schedule_all() in mt76x02_poll_tx(), right? Yes. Stanislaw
On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote: > On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote: > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > index 467b28379870..622251faa415 100644 > > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget) > > > > > mt76.tx_napi); > > > > > int i; > > > > > > > > > > - mt76x02_mac_poll_tx_status(dev, false); > > > > > + mt76x02_mac_poll_tx_status(dev, true); > > > > > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run > > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only > > > > difference doing so is we do not run mt76x02_send_tx_status(). > > > > > > I thought this is the problem, but it was my mistake during testing. > > > I tested the above change together with mt76_txq_schedule(dev, txq->ac) > > > change and get wrong impression it fixes the issue. But above change > > > alone does not help. > > > > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx() > > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch > > > originally posted here make the problem gone. > > > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status() > > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run > > mt76_txq_schedule_all() in mt76x02_poll_tx(), right? > > Yes. Err, no, I should read more cerfully. It is partiall revert of 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") : diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c index 5397827668b9..fefe0ee52584 100644 --- a/drivers/net/wireless/mediatek/mt76/tx.c +++ b/drivers/net/wireless/mediatek/mt76/tx.c @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq) if (!test_bit(MT76_STATE_RUNNING, &dev->state)) return; - tasklet_schedule(&dev->tx_tasklet); + mt76_txq_schedule(dev, txq->ac); } EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);
> On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote: > > On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote: > > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > > index 467b28379870..622251faa415 100644 > > > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget) > > > > > > mt76.tx_napi); > > > > > > int i; > > > > > > > > > > > > - mt76x02_mac_poll_tx_status(dev, false); > > > > > > + mt76x02_mac_poll_tx_status(dev, true); > > > > > > > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run > > > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only > > > > > difference doing so is we do not run mt76x02_send_tx_status(). > > > > > > > > I thought this is the problem, but it was my mistake during testing. > > > > I tested the above change together with mt76_txq_schedule(dev, txq->ac) > > > > change and get wrong impression it fixes the issue. But above change > > > > alone does not help. > > > > > > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx() > > > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch > > > > originally posted here make the problem gone. > > > > > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status() > > > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run > > > mt76_txq_schedule_all() in mt76x02_poll_tx(), right? > > > > Yes. > > Err, no, I should read more cerfully. It is partiall revert of > 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") : > > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c > index 5397827668b9..fefe0ee52584 100644 > --- a/drivers/net/wireless/mediatek/mt76/tx.c > +++ b/drivers/net/wireless/mediatek/mt76/tx.c > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq) > if (!test_bit(MT76_STATE_RUNNING, &dev->state)) > return; > > - tasklet_schedule(&dev->tx_tasklet); > + mt76_txq_schedule(dev, txq->ac); > } > EXPORT_SYMBOL_GPL(mt76_wake_tx_queue); reviewing the code I think: - we should not run mt76u_tx_tasklet() from mt76_wake_tx_queue() since we do not have tx_napi for usb and it will unnecessary go through tx queue checks. We should probably do in mt76_wake_tx_queue() something like: if (is_mmio()) tasklet_schedule(&dev->tx_tasklet); else mt76_txq_schedule(dev, txq->ac); Another solution would be add a status_tasklet that just goes through the tx queues receiving the usb tx completion and it schedules the tx_tasklet What do you think? - I guess it does not fix the 76x0e issue but we should just schedule tx queues in mt76x02_tx_tasklet() (like it is done for mt7603 and mt7615) and move status processing in mt76x02_poll_tx() Regards, Lorenzo
On Wed, Jul 31, 2019 at 11:09:27AM +0200, Lorenzo Bianconi wrote: > > On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote: > > > On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote: > > > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > > > index 467b28379870..622251faa415 100644 > > > > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget) > > > > > > > mt76.tx_napi); > > > > > > > int i; > > > > > > > > > > > > > > - mt76x02_mac_poll_tx_status(dev, false); > > > > > > > + mt76x02_mac_poll_tx_status(dev, true); > > > > > > > > > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run > > > > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only > > > > > > difference doing so is we do not run mt76x02_send_tx_status(). > > > > > > > > > > I thought this is the problem, but it was my mistake during testing. > > > > > I tested the above change together with mt76_txq_schedule(dev, txq->ac) > > > > > change and get wrong impression it fixes the issue. But above change > > > > > alone does not help. > > > > > > > > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx() > > > > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch > > > > > originally posted here make the problem gone. > > > > > > > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status() > > > > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run > > > > mt76_txq_schedule_all() in mt76x02_poll_tx(), right? > > > > > > Yes. > > > > Err, no, I should read more cerfully. It is partiall revert of > > 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") : > > > > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c > > index 5397827668b9..fefe0ee52584 100644 > > --- a/drivers/net/wireless/mediatek/mt76/tx.c > > +++ b/drivers/net/wireless/mediatek/mt76/tx.c > > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq) > > if (!test_bit(MT76_STATE_RUNNING, &dev->state)) > > return; > > > > - tasklet_schedule(&dev->tx_tasklet); > > + mt76_txq_schedule(dev, txq->ac); > > } > > EXPORT_SYMBOL_GPL(mt76_wake_tx_queue); > > reviewing the code I think: > > - we should not run mt76u_tx_tasklet() from mt76_wake_tx_queue() since we do > not have tx_napi for usb and it will unnecessary go through tx queue checks. > We should probably do in mt76_wake_tx_queue() something like: > > if (is_mmio()) Adding '&& !is_mt7630()' will solve the problem for MT7630E as well ... > tasklet_schedule(&dev->tx_tasklet); > else > mt76_txq_schedule(dev, txq->ac); > > Another solution would be add a status_tasklet that just goes through the tx > queues receiving the usb tx completion and it schedules the tx_tasklet > What do you think? > > - I guess it does not fix the 76x0e issue but we should just schedule tx queues in > mt76x02_tx_tasklet() (like it is done for mt7603 and mt7615) and move status > processing in mt76x02_poll_tx() ... but I think we have bug when do mt76_txq_schedule_all() in tx_tasklet, because we can schedule on queues that are stoped. So reverting 41634aa8d6db and then optimize by removing tx_tasklet for mmio and remove not needed mt76_txq_schedule_all() calls looks more reasoneble to me. Stanislaw
> On Wed, Jul 31, 2019 at 11:09:27AM +0200, Lorenzo Bianconi wrote: > > > On Wed, Jul 31, 2019 at 10:19:58AM +0200, Stanislaw Gruszka wrote: > > > > On Tue, Jul 30, 2019 at 04:55:31PM +0200, Lorenzo Bianconi wrote: > > > > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > > > > index 467b28379870..622251faa415 100644 > > > > > > > > --- a/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > > > > +++ b/drivers/net/wireless/mediatek/mt76/mt76x02_mmio.c > > > > > > > > @@ -170,7 +170,7 @@ static int mt76x02_poll_tx(struct napi_struct *napi, int budget) > > > > > > > > mt76.tx_napi); > > > > > > > > int i; > > > > > > > > > > > > > > > > - mt76x02_mac_poll_tx_status(dev, false); > > > > > > > > + mt76x02_mac_poll_tx_status(dev, true); > > > > > > > > > > > > > > I am not sure if we really need mt76x02_mac_poll_tx_status() here since we run > > > > > > > it in mt76x02_tx_complete_skb() and in mt76x02_tx_tasklet(). Anyway the only > > > > > > > difference doing so is we do not run mt76x02_send_tx_status(). > > > > > > > > > > > > I thought this is the problem, but it was my mistake during testing. > > > > > > I tested the above change together with mt76_txq_schedule(dev, txq->ac) > > > > > > change and get wrong impression it fixes the issue. But above change > > > > > > alone does not help. > > > > > > > > > > > > I tried to add some locking to avoid parallel execution of mt76x02_poll_tx() > > > > > > and mt76x02_tx_tasklet(), but it didn't help either. So far only patch > > > > > > originally posted here make the problem gone. > > > > > > > > > > so, in order to be on the same page, if you comment out mt76x02_mac_poll_tx_status() > > > > > in mt76x02_poll_tx() the issue will still occur. The only to 'fix' it is to run > > > > > mt76_txq_schedule_all() in mt76x02_poll_tx(), right? > > > > > > > > Yes. > > > > > > Err, no, I should read more cerfully. It is partiall revert of > > > 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") : > > > > > > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c > > > index 5397827668b9..fefe0ee52584 100644 > > > --- a/drivers/net/wireless/mediatek/mt76/tx.c > > > +++ b/drivers/net/wireless/mediatek/mt76/tx.c > > > @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq) > > > if (!test_bit(MT76_STATE_RUNNING, &dev->state)) > > > return; > > > > > > - tasklet_schedule(&dev->tx_tasklet); > > > + mt76_txq_schedule(dev, txq->ac); > > > } > > > EXPORT_SYMBOL_GPL(mt76_wake_tx_queue); > > > > reviewing the code I think: > > > > - we should not run mt76u_tx_tasklet() from mt76_wake_tx_queue() since we do > > not have tx_napi for usb and it will unnecessary go through tx queue checks. > > We should probably do in mt76_wake_tx_queue() something like: > > > > if (is_mmio()) > > Adding '&& !is_mt7630()' will solve the problem for MT7630E as well ... > > > tasklet_schedule(&dev->tx_tasklet); > > else > > mt76_txq_schedule(dev, txq->ac); > > > > Another solution would be add a status_tasklet that just goes through the tx > > queues receiving the usb tx completion and it schedules the tx_tasklet > > What do you think? > > > > - I guess it does not fix the 76x0e issue but we should just schedule tx queues in > > mt76x02_tx_tasklet() (like it is done for mt7603 and mt7615) and move status > > processing in mt76x02_poll_tx() > > ... but I think we have bug when do mt76_txq_schedule_all() in > tx_tasklet, because we can schedule on queues that are stoped. > So reverting 41634aa8d6db and then optimize by removing tx_tasklet > for mmio and remove not needed mt76_txq_schedule_all() calls looks > more reasoneble to me. schedule a stopped queue seems not harmful at a first glance since we do not copy pending skbs if we have not enough room in the dma ring. Maybe we can be more conservative doing something like: diff --git a/drivers/net/wireless/mediatek/mt76/dma.c b/drivers/net/wireless/mediatek/mt76/dma.c index d8f61e540bfd..c6482155e5e4 100644 --- a/drivers/net/wireless/mediatek/mt76/dma.c +++ b/drivers/net/wireless/mediatek/mt76/dma.c @@ -346,6 +346,11 @@ mt76_dma_tx_queue_skb(struct mt76_dev *dev, enum mt76_txq_id qid, goto unmap; if (q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1) { + if (!q->stopped) { + ieee80211_stop_queue(dev->hw, + skb_get_queue_mapping(skb)); + q->stopped = true; + } ret = -ENOMEM; goto unmap; } diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c index 5397827668b9..bd2d34c4f326 100644 --- a/drivers/net/wireless/mediatek/mt76/tx.c +++ b/drivers/net/wireless/mediatek/mt76/tx.c @@ -495,6 +495,9 @@ mt76_txq_schedule_list(struct mt76_dev *dev, enum mt76_txq_id qid) while (1) { bool empty = false; + if (hwq->stopped) + break; + if (sq->swq_queued >= 4) break; Does it fix the issue you are facing? Regards, Lorenzo > > Stanislaw
On Mon, Aug 05, 2019 at 01:27:19PM +0200, Lorenzo Bianconi wrote: > > ... but I think we have bug when do mt76_txq_schedule_all() in > > tx_tasklet, because we can schedule on queues that are stoped. > > So reverting 41634aa8d6db and then optimize by removing tx_tasklet > > for mmio and remove not needed mt76_txq_schedule_all() calls looks > > more reasoneble to me. > > schedule a stopped queue seems not harmful at a first glance since we do not > copy pending skbs if we have not enough room in the dma ring. mac80211 stop queues for various other reasons than IEEE80211_QUEUE_STOP_REASON_DRIVER . > Maybe we can be > more conservative doing something like: > > diff --git a/drivers/net/wireless/mediatek/mt76/dma.c b/drivers/net/wireless/mediatek/mt76/dma.c > index d8f61e540bfd..c6482155e5e4 100644 > --- a/drivers/net/wireless/mediatek/mt76/dma.c > +++ b/drivers/net/wireless/mediatek/mt76/dma.c > @@ -346,6 +346,11 @@ mt76_dma_tx_queue_skb(struct mt76_dev *dev, enum mt76_txq_id qid, > goto unmap; > > if (q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1) { > + if (!q->stopped) { > + ieee80211_stop_queue(dev->hw, > + skb_get_queue_mapping(skb)); > + q->stopped = true; > + } > ret = -ENOMEM; > goto unmap; > } > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c > index 5397827668b9..bd2d34c4f326 100644 > --- a/drivers/net/wireless/mediatek/mt76/tx.c > +++ b/drivers/net/wireless/mediatek/mt76/tx.c > @@ -495,6 +495,9 @@ mt76_txq_schedule_list(struct mt76_dev *dev, enum mt76_txq_id qid) > while (1) { > bool empty = false; > > + if (hwq->stopped) > + break; > + > if (sq->swq_queued >= 4) > break; > > Does it fix the issue you are facing? I'll not be able to test this patch this week. Will have access to the hardware next week. I checeked before, if 'q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1' is triggered when MT7630E hangs and it is not. But maybe second part of the patch will help. Stanislaw
On Mon, Aug 05, 2019 at 02:39:16PM +0200, Stanislaw Gruszka wrote: > On Mon, Aug 05, 2019 at 01:27:19PM +0200, Lorenzo Bianconi wrote: > > > ... but I think we have bug when do mt76_txq_schedule_all() in > > > tx_tasklet, because we can schedule on queues that are stoped. > > > So reverting 41634aa8d6db and then optimize by removing tx_tasklet > > > for mmio and remove not needed mt76_txq_schedule_all() calls looks > > > more reasoneble to me. > > > > schedule a stopped queue seems not harmful at a first glance since we do not > > copy pending skbs if we have not enough room in the dma ring. > > mac80211 stop queues for various other reasons than > IEEE80211_QUEUE_STOP_REASON_DRIVER . After looking in more details, we check if queue is stopped in ieee80211_tx_dequeue(). But we do not check that for skb's queued in mtxq->retry_q . > > diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c > > index 5397827668b9..bd2d34c4f326 100644 > > --- a/drivers/net/wireless/mediatek/mt76/tx.c > > +++ b/drivers/net/wireless/mediatek/mt76/tx.c > > @@ -495,6 +495,9 @@ mt76_txq_schedule_list(struct mt76_dev *dev, enum mt76_txq_id qid) > > while (1) { > > bool empty = false; > > > > + if (hwq->stopped) > > + break; > > + > > if (sq->swq_queued >= 4) > > break; > > > > Does it fix the issue you are facing? > > I'll not be able to test this patch this week. Will have access to > the hardware next week. > > I checeked before, if > 'q->queued + (tx_info.nbuf + 1) / 2 >= q->ndesc - 1' is triggered > when MT7630E hangs and it is not. But maybe second part of the patch > will help. Patch did not help. I debugged problem a bit more and issue is related with HW encryption. Using full mac80211 SW encyption by retuning -EOPNOTSUPP mt76x02_set_key() make the problem gone as well. Looks there is some race between setting HW keys and transmitting. Stanislaw
diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c index 5397827668b9..fefe0ee52584 100644 --- a/drivers/net/wireless/mediatek/mt76/tx.c +++ b/drivers/net/wireless/mediatek/mt76/tx.c @@ -598,7 +598,7 @@ void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq) if (!test_bit(MT76_STATE_RUNNING, &dev->state)) return; - tasklet_schedule(&dev->tx_tasklet); + mt76_txq_schedule(dev, txq->ac); } EXPORT_SYMBOL_GPL(mt76_wake_tx_queue);
Since 41634aa8d6db ("mt76: only schedule txqs from the tx tasklet") I can observe firmware hangs on MT7630E on station mode: tx stop functioning after minor activity (rx keep working) and on module unload device fail to stop with messages: [ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop [ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop Loading module again results in failure to associate with AP. Only machine power off / power on cycle can make device work again. I have no idea why the commit caused F/W hangs. Most likely some proper fix is needed of changing registers programming (or assuring proper order of actions), but so far I can not came up with any better fix than a partial revert of 41634aa8d6db. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> --- drivers/net/wireless/mediatek/mt76/tx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)