[wpan-next,v4,00/11] ieee802154: Synchronous Tx support

Message ID	20220519150516.443078-1-miquel.raynal@bootlin.com (mailing list archive)
Headers	show Return-Path: <linux-wpan-owner@kernel.org> sender: miquel.raynal@bootlin.com) by mail.gandi.net (Postfix) with ESMTPSA id 2A40F1BF210; Thu, 19 May 2022 15:05:17 +0000 (UTC) From: Miquel Raynal <miquel.raynal@bootlin.com> To: Alexander Aring <alex.aring@gmail.com>, Stefan Schmidt <stefan@datenfreihafen.org>, linux-wpan@vger.kernel.org Cc: David Girault <david.girault@qorvo.com>, Romuald Despres <romuald.despres@qorvo.com>, Frederic Blain <frederic.blain@qorvo.com>, Nicolas Schodet <nico@ni.fr.eu.org>, Thomas Petazzoni <thomas.petazzoni@bootlin.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, netdev@vger.kernel.org, Miquel Raynal <miquel.raynal@bootlin.com> Subject: [PATCH wpan-next v4 00/11] ieee802154: Synchronous Tx support Date: Thu, 19 May 2022 17:05:05 +0200 Message-Id: <20220519150516.443078-1-miquel.raynal@bootlin.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Precedence: bulk
Series	ieee802154: Synchronous Tx support \| expand [wpan-next,v4,00/11] ieee802154: Synchronous Tx support [wpan-next,v4,01/11] net: mac802154: Rename the synchronous xmit worker [wpan-next,v4,02/11] net: mac802154: Rename the main tx_work struct [wpan-next,v4,03/11] net: mac802154: Enhance the error path in the main tx helper [wpan-next,v4,04/11] net: mac802154: Follow the count of ongoing transmissions [wpan-next,v4,05/11] net: mac802154: Bring the ability to hold the transmit queue [wpan-next,v4,06/11] net: mac802154: Create a hot tx path [wpan-next,v4,07/11] net: mac802154: Introduce a helper to disable the queue [wpan-next,v4,08/11] net: mac802154: Introduce a tx queue flushing mechanism [wpan-next,v4,09/11] net: mac802154: Introduce a synchronous API for MLME commands [wpan-next,v4,10/11] net: mac802154: Add a warning in the hot path [wpan-next,v4,11/11] net: mac802154: Add a warning in the slow path

Miquel Raynal May 19, 2022, 3:05 p.m. UTC

Hello,

This series brings support for that famous synchronous Tx API for MLME
commands.

MLME commands will be used during scan operations. In this situation,
we need to be sure that all transfers finished and that no transfer
will be queued for a short moment.

Cheers,
Miquèl

Changes in v4:
* Made visible the mlme_tx{_pre,,_post} helpers, used them later in the
  scanning code where relevant.
* Used the atomic_fetch_inc() alternative to only stop the queue when
  necessary.
* Used the netif_running() helper in place of the manual check against
  the IFF_UP netdev flag.
* Changed the error codes to ENETDOWN if the device was closed.
* Reworked the MLME transmissions error path so that they would not keep
  the rtnl taken.
* Updated the logic to avoid erroring out on the mlme_op_pre() call
  which just returns the code of the previous transmission (which we
  likely do not care about here).
* Dropped the queue_stopped variable, used the existing "flags"
  variable, turning it into an unsigned long so that it would accept
  atomic operations. Created a WPAN_PHY_FLAG_STATE_QUEUE_STOPPED
  definition for this purpose.

Changes in v3:
* Tested with lockdep enabled, a more aggressive preemption level and
  the sleeping while atomic warnings enabled.
* Changed the hold/release queue mutex into a spinlock.
* Split the mlme_tx function into three, one to hold the queue, then
  another part that does takes the rtnl and has the real content, and a
  last helper to release the queue.
* Fixed the warning condition in the slow path.
* Used an unsigned long and test/set_bit helpers to follow the queue
  state instead of an atomic_t.

Changes in v2:
* Updated the main tx function error path.
* Added a missing atomic_dec_at_test() call on the hold counter.
* Always called (upon a certain condition) the queue wakeup helper from
  the release queue helper (and similarly in the hold helper) and
  squashed two existing patches in it to simplify the series.
* Introduced a mutex to serialize accesses to the increment/decrement of
  the hold counter and the wake up call.
* Added a warning in case an MLME Tx gets triggered while the device was
  stopped.
* Used the rtnl to ensure the device cannot be stopped while an MLME
  transmission is ongoing.

Changes in v1 since this series got extracted from a bigger change:
* Introduced a new atomic variable to know when the queue is actually
  stopped. So far we only had an atomic to know when the queue was held
  (indicates a transitioning state towards a stopped queue only) and
  another atomic indicating if a transfer was still ongoing at this
  point (used by the wait logic as a condition to wake up).

Miquel Raynal (11):
  net: mac802154: Rename the synchronous xmit worker
  net: mac802154: Rename the main tx_work struct
  net: mac802154: Enhance the error path in the main tx helper
  net: mac802154: Follow the count of ongoing transmissions
  net: mac802154: Bring the ability to hold the transmit queue
  net: mac802154: Create a hot tx path
  net: mac802154: Introduce a helper to disable the queue
  net: mac802154: Introduce a tx queue flushing mechanism
  net: mac802154: Introduce a synchronous API for MLME commands
  net: mac802154: Add a warning in the hot path
  net: mac802154: Add a warning in the slow path

 include/net/cfg802154.h      |  13 +++-
 include/net/mac802154.h      |  27 -------
 net/ieee802154/core.c        |   3 +
 net/mac802154/cfg.c          |   4 +-
 net/mac802154/ieee802154_i.h |  40 +++++++++-
 net/mac802154/main.c         |   2 +-
 net/mac802154/tx.c           | 147 +++++++++++++++++++++++++++++++----
 net/mac802154/util.c         |  71 +++++++++++++++--
 8 files changed, 252 insertions(+), 55 deletions(-)

Alexander Aring June 1, 2022, 3:30 a.m. UTC | #1

Hi,

On Thu, May 19, 2022 at 11:06 AM Miquel Raynal
<miquel.raynal@bootlin.com> wrote:
>
> Hello,
>
> This series brings support for that famous synchronous Tx API for MLME
> commands.
>
> MLME commands will be used during scan operations. In this situation,
> we need to be sure that all transfers finished and that no transfer
> will be queued for a short moment.
>

Acked-by: Alexander Aring <aahringo@redhat.com>

There will be now functions upstream which will never be used, Stefan
should wait until they are getting used before sending it to net-next.

- Alex

Miquel Raynal June 1, 2022, 6:12 a.m. UTC | #2

Hi Alexander,

aahringo@redhat.com wrote on Tue, 31 May 2022 23:30:25 -0400:

> Hi,
> 
> On Thu, May 19, 2022 at 11:06 AM Miquel Raynal
> <miquel.raynal@bootlin.com> wrote:
> >
> > Hello,
> >
> > This series brings support for that famous synchronous Tx API for MLME
> > commands.
> >
> > MLME commands will be used during scan operations. In this situation,
> > we need to be sure that all transfers finished and that no transfer
> > will be queued for a short moment.
> >  
> 
> Acked-by: Alexander Aring <aahringo@redhat.com>
> 
> There will be now functions upstream which will never be used, Stefan
> should wait until they are getting used before sending it to net-next.

That's right.

Thanks for all the feedback so far!
Miquèl

Stefan Schmidt June 1, 2022, 9:01 p.m. UTC | #3

Hello.

On 01.06.22 05:30, Alexander Aring wrote:
> Hi,
> 
> On Thu, May 19, 2022 at 11:06 AM Miquel Raynal
> <miquel.raynal@bootlin.com> wrote:
>>
>> Hello,
>>
>> This series brings support for that famous synchronous Tx API for MLME
>> commands.
>>
>> MLME commands will be used during scan operations. In this situation,
>> we need to be sure that all transfers finished and that no transfer
>> will be queued for a short moment.
>>
> 
> Acked-by: Alexander Aring <aahringo@redhat.com>

These patches have been applied to the wpan-next tree. Thanks!

> There will be now functions upstream which will never be used, Stefan
> should wait until they are getting used before sending it to net-next.

Indeed this can wait until we have a consumer of the functions before 
pushing this forward to net-next. Pretty sure Miquel is happy to finally 
move on to other pieces of his puzzle and use them. :-)

regards
Stefan Schmidt

Miquel Raynal June 3, 2022, 5:55 p.m. UTC | #4

Hi Stefan, Alex,

stefan@datenfreihafen.org wrote on Wed, 1 Jun 2022 23:01:51 +0200:

> Hello.
> 
> On 01.06.22 05:30, Alexander Aring wrote:
> > Hi,
> > 
> > On Thu, May 19, 2022 at 11:06 AM Miquel Raynal
> > <miquel.raynal@bootlin.com> wrote:  
> >>
> >> Hello,
> >>
> >> This series brings support for that famous synchronous Tx API for MLME
> >> commands.
> >>
> >> MLME commands will be used during scan operations. In this situation,
> >> we need to be sure that all transfers finished and that no transfer
> >> will be queued for a short moment.
> >>  
> > 
> > Acked-by: Alexander Aring <aahringo@redhat.com>  
> 
> These patches have been applied to the wpan-next tree. Thanks!
> 
> > There will be now functions upstream which will never be used, Stefan
> > should wait until they are getting used before sending it to net-next.  
> 
> Indeed this can wait until we have a consumer of the functions before pushing this forward to net-next. Pretty sure Miquel is happy to finally move on to other pieces of his puzzle and use them. :-)

Next part is coming!

In the mean time I've experienced a new lockdep warning:

All the netlink commands are executed with the rtnl taken.
In my current implementation, when I configure/edit a scan request or a
beacon request I take a scan_lock or a beacons_lock, so they may only
be taken after the rtnl in this case, which leads to this sequence of
events:
- the rtnl is taken (by the net core)
- the beacon's lock is taken

But now in a beacon's work or an active scan work, what happens is:
- work gets woken up
- the beacon/scan lock is taken
- a beacon/beacon-request frame is transmitted
- the rtnl lock is taken during this transmission

Lockdep then detects a possible circular dependency:
[  490.153387]        CPU0                    CPU1
[  490.153391]        ----                    ----
[  490.153394]   lock(&local->beacons_lock);
[  490.153400]                                lock(rtnl_mutex);
[  490.153406]                                lock(&local->beacons_lock);
[  490.153412]   lock(rtnl_mutex);

So in practice, I always need to have the rtnl lock taken when
acquiring these other locks (beacon/scan_lock) which I think is far
from optimal.

1# One solution is to drop the beacons/scan locks because they are not
useful anymore and simply rely on the rtnl.

2# Another solution would be to change the mlme_tx() implementation to
finally not need the rtnl at all.

Note that just calling ASSERT_RTNL() makes no difference in 2#, it
still means that I always need to acquire the rtnl before acquiring the
beacons/scan locks, which greatly reduces their usefulness and leads to
solution 1# in the end.

IIRC I decided to introduce the rtnl to avoid ->ndo_stop() calls during
an MLME transmission. I don't know if it has another use there. If not,
we may perhaps get rid of the rtnl in mlme_tx() by really handling the
stop calls (but I was too lazy so far to do that).

What direction would you advise?

Thanks,
Miquèl

Alexander Aring June 4, 2022, 1:50 a.m. UTC | #5

Hi,

On Fri, Jun 3, 2022 at 1:55 PM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>
> Hi Stefan, Alex,
>
> stefan@datenfreihafen.org wrote on Wed, 1 Jun 2022 23:01:51 +0200:
>
> > Hello.
> >
> > On 01.06.22 05:30, Alexander Aring wrote:
> > > Hi,
> > >
> > > On Thu, May 19, 2022 at 11:06 AM Miquel Raynal
> > > <miquel.raynal@bootlin.com> wrote:
> > >>
> > >> Hello,
> > >>
> > >> This series brings support for that famous synchronous Tx API for MLME
> > >> commands.
> > >>
> > >> MLME commands will be used during scan operations. In this situation,
> > >> we need to be sure that all transfers finished and that no transfer
> > >> will be queued for a short moment.
> > >>
> > >
> > > Acked-by: Alexander Aring <aahringo@redhat.com>
> >
> > These patches have been applied to the wpan-next tree. Thanks!
> >
> > > There will be now functions upstream which will never be used, Stefan
> > > should wait until they are getting used before sending it to net-next.
> >
> > Indeed this can wait until we have a consumer of the functions before pushing this forward to net-next. Pretty sure Miquel is happy to finally move on to other pieces of his puzzle and use them. :-)
>
> Next part is coming!
>
> In the mean time I've experienced a new lockdep warning:
>
> All the netlink commands are executed with the rtnl taken.
> In my current implementation, when I configure/edit a scan request or a
> beacon request I take a scan_lock or a beacons_lock, so they may only
> be taken after the rtnl in this case, which leads to this sequence of
> events:
> - the rtnl is taken (by the net core)
> - the beacon's lock is taken
>
> But now in a beacon's work or an active scan work, what happens is:
> - work gets woken up
> - the beacon/scan lock is taken
> - a beacon/beacon-request frame is transmitted
> - the rtnl lock is taken during this transmission
>
> Lockdep then detects a possible circular dependency:
> [  490.153387]        CPU0                    CPU1
> [  490.153391]        ----                    ----
> [  490.153394]   lock(&local->beacons_lock);
> [  490.153400]                                lock(rtnl_mutex);
> [  490.153406]                                lock(&local->beacons_lock);
> [  490.153412]   lock(rtnl_mutex);
>
> So in practice, I always need to have the rtnl lock taken when
> acquiring these other locks (beacon/scan_lock) which I think is far
> from optimal.
>

*Note that those can also be false positives.

> 1# One solution is to drop the beacons/scan locks because they are not
> useful anymore and simply rely on the rtnl.
>

depends on how long it will be held.

> 2# Another solution would be to change the mlme_tx() implementation to
> finally not need the rtnl at all.
>
> Note that just calling ASSERT_RTNL() makes no difference in 2#, it
> still means that I always need to acquire the rtnl before acquiring the
> beacons/scan locks, which greatly reduces their usefulness and leads to
> solution 1# in the end.
>
> IIRC I decided to introduce the rtnl to avoid ->ndo_stop() calls during
> an MLME transmission. I don't know if it has another use there. If not,
> we may perhaps get rid of the rtnl in mlme_tx() by really handling the
> stop calls (but I was too lazy so far to do that).
>
> What direction would you advise?

Hard to say without code. Please show us some code of the current
state... there should also be some stacktrace of the circular lock
dependency, please provide the full output _matching_ the provided
code.

Thanks.

- Alex

Miquel Raynal June 6, 2022, 5:03 p.m. UTC | #6

Hi Alex,

aahringo@redhat.com wrote on Fri, 3 Jun 2022 21:50:15 -0400:

> Hi,
> 
> On Fri, Jun 3, 2022 at 1:55 PM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > Hi Stefan, Alex,
> >
> > stefan@datenfreihafen.org wrote on Wed, 1 Jun 2022 23:01:51 +0200:
> >  
> > > Hello.
> > >
> > > On 01.06.22 05:30, Alexander Aring wrote:  
> > > > Hi,
> > > >
> > > > On Thu, May 19, 2022 at 11:06 AM Miquel Raynal
> > > > <miquel.raynal@bootlin.com> wrote:  
> > > >>
> > > >> Hello,
> > > >>
> > > >> This series brings support for that famous synchronous Tx API for MLME
> > > >> commands.
> > > >>
> > > >> MLME commands will be used during scan operations. In this situation,
> > > >> we need to be sure that all transfers finished and that no transfer
> > > >> will be queued for a short moment.
> > > >>  
> > > >
> > > > Acked-by: Alexander Aring <aahringo@redhat.com>  
> > >
> > > These patches have been applied to the wpan-next tree. Thanks!
> > >  
> > > > There will be now functions upstream which will never be used, Stefan
> > > > should wait until they are getting used before sending it to net-next.  
> > >
> > > Indeed this can wait until we have a consumer of the functions before pushing this forward to net-next. Pretty sure Miquel is happy to finally move on to other pieces of his puzzle and use them. :-)  
> >
> > Next part is coming!
> >
> > In the mean time I've experienced a new lockdep warning:
> >
> > All the netlink commands are executed with the rtnl taken.
> > In my current implementation, when I configure/edit a scan request or a
> > beacon request I take a scan_lock or a beacons_lock, so they may only
> > be taken after the rtnl in this case, which leads to this sequence of
> > events:
> > - the rtnl is taken (by the net core)
> > - the beacon's lock is taken
> >
> > But now in a beacon's work or an active scan work, what happens is:
> > - work gets woken up
> > - the beacon/scan lock is taken
> > - a beacon/beacon-request frame is transmitted
> > - the rtnl lock is taken during this transmission
> >
> > Lockdep then detects a possible circular dependency:
> > [  490.153387]        CPU0                    CPU1
> > [  490.153391]        ----                    ----
> > [  490.153394]   lock(&local->beacons_lock);
> > [  490.153400]                                lock(rtnl_mutex);
> > [  490.153406]                                lock(&local->beacons_lock);
> > [  490.153412]   lock(rtnl_mutex);
> >
> > So in practice, I always need to have the rtnl lock taken when
> > acquiring these other locks (beacon/scan_lock) which I think is far
> > from optimal.
> >  
> 
> *Note that those can also be false positives.
> 
> > 1# One solution is to drop the beacons/scan locks because they are not
> > useful anymore and simply rely on the rtnl.
> >  
> 
> depends on how long it will be held.
> 
> > 2# Another solution would be to change the mlme_tx() implementation to
> > finally not need the rtnl at all.
> >
> > Note that just calling ASSERT_RTNL() makes no difference in 2#, it
> > still means that I always need to acquire the rtnl before acquiring the
> > beacons/scan locks, which greatly reduces their usefulness and leads to
> > solution 1# in the end.
> >
> > IIRC I decided to introduce the rtnl to avoid ->ndo_stop() calls during
> > an MLME transmission. I don't know if it has another use there. If not,
> > we may perhaps get rid of the rtnl in mlme_tx() by really handling the
> > stop calls (but I was too lazy so far to do that).
> >
> > What direction would you advise?  
> 
> Hard to say without code. Please show us some code of the current
> state... there should also be some stacktrace of the circular lock
> dependency, please provide the full output _matching_ the provided
> code.

Of course, here is the branch that I used to produce the warning:
https://github.com/miquelraynal/linux/ branch wpan-next/scan

Triggering this is just a matter or executing nl802154_send_beacons().
And here is the trace which appears in the dmesg:

[  234.224911] mac802154_hwsim mac802154_hwsim: Added 2 mac802154 hwsim hardware radios
[  257.846221] Sending beacon

[  257.847439] ======================================================
[  257.847446] WARNING: possible circular locking dependency detected
[  257.847463] 5.18.0-rc4-uwb+ #217 Not tainted
[  257.847473] ------------------------------------------------------
[  257.847479] kworker/u4:4/53 is trying to acquire lock:
[  257.847488] ffffffff9d049d48 (rtnl_mutex){+.+.}-{3:3}, at: ieee802154_mlme_tx+0xf/0x160 [mac802154]
[  257.847577] 
               but task is already holding lock:
[  257.847584] ffff89b082ea7ae0 (&local->beacons_lock){+.+.}-{3:3}, at: mac802154_beacons_work+0x1d/0xb0 [mac802154]
[  257.847651] 
               which lock already depends on the new lock.

[  257.847668] 
               the existing dependency chain (in reverse order) is:
[  257.847674] 
               -> #1 (&local->beacons_lock){+.+.}-{3:3}:
[  257.847702]        __mutex_lock+0x9d/0x9a0
[  257.847719]        mac802154_send_beacons+0x32/0x80 [mac802154]
[  257.847767]        nl802154_send_beacons+0xd7/0x1f0 [ieee802154]
[  257.847829]        genl_family_rcv_msg_doit+0xe5/0x140
[  257.847842]        genl_rcv_msg+0xd7/0x1e0
[  257.847852]        netlink_rcv_skb+0x4c/0xf0
[  257.847862]        genl_rcv+0x1f/0x30
[  257.847871]        netlink_unicast+0x191/0x260
[  257.847882]        netlink_sendmsg+0x22e/0x480
[  257.847892]        sock_sendmsg+0x59/0x60
[  257.847907]        ____sys_sendmsg+0x20c/0x260
[  257.847922]        ___sys_sendmsg+0x7c/0xc0
[  257.847932]        __sys_sendmsg+0x54/0xa0
[  257.847942]        do_syscall_64+0x3b/0x90
[  257.847956]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  257.847972] 
               -> #0 (rtnl_mutex){+.+.}-{3:3}:
[  257.847989]        __lock_acquire+0x1253/0x22e0
[  257.848002]        lock_acquire+0xca/0x2f0
[  257.848011]        __mutex_lock+0x9d/0x9a0
[  257.848023]        ieee802154_mlme_tx+0xf/0x160 [mac802154]
[  257.848058]        ieee802154_mlme_tx_one+0x2d/0x40 [mac802154]
[  257.848092]        mac802154_beacons_work.cold+0x100/0x110 [mac802154]
[  257.848135]        process_one_work+0x26f/0x5a0
[  257.848147]        worker_thread+0x4a/0x3d0
[  257.848158]        kthread+0xee/0x120
[  257.848168]        ret_from_fork+0x22/0x30
[  257.848180] 
               other info that might help us debug this:

[  257.848187]  Possible unsafe locking scenario:

[  257.848192]        CPU0                    CPU1
[  257.848198]        ----                    ----
[  257.848203]   lock(&local->beacons_lock);
[  257.848215]                                lock(rtnl_mutex);
[  257.848226]                                lock(&local->beacons_lock);
[  257.848236]   lock(rtnl_mutex);
[  257.848246] 
                *** DEADLOCK ***

[  257.848252] 3 locks held by kworker/u4:4/53:
[  257.848262]  #0: ffff89b0b66b4138 ((wq_completion)phy0){+.+.}-{0:0}, at: process_one_work+0x1ef/0x5a0
[  257.848290]  #1: ffffa021404e3e78 ((work_completion)(&(&local->beacons_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1ef/0x5a0
[  257.848317]  #2: ffff89b082ea7ae0 (&local->beacons_lock){+.+.}-{3:3}, at: mac802154_beacons_work+0x1d/0xb0 [mac802154]
[  257.848371] 
               stack backtrace:
[  257.848388] CPU: 1 PID: 53 Comm: kworker/u4:4 Not tainted 5.18.0-rc4-uwb+ #217
[  257.848404] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
[  257.848422] Workqueue: phy0 mac802154_beacons_work [mac802154]
[  257.848472] Call Trace:
[  257.848490]  <TASK>
[  257.848507]  dump_stack_lvl+0x45/0x59
[  257.848536]  check_noncircular+0xfe/0x110
[  257.848559]  __lock_acquire+0x1253/0x22e0
[  257.848581]  lock_acquire+0xca/0x2f0
[  257.848592]  ? ieee802154_mlme_tx+0xf/0x160 [mac802154]
[  257.848638]  __mutex_lock+0x9d/0x9a0
[  257.848652]  ? ieee802154_mlme_tx+0xf/0x160 [mac802154]
[  257.848690]  ? mark_held_locks+0x49/0x70
[  257.848701]  ? ieee802154_mlme_tx+0xf/0x160 [mac802154]
[  257.848737]  ? _raw_spin_unlock_irqrestore+0x28/0x50
[  257.848753]  ? lockdep_hardirqs_on+0x79/0x100
[  257.848770]  ? ieee802154_mlme_tx+0xf/0x160 [mac802154]
[  257.848804]  ieee802154_mlme_tx+0xf/0x160 [mac802154]
[  257.848841]  ieee802154_mlme_tx_one+0x2d/0x40 [mac802154]
[  257.848878]  mac802154_beacons_work.cold+0x100/0x110 [mac802154]
[  257.848924]  process_one_work+0x26f/0x5a0
[  257.848944]  worker_thread+0x4a/0x3d0
[  257.848959]  ? process_one_work+0x5a0/0x5a0
[  257.848971]  kthread+0xee/0x120
[  257.848981]  ? kthread_complete_and_exit+0x20/0x20
[  257.848995]  ret_from_fork+0x22/0x30
[  257.849022]  </TASK>

Miquel Raynal June 17, 2022, 2:20 p.m. UTC | #7

Hi Alex,

aahringo@redhat.com wrote on Fri, 3 Jun 2022 21:50:15 -0400:

> Hi,
> 
> On Fri, Jun 3, 2022 at 1:55 PM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > Hi Stefan, Alex,
> >
> > stefan@datenfreihafen.org wrote on Wed, 1 Jun 2022 23:01:51 +0200:
> >  
> > > Hello.
> > >
> > > On 01.06.22 05:30, Alexander Aring wrote:  
> > > > Hi,
> > > >
> > > > On Thu, May 19, 2022 at 11:06 AM Miquel Raynal
> > > > <miquel.raynal@bootlin.com> wrote:  
> > > >>
> > > >> Hello,
> > > >>
> > > >> This series brings support for that famous synchronous Tx API for MLME
> > > >> commands.
> > > >>
> > > >> MLME commands will be used during scan operations. In this situation,
> > > >> we need to be sure that all transfers finished and that no transfer
> > > >> will be queued for a short moment.
> > > >>  
> > > >
> > > > Acked-by: Alexander Aring <aahringo@redhat.com>  
> > >
> > > These patches have been applied to the wpan-next tree. Thanks!
> > >  
> > > > There will be now functions upstream which will never be used, Stefan
> > > > should wait until they are getting used before sending it to net-next.  
> > >
> > > Indeed this can wait until we have a consumer of the functions before pushing this forward to net-next. Pretty sure Miquel is happy to finally move on to other pieces of his puzzle and use them. :-)  
> >
> > Next part is coming!
> >
> > In the mean time I've experienced a new lockdep warning:
> >
> > All the netlink commands are executed with the rtnl taken.
> > In my current implementation, when I configure/edit a scan request or a
> > beacon request I take a scan_lock or a beacons_lock, so they may only
> > be taken after the rtnl in this case, which leads to this sequence of
> > events:
> > - the rtnl is taken (by the net core)
> > - the beacon's lock is taken
> >
> > But now in a beacon's work or an active scan work, what happens is:
> > - work gets woken up
> > - the beacon/scan lock is taken
> > - a beacon/beacon-request frame is transmitted
> > - the rtnl lock is taken during this transmission
> >
> > Lockdep then detects a possible circular dependency:
> > [  490.153387]        CPU0                    CPU1
> > [  490.153391]        ----                    ----
> > [  490.153394]   lock(&local->beacons_lock);
> > [  490.153400]                                lock(rtnl_mutex);
> > [  490.153406]                                lock(&local->beacons_lock);
> > [  490.153412]   lock(rtnl_mutex);

So after a lot of thinking and different tries, I've opted for a
slightly different approach regarding the rtnl being taken in the mlme
tx path. What we want there is actually to be sure that the device
won't be turned off during the transmission. Either this is done
before, and the transmission will just return an error (and this is
fine) or there is no ndo_close() call and we are actually safe. So I've
actually introduced a mutex for serializing accesses to the "stop the
device" section which actually what we care about. It works well, avoid
keeping the rtnl in all the scan/beacons works (which would have been
a crazy thing to do IMHO) and allows to keep a beacons/scan mutex for
the configuration of these specific parts. I'll propose this change in
the upcoming series.

Thanks,
Miquèl

[wpan-next,v4,00/11] ieee802154: Synchronous Tx support

Message

Comments