diff mbox series

station: delay roam until fully connected

Message ID 20230417185457.1000642-1-prestwoj@gmail.com (mailing list archive)
State New
Headers show
Series station: delay roam until fully connected | expand

Checks

Context Check Description
tedd_an/pre-ci_am success Success
prestwoj/iwd-alpine-ci-fetch success Fetch PR
prestwoj/iwd-ci-gitlint success GitLint
prestwoj/iwd-ci-fetch success Fetch PR
prestwoj/iwd-alpine-ci-makedistcheck fail Make Distcheck Make FAIL: make[2]: *** No rule to make target 'ell/sysctl.h', needed by 'distdir-am'. Stop. make[1]: *** [Makefile:3218: distdir] Error 2 make: *** [Makefile:3298: dist] Error 2
prestwoj/iwd-alpine-ci-incremental_build success Incremental build not run PASS
prestwoj/iwd-alpine-ci-build success Build - Configure
prestwoj/iwd-alpine-ci-makecheckvalgrind fail Make FAIL: make[1]: *** No rule to make target 'ell/sysctl.c', needed by 'ell/sysctl.lo'. Stop. make[1]: *** Waiting for unfinished jobs.... make: *** [Makefile:1710: all] Error 2
prestwoj/iwd-alpine-ci-makecheck pending makecheck SKIP
prestwoj/iwd-ci-incremental_build success Incremental build not run PASS
prestwoj/iwd-ci-build success Build - Configure
prestwoj/iwd-ci-makecheckvalgrind fail Make FAIL: make[1]: *** No rule to make target 'ell/sysctl.c', needed by 'ell/sysctl.lo'. Stop. make[1]: *** Waiting for unfinished jobs.... make: *** [Makefile:1709: all] Error 2
prestwoj/iwd-ci-clang fail Clang IWD - make FAIL: make[1]: *** No rule to make target 'ell/sysctl.c', needed by 'ell/sysctl.lo'. Stop. make[1]: *** Waiting for unfinished jobs.... make: *** [Makefile:1709: all] Error 2
prestwoj/iwd-ci-makecheck pending makecheck SKIP
prestwoj/iwd-ci-makedistcheck fail Make Distcheck Make FAIL: make[2]: *** No rule to make target 'ell/sysctl.h', needed by 'distdir-am'. Stop. make[1]: *** [Makefile:3217: distdir] Error 2 make: *** [Makefile:3297: dist] Error 2
prestwoj/iwd-ci-testrunner pending testrunner SKIP

Commit Message

James Prestwood April 17, 2023, 6:54 p.m. UTC
If DHCP/netconfig takes a long time to finish IWD could end up trying
to roam to another BSS. This could happen if the signal strength is
below the set threshold while connecting. Once authenticated to the
AP IWD would get a CQM event and start a roam timer. If DHCP did not
finish before that timer IWD would start a roam. IWD would not handle
this well since a roam doesn't trigger DHCP to start and likely the
network infrustructure would be very confused.

To fix this continue to delay the roam attempt until IWD reaches the
connected state.
---
 src/station.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Denis Kenzior April 23, 2023, 11:39 p.m. UTC | #1
Hi James,

On 4/17/23 13:54, James Prestwood wrote:
> If DHCP/netconfig takes a long time to finish IWD could end up trying
> to roam to another BSS. This could happen if the signal strength is
> below the set threshold while connecting. Once authenticated to the
> AP IWD would get a CQM event and start a roam timer. If DHCP did not
> finish before that timer IWD would start a roam. IWD would not handle
> this well since a roam doesn't trigger DHCP to start and likely the
> network infrustructure would be very confused.

Typo here, 'infrustructure'.

> 
> To fix this continue to delay the roam attempt until IWD reaches the
> connected state.

I'm not convinced this is the right approach.  If the signal is bad enough that 
DHCP is failing, then delaying the roam is probably the worst choice we can 
make, no?

I would think the better way to fix this would be to track netconfig state, and 
if it never reached a fully configured state prior to roaming, reset & 
reconfigure upon a successful "roam".  Note that since we haven't signaled 
"connected" just yet, we're really still just in the "connecting" phase.

> ---
>   src/station.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 

Regards,
-Denis
James Prestwood April 24, 2023, 1:24 p.m. UTC | #2
Hi Denis,

On 4/23/23 4:39 PM, Denis Kenzior wrote:
> Hi James,
> 
> On 4/17/23 13:54, James Prestwood wrote:
>> If DHCP/netconfig takes a long time to finish IWD could end up trying
>> to roam to another BSS. This could happen if the signal strength is
>> below the set threshold while connecting. Once authenticated to the
>> AP IWD would get a CQM event and start a roam timer. If DHCP did not
>> finish before that timer IWD would start a roam. IWD would not handle
>> this well since a roam doesn't trigger DHCP to start and likely the
>> network infrustructure would be very confused.
> 
> Typo here, 'infrustructure'.
> 
>>
>> To fix this continue to delay the roam attempt until IWD reaches the
>> connected state.
> 
> I'm not convinced this is the right approach.  If the signal is bad 
> enough that DHCP is failing, then delaying the roam is probably the 
> worst choice we can make, no?

It may not be that the connection quality caused DHCP to take a while, 
could be the backend just was slow and not responding quickly.

> 
> I would think the better way to fix this would be to track netconfig 
> state, and if it never reached a fully configured state prior to 
> roaming, reset & reconfigure upon a successful "roam".  Note that since 
> we haven't signaled "connected" just yet, we're really still just in the 
> "connecting" phase.

The question is if the AP infrastructure could handle this. For example 
the worst case being if DHCP started and we roamed right in the 
middle... That's why I thinking the safest option was to complete DHCP, 
then roam. I never actually saw this behavior happening, just noticed 
that IWD would get very confused if it did happen.

Thanks,
James

> 
>> ---
>>   src/station.c | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
> 
> Regards,
> -Denis
Denis Kenzior April 30, 2023, 6:07 p.m. UTC | #3
Hi James,

>> I'm not convinced this is the right approach.  If the signal is bad enough 
>> that DHCP is failing, then delaying the roam is probably the worst choice we 
>> can make, no?
> 
> It may not be that the connection quality caused DHCP to take a while, could be 
> the backend just was slow and not responding quickly.

Possible.  But I don't think this is a good reason to remain on a bad AP. 
Besides, what if iwd is not running DHCP and it is being done by some external 
app?  In that case iwd will happily roam off anyway and the external app will 
restart dhcp.  So remaining on the bad AP is asymmetrical to the 
EnableNetworkConfiguration=False situation.

> 
>>
>> I would think the better way to fix this would be to track netconfig state, 
>> and if it never reached a fully configured state prior to roaming, reset & 
>> reconfigure upon a successful "roam".  Note that since we haven't signaled 
>> "connected" just yet, we're really still just in the "connecting" phase.
> 
> The question is if the AP infrastructure could handle this. For example the 
> worst case being if DHCP started and we roamed right in the middle... That's why 
> I thinking the safest option was to complete DHCP, then roam. I never actually 
> saw this behavior happening, just noticed that IWD would get very confused if it 
> did happen.

Well, see above.  Infrastructure has to handle this possibility anyway.  But 
yes, this is certainly a corner case we are not handling right now.

Regards,
-Denis
diff mbox series

Patch

diff --git a/src/station.c b/src/station.c
index dfc23e24..18c37f9c 100644
--- a/src/station.c
+++ b/src/station.c
@@ -2786,6 +2786,22 @@  static void station_roam_trigger_cb(struct l_timeout *timeout, void *user_data)
 	if (station_cannot_roam(station))
 		return;
 
+	/*
+	 * This is unlikely, but can still happen when network configuration is
+	 * enabled and DHCP is slow. If IWD connects to a BSS under the low RSSI
+	 * threshold it will immediately get a CQM event once connected (as far
+	 * as the kernel is concerned, NOT IWD's connected state). This starts
+	 * a 5 second timer and if DHCP has not completed by then IWD will
+	 * roam. Since this would likely confused the APs and IWD its best to
+	 * let DHCP complete i.e. wait until we are in a connected state.
+	 */
+	if (station->state != STATION_STATE_CONNECTED) {
+		l_debug("Tried to roam in unconnected state, will try "
+			"again in 5 seconds");
+		station_roam_timeout_rearm(station, 5);
+		return;
+	}
+
 	station_start_roam(station);
 }