Message ID | 20230403141927.235014-1-alvin@pqrs.dk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | station: handle ROAMING state in disconnect event (again) | expand |
Hi Alvin, On 4/3/23 07:19, Alvin Šipraga wrote: > From: Alvin Šipraga <alsi@bang-olufsen.dk> > > The blamed commit clobbered the non-FT (Reassociation) roaming case > in the disconnect event handler, causing iwd to potentially get stuck in > the ROAMING state after a failed reassociation during non-FT roaming. > Add it back. > > The subject is inspired by a previous commit which did the same thing, > but introduced before the new roaming states were added, namely commit > 8758cc89487b ("station: handle ROAMING state in disconnect event"). The > rationale here is the same. > > Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)") IIRC leaving out the old ROAMING state in this switch was actually done intentionally. This is because netdev_reassociate() takes a callback and any failures with the roam *should* show up in netdev_connect_event(), and in turn call the callback which then leads to the same station_disassociated() call. We wanted to avoid the potential of a disconnect event AND the connect callback both calling station_disassocated(). Do you happen to have an iwmon pcap/log of this behavior? If its a kernel bug and a CMD_CONNECT connect event never comes then we probably do need this patch, but maybe instead with an L_WARN_ON or a big debug print because it really shouldn't be happening. Thanks, James > --- > src/station.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/src/station.c b/src/station.c > index d5650a2995a9..69f9be79880d 100644 > --- a/src/station.c > +++ b/src/station.c > @@ -3201,6 +3201,7 @@ static void station_disconnect_event(struct station *station, void *event_data) > event_data, station); > return; > case STATION_STATE_CONNECTED: > + case STATION_STATE_ROAMING: > case STATION_STATE_FT_ROAMING: > case STATION_STATE_FW_ROAMING: > station_disassociated(station);
Hi James, On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote: > Hi Alvin, > > On 4/3/23 07:19, Alvin Šipraga wrote: > > From: Alvin Šipraga <alsi@bang-olufsen.dk> > > > > The blamed commit clobbered the non-FT (Reassociation) roaming case > > in the disconnect event handler, causing iwd to potentially get stuck in > > the ROAMING state after a failed reassociation during non-FT roaming. > > Add it back. > > > > The subject is inspired by a previous commit which did the same thing, > > but introduced before the new roaming states were added, namely commit > > 8758cc89487b ("station: handle ROAMING state in disconnect event"). The > > rationale here is the same. > > > > Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)") > > IIRC leaving out the old ROAMING state in this switch was actually done > intentionally. This is because netdev_reassociate() takes a callback and any > failures with the roam *should* show up in netdev_connect_event(), and in > turn call the callback which then leads to the same station_disassociated() > call. We wanted to avoid the potential of a disconnect event AND the connect > callback both calling station_disassocated(). Yes, that makes sense. > > Do you happen to have an iwmon pcap/log of this behavior? If its a kernel > bug and a CMD_CONNECT connect event never comes then we probably do need > this patch, but maybe instead with an L_WARN_ON or a big debug print because > it really shouldn't be happening. Yes, this will be our next direction of inquiry. Easter is coming up and the issue is reproducible by colleagues who are on holiday. But I will follow up. Thanks for the quick reply. Kind regards, Alvin > > Thanks, > > James > > > --- > > src/station.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/src/station.c b/src/station.c > > index d5650a2995a9..69f9be79880d 100644 > > --- a/src/station.c > > +++ b/src/station.c > > @@ -3201,6 +3201,7 @@ static void station_disconnect_event(struct station *station, void *event_data) > > event_data, station); > > return; > > case STATION_STATE_CONNECTED: > > + case STATION_STATE_ROAMING: > > case STATION_STATE_FT_ROAMING: > > case STATION_STATE_FW_ROAMING: > > station_disassociated(station);
Hi again, On Wed, Apr 05, 2023 at 09:12:20AM +0000, Alvin Šipraga wrote: > Hi James, > > On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote: > > Hi Alvin, > > > > On 4/3/23 07:19, Alvin Šipraga wrote: > > > From: Alvin Šipraga <alsi@bang-olufsen.dk> > > > > > > The blamed commit clobbered the non-FT (Reassociation) roaming case > > > in the disconnect event handler, causing iwd to potentially get stuck in > > > the ROAMING state after a failed reassociation during non-FT roaming. > > > Add it back. > > > > > > The subject is inspired by a previous commit which did the same thing, > > > but introduced before the new roaming states were added, namely commit > > > 8758cc89487b ("station: handle ROAMING state in disconnect event"). The > > > rationale here is the same. > > > > > > Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)") > > > > IIRC leaving out the old ROAMING state in this switch was actually done > > intentionally. This is because netdev_reassociate() takes a callback and any > > failures with the roam *should* show up in netdev_connect_event(), and in > > turn call the callback which then leads to the same station_disassociated() > > call. We wanted to avoid the potential of a disconnect event AND the connect > > callback both calling station_disassocated(). > > Yes, that makes sense. > > > > > Do you happen to have an iwmon pcap/log of this behavior? If its a kernel > > bug and a CMD_CONNECT connect event never comes then we probably do need > > this patch, but maybe instead with an L_WARN_ON or a big debug print because > > it really shouldn't be happening. > > Yes, this will be our next direction of inquiry. Easter is coming up and the > issue is reproducible by colleagues who are on holiday. But I will follow up. Just a short update on this. Currently we root-caused the failure to a bug where the 2nd packet of the 4-Way Handshake gets silently dropped by the firmware we are using and the vendor is looking into it. Until this is fixed we can't point any fingers at IWD. I'll follow up once that is fixed. Thanks! Kind regards, Alvin > > Thanks for the quick reply. > > Kind regards, > Alvin > > > > > Thanks, > > > > James > > > > > --- > > > src/station.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/src/station.c b/src/station.c > > > index d5650a2995a9..69f9be79880d 100644 > > > --- a/src/station.c > > > +++ b/src/station.c > > > @@ -3201,6 +3201,7 @@ static void station_disconnect_event(struct station *station, void *event_data) > > > event_data, station); > > > return; > > > case STATION_STATE_CONNECTED: > > > + case STATION_STATE_ROAMING: > > > case STATION_STATE_FT_ROAMING: > > > case STATION_STATE_FW_ROAMING: > > > station_disassociated(station);
Hi Alvin, On 4/30/23 3:22 AM, Alvin Šipraga wrote: > Hi again, > > On Wed, Apr 05, 2023 at 09:12:20AM +0000, Alvin Šipraga wrote: >> Hi James, >> >> On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote: >>> Hi Alvin, >>> >>> On 4/3/23 07:19, Alvin Šipraga wrote: >>>> From: Alvin Šipraga <alsi@bang-olufsen.dk> >>>> >>>> The blamed commit clobbered the non-FT (Reassociation) roaming case >>>> in the disconnect event handler, causing iwd to potentially get stuck in >>>> the ROAMING state after a failed reassociation during non-FT roaming. >>>> Add it back. >>>> >>>> The subject is inspired by a previous commit which did the same thing, >>>> but introduced before the new roaming states were added, namely commit >>>> 8758cc89487b ("station: handle ROAMING state in disconnect event"). The >>>> rationale here is the same. >>>> >>>> Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)") >>> >>> IIRC leaving out the old ROAMING state in this switch was actually done >>> intentionally. This is because netdev_reassociate() takes a callback and any >>> failures with the roam *should* show up in netdev_connect_event(), and in >>> turn call the callback which then leads to the same station_disassociated() >>> call. We wanted to avoid the potential of a disconnect event AND the connect >>> callback both calling station_disassocated(). >> >> Yes, that makes sense. >> >>> >>> Do you happen to have an iwmon pcap/log of this behavior? If its a kernel >>> bug and a CMD_CONNECT connect event never comes then we probably do need >>> this patch, but maybe instead with an L_WARN_ON or a big debug print because >>> it really shouldn't be happening. >> >> Yes, this will be our next direction of inquiry. Easter is coming up and the >> issue is reproducible by colleagues who are on holiday. But I will follow up. > > Just a short update on this. Currently we root-caused the failure to a bug where > the 2nd packet of the 4-Way Handshake gets silently dropped by the firmware we > are using and the vendor is looking into it. Until this is fixed we can't point > any fingers at IWD. I'll follow up once that is fixed. Thanks! So was IWD not handling this? We keep a timer for eapol, so if packet 2/4 was lost IWD should recover regardless. Even if its a FW bug IWD shouldn't get stuck. > > Kind regards, > Alvin > >> >> Thanks for the quick reply. >> >> Kind regards, >> Alvin >> >>> >>> Thanks, >>> >>> James >>> >>>> --- >>>> src/station.c | 1 + >>>> 1 file changed, 1 insertion(+) >>>> >>>> diff --git a/src/station.c b/src/station.c >>>> index d5650a2995a9..69f9be79880d 100644 >>>> --- a/src/station.c >>>> +++ b/src/station.c >>>> @@ -3201,6 +3201,7 @@ static void station_disconnect_event(struct station *station, void *event_data) >>>> event_data, station); >>>> return; >>>> case STATION_STATE_CONNECTED: >>>> + case STATION_STATE_ROAMING: >>>> case STATION_STATE_FT_ROAMING: >>>> case STATION_STATE_FW_ROAMING: >>>> station_disassociated(station);
Hello again James, I'm resurrecting this old thread because we are hitting another roaming related issue and I am not sure whether the problem is in iwd or in the wireless driver/fw. I hope you can help me with some log analysis. On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote: > Hi Alvin, > > On 4/3/23 07:19, Alvin Šipraga wrote: > > From: Alvin Šipraga <alsi@bang-olufsen.dk> > > > > The blamed commit clobbered the non-FT (Reassociation) roaming case > > in the disconnect event handler, causing iwd to potentially get stuck in > > the ROAMING state after a failed reassociation during non-FT roaming. > > Add it back. > > > > The subject is inspired by a previous commit which did the same thing, > > but introduced before the new roaming states were added, namely commit > > 8758cc89487b ("station: handle ROAMING state in disconnect event"). The > > rationale here is the same. > > > > Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)") The last update I gave was on a silently dropped frame in the firmware when doing the 4-Way Handshake. This was fixed inside the vendor's firmware so there is nothing further to discuss there. See below for the current issue we are facing. > > IIRC leaving out the old ROAMING state in this switch was actually done > intentionally. This is because netdev_reassociate() takes a callback and any > failures with the roam *should* show up in netdev_connect_event(), and in > turn call the callback which then leads to the same station_disassociated() > call. We wanted to avoid the potential of a disconnect event AND the connect > callback both calling station_disassocated(). OK, so this makes sense, but I think we are hitting an edge case in our test setup which slips through this logic. Allow me to explain. We have an automated test setup with two BSSes in the same ESS and programmable attenuators used to simulate a low-RSSI event on the station. The sequence of events below begins with the station connected to the first BSS (BSS0), and it is expected that iwd will then direct the driver to associate to the second BSS (BSS1). This is without FT roaming. 1. iwd receives a CQM event and enters roaming state. 2. iwd issues a CMD_CONNECT. The driver defines .auth and .assoc ops, not .connect, so it takes the SME path in the kernel. 3. Auth/Assoc are successful and BSS1 kicks off the 4-Way Handshake. 4. After entering eapol_handle_ptk_3_of_4() iwd asynchronously installs the keys with CMD_NEW_KEY and then deletes the EAPOL timeout. 5. The group key gets installed successfully (CMD_NEW_KEY returns success). 6. In the mean time however the station receives a Deauth or Deassoc (not clear from our driver logs) and the driver sends this to iwd. In netdev_disconnect_event() iwd then follows this path: -> netdev_connect_free() -> handshake_state_free() -> netdev_handshake_state_free() -> netdev_handshake_state_cancel_all() -> l_genl_family_cancel(nl80211, nhs->pairwise_new_key_cmd_id); Then, re: the original patch in this thread, iwd's station code ignores this event, because the station state is ROAMING. 7. The subsequent CMD_NEW_KEY then fails in the driver, but the above genl callback is never invoked because it was cancelled. In this way neither the station nor the handshake code then report the error, so we get stuck in a roaming state. > > Do you happen to have an iwmon pcap/log of this behavior? If its a kernel > bug and a CMD_CONNECT connect event never comes then we probably do need > this patch, but maybe instead with an L_WARN_ON or a big debug print because > it really shouldn't be happening. Yes, I have an iwmon pcap and also a journal log with iwd and kernel/driver debug messages. Hope it's OK with you that I send it in a private mail. I've put some engineers from NXP on cc as well, as they are helping us with this case on the driver/FW side. The driver in question is the out-of-tree Marvell/moal/mlan driver. I have not yet tested the patch in this thread for this particular issue. First I would like to hear your thoughts/feedback. Thanks! Kind regards, Alvin
Hi Alvin, On 11/13/23 4:46 AM, Alvin Šipraga wrote: > Hello again James, > > I'm resurrecting this old thread because we are hitting another roaming related > issue and I am not sure whether the problem is in iwd or in the wireless > driver/fw. I hope you can help me with some log analysis. > > On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote: >> Hi Alvin, >> >> On 4/3/23 07:19, Alvin Šipraga wrote: >>> From: Alvin Šipraga <alsi@bang-olufsen.dk> >>> >>> The blamed commit clobbered the non-FT (Reassociation) roaming case >>> in the disconnect event handler, causing iwd to potentially get stuck in >>> the ROAMING state after a failed reassociation during non-FT roaming. >>> Add it back. >>> >>> The subject is inspired by a previous commit which did the same thing, >>> but introduced before the new roaming states were added, namely commit >>> 8758cc89487b ("station: handle ROAMING state in disconnect event"). The >>> rationale here is the same. >>> >>> Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)") > > The last update I gave was on a silently dropped frame in the firmware when > doing the 4-Way Handshake. This was fixed inside the vendor's firmware so there > is nothing further to discuss there. See below for the current issue we are facing. > >> >> IIRC leaving out the old ROAMING state in this switch was actually done >> intentionally. This is because netdev_reassociate() takes a callback and any >> failures with the roam *should* show up in netdev_connect_event(), and in >> turn call the callback which then leads to the same station_disassociated() >> call. We wanted to avoid the potential of a disconnect event AND the connect >> callback both calling station_disassocated(). > > OK, so this makes sense, but I think we are hitting an edge case in our test > setup which slips through this logic. Allow me to explain. > > We have an automated test setup with two BSSes in the same ESS and programmable > attenuators used to simulate a low-RSSI event on the station. The sequence of > events below begins with the station connected to the first BSS (BSS0), and it > is expected that iwd will then direct the driver to associate to the second BSS > (BSS1). This is without FT roaming. > > 1. iwd receives a CQM event and enters roaming state. > > 2. iwd issues a CMD_CONNECT. The driver defines .auth and .assoc ops, not > .connect, so it takes the SME path in the kernel. > > 3. Auth/Assoc are successful and BSS1 kicks off the 4-Way Handshake. > > 4. After entering eapol_handle_ptk_3_of_4() iwd asynchronously installs the keys > with CMD_NEW_KEY and then deletes the EAPOL timeout. > > 5. The group key gets installed successfully (CMD_NEW_KEY returns success). > > 6. In the mean time however the station receives a Deauth or Deassoc (not clear > from our driver logs) and the driver sends this to iwd. In > netdev_disconnect_event() iwd then follows this path: > -> netdev_connect_free() > -> handshake_state_free() > -> netdev_handshake_state_free() > -> netdev_handshake_state_cancel_all() > -> l_genl_family_cancel(nl80211, nhs->pairwise_new_key_cmd_id); > > Then, re: the original patch in this thread, iwd's station code ignores this > event, because the station state is ROAMING. > > 7. The subsequent CMD_NEW_KEY then fails in the driver, but the above genl > callback is never invoked because it was cancelled. Ok this does make sense to me. A disconnect from the AP could come in at any point during reassociation (during the ROAMING state). I think we assume the kernel handles this during the actual association step, i.e. relays this in the connect event, but during eapol/key setting I think your original patch holds some merit. Just so we have public record I've sanitized the logs you sent (removed dates, hostnames, driver prints), and its clear that a disconnect at this moment would trigger this "Unexpected disconnect" case: iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start() iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex 8 iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8 iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1 iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_new_group_key_cb() ifindex: 8, err: 0 iwd[3589]: ../iwd-2.8/src/netdev.c:try_handshake_complete() ptk_installed: 0, gtk_installed: 1, igtk_installed: 1 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Disassociate(40) iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48) iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_disconnect_event() iwd[3589]: Received Deauthentication event, reason: 1, from_ap: true iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_done() Work item 15 done iwd[3589]: ../iwd-2.8/src/station.c:station_disconnect_event() 8 iwd[3589]: Unexpected disconnect event So I'm thinking we should merge the original patch adding ROAMING to the switch statement. Maybe Denis has some additional insight Thanks, James
Hi Alvin, > > 1. iwd receives a CQM event and enters roaming state. > > 2. iwd issues a CMD_CONNECT. The driver defines .auth and .assoc ops, not > .connect, so it takes the SME path in the kernel. > > 3. Auth/Assoc are successful and BSS1 kicks off the 4-Way Handshake. > > 4. After entering eapol_handle_ptk_3_of_4() iwd asynchronously installs the keys > with CMD_NEW_KEY and then deletes the EAPOL timeout. > > 5. The group key gets installed successfully (CMD_NEW_KEY returns success). > > 6. In the mean time however the station receives a Deauth or Deassoc (not clear > from our driver logs) and the driver sends this to iwd. In Do we know why? And the Deauth/Deassoc is from BSS1, not BSS0, correct? > netdev_disconnect_event() iwd then follows this path: > -> netdev_connect_free() > -> handshake_state_free() > -> netdev_handshake_state_free() > -> netdev_handshake_state_cancel_all() > -> l_genl_family_cancel(nl80211, nhs->pairwise_new_key_cmd_id); > > Then, re: the original patch in this thread, iwd's station code ignores this > event, because the station state is ROAMING. > > 7. The subsequent CMD_NEW_KEY then fails in the driver, but the above genl > callback is never invoked because it was cancelled. There should be three NEW_KEYs generated: set_gtk set_igtk (if you have MFP enabled) set_tk Normally in that order. Which CMD_NEW_KEY fails? > > In this way neither the station nor the handshake code then report the error, so > we get stuck in a roaming state. > Ok, I can see that happening. Regards, -Denis
Hi James, On Mon, Nov 13, 2023 at 05:10:20AM -0800, James Prestwood wrote: > Hi Alvin, > > On 11/13/23 4:46 AM, Alvin Šipraga wrote: > > Hello again James, > > > > I'm resurrecting this old thread because we are hitting another roaming related > > issue and I am not sure whether the problem is in iwd or in the wireless > > driver/fw. I hope you can help me with some log analysis. > > > > On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote: > > > Hi Alvin, > > > > > > On 4/3/23 07:19, Alvin Šipraga wrote: > > > > From: Alvin Šipraga <alsi@bang-olufsen.dk> > > > > > > > > The blamed commit clobbered the non-FT (Reassociation) roaming case > > > > in the disconnect event handler, causing iwd to potentially get stuck in > > > > the ROAMING state after a failed reassociation during non-FT roaming. > > > > Add it back. > > > > > > > > The subject is inspired by a previous commit which did the same thing, > > > > but introduced before the new roaming states were added, namely commit > > > > 8758cc89487b ("station: handle ROAMING state in disconnect event"). The > > > > rationale here is the same. > > > > > > > > Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)") > > > > The last update I gave was on a silently dropped frame in the firmware when > > doing the 4-Way Handshake. This was fixed inside the vendor's firmware so there > > is nothing further to discuss there. See below for the current issue we are facing. > > > > > > > > IIRC leaving out the old ROAMING state in this switch was actually done > > > intentionally. This is because netdev_reassociate() takes a callback and any > > > failures with the roam *should* show up in netdev_connect_event(), and in > > > turn call the callback which then leads to the same station_disassociated() > > > call. We wanted to avoid the potential of a disconnect event AND the connect > > > callback both calling station_disassocated(). > > > > OK, so this makes sense, but I think we are hitting an edge case in our test > > setup which slips through this logic. Allow me to explain. > > > > We have an automated test setup with two BSSes in the same ESS and programmable > > attenuators used to simulate a low-RSSI event on the station. The sequence of > > events below begins with the station connected to the first BSS (BSS0), and it > > is expected that iwd will then direct the driver to associate to the second BSS > > (BSS1). This is without FT roaming. > > > > 1. iwd receives a CQM event and enters roaming state. > > > > 2. iwd issues a CMD_CONNECT. The driver defines .auth and .assoc ops, not > > .connect, so it takes the SME path in the kernel. > > > > 3. Auth/Assoc are successful and BSS1 kicks off the 4-Way Handshake. > > > > 4. After entering eapol_handle_ptk_3_of_4() iwd asynchronously installs the keys > > with CMD_NEW_KEY and then deletes the EAPOL timeout. > > > > 5. The group key gets installed successfully (CMD_NEW_KEY returns success). > > > > 6. In the mean time however the station receives a Deauth or Deassoc (not clear > > from our driver logs) and the driver sends this to iwd. In > > netdev_disconnect_event() iwd then follows this path: > > -> netdev_connect_free() > > -> handshake_state_free() > > -> netdev_handshake_state_free() > > -> netdev_handshake_state_cancel_all() > > -> l_genl_family_cancel(nl80211, nhs->pairwise_new_key_cmd_id); > > > > Then, re: the original patch in this thread, iwd's station code ignores this > > event, because the station state is ROAMING. > > > > 7. The subsequent CMD_NEW_KEY then fails in the driver, but the above genl > > callback is never invoked because it was cancelled. > > Ok this does make sense to me. A disconnect from the AP could come in at any > point during reassociation (during the ROAMING state). I think we assume the > kernel handles this during the actual association step, i.e. relays this in > the connect event, but during eapol/key setting I think your original patch > holds some merit. > > Just so we have public record I've sanitized the logs you sent (removed > dates, hostnames, driver prints), and its clear that a disconnect at this > moment would trigger this "Unexpected disconnect" case: > > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start() > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex > 8 > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8 > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8 > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1 > iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 > iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_new_group_key_cb() ifindex: 8, > err: 0 > iwd[3589]: ../iwd-2.8/src/netdev.c:try_handshake_complete() ptk_installed: > 0, gtk_installed: 1, igtk_installed: 1 > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification > Disassociate(40) > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification > Disconnect(48) > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_disconnect_event() > iwd[3589]: Received Deauthentication event, reason: 1, from_ap: true > iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_done() Work item 15 done > iwd[3589]: ../iwd-2.8/src/station.c:station_disconnect_event() 8 > iwd[3589]: Unexpected disconnect event > > So I'm thinking we should merge the original patch adding ROAMING to the > switch statement. Maybe Denis has some additional insight Great, thank you very much James for the swift response. Let's see what Denis says too. In the mean time we will run the test with the patch applied (+ logging) and see if the problem is solved in this way. For the record the test also triggers other roaming bugs with the same overall symptom (stuck in roaming), but we suspect these bugs are on the driver or FW level. It might complicate validation of the patch though. I will keep you posted. Kind regards, Alvin > > Thanks, > James
Hi James, > > Ok this does make sense to me. A disconnect from the AP could come in at any > point during reassociation (during the ROAMING state). I think we assume the > kernel handles this during the actual association step, i.e. relays this in the > connect event, but during eapol/key setting I think your original patch holds > some merit. > > Just so we have public record I've sanitized the logs you sent (removed dates, > hostnames, driver prints), and its clear that a disconnect at this moment would > trigger this "Unexpected disconnect" case: > > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start() > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex 8 > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8 > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8 > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1 > iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 > iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 Why is set_tk being called twice? Or is this a copy-paste artifact? Regards, -Denis
Hi Denis, On 11/13/23 7:26 AM, Denis Kenzior wrote: > Hi James, > >> >> Ok this does make sense to me. A disconnect from the AP could come in >> at any point during reassociation (during the ROAMING state). I think >> we assume the kernel handles this during the actual association step, >> i.e. relays this in the connect event, but during eapol/key setting I >> think your original patch holds some merit. >> >> Just so we have public record I've sanitized the logs you sent >> (removed dates, hostnames, driver prints), and its clear that a >> disconnect at this moment would trigger this "Unexpected disconnect" >> case: >> >> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start() >> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on >> ifindex 8 >> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8 >> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8 >> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1 >> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting >> keys >> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 >> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting >> keys >> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 > > Why is set_tk being called twice? Or is this a copy-paste artifact? Hmm, good question. I didn't notice that but its called 2x in the logs Alvin sent me. Alvin, was it a copy-paste error on your side? If not something weird is going on. Thanks, James > > Regards, > -Denis
Hi James, Alvin, On 11/13/23 09:29, James Prestwood wrote: > Hi Denis, > > On 11/13/23 7:26 AM, Denis Kenzior wrote: >> Hi James, >> >>> >>> Ok this does make sense to me. A disconnect from the AP could come in at any >>> point during reassociation (during the ROAMING state). I think we assume the >>> kernel handles this during the actual association step, i.e. relays this in >>> the connect event, but during eapol/key setting I think your original patch >>> holds some merit. >>> >>> Just so we have public record I've sanitized the logs you sent (removed >>> dates, hostnames, driver prints), and its clear that a disconnect at this >>> moment would trigger this "Unexpected disconnect" case: >>> >>> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start() >>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex 8 >>> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8 >>> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8 >>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1 >>> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys >>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 >>> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys >>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 >> >> Why is set_tk being called twice? Or is this a copy-paste artifact? > > Hmm, good question. I didn't notice that but its called 2x in the logs Alvin > sent me. Alvin, was it a copy-paste error on your side? If not something weird > is going on. > We're generating the Setting keys event twice. I don't really understand how this can happen, honestly. I hope this is an artifact of some sort. Anyhow, I think overall the patch approach is incorrect. When we are roaming, we expect the callback provided to netdev_reassociate() to be invoked on any error. So the real fix is in netdev.c somewhere. Regards, -Denis
Hi Denis, On Mon, Nov 13, 2023 at 09:20:14AM -0600, Denis Kenzior wrote: > Hi Alvin, > > > > > 1. iwd receives a CQM event and enters roaming state. > > > > 2. iwd issues a CMD_CONNECT. The driver defines .auth and .assoc ops, not > > .connect, so it takes the SME path in the kernel. > > > > 3. Auth/Assoc are successful and BSS1 kicks off the 4-Way Handshake. > > > > 4. After entering eapol_handle_ptk_3_of_4() iwd asynchronously installs the keys > > with CMD_NEW_KEY and then deletes the EAPOL timeout. > > > > 5. The group key gets installed successfully (CMD_NEW_KEY returns success). > > > > 6. In the mean time however the station receives a Deauth or Deassoc (not clear > > from our driver logs) and the driver sends this to iwd. In > > Do we know why? And the Deauth/Deassoc is from BSS1, not BSS0, correct? No, we don't know why. But it's coming from BSS1, correct, at least if the iwmon pcap is to be believed. The driver debug logs themselves don't print the MAC address and we don't have a sniffer log. Also, now I had a second look and it is indeed a Disassoc frame with unspecified reason code (1). > > > netdev_disconnect_event() iwd then follows this path: > > -> netdev_connect_free() > > -> handshake_state_free() > > -> netdev_handshake_state_free() > > -> netdev_handshake_state_cancel_all() > > -> l_genl_family_cancel(nl80211, nhs->pairwise_new_key_cmd_id); > > > > Then, re: the original patch in this thread, iwd's station code ignores this > > event, because the station state is ROAMING. > > > > 7. The subsequent CMD_NEW_KEY then fails in the driver, but the above genl > > callback is never invoked because it was cancelled. > > There should be three NEW_KEYs generated: > set_gtk > set_igtk (if you have MFP enabled) > set_tk > > Normally in that order. Which CMD_NEW_KEY fails? set_tk is failing. Here you can see the order (no set_igtk called): iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1 iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_rekey_offload() 8 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_new_group_key_cb() ifindex: 8, err: 0 iwd[3589]: ../iwd-2.8/src/netdev.c:try_handshake_complete() ptk_installed: 0, gtk_installed: 1, igtk_installed: 1 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Disassociate(40) iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48) iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_disconnect_event() iwd[3589]: Received Deauthentication event, reason: 1, from_ap: true iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_done() Work item 15 done iwd[3589]: ../iwd-2.8/src/station.c:station_disconnect_event() 8 iwd[3589]: Unexpected disconnect event Here you can see that set_gtk succeeds. You don't see the callback for set_tk for the reasons I explained in mail above, but in the iwmon pcap I can see that the CMD_NEW_KEY fails with -ENOLINK. Makes sense, since the driver got the Deauth/Disassoc in the mean time. > > > > > In this way neither the station nor the handshake code then report the error, so > > we get stuck in a roaming state. > > > > Ok, I can see that happening. > > Regards, > -Denis
Hi James, On Mon, Nov 13, 2023 at 07:29:14AM -0800, James Prestwood wrote: > Hi Denis, > > On 11/13/23 7:26 AM, Denis Kenzior wrote: > > Hi James, > > > > > > > > Ok this does make sense to me. A disconnect from the AP could come > > > in at any point during reassociation (during the ROAMING state). I > > > think we assume the kernel handles this during the actual > > > association step, i.e. relays this in the connect event, but during > > > eapol/key setting I think your original patch holds some merit. > > > > > > Just so we have public record I've sanitized the logs you sent > > > (removed dates, hostnames, driver prints), and its clear that a > > > disconnect at this moment would trigger this "Unexpected disconnect" > > > case: > > > > > > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start() > > > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on > > > ifindex 8 > > > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8 > > > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8 > > > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1 > > > iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() > > > Setting keys > > Hmm, good question. I didn't notice that but its called 2x in the logs Alvin > sent me. Alvin, was it a copy-paste error on your side? If not something > weird is going on. It's my bad. I sent James a snippet in my email when attaching the logs and he just pasted it back. Something must have gotten screwed up on my end while copy-pasting. Here is the proper snippet one more time: iwd[3589]: ../iwd-2.8/src/station.c:station_enter_state() Old State: connected, new state: roaming iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_done() Work item 14 done iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_next() Starting work item 15 iwd[3589]: ../iwd-2.8/src/station.c:station_netdev_event() Associating iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Authenticate(37) iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_authenticate_event() iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Associate(38) iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_associate_event() iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Connect(46) iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_connect_event() iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_connect_event() aborting and ignore_connect_event not set, proceed iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_connect_event() expect_connect_failure not set, proceed iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_connect_event() Request / Response IEs parsed iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_get_oci() iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex 8 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_get_oci_cb() Obtained OCI: freq: 2412, width: 1, center1: 2412, center2: 0 iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start() iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex 8 iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8 iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1 iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_rekey_offload() 8 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_new_group_key_cb() ifindex: 8, err: 0 iwd[3589]: ../iwd-2.8/src/netdev.c:try_handshake_complete() ptk_installed: 0, gtk_installed: 1, igtk_installed: 1 iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Disassociate(40) iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48) iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_disconnect_event() iwd[3589]: Received Deauthentication event, reason: 1, from_ap: true iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_done() Work item 15 done iwd[3589]: ../iwd-2.8/src/station.c:station_disconnect_event() 8 iwd[3589]: Unexpected disconnect event Sorry about that. Kind regards, Alvin
diff --git a/src/station.c b/src/station.c index d5650a2995a9..69f9be79880d 100644 --- a/src/station.c +++ b/src/station.c @@ -3201,6 +3201,7 @@ static void station_disconnect_event(struct station *station, void *event_data) event_data, station); return; case STATION_STATE_CONNECTED: + case STATION_STATE_ROAMING: case STATION_STATE_FT_ROAMING: case STATION_STATE_FW_ROAMING: station_disassociated(station);
From: Alvin Šipraga <alsi@bang-olufsen.dk> The blamed commit clobbered the non-FT (Reassociation) roaming case in the disconnect event handler, causing iwd to potentially get stuck in the ROAMING state after a failed reassociation during non-FT roaming. Add it back. The subject is inspired by a previous commit which did the same thing, but introduced before the new roaming states were added, namely commit 8758cc89487b ("station: handle ROAMING state in disconnect event"). The rationale here is the same. Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)") --- src/station.c | 1 + 1 file changed, 1 insertion(+)