Message ID | 20230510204508.715558-2-prestwoj@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Fix wiphy regdom logic | expand |
Hi James, On 5/10/23 15:45, James Prestwood wrote: > Certain drivers change the regulatory domain when the client roams. > First XX, then the country the roam target is using. Depending on the > timing the second regdom event can come while the first is still > dumping. This was not handled gracefully/nicely in IWD and it would > just issue another dump (it tries to cancel but if the dump started > there isn't anything that can be done). AFAIK the kernel should be > able to handle this but based on the behavior we saw it seems to not > be able to, or at least not reliably. Okay. Might be useful to know what the kernel actually expects to happen in case we send on a socket during an ongoing command. > > When the second dump comes in and GET_WIPHY is issued the kernel > stops replying to any messages which obviously causes IWD to completely > hang up waiting for various acks (set_station, set_key, etc.). > Sounds like genl isn't doing the right thing... > Now if another dump is ongoing (and cant be safely canceled) a flag > will be set when another dump is needed. The first dump will be > waited for and only once completed will another start. This also > ensures there is at most 1 pending dump when before any number of > dumps could be queued which is pointless since only the last dump > request matters. NAK. Please stop right there. genl is already a queue. There's no sense adding queuing on top of it in this particular case. If l_genl_family_cancel isn't doing the right thing, then fix that instead. > --- > src/wiphy.c | 103 +++++++++++++++++++++++++++------------------------- > 1 file changed, 54 insertions(+), 49 deletions(-) > Regards, -Denis
Hi Denis, On 5/16/23 5:53 PM, Denis Kenzior wrote: > Hi James, > > On 5/10/23 15:45, James Prestwood wrote: >> Certain drivers change the regulatory domain when the client roams. >> First XX, then the country the roam target is using. Depending on the >> timing the second regdom event can come while the first is still >> dumping. This was not handled gracefully/nicely in IWD and it would >> just issue another dump (it tries to cancel but if the dump started >> there isn't anything that can be done). AFAIK the kernel should be >> able to handle this but based on the behavior we saw it seems to not >> be able to, or at least not reliably. > > Okay. Might be useful to know what the kernel actually expects to > happen in case we send on a socket during an ongoing command. I can try and get this to happen locally to investigate further, but so far its been a rare case. Maybe happening 1 time in a 24 hour period on 20-30 clients. I'm thinking maybe a locking/race problem in nl80211_dump_wiphy(), but its just a guess. > >> >> When the second dump comes in and GET_WIPHY is issued the kernel >> stops replying to any messages which obviously causes IWD to completely >> hang up waiting for various acks (set_station, set_key, etc.). >> > > Sounds like genl isn't doing the right thing... > >> Now if another dump is ongoing (and cant be safely canceled) a flag >> will be set when another dump is needed. The first dump will be >> waited for and only once completed will another start. This also >> ensures there is at most 1 pending dump when before any number of >> dumps could be queued which is pointless since only the last dump >> request matters. > > NAK. Please stop right there. genl is already a queue. There's no > sense adding queuing on top of it in this particular case. If > l_genl_family_cancel isn't doing the right thing, then fix that instead. Well l_genl_family_cancel really cant do anything about it. Once the dump hits the kernel you cant cancel it (this is why we added l_genl_family_request_sent). So the only alternative I can think of, if we cant do it in IWD, would be to make l_genl queue dumps and not issue more dumps until the previous one finishes. This is obviously a significant behavioral change. > >> --- >> src/wiphy.c | 103 +++++++++++++++++++++++++++------------------------- >> 1 file changed, 54 insertions(+), 49 deletions(-) >> > > Regards, > -Denis
Hi James, >> NAK. Please stop right there. genl is already a queue. There's no sense >> adding queuing on top of it in this particular case. If l_genl_family_cancel >> isn't doing the right thing, then fix that instead. > > Well l_genl_family_cancel really cant do anything about it. Once the dump hits > the kernel you cant cancel it (this is why we added l_genl_family_request_sent). This was done for a different reason. In the case of CMD_REMAIN_ON_CHANNEL we must wait for the ack to obtain the cookie. We can cancel the ROC only if we know the cookie. Otherwise, even if we call l_genl_family_cancel, the ROC would still go ahead. What you have here is a different and simpler case. > So the only alternative I can think of, if we cant do it in IWD, would be to > make l_genl queue dumps and not issue more dumps until the previous one > finishes. This is obviously a significant behavioral change. I'm lost? That is what is supposed to happen and happens already today. If you issue two dumps like: l_genl_family_dump(..., msg1, ...); l_genl_family_dump(..., msg2, ...); Then msg2 will not be sent to the kernel until NLMSG_DONE flagged message is received. So what exactly would be the behavioral change you are referring to? Anyway, looking at l_genl_family_cancel, it is doing the wrong thing, so that should be fixed. Regards, -Denis
Hi Denis, On 5/17/23 8:21 PM, Denis Kenzior wrote: > Hi James, > >>> NAK. Please stop right there. genl is already a queue. There's no >>> sense adding queuing on top of it in this particular case. If >>> l_genl_family_cancel isn't doing the right thing, then fix that instead. >> >> Well l_genl_family_cancel really cant do anything about it. Once the >> dump hits the kernel you cant cancel it (this is why we added >> l_genl_family_request_sent). > > This was done for a different reason. In the case of > CMD_REMAIN_ON_CHANNEL we must wait for the ack to obtain the cookie. We > can cancel the ROC only if we know the cookie. Otherwise, even if we > call l_genl_family_cancel, the ROC would still go ahead. > > What you have here is a different and simpler case. > >> So the only alternative I can think of, if we cant do it in IWD, would >> be to make l_genl queue dumps and not issue more dumps until the >> previous one finishes. This is obviously a significant behavioral change. > > I'm lost? That is what is supposed to happen and happens already > today. If you issue two dumps like: > > l_genl_family_dump(..., msg1, ...); > l_genl_family_dump(..., msg2, ...); > > Then msg2 will not be sent to the kernel until NLMSG_DONE flagged > message is received. So what exactly would be the behavioral change you > are referring to? Ok this was not my understanding of how l_genl messaging worked. I thought you could have multiple in-flight messages. But looking at it you can't, the IO writer only sends a single message. So I'll fix l_genl_family_cancel to not remove the message from the pending queue until we get NLMSG_DONE. > > Anyway, looking at l_genl_family_cancel, it is doing the wrong thing, so > that should be fixed. > > Regards, > -Denis
diff --git a/src/wiphy.c b/src/wiphy.c index 2db2d2cd..dbffa44e 100644 --- a/src/wiphy.c +++ b/src/wiphy.c @@ -67,6 +67,7 @@ static int mac_randomize_bytes = 6; static char regdom_country[2]; static uint32_t work_ids; static unsigned int wiphy_dump_id; +static bool dump_queued; enum driver_flag { DEFAULT_IF = 0x1, @@ -123,6 +124,7 @@ struct wiphy { bool work_in_callback; unsigned int get_reg_id; unsigned int dump_id; + bool dump_queued : 1; bool support_scheduled_scan:1; bool support_rekey_offload:1; @@ -2066,18 +2068,23 @@ static void wiphy_setup_rm_enabled_capabilities(struct wiphy *wiphy) */ } +static void wiphy_dump_after_regdom(struct wiphy *wiphy); + static void wiphy_dump_done(void *user_data) { struct wiphy *wiphy = user_data; const struct l_queue_entry *e; - /* This dump was canceled due to another dump */ - if ((wiphy && !wiphy->dump_id) || (!wiphy && !wiphy_dump_id)) - return; - if (wiphy) { wiphy->dump_id = 0; + if (wiphy->dump_queued) { + wiphy->dump_queued = false; + wiphy_dump_after_regdom(wiphy); + return; + } + + /* No other dumps queued, emit done */ WATCHLIST_NOTIFY(&wiphy->state_watches, wiphy_state_watch_func_t, wiphy, WIPHY_STATE_WATCH_EVENT_REGDOM_DONE); @@ -2087,6 +2094,12 @@ static void wiphy_dump_done(void *user_data) wiphy_dump_id = 0; + if (dump_queued) { + dump_queued = false; + wiphy_dump_after_regdom(NULL); + return; + } + for (e = l_queue_get_entries(wiphy_list); e; e = e->next) { wiphy = e->data; @@ -2160,38 +2173,48 @@ static void wiphy_dump_callback(struct l_genl_msg *msg, } } -static bool wiphy_cancel_last_dump(struct wiphy *wiphy) +static void wiphy_dump_after_regdom(struct wiphy *wiphy) { - unsigned int id = 0; + const struct l_queue_entry *e; + struct l_genl_msg *msg; + unsigned int *id = NULL; - /* - * Zero command ID to signal that wiphy_dump_done doesn't need to do - * anything. - */ - if (wiphy && wiphy->dump_id) { - id = wiphy->dump_id; - wiphy->dump_id = 0; - } else if (!wiphy && wiphy_dump_id) { - id = wiphy_dump_id; - wiphy_dump_id = 0; - } + if (wiphy) + id = &wiphy->dump_id; + else + id = &wiphy_dump_id; - if (id) { - l_debug("Canceling pending regdom wiphy dump (%s)", + if (*id) { + /* + * If the dump hasn't hit the kernel we can safely cancel it. + * Otherwise set the dump_queued flag and once the current dump + * has completed, another will start. This also ensures there is + * only at most 1 pending dump. + */ + if (!l_genl_family_request_sent(nl80211, *id)) { + l_debug("Canceling pending regdom wiphy dump (%s)", wiphy ? wiphy->name : "global"); - l_genl_family_cancel(nl80211, id); + l_genl_family_cancel(nl80211, *id); + + /* + * If there is already dump queued (dump_queued=true) + * the destroy callback will end up starting another + * dump. If this happens nothing more needs to be done. + */ + if (*id) + return; + } else if (wiphy) { + wiphy->dump_queued = true; + return; + } else { + dump_queued = true; + return; + } } - return id != 0; -} - -static void wiphy_dump_after_regdom(struct wiphy *wiphy) -{ - const struct l_queue_entry *e; - struct l_genl_msg *msg; - unsigned int id; - bool no_start_event; + l_debug("Dumping wiphy after regdom change (%s)", + wiphy ? wiphy->name : "global"); msg = l_genl_msg_new_sized(NL80211_CMD_GET_WIPHY, 128); @@ -2199,37 +2222,22 @@ static void wiphy_dump_after_regdom(struct wiphy *wiphy) l_genl_msg_append_attr(msg, NL80211_ATTR_WIPHY, 4, &wiphy->id); l_genl_msg_append_attr(msg, NL80211_ATTR_SPLIT_WIPHY_DUMP, 0, NULL); - id = l_genl_family_dump(nl80211, msg, wiphy_dump_callback, + *id = l_genl_family_dump(nl80211, msg, wiphy_dump_callback, wiphy, wiphy_dump_done); - if (!id) { + if (!*id) { l_error("Wiphy information dump failed"); l_genl_msg_unref(msg); return; } - /* - * Another update while dumping wiphy. This next dump should supercede - * the first and not result in a DONE event until this new dump is - * finished. This is because the disabled frequencies are in an unknown - * state and could cause incorrect behavior by any watchers. - */ - no_start_event = wiphy_cancel_last_dump(wiphy); - /* Limited dump so just emit the event for this wiphy */ if (wiphy) { - wiphy->dump_id = id; - - if (no_start_event) - return; - WATCHLIST_NOTIFY(&wiphy->state_watches, wiphy_state_watch_func_t, wiphy, WIPHY_STATE_WATCH_EVENT_REGDOM_STARTED); return; } - wiphy_dump_id = id; - /* Otherwise for a global regdom change notify for all wiphy's */ for (e = l_queue_get_entries(wiphy_list); e; e = e->next) { struct wiphy *w = e->data; @@ -2237,9 +2245,6 @@ static void wiphy_dump_after_regdom(struct wiphy *wiphy) if (w->self_managed) continue; - if (no_start_event) - continue; - WATCHLIST_NOTIFY(&w->state_watches, wiphy_state_watch_func_t, w, WIPHY_STATE_WATCH_EVENT_REGDOM_STARTED); }