Message ID | 20240111133331.1423853-1-prestwoj@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] dhcp: fix overflow causing retries to stop | expand |
Context | Check | Description |
---|---|---|
tedd_an/pre-ci_am | success | Success |
prestwoj/iwd-ci-makedistcheck | success | Make Distcheck |
prestwoj/iwd-ci-build | success | Build - Configure |
prestwoj/iwd-ci-makecheckvalgrind | success | Make Check w/Valgrind |
prestwoj/iwd-ci-makecheck | success | Make Check |
prestwoj/iwd-ci-clang | success | clang PASS |
prestwoj/iwd-ci-testrunner | success | test-runner PASS |
Hi James, On 1/11/24 07:33, James Prestwood wrote: > If DHCP is in a SELECTING/REQUESTING state and the number of attempts > reached a value where 2 << attempts overflowed an unsigned int the > next timeout would become zero, causing DHCP to never retry without > any event notification. > > Roughly 2 minutes has passed when we reach >5 attempts so it makes > little sense to retry indefinitely, at least without notifying the > upper layers (which could decided to retry themselves). Okay, but '5' is just a magic number in the code with no context. Lets avoid that. Can we make this limit configurable? Say between 3..30? For iwd, even 2 minutes might be too long. Regards, -Denis
Hi Denis, On 1/12/24 9:11 AM, Denis Kenzior wrote: > Hi James, > > On 1/11/24 07:33, James Prestwood wrote: >> If DHCP is in a SELECTING/REQUESTING state and the number of attempts >> reached a value where 2 << attempts overflowed an unsigned int the >> next timeout would become zero, causing DHCP to never retry without >> any event notification. >> >> Roughly 2 minutes has passed when we reach >5 attempts so it makes >> little sense to retry indefinitely, at least without notifying the >> upper layers (which could decided to retry themselves). > > Okay, but '5' is just a magic number in the code with no context. Lets > avoid that. > > Can we make this limit configurable? Say between 3..30? For iwd, > even 2 minutes might be too long. I'm thinking it may be better to leave the default unlimited retries but add a configurable time limit versus retry limit. Setting based on retries isn't intuitive a) because 99% of people have no idea what the backoff algorithm is, and b) time limits are fuzzed so its not an exact amount of time. I say this because if we want to include netconfig into IWD's Connect() DBus method return we kinda need a way to define a time limit as opposed to guessing how long 2-3 retries will actually take. We, potentially, could iterate through several BSS's and fail extending the time to connect, then start DHCP. We would need a way to limit netconfig to within the DBus method timeout (25 seconds IIRC), including how long all the connect attempts took prior. Maybe I'm overthinking it but theoretically we could hit the method timeout limit so we may want to add handling for it, which would be easier if we did the above with a timeout versus retries. Thanks, James > > Regards, > -Denis
Hi James, On 1/14/24 10:12, James Prestwood wrote: > Hi Denis, > > On 1/12/24 9:11 AM, Denis Kenzior wrote: >> Hi James, >> >> On 1/11/24 07:33, James Prestwood wrote: >>> If DHCP is in a SELECTING/REQUESTING state and the number of attempts >>> reached a value where 2 << attempts overflowed an unsigned int the >>> next timeout would become zero, causing DHCP to never retry without >>> any event notification. >>> >>> Roughly 2 minutes has passed when we reach >5 attempts so it makes >>> little sense to retry indefinitely, at least without notifying the >>> upper layers (which could decided to retry themselves). >> >> Okay, but '5' is just a magic number in the code with no context. Lets avoid >> that. >> >> Can we make this limit configurable? Say between 3..30? For iwd, even 2 >> minutes might be too long. > > I'm thinking it may be better to leave the default unlimited retries but add a > configurable time limit versus retry limit. Setting based on retries isn't > intuitive a) because 99% of people have no idea what the backoff algorithm is, > and b) time limits are fuzzed so its not an exact amount of time. Hopefully we know enough to tune the retries :) > > I say this because if we want to include netconfig into IWD's Connect() DBus > method return we kinda need a way to define a time limit as opposed to guessing > how long 2-3 retries will actually take. We, potentially, could iterate through > several BSS's and fail extending the time to connect, then start DHCP. We would > need a way to limit netconfig to within the DBus method timeout (25 seconds DBus method timeouts don't exist on the protocol level. ell doesn't even let you specify a method timeout for that reason. Some libraries use a locally generated timeout by default (such as libdbus-1), but such libraries usually provide a way to specify an arbitrary timeout, or even disable it. The problem with such timeouts is that when a library-local timeout is hit, creating a fake dbus method reply, there's no way to tell the callee. In the case of iwd, it will just continue its Connect() operation since nothing happened as far as it is concerned. Any dbus method timeouts are up to the application to deal with correctly... Also, having dhcp use its own timeout incurs a cost of an extra file-descriptor. Not worth it. > IIRC), including how long all the connect attempts took prior. Maybe I'm > overthinking it but theoretically we could hit the method timeout limit so we > may want to add handling for it, which would be easier if we did the above with > a timeout versus retries. All we can do is document that Connect might be a long-running operation (should be obvious) for the reasons you outline. Regards, -Denis
Hi Denis, On 1/15/24 6:30 AM, Denis Kenzior wrote: > Hi James, > > On 1/14/24 10:12, James Prestwood wrote: >> Hi Denis, >> >> On 1/12/24 9:11 AM, Denis Kenzior wrote: >>> Hi James, >>> >>> On 1/11/24 07:33, James Prestwood wrote: >>>> If DHCP is in a SELECTING/REQUESTING state and the number of attempts >>>> reached a value where 2 << attempts overflowed an unsigned int the >>>> next timeout would become zero, causing DHCP to never retry without >>>> any event notification. >>>> >>>> Roughly 2 minutes has passed when we reach >5 attempts so it makes >>>> little sense to retry indefinitely, at least without notifying the >>>> upper layers (which could decided to retry themselves). >>> >>> Okay, but '5' is just a magic number in the code with no context. >>> Lets avoid that. >>> >>> Can we make this limit configurable? Say between 3..30? For iwd, >>> even 2 minutes might be too long. >> >> I'm thinking it may be better to leave the default unlimited retries >> but add a configurable time limit versus retry limit. Setting based >> on retries isn't intuitive a) because 99% of people have no idea what >> the backoff algorithm is, and b) time limits are fuzzed so its not an >> exact amount of time. > > Hopefully we know enough to tune the retries :) I was more saying in the context of someone using the ELL API, not specifically for IWD. > >> >> I say this because if we want to include netconfig into IWD's >> Connect() DBus method return we kinda need a way to define a time >> limit as opposed to guessing how long 2-3 retries will actually take. >> We, potentially, could iterate through several BSS's and fail >> extending the time to connect, then start DHCP. We would need a way >> to limit netconfig to within the DBus method timeout (25 seconds > > DBus method timeouts don't exist on the protocol level. ell doesn't > even let you specify a method timeout for that reason. Some libraries > use a locally generated timeout by default (such as libdbus-1), but > such libraries usually provide a way to specify an arbitrary timeout, > or even disable it. > > The problem with such timeouts is that when a library-local timeout is > hit, creating a fake dbus method reply, there's no way to tell the > callee. In the case of iwd, it will just continue its Connect() > operation since nothing happened as far as it is concerned. Any dbus > method timeouts are up to the application to deal with correctly... > > Also, having dhcp use its own timeout incurs a cost of an extra > file-descriptor. Not worth it. > >> IIRC), including how long all the connect attempts took prior. Maybe >> I'm overthinking it but theoretically we could hit the method timeout >> limit so we may want to add handling for it, which would be easier if >> we did the above with a timeout versus retries. > > All we can do is document that Connect might be a long-running > operation (should be obvious) for the reasons you outline. So from an ELL perspective I'm fine adding a retries setter (though timeout makes more intuitive sense IMO) but I do want to explain some issues I see with utilizing this in IWD, and doing netconfig prior to the DBus method return. Adding netconfig into the mix would greatly increase the likelihood of a DBus method timeout in "semi-normal" cases depending on the retry limit chosen. I understand the dilemma you explained above but I do think IWD should try its best to stay within the default DBus method timeout of 25 seconds. Trying to do this while also doing netconfig seems difficult unless we really cut down the allowed retries, which has is drawbacks as poor RF could easily drop a few frames and we could quickly get up to 25 seconds with the nature of the DHCP timeouts and if the wifi association took a while (multiple BSS retries). I'm also not sure if a DBus method timeout is going to break or cause weird behavior with Connman/NetworkManager etc. And even if its handled, what exactly should they do in this case? Will DHCP complete? should they re-issue connect which could return -EBUSY? What happens when IWD does reply and a timeout has already occurred? It really opens up a lot of possibilities here. This would have to be an IWD 3.0 change, but adding netconfig as a DBus state may be the better decision here. Then netconfig could take much longer (maybe user configurable), and we avoid/lessen the risk of method timeouts. To me this is similar to how Scan() works where the method returns quickly, but the caller then needs to wait for the scanning property before getting results. Thanks, James > Regards, > -Denis
Hi James, > > So from an ELL perspective I'm fine adding a retries setter (though timeout > makes more intuitive sense IMO) but I do want to explain some issues I see with > utilizing this in IWD, and doing netconfig prior to the DBus method return. I disagree here. For the dhcp_client, max retries is a much clearer 'lever' to adjust than the overall timeout. As you said earlier, with fuzzing the number of retries in a given period can vary. If the upper layers want to enforce a given timeout, they can do so anyway. > > Adding netconfig into the mix would greatly increase the likelihood of a DBus > method timeout in "semi-normal" cases depending on the retry limit chosen. I > understand the dilemma you explained above but I do think IWD should try its > best to stay within the default DBus method timeout of 25 seconds. Trying to do You'll have to explain why you make this assertion? If DBus bindings being used don't allow setting an infinite timeout, get better bindings. I don't see any need to force iwd into fitting into this arbitrary 25 second limit. And anyway, Connect() already can invoke Agent methods, which can take arbitrarily long anyway. So netconfig is the least of your problems. > this while also doing netconfig seems difficult unless we really cut down the > allowed retries, which has is drawbacks as poor RF could easily drop a few > frames and we could quickly get up to 25 seconds with the nature of the DHCP > timeouts and if the wifi association took a while (multiple BSS retries). Do you have any data / examples to back up this assertion? > > I'm also not sure if a DBus method timeout is going to break or cause weird > behavior with Connman/NetworkManager etc. And even if its handled, what exactly Not if they're written properly. > should they do in this case? Will DHCP complete? should they re-issue connect > which could return -EBUSY? What happens when IWD does reply and a timeout has > already occurred? It really opens up a lot of possibilities here. That is why I said that timeout handling has to be handled by the application. Most I've seen in real life do this poorly since there is usually a fundamental misunderstanding of how D-Bus works under the hood. In practice, locally generated timeouts like the one used by default in libdbus-1 are just a bad idea. I have no clue why they were added originally. If you have a misbehaving service on your bus which leaks messages, you will end up clogging D-Bus at some point and only a service restart will fix that. Your service always has to reply to the message. If you trust your running-as-root services to do the right thing, then adding arbitrary timeouts just gets in the way. The fact that the timeout happened is unknown to the dbus peer. It will continue as if nothing happened, and if well-behaved will generate a reply message eventually. This has to happen for proper functioning of D-Bus. The dbus-daemon will track that reply and forward it to the caller. The caller's dbus implementation will just drop it on the floor since the reply-to serial number is no longer on the pending list. So your iwd client has only two choices: 1. Call Connect() with no timeout and trust iwd to do the right thing. 2. Handle timeouts by trying to abort the connection somehow. Perhaps via Station.Disconnect(). But that has its own side-effects. 3. If a timeout happens, assume iwd service is dead and restart it. Hint: Option 1 or 3 is preferred Regards, -Denis
Hi Denis, On 1/16/24 8:13 AM, Denis Kenzior wrote: > Hi James, > >> >> So from an ELL perspective I'm fine adding a retries setter (though >> timeout makes more intuitive sense IMO) but I do want to explain some >> issues I see with utilizing this in IWD, and doing netconfig prior to >> the DBus method return. > > I disagree here. For the dhcp_client, max retries is a much clearer > 'lever' to adjust than the overall timeout. As you said earlier, with > fuzzing the number of retries in a given period can vary. If the > upper layers want to enforce a given timeout, they can do so anyway. > >> >> Adding netconfig into the mix would greatly increase the likelihood >> of a DBus method timeout in "semi-normal" cases depending on the >> retry limit chosen. I understand the dilemma you explained above but >> I do think IWD should try its best to stay within the default DBus >> method timeout of 25 seconds. Trying to do > > You'll have to explain why you make this assertion? If DBus bindings > being used don't allow setting an infinite timeout, get better > bindings. I don't see any need to force iwd into fitting into this > arbitrary 25 second limit. And anyway, Connect() already can invoke > Agent methods, which can take arbitrarily long anyway. So netconfig > is the least of your problems. Just based on the bindings we use for test-runner and the bindings I've been using recently (dbus_next), both have a 25 second method timeout. It appears that "dbus" (what we use in test-runner) might let you specify a timeout, not sure about dbus_next. I just wanted to communicate that a timeout is common across these bindings, for better or worse. If we don't want to conform to that, that's fine. > >> this while also doing netconfig seems difficult unless we really cut >> down the allowed retries, which has is drawbacks as poor RF could >> easily drop a few frames and we could quickly get up to 25 seconds >> with the nature of the DHCP timeouts and if the wifi association took >> a while (multiple BSS retries). > > Do you have any data / examples to back up this assertion? We have seen DHCP take 10-15 seconds due to dropped frames. All I'm saying is that just a few dropped frames can drastically increase the time. If we don't care about the arbitrary 25 second timeout, then this is fine, I just was pointing that out. > >> >> I'm also not sure if a DBus method timeout is going to break or cause >> weird behavior with Connman/NetworkManager etc. And even if its >> handled, what exactly > > Not if they're written properly. > >> should they do in this case? Will DHCP complete? should they re-issue >> connect which could return -EBUSY? What happens when IWD does reply >> and a timeout has already occurred? It really opens up a lot of >> possibilities here. > > That is why I said that timeout handling has to be handled by the > application. Most I've seen in real life do this poorly since there is > usually a fundamental misunderstanding of how D-Bus works under the hood. Ok, I'll definitely add handling for this, hopefully everything else using Connect() already handles it. > > In practice, locally generated timeouts like the one used by default > in libdbus-1 are just a bad idea. I have no clue why they were added > originally. If you have a misbehaving service on your bus which leaks > messages, you will end up clogging D-Bus at some point and only a > service restart will fix that. Your service always has to reply to > the message. If you trust your running-as-root services to do the > right thing, then adding arbitrary timeouts just gets in the way. > > The fact that the timeout happened is unknown to the dbus peer. It > will continue as if nothing happened, and if well-behaved will > generate a reply message eventually. This has to happen for proper > functioning of D-Bus. The dbus-daemon will track that reply and > forward it to the caller. The caller's dbus implementation will just > drop it on the floor since the reply-to serial number is no longer on > the pending list. > > So your iwd client has only two choices: > 1. Call Connect() with no timeout and trust iwd to do the right thing. > 2. Handle timeouts by trying to abort the connection somehow. Perhaps > via Station.Disconnect(). But that has its own side-effects. > 3. If a timeout happens, assume iwd service is dead and restart it. > > Hint: Option 1 or 3 is preferred > > Regards, > -Denis
diff --git a/ell/dhcp.c b/ell/dhcp.c index ece3e23..79fc54c 100644 --- a/ell/dhcp.c +++ b/ell/dhcp.c @@ -557,10 +557,23 @@ static void dhcp_client_timeout_resend(struct l_timeout *timeout, * RFC 2131 Section 4.1: * "The retransmission delay SHOULD be doubled with subsequent * retransmissions up to a maximum of 64 seconds. + * + * The maximum is hit after 5 attempts (2 << 5 == 64) */ - client->attempt += 1; - next_timeout = minsize(2 << client->attempt, 64); - break; + if (client->attempt <= 5) { + next_timeout = 2 << client->attempt++; + break; + } + + /* + * DHCP server is non-responsive after ~2 minutes, relay this to + * upper layers do decide how to proceed + */ + CLIENT_DEBUG("Max request/discover retires reached"); + + dhcp_client_event_notify(client, + L_DHCP_CLIENT_EVENT_MAX_RETRIES_REACHED); + return; case DHCP_STATE_INIT: case DHCP_STATE_INIT_REBOOT: case DHCP_STATE_REBOOTING: diff --git a/ell/dhcp.h b/ell/dhcp.h index 6ce4dde..4db573d 100644 --- a/ell/dhcp.h +++ b/ell/dhcp.h @@ -41,6 +41,7 @@ enum l_dhcp_client_event { L_DHCP_CLIENT_EVENT_LEASE_EXPIRED, L_DHCP_CLIENT_EVENT_LEASE_RENEWED, L_DHCP_CLIENT_EVENT_NO_LEASE, + L_DHCP_CLIENT_EVENT_MAX_RETRIES_REACHED, }; enum l_dhcp_server_event { diff --git a/ell/netconfig.c b/ell/netconfig.c index ab59299..1e6912a 100644 --- a/ell/netconfig.c +++ b/ell/netconfig.c @@ -548,6 +548,12 @@ static void netconfig_dhcp_event_handler(struct l_dhcp_client *client, if (!l_dhcp_client_start(nc->dhcp_client)) netconfig_failed(nc, AF_INET); + break; + case L_DHCP_CLIENT_EVENT_MAX_RETRIES_REACHED: + L_WARN_ON(nc->v4_configured); + + netconfig_failed(nc, AF_INET); + break; } }