diff mbox series

dhcp: fix overflow causing retries to stop

Message ID 20240109040621.1248647-1-prestwoj@gmail.com (mailing list archive)
State New
Headers show
Series dhcp: fix overflow causing retries to stop | expand

Checks

Context Check Description
tedd_an/pre-ci_am success Success
prestwoj/iwd-ci-makedistcheck success Make Distcheck
prestwoj/iwd-ci-build success Build - Configure
prestwoj/iwd-ci-makecheckvalgrind success Make Check w/Valgrind
prestwoj/iwd-ci-clang success clang PASS
prestwoj/iwd-ci-makecheck success Make Check
prestwoj/iwd-ci-testrunner success test-runner PASS

Commit Message

James Prestwood Jan. 9, 2024, 4:06 a.m. UTC
If DHCP is in a SELECTING/REQUESTING state and the number of attempts
reached a value where 2 << attempts overflowed an unsigned int the
next timeout would become zero, causing DHCP to never retry.

Since 5 attempts results in a value of 64 we can instead just limit
the attempts to 5, and set next_timeout to 64 after that as the spec
requires.

Fixes: f130c448 ("dhcp: Introduce timeout fuzzing")
---
 ell/dhcp.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Denis Kenzior Jan. 9, 2024, 4:27 a.m. UTC | #1
Hi James,

On 1/8/24 22:06, James Prestwood wrote:
> If DHCP is in a SELECTING/REQUESTING state and the number of attempts
> reached a value where 2 << attempts overflowed an unsigned int the
> next timeout would become zero, causing DHCP to never retry.
> 
> Since 5 attempts results in a value of 64 we can instead just limit
> the attempts to 5, and set next_timeout to 64 after that as the spec
> requires.

Overflowing next_timeout implies that discovery is taking ~25 minutes?  I think 
I'd rather we gave up and sent a NO_LEASE event after retries hits a certain max 
number?

Regards,
-Denis
James Prestwood Jan. 9, 2024, 12:18 p.m. UTC | #2
Hi Denis,

On 1/8/24 8:27 PM, Denis Kenzior wrote:
> Hi James,
>
> On 1/8/24 22:06, James Prestwood wrote:
>> If DHCP is in a SELECTING/REQUESTING state and the number of attempts
>> reached a value where 2 << attempts overflowed an unsigned int the
>> next timeout would become zero, causing DHCP to never retry.
>>
>> Since 5 attempts results in a value of 64 we can instead just limit
>> the attempts to 5, and set next_timeout to 64 after that as the spec
>> requires.
>
> Overflowing next_timeout implies that discovery is taking ~25 
> minutes?  I think I'd rather we gave up and sent a NO_LEASE event 
> after retries hits a certain max number?

Yep, this is whats happening, the DHCP server is down. Obviously we 
could change the logic, but NO_LEASE appears to just restart the DHCP 
client, i.e. no real difference between trying discovery forever. Should 
we add another event, L_DHCP_CLIENT_EVENT_TIMEOUT?

Thanks,

james

>
> Regards,
> -Denis
Denis Kenzior Jan. 9, 2024, 5:49 p.m. UTC | #3
Hi James,

On 1/9/24 06:18, James Prestwood wrote:
> Hi Denis,
> 
> On 1/8/24 8:27 PM, Denis Kenzior wrote:
>> Hi James,
>>
>> On 1/8/24 22:06, James Prestwood wrote:
>>> If DHCP is in a SELECTING/REQUESTING state and the number of attempts
>>> reached a value where 2 << attempts overflowed an unsigned int the
>>> next timeout would become zero, causing DHCP to never retry.
>>>
>>> Since 5 attempts results in a value of 64 we can instead just limit
>>> the attempts to 5, and set next_timeout to 64 after that as the spec
>>> requires.
>>
>> Overflowing next_timeout implies that discovery is taking ~25 minutes?  I 
>> think I'd rather we gave up and sent a NO_LEASE event after retries hits a 
>> certain max number?
> 
> Yep, this is whats happening, the DHCP server is down. Obviously we could change 
> the logic, but NO_LEASE appears to just restart the DHCP client, i.e. no real 

I'm not sure why netconfig does that actually.  From the comment it seems like 
it expects NO_LEASE event when the old address could not be obtained.  But, if I 
recall correctly, ell's dhcp client implementation only includes the old address 
when using RENEW or REBIND, and never on a fresh dhcp_client_start()-up due to 
privacy policy.

In the end, it is up to netconfig to implement a policy it wants.  Re-starting 
discovery is a valid policy, but I think in iwd's case, we should have netconfig 
treat 'NO_LEASE' as a DHCP server going down / server not present.  Having a max 
number of retries can facilitate that.  For example, if we're connecting and 
reach max_attempts of 6 or so, then give up and have iwd fail the connection.

> difference between trying discovery forever. Should we add another event, 
> L_DHCP_CLIENT_EVENT_TIMEOUT?

l_dhcp_client sends NO_LEASE when it gets a NAK and there's no point in 
continuing without further intervention (or configuration change).  Server not 
responding seems to fall into the same category, but we can also add an event 
explicitly for this.  Maybe MAX_DISCOVER_ATTEMPTS_REACHED?

Regards,
-Denis
diff mbox series

Patch

diff --git a/ell/dhcp.c b/ell/dhcp.c
index ece3e23..4f8b369 100644
--- a/ell/dhcp.c
+++ b/ell/dhcp.c
@@ -557,9 +557,14 @@  static void dhcp_client_timeout_resend(struct l_timeout *timeout,
 		 * RFC 2131 Section 4.1:
 		 * "The retransmission delay SHOULD be doubled with subsequent
 		 * retransmissions up to a maximum of 64 seconds.
+		 *
+		 * More than 5 attempts will trigger the maximum, don't increase
+		 * any more to avoid overflowing next_timeout.
 		 */
-		client->attempt += 1;
-		next_timeout = minsize(2 << client->attempt, 64);
+		if (client->attempt > 5)
+			next_timeout = 64;
+		else
+			next_timeout = 2 << client->attempt++;
 		break;
 	case DHCP_STATE_INIT:
 	case DHCP_STATE_INIT_REBOOT: