dhcp: fix overflow causing retries to stop

Message ID	20240109040621.1248647-1-prestwoj@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-qk1-f170.google.com (mail-qk1-f170.google.com [209.85.222.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A0736104 for <ell@lists.linux.dev>; Tue, 9 Jan 2024 04:06:27 +0000 (UTC) From: James Prestwood <prestwoj@gmail.com> To: ell@lists.linux.dev Cc: James Prestwood <prestwoj@gmail.com> Subject: [PATCH] dhcp: fix overflow causing retries to stop Date: Mon, 8 Jan 2024 20:06:21 -0800 Message-Id: <20240109040621.1248647-1-prestwoj@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	dhcp: fix overflow causing retries to stop \| expand dhcp: fix overflow causing retries to stop

Message ID

20240109040621.1248647-1-prestwoj@gmail.com (mailing list archive)

State

New

Headers

From: James Prestwood <prestwoj@gmail.com>
To: ell@lists.linux.dev
Cc: James Prestwood <prestwoj@gmail.com>
Subject: [PATCH] dhcp: fix overflow causing retries to stop
Date: Mon,  8 Jan 2024 20:06:21 -0800
Message-Id: <20240109040621.1248647-1-prestwoj@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

dhcp: fix overflow causing retries to stop | expand

Checks

Context	Check	Description
tedd_an/pre-ci_am	success	Success
prestwoj/iwd-ci-makedistcheck	success	Make Distcheck
prestwoj/iwd-ci-build	success	Build - Configure
prestwoj/iwd-ci-makecheckvalgrind	success	Make Check w/Valgrind
prestwoj/iwd-ci-clang	success	clang PASS
prestwoj/iwd-ci-makecheck	success	Make Check
prestwoj/iwd-ci-testrunner	success	test-runner PASS

Context

Check

Description

tedd_an/pre-ci_am

success

Success

prestwoj/iwd-ci-makedistcheck

success

Make Distcheck

prestwoj/iwd-ci-build

success

Build - Configure

prestwoj/iwd-ci-makecheckvalgrind

success

Make Check w/Valgrind

prestwoj/iwd-ci-clang

success

clang PASS

prestwoj/iwd-ci-makecheck

success

Make Check

prestwoj/iwd-ci-testrunner

success

test-runner PASS

Commit Message

James Prestwood Jan. 9, 2024, 4:06 a.m. UTC

If DHCP is in a SELECTING/REQUESTING state and the number of attempts
reached a value where 2 << attempts overflowed an unsigned int the
next timeout would become zero, causing DHCP to never retry.

Since 5 attempts results in a value of 64 we can instead just limit
the attempts to 5, and set next_timeout to 64 after that as the spec
requires.

Fixes: f130c448 ("dhcp: Introduce timeout fuzzing")
---
 ell/dhcp.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Denis Kenzior Jan. 9, 2024, 4:27 a.m. UTC | #1

Hi James,

On 1/8/24 22:06, James Prestwood wrote:
> If DHCP is in a SELECTING/REQUESTING state and the number of attempts
> reached a value where 2 << attempts overflowed an unsigned int the
> next timeout would become zero, causing DHCP to never retry.
> 
> Since 5 attempts results in a value of 64 we can instead just limit
> the attempts to 5, and set next_timeout to 64 after that as the spec
> requires.

Overflowing next_timeout implies that discovery is taking ~25 minutes?  I think 
I'd rather we gave up and sent a NO_LEASE event after retries hits a certain max 
number?

Regards,
-Denis

James Prestwood Jan. 9, 2024, 12:18 p.m. UTC | #2

Hi Denis,

On 1/8/24 8:27 PM, Denis Kenzior wrote:
> Hi James,
>
> On 1/8/24 22:06, James Prestwood wrote:
>> If DHCP is in a SELECTING/REQUESTING state and the number of attempts
>> reached a value where 2 << attempts overflowed an unsigned int the
>> next timeout would become zero, causing DHCP to never retry.
>>
>> Since 5 attempts results in a value of 64 we can instead just limit
>> the attempts to 5, and set next_timeout to 64 after that as the spec
>> requires.
>
> Overflowing next_timeout implies that discovery is taking ~25 
> minutes?  I think I'd rather we gave up and sent a NO_LEASE event 
> after retries hits a certain max number?

Yep, this is whats happening, the DHCP server is down. Obviously we 
could change the logic, but NO_LEASE appears to just restart the DHCP 
client, i.e. no real difference between trying discovery forever. Should 
we add another event, L_DHCP_CLIENT_EVENT_TIMEOUT?

Thanks,

james

>
> Regards,
> -Denis

Denis Kenzior Jan. 9, 2024, 5:49 p.m. UTC | #3

Hi James,

On 1/9/24 06:18, James Prestwood wrote:
> Hi Denis,
> 
> On 1/8/24 8:27 PM, Denis Kenzior wrote:
>> Hi James,
>>
>> On 1/8/24 22:06, James Prestwood wrote:
>>> If DHCP is in a SELECTING/REQUESTING state and the number of attempts
>>> reached a value where 2 << attempts overflowed an unsigned int the
>>> next timeout would become zero, causing DHCP to never retry.
>>>
>>> Since 5 attempts results in a value of 64 we can instead just limit
>>> the attempts to 5, and set next_timeout to 64 after that as the spec
>>> requires.
>>
>> Overflowing next_timeout implies that discovery is taking ~25 minutes?  I 
>> think I'd rather we gave up and sent a NO_LEASE event after retries hits a 
>> certain max number?
> 
> Yep, this is whats happening, the DHCP server is down. Obviously we could change 
> the logic, but NO_LEASE appears to just restart the DHCP client, i.e. no real 

I'm not sure why netconfig does that actually.  From the comment it seems like 
it expects NO_LEASE event when the old address could not be obtained.  But, if I 
recall correctly, ell's dhcp client implementation only includes the old address 
when using RENEW or REBIND, and never on a fresh dhcp_client_start()-up due to 
privacy policy.

In the end, it is up to netconfig to implement a policy it wants.  Re-starting 
discovery is a valid policy, but I think in iwd's case, we should have netconfig 
treat 'NO_LEASE' as a DHCP server going down / server not present.  Having a max 
number of retries can facilitate that.  For example, if we're connecting and 
reach max_attempts of 6 or so, then give up and have iwd fail the connection.

> difference between trying discovery forever. Should we add another event, 
> L_DHCP_CLIENT_EVENT_TIMEOUT?

l_dhcp_client sends NO_LEASE when it gets a NAK and there's no point in 
continuing without further intervention (or configuration change).  Server not 
responding seems to fall into the same category, but we can also add an event 
explicitly for this.  Maybe MAX_DISCOVER_ATTEMPTS_REACHED?

Regards,
-Denis

diff --git a/ell/dhcp.c b/ell/dhcp.c
index ece3e23..4f8b369 100644
--- a/ell/dhcp.c
+++ b/ell/dhcp.c
@@ -557,9 +557,14 @@  static void dhcp_client_timeout_resend(struct l_timeout *timeout,
 		 * RFC 2131 Section 4.1:
 		 * "The retransmission delay SHOULD be doubled with subsequent
 		 * retransmissions up to a maximum of 64 seconds.
+		 *
+		 * More than 5 attempts will trigger the maximum, don't increase
+		 * any more to avoid overflowing next_timeout.
 		 */
-		client->attempt += 1;
-		next_timeout = minsize(2 << client->attempt, 64);
+		if (client->attempt > 5)
+			next_timeout = 64;
+		else
+			next_timeout = 2 << client->attempt++;
 		break;
 	case DHCP_STATE_INIT:
 	case DHCP_STATE_INIT_REBOOT:

dhcp: fix overflow causing retries to stop

Checks

Commit Message

Comments

Patch