ntp: remove accidental integer wrap-around

Message ID	20240507-b4-sio-ntp-usec-v1-1-15003fc9c2b4@google.com (mailing list archive)
State	Superseded
Headers	show Received: from mail-io1-f73.google.com (mail-io1-f73.google.com [209.85.166.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 65F291E87F for <linux-hardening@vger.kernel.org>; Tue, 7 May 2024 04:34:14 +0000 (UTC) Date: Tue, 07 May 2024 04:34:13 +0000 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240507-b4-sio-ntp-usec-v1-1-15003fc9c2b4@google.com> Subject: [PATCH] ntp: remove accidental integer wrap-around From: Justin Stitt <justinstitt@google.com> To: John Stultz <jstultz@google.com>, Thomas Gleixner <tglx@linutronix.de>, Stephen Boyd <sboyd@kernel.org>, Nathan Chancellor <nathan@kernel.org>, Bill Wendling <morbo@google.com> Cc: linux-kernel@vger.kernel.org, llvm@lists.linux.dev, linux-hardening@vger.kernel.org, Justin Stitt <justinstitt@google.com> Content-Type: text/plain; charset="utf-8"
Series	ntp: remove accidental integer wrap-around \| expand ntp: remove accidental integer wrap-around

Justin Stitt May 7, 2024, 4:34 a.m. UTC

Using syzkaller alongside the newly reintroduced signed integer overflow
sanitizer spits out this report:

[  138.454979] ------------[ cut here ]------------
[  138.458089] UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16
[  138.462134] 9223372036854775807 + 500 cannot be represented in type 'long'
[  138.466234] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc2-00038-gc0a509640e93-dirty #10
[  138.471498] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  138.477110] Call Trace:
[  138.478657]  <IRQ>
[  138.479964]  dump_stack_lvl+0x93/0xd0
[  138.482276]  handle_overflow+0x171/0x1b0
[  138.484699]  second_overflow+0x2d6/0x500
[  138.487133]  accumulate_nsecs_to_secs+0x60/0x160
[  138.489931]  timekeeping_advance+0x1fe/0x890
[  138.492535]  update_wall_time+0x10/0x30
...

Historically, the signed integer overflow sanitizer did not work in the
kernel due to its interaction with `-fwrapv` but this has since been
changed [1] in the newest version of Clang. It was re-enabled in the
kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow
sanitizer").

Let's introduce a new macro and use that against NTP_PHASE_LIMIT to
properly limit the max size of time_maxerror without overflowing during
the check itself.

Link: https://github.com/llvm/llvm-project/pull/82432 [1]
Closes: https://github.com/KSPP/linux/issues/354
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
---
 include/linux/timex.h | 1 +
 kernel/time/ntp.c     | 8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)


---
base-commit: 0106679839f7c69632b3b9833c3268c316c0a9fc
change-id: 20240507-b4-sio-ntp-usec-1a3ab67bdce1

Best regards,
--
Justin Stitt <justinstitt@google.com>

John Stultz May 7, 2024, 5:54 a.m. UTC | #1

On Mon, May 6, 2024 at 9:34 PM Justin Stitt <justinstitt@google.com> wrote:
> Let's introduce a new macro and use that against NTP_PHASE_LIMIT to
> properly limit the max size of time_maxerror without overflowing during
> the check itself.
>
> Link: https://github.com/llvm/llvm-project/pull/82432 [1]
> Closes: https://github.com/KSPP/linux/issues/354
> Cc: linux-hardening@vger.kernel.org
> Signed-off-by: Justin Stitt <justinstitt@google.com>
> ---
>  include/linux/timex.h | 1 +
>  kernel/time/ntp.c     | 8 ++++----
>  2 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/timex.h b/include/linux/timex.h
> index 3871b06bd302..976490a06915 100644
> --- a/include/linux/timex.h
> +++ b/include/linux/timex.h
> @@ -138,6 +138,7 @@ unsigned long random_get_entropy_fallback(void);
>  #define MINSEC 256             /* min interval between updates (s) */
>  #define MAXSEC 2048            /* max interval between updates (s) */
>  #define NTP_PHASE_LIMIT ((MAXPHASE / NSEC_PER_USEC) << 5) /* beyond max. dispersion */
> +#define NTP_MAXFREQ_USEC (MAXFREQ / NSEC_PER_USEC) /* scaled to microseconds */
>
>  /*
>   * kernel variables
> diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
> index 406dccb79c2b..19027b6d0827 100644
> --- a/kernel/time/ntp.c
> +++ b/kernel/time/ntp.c
> @@ -454,12 +454,12 @@ int second_overflow(time64_t secs)
>         }
>
>
> -       /* Bump the maxerror field */
> -       time_maxerror += MAXFREQ / NSEC_PER_USEC;
> -       if (time_maxerror > NTP_PHASE_LIMIT) {
> +       /* Bump the maxerror field, making sure not to exceed NTP_PHASE_LIMIT */
> +       if (NTP_PHASE_LIMIT - NTP_MAXFREQ_USEC < time_maxerror) {
>                 time_maxerror = NTP_PHASE_LIMIT;
>                 time_status |= STA_UNSYNC;
> -       }
> +       } else
> +               time_maxerror += NTP_MAXFREQ_USEC;
>
>         /* Compute the phase adjustment for the next second */
>         tick_length      = tick_length_base;
>

Looks reasonable to me.
Acked-by: John Stultz <jstultz@google.com>

thanks
-john

Thomas Gleixner May 14, 2024, 10:38 a.m. UTC | #2

On Tue, May 07 2024 at 04:34, Justin Stitt wrote:
> Using syzkaller alongside the newly reintroduced signed integer overflow
> sanitizer spits out this report:
>
> [  138.454979] ------------[ cut here ]------------
> [  138.458089] UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16
> [  138.462134] 9223372036854775807 + 500 cannot be represented in type 'long'
> [  138.466234] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc2-00038-gc0a509640e93-dirty #10
> [  138.471498] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [  138.477110] Call Trace:
> [  138.478657]  <IRQ>
> [  138.479964]  dump_stack_lvl+0x93/0xd0
> [  138.482276]  handle_overflow+0x171/0x1b0
> [  138.484699]  second_overflow+0x2d6/0x500
> [  138.487133]  accumulate_nsecs_to_secs+0x60/0x160
> [  138.489931]  timekeeping_advance+0x1fe/0x890
> [  138.492535]  update_wall_time+0x10/0x30

Same comment vs. trimming.

> Historically, the signed integer overflow sanitizer did not work in the
> kernel due to its interaction with `-fwrapv` but this has since been
> changed [1] in the newest version of Clang. It was re-enabled in the
> kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow
> sanitizer").

Again. Irrelevant to the problem.

> Let's introduce a new macro and use that against NTP_PHASE_LIMIT to
> properly limit the max size of time_maxerror without overflowing during
> the check itself.

This fails to tell what is causing the issue and just talks about what
the patch is doing. The latter can be seen from the patch itself, no?

Something like this:

   On second overflow time_maxerror is unconditionally incremented and
   the result is checked against NTP_PHASE_LIMIT, but the increment can
   overflow into negative space.

   Prevent this by checking the overflow condition before incrementing.

Hmm?

But that obviously begs the question why this can happen at all.

#define MAXPHASE 500000000L       
#define NTP_PHASE_LIMIT ((MAXPHASE / NSEC_PER_USEC) << 5)

==> NTP_PHASE_LIMIT = 1.6e+07 = 0xf42400

#define MAXFREQ 500000

So how can 0xf42400 + 500000/1000 overflow in the first place?

It can't unless time_maxerror is somehow initialized to a bogus
value and indeed it is:

process_adjtimex_modes()
        ....
	if (txc->modes & ADJ_MAXERROR)
		time_maxerror = txc->maxerror;

So that wants to be fixed and not the symptom.

Thanks,

        tglx

Justin Stitt May 16, 2024, 11:40 p.m. UTC | #3

Hi,

On Tue, May 14, 2024 at 3:38 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, May 07 2024 at 04:34, Justin Stitt wrote:
> > Using syzkaller alongside the newly reintroduced signed integer overflow
> > sanitizer spits out this report:
> >
> > [  138.454979] ------------[ cut here ]------------
> > [  138.458089] UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16
> > [  138.462134] 9223372036854775807 + 500 cannot be represented in type 'long'
> > [  138.466234] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc2-00038-gc0a509640e93-dirty #10
> > [  138.471498] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > [  138.477110] Call Trace:
> > [  138.478657]  <IRQ>
> > [  138.479964]  dump_stack_lvl+0x93/0xd0
> > [  138.482276]  handle_overflow+0x171/0x1b0
> > [  138.484699]  second_overflow+0x2d6/0x500
> > [  138.487133]  accumulate_nsecs_to_secs+0x60/0x160
> > [  138.489931]  timekeeping_advance+0x1fe/0x890
> > [  138.492535]  update_wall_time+0x10/0x30
>
> Same comment vs. trimming.

Gotcha, in the next version this will be trimmed.

>
> > Historically, the signed integer overflow sanitizer did not work in the
> > kernel due to its interaction with `-fwrapv` but this has since been
> > changed [1] in the newest version of Clang. It was re-enabled in the
> > kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow
> > sanitizer").
>
> Again. Irrelevant to the problem.

Right, I'll move it below the fold.

>
> > Let's introduce a new macro and use that against NTP_PHASE_LIMIT to
> > properly limit the max size of time_maxerror without overflowing during
> > the check itself.
>
> This fails to tell what is causing the issue and just talks about what
> the patch is doing. The latter can be seen from the patch itself, no?
>
> Something like this:
>
>    On second overflow time_maxerror is unconditionally incremented and
>    the result is checked against NTP_PHASE_LIMIT, but the increment can
>    overflow into negative space.
>
>    Prevent this by checking the overflow condition before incrementing.
>
> Hmm?

Sounds better :thumbs_up: I'll use this!

>
> But that obviously begs the question why this can happen at all.
>
> #define MAXPHASE 500000000L
> #define NTP_PHASE_LIMIT ((MAXPHASE / NSEC_PER_USEC) << 5)
>
> ==> NTP_PHASE_LIMIT = 1.6e+07 = 0xf42400
>
> #define MAXFREQ 500000
>
> So how can 0xf42400 + 500000/1000 overflow in the first place?
>
> It can't unless time_maxerror is somehow initialized to a bogus
> value and indeed it is:
>
> process_adjtimex_modes()
>         ....
>         if (txc->modes & ADJ_MAXERROR)
>                 time_maxerror = txc->maxerror;
>
> So that wants to be fixed and not the symptom.

Isn't this usually supplied from the user and can be some pretty
random stuff? Are you suggesting we update
timekeeping_validate_timex() to include a check to limit the maxerror
field to (NTP_PHASE_LIMIT-(MAXFREQ / NSEC_PER_USEC))? It seems like we
should handle the overflow case where it happens: in
second_overflow().

The clear intent of the existing code was to saturate at
NTP_PHASE_LIMIT, they just did it in a way where the check itself
triggers overflow sanitizers.

>
> Thanks,
>
>         tglx

Thanks
Justin

Justin Stitt May 16, 2024, 11:55 p.m. UTC | #4

On Thu, May 16, 2024 at 4:40 PM Justin Stitt <justinstitt@google.com> wrote:
> Isn't this usually supplied from the user and can be some pretty
> random stuff? Are you suggesting we update
> timekeeping_validate_timex() to include a check to limit the maxerror
> field to (NTP_PHASE_LIMIT-(MAXFREQ / NSEC_PER_USEC))? It seems like we
> should handle the overflow case where it happens: in
> second_overflow().

Or, I suppose we could add a check to timekeeping_validate_timex() like:

if (txc->modes & ADJ_MAXERROR) {
    if (txc->maxerror < 0 || txc->maxerror > NTP_PHASE_LIMIT)
        return -EINVAL;
}


> Thanks
> Justin

Thomas Gleixner May 17, 2024, 8:49 a.m. UTC | #5

On Thu, May 16 2024 at 16:40, Justin Stitt wrote:
> On Tue, May 14, 2024 at 3:38 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> So how can 0xf42400 + 500000/1000 overflow in the first place?
>>
>> It can't unless time_maxerror is somehow initialized to a bogus
>> value and indeed it is:
>>
>> process_adjtimex_modes()
>>         ....
>>         if (txc->modes & ADJ_MAXERROR)
>>                 time_maxerror = txc->maxerror;
>>
>> So that wants to be fixed and not the symptom.
>
> Isn't this usually supplied from the user and can be some pretty
> random stuff?

Sure it comes from user space and can contain random nonsense as
syzkaller demonstrated.

> Are you suggesting we update timekeeping_validate_timex() to include a
> check to limit the maxerror field to (NTP_PHASE_LIMIT-(MAXFREQ /
> NSEC_PER_USEC))? It seems like we should handle the overflow case
> where it happens: in second_overflow().
>
> The clear intent of the existing code was to saturate at
> NTP_PHASE_LIMIT, they just did it in a way where the check itself
> triggers overflow sanitizers.

The clear intent of the code is to do saturation of a bound value.

Clearly the user space interface fails to validate the input to be in a
sane range and that makes you go and prevent the resulting overflow at
the usage site. Seriously?

Obviously the sanitizer detects the stupid in second_overflow(), but
that does not mean that the proper solution is to add overflow
protection to that code.

Tools are good to pin-point symptoms, but they are by definition
patently bad in root cause analysis. Otherwise we could just let the
tool write the "fix".

The obvious first question in such a case is to ask _WHY_ does
time_maxerror have a bogus value, which clearly cannot be achieved from
regular operation.

Once you figured out that the only way to set time_maxerror to a bogus
value is the user space interface the obvious follow up question is
whether such a value has to be considered as valid or not.

As it is obviously invalid the logical consequence is to add a sanity
check and reject that nonsense at that boundary, no?

Thanks,

        tglx

ntp: remove accidental integer wrap-around

Commit Message

Comments

Patch