diff mbox

RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

Message ID 20170822001928.74f01322@roar.ozlabs.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Nicholas Piggin Aug. 21, 2017, 2:19 p.m. UTC
On Mon, 21 Aug 2017 11:18:33 +0100
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Mon, 21 Aug 2017 16:06:05 +1000
> Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> > On Mon, 21 Aug 2017 10:52:58 +1000
> > Nicholas Piggin <npiggin@gmail.com> wrote:
> >   
> > > On Sun, 20 Aug 2017 14:14:29 -0700
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > >     
> > > > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:      
> > > > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:        
> > > > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > > > Nicholas Piggin <npiggin@gmail.com> wrote:
> > > > > >         
> > > > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:        
> > > > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > > > > > 
> > > > > > > > Thomas, John, am I misinterpreting the timer trace event messages?          
> > > > > > > 
> > > > > > > So I did some digging, and what you find is that rcu_sched seems to do a
> > > > > > > simple scheudle_timeout(1) and just goes out to lunch for many seconds.
> > > > > > > The process_timeout timer never fires (when it finally does wake after
> > > > > > > one of these events, it usually removes the timer with del_timer_sync).
> > > > > > > 
> > > > > > > So this patch seems to fix it. Testing, comments welcome.        
> > > > > > 
> > > > > > Okay this had a problem of trying to forward the timer from a timer
> > > > > > callback function.
> > > > > > 
> > > > > > This was my other approach which also fixes the RCU warnings, but it's
> > > > > > a little more complex. I reworked it a bit so the mod_timer fast path
> > > > > > hopefully doesn't have much more overhead (actually by reading jiffies
> > > > > > only when needed, it probably saves a load).        
> > > > > 
> > > > > Giving this one a whirl!        
> > > > 
> > > > No joy here, but then again there are other reasons to believe that I
> > > > am seeing a different bug than Dave and Jonathan are.
> > > > 
> > > > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > > > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > > > not statistically different from what I see without either patch.
> > > > 
> > > > But no statistical difference compared to without patch, and I still
> > > > see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> > > > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > > > Hmmm...  I am also seeing that without any of your patches.  Might
> > > > be hypervisor preemption, I guess.      
> > > 
> > > Okay it makes the warnings go away for me, but I'm just booting then
> > > leaving the system idle. You're doing some CPU hotplug activity?    
> > 
> > Okay found a bug in the patch (it was not forwarding properly before
> > adding the first timer after an idle) and a few other concerns.
> > 
> > There's still a problem of a timer function doing a mod timer from
> > within expire_timers. It can't forward the base, which might currently
> > be quite a way behind. I *think* after we close these gaps and get
> > timely wakeups for timers on there, it should not get too far behind
> > for standard timers.
> > 
> > Deferrable is a different story. Firstly it has no idle tracking so we
> > never forward it. Even if we wanted to, we can't do it reliably because
> > it could contain timers way behind the base. They are "deferrable", so
> > you get what you pay for, but this still means there's a window where
> > you can add a deferrable timer and get a far later expiry than you
> > asked for despite the CPU never going idle after you added it.
> > 
> > All these problems would seem to go away if mod_timer just queued up
> > the timer to a single list on the base then pushed them into the
> > wheel during your wheel processing softirq... Although maybe you end
> > up with excessive passes over big queue of timers. Anyway that
> > wouldn't be suitable for 4.13 even if it could work.
> > 
> > I'll send out an updated minimal fix after some more testing...  
> 
> Hi All,
> 
> I'm back in the office with hardware access on our D05 64 core ARM64
> boards.
> 
> I think we still have by far the quickest test cases for this so
> feel free to ping me anything you want tested quickly (we were
> looking at an average of less than 10 minutes to trigger
> with machine idling).
> 
> Nick, I'm currently running your previous version and we are over an
> hour so even without any instances of the issue so it looks like a
> considerable improvement.  I'll see if I can line a couple of boards
> up for an overnight run if you have your updated version out by then.
> 
> Be great to finally put this one to bed.

Hi Jonathan,

Thanks here's an updated version with a couple more bugs fixed. If
you could try testing, that would be much appreciated.

Thanks,
Nick

---
 kernel/time/timer.c | 43 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 35 insertions(+), 8 deletions(-)

Comments

Jonathan Cameron Aug. 21, 2017, 3:02 p.m. UTC | #1
On Tue, 22 Aug 2017 00:19:28 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Mon, 21 Aug 2017 11:18:33 +0100
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > On Mon, 21 Aug 2017 16:06:05 +1000
> > Nicholas Piggin <npiggin@gmail.com> wrote:
> >   
> > > On Mon, 21 Aug 2017 10:52:58 +1000
> > > Nicholas Piggin <npiggin@gmail.com> wrote:
> > >     
> > > > On Sun, 20 Aug 2017 14:14:29 -0700
> > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > >       
> > > > > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:        
> > > > > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:          
> > > > > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > > > > Nicholas Piggin <npiggin@gmail.com> wrote:
> > > > > > >           
> > > > > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:          
> > > > > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > > > > > > 
> > > > > > > > > Thomas, John, am I misinterpreting the timer trace event messages?            
> > > > > > > > 
> > > > > > > > So I did some digging, and what you find is that rcu_sched seems to do a
> > > > > > > > simple scheudle_timeout(1) and just goes out to lunch for many seconds.
> > > > > > > > The process_timeout timer never fires (when it finally does wake after
> > > > > > > > one of these events, it usually removes the timer with del_timer_sync).
> > > > > > > > 
> > > > > > > > So this patch seems to fix it. Testing, comments welcome.          
> > > > > > > 
> > > > > > > Okay this had a problem of trying to forward the timer from a timer
> > > > > > > callback function.
> > > > > > > 
> > > > > > > This was my other approach which also fixes the RCU warnings, but it's
> > > > > > > a little more complex. I reworked it a bit so the mod_timer fast path
> > > > > > > hopefully doesn't have much more overhead (actually by reading jiffies
> > > > > > > only when needed, it probably saves a load).          
> > > > > > 
> > > > > > Giving this one a whirl!          
> > > > > 
> > > > > No joy here, but then again there are other reasons to believe that I
> > > > > am seeing a different bug than Dave and Jonathan are.
> > > > > 
> > > > > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > > > > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > > > > not statistically different from what I see without either patch.
> > > > > 
> > > > > But no statistical difference compared to without patch, and I still
> > > > > see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> > > > > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > > > > Hmmm...  I am also seeing that without any of your patches.  Might
> > > > > be hypervisor preemption, I guess.        
> > > > 
> > > > Okay it makes the warnings go away for me, but I'm just booting then
> > > > leaving the system idle. You're doing some CPU hotplug activity?      
> > > 
> > > Okay found a bug in the patch (it was not forwarding properly before
> > > adding the first timer after an idle) and a few other concerns.
> > > 
> > > There's still a problem of a timer function doing a mod timer from
> > > within expire_timers. It can't forward the base, which might currently
> > > be quite a way behind. I *think* after we close these gaps and get
> > > timely wakeups for timers on there, it should not get too far behind
> > > for standard timers.
> > > 
> > > Deferrable is a different story. Firstly it has no idle tracking so we
> > > never forward it. Even if we wanted to, we can't do it reliably because
> > > it could contain timers way behind the base. They are "deferrable", so
> > > you get what you pay for, but this still means there's a window where
> > > you can add a deferrable timer and get a far later expiry than you
> > > asked for despite the CPU never going idle after you added it.
> > > 
> > > All these problems would seem to go away if mod_timer just queued up
> > > the timer to a single list on the base then pushed them into the
> > > wheel during your wheel processing softirq... Although maybe you end
> > > up with excessive passes over big queue of timers. Anyway that
> > > wouldn't be suitable for 4.13 even if it could work.
> > > 
> > > I'll send out an updated minimal fix after some more testing...    
> > 
> > Hi All,
> > 
> > I'm back in the office with hardware access on our D05 64 core ARM64
> > boards.
> > 
> > I think we still have by far the quickest test cases for this so
> > feel free to ping me anything you want tested quickly (we were
> > looking at an average of less than 10 minutes to trigger
> > with machine idling).
> > 
> > Nick, I'm currently running your previous version and we are over an
> > hour so even without any instances of the issue so it looks like a
> > considerable improvement.  I'll see if I can line a couple of boards
> > up for an overnight run if you have your updated version out by then.
> > 
> > Be great to finally put this one to bed.  
> 
> Hi Jonathan,
> 
> Thanks here's an updated version with a couple more bugs fixed. If
> you could try testing, that would be much appreciated.
> 
> Thanks,
> Nick

Running now on 1 board. I'll grab another in a few hours and report back
in the morning if we don't see issues before I head off.

We got to about 5 hours on previous version without a problem vs
sub 10 minutes on the two baseline tests I ran without it, so even
with bugs that seemed to have dealt with the issue itself.

On 15 mins so far and all good.

Jonathan

> 
> ---
>  kernel/time/timer.c | 43 +++++++++++++++++++++++++++++++++++--------
>  1 file changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 8f5d1bf18854..2b9d2cdb3fac 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -203,6 +203,7 @@ struct timer_base {
>  	bool			migration_enabled;
>  	bool			nohz_active;
>  	bool			is_idle;
> +	bool			was_idle; /* was it idle since last run/fwded */
>  	DECLARE_BITMAP(pending_map, WHEEL_SIZE);
>  	struct hlist_head	vectors[WHEEL_SIZE];
>  } ____cacheline_aligned;
> @@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned tflags)
>  
>  static inline void forward_timer_base(struct timer_base *base)
>  {
> -	unsigned long jnow = READ_ONCE(jiffies);
> +	unsigned long jnow;
>  
>  	/*
> -	 * We only forward the base when it's idle and we have a delta between
> -	 * base clock and jiffies.
> +	 * We only forward the base when we are idle or have just come out
> +	 * of idle (was_idle logic), and have a delta between base clock
> +	 * and jiffies. In the common case, run_timers will take care of it.
>  	 */
> -	if (!base->is_idle || (long) (jnow - base->clk) < 2)
> +	if (likely(!base->was_idle))
> +		return;
> +
> +	jnow = READ_ONCE(jiffies);
> +	base->was_idle = base->is_idle;
> +	if ((long)(jnow - base->clk) < 2)
>  		return;
>  
>  	/*
> @@ -938,6 +945,13 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
>  	 * same array bucket then just return:
>  	 */
>  	if (timer_pending(timer)) {
> +		/*
> +		 * The downside of this optimization is that it can result in
> +		 * larger granularity than you would get from adding a new
> +		 * timer with this expiry. Would a timer flag for networking
> +		 * be appropriate, then we can try to keep expiry of general
> +		 * timers within ~1/8th of their interval?
> +		 */
>  		if (timer->expires == expires)
>  			return 1;
>  
> @@ -948,6 +962,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
>  		 * dequeue/enqueue dance.
>  		 */
>  		base = lock_timer_base(timer, &flags);
> +		forward_timer_base(base);
>  
>  		clk = base->clk;
>  		idx = calc_wheel_index(expires, clk);
> @@ -964,6 +979,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
>  		}
>  	} else {
>  		base = lock_timer_base(timer, &flags);
> +		forward_timer_base(base);
>  	}
>  
>  	ret = detach_if_pending(timer, base, false);
> @@ -991,12 +1007,10 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
>  			raw_spin_lock(&base->lock);
>  			WRITE_ONCE(timer->flags,
>  				   (timer->flags & ~TIMER_BASEMASK) | base->cpu);
> +			forward_timer_base(base);
>  		}
>  	}
>  
> -	/* Try to forward a stale timer base clock */
> -	forward_timer_base(base);
> -
>  	timer->expires = expires;
>  	/*
>  	 * If 'idx' was calculated above and the base time did not advance
> @@ -1499,8 +1513,10 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
>  		/*
>  		 * If we expect to sleep more than a tick, mark the base idle:
>  		 */
> -		if ((expires - basem) > TICK_NSEC)
> +		if ((expires - basem) > TICK_NSEC) {
> +			base->was_idle = true;
>  			base->is_idle = true;
> +		}
>  	}
>  	raw_spin_unlock(&base->lock);
>  
> @@ -1611,6 +1627,17 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
>  {
>  	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
>  
> +	/*
> +	 * was_idle must be cleared before running timers so that any timer
> +	 * functions that call mod_timer will not try to forward the base.
> +	 *
> +	 * The deferrable base does not do idle tracking at all, so we do
> +	 * not forward it. This can result in very large variations in
> +	 * granularity for deferrable timers, but they can be deferred for
> +	 * long periods due to idle.
> +	 */
> +	base->was_idle = false;
> +
>  	__run_timers(base);
>  	if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active)
>  		__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
David Miller Aug. 21, 2017, 8:55 p.m. UTC | #2
From: Nicholas Piggin <npiggin@gmail.com>
Date: Tue, 22 Aug 2017 00:19:28 +1000

> Thanks here's an updated version with a couple more bugs fixed. If
> you could try testing, that would be much appreciated.

I'm not getting RCU stalls on sparc64 any longer with this patch.

I'm really happy you guys were able to figure out what was going
wrong. :-)

Feel free to add my Tested-by:
Jonathan Cameron Aug. 22, 2017, 7:49 a.m. UTC | #3
On Mon, 21 Aug 2017 13:55:04 -0700
David Miller <davem@davemloft.net> wrote:

> From: Nicholas Piggin <npiggin@gmail.com>
> Date: Tue, 22 Aug 2017 00:19:28 +1000
> 
> > Thanks here's an updated version with a couple more bugs fixed. If
> > you could try testing, that would be much appreciated.  
> 
> I'm not getting RCU stalls on sparc64 any longer with this patch.
> 
> I'm really happy you guys were able to figure out what was going
> wrong. :-)
> 
> Feel free to add my Tested-by:
> 

Like wise - 16 hours of clean run with the latest

Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks for all the hard work everyone put into this one, great to
cross it off the list!

Jonathan
Abdul Haleem Aug. 22, 2017, 8:51 a.m. UTC | #4
On Tue, 2017-08-22 at 08:49 +0100, Jonathan Cameron wrote:
> On Mon, 21 Aug 2017 13:55:04 -0700
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Nicholas Piggin <npiggin@gmail.com>
> > Date: Tue, 22 Aug 2017 00:19:28 +1000
> > 
> > > Thanks here's an updated version with a couple more bugs fixed. If
> > > you could try testing, that would be much appreciated.  
> > 
> > I'm not getting RCU stalls on sparc64 any longer with this patch.
> > 
> > I'm really happy you guys were able to figure out what was going
> > wrong. :-)
> > 
> > Feel free to add my Tested-by:
> > 
> 
> Like wise - 16 hours of clean run with the latest
> 
> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Thanks for all the hard work everyone put into this one, great to
> cross it off the list!
> 
> Jonathan
> 

No more RCU stalls on PowerPC, system is clean when idle or with some
test runs.

Thank you all for your time and efforts in fixing this.

Reported-and-Tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Paul E. McKenney Aug. 22, 2017, 3:26 p.m. UTC | #5
On Tue, Aug 22, 2017 at 02:21:32PM +0530, Abdul Haleem wrote:
> On Tue, 2017-08-22 at 08:49 +0100, Jonathan Cameron wrote:
> > On Mon, 21 Aug 2017 13:55:04 -0700
> > David Miller <davem@davemloft.net> wrote:
> > 
> > > From: Nicholas Piggin <npiggin@gmail.com>
> > > Date: Tue, 22 Aug 2017 00:19:28 +1000
> > > 
> > > > Thanks here's an updated version with a couple more bugs fixed. If
> > > > you could try testing, that would be much appreciated.  
> > > 
> > > I'm not getting RCU stalls on sparc64 any longer with this patch.
> > > 
> > > I'm really happy you guys were able to figure out what was going
> > > wrong. :-)
> > > 
> > > Feel free to add my Tested-by:
> > > 
> > 
> > Like wise - 16 hours of clean run with the latest
> > 
> > Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > Thanks for all the hard work everyone put into this one, great to
> > cross it off the list!
> > 
> > Jonathan
> > 
> 
> No more RCU stalls on PowerPC, system is clean when idle or with some
> test runs.
> 
> Thank you all for your time and efforts in fixing this.
> 
> Reported-and-Tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>

I am still seeing failures, but then again I am running rcutorture with
lots of CPU hotplug activity.  So I am probably seeing some other bug,
though it still looks a lot like a lost timer.

							Thanx, Paul
Paul E. McKenney Sept. 6, 2017, 12:28 p.m. UTC | #6
On Tue, Aug 22, 2017 at 08:26:37AM -0700, Paul E. McKenney wrote:
> On Tue, Aug 22, 2017 at 02:21:32PM +0530, Abdul Haleem wrote:
> > On Tue, 2017-08-22 at 08:49 +0100, Jonathan Cameron wrote:

[ . . . ]

> > No more RCU stalls on PowerPC, system is clean when idle or with some
> > test runs.
> > 
> > Thank you all for your time and efforts in fixing this.
> > 
> > Reported-and-Tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
> 
> I am still seeing failures, but then again I am running rcutorture with
> lots of CPU hotplug activity.  So I am probably seeing some other bug,
> though it still looks a lot like a lost timer.

So one problem appears to be a timing-related deadlock between RCU and
timers.  The way that this can happen is that the outgoing CPU goes
offline (as in cpuhp_report_idle_dead() invoked from do_idle()) with
one of RCU's grace-period kthread's timers queued.  Now, if someone
waits for a grace period, either directly or indirectly, in a way that
blocks the hotplug notifiers, execution will never reach timers_dead_cpu(),
which means that the grace-period kthread will never wake, which will
mean that the grace period will never complete.  Classic deadlock.

I currently have an extremely ugly workaround for this deadlock, which
is to periodically and (usually) redundantly wake up all the RCU
grace-period kthreads from the scheduling-interrupt handler.  This is
of course completely inappropriate for mainline, but it does reliably
prevent the "kthread starved for %ld jiffies!" type of RCU CPU stall
warning that I would otherwise see.

To mainline this, one approach would be to make the timers switch to
add_timer_on() to a surviving CPU once the offlining process starts.
Alternatively, I suppose that RCU could do the redundant-wakeup kludge,
but with checks to prevent it from happening unless (1) there is a CPU
in the process of going offline (2) there is an RCU grace period in
progress, and (3) the RCU grace period kthread has been blocked for
(say) three times longer than it should have.

Unfortunately, this is not sufficient to make rcutorture run reliably,
though it does help, which is of course to say that it makes debugging
slower.  ;-)

What happens now is that random rcutorture kthreads will hang waiting for
timeouts to complete.  This confused me for awhile because I expected
that the timeouts would be delayed during offline processing, but that
my crude deadlock-resolution approach would eventually get things going.
My current suspicion is that the problem is due to a potential delay
between the time an outgoing CPU hits cpuhp_report_idle_dead() and the
timers get migrated from timers_dead_cpu().  This means that the CPU
adopting the timers might be a few ticks ahead of where the outgoing CPU
last processed timers.  My current guess is that any timers queued in
intervening indexes are going to wait one good long time.  And I don't see
any code in the timers_dead_cpu() that would account for this possibility,
though I of course cannot claim to fully understand this code..

Is this plausible, or am I confused?  (Either way, -something- besides
just me is rather thoroughly confused!)

If this is plausible, my guess is that timers_dead_cpu() needs to check
for mismatched indexes (in timer->flags?) and force any intervening
timers to expire if so.

Thoughts?

							Thanx, Paul
diff mbox

Patch

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 8f5d1bf18854..2b9d2cdb3fac 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -203,6 +203,7 @@  struct timer_base {
 	bool			migration_enabled;
 	bool			nohz_active;
 	bool			is_idle;
+	bool			was_idle; /* was it idle since last run/fwded */
 	DECLARE_BITMAP(pending_map, WHEEL_SIZE);
 	struct hlist_head	vectors[WHEEL_SIZE];
 } ____cacheline_aligned;
@@ -856,13 +857,19 @@  get_target_base(struct timer_base *base, unsigned tflags)
 
 static inline void forward_timer_base(struct timer_base *base)
 {
-	unsigned long jnow = READ_ONCE(jiffies);
+	unsigned long jnow;
 
 	/*
-	 * We only forward the base when it's idle and we have a delta between
-	 * base clock and jiffies.
+	 * We only forward the base when we are idle or have just come out
+	 * of idle (was_idle logic), and have a delta between base clock
+	 * and jiffies. In the common case, run_timers will take care of it.
 	 */
-	if (!base->is_idle || (long) (jnow - base->clk) < 2)
+	if (likely(!base->was_idle))
+		return;
+
+	jnow = READ_ONCE(jiffies);
+	base->was_idle = base->is_idle;
+	if ((long)(jnow - base->clk) < 2)
 		return;
 
 	/*
@@ -938,6 +945,13 @@  __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 	 * same array bucket then just return:
 	 */
 	if (timer_pending(timer)) {
+		/*
+		 * The downside of this optimization is that it can result in
+		 * larger granularity than you would get from adding a new
+		 * timer with this expiry. Would a timer flag for networking
+		 * be appropriate, then we can try to keep expiry of general
+		 * timers within ~1/8th of their interval?
+		 */
 		if (timer->expires == expires)
 			return 1;
 
@@ -948,6 +962,7 @@  __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 		 * dequeue/enqueue dance.
 		 */
 		base = lock_timer_base(timer, &flags);
+		forward_timer_base(base);
 
 		clk = base->clk;
 		idx = calc_wheel_index(expires, clk);
@@ -964,6 +979,7 @@  __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 		}
 	} else {
 		base = lock_timer_base(timer, &flags);
+		forward_timer_base(base);
 	}
 
 	ret = detach_if_pending(timer, base, false);
@@ -991,12 +1007,10 @@  __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 			raw_spin_lock(&base->lock);
 			WRITE_ONCE(timer->flags,
 				   (timer->flags & ~TIMER_BASEMASK) | base->cpu);
+			forward_timer_base(base);
 		}
 	}
 
-	/* Try to forward a stale timer base clock */
-	forward_timer_base(base);
-
 	timer->expires = expires;
 	/*
 	 * If 'idx' was calculated above and the base time did not advance
@@ -1499,8 +1513,10 @@  u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 		/*
 		 * If we expect to sleep more than a tick, mark the base idle:
 		 */
-		if ((expires - basem) > TICK_NSEC)
+		if ((expires - basem) > TICK_NSEC) {
+			base->was_idle = true;
 			base->is_idle = true;
+		}
 	}
 	raw_spin_unlock(&base->lock);
 
@@ -1611,6 +1627,17 @@  static __latent_entropy void run_timer_softirq(struct softirq_action *h)
 {
 	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
 
+	/*
+	 * was_idle must be cleared before running timers so that any timer
+	 * functions that call mod_timer will not try to forward the base.
+	 *
+	 * The deferrable base does not do idle tracking at all, so we do
+	 * not forward it. This can result in very large variations in
+	 * granularity for deferrable timers, but they can be deferred for
+	 * long periods due to idle.
+	 */
+	base->was_idle = false;
+
 	__run_timers(base);
 	if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active)
 		__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));