Message ID | 20170822015439.11263-1-npiggin@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, 22 Aug 2017, Nicholas Piggin wrote: > I would have preferred to get comments from the timer maintainers, but > they've been busy or away for the past copule of weeks. Perhaps you > would consider carrying it until then? Yes, I was on vacation, but that patch never hit my LKML inbox .... > diff --git a/kernel/time/timer.c b/kernel/time/timer.c > index 8f5d1bf18854..dd7be9fe6839 100644 > --- a/kernel/time/timer.c > +++ b/kernel/time/timer.c > @@ -203,6 +203,7 @@ struct timer_base { > bool migration_enabled; > bool nohz_active; > bool is_idle; > + bool was_idle; /* was it idle since last run/fwded */ Please do not use tail comments. They are a horrible habit and disturb the reading flow. I'm not fond of that name either. Something like forward_clk or a similar self explaining name would be appreciated. > /* > @@ -938,6 +945,13 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) > * same array bucket then just return: > */ > if (timer_pending(timer)) { > + /* > + * The downside of this optimization is that it can result in > + * larger granularity than you would get from adding a new > + * timer with this expiry. Would a timer flag for networking > + * be appropriate, then we can try to keep expiry of general > + * timers within ~1/8th of their interval? This really depends on the timer usage type. If you use it merily for TCP timeouts or similar things, then this does not matter at all because then the timer is the safety net in case something goes wrong. If you lose packets on the transport the larger granularity is the least of your worries. From earlier discussions with the networking folks these timeouts are the reason for this optimization because you avoid the expensive dequeue/requeue operation if the successful communication is fast. If the timer has a short relative expiry and is used for sending packets or similar purposes, then it usually sits in the first bucket and has no granularity issues anyway. But from staring at trace data provided by google and facebook these timer are not rearmed while pending, they fire, do networking stuff and then get rearmed. So I rather avoid that kind of misleading comment. The first sentence surely can stay. Other than that, nice detective work! Thanks, tglx
On Tue, 22 Aug 2017 09:45:46 +0200 (CEST) Thomas Gleixner <tglx@linutronix.de> wrote: > On Tue, 22 Aug 2017, Nicholas Piggin wrote: > > I would have preferred to get comments from the timer maintainers, but > > they've been busy or away for the past copule of weeks. Perhaps you > > would consider carrying it until then? > > Yes, I was on vacation, but that patch never hit my LKML inbox .... Hmm okay. Well that's fine, we're just getting done testing it here. > > > diff --git a/kernel/time/timer.c b/kernel/time/timer.c > > index 8f5d1bf18854..dd7be9fe6839 100644 > > --- a/kernel/time/timer.c > > +++ b/kernel/time/timer.c > > @@ -203,6 +203,7 @@ struct timer_base { > > bool migration_enabled; > > bool nohz_active; > > bool is_idle; > > + bool was_idle; /* was it idle since last run/fwded */ > > Please do not use tail comments. They are a horrible habit and disturb the > reading flow. Sure. > I'm not fond of that name either. Something like forward_clk > or a similar self explaining name would be appreciated. Well I actually had that initially (must_forward). But on the other hand that's less explanatory about the state of the timer base I thought. Anyway I could go either way and you're going to be looking at this code more than me in future :) > > > /* > > @@ -938,6 +945,13 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) > > * same array bucket then just return: > > */ > > if (timer_pending(timer)) { > > + /* > > + * The downside of this optimization is that it can result in > > + * larger granularity than you would get from adding a new > > + * timer with this expiry. Would a timer flag for networking > > + * be appropriate, then we can try to keep expiry of general > > + * timers within ~1/8th of their interval? > > This really depends on the timer usage type. > > If you use it merily for TCP timeouts or similar things, then this does not > matter at all because then the timer is the safety net in case something > goes wrong. If you lose packets on the transport the larger granularity is > the least of your worries. > > From earlier discussions with the networking folks these timeouts are the > reason for this optimization because you avoid the expensive > dequeue/requeue operation if the successful communication is fast. > > If the timer has a short relative expiry and is used for sending packets or > similar purposes, then it usually sits in the first bucket and has no > granularity issues anyway. But from staring at trace data provided by > google and facebook these timer are not rearmed while pending, they fire, > do networking stuff and then get rearmed. > > So I rather avoid that kind of misleading comment. The first sentence > surely can stay. Right. The question was actually there for you to answer, so thanks. I can certainly understand the use-case and importance of keeping it light. > Other than that, nice detective work! Thanks for taking a look (and thanks everyone who's been testing and hacking on it). Do you want me to re-send with the changes? Thanks, Nick
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 8f5d1bf18854..dd7be9fe6839 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -203,6 +203,7 @@ struct timer_base { bool migration_enabled; bool nohz_active; bool is_idle; + bool was_idle; /* was it idle since last run/fwded */ DECLARE_BITMAP(pending_map, WHEEL_SIZE); struct hlist_head vectors[WHEEL_SIZE]; } ____cacheline_aligned; @@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned tflags) static inline void forward_timer_base(struct timer_base *base) { - unsigned long jnow = READ_ONCE(jiffies); + unsigned long jnow; /* - * We only forward the base when it's idle and we have a delta between - * base clock and jiffies. + * We only forward the base when we are idle or have just come out + * of idle (was_idle logic), and have a delta between base clock + * and jiffies. In the common case, run_timers will take care of it. */ - if (!base->is_idle || (long) (jnow - base->clk) < 2) + if (likely(!base->was_idle)) + return; + + jnow = READ_ONCE(jiffies); + base->was_idle = base->is_idle; + if ((long)(jnow - base->clk) < 2) return; /* @@ -938,6 +945,13 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) * same array bucket then just return: */ if (timer_pending(timer)) { + /* + * The downside of this optimization is that it can result in + * larger granularity than you would get from adding a new + * timer with this expiry. Would a timer flag for networking + * be appropriate, then we can try to keep expiry of general + * timers within ~1/8th of their interval? + */ if (timer->expires == expires) return 1; @@ -948,6 +962,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) * dequeue/enqueue dance. */ base = lock_timer_base(timer, &flags); + forward_timer_base(base); clk = base->clk; idx = calc_wheel_index(expires, clk); @@ -964,6 +979,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) } } else { base = lock_timer_base(timer, &flags); + forward_timer_base(base); } ret = detach_if_pending(timer, base, false); @@ -991,12 +1007,10 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) raw_spin_lock(&base->lock); WRITE_ONCE(timer->flags, (timer->flags & ~TIMER_BASEMASK) | base->cpu); + forward_timer_base(base); } } - /* Try to forward a stale timer base clock */ - forward_timer_base(base); - timer->expires = expires; /* * If 'idx' was calculated above and the base time did not advance @@ -1112,6 +1126,7 @@ void add_timer_on(struct timer_list *timer, int cpu) WRITE_ONCE(timer->flags, (timer->flags & ~TIMER_BASEMASK) | cpu); } + forward_timer_base(base); debug_activate(timer, timer->expires); internal_add_timer(base, timer); @@ -1499,8 +1514,10 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem) /* * If we expect to sleep more than a tick, mark the base idle: */ - if ((expires - basem) > TICK_NSEC) + if ((expires - basem) > TICK_NSEC) { + base->was_idle = true; base->is_idle = true; + } } raw_spin_unlock(&base->lock); @@ -1611,6 +1628,17 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h) { struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]); + /* + * was_idle must be cleared before running timers so that any timer + * functions that call mod_timer will not try to forward the base. + * + * The deferrable base does not do idle tracking at all, so we do + * not forward it. This can result in very large variations in + * granularity for deferrable timers, but they can be deferred for + * long periods due to idle. + */ + base->was_idle = false; + __run_timers(base); if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active) __run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));