Message ID | 20220330082046.3512424-2-asavkov@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Upper bound kernel timers | expand |
On Wed, 30 Mar 2022, Artem Savkov wrote: > Current timer wheel implementation is optimized for performance and > energy usage but lacks in precision. This, normally, is not a problem as > most timers that use timer wheel are used for timeouts and thus rarely > expire, instead they often get canceled or modified before expiration. > Even when they don't, expiring a bit late is not an issue for timeout > timers. > > TCP keepalive timer is a special case, it's aim is to prevent timeouts, > so triggering earlier rather than later is desired behavior. In a > reported case the user had a 3600s keepalive timer for preventing firewall > disconnects (on a 3650s interval). They observed keepalive timers coming > in up to four minutes late, causing unexpected disconnects. > > This commit adds upper_bound_timeout() function that takes a relative > timeout and adjusts it based on timer wheel granularity so that supplied > value effectively becomes an upper bound for the timer. > I think there is a problem with this approach. Please correct me, if I'm wrong. The timer wheel index and level calculation depends on timer_base::clk. The timeout/delta which is used for this calculation is relative to timer_base::clk (delta = expires - base::clk). timer_base::clk is not updated in sync with jiffies. It is forwarded before a new timer is queued. It is possible, that timer_base::clk is behind jiffies after forwarding because of a not yet expired timer. When calculating the level/index with a relative timeout, there is no guarantee that the result is the same when actual enqueueing the timer with expiry = jiffies + timeout . Thanks, Anna-Maria
On Wed, Mar 30, 2022 at 03:40:55PM +0200, Anna-Maria Behnsen wrote: > On Wed, 30 Mar 2022, Artem Savkov wrote: > > > Current timer wheel implementation is optimized for performance and > > energy usage but lacks in precision. This, normally, is not a problem as > > most timers that use timer wheel are used for timeouts and thus rarely > > expire, instead they often get canceled or modified before expiration. > > Even when they don't, expiring a bit late is not an issue for timeout > > timers. > > > > TCP keepalive timer is a special case, it's aim is to prevent timeouts, > > so triggering earlier rather than later is desired behavior. In a > > reported case the user had a 3600s keepalive timer for preventing firewall > > disconnects (on a 3650s interval). They observed keepalive timers coming > > in up to four minutes late, causing unexpected disconnects. > > > > This commit adds upper_bound_timeout() function that takes a relative > > timeout and adjusts it based on timer wheel granularity so that supplied > > value effectively becomes an upper bound for the timer. > > > > I think there is a problem with this approach. Please correct me, if I'm > wrong. The timer wheel index and level calculation depends on > timer_base::clk. The timeout/delta which is used for this calculation is > relative to timer_base::clk (delta = expires - base::clk). timer_base::clk > is not updated in sync with jiffies. It is forwarded before a new timer is > queued. It is possible, that timer_base::clk is behind jiffies after > forwarding because of a not yet expired timer. > > When calculating the level/index with a relative timeout, there is no > guarantee that the result is the same when actual enqueueing the timer with > expiry = jiffies + timeout . Yes, you are correct. This especially is a problem for timeouts placed just before LVL_START(x), which is a good chunk of cases. I don't think it is possible to get to timer_base clock without meddling with the hot-path. Is it possible to determine the upper limit of error margin here? My assumption is it shouldn't be very big, so maybe it would be enough to account for this when adjusting timeout at the edge of a level. I know this doesn't sound good but I am running out of ideas here.
On Sat, Apr 02 2022 at 08:55, Artem Savkov wrote: > On Wed, Mar 30, 2022 at 03:40:55PM +0200, Anna-Maria Behnsen wrote: >> When calculating the level/index with a relative timeout, there is no >> guarantee that the result is the same when actual enqueueing the timer with >> expiry = jiffies + timeout . > > Yes, you are correct. This especially is a problem for timeouts placed > just before LVL_START(x), which is a good chunk of cases. I don't > think it is possible to get to timer_base clock without meddling with > the hot-path. Especially not when the timer might end up being migrated to a different base, which would in the worst case require to do the calculation twice if the base clocks differ. The problem especially with network timers is that in the traffic case the timers are often rearmed at high frequency. See the optimizations in mod_timer() which try to avoid dequeue/enqueue sequences. One of the main reasons for giving up on accuracy for the timer wheel back then was to avoid two issues for networking: - Full rearming of a timer which ended up in the same expiry bucket due to software batching (incomplete, but with the same goal and the same 12.5% error margin). - Cascading bursts. We gathered statistics from various workloads which showed that 99+% of all timers were canceled or rearmed before expiry, but under certain load scenarios a large portion of those timers where cascaded into a lower wheel level with smaller granularity before cancelation or rearming. As the wheel level granularity was strictly exponential the event of cascading a large amount (hint: thousands) of timers in one go was not uncommon. Cascading happened with interrupts disabled and it touched a usualy cold cache line per cascaded timer... > Is it possible to determine the upper limit of error margin here? My > assumption is it shouldn't be very big, so maybe it would be enough to > account for this when adjusting timeout at the edge of a level. > I know this doesn't sound good but I am running out of ideas here. Let's just take a step back. So we know, that the maximal error margin in the wheel is 12.5%, right? That means, if you take your relative timeout and subtract 12.5% then you are in the right ballpark and the earliest expiry will not be before that point obviously, but it's also guaranteed not to expire later than the original timeout. Obviously this will converge towards the early expiry the longer the timeouts are, but it's bound. Also due to the properties of the wheel, the lag of base::clk will obviously only affect those levels where lag >= LVL_GRAN(level). That's not perfect, but perfect is the enemy of good. Thanks, tglx
On Wed, Apr 06 2022 at 11:52, Artem Savkov wrote: > On Tue, Apr 05, 2022 at 05:33:23PM +0200, Thomas Gleixner wrote: >> On Sat, Apr 02 2022 at 08:55, Artem Savkov wrote: >> > Is it possible to determine the upper limit of error margin here? My >> > assumption is it shouldn't be very big, so maybe it would be enough to >> > account for this when adjusting timeout at the edge of a level. >> > I know this doesn't sound good but I am running out of ideas here. >> >> Let's just take a step back. >> >> So we know, that the maximal error margin in the wheel is 12.5%, right? >> That means, if you take your relative timeout and subtract 12.5% then >> you are in the right ballpark and the earliest expiry will not be before >> that point obviously, but it's also guaranteed not to expire later than >> the original timeout. Obviously this will converge towards the early >> expiry the longer the timeouts are, but it's bound. > > Ok, I was trying to avoid a "hole" where any timeout < LVL_GRAN(lvl) > would be always substantially (LVL_GRAN(lvl) - LVL_GRAN(lvl - 1)) early > but looks like this is unavoidable with this approach. Right, but where is the problem you are trying to solve? Does it matter whether the keepalive timer fires after 28 minutes or after 30 minutes? Not really. All you are about that it does not fire 2 minutes late. So what? >> Also due to the properties of the wheel, the lag of base::clk will >> obviously only affect those levels where lag >= LVL_GRAN(level). > > Is this true? Won't it be enough for the lag to be just > lag >= (LVL_START(lvl) - adjusted_timeout) for the cases when we cross > level boundary on adjustment? The corner case is at the next boundary level. The resulting worst case there is one jiffy, which is below noise level :) Thanks, tglx
diff --git a/include/linux/timer.h b/include/linux/timer.h index fda13c9d1256c..b209d31d543f0 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -168,6 +168,7 @@ static inline int timer_pending(const struct timer_list * timer) return !hlist_unhashed_lockless(&timer->entry); } +extern unsigned long upper_bound_timeout(unsigned long timeout); extern void add_timer_on(struct timer_list *timer, int cpu); extern int del_timer(struct timer_list * timer); extern int mod_timer(struct timer_list *timer, unsigned long expires); diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 85f1021ad4595..76d4f26c991be 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -507,28 +507,38 @@ static inline unsigned calc_index(unsigned long expires, unsigned lvl, return LVL_OFFS(lvl) + (expires & LVL_MASK); } -static int calc_wheel_index(unsigned long expires, unsigned long clk, - unsigned long *bucket_expiry) +static inline int get_wheel_lvl(unsigned long delta) { - unsigned long delta = expires - clk; - unsigned int idx; - if (delta < LVL_START(1)) { - idx = calc_index(expires, 0, bucket_expiry); + return 0; } else if (delta < LVL_START(2)) { - idx = calc_index(expires, 1, bucket_expiry); + return 1; } else if (delta < LVL_START(3)) { - idx = calc_index(expires, 2, bucket_expiry); + return 2; } else if (delta < LVL_START(4)) { - idx = calc_index(expires, 3, bucket_expiry); + return 3; } else if (delta < LVL_START(5)) { - idx = calc_index(expires, 4, bucket_expiry); + return 4; } else if (delta < LVL_START(6)) { - idx = calc_index(expires, 5, bucket_expiry); + return 5; } else if (delta < LVL_START(7)) { - idx = calc_index(expires, 6, bucket_expiry); + return 6; } else if (LVL_DEPTH > 8 && delta < LVL_START(8)) { - idx = calc_index(expires, 7, bucket_expiry); + return 7; + } + + return -1; +} + +static int calc_wheel_index(unsigned long expires, unsigned long clk, + unsigned long *bucket_expiry) +{ + unsigned long delta = expires - clk; + unsigned int idx; + int lvl = get_wheel_lvl(delta); + + if (lvl >= 0) { + idx = calc_index(expires, lvl, bucket_expiry); } else if ((long) delta < 0) { idx = clk & LVL_MASK; *bucket_expiry = clk; @@ -545,6 +555,62 @@ static int calc_wheel_index(unsigned long expires, unsigned long clk, return idx; } +/** + * upper_bound_timeout - return granularity-adjusted timeout + * @timeout: timeout value in jiffies + * + * This function return supplied timeout adjusted based on timer wheel + * granularity effectively making supplied value an upper bound at which the + * timer will expire. + */ +unsigned long upper_bound_timeout(unsigned long timeout) +{ + int lvl = get_wheel_lvl(timeout); + + if (lvl < 0) { + if ((long) timeout < 0) { + /* + * This will expire immediately so no adjustment + * needed. + */ + return timeout; + } else { + if (timeout > WHEEL_TIMEOUT_CUTOFF) + timeout = WHEEL_TIMEOUT_CUTOFF; + lvl = LVL_DEPTH - 1; + } + } + + if (timeout - LVL_GRAN(lvl) < LVL_START(lvl)) { + /* + * Beginning of each level is a special case, we can't just + * subtract LVL_GRAN(lvl) because then timeout ends up on a + * previous level and is guaranteed to fire off early. We can't + * mitigate this completely, but we can try to minimize the + * margin. + */ + if (timeout - LVL_GRAN(lvl - 1) < LVL_START(lvl)) { + /* + * If timeout is within previous level's granularity + * adjust timeout using that level's granularity. + */ + return timeout - LVL_GRAN(lvl - 1); + } else { + /* + * If LVL_GRAN(lvl - 1) < timeout < LVL_GRAN(lvl) + * the best we can do is to set timeout to the end of + * previous level. This means that the timer will + * trigger early. In worst case it will be _at least_ + * (LVL_GRAN(lvl) - LVL_GRAN(lvl -1)) jiffies early. + */ + return LVL_START(lvl) - 1; + } + } else { + return timeout - LVL_GRAN(lvl); + } +} +EXPORT_SYMBOL(upper_bound_timeout); + static void trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer) {
Current timer wheel implementation is optimized for performance and energy usage but lacks in precision. This, normally, is not a problem as most timers that use timer wheel are used for timeouts and thus rarely expire, instead they often get canceled or modified before expiration. Even when they don't, expiring a bit late is not an issue for timeout timers. TCP keepalive timer is a special case, it's aim is to prevent timeouts, so triggering earlier rather than later is desired behavior. In a reported case the user had a 3600s keepalive timer for preventing firewall disconnects (on a 3650s interval). They observed keepalive timers coming in up to four minutes late, causing unexpected disconnects. This commit adds upper_bound_timeout() function that takes a relative timeout and adjusts it based on timer wheel granularity so that supplied value effectively becomes an upper bound for the timer. This was previously discussed here: https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Artem Savkov <asavkov@redhat.com> --- include/linux/timer.h | 1 + kernel/time/timer.c | 92 +++++++++++++++++++++++++++++++++++++------ 2 files changed, 80 insertions(+), 13 deletions(-)