Message ID | 1422022952-31552-5-git-send-email-daniel.thompson@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 23 Jan 2015, Daniel Thompson wrote: > This patch fixes that problem by providing banked clock data in a > similar manner to Thomas Gleixner's 4396e058c52e("timekeeping: Provide > fast and NMI safe access to CLOCK_MONOTONIC"). By some definition of similar. > -struct clock_data { > - ktime_t wrap_kt; > +struct clock_data_banked { > u64 epoch_ns; > u64 epoch_cyc; > - seqcount_t seq; > - unsigned long rate; > + u64 (*read_sched_clock)(void); > + u64 sched_clock_mask; > u32 mult; > u32 shift; > bool suspended; > }; > > +struct clock_data { > + ktime_t wrap_kt; > + seqcount_t seq; > + unsigned long rate; > + struct clock_data_banked bank[2]; > +}; .... > -static u64 __read_mostly (*read_sched_clock)(void) = jiffy_sched_clock_read; > +static struct clock_data cd = { > + .bank = { > + [0] = { > + .mult = NSEC_PER_SEC / HZ, > + .read_sched_clock = jiffy_sched_clock_read, > + }, > + }, > +}; If you had carefully studied the changes which made it possible to do the nmi safe clock monotonic accessor then you'd had noticed that I went a great way to optimize the cache foot print first and then add this new fangled thing. So in the first place 'cd' lacks ____cacheline_aligned. It should have been there before, but that's a different issue. You should have noticed. Secondly, I don't see any hint that you actually thought about the cache foot print of the result struct clock_data. struct clock_data { ktime_t wrap_kt; seqcount_t seq; unsigned long rate; struct clock_data_banked bank[2]; }; wrap_kt and rate are completely irrelevant for the hotpath. The whole thing up to the last member of bank[0] still fits into 64 byte on both 32 and 64bit, but that's not by design and not documented so anyone who is aware of cache foot print issues will go WTF when the first member of a hot path data structure is completely irrelevant. > static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift) > { > @@ -58,50 +65,82 @@ static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift) > > unsigned long long notrace sched_clock(void) > { > - u64 epoch_ns; > - u64 epoch_cyc; > u64 cyc; > unsigned long seq; > - > - if (cd.suspended) > - return cd.epoch_ns; > + struct clock_data_banked *b; > + u64 res; So we now have u64 cyc; unsigned long seq; struct clock_data_banked *b; u64 res; Let me try a different version of that: struct clock_data_banked *b; unsigned long seq; u64 res, cyc; Can you spot the difference in the reading experience? > do { > - seq = raw_read_seqcount_begin(&cd.seq); > - epoch_cyc = cd.epoch_cyc; > - epoch_ns = cd.epoch_ns; > + seq = raw_read_seqcount(&cd.seq); > + b = cd.bank + (seq & 1); > + if (b->suspended) { > + res = b->epoch_ns; So now we have read_sched_clock as a pointer in the bank. Why do you still need b->suspended? What's wrong with setting b->read_sched_clock to NULL at suspend and restore the proper pointer on resume and use that as a conditional? It would allow the compiler to generate better code, but that's obviously not the goal here. Darn, this is hot path code and not some random driver. > + } else { > + cyc = b->read_sched_clock(); > + cyc = (cyc - b->epoch_cyc) & b->sched_clock_mask; > + res = b->epoch_ns + cyc_to_ns(cyc, b->mult, b->shift); It would allow the following optimization as well: res = b->epoch_ns; if (b->read_sched_clock) { ... } If you think that compilers are smart enough to figure all that out for you, you might get surprised. The more clear your code is the better is the chance that the compiler gets it right. We have seen the opposite of that as well, but that's clearly a compiler bug. > +/* > + * Start updating the banked clock data. > + * > + * sched_clock will never observe mis-matched data even if called from > + * an NMI. We do this by maintaining an odd/even copy of the data and > + * steering sched_clock to one or the other using a sequence counter. > + * In order to preserve the data cache profile of sched_clock as much > + * as possible the system reverts back to the even copy when the update > + * completes; the odd copy is used *only* during an update. > + * > + * The caller is responsible for avoiding simultaneous updates. > + */ > +static struct clock_data_banked *update_bank_begin(void) > +{ > + /* update the backup (odd) bank and steer readers towards it */ > + memcpy(cd.bank + 1, cd.bank, sizeof(struct clock_data_banked)); > + raw_write_seqcount_latch(&cd.seq); > + > + return cd.bank; > +} > + > +/* > + * Finalize update of banked clock data. > + * > + * This is just a trivial switch back to the primary (even) copy. > + */ > +static void update_bank_end(void) > +{ > + raw_write_seqcount_latch(&cd.seq); > } What's wrong with having a master struct struct master_data { struct clock_data_banked master_data; ktime_t wrap_kt; unsigned long rate; u64 (*real_read_sched_clock)(void); }; Then you only have to care about the serialization of the master_data update and then the hotpath data update would be the same simple function as update_fast_timekeeper(). And it would have the same ordering scheme and aside of that the resulting code would be simpler, more intuitive to read and I'm pretty sure faster. Thanks, tglx
On 24/01/15 22:40, Thomas Gleixner wrote: > On Fri, 23 Jan 2015, Daniel Thompson wrote: >> This patch fixes that problem by providing banked clock data in a >> similar manner to Thomas Gleixner's 4396e058c52e("timekeeping: Provide >> fast and NMI safe access to CLOCK_MONOTONIC"). > > By some definition of similar. Fair point, I copied only the NMI-safety concept. Anyhow, thanks very much for the review. >> -struct clock_data { >> - ktime_t wrap_kt; >> +struct clock_data_banked { >> u64 epoch_ns; >> u64 epoch_cyc; >> - seqcount_t seq; >> - unsigned long rate; >> + u64 (*read_sched_clock)(void); >> + u64 sched_clock_mask; >> u32 mult; >> u32 shift; >> bool suspended; >> }; >> >> +struct clock_data { >> + ktime_t wrap_kt; >> + seqcount_t seq; >> + unsigned long rate; >> + struct clock_data_banked bank[2]; >> +}; > > .... > >> -static u64 __read_mostly (*read_sched_clock)(void) = jiffy_sched_clock_read; >> +static struct clock_data cd = { >> + .bank = { >> + [0] = { >> + .mult = NSEC_PER_SEC / HZ, >> + .read_sched_clock = jiffy_sched_clock_read, >> + }, >> + }, >> +}; > > If you had carefully studied the changes which made it possible to do > the nmi safe clock monotonic accessor then you'd had noticed that I > went a great way to optimize the cache foot print first and then add > this new fangled thing. > > So in the first place 'cd' lacks ____cacheline_aligned. It should have > been there before, but that's a different issue. You should have > noticed. > > Secondly, I don't see any hint that you actually thought about the > cache foot print of the result struct clock_data. I did think about the cache footprint but only to the point of believing my patch was unlikely to regress performance. As it happens it was the absence of __cacheline_aligned on cd in the current code that made be believe absence of regression would be enough (once I'd managed that I ordered the members within the structure to get best locality of reference within the *patch* in order to make code review easier). I guess I did two things wrong here: inadequately documenting what work I did and possessing insufficient ambition to improve! I'll work on both of these. > struct clock_data { > ktime_t wrap_kt; > seqcount_t seq; > unsigned long rate; > struct clock_data_banked bank[2]; > }; > > wrap_kt and rate are completely irrelevant for the hotpath. The whole > thing up to the last member of bank[0] still fits into 64 byte on both > 32 and 64bit, but that's not by design and not documented so anyone > who is aware of cache foot print issues will go WTF when the first > member of a hot path data structure is completely irrelevant. Agreed. It looks like I also put the function pointer in the wrong place within clock_data_banked. It should occupy the space between the 64-bit and 32-bit members shouldn't it? >> static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift) >> { >> @@ -58,50 +65,82 @@ static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift) >> >> unsigned long long notrace sched_clock(void) >> { >> - u64 epoch_ns; >> - u64 epoch_cyc; >> u64 cyc; >> unsigned long seq; >> - >> - if (cd.suspended) >> - return cd.epoch_ns; >> + struct clock_data_banked *b; >> + u64 res; > > So we now have > > u64 cyc; > unsigned long seq; > struct clock_data_banked *b; > u64 res; > > Let me try a different version of that: > > struct clock_data_banked *b; > unsigned long seq; > u64 res, cyc; > > Can you spot the difference in the reading experience? Will fix. > >> do { >> - seq = raw_read_seqcount_begin(&cd.seq); >> - epoch_cyc = cd.epoch_cyc; >> - epoch_ns = cd.epoch_ns; >> + seq = raw_read_seqcount(&cd.seq); >> + b = cd.bank + (seq & 1); >> + if (b->suspended) { >> + res = b->epoch_ns; > > So now we have read_sched_clock as a pointer in the bank. Why do you > still need b->suspended? > > What's wrong with setting b->read_sched_clock to NULL at suspend and > restore the proper pointer on resume and use that as a conditional? > > It would allow the compiler to generate better code, but that's > obviously not the goal here. Darn, this is hot path code and not some > random driver. The update code probably won't be as easy to read but, as you say, this is hot patch code. >> + } else { >> + cyc = b->read_sched_clock(); >> + cyc = (cyc - b->epoch_cyc) & b->sched_clock_mask; >> + res = b->epoch_ns + cyc_to_ns(cyc, b->mult, b->shift); > > It would allow the following optimization as well: > > res = b->epoch_ns; > if (b->read_sched_clock) { > ... > } > > If you think that compilers are smart enough to figure all that out > for you, you might get surprised. The more clear your code is the > better is the chance that the compiler gets it right. We have seen the > opposite of that as well, but that's clearly a compiler bug. Good idea and, in this case there is a function pointer with unknown side effects so a compiler would never be able to make that optimization. >> +/* >> + * Start updating the banked clock data. >> + * >> + * sched_clock will never observe mis-matched data even if called from >> + * an NMI. We do this by maintaining an odd/even copy of the data and >> + * steering sched_clock to one or the other using a sequence counter. >> + * In order to preserve the data cache profile of sched_clock as much >> + * as possible the system reverts back to the even copy when the update >> + * completes; the odd copy is used *only* during an update. >> + * >> + * The caller is responsible for avoiding simultaneous updates. >> + */ >> +static struct clock_data_banked *update_bank_begin(void) >> +{ >> + /* update the backup (odd) bank and steer readers towards it */ >> + memcpy(cd.bank + 1, cd.bank, sizeof(struct clock_data_banked)); >> + raw_write_seqcount_latch(&cd.seq); >> + >> + return cd.bank; >> +} >> + >> +/* >> + * Finalize update of banked clock data. >> + * >> + * This is just a trivial switch back to the primary (even) copy. >> + */ >> +static void update_bank_end(void) >> +{ >> + raw_write_seqcount_latch(&cd.seq); >> } > > What's wrong with having a master struct > > struct master_data { > struct clock_data_banked master_data; > ktime_t wrap_kt; > unsigned long rate; > u64 (*real_read_sched_clock)(void); > }; > > Then you only have to care about the serialization of the master_data > update and then the hotpath data update would be the same simple > function as update_fast_timekeeper(). And it would have the same > ordering scheme and aside of that the resulting code would be simpler, > more intuitive to read and I'm pretty sure faster. Sorry. I don't quite understand this. Is the intent to have a single function to update the hotpath data used by both update_sched_clock() and sched_clock_register() to replace the pairing of update_bank_begin/end()? If so, I started out doing that but eventually concluded that update_sched_clock() didn't really benefit from having to make a third copy of the values it consumes rather than updates. However if that's an unconvincing reason I'm happy to switch to having an update structure. Daniel.
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c index 01d2d15aa662..a2ea66944bc1 100644 --- a/kernel/time/sched_clock.c +++ b/kernel/time/sched_clock.c @@ -18,28 +18,28 @@ #include <linux/seqlock.h> #include <linux/bitops.h> -struct clock_data { - ktime_t wrap_kt; +struct clock_data_banked { u64 epoch_ns; u64 epoch_cyc; - seqcount_t seq; - unsigned long rate; + u64 (*read_sched_clock)(void); + u64 sched_clock_mask; u32 mult; u32 shift; bool suspended; }; +struct clock_data { + ktime_t wrap_kt; + seqcount_t seq; + unsigned long rate; + struct clock_data_banked bank[2]; +}; + static struct hrtimer sched_clock_timer; static int irqtime = -1; core_param(irqtime, irqtime, int, 0400); -static struct clock_data cd = { - .mult = NSEC_PER_SEC / HZ, -}; - -static u64 __read_mostly sched_clock_mask; - static u64 notrace jiffy_sched_clock_read(void) { /* @@ -49,7 +49,14 @@ static u64 notrace jiffy_sched_clock_read(void) return (u64)(jiffies - INITIAL_JIFFIES); } -static u64 __read_mostly (*read_sched_clock)(void) = jiffy_sched_clock_read; +static struct clock_data cd = { + .bank = { + [0] = { + .mult = NSEC_PER_SEC / HZ, + .read_sched_clock = jiffy_sched_clock_read, + }, + }, +}; static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift) { @@ -58,50 +65,82 @@ static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift) unsigned long long notrace sched_clock(void) { - u64 epoch_ns; - u64 epoch_cyc; u64 cyc; unsigned long seq; - - if (cd.suspended) - return cd.epoch_ns; + struct clock_data_banked *b; + u64 res; do { - seq = raw_read_seqcount_begin(&cd.seq); - epoch_cyc = cd.epoch_cyc; - epoch_ns = cd.epoch_ns; + seq = raw_read_seqcount(&cd.seq); + b = cd.bank + (seq & 1); + if (b->suspended) { + res = b->epoch_ns; + } else { + cyc = b->read_sched_clock(); + cyc = (cyc - b->epoch_cyc) & b->sched_clock_mask; + res = b->epoch_ns + cyc_to_ns(cyc, b->mult, b->shift); + } } while (read_seqcount_retry(&cd.seq, seq)); - cyc = read_sched_clock(); - cyc = (cyc - epoch_cyc) & sched_clock_mask; - return epoch_ns + cyc_to_ns(cyc, cd.mult, cd.shift); + return res; +} + +/* + * Start updating the banked clock data. + * + * sched_clock will never observe mis-matched data even if called from + * an NMI. We do this by maintaining an odd/even copy of the data and + * steering sched_clock to one or the other using a sequence counter. + * In order to preserve the data cache profile of sched_clock as much + * as possible the system reverts back to the even copy when the update + * completes; the odd copy is used *only* during an update. + * + * The caller is responsible for avoiding simultaneous updates. + */ +static struct clock_data_banked *update_bank_begin(void) +{ + /* update the backup (odd) bank and steer readers towards it */ + memcpy(cd.bank + 1, cd.bank, sizeof(struct clock_data_banked)); + raw_write_seqcount_latch(&cd.seq); + + return cd.bank; +} + +/* + * Finalize update of banked clock data. + * + * This is just a trivial switch back to the primary (even) copy. + */ +static void update_bank_end(void) +{ + raw_write_seqcount_latch(&cd.seq); } /* * Atomically update the sched_clock epoch. */ -static void notrace update_sched_clock(void) +static void notrace update_sched_clock(bool suspended) { - unsigned long flags; + struct clock_data_banked *b; u64 cyc; u64 ns; - cyc = read_sched_clock(); - ns = cd.epoch_ns + - cyc_to_ns((cyc - cd.epoch_cyc) & sched_clock_mask, - cd.mult, cd.shift); - - raw_local_irq_save(flags); - raw_write_seqcount_begin(&cd.seq); - cd.epoch_ns = ns; - cd.epoch_cyc = cyc; - raw_write_seqcount_end(&cd.seq); - raw_local_irq_restore(flags); + b = update_bank_begin(); + + cyc = b->read_sched_clock(); + ns = b->epoch_ns + cyc_to_ns((cyc - b->epoch_cyc) & b->sched_clock_mask, + b->mult, b->shift); + + b->epoch_ns = ns; + b->epoch_cyc = cyc; + b->suspended = suspended; + + update_bank_end(); } static enum hrtimer_restart sched_clock_poll(struct hrtimer *hrt) { - update_sched_clock(); + update_sched_clock(false); hrtimer_forward_now(hrt, cd.wrap_kt); return HRTIMER_RESTART; } @@ -111,9 +150,9 @@ void __init sched_clock_register(u64 (*read)(void), int bits, { u64 res, wrap, new_mask, new_epoch, cyc, ns; u32 new_mult, new_shift; - ktime_t new_wrap_kt; unsigned long r; char r_unit; + struct clock_data_banked *b; if (cd.rate > rate) return; @@ -122,29 +161,30 @@ void __init sched_clock_register(u64 (*read)(void), int bits, /* calculate the mult/shift to convert counter ticks to ns. */ clocks_calc_mult_shift(&new_mult, &new_shift, rate, NSEC_PER_SEC, 3600); + cd.rate = rate; new_mask = CLOCKSOURCE_MASK(bits); /* calculate how many ns until we wrap */ wrap = clocks_calc_max_nsecs(new_mult, new_shift, 0, new_mask); - new_wrap_kt = ns_to_ktime(wrap - (wrap >> 3)); + cd.wrap_kt = ns_to_ktime(wrap - (wrap >> 3)); + + b = update_bank_begin(); /* update epoch for new counter and update epoch_ns from old counter*/ new_epoch = read(); - cyc = read_sched_clock(); - ns = cd.epoch_ns + cyc_to_ns((cyc - cd.epoch_cyc) & sched_clock_mask, - cd.mult, cd.shift); + cyc = b->read_sched_clock(); + ns = b->epoch_ns + cyc_to_ns((cyc - b->epoch_cyc) & b->sched_clock_mask, + b->mult, b->shift); - raw_write_seqcount_begin(&cd.seq); - read_sched_clock = read; - sched_clock_mask = new_mask; - cd.rate = rate; - cd.wrap_kt = new_wrap_kt; - cd.mult = new_mult; - cd.shift = new_shift; - cd.epoch_cyc = new_epoch; - cd.epoch_ns = ns; - raw_write_seqcount_end(&cd.seq); + b->read_sched_clock = read; + b->sched_clock_mask = new_mask; + b->mult = new_mult; + b->shift = new_shift; + b->epoch_cyc = new_epoch; + b->epoch_ns = ns; + + update_bank_end(); r = rate; if (r >= 4000000) { @@ -175,10 +215,10 @@ void __init sched_clock_postinit(void) * If no sched_clock function has been provided at that point, * make it the final one one. */ - if (read_sched_clock == jiffy_sched_clock_read) + if (cd.bank[0].read_sched_clock == jiffy_sched_clock_read) sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ); - update_sched_clock(); + update_sched_clock(false); /* * Start the timer to keep sched_clock() properly updated and @@ -191,17 +231,20 @@ void __init sched_clock_postinit(void) static int sched_clock_suspend(void) { - update_sched_clock(); + update_sched_clock(true); hrtimer_cancel(&sched_clock_timer); - cd.suspended = true; return 0; } static void sched_clock_resume(void) { - cd.epoch_cyc = read_sched_clock(); + struct clock_data_banked *b; + + b = update_bank_begin(); + b->epoch_cyc = b->read_sched_clock(); hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL); - cd.suspended = false; + b->suspended = false; + update_bank_end(); } static struct syscore_ops sched_clock_ops = {
Currently it is possible for an NMI (or FIQ on ARM) to come in and read sched_clock() whilst update_sched_clock() has locked the seqcount for writing. This results in the NMI handler locking up when it calls raw_read_seqcount_begin(). This patch fixes that problem by providing banked clock data in a similar manner to Thomas Gleixner's 4396e058c52e("timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC"). Changing the mode of operation of the seqcount away from the traditional LSB-means-interrupted-write to a banked approach also revealed a good deal of "fake" write locking within sched_clock_register(). This is likely to be a latent issue because sched_clock_register() is typically called before we enable interrupts. Nevertheless the issue has been eliminated by increasing the scope of the read locking performed by sched_clock(). Suggested-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: John Stultz <john.stultz@linaro.org> --- kernel/time/sched_clock.c | 157 +++++++++++++++++++++++++++++----------------- 1 file changed, 100 insertions(+), 57 deletions(-)