Message ID | 35FD53F367049845BC99AC72306C23D1044A02027E0A@CNBJMBX05.corpusers.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 2/1/2015 7:55 PM, Wang, Yalin wrote: > This patch change non-atomic bitops, > add a if() condition to test it, before set/clear the bit. > so that we don't need dirty the cache line, if this bit > have been set or clear. On SMP system, dirty cache line will > need invalidate other processors cache line, this will have > some impact on SMP systems. > Any actual numbers to give an idea of the impact? Thanks, Laura
On Mon, Feb 02, 2015 at 11:55:03AM +0800, Wang, Yalin wrote: > This patch change non-atomic bitops, > add a if() condition to test it, before set/clear the bit. > so that we don't need dirty the cache line, if this bit > have been set or clear. On SMP system, dirty cache line will > need invalidate other processors cache line, this will have > some impact on SMP systems. > > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com> > --- > include/asm-generic/bitops/non-atomic.h | 13 +++++++++---- > 1 file changed, 9 insertions(+), 4 deletions(-) > > diff --git a/include/asm-generic/bitops/non-atomic.h b/include/asm-generic/bitops/non-atomic.h > index 697cc2b..e4ef18a 100644 > --- a/include/asm-generic/bitops/non-atomic.h > +++ b/include/asm-generic/bitops/non-atomic.h > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned long *addr) > unsigned long mask = BIT_MASK(nr); > unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); > > - *p |= mask; > + if ((*p & mask) == 0) > + *p |= mask; Care to fix the double space here while touching the code? I think the more natural check here is: if ((~*p & mask) != 0) *p |= mask; Might be a matter of taste, but this check is equivalent to *p != (*p | mask) which is what you really want to test for. (Your check only has this property for values of mask that have a single bit set, which is ok here of course.) > + > } > > static inline void __clear_bit(int nr, volatile unsigned long *addr) > @@ -25,7 +27,8 @@ static inline void __clear_bit(int nr, volatile unsigned long *addr) > unsigned long mask = BIT_MASK(nr); > unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); > > - *p &= ~mask; > + if ((*p & mask) != 0) > + *p &= ~mask; This is already fine. > } > > /** > @@ -60,7 +63,8 @@ static inline int __test_and_set_bit(int nr, volatile unsigned long *addr) > unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); > unsigned long old = *p; > > - *p = old | mask; > + if ((old & mask) == 0) > + *p = old | mask; Here it would be: if ((~old & mask) != 0) > return (old & mask) != 0; > } Best regards Uwe
On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote: > This patch change non-atomic bitops, > add a if() condition to test it, before set/clear the bit. > so that we don't need dirty the cache line, if this bit > have been set or clear. On SMP system, dirty cache line will > need invalidate other processors cache line, this will have > some impact on SMP systems. > > --- a/include/asm-generic/bitops/non-atomic.h > +++ b/include/asm-generic/bitops/non-atomic.h > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned long *addr) > unsigned long mask = BIT_MASK(nr); > unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); > > - *p |= mask; > + if ((*p & mask) == 0) > + *p |= mask; > + > } hm, maybe. It will speed up set_bit on an already-set bit. But it will slow down set_bit on a not-set bit. And the latter case is presumably much, much more common. How do we know the patch is a net performance gain?
On Mon, Feb 02, 2015 at 03:29:09PM -0800, Andrew Morton wrote: > On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote: > > > This patch change non-atomic bitops, > > add a if() condition to test it, before set/clear the bit. > > so that we don't need dirty the cache line, if this bit > > have been set or clear. On SMP system, dirty cache line will > > need invalidate other processors cache line, this will have > > some impact on SMP systems. > > > > --- a/include/asm-generic/bitops/non-atomic.h > > +++ b/include/asm-generic/bitops/non-atomic.h > > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned long *addr) > > unsigned long mask = BIT_MASK(nr); > > unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); > > > > - *p |= mask; > > + if ((*p & mask) == 0) > > + *p |= mask; > > + > > } > > hm, maybe. > > It will speed up set_bit on an already-set bit. But it will slow down > set_bit on a not-set bit. And the latter case is presumably much, much > more common. > > How do we know the patch is a net performance gain? Yes, we do need to know the performance impact of changes like this - as Laura said in her reply already...
On Mon, Feb 02, 2015 at 03:29:09PM -0800, Andrew Morton wrote: > On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote: > > > This patch change non-atomic bitops, > > add a if() condition to test it, before set/clear the bit. > > so that we don't need dirty the cache line, if this bit > > have been set or clear. On SMP system, dirty cache line will > > need invalidate other processors cache line, this will have > > some impact on SMP systems. > > > > --- a/include/asm-generic/bitops/non-atomic.h > > +++ b/include/asm-generic/bitops/non-atomic.h > > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned long *addr) > > unsigned long mask = BIT_MASK(nr); > > unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); > > > > - *p |= mask; > > + if ((*p & mask) == 0) > > + *p |= mask; > > + > > } > > hm, maybe. > > It will speed up set_bit on an already-set bit. But it will slow down > set_bit on a not-set bit. And the latter case is presumably much, much > more common. > > How do we know the patch is a net performance gain? Let's try to measure. The micro benchmark: #include <stdio.h> #include <time.h> #include <sys/mman.h> #ifdef CACHE_HOT #define SIZE (2UL << 20) #define TIMES 10000000 #else #define SIZE (1UL << 30) #define TIMES 10000 #endif int main(int argc, char **argv) { struct timespec a, b, diff; unsigned long i, *p, times = TIMES; p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0); clock_gettime(CLOCK_MONOTONIC, &a); while (times--) { for (i = 0; i < SIZE/64/sizeof(*p); i++) { #ifdef CHECK_BEFORE_SET if (p[i] != times) #endif p[i] = times; } } clock_gettime(CLOCK_MONOTONIC, &b); diff.tv_sec = b.tv_sec - a.tv_sec; if (a.tv_nsec > b.tv_nsec) { diff.tv_sec--; diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec; } else diff.tv_nsec = b.tv_nsec - a.tv_nsec; printf("%lu.%09lu\n", diff.tv_sec, diff.tv_nsec); return 0; } Results for 10 runs on my laptop -- i5-3427U (IvyBridge 1.8 Ghz, 2.8Ghz Turbo with 3MB LLC): Avg Stddev baseline 21.5351 0.5315 -DCHECK_BEFORE_SET 21.9834 0.0789 -DCACHE_HOT 14.9987 0.0365 -DCACHE_HOT -DCHECK_BEFORE_SET 29.9010 0.0204 Difference between -DCACHE_HOT and -DCACHE_HOT -DCHECK_BEFORE_SET appears huge, but if you recalculate it to CPU cycles per inner loop @ 2.8 Ghz, it's 1.02530 and 2.04401 CPU cycles respectively. Basically, the check is free on decent CPU.
> -----Original Message----- > From: Kirill A. Shutemov [mailto:kirill@shutemov.name] > Sent: Tuesday, February 03, 2015 9:18 AM > To: Andrew Morton > Cc: Wang, Yalin; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org'; 'linux- > kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm- > kernel@lists.infradead.org' > Subject: Re: [RFC] change non-atomic bitops method > > On Mon, Feb 02, 2015 at 03:29:09PM -0800, Andrew Morton wrote: > > On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin" > <Yalin.Wang@sonymobile.com> wrote: > > > > > This patch change non-atomic bitops, > > > add a if() condition to test it, before set/clear the bit. > > > so that we don't need dirty the cache line, if this bit > > > have been set or clear. On SMP system, dirty cache line will > > > need invalidate other processors cache line, this will have > > > some impact on SMP systems. > > > > > > --- a/include/asm-generic/bitops/non-atomic.h > > > +++ b/include/asm-generic/bitops/non-atomic.h > > > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile > unsigned long *addr) > > > unsigned long mask = BIT_MASK(nr); > > > unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); > > > > > > - *p |= mask; > > > + if ((*p & mask) == 0) > > > + *p |= mask; > > > + > > > } > > > > hm, maybe. > > > > It will speed up set_bit on an already-set bit. But it will slow down > > set_bit on a not-set bit. And the latter case is presumably much, much > > more common. > > > > How do we know the patch is a net performance gain? > > Let's try to measure. The micro benchmark: > > #include <stdio.h> > #include <time.h> > #include <sys/mman.h> > > #ifdef CACHE_HOT > #define SIZE (2UL << 20) > #define TIMES 10000000 > #else > #define SIZE (1UL << 30) > #define TIMES 10000 > #endif > > int main(int argc, char **argv) > { > struct timespec a, b, diff; > unsigned long i, *p, times = TIMES; > > p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, > 0); > > clock_gettime(CLOCK_MONOTONIC, &a); > while (times--) { > for (i = 0; i < SIZE/64/sizeof(*p); i++) { > #ifdef CHECK_BEFORE_SET > if (p[i] != times) > #endif > p[i] = times; > } > } > clock_gettime(CLOCK_MONOTONIC, &b); > > diff.tv_sec = b.tv_sec - a.tv_sec; > if (a.tv_nsec > b.tv_nsec) { > diff.tv_sec--; > diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec; > } else > diff.tv_nsec = b.tv_nsec - a.tv_nsec; > > printf("%lu.%09lu\n", diff.tv_sec, diff.tv_nsec); > return 0; > } > > Results for 10 runs on my laptop -- i5-3427U (IvyBridge 1.8 Ghz, 2.8Ghz > Turbo > with 3MB LLC): > > Avg Stddev > baseline 21.5351 0.5315 > -DCHECK_BEFORE_SET 21.9834 0.0789 > -DCACHE_HOT 14.9987 0.0365 > -DCACHE_HOT -DCHECK_BEFORE_SET 29.9010 0.0204 > > Difference between -DCACHE_HOT and -DCACHE_HOT -DCHECK_BEFORE_SET appears > huge, but if you recalculate it to CPU cycles per inner loop @ 2.8 Ghz, > it's 1.02530 and 2.04401 CPU cycles respectively. > > Basically, the check is free on decent CPU. > Awesome test, but you only test the one cpu which running this code, Have not consider the other CPUs, whose cache line will be invalidate if The cache is dirtied by writer CPU, So another test should be running 2 thread on two different CPUs(bind to CPU), One write , one read, to see the impact on the reader CPU.
> -----Original Message----- > From: Wang, Yalin > Sent: Tuesday, February 03, 2015 10:13 AM > To: 'Kirill A. Shutemov'; Andrew Morton > Cc: 'arnd@arndb.de'; 'linux-arch@vger.kernel.org'; 'linux- > kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm- > kernel@lists.infradead.org' > Subject: RE: [RFC] change non-atomic bitops method > > > -----Original Message----- > > From: Kirill A. Shutemov [mailto:kirill@shutemov.name] > > Sent: Tuesday, February 03, 2015 9:18 AM > > To: Andrew Morton > > Cc: Wang, Yalin; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org'; 'linux- > > kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm- > > kernel@lists.infradead.org' > > Subject: Re: [RFC] change non-atomic bitops method > > > > On Mon, Feb 02, 2015 at 03:29:09PM -0800, Andrew Morton wrote: > > > On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin" > > <Yalin.Wang@sonymobile.com> wrote: > > > > > > > This patch change non-atomic bitops, > > > > add a if() condition to test it, before set/clear the bit. > > > > so that we don't need dirty the cache line, if this bit > > > > have been set or clear. On SMP system, dirty cache line will > > > > need invalidate other processors cache line, this will have > > > > some impact on SMP systems. > > > > > > > > --- a/include/asm-generic/bitops/non-atomic.h > > > > +++ b/include/asm-generic/bitops/non-atomic.h > > > > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile > > unsigned long *addr) > > > > unsigned long mask = BIT_MASK(nr); > > > > unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); > > > > > > > > - *p |= mask; > > > > + if ((*p & mask) == 0) > > > > + *p |= mask; > > > > + > > > > } > > > > > > hm, maybe. > > > > > > It will speed up set_bit on an already-set bit. But it will slow down > > > set_bit on a not-set bit. And the latter case is presumably much, much > > > more common. > > > > > > How do we know the patch is a net performance gain? > > > > Let's try to measure. The micro benchmark: > > > > #include <stdio.h> > > #include <time.h> > > #include <sys/mman.h> > > > > #ifdef CACHE_HOT > > #define SIZE (2UL << 20) > > #define TIMES 10000000 > > #else > > #define SIZE (1UL << 30) > > #define TIMES 10000 > > #endif > > > > int main(int argc, char **argv) > > { > > struct timespec a, b, diff; > > unsigned long i, *p, times = TIMES; > > > > p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > > MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, > > 0); > > > > clock_gettime(CLOCK_MONOTONIC, &a); > > while (times--) { > > for (i = 0; i < SIZE/64/sizeof(*p); i++) { > > #ifdef CHECK_BEFORE_SET > > if (p[i] != times) > > #endif > > p[i] = times; > > } > > } > > clock_gettime(CLOCK_MONOTONIC, &b); > > > > diff.tv_sec = b.tv_sec - a.tv_sec; > > if (a.tv_nsec > b.tv_nsec) { > > diff.tv_sec--; > > diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec; > > } else > > diff.tv_nsec = b.tv_nsec - a.tv_nsec; > > > > printf("%lu.%09lu\n", diff.tv_sec, diff.tv_nsec); > > return 0; > > } > > > > Results for 10 runs on my laptop -- i5-3427U (IvyBridge 1.8 Ghz, 2.8Ghz > > Turbo > > with 3MB LLC): > > > > Avg Stddev > > baseline 21.5351 0.5315 > > -DCHECK_BEFORE_SET 21.9834 0.0789 > > -DCACHE_HOT 14.9987 0.0365 > > -DCACHE_HOT -DCHECK_BEFORE_SET 29.9010 0.0204 > > > > Difference between -DCACHE_HOT and -DCACHE_HOT -DCHECK_BEFORE_SET appears > > huge, but if you recalculate it to CPU cycles per inner loop @ 2.8 Ghz, > > it's 1.02530 and 2.04401 CPU cycles respectively. > > > > Basically, the check is free on decent CPU. > > > Awesome test, but you only test the one cpu which running this code, > Have not consider the other CPUs, whose cache line will be invalidate if > The cache is dirtied by writer CPU, > So another test should be running 2 thread on two different CPUs(bind to > CPU), > One write , one read, to see the impact on the reader CPU. I make a little change about your test progrom, Add a new thread to test SMP cache impact. --- #include <stdio.h> #include <time.h> #include <sys/mman.h> #include <errno.h> #define _GNU_SOURCE #define __USE_GNU #include <sched.h> #include <pthread.h> #ifdef CACHE_HOT #define SIZE (2UL << 20) #define TIMES 100000 #else #define SIZE (1UL << 20) #define TIMES 10000 #endif static void *reader_thread(void *arg) { struct timespec a, b, diff; unsigned long *p = arg; volatile unsigned long temp; unsigned long i, ret, times = TIMES; cpu_set_t set; CPU_ZERO(&set); CPU_SET(1, &set); ret = sched_setaffinity(-1, sizeof(cpu_set_t), &set); if (ret < 0) { printf("sched_setaffinity error:%s", strerror(errno)); } clock_gettime(CLOCK_MONOTONIC, &a); while (times--) { for (i = 0; i < SIZE/sizeof(*p); i++) { temp = p[i]; } } clock_gettime(CLOCK_MONOTONIC, &b); diff.tv_sec = b.tv_sec - a.tv_sec; if (a.tv_nsec > b.tv_nsec) { diff.tv_sec--; diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec; } else diff.tv_nsec = b.tv_nsec - a.tv_nsec; printf("reader:%lu.%09lu\n", diff.tv_sec, diff.tv_nsec); } int main(int argc, char **argv) { struct timespec a, b, diff; unsigned long i, ret, *p, times = TIMES; pthread_t thread; cpu_set_t set; CPU_ZERO(&set); CPU_SET(0, &set); ret = sched_setaffinity(-1, sizeof(cpu_set_t), &set); if (ret < 0) { printf("sched_setaffinity error:%s", strerror(errno)); } p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_LOCKED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0); pthread_create(&thread, NULL, reader_thread, p); clock_gettime(CLOCK_MONOTONIC, &a); while (times--) { for (i = 0; i < SIZE/sizeof(*p); i++) { #ifdef CHECK_BEFORE_SET if (p[i] != times) #endif p[i] = times; } } clock_gettime(CLOCK_MONOTONIC, &b); diff.tv_sec = b.tv_sec - a.tv_sec; if (a.tv_nsec > b.tv_nsec) { diff.tv_sec--; diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec; } else diff.tv_nsec = b.tv_nsec - a.tv_nsec; printf("%lu.%09lu\n", diff.tv_sec, diff.tv_nsec); return 0; } ---- One run on CPU0, reader thread run on CPU1, Test result: sudo ./cache_test reader:8.426228173 8.672198335 With -DCHECK_BEFORE_SET sudo ./cache_test_check reader:7.537036819 10.799746531 You can see reader can save some time if cache not dirtied. Also we can see that for writer, it will increase some impact Because it need read the data before change it, I think if the system have lots of cores, reader performance Improve is more useful . My CPU info: 28851195@cnbjlx20570:~/test$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 37 model name : Intel(R) Core(TM) i5 CPU 660 @ 3.33GHz stepping : 5 microcode : 0x2 cpu MHz : 1199.000 cache size : 4096 KB physical id : 0 siblings : 4 Thanks for your test program very much!
On Tue, 3 Feb 2015 13:42:45 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote: > > ... > > #ifdef CHECK_BEFORE_SET > if (p[i] != times) > #endif > > ... > > ---- > One run on CPU0, reader thread run on CPU1, > Test result: > sudo ./cache_test > reader:8.426228173 > 8.672198335 > > With -DCHECK_BEFORE_SET > sudo ./cache_test_check > reader:7.537036819 > 10.799746531 > You aren't measuring the right thing. You should compare if (p[i] != x) p[i] = x; versus p[i] = x; and you should do this for two cases: a) p[i] == x b) p[i] != x The first code sequence will be slower when (p[i] != x) and faster when (p[i] == x). Next, we should instrument the kernel to work out the frequency of set_bit on an already-set bit. It is only with both these ratios that we can work out whether the patch is a net gain. My suspicion is that set_bit on an already-set bit is so rare that the patch will be a loss.
> -----Original Message----- > From: Andrew Morton [mailto:akpm@linux-foundation.org] > Sent: Tuesday, February 03, 2015 2:39 PM > To: Wang, Yalin > Cc: 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org'; > 'linux-kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm- > kernel@lists.infradead.org' > Subject: Re: [RFC] change non-atomic bitops method > > On Tue, 3 Feb 2015 13:42:45 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> > wrote: > > > > ... > > > > #ifdef CHECK_BEFORE_SET > > if (p[i] != times) > > #endif > > > > ... > > > > ---- > > One run on CPU0, reader thread run on CPU1, > > Test result: > > sudo ./cache_test > > reader:8.426228173 > > 8.672198335 > > > > With -DCHECK_BEFORE_SET > > sudo ./cache_test_check > > reader:7.537036819 > > 10.799746531 > > > > You aren't measuring the right thing. You should compare > > if (p[i] != x) > p[i] = x; > > versus > > p[i] = x; > > and you should do this for two cases: > > a) p[i] == x > > b) p[i] != x > > > The first code sequence will be slower when (p[i] != x) and faster when > (p[i] == x). > > > Next, we should instrument the kernel to work out the frequency of > set_bit on an already-set bit. > > It is only with both these ratios that we can work out whether the > patch is a net gain. My suspicion is that set_bit on an already-set > bit is so rare that the patch will be a loss. I see, let's change the test a little: 1) memset(p, 0, SIZE); if (p[i] != 0) p[i] = 0; // never called #sudo ./cache_test_check 6.698153838 reader:7.529402625 2) memset(p, 0, SIZE); if (p[i] == 0) p[i] = 0; // always called #sudo ./cache_test_check reader:7.895421311 9.000889973 Thanks
From: Andrew Morton <akpm@linux-foundation.org> Date: Mon, 2 Feb 2015 22:38:51 -0800 > It is only with both these ratios that we can work out whether the > patch is a net gain. My suspicion is that set_bit on an already-set > bit is so rare that the patch will be a loss. A common pattern is implementing a "referenced" bit, and in that case the bit is often already set, and in such a scenerio the proposed change is a huge win.
On Tue, 03 Feb 2015 00:40:31 -0800 (PST) David Miller <davem@davemloft.net> wrote: > From: Andrew Morton <akpm@linux-foundation.org> > Date: Mon, 2 Feb 2015 22:38:51 -0800 > > > It is only with both these ratios that we can work out whether the > > patch is a net gain. My suspicion is that set_bit on an already-set > > bit is so rare that the patch will be a loss. > > A common pattern is implementing a "referenced" bit, and in that case > the bit is often already set, and in such a scenerio the proposed > change is a huge win. pagecache, dcache and icache already perform this optimisation (and only pagecache uses bitops for it anyway). I'm not sure what's left. But there's really no point in speculating about this - it's trivial to instrument the kernel and get real numbers.
On Tue, Feb 03 2015, Andrew Morton <akpm@linux-foundation.org> wrote: > > You aren't measuring the right thing. You should compare > > if (p[i] != x) > p[i] = x; > > versus > > p[i] = x; > > and you should do this for two cases: > > a) p[i] == x > > b) p[i] != x > > > The first code sequence will be slower when (p[i] != x) and faster when > (p[i] == x). > > > Next, we should instrument the kernel to work out the frequency of > set_bit on an already-set bit. > > It is only with both these ratios that we can work out whether the > patch is a net gain. My suspicion is that set_bit on an already-set > bit is so rare that the patch will be a loss. There's also the code-bloat issue to consider (instruction cache and all that); the conditional versions will usually require three extra instructions and an extra register. Also, the cache line might already be dirty because of something in the surrounding code. Instruction cache misses and larger stack footprint (from larger register pressure) won't show up in a microbenchmark, so I think this needs a real-world example to justify. But even if one finds some hot spot that would benefit from the conditional, that should simply be added explicitly there, instead of pessimizing every other user. (A good example of that is 358eec18243a ("vfs: decrapify dput(), fix cache behavior under normal load")). Rasmus
> -----Original Message----- > From: Rasmus Villemoes [mailto:linux@rasmusvillemoes.dk] > Sent: Tuesday, February 03, 2015 5:34 PM > To: Andrew Morton > Cc: Wang, Yalin; 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux- > arch@vger.kernel.org'; 'linux-kernel@vger.kernel.org'; > 'linux@arm.linux.org.uk'; 'linux-arm-kernel@lists.infradead.org' > Subject: Re: [RFC] change non-atomic bitops method > > On Tue, Feb 03 2015, Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > You aren't measuring the right thing. You should compare > > > > if (p[i] != x) > > p[i] = x; > > > > versus > > > > p[i] = x; > > > > and you should do this for two cases: > > > > a) p[i] == x > > > > b) p[i] != x > > > > > > The first code sequence will be slower when (p[i] != x) and faster when > > (p[i] == x). > > > > > > Next, we should instrument the kernel to work out the frequency of > > set_bit on an already-set bit. > > > > It is only with both these ratios that we can work out whether the > > patch is a net gain. My suspicion is that set_bit on an already-set > > bit is so rare that the patch will be a loss. > > There's also the code-bloat issue to consider (instruction cache and all > that); the conditional versions will usually require three extra > instructions and an extra register. Also, the cache line might already > be dirty because of something in the surrounding code. Instruction cache > misses and larger stack footprint (from larger register pressure) won't > show up in a microbenchmark, so I think this needs a real-world example > to justify. > > But even if one finds some hot spot that would benefit from the > conditional, that should simply be added explicitly there, instead of > pessimizing every other user. (A good example of that is 358eec18243a > ("vfs: decrapify dput(), fix cache behavior under normal load")). Oh, thank you, it is really a very nice example.
On Tue, Feb 03, 2015 at 03:17:30AM +0200, Kirill A. Shutemov wrote: > Results for 10 runs on my laptop -- i5-3427U (IvyBridge 1.8 Ghz, 2.8Ghz Turbo > with 3MB LLC): I've screwed up the inner loop condition and step. As result the benchmark touches the same cache line 8 times and scan SIZE/8 of memory. Fixed test is in attach. Avg Stddev baseline 14.0663 0.0182 -DCHECK_BEFORE_SET 13.8594 0.0458 -DCACHE_HOT 12.3896 0.0867 -DCACHE_HOT -DCHECK_BEFORE_SET 11.7480 0.2497 And now it's faster *with* the check. Sometimes CPU is just too clever. ;)
Uwe Kleine-König wrote: > Might be a matter of taste, but this check is equivalent to > > *p != (*p | mask) > > which is what you really want to test for. I would argue that this is less clear as to what's going on. David
Hello, [added some more context again] On Tue, Feb 03, 2015 at 03:14:43PM +0000, David Howells wrote: > > > - *p |= mask; > > > + if ((*p & mask) == 0) > > > + *p |= mask; > > Care to fix the double space here while touching the code? > > > > I think the more natural check here is: > > > > if ((~*p & mask) != 0) > > *p |= mask; > > > > Might be a matter of taste, but this check is equivalent to > > > > *p != (*p | mask) > > > > which is what you really want to test for. > I would argue that this is less clear as to what's going on. OK, I admit that this equivalence is not obvious. Then maybe let the compiler find the equivalence and do: - *p |= mask; + if (*p != (*p | mask)) + p |= mask; ? Best regards Uwe
diff --git a/include/asm-generic/bitops/non-atomic.h b/include/asm-generic/bitops/non-atomic.h index 697cc2b..e4ef18a 100644 --- a/include/asm-generic/bitops/non-atomic.h +++ b/include/asm-generic/bitops/non-atomic.h @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned long *addr) unsigned long mask = BIT_MASK(nr); unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); - *p |= mask; + if ((*p & mask) == 0) + *p |= mask; + } static inline void __clear_bit(int nr, volatile unsigned long *addr) @@ -25,7 +27,8 @@ static inline void __clear_bit(int nr, volatile unsigned long *addr) unsigned long mask = BIT_MASK(nr); unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); - *p &= ~mask; + if ((*p & mask) != 0) + *p &= ~mask; } /** @@ -60,7 +63,8 @@ static inline int __test_and_set_bit(int nr, volatile unsigned long *addr) unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); unsigned long old = *p; - *p = old | mask; + if ((old & mask) == 0) + *p = old | mask; return (old & mask) != 0; } @@ -79,7 +83,8 @@ static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr) unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr); unsigned long old = *p; - *p = old & ~mask; + if ((old & mask) != 0) + *p = old & ~mask; return (old & mask) != 0; }
This patch change non-atomic bitops, add a if() condition to test it, before set/clear the bit. so that we don't need dirty the cache line, if this bit have been set or clear. On SMP system, dirty cache line will need invalidate other processors cache line, this will have some impact on SMP systems. Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com> --- include/asm-generic/bitops/non-atomic.h | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)