Message ID | 20231011150252.32737-1-jack@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | lib/find: Fix KCSAN warnings in find_*_bit() functions | expand |
Long story short: KCSAN found some potential issues related to how people use bitmap API. And instead of working through that issues, the following code shuts down KCSAN by applying READ_ONCE() here and there. On Wed, Oct 11, 2023 at 05:02:31PM +0200, Jan Kara wrote: > From: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> > > Some users of lib/find functions can operate on bitmaps that change > while we are looking at the bitmap. For example xarray code looks at tag > bitmaps only with RCU protection. The xarray users are prepared to > handle a situation were tag modification races with reading of a tag > bitmap (and thus we get back a stale information) but the find_*bit() > functions should return based on bitmap contents at *some* point in time. > As UBSAN properly warns, the code like: > > val = *addr; > if (val) > __ffs(val) > > prone to refetching 'val' contents from addr by the compiler and thus > passing 0 to __ffs() which has undefined results. > > Fix the problem by using READ_ONCE() in all the appropriate places so > that the compiler cannot accidentally refetch the contents of addr > between the test and __ffs(). This makes the undefined behavior > impossible. The generated code is somewhat larger due to gcc tracking > both the index and target fetch address in separate registers (according > to GCC folks the volatile cast confuses their optimizer a bit, they are > looking into a fix). The difference in performance is minimal though. > Targetted microbenchmark (in userspace) of the bit searching loop shows > about 2% regression on some x86 microarchitectures so for normal kernel > workloads this should be in the noise and not worth introducing special > set of bitmap searching functions. > > [JK: Wrote commit message] READ_ONCE() fixes nothing because nothing is broken in find_bit() API. As I suspected, and as Matthew confirmed in his email, the true reason for READ_ONCE() here is to disable KCSAN check: READ_ONCE() serves two functions here; one is that it tells the compiler not to try anything fancy, and the other is that it tells KCSAN to not bother instrumenting this load; no load-delay-reload. https://lkml.kernel.org/linux-mm/ZQkhgVb8nWGxpSPk@casper.infradead.org/ And as side-effect, it of course hurts the performance. In the same email Matthew said he doesn't believe me that READ_ONCE would do that, so thank you for testing and confirming that it does. Matthew, in my experience compilers do really well in that low-level things, and gambling with compilers usually makes thing worse. x86_64 is one of the most strong-ordered architectures that I've been working with, and even here READ_ONCE has visible effect on performance. I expect that it will get even worse on more weakly ordered platforms. I'm OK that you don't believe me; probably you can believe Paul McKenney and his paper on kernel memory model, which says in the very first paragraph: Applying ACCESS_ONCE() to a large array or structure is unlikely to do anything useful, and use of READ_ONCE() and WRITE_ONCE() in this situation can result in load-tearing and store-tearing. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4444.html Jan, I think that in your last email you confirmed that the xarray problem that you're trying to solve is about a lack of proper locking: Well, for xarray the write side is synchronized with a spinlock but the read side is not (only RCU protected). https://lkml.kernel.org/linux-mm/20230918155403.ylhfdbscgw6yek6p@quack3/ If there's no enough synchronization, why not just adding it? Regardless performance consideration, my main concern is that this patch considers bitmap as an atomic structure, which is intentionally not. There are just a few single-bit atomic operations like set_bit() and clear_bit(). All other functions are non-atomic, including those find_bit() operations. There is quite a lot of examples of wrong use of bitmaps wrt atomicity, the most typical is like: for(idx = 0; idx < num; idx++) { ... set_bit(idx, bitmap); } This is wrong because a series of atomic ops is not atomic itself, and if you see something like this in you code, it should be converted to using non-atomic __set_bit(), and protected externally if needed. Similarly, READ_ONCE() in a for-loop doesn't guarantee any ordering or atomicity, and only hurts the performance. And this is exactly what this patch does. Thanks, Yury > CC: Yury Norov <yury.norov@gmail.com> > Link: https://lore.kernel.org/all/20230918044739.29782-1-mirsad.todorovac@alu.unizg.hr/ > Signed-off-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> > Signed-off-by: Jan Kara <jack@suse.cz> > --- > include/linux/find.h | 40 ++++++++++++++++++++++++---------------- > lib/find_bit.c | 39 +++++++++++++++++++++++---------------- > 2 files changed, 47 insertions(+), 32 deletions(-) > > diff --git a/include/linux/find.h b/include/linux/find.h > index 5e4f39ef2e72..5ef6466dc7cc 100644 > --- a/include/linux/find.h > +++ b/include/linux/find.h > @@ -60,7 +60,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, > if (unlikely(offset >= size)) > return size; > > - val = *addr & GENMASK(size - 1, offset); > + val = READ_ONCE(*addr) & GENMASK(size - 1, offset); > return val ? __ffs(val) : size; > } > > @@ -90,7 +90,8 @@ unsigned long find_next_and_bit(const unsigned long *addr1, > if (unlikely(offset >= size)) > return size; > > - val = *addr1 & *addr2 & GENMASK(size - 1, offset); > + val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & > + GENMASK(size - 1, offset); > return val ? __ffs(val) : size; > } > > @@ -121,7 +122,8 @@ unsigned long find_next_andnot_bit(const unsigned long *addr1, > if (unlikely(offset >= size)) > return size; > > - val = *addr1 & ~*addr2 & GENMASK(size - 1, offset); > + val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) & > + GENMASK(size - 1, offset); > return val ? __ffs(val) : size; > } > > @@ -151,7 +153,8 @@ unsigned long find_next_or_bit(const unsigned long *addr1, > if (unlikely(offset >= size)) > return size; > > - val = (*addr1 | *addr2) & GENMASK(size - 1, offset); > + val = (READ_ONCE(*addr1) | READ_ONCE(*addr2)) & > + GENMASK(size - 1, offset); > return val ? __ffs(val) : size; > } > > @@ -179,7 +182,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, > if (unlikely(offset >= size)) > return size; > > - val = *addr | ~GENMASK(size - 1, offset); > + val = READ_ONCE(*addr) | ~GENMASK(size - 1, offset); > return val == ~0UL ? size : ffz(val); > } > > @@ -200,7 +203,7 @@ static inline > unsigned long find_first_bit(const unsigned long *addr, unsigned long size) > { > if (small_const_nbits(size)) { > - unsigned long val = *addr & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); > > return val ? __ffs(val) : size; > } > @@ -229,7 +232,7 @@ unsigned long find_nth_bit(const unsigned long *addr, unsigned long size, unsign > return size; > > if (small_const_nbits(size)) { > - unsigned long val = *addr & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); > > return val ? fns(val, n) : size; > } > @@ -255,7 +258,8 @@ unsigned long find_nth_and_bit(const unsigned long *addr1, const unsigned long * > return size; > > if (small_const_nbits(size)) { > - unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) > + & GENMASK(size - 1, 0); > > return val ? fns(val, n) : size; > } > @@ -282,7 +286,8 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon > return size; > > if (small_const_nbits(size)) { > - unsigned long val = *addr1 & (~*addr2) & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) & > + GENMASK(size - 1, 0); > > return val ? fns(val, n) : size; > } > @@ -312,7 +317,8 @@ unsigned long find_nth_and_andnot_bit(const unsigned long *addr1, > return size; > > if (small_const_nbits(size)) { > - unsigned long val = *addr1 & *addr2 & (~*addr3) & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & > + ~READ_ONCE(*addr3) & GENMASK(size - 1, 0); > > return val ? fns(val, n) : size; > } > @@ -336,7 +342,8 @@ unsigned long find_first_and_bit(const unsigned long *addr1, > unsigned long size) > { > if (small_const_nbits(size)) { > - unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & > + GENMASK(size - 1, 0); > > return val ? __ffs(val) : size; > } > @@ -358,7 +365,7 @@ static inline > unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size) > { > if (small_const_nbits(size)) { > - unsigned long val = *addr | ~GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr) | ~GENMASK(size - 1, 0); > > return val == ~0UL ? size : ffz(val); > } > @@ -379,7 +386,7 @@ static inline > unsigned long find_last_bit(const unsigned long *addr, unsigned long size) > { > if (small_const_nbits(size)) { > - unsigned long val = *addr & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); > > return val ? __fls(val) : size; > } > @@ -505,7 +512,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned > long size, unsigned long offset) > { > if (small_const_nbits(size)) { > - unsigned long val = *(const unsigned long *)addr; > + unsigned long val = READ_ONCE(*(const unsigned long *)addr); > > if (unlikely(offset >= size)) > return size; > @@ -523,7 +530,8 @@ static inline > unsigned long find_first_zero_bit_le(const void *addr, unsigned long size) > { > if (small_const_nbits(size)) { > - unsigned long val = swab(*(const unsigned long *)addr) | ~GENMASK(size - 1, 0); > + unsigned long val = swab(READ_ONCE(*(const unsigned long *)addr)) > + | ~GENMASK(size - 1, 0); > > return val == ~0UL ? size : ffz(val); > } > @@ -538,7 +546,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned > long size, unsigned long offset) > { > if (small_const_nbits(size)) { > - unsigned long val = *(const unsigned long *)addr; > + unsigned long val = READ_ONCE(*(const unsigned long *)addr); > > if (unlikely(offset >= size)) > return size; > diff --git a/lib/find_bit.c b/lib/find_bit.c > index 32f99e9a670e..6867ef8a85e0 100644 > --- a/lib/find_bit.c > +++ b/lib/find_bit.c > @@ -98,7 +98,7 @@ out: \ > */ > unsigned long _find_first_bit(const unsigned long *addr, unsigned long size) > { > - return FIND_FIRST_BIT(addr[idx], /* nop */, size); > + return FIND_FIRST_BIT(READ_ONCE(addr[idx]), /* nop */, size); > } > EXPORT_SYMBOL(_find_first_bit); > #endif > @@ -111,7 +111,8 @@ unsigned long _find_first_and_bit(const unsigned long *addr1, > const unsigned long *addr2, > unsigned long size) > { > - return FIND_FIRST_BIT(addr1[idx] & addr2[idx], /* nop */, size); > + return FIND_FIRST_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), > + /* nop */, size); > } > EXPORT_SYMBOL(_find_first_and_bit); > #endif > @@ -122,7 +123,7 @@ EXPORT_SYMBOL(_find_first_and_bit); > */ > unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size) > { > - return FIND_FIRST_BIT(~addr[idx], /* nop */, size); > + return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), /* nop */, size); > } > EXPORT_SYMBOL(_find_first_zero_bit); > #endif > @@ -130,28 +131,30 @@ EXPORT_SYMBOL(_find_first_zero_bit); > #ifndef find_next_bit > unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start) > { > - return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start); > + return FIND_NEXT_BIT(READ_ONCE(addr[idx]), /* nop */, nbits, start); > } > EXPORT_SYMBOL(_find_next_bit); > #endif > > unsigned long __find_nth_bit(const unsigned long *addr, unsigned long size, unsigned long n) > { > - return FIND_NTH_BIT(addr[idx], size, n); > + return FIND_NTH_BIT(READ_ONCE(addr[idx]), size, n); > } > EXPORT_SYMBOL(__find_nth_bit); > > unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long *addr2, > unsigned long size, unsigned long n) > { > - return FIND_NTH_BIT(addr1[idx] & addr2[idx], size, n); > + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), > + size, n); > } > EXPORT_SYMBOL(__find_nth_and_bit); > > unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2, > unsigned long size, unsigned long n) > { > - return FIND_NTH_BIT(addr1[idx] & ~addr2[idx], size, n); > + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]), > + size, n); > } > EXPORT_SYMBOL(__find_nth_andnot_bit); > > @@ -160,7 +163,8 @@ unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, > const unsigned long *addr3, > unsigned long size, unsigned long n) > { > - return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n); > + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]) & > + ~READ_ONCE(addr3[idx]), size, n); > } > EXPORT_SYMBOL(__find_nth_and_andnot_bit); > > @@ -168,7 +172,8 @@ EXPORT_SYMBOL(__find_nth_and_andnot_bit); > unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2, > unsigned long nbits, unsigned long start) > { > - return FIND_NEXT_BIT(addr1[idx] & addr2[idx], /* nop */, nbits, start); > + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), > + /* nop */, nbits, start); > } > EXPORT_SYMBOL(_find_next_and_bit); > #endif > @@ -177,7 +182,8 @@ EXPORT_SYMBOL(_find_next_and_bit); > unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2, > unsigned long nbits, unsigned long start) > { > - return FIND_NEXT_BIT(addr1[idx] & ~addr2[idx], /* nop */, nbits, start); > + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]), > + /* nop */, nbits, start); > } > EXPORT_SYMBOL(_find_next_andnot_bit); > #endif > @@ -186,7 +192,8 @@ EXPORT_SYMBOL(_find_next_andnot_bit); > unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2, > unsigned long nbits, unsigned long start) > { > - return FIND_NEXT_BIT(addr1[idx] | addr2[idx], /* nop */, nbits, start); > + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) | READ_ONCE(addr2[idx]), > + /* nop */, nbits, start); > } > EXPORT_SYMBOL(_find_next_or_bit); > #endif > @@ -195,7 +202,7 @@ EXPORT_SYMBOL(_find_next_or_bit); > unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits, > unsigned long start) > { > - return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start); > + return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), /* nop */, nbits, start); > } > EXPORT_SYMBOL(_find_next_zero_bit); > #endif > @@ -208,7 +215,7 @@ unsigned long _find_last_bit(const unsigned long *addr, unsigned long size) > unsigned long idx = (size-1) / BITS_PER_LONG; > > do { > - val &= addr[idx]; > + val &= READ_ONCE(addr[idx]); > if (val) > return idx * BITS_PER_LONG + __fls(val); > > @@ -242,7 +249,7 @@ EXPORT_SYMBOL(find_next_clump8); > */ > unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size) > { > - return FIND_FIRST_BIT(~addr[idx], swab, size); > + return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), swab, size); > } > EXPORT_SYMBOL(_find_first_zero_bit_le); > > @@ -252,7 +259,7 @@ EXPORT_SYMBOL(_find_first_zero_bit_le); > unsigned long _find_next_zero_bit_le(const unsigned long *addr, > unsigned long size, unsigned long offset) > { > - return FIND_NEXT_BIT(~addr[idx], swab, size, offset); > + return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), swab, size, offset); > } > EXPORT_SYMBOL(_find_next_zero_bit_le); > #endif > @@ -261,7 +268,7 @@ EXPORT_SYMBOL(_find_next_zero_bit_le); > unsigned long _find_next_bit_le(const unsigned long *addr, > unsigned long size, unsigned long offset) > { > - return FIND_NEXT_BIT(addr[idx], swab, size, offset); > + return FIND_NEXT_BIT(READ_ONCE(addr[idx]), swab, size, offset); > } > EXPORT_SYMBOL(_find_next_bit_le); > > -- > 2.35.3
On Wed, Oct 11, 2023 at 11:26:29AM -0700, Yury Norov wrote: > Long story short: KCSAN found some potential issues related to how > people use bitmap API. And instead of working through that issues, > the following code shuts down KCSAN by applying READ_ONCE() here > and there. Pfft. > READ_ONCE() fixes nothing because nothing is broken in find_bit() API. > As I suspected, and as Matthew confirmed in his email, the true reason > for READ_ONCE() here is to disable KCSAN check: > > READ_ONCE() serves two functions here; > one is that it tells the compiler not to try anything fancy, and > the other is that it tells KCSAN to not bother instrumenting this > load; no load-delay-reload. > > https://lkml.kernel.org/linux-mm/ZQkhgVb8nWGxpSPk@casper.infradead.org/ > > And as side-effect, it of course hurts the performance. In the same > email Matthew said he doesn't believe me that READ_ONCE would do that, > so thank you for testing and confirming that it does. You really misinterpreted what Jan wrote to accomplish this motivated reasoning. > Jan, I think that in your last email you confirmed that the xarray > problem that you're trying to solve is about a lack of proper locking: > > Well, for xarray the write side is synchronized with a spinlock but the read > side is not (only RCU protected). > > https://lkml.kernel.org/linux-mm/20230918155403.ylhfdbscgw6yek6p@quack3/ > > If there's no enough synchronization, why not just adding it? You don't understand. We _intend_ for there to be no locking. We_understand_ there is a race here. We're _fine_ with there being a race here. > Regardless performance consideration, my main concern is that this patch > considers bitmap as an atomic structure, which is intentionally not. > There are just a few single-bit atomic operations like set_bit() and > clear_bit(). All other functions are non-atomic, including those > find_bit() operations. ... and for KCSAN to understand that, we have to use READ_ONCE. > There is quite a lot of examples of wrong use of bitmaps wrt > atomicity, the most typical is like: > for(idx = 0; idx < num; idx++) { > ... > set_bit(idx, bitmap); > } > > This is wrong because a series of atomic ops is not atomic itself, and > if you see something like this in you code, it should be converted to > using non-atomic __set_bit(), and protected externally if needed. That is a bad use of set_bit()! I agree! See, for example, commit b21866f514cb where I remove precisely this kind of code. > Similarly, READ_ONCE() in a for-loop doesn't guarantee any ordering or > atomicity, and only hurts the performance. And this is exactly what > this patch does. Go back and read Jan's patch again, instead of cherry-picking some little bits that confirm your prejudices.
On 10/11/23 20:49, Matthew Wilcox wrote: > On Wed, Oct 11, 2023 at 11:26:29AM -0700, Yury Norov wrote: >> Long story short: KCSAN found some potential issues related to how >> people use bitmap API. And instead of working through that issues, >> the following code shuts down KCSAN by applying READ_ONCE() here >> and there. > > Pfft. > >> READ_ONCE() fixes nothing because nothing is broken in find_bit() API. >> As I suspected, and as Matthew confirmed in his email, the true reason >> for READ_ONCE() here is to disable KCSAN check: >> >> READ_ONCE() serves two functions here; >> one is that it tells the compiler not to try anything fancy, and >> the other is that it tells KCSAN to not bother instrumenting this >> load; no load-delay-reload. >> >> https://lkml.kernel.org/linux-mm/ZQkhgVb8nWGxpSPk@casper.infradead.org/ >> >> And as side-effect, it of course hurts the performance. In the same >> email Matthew said he doesn't believe me that READ_ONCE would do that, >> so thank you for testing and confirming that it does. > > You really misinterpreted what Jan wrote to accomplish this motivated > reasoning. > >> Jan, I think that in your last email you confirmed that the xarray >> problem that you're trying to solve is about a lack of proper locking: >> >> Well, for xarray the write side is synchronized with a spinlock but the read >> side is not (only RCU protected). >> >> https://lkml.kernel.org/linux-mm/20230918155403.ylhfdbscgw6yek6p@quack3/ >> >> If there's no enough synchronization, why not just adding it? > > You don't understand. We _intend_ for there to be no locking. > We_understand_ there is a race here. We're _fine_ with there being > a race here. > >> Regardless performance consideration, my main concern is that this patch >> considers bitmap as an atomic structure, which is intentionally not. >> There are just a few single-bit atomic operations like set_bit() and >> clear_bit(). All other functions are non-atomic, including those >> find_bit() operations. > > ... and for KCSAN to understand that, we have to use READ_ONCE. > >> There is quite a lot of examples of wrong use of bitmaps wrt >> atomicity, the most typical is like: >> for(idx = 0; idx < num; idx++) { >> ... >> set_bit(idx, bitmap); >> } >> >> This is wrong because a series of atomic ops is not atomic itself, and >> if you see something like this in you code, it should be converted to >> using non-atomic __set_bit(), and protected externally if needed. > > That is a bad use of set_bit()! I agree! See, for example, commit > b21866f514cb where I remove precisely this kind of code. > >> Similarly, READ_ONCE() in a for-loop doesn't guarantee any ordering or >> atomicity, and only hurts the performance. And this is exactly what >> this patch does. > > Go back and read Jan's patch again, instead of cherry-picking some > little bits that confirm your prejudices. Hey Yuri, No hard feelings, but I tend to agree with Mr. Wilcox and Jan. set_bit just as any atomic increment or memory barrier - by the same author you quoted - causes a LOCK prefix to the assembler instruction. By my modest knowledge of the machine language, this will cause the CPU core to LOCK the memory bus for the time the byte, word, longword or quadword (or even bit) is being read, changed, and written back. If I am not making a stupid logical mistake, this LOCK on the memory bus by a core is going to prevent the other cores from accessing memory or filling or flushing their caches. I agree with Mr. Wilcox that locking would have much worse performance penalty that a simply READ_ONCE() that is designed to prevent the compiler from the "funny" optimisations, such as using the two faster instructions instead of the atomic load - which might in the worst case be interrupted just between two half-loads. It does nothing to hurt performance on the level of a memory read or write barrier or the memory bus lock that stalls all cores. So, it silences KCSAN and I am happy with it, but I will not proceed with a formal patch proposal until we have a consensus about it. The data-race actually means that another core can tear your half-load and you get unexpected results. Why does it happen more often on my configuration that on the others I cannot tell. But it might hurt the integrity of any filesystem relying of find_first_bit() and find_next_bit() primitives. I mean, in the worst case scenario. Meaning, I might not opt to go to Mars with the ship's computer running with data-races ;-) Best regards, Mirsad Todorovac
On 10/11/23 17:02, Jan Kara wrote: > From: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> > > Some users of lib/find functions can operate on bitmaps that change > while we are looking at the bitmap. For example xarray code looks at tag > bitmaps only with RCU protection. The xarray users are prepared to > handle a situation were tag modification races with reading of a tag > bitmap (and thus we get back a stale information) but the find_*bit() > functions should return based on bitmap contents at *some* point in time. > As UBSAN properly warns, the code like: > > val = *addr; > if (val) > __ffs(val) > > prone to refetching 'val' contents from addr by the compiler and thus > passing 0 to __ffs() which has undefined results. > > Fix the problem by using READ_ONCE() in all the appropriate places so > that the compiler cannot accidentally refetch the contents of addr > between the test and __ffs(). This makes the undefined behavior > impossible. The generated code is somewhat larger due to gcc tracking > both the index and target fetch address in separate registers (according > to GCC folks the volatile cast confuses their optimizer a bit, they are > looking into a fix). The difference in performance is minimal though. > Targetted microbenchmark (in userspace) of the bit searching loop shows > about 2% regression on some x86 microarchitectures so for normal kernel > workloads this should be in the noise and not worth introducing special > set of bitmap searching functions. > > [JK: Wrote commit message] > > CC: Yury Norov <yury.norov@gmail.com> > Link: https://lore.kernel.org/all/20230918044739.29782-1-mirsad.todorovac@alu.unizg.hr/ > Signed-off-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> > Signed-off-by: Jan Kara <jack@suse.cz> > --- > include/linux/find.h | 40 ++++++++++++++++++++++++---------------- > lib/find_bit.c | 39 +++++++++++++++++++++++---------------- > 2 files changed, 47 insertions(+), 32 deletions(-) > > diff --git a/include/linux/find.h b/include/linux/find.h > index 5e4f39ef2e72..5ef6466dc7cc 100644 > --- a/include/linux/find.h > +++ b/include/linux/find.h > @@ -60,7 +60,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, > if (unlikely(offset >= size)) > return size; > > - val = *addr & GENMASK(size - 1, offset); > + val = READ_ONCE(*addr) & GENMASK(size - 1, offset); > return val ? __ffs(val) : size; > } > > @@ -90,7 +90,8 @@ unsigned long find_next_and_bit(const unsigned long *addr1, > if (unlikely(offset >= size)) > return size; > > - val = *addr1 & *addr2 & GENMASK(size - 1, offset); > + val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & > + GENMASK(size - 1, offset); > return val ? __ffs(val) : size; > } > > @@ -121,7 +122,8 @@ unsigned long find_next_andnot_bit(const unsigned long *addr1, > if (unlikely(offset >= size)) > return size; > > - val = *addr1 & ~*addr2 & GENMASK(size - 1, offset); > + val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) & > + GENMASK(size - 1, offset); > return val ? __ffs(val) : size; > } > > @@ -151,7 +153,8 @@ unsigned long find_next_or_bit(const unsigned long *addr1, > if (unlikely(offset >= size)) > return size; > > - val = (*addr1 | *addr2) & GENMASK(size - 1, offset); > + val = (READ_ONCE(*addr1) | READ_ONCE(*addr2)) & > + GENMASK(size - 1, offset); > return val ? __ffs(val) : size; > } > > @@ -179,7 +182,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, > if (unlikely(offset >= size)) > return size; > > - val = *addr | ~GENMASK(size - 1, offset); > + val = READ_ONCE(*addr) | ~GENMASK(size - 1, offset); > return val == ~0UL ? size : ffz(val); > } > > @@ -200,7 +203,7 @@ static inline > unsigned long find_first_bit(const unsigned long *addr, unsigned long size) > { > if (small_const_nbits(size)) { > - unsigned long val = *addr & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); > > return val ? __ffs(val) : size; > } > @@ -229,7 +232,7 @@ unsigned long find_nth_bit(const unsigned long *addr, unsigned long size, unsign > return size; > > if (small_const_nbits(size)) { > - unsigned long val = *addr & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); > > return val ? fns(val, n) : size; > } > @@ -255,7 +258,8 @@ unsigned long find_nth_and_bit(const unsigned long *addr1, const unsigned long * > return size; > > if (small_const_nbits(size)) { > - unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) > + & GENMASK(size - 1, 0); > > return val ? fns(val, n) : size; > } > @@ -282,7 +286,8 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon > return size; > > if (small_const_nbits(size)) { > - unsigned long val = *addr1 & (~*addr2) & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) & > + GENMASK(size - 1, 0); > > return val ? fns(val, n) : size; > } > @@ -312,7 +317,8 @@ unsigned long find_nth_and_andnot_bit(const unsigned long *addr1, > return size; > > if (small_const_nbits(size)) { > - unsigned long val = *addr1 & *addr2 & (~*addr3) & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & > + ~READ_ONCE(*addr3) & GENMASK(size - 1, 0); > > return val ? fns(val, n) : size; > } > @@ -336,7 +342,8 @@ unsigned long find_first_and_bit(const unsigned long *addr1, > unsigned long size) > { > if (small_const_nbits(size)) { > - unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & > + GENMASK(size - 1, 0); > > return val ? __ffs(val) : size; > } > @@ -358,7 +365,7 @@ static inline > unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size) > { > if (small_const_nbits(size)) { > - unsigned long val = *addr | ~GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr) | ~GENMASK(size - 1, 0); > > return val == ~0UL ? size : ffz(val); > } > @@ -379,7 +386,7 @@ static inline > unsigned long find_last_bit(const unsigned long *addr, unsigned long size) > { > if (small_const_nbits(size)) { > - unsigned long val = *addr & GENMASK(size - 1, 0); > + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); > > return val ? __fls(val) : size; > } > @@ -505,7 +512,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned > long size, unsigned long offset) > { > if (small_const_nbits(size)) { > - unsigned long val = *(const unsigned long *)addr; > + unsigned long val = READ_ONCE(*(const unsigned long *)addr); > > if (unlikely(offset >= size)) > return size; > @@ -523,7 +530,8 @@ static inline > unsigned long find_first_zero_bit_le(const void *addr, unsigned long size) > { > if (small_const_nbits(size)) { > - unsigned long val = swab(*(const unsigned long *)addr) | ~GENMASK(size - 1, 0); > + unsigned long val = swab(READ_ONCE(*(const unsigned long *)addr)) > + | ~GENMASK(size - 1, 0); > > return val == ~0UL ? size : ffz(val); > } > @@ -538,7 +546,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned > long size, unsigned long offset) > { > if (small_const_nbits(size)) { > - unsigned long val = *(const unsigned long *)addr; > + unsigned long val = READ_ONCE(*(const unsigned long *)addr); > > if (unlikely(offset >= size)) > return size; > diff --git a/lib/find_bit.c b/lib/find_bit.c > index 32f99e9a670e..6867ef8a85e0 100644 > --- a/lib/find_bit.c > +++ b/lib/find_bit.c > @@ -98,7 +98,7 @@ out: \ > */ > unsigned long _find_first_bit(const unsigned long *addr, unsigned long size) > { > - return FIND_FIRST_BIT(addr[idx], /* nop */, size); > + return FIND_FIRST_BIT(READ_ONCE(addr[idx]), /* nop */, size); > } > EXPORT_SYMBOL(_find_first_bit); > #endif > @@ -111,7 +111,8 @@ unsigned long _find_first_and_bit(const unsigned long *addr1, > const unsigned long *addr2, > unsigned long size) > { > - return FIND_FIRST_BIT(addr1[idx] & addr2[idx], /* nop */, size); > + return FIND_FIRST_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), > + /* nop */, size); > } > EXPORT_SYMBOL(_find_first_and_bit); > #endif > @@ -122,7 +123,7 @@ EXPORT_SYMBOL(_find_first_and_bit); > */ > unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size) > { > - return FIND_FIRST_BIT(~addr[idx], /* nop */, size); > + return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), /* nop */, size); > } > EXPORT_SYMBOL(_find_first_zero_bit); > #endif > @@ -130,28 +131,30 @@ EXPORT_SYMBOL(_find_first_zero_bit); > #ifndef find_next_bit > unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start) > { > - return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start); > + return FIND_NEXT_BIT(READ_ONCE(addr[idx]), /* nop */, nbits, start); > } > EXPORT_SYMBOL(_find_next_bit); > #endif > > unsigned long __find_nth_bit(const unsigned long *addr, unsigned long size, unsigned long n) > { > - return FIND_NTH_BIT(addr[idx], size, n); > + return FIND_NTH_BIT(READ_ONCE(addr[idx]), size, n); > } > EXPORT_SYMBOL(__find_nth_bit); > > unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long *addr2, > unsigned long size, unsigned long n) > { > - return FIND_NTH_BIT(addr1[idx] & addr2[idx], size, n); > + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), > + size, n); > } > EXPORT_SYMBOL(__find_nth_and_bit); > > unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2, > unsigned long size, unsigned long n) > { > - return FIND_NTH_BIT(addr1[idx] & ~addr2[idx], size, n); > + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]), > + size, n); > } > EXPORT_SYMBOL(__find_nth_andnot_bit); > > @@ -160,7 +163,8 @@ unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, > const unsigned long *addr3, > unsigned long size, unsigned long n) > { > - return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n); > + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]) & > + ~READ_ONCE(addr3[idx]), size, n); > } > EXPORT_SYMBOL(__find_nth_and_andnot_bit); > > @@ -168,7 +172,8 @@ EXPORT_SYMBOL(__find_nth_and_andnot_bit); > unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2, > unsigned long nbits, unsigned long start) > { > - return FIND_NEXT_BIT(addr1[idx] & addr2[idx], /* nop */, nbits, start); > + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), > + /* nop */, nbits, start); > } > EXPORT_SYMBOL(_find_next_and_bit); > #endif > @@ -177,7 +182,8 @@ EXPORT_SYMBOL(_find_next_and_bit); > unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2, > unsigned long nbits, unsigned long start) > { > - return FIND_NEXT_BIT(addr1[idx] & ~addr2[idx], /* nop */, nbits, start); > + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]), > + /* nop */, nbits, start); > } > EXPORT_SYMBOL(_find_next_andnot_bit); > #endif > @@ -186,7 +192,8 @@ EXPORT_SYMBOL(_find_next_andnot_bit); > unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2, > unsigned long nbits, unsigned long start) > { > - return FIND_NEXT_BIT(addr1[idx] | addr2[idx], /* nop */, nbits, start); > + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) | READ_ONCE(addr2[idx]), > + /* nop */, nbits, start); > } > EXPORT_SYMBOL(_find_next_or_bit); > #endif > @@ -195,7 +202,7 @@ EXPORT_SYMBOL(_find_next_or_bit); > unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits, > unsigned long start) > { > - return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start); > + return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), /* nop */, nbits, start); > } > EXPORT_SYMBOL(_find_next_zero_bit); > #endif > @@ -208,7 +215,7 @@ unsigned long _find_last_bit(const unsigned long *addr, unsigned long size) > unsigned long idx = (size-1) / BITS_PER_LONG; > > do { > - val &= addr[idx]; > + val &= READ_ONCE(addr[idx]); > if (val) > return idx * BITS_PER_LONG + __fls(val); > > @@ -242,7 +249,7 @@ EXPORT_SYMBOL(find_next_clump8); > */ > unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size) > { > - return FIND_FIRST_BIT(~addr[idx], swab, size); > + return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), swab, size); > } > EXPORT_SYMBOL(_find_first_zero_bit_le); > > @@ -252,7 +259,7 @@ EXPORT_SYMBOL(_find_first_zero_bit_le); > unsigned long _find_next_zero_bit_le(const unsigned long *addr, > unsigned long size, unsigned long offset) > { > - return FIND_NEXT_BIT(~addr[idx], swab, size, offset); > + return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), swab, size, offset); > } > EXPORT_SYMBOL(_find_next_zero_bit_le); > #endif > @@ -261,7 +268,7 @@ EXPORT_SYMBOL(_find_next_zero_bit_le); > unsigned long _find_next_bit_le(const unsigned long *addr, > unsigned long size, unsigned long offset) > { > - return FIND_NEXT_BIT(addr[idx], swab, size, offset); > + return FIND_NEXT_BIT(READ_ONCE(addr[idx]), swab, size, offset); > } > EXPORT_SYMBOL(_find_next_bit_le); > Works like charm. Nothing in KCSAN. Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> Best regards, Mirsad
On Wed 11-10-23 11:26:29, Yury Norov wrote: > Long story short: KCSAN found some potential issues related to how > people use bitmap API. And instead of working through that issues, > the following code shuts down KCSAN by applying READ_ONCE() here > and there. I'm sorry but this is not what the patch does. I'm not sure how to get the message across so maybe let me start from a different angle: Bitmaps are perfectly fine to be used without any external locking if only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are used. This is a significant performance gain compared to using a spinlock or other locking and people do this for a long time. I hope we agree on that. Now it is also common that you need to find a set / clear bit in a bitmap. To maintain lockless protocol and deal with races people employ schemes like (the dumbest form): do { bit = find_first_bit(bitmap, n); if (bit >= n) abort... } while (!test_and_clear_bit(bit, bitmap)); So the code loops until it finds a set bit that is successfully cleared by it. This is perfectly fine and safe lockless code and such use should be supported. Agreed? *Except* that the above actually is not safe due to find_first_bit() implementation and KCSAN warns about that. The problem is that: Assume *addr == 1 CPU1 CPU2 find_first_bit(addr, 64) val = *addr; if (val) -> true clear_bit(0, addr) val = *addr -> compiler decided to refetch addr contents for whatever reason in the generated assembly __ffs(val) -> now executed for value 0 which has undefined results. And the READ_ONCE() this patch adds prevents the compiler from adding the refetching of addr into the assembly. Are we on the same page now? Honza > On Wed, Oct 11, 2023 at 05:02:31PM +0200, Jan Kara wrote: > > From: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> > > > > Some users of lib/find functions can operate on bitmaps that change > > while we are looking at the bitmap. For example xarray code looks at tag > > bitmaps only with RCU protection. The xarray users are prepared to > > handle a situation were tag modification races with reading of a tag > > bitmap (and thus we get back a stale information) but the find_*bit() > > functions should return based on bitmap contents at *some* point in time. > > As UBSAN properly warns, the code like: > > > > val = *addr; > > if (val) > > __ffs(val) > > > > prone to refetching 'val' contents from addr by the compiler and thus > > passing 0 to __ffs() which has undefined results. > > > > Fix the problem by using READ_ONCE() in all the appropriate places so > > that the compiler cannot accidentally refetch the contents of addr > > between the test and __ffs(). This makes the undefined behavior > > impossible. The generated code is somewhat larger due to gcc tracking > > both the index and target fetch address in separate registers (according > > to GCC folks the volatile cast confuses their optimizer a bit, they are > > looking into a fix). The difference in performance is minimal though. > > Targetted microbenchmark (in userspace) of the bit searching loop shows > > about 2% regression on some x86 microarchitectures so for normal kernel > > workloads this should be in the noise and not worth introducing special > > set of bitmap searching functions. > > > > [JK: Wrote commit message] > > READ_ONCE() fixes nothing because nothing is broken in find_bit() API. > As I suspected, and as Matthew confirmed in his email, the true reason > for READ_ONCE() here is to disable KCSAN check: > > READ_ONCE() serves two functions here; > one is that it tells the compiler not to try anything fancy, and > the other is that it tells KCSAN to not bother instrumenting this > load; no load-delay-reload. > > https://lkml.kernel.org/linux-mm/ZQkhgVb8nWGxpSPk@casper.infradead.org/ > > And as side-effect, it of course hurts the performance. In the same > email Matthew said he doesn't believe me that READ_ONCE would do that, > so thank you for testing and confirming that it does. > > Matthew, in my experience compilers do really well in that low-level > things, and gambling with compilers usually makes thing worse. x86_64 > is one of the most strong-ordered architectures that I've been working > with, and even here READ_ONCE has visible effect on performance. I > expect that it will get even worse on more weakly ordered platforms. > > I'm OK that you don't believe me; probably you can believe Paul > McKenney and his paper on kernel memory model, which says in the very > first paragraph: > > Applying ACCESS_ONCE() to a large array or structure is unlikely > to do anything useful, and use of READ_ONCE() and WRITE_ONCE() > in this situation can result in load-tearing and store-tearing. > > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4444.html > > Jan, I think that in your last email you confirmed that the xarray > problem that you're trying to solve is about a lack of proper locking: > > Well, for xarray the write side is synchronized with a spinlock but the read > side is not (only RCU protected). > > https://lkml.kernel.org/linux-mm/20230918155403.ylhfdbscgw6yek6p@quack3/ > > If there's no enough synchronization, why not just adding it? > > Regardless performance consideration, my main concern is that this patch > considers bitmap as an atomic structure, which is intentionally not. > There are just a few single-bit atomic operations like set_bit() and > clear_bit(). All other functions are non-atomic, including those > find_bit() operations. > > There is quite a lot of examples of wrong use of bitmaps wrt > atomicity, the most typical is like: > for(idx = 0; idx < num; idx++) { > ... > set_bit(idx, bitmap); > } > > This is wrong because a series of atomic ops is not atomic itself, and > if you see something like this in you code, it should be converted to > using non-atomic __set_bit(), and protected externally if needed. > > Similarly, READ_ONCE() in a for-loop doesn't guarantee any ordering or > atomicity, and only hurts the performance. And this is exactly what > this patch does. > > Thanks, > Yury > > > CC: Yury Norov <yury.norov@gmail.com> > > Link: https://lore.kernel.org/all/20230918044739.29782-1-mirsad.todorovac@alu.unizg.hr/ > > Signed-off-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr> > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > include/linux/find.h | 40 ++++++++++++++++++++++++---------------- > > lib/find_bit.c | 39 +++++++++++++++++++++++---------------- > > 2 files changed, 47 insertions(+), 32 deletions(-) > > > > diff --git a/include/linux/find.h b/include/linux/find.h > > index 5e4f39ef2e72..5ef6466dc7cc 100644 > > --- a/include/linux/find.h > > +++ b/include/linux/find.h > > @@ -60,7 +60,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, > > if (unlikely(offset >= size)) > > return size; > > > > - val = *addr & GENMASK(size - 1, offset); > > + val = READ_ONCE(*addr) & GENMASK(size - 1, offset); > > return val ? __ffs(val) : size; > > } > > > > @@ -90,7 +90,8 @@ unsigned long find_next_and_bit(const unsigned long *addr1, > > if (unlikely(offset >= size)) > > return size; > > > > - val = *addr1 & *addr2 & GENMASK(size - 1, offset); > > + val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & > > + GENMASK(size - 1, offset); > > return val ? __ffs(val) : size; > > } > > > > @@ -121,7 +122,8 @@ unsigned long find_next_andnot_bit(const unsigned long *addr1, > > if (unlikely(offset >= size)) > > return size; > > > > - val = *addr1 & ~*addr2 & GENMASK(size - 1, offset); > > + val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) & > > + GENMASK(size - 1, offset); > > return val ? __ffs(val) : size; > > } > > > > @@ -151,7 +153,8 @@ unsigned long find_next_or_bit(const unsigned long *addr1, > > if (unlikely(offset >= size)) > > return size; > > > > - val = (*addr1 | *addr2) & GENMASK(size - 1, offset); > > + val = (READ_ONCE(*addr1) | READ_ONCE(*addr2)) & > > + GENMASK(size - 1, offset); > > return val ? __ffs(val) : size; > > } > > > > @@ -179,7 +182,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, > > if (unlikely(offset >= size)) > > return size; > > > > - val = *addr | ~GENMASK(size - 1, offset); > > + val = READ_ONCE(*addr) | ~GENMASK(size - 1, offset); > > return val == ~0UL ? size : ffz(val); > > } > > > > @@ -200,7 +203,7 @@ static inline > > unsigned long find_first_bit(const unsigned long *addr, unsigned long size) > > { > > if (small_const_nbits(size)) { > > - unsigned long val = *addr & GENMASK(size - 1, 0); > > + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); > > > > return val ? __ffs(val) : size; > > } > > @@ -229,7 +232,7 @@ unsigned long find_nth_bit(const unsigned long *addr, unsigned long size, unsign > > return size; > > > > if (small_const_nbits(size)) { > > - unsigned long val = *addr & GENMASK(size - 1, 0); > > + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); > > > > return val ? fns(val, n) : size; > > } > > @@ -255,7 +258,8 @@ unsigned long find_nth_and_bit(const unsigned long *addr1, const unsigned long * > > return size; > > > > if (small_const_nbits(size)) { > > - unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0); > > + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) > > + & GENMASK(size - 1, 0); > > > > return val ? fns(val, n) : size; > > } > > @@ -282,7 +286,8 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon > > return size; > > > > if (small_const_nbits(size)) { > > - unsigned long val = *addr1 & (~*addr2) & GENMASK(size - 1, 0); > > + unsigned long val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) & > > + GENMASK(size - 1, 0); > > > > return val ? fns(val, n) : size; > > } > > @@ -312,7 +317,8 @@ unsigned long find_nth_and_andnot_bit(const unsigned long *addr1, > > return size; > > > > if (small_const_nbits(size)) { > > - unsigned long val = *addr1 & *addr2 & (~*addr3) & GENMASK(size - 1, 0); > > + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & > > + ~READ_ONCE(*addr3) & GENMASK(size - 1, 0); > > > > return val ? fns(val, n) : size; > > } > > @@ -336,7 +342,8 @@ unsigned long find_first_and_bit(const unsigned long *addr1, > > unsigned long size) > > { > > if (small_const_nbits(size)) { > > - unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0); > > + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & > > + GENMASK(size - 1, 0); > > > > return val ? __ffs(val) : size; > > } > > @@ -358,7 +365,7 @@ static inline > > unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size) > > { > > if (small_const_nbits(size)) { > > - unsigned long val = *addr | ~GENMASK(size - 1, 0); > > + unsigned long val = READ_ONCE(*addr) | ~GENMASK(size - 1, 0); > > > > return val == ~0UL ? size : ffz(val); > > } > > @@ -379,7 +386,7 @@ static inline > > unsigned long find_last_bit(const unsigned long *addr, unsigned long size) > > { > > if (small_const_nbits(size)) { > > - unsigned long val = *addr & GENMASK(size - 1, 0); > > + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); > > > > return val ? __fls(val) : size; > > } > > @@ -505,7 +512,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned > > long size, unsigned long offset) > > { > > if (small_const_nbits(size)) { > > - unsigned long val = *(const unsigned long *)addr; > > + unsigned long val = READ_ONCE(*(const unsigned long *)addr); > > > > if (unlikely(offset >= size)) > > return size; > > @@ -523,7 +530,8 @@ static inline > > unsigned long find_first_zero_bit_le(const void *addr, unsigned long size) > > { > > if (small_const_nbits(size)) { > > - unsigned long val = swab(*(const unsigned long *)addr) | ~GENMASK(size - 1, 0); > > + unsigned long val = swab(READ_ONCE(*(const unsigned long *)addr)) > > + | ~GENMASK(size - 1, 0); > > > > return val == ~0UL ? size : ffz(val); > > } > > @@ -538,7 +546,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned > > long size, unsigned long offset) > > { > > if (small_const_nbits(size)) { > > - unsigned long val = *(const unsigned long *)addr; > > + unsigned long val = READ_ONCE(*(const unsigned long *)addr); > > > > if (unlikely(offset >= size)) > > return size; > > diff --git a/lib/find_bit.c b/lib/find_bit.c > > index 32f99e9a670e..6867ef8a85e0 100644 > > --- a/lib/find_bit.c > > +++ b/lib/find_bit.c > > @@ -98,7 +98,7 @@ out: \ > > */ > > unsigned long _find_first_bit(const unsigned long *addr, unsigned long size) > > { > > - return FIND_FIRST_BIT(addr[idx], /* nop */, size); > > + return FIND_FIRST_BIT(READ_ONCE(addr[idx]), /* nop */, size); > > } > > EXPORT_SYMBOL(_find_first_bit); > > #endif > > @@ -111,7 +111,8 @@ unsigned long _find_first_and_bit(const unsigned long *addr1, > > const unsigned long *addr2, > > unsigned long size) > > { > > - return FIND_FIRST_BIT(addr1[idx] & addr2[idx], /* nop */, size); > > + return FIND_FIRST_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), > > + /* nop */, size); > > } > > EXPORT_SYMBOL(_find_first_and_bit); > > #endif > > @@ -122,7 +123,7 @@ EXPORT_SYMBOL(_find_first_and_bit); > > */ > > unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size) > > { > > - return FIND_FIRST_BIT(~addr[idx], /* nop */, size); > > + return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), /* nop */, size); > > } > > EXPORT_SYMBOL(_find_first_zero_bit); > > #endif > > @@ -130,28 +131,30 @@ EXPORT_SYMBOL(_find_first_zero_bit); > > #ifndef find_next_bit > > unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start) > > { > > - return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start); > > + return FIND_NEXT_BIT(READ_ONCE(addr[idx]), /* nop */, nbits, start); > > } > > EXPORT_SYMBOL(_find_next_bit); > > #endif > > > > unsigned long __find_nth_bit(const unsigned long *addr, unsigned long size, unsigned long n) > > { > > - return FIND_NTH_BIT(addr[idx], size, n); > > + return FIND_NTH_BIT(READ_ONCE(addr[idx]), size, n); > > } > > EXPORT_SYMBOL(__find_nth_bit); > > > > unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long *addr2, > > unsigned long size, unsigned long n) > > { > > - return FIND_NTH_BIT(addr1[idx] & addr2[idx], size, n); > > + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), > > + size, n); > > } > > EXPORT_SYMBOL(__find_nth_and_bit); > > > > unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2, > > unsigned long size, unsigned long n) > > { > > - return FIND_NTH_BIT(addr1[idx] & ~addr2[idx], size, n); > > + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]), > > + size, n); > > } > > EXPORT_SYMBOL(__find_nth_andnot_bit); > > > > @@ -160,7 +163,8 @@ unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, > > const unsigned long *addr3, > > unsigned long size, unsigned long n) > > { > > - return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n); > > + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]) & > > + ~READ_ONCE(addr3[idx]), size, n); > > } > > EXPORT_SYMBOL(__find_nth_and_andnot_bit); > > > > @@ -168,7 +172,8 @@ EXPORT_SYMBOL(__find_nth_and_andnot_bit); > > unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2, > > unsigned long nbits, unsigned long start) > > { > > - return FIND_NEXT_BIT(addr1[idx] & addr2[idx], /* nop */, nbits, start); > > + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), > > + /* nop */, nbits, start); > > } > > EXPORT_SYMBOL(_find_next_and_bit); > > #endif > > @@ -177,7 +182,8 @@ EXPORT_SYMBOL(_find_next_and_bit); > > unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2, > > unsigned long nbits, unsigned long start) > > { > > - return FIND_NEXT_BIT(addr1[idx] & ~addr2[idx], /* nop */, nbits, start); > > + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]), > > + /* nop */, nbits, start); > > } > > EXPORT_SYMBOL(_find_next_andnot_bit); > > #endif > > @@ -186,7 +192,8 @@ EXPORT_SYMBOL(_find_next_andnot_bit); > > unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2, > > unsigned long nbits, unsigned long start) > > { > > - return FIND_NEXT_BIT(addr1[idx] | addr2[idx], /* nop */, nbits, start); > > + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) | READ_ONCE(addr2[idx]), > > + /* nop */, nbits, start); > > } > > EXPORT_SYMBOL(_find_next_or_bit); > > #endif > > @@ -195,7 +202,7 @@ EXPORT_SYMBOL(_find_next_or_bit); > > unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits, > > unsigned long start) > > { > > - return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start); > > + return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), /* nop */, nbits, start); > > } > > EXPORT_SYMBOL(_find_next_zero_bit); > > #endif > > @@ -208,7 +215,7 @@ unsigned long _find_last_bit(const unsigned long *addr, unsigned long size) > > unsigned long idx = (size-1) / BITS_PER_LONG; > > > > do { > > - val &= addr[idx]; > > + val &= READ_ONCE(addr[idx]); > > if (val) > > return idx * BITS_PER_LONG + __fls(val); > > > > @@ -242,7 +249,7 @@ EXPORT_SYMBOL(find_next_clump8); > > */ > > unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size) > > { > > - return FIND_FIRST_BIT(~addr[idx], swab, size); > > + return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), swab, size); > > } > > EXPORT_SYMBOL(_find_first_zero_bit_le); > > > > @@ -252,7 +259,7 @@ EXPORT_SYMBOL(_find_first_zero_bit_le); > > unsigned long _find_next_zero_bit_le(const unsigned long *addr, > > unsigned long size, unsigned long offset) > > { > > - return FIND_NEXT_BIT(~addr[idx], swab, size, offset); > > + return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), swab, size, offset); > > } > > EXPORT_SYMBOL(_find_next_zero_bit_le); > > #endif > > @@ -261,7 +268,7 @@ EXPORT_SYMBOL(_find_next_zero_bit_le); > > unsigned long _find_next_bit_le(const unsigned long *addr, > > unsigned long size, unsigned long offset) > > { > > - return FIND_NEXT_BIT(addr[idx], swab, size, offset); > > + return FIND_NEXT_BIT(READ_ONCE(addr[idx]), swab, size, offset); > > } > > EXPORT_SYMBOL(_find_next_bit_le); > > > > -- > > 2.35.3
Restore LKML On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote: > On Wed 11-10-23 11:26:29, Yury Norov wrote: > > Long story short: KCSAN found some potential issues related to how > > people use bitmap API. And instead of working through that issues, > > the following code shuts down KCSAN by applying READ_ONCE() here > > and there. > > I'm sorry but this is not what the patch does. I'm not sure how to get the > message across so maybe let me start from a different angle: > > Bitmaps are perfectly fine to be used without any external locking if > only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are > used. This is a significant performance gain compared to using a spinlock > or other locking and people do this for a long time. I hope we agree on > that. > > Now it is also common that you need to find a set / clear bit in a bitmap. > To maintain lockless protocol and deal with races people employ schemes > like (the dumbest form): > > do { > bit = find_first_bit(bitmap, n); > if (bit >= n) > abort... > } while (!test_and_clear_bit(bit, bitmap)); > > So the code loops until it finds a set bit that is successfully cleared by > it. This is perfectly fine and safe lockless code and such use should be > supported. Agreed? Great example. When you're running non-atomic functions concurrently, the result may easily become incorrect, and this is what you're demonstrating here. Regarding find_first_bit() it means that: - it may erroneously return unset bit; - it may erroneously return non-first set bit; - it may erroneously return no bits for non-empty bitmap. Effectively it means that find_first bit may just return a random number. Let's take another example: do { bit = get_random_number(); if (bit >= n) abort... } while (!test_and_clear_bit(bit, bitmap)); When running concurrently, the difference between this and your code is only in probability of getting set bit somewhere from around the beginning of bitmap. The key point is that find_bit() may return undef even if READ_ONCE() is used. If bitmap gets changed anytime in the process, the result becomes invalid. It may happen even after returning from find_first_bit(). And if my understanding correct, your code is designed in the assumption that find_first_bit() may return garbage, so handles it correctly. > *Except* that the above actually is not safe due to find_first_bit() > implementation and KCSAN warns about that. The problem is that: > > Assume *addr == 1 > CPU1 CPU2 > find_first_bit(addr, 64) > val = *addr; > if (val) -> true > clear_bit(0, addr) > val = *addr -> compiler decided to refetch addr contents for whatever > reason in the generated assembly > __ffs(val) -> now executed for value 0 which has undefined results. Yes, __ffs(0) is undef. But the whole function is undef when accessing bitmap concurrently. > And the READ_ONCE() this patch adds prevents the compiler from adding the > refetching of addr into the assembly. That's true. But it doesn't improve on the situation. It was an undef before, and it's undef after, but a 2% slower undef. Now on that KCSAN warning. If I understand things correctly, for the example above, KCSAN warning is false-positive, because you're intentionally running lockless. But for some other people it may be a true error, and now they'll have no chance to catch it if KCSAN is forced to ignore find_bit() entirely. We've got the whole class of lockless algorithms that allow safe concurrent access to the memory. And now that there's a tool that searches for them (concurrent accesses), we need to have an option to somehow teach it to suppress irrelevant warnings. Maybe something like this? lockless_algorithm_begin(bitmap, bitmap_size(nbits)); do { bit = find_first_bit(bitmap, nbits); if (bit >= nbits) break; } while (!test_and_clear_bit(bit, bitmap)); lockless_algorithm_end(bitmap, bitmap_size(nbits)); And, of course, as I suggested a couple iterations ago, you can invent a thread-safe version of find_bit(), that would be perfectly correct for lockless use: unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size) { unsigned long bit = 0; while (!test_and_clear_bit(bit, bitmap) { bit = FIND_FIRST_BIT(addr[idx], /* nop */, size); if (bit >= size) return size; } return bit; } Didn't test that, but I hope 'volatile' specifier should be enough for compiler to realize that it shouldn't optimize memory access, and for KCSAN that everything's OK here. By the way, thank you for respectful and professional communication. Thanks, Yury
On 10/14/2023 2:15 AM, Yury Norov wrote: > Restore LKML > > On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote: >> On Wed 11-10-23 11:26:29, Yury Norov wrote: >>> Long story short: KCSAN found some potential issues related to how >>> people use bitmap API. And instead of working through that issues, >>> the following code shuts down KCSAN by applying READ_ONCE() here >>> and there. >> >> I'm sorry but this is not what the patch does. I'm not sure how to get the >> message across so maybe let me start from a different angle: >> >> Bitmaps are perfectly fine to be used without any external locking if >> only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are >> used. This is a significant performance gain compared to using a spinlock >> or other locking and people do this for a long time. I hope we agree on >> that. >> >> Now it is also common that you need to find a set / clear bit in a bitmap. >> To maintain lockless protocol and deal with races people employ schemes >> like (the dumbest form): >> >> do { >> bit = find_first_bit(bitmap, n); >> if (bit >= n) >> abort... >> } while (!test_and_clear_bit(bit, bitmap)); >> >> So the code loops until it finds a set bit that is successfully cleared by >> it. This is perfectly fine and safe lockless code and such use should be >> supported. Agreed? > > Great example. When you're running non-atomic functions concurrently, > the result may easily become incorrect, and this is what you're > demonstrating here. > > Regarding find_first_bit() it means that: > - it may erroneously return unset bit; > - it may erroneously return non-first set bit; > - it may erroneously return no bits for non-empty bitmap. > > Effectively it means that find_first bit may just return a random number. > > Let's take another example: > > do { > bit = get_random_number(); > if (bit >= n) > abort... > } while (!test_and_clear_bit(bit, bitmap)); > > When running concurrently, the difference between this and your code > is only in probability of getting set bit somewhere from around the > beginning of bitmap. > > The key point is that find_bit() may return undef even if READ_ONCE() is > used. If bitmap gets changed anytime in the process, the result becomes > invalid. It may happen even after returning from find_first_bit(). > > And if my understanding correct, your code is designed in the > assumption that find_first_bit() may return garbage, so handles it > correctly. > >> *Except* that the above actually is not safe due to find_first_bit() >> implementation and KCSAN warns about that. The problem is that: >> >> Assume *addr == 1 >> CPU1 CPU2 >> find_first_bit(addr, 64) >> val = *addr; >> if (val) -> true >> clear_bit(0, addr) >> val = *addr -> compiler decided to refetch addr contents for whatever >> reason in the generated assembly >> __ffs(val) -> now executed for value 0 which has undefined results. > > Yes, __ffs(0) is undef. But the whole function is undef when accessing > bitmap concurrently. > >> And the READ_ONCE() this patch adds prevents the compiler from adding the >> refetching of addr into the assembly. > > That's true. But it doesn't improve on the situation. It was an undef > before, and it's undef after, but a 2% slower undef. > > Now on that KCSAN warning. If I understand things correctly, for the > example above, KCSAN warning is false-positive, because you're > intentionally running lockless. > > But for some other people it may be a true error, and now they'll have > no chance to catch it if KCSAN is forced to ignore find_bit() entirely. > > We've got the whole class of lockless algorithms that allow safe concurrent > access to the memory. And now that there's a tool that searches for them > (concurrent accesses), we need to have an option to somehow teach it > to suppress irrelevant warnings. Maybe something like this? > > lockless_algorithm_begin(bitmap, bitmap_size(nbits)); > do { > bit = find_first_bit(bitmap, nbits); > if (bit >= nbits) > break; > } while (!test_and_clear_bit(bit, bitmap)); > lockless_algorithm_end(bitmap, bitmap_size(nbits)); > > And, of course, as I suggested a couple iterations ago, you can invent > a thread-safe version of find_bit(), that would be perfectly correct > for lockless use: > > unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size) > { > unsigned long bit = 0; > > while (!test_and_clear_bit(bit, bitmap) { > bit = FIND_FIRST_BIT(addr[idx], /* nop */, size); > if (bit >= size) > return size; > } > > return bit; > } Hi, Yuri, But the code above effectively does the same as the READ_ONCE() macro as defined in rwonce.h: #ifndef __READ_ONCE #define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x)) #endif #define READ_ONCE(x) \ ({ \ compiletime_assert_rwonce_type(x); \ __READ_ONCE(x); \ }) Both uses only prevent the funny stuff the compiler might have done to the read of the addr[idx], there's no black magic in READ_ONCE(). Both examples would probably result in the same assembly and produce the same 2% slowdown ... Only you declare volatile in one place, and READ_ONCE() in each read, but this will only compile a bit slower and generate the same machine code. Best regards, Mirsad Todorovac > Didn't test that, but I hope 'volatile' specifier should be enough > for compiler to realize that it shouldn't optimize memory access, and > for KCSAN that everything's OK here. > > By the way, thank you for respectful and professional communication. > > Thanks, > Yury
On Sat, Oct 14, 2023 at 04:21:50AM +0200, Mirsad Goran Todorovac wrote: > On 10/14/2023 2:15 AM, Yury Norov wrote: > > Restore LKML > > > > On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote: > > > On Wed 11-10-23 11:26:29, Yury Norov wrote: > > > > Long story short: KCSAN found some potential issues related to how > > > > people use bitmap API. And instead of working through that issues, > > > > the following code shuts down KCSAN by applying READ_ONCE() here > > > > and there. > > > > > > I'm sorry but this is not what the patch does. I'm not sure how to get the > > > message across so maybe let me start from a different angle: > > > > > > Bitmaps are perfectly fine to be used without any external locking if > > > only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are > > > used. This is a significant performance gain compared to using a spinlock > > > or other locking and people do this for a long time. I hope we agree on > > > that. > > > > > > Now it is also common that you need to find a set / clear bit in a bitmap. > > > To maintain lockless protocol and deal with races people employ schemes > > > like (the dumbest form): > > > > > > do { > > > bit = find_first_bit(bitmap, n); > > > if (bit >= n) > > > abort... > > > } while (!test_and_clear_bit(bit, bitmap)); > > > > > > So the code loops until it finds a set bit that is successfully cleared by > > > it. This is perfectly fine and safe lockless code and such use should be > > > supported. Agreed? > > > > Great example. When you're running non-atomic functions concurrently, > > the result may easily become incorrect, and this is what you're > > demonstrating here. > > > > Regarding find_first_bit() it means that: > > - it may erroneously return unset bit; > > - it may erroneously return non-first set bit; > > - it may erroneously return no bits for non-empty bitmap. > > > > Effectively it means that find_first bit may just return a random number. > > > > Let's take another example: > > > > do { > > bit = get_random_number(); > > if (bit >= n) > > abort... > > } while (!test_and_clear_bit(bit, bitmap)); > > > > When running concurrently, the difference between this and your code > > is only in probability of getting set bit somewhere from around the > > beginning of bitmap. > > > > The key point is that find_bit() may return undef even if READ_ONCE() is > > used. If bitmap gets changed anytime in the process, the result becomes > > invalid. It may happen even after returning from find_first_bit(). > > > > And if my understanding correct, your code is designed in the > > assumption that find_first_bit() may return garbage, so handles it > > correctly. > > > > > *Except* that the above actually is not safe due to find_first_bit() > > > implementation and KCSAN warns about that. The problem is that: > > > > > > Assume *addr == 1 > > > CPU1 CPU2 > > > find_first_bit(addr, 64) > > > val = *addr; > > > if (val) -> true > > > clear_bit(0, addr) > > > val = *addr -> compiler decided to refetch addr contents for whatever > > > reason in the generated assembly > > > __ffs(val) -> now executed for value 0 which has undefined results. > > > > Yes, __ffs(0) is undef. But the whole function is undef when accessing > > bitmap concurrently. > > > > > And the READ_ONCE() this patch adds prevents the compiler from adding the > > > refetching of addr into the assembly. > > > > That's true. But it doesn't improve on the situation. It was an undef > > before, and it's undef after, but a 2% slower undef. > > > > Now on that KCSAN warning. If I understand things correctly, for the > > example above, KCSAN warning is false-positive, because you're > > intentionally running lockless. > > > > But for some other people it may be a true error, and now they'll have > > no chance to catch it if KCSAN is forced to ignore find_bit() entirely. > > > > We've got the whole class of lockless algorithms that allow safe concurrent > > access to the memory. And now that there's a tool that searches for them > > (concurrent accesses), we need to have an option to somehow teach it > > to suppress irrelevant warnings. Maybe something like this? > > > > lockless_algorithm_begin(bitmap, bitmap_size(nbits)); > > do { > > bit = find_first_bit(bitmap, nbits); > > if (bit >= nbits) > > break; > > } while (!test_and_clear_bit(bit, bitmap)); > > lockless_algorithm_end(bitmap, bitmap_size(nbits)); > > > > And, of course, as I suggested a couple iterations ago, you can invent > > a thread-safe version of find_bit(), that would be perfectly correct > > for lockless use: > > > > unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size) > > { > > unsigned long bit = 0; > > while (!test_and_clear_bit(bit, bitmap) { > > bit = FIND_FIRST_BIT(addr[idx], /* nop */, size); > > if (bit >= size) > > return size; > > } > > > > return bit; > > } > > Hi, Yuri, > > But the code above effectively does the same as the READ_ONCE() macro > as defined in rwonce.h: > > #ifndef __READ_ONCE > #define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x)) > #endif > > #define READ_ONCE(x) \ > ({ \ > compiletime_assert_rwonce_type(x); \ > __READ_ONCE(x); \ > }) > > Both uses only prevent the funny stuff the compiler might have done to the > read of the addr[idx], there's no black magic in READ_ONCE(). > > Both examples would probably result in the same assembly and produce the > same 2% slowdown ... > > Only you declare volatile in one place, and READ_ONCE() in each read, but > this will only compile a bit slower and generate the same machine code. The difference is that find_and_clear_bit() has a semantics of atomic operation. Those who will decide to use it will also anticipate associate downsides. And other hundreds (or thousands) users of non-atomic find_bit() functions will not have to pay extra buck for unneeded atomicity. Check how 'volatile' is used in test_and_clear_bit(), and consider find_and_clear_bit() as a wrapper around test_and_clear_bit(). In other words, this patch suggests to make find_bit() thread-safe by using READ_ONCE(), and it doesn't work. find_and_clear_bit(), on the other hand, is simply a wrapper around test_and_clear_bit(), and doesn't imply any new restriction that test_and_clear_bit() doesn't. Think of it as an optimized version of: while (bit < nbits && !test_and_clear_bit(bit, bitmap) bit++; If you think it's worth to try in your code, I can send a patch for you. Thanks, Yury
On 10/14/23 04:53, Yury Norov wrote: > On Sat, Oct 14, 2023 at 04:21:50AM +0200, Mirsad Goran Todorovac wrote: >> On 10/14/2023 2:15 AM, Yury Norov wrote: >>> Restore LKML >>> >>> On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote: >>>> On Wed 11-10-23 11:26:29, Yury Norov wrote: >>>>> Long story short: KCSAN found some potential issues related to how >>>>> people use bitmap API. And instead of working through that issues, >>>>> the following code shuts down KCSAN by applying READ_ONCE() here >>>>> and there. >>>> >>>> I'm sorry but this is not what the patch does. I'm not sure how to get the >>>> message across so maybe let me start from a different angle: >>>> >>>> Bitmaps are perfectly fine to be used without any external locking if >>>> only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are >>>> used. This is a significant performance gain compared to using a spinlock >>>> or other locking and people do this for a long time. I hope we agree on >>>> that. >>>> >>>> Now it is also common that you need to find a set / clear bit in a bitmap. >>>> To maintain lockless protocol and deal with races people employ schemes >>>> like (the dumbest form): >>>> >>>> do { >>>> bit = find_first_bit(bitmap, n); >>>> if (bit >= n) >>>> abort... >>>> } while (!test_and_clear_bit(bit, bitmap)); >>>> >>>> So the code loops until it finds a set bit that is successfully cleared by >>>> it. This is perfectly fine and safe lockless code and such use should be >>>> supported. Agreed? >>> >>> Great example. When you're running non-atomic functions concurrently, >>> the result may easily become incorrect, and this is what you're >>> demonstrating here. >>> >>> Regarding find_first_bit() it means that: >>> - it may erroneously return unset bit; >>> - it may erroneously return non-first set bit; >>> - it may erroneously return no bits for non-empty bitmap. >>> >>> Effectively it means that find_first bit may just return a random number. >>> >>> Let's take another example: >>> >>> do { >>> bit = get_random_number(); >>> if (bit >= n) >>> abort... >>> } while (!test_and_clear_bit(bit, bitmap)); >>> >>> When running concurrently, the difference between this and your code >>> is only in probability of getting set bit somewhere from around the >>> beginning of bitmap. >>> >>> The key point is that find_bit() may return undef even if READ_ONCE() is >>> used. If bitmap gets changed anytime in the process, the result becomes >>> invalid. It may happen even after returning from find_first_bit(). >>> >>> And if my understanding correct, your code is designed in the >>> assumption that find_first_bit() may return garbage, so handles it >>> correctly. >>> >>>> *Except* that the above actually is not safe due to find_first_bit() >>>> implementation and KCSAN warns about that. The problem is that: >>>> >>>> Assume *addr == 1 >>>> CPU1 CPU2 >>>> find_first_bit(addr, 64) >>>> val = *addr; >>>> if (val) -> true >>>> clear_bit(0, addr) >>>> val = *addr -> compiler decided to refetch addr contents for whatever >>>> reason in the generated assembly >>>> __ffs(val) -> now executed for value 0 which has undefined results. >>> >>> Yes, __ffs(0) is undef. But the whole function is undef when accessing >>> bitmap concurrently. >>> >>>> And the READ_ONCE() this patch adds prevents the compiler from adding the >>>> refetching of addr into the assembly. >>> >>> That's true. But it doesn't improve on the situation. It was an undef >>> before, and it's undef after, but a 2% slower undef. >>> >>> Now on that KCSAN warning. If I understand things correctly, for the >>> example above, KCSAN warning is false-positive, because you're >>> intentionally running lockless. >>> >>> But for some other people it may be a true error, and now they'll have >>> no chance to catch it if KCSAN is forced to ignore find_bit() entirely. >>> >>> We've got the whole class of lockless algorithms that allow safe concurrent >>> access to the memory. And now that there's a tool that searches for them >>> (concurrent accesses), we need to have an option to somehow teach it >>> to suppress irrelevant warnings. Maybe something like this? >>> >>> lockless_algorithm_begin(bitmap, bitmap_size(nbits)); >>> do { >>> bit = find_first_bit(bitmap, nbits); >>> if (bit >= nbits) >>> break; >>> } while (!test_and_clear_bit(bit, bitmap)); >>> lockless_algorithm_end(bitmap, bitmap_size(nbits)); >>> >>> And, of course, as I suggested a couple iterations ago, you can invent >>> a thread-safe version of find_bit(), that would be perfectly correct >>> for lockless use: >>> >>> unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size) >>> { >>> unsigned long bit = 0; >>> while (!test_and_clear_bit(bit, bitmap) { >>> bit = FIND_FIRST_BIT(addr[idx], /* nop */, size); >>> if (bit >= size) >>> return size; >>> } >>> >>> return bit; >>> } >> >> Hi, Yuri, >> >> But the code above effectively does the same as the READ_ONCE() macro >> as defined in rwonce.h: >> >> #ifndef __READ_ONCE >> #define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x)) >> #endif >> >> #define READ_ONCE(x) \ >> ({ \ >> compiletime_assert_rwonce_type(x); \ >> __READ_ONCE(x); \ >> }) >> >> Both uses only prevent the funny stuff the compiler might have done to the >> read of the addr[idx], there's no black magic in READ_ONCE(). >> >> Both examples would probably result in the same assembly and produce the >> same 2% slowdown ... >> >> Only you declare volatile in one place, and READ_ONCE() in each read, but >> this will only compile a bit slower and generate the same machine code. > > The difference is that find_and_clear_bit() has a semantics of > atomic operation. Those who will decide to use it will also anticipate > associate downsides. And other hundreds (or thousands) users of > non-atomic find_bit() functions will not have to pay extra buck > for unneeded atomicity. > > Check how 'volatile' is used in test_and_clear_bit(), and consider > find_and_clear_bit() as a wrapper around test_and_clear_bit(). > > In other words, this patch suggests to make find_bit() thread-safe by > using READ_ONCE(), and it doesn't work. find_and_clear_bit(), on the > other hand, is simply a wrapper around test_and_clear_bit(), and > doesn't imply any new restriction that test_and_clear_bit() doesn't. > > Think of it as an optimized version of: > while (bit < nbits && !test_and_clear_bit(bit, bitmap) > bit++; > > If you think it's worth to try in your code, I can send a patch for > you. > > Thanks, > Yury After some thinking, your declaration: >>> unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size) OK, this makes "addr" a pointer to a volatile array of unsigned longs. But to this I have an objection: > while (bit < nbits && !test_and_clear_bit(bit, bitmap) > bit++; Note that there is nothing magical in an atomic test_and_clear_bit(): it has to read entire (quad)word, remember the stat of the bit, clear it, and write it back. The problem is that LOCK prefix comes before the assembled instruction LOCK BTR r/m32, imm8 so it would be executed atomically. Otherwise there are no guarantees that other core wouldn't write its own idea of the value. But atomic test_and_clear_bit() is not a free lunch: you would LOCK the bus for all cores except yours 32/64 times for each bit, per (quad)word tested. That is 32/64 times more than it is optimal, and it looks like a real hog. Do you see the difference? Needless to say, your atomicity works for one bit, and nothing prevents i.e. core 5 to modify/set atomically bits you have already tested and found clear ... Ideally, you would use atomic ffs() which is compiled as a single and atomic BSF/BSR instruction on x86. BSR r32, r/m32 (Alternatively you might want BSL in your algorithm, scanning from the least significant bit.) Even better would be if we could also atomically clear that bit in memory and have the instruction return its index. Test is atomic because it is a single instruction, but it can also be prefixed with a LOCK. LOCK BSR reg, mem LOCK BTR mem, reg would unfortunately not work, because something unfortunate could change the memory location in between. But your proposed algorithm is nothing more atomic than bit = ffs(bitmap); test_and_clear_bit(bit, bitmap); And only less efficient, since you use on average 16/32 bus LOCKs instead of one assembly instruction BSR/BRL. Those LOCK prefixes would mean that the entire set of cores is prevented from reading or modifying memory while the bus is LOCKed, or only writes on smarted architectures with WRITE LOCK in assembly, as reads won't hurt the process of test_and_clear_bit(), only might give an out-of-sync value. Whatever fancy C function or macro, it cannot outsmart the CPU instruction set if the set doesn't have that one. For x86/x86_64, I couldn't find an atomic instruction to find the first set bit and atomically clear it, returning its value ... but it doesn't mean nobody else knows. Regards, Mirsad
On Fri 13-10-23 17:15:28, Yury Norov wrote: > On Thu, Oct 12, 2023 at 02:21:10PM +0200, Jan Kara wrote: > > On Wed 11-10-23 11:26:29, Yury Norov wrote: > > > Long story short: KCSAN found some potential issues related to how > > > people use bitmap API. And instead of working through that issues, > > > the following code shuts down KCSAN by applying READ_ONCE() here > > > and there. > > > > I'm sorry but this is not what the patch does. I'm not sure how to get the > > message across so maybe let me start from a different angle: > > > > Bitmaps are perfectly fine to be used without any external locking if > > only atomic bit ops (set_bit, clear_bit, test_and_{set/clear}_bit) are > > used. This is a significant performance gain compared to using a spinlock > > or other locking and people do this for a long time. I hope we agree on > > that. > > > > Now it is also common that you need to find a set / clear bit in a bitmap. > > To maintain lockless protocol and deal with races people employ schemes > > like (the dumbest form): > > > > do { > > bit = find_first_bit(bitmap, n); > > if (bit >= n) > > abort... > > } while (!test_and_clear_bit(bit, bitmap)); > > > > So the code loops until it finds a set bit that is successfully cleared by > > it. This is perfectly fine and safe lockless code and such use should be > > supported. Agreed? > > Great example. When you're running non-atomic functions concurrently, > the result may easily become incorrect, and this is what you're > demonstrating here. > > Regarding find_first_bit() it means that: > - it may erroneously return unset bit; > - it may erroneously return non-first set bit; > - it may erroneously return no bits for non-empty bitmap. Correct. > Effectively it means that find_first bit may just return a random number. I prefer to think that it can return a result that is no longer valid by the time we further use it :) > Let's take another example: > > do { > bit = get_random_number(); > if (bit >= n) > abort... > } while (!test_and_clear_bit(bit, bitmap)); > > When running concurrently, the difference between this and your code > is only in probability of getting set bit somewhere from around the > beginning of bitmap. Well, as you say the difference is in the probability - i.e., average number of loops taken is higher with using truly random number and that is the whole point. We bother with complexity of lockless access exactly because of performance :). As long as find_first_bit() returns set bit in case there's no collision with other bitmap modification, we are fine with its results (usually we don't expect the collision to happen, often the bitmap users also employ schemes to spread different processes modifying the bitmap to different parts of the bitmap to further reduce likelyhood of a collision). > The key point is that find_bit() may return undef even if READ_ONCE() is > used. If bitmap gets changed anytime in the process, the result becomes > invalid. It may happen even after returning from find_first_bit(). > > And if my understanding correct, your code is designed in the > assumption that find_first_bit() may return garbage, so handles it > correctly. Yes, that is true. > > *Except* that the above actually is not safe due to find_first_bit() > > implementation and KCSAN warns about that. The problem is that: > > > > Assume *addr == 1 > > CPU1 CPU2 > > find_first_bit(addr, 64) > > val = *addr; > > if (val) -> true > > clear_bit(0, addr) > > val = *addr -> compiler decided to refetch addr contents for whatever > > reason in the generated assembly > > __ffs(val) -> now executed for value 0 which has undefined results. > > Yes, __ffs(0) is undef. But the whole function is undef when accessing > bitmap concurrently. So here I think we get at the core of our misunderstanding :): Yes, find_first_bit() may return a bit number that is not set any longer. But it is guaranteed to return some number between 0 and n where n is the bitmap size. What __ffs() does when passed 0 value is unclear and likely will be architecture dependent. If we are guaranteed it returns some number between 0 and 8*sizeof(unsigned long), then we are fine. But I'm concerned it may throw exception (similarly to division by 0) or return number greater than 8*sizeof(unsigned long) for some architecture and that would be a problem. E.g. reading the x86 bsf instruction documentation, the destination register is untouched if there is no set bit so the result can indeed be > 8*sizeof(unsigned long). So __ffs(0) can result in returning a number beyond the end of the bitmap (e.g. 0xffffffff). And that is IMO unacceptable output for find_first_bit(). > > And the READ_ONCE() this patch adds prevents the compiler from adding the > > refetching of addr into the assembly. > > That's true. But it doesn't improve on the situation. It was an undef > before, and it's undef after, but a 2% slower undef. > > Now on that KCSAN warning. If I understand things correctly, for the > example above, KCSAN warning is false-positive, because you're > intentionally running lockless. As I wrote above, there are different levels of "undefinedness" and that matters in this case. KCSAN is complaining that the value passed to __ffs() function may be different one from the one tested in the condition before it. Depending on exact __ffs() behavior this may be fine or it may be not. > But for some other people it may be a true error, and now they'll have > no chance to catch it if KCSAN is forced to ignore find_bit() entirely. I agree some people may accidentally use bitmap function unlocked without properly handling the races. However in this case KCSAN does not warn about unsafe use of the result from find_bit() (which is what should happen for those unsafe uses). It complains about unsafe internal implementation of find_bit() when it is used without external synchronization. These two are different things so I don't think this is a good argument for leaving the race in find_bit(). Furthermore I'd note that READ_ONCE() does not make KCSAN ignore find_bit() completely. READ_ONCE() forces the compiler to use the same value for the test and __ffs() argument (by telling it it cannot assume the standard C memory model using "volatile" keyword for this fetch). That's all. That makes it impossible for KCSAN to inject a modification of the bitmap & refetch from memory inbetween the two uses of the local variable and thus it doesn't generate the warning anymore. > We've got the whole class of lockless algorithms that allow safe concurrent > access to the memory. And now that there's a tool that searches for them > (concurrent accesses), we need to have an option to somehow teach it > to suppress irrelevant warnings. Maybe something like this? > > lockless_algorithm_begin(bitmap, bitmap_size(nbits)); > do { > bit = find_first_bit(bitmap, nbits); > if (bit >= nbits) > break; > } while (!test_and_clear_bit(bit, bitmap)); > lockless_algorithm_end(bitmap, bitmap_size(nbits)); > > And, of course, as I suggested a couple iterations ago, you can invent > a thread-safe version of find_bit(), that would be perfectly correct > for lockless use: > > unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long size) > { > unsigned long bit = 0; > > while (!test_and_clear_bit(bit, bitmap) { > bit = FIND_FIRST_BIT(addr[idx], /* nop */, size); > if (bit >= size) > return size; > } > > return bit; > } > > Didn't test that, but I hope 'volatile' specifier should be enough > for compiler to realize that it shouldn't optimize memory access, and > for KCSAN that everything's OK here. Based on my research regarding __ffs() we indeed do need find_*_bit() functions that are guaranteed to return number in 0-n range even in presence of concurrent bitmap modifications. Do I get it right you'd rather prefer cloning all the find_*_bit() implementations to create such variants? IMO that's worse both in terms of maintainability (more code) and usability (users have to be aware special functions are needed for lockless code) so the 2% of performance overhead until gcc is fixed isn't IMO worth it but you are the maintainer... Honza
Hi Jan,
kernel test robot noticed the following build warnings:
[auto build test WARNING on linus/master]
[also build test WARNING on v6.6-rc6 next-20231018]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/lib-find-Make-functions-safe-on-changing-bitmaps/20231011-230553
base: linus/master
patch link: https://lore.kernel.org/r/20231011150252.32737-1-jack%40suse.cz
patch subject: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps
config: i386-randconfig-002-20231018 (https://download.01.org/0day-ci/archive/20231019/202310190005.NqJcRXtK-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231019/202310190005.NqJcRXtK-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310190005.NqJcRXtK-lkp@intel.com/
All warnings (new ones prefixed by >>):
mm/percpu.c: In function 'pcpu_build_alloc_info':
>> mm/percpu.c:2885:26: warning: array subscript 32 is above array bounds of 'int[32]' [-Warray-bounds]
2885 | group_map[cpu] = group;
| ~~~~~~~~~^~~~~
mm/percpu.c:2842:20: note: while referencing 'group_map'
2842 | static int group_map[NR_CPUS] __initdata;
| ^~~~~~~~~
vim +2885 mm/percpu.c
3c9a024fde58b0 Tejun Heo 2010-09-09 2813
3c9a024fde58b0 Tejun Heo 2010-09-09 2814 /* pcpu_build_alloc_info() is used by both embed and page first chunk */
3c9a024fde58b0 Tejun Heo 2010-09-09 2815 #if defined(BUILD_EMBED_FIRST_CHUNK) || defined(BUILD_PAGE_FIRST_CHUNK)
3c9a024fde58b0 Tejun Heo 2010-09-09 2816 /**
3c9a024fde58b0 Tejun Heo 2010-09-09 2817 * pcpu_build_alloc_info - build alloc_info considering distances between CPUs
3c9a024fde58b0 Tejun Heo 2010-09-09 2818 * @reserved_size: the size of reserved percpu area in bytes
3c9a024fde58b0 Tejun Heo 2010-09-09 2819 * @dyn_size: minimum free size for dynamic allocation in bytes
3c9a024fde58b0 Tejun Heo 2010-09-09 2820 * @atom_size: allocation atom size
3c9a024fde58b0 Tejun Heo 2010-09-09 2821 * @cpu_distance_fn: callback to determine distance between cpus, optional
3c9a024fde58b0 Tejun Heo 2010-09-09 2822 *
3c9a024fde58b0 Tejun Heo 2010-09-09 2823 * This function determines grouping of units, their mappings to cpus
3c9a024fde58b0 Tejun Heo 2010-09-09 2824 * and other parameters considering needed percpu size, allocation
3c9a024fde58b0 Tejun Heo 2010-09-09 2825 * atom size and distances between CPUs.
3c9a024fde58b0 Tejun Heo 2010-09-09 2826 *
bffc4375897ea0 Yannick Guerrini 2015-03-06 2827 * Groups are always multiples of atom size and CPUs which are of
3c9a024fde58b0 Tejun Heo 2010-09-09 2828 * LOCAL_DISTANCE both ways are grouped together and share space for
3c9a024fde58b0 Tejun Heo 2010-09-09 2829 * units in the same group. The returned configuration is guaranteed
3c9a024fde58b0 Tejun Heo 2010-09-09 2830 * to have CPUs on different nodes on different groups and >=75% usage
3c9a024fde58b0 Tejun Heo 2010-09-09 2831 * of allocated virtual address space.
3c9a024fde58b0 Tejun Heo 2010-09-09 2832 *
3c9a024fde58b0 Tejun Heo 2010-09-09 2833 * RETURNS:
3c9a024fde58b0 Tejun Heo 2010-09-09 2834 * On success, pointer to the new allocation_info is returned. On
3c9a024fde58b0 Tejun Heo 2010-09-09 2835 * failure, ERR_PTR value is returned.
3c9a024fde58b0 Tejun Heo 2010-09-09 2836 */
258e0815e2b170 Dennis Zhou 2021-02-14 2837 static struct pcpu_alloc_info * __init __flatten pcpu_build_alloc_info(
3c9a024fde58b0 Tejun Heo 2010-09-09 2838 size_t reserved_size, size_t dyn_size,
3c9a024fde58b0 Tejun Heo 2010-09-09 2839 size_t atom_size,
3c9a024fde58b0 Tejun Heo 2010-09-09 2840 pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
3c9a024fde58b0 Tejun Heo 2010-09-09 2841 {
3c9a024fde58b0 Tejun Heo 2010-09-09 2842 static int group_map[NR_CPUS] __initdata;
3c9a024fde58b0 Tejun Heo 2010-09-09 2843 static int group_cnt[NR_CPUS] __initdata;
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2844 static struct cpumask mask __initdata;
3c9a024fde58b0 Tejun Heo 2010-09-09 2845 const size_t static_size = __per_cpu_end - __per_cpu_start;
3c9a024fde58b0 Tejun Heo 2010-09-09 2846 int nr_groups = 1, nr_units = 0;
3c9a024fde58b0 Tejun Heo 2010-09-09 2847 size_t size_sum, min_unit_size, alloc_size;
3f649ab728cda8 Kees Cook 2020-06-03 2848 int upa, max_upa, best_upa; /* units_per_alloc */
3c9a024fde58b0 Tejun Heo 2010-09-09 2849 int last_allocs, group, unit;
3c9a024fde58b0 Tejun Heo 2010-09-09 2850 unsigned int cpu, tcpu;
3c9a024fde58b0 Tejun Heo 2010-09-09 2851 struct pcpu_alloc_info *ai;
3c9a024fde58b0 Tejun Heo 2010-09-09 2852 unsigned int *cpu_map;
3c9a024fde58b0 Tejun Heo 2010-09-09 2853
3c9a024fde58b0 Tejun Heo 2010-09-09 2854 /* this function may be called multiple times */
3c9a024fde58b0 Tejun Heo 2010-09-09 2855 memset(group_map, 0, sizeof(group_map));
3c9a024fde58b0 Tejun Heo 2010-09-09 2856 memset(group_cnt, 0, sizeof(group_cnt));
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2857 cpumask_clear(&mask);
3c9a024fde58b0 Tejun Heo 2010-09-09 2858
3c9a024fde58b0 Tejun Heo 2010-09-09 2859 /* calculate size_sum and ensure dyn_size is enough for early alloc */
3c9a024fde58b0 Tejun Heo 2010-09-09 2860 size_sum = PFN_ALIGN(static_size + reserved_size +
3c9a024fde58b0 Tejun Heo 2010-09-09 2861 max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
3c9a024fde58b0 Tejun Heo 2010-09-09 2862 dyn_size = size_sum - static_size - reserved_size;
3c9a024fde58b0 Tejun Heo 2010-09-09 2863
3c9a024fde58b0 Tejun Heo 2010-09-09 2864 /*
3c9a024fde58b0 Tejun Heo 2010-09-09 2865 * Determine min_unit_size, alloc_size and max_upa such that
3c9a024fde58b0 Tejun Heo 2010-09-09 2866 * alloc_size is multiple of atom_size and is the smallest
25985edcedea63 Lucas De Marchi 2011-03-30 2867 * which can accommodate 4k aligned segments which are equal to
3c9a024fde58b0 Tejun Heo 2010-09-09 2868 * or larger than min_unit_size.
3c9a024fde58b0 Tejun Heo 2010-09-09 2869 */
3c9a024fde58b0 Tejun Heo 2010-09-09 2870 min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
3c9a024fde58b0 Tejun Heo 2010-09-09 2871
9c01516278ef87 Dennis Zhou (Facebook 2017-07-15 2872) /* determine the maximum # of units that can fit in an allocation */
3c9a024fde58b0 Tejun Heo 2010-09-09 2873 alloc_size = roundup(min_unit_size, atom_size);
3c9a024fde58b0 Tejun Heo 2010-09-09 2874 upa = alloc_size / min_unit_size;
f09f1243ca2d5d Alexander Kuleshov 2015-11-05 2875 while (alloc_size % upa || (offset_in_page(alloc_size / upa)))
3c9a024fde58b0 Tejun Heo 2010-09-09 2876 upa--;
3c9a024fde58b0 Tejun Heo 2010-09-09 2877 max_upa = upa;
3c9a024fde58b0 Tejun Heo 2010-09-09 2878
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2879 cpumask_copy(&mask, cpu_possible_mask);
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2880
3c9a024fde58b0 Tejun Heo 2010-09-09 2881 /* group cpus according to their proximity */
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2882 for (group = 0; !cpumask_empty(&mask); group++) {
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2883 /* pop the group's first cpu */
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2884 cpu = cpumask_first(&mask);
3c9a024fde58b0 Tejun Heo 2010-09-09 @2885 group_map[cpu] = group;
3c9a024fde58b0 Tejun Heo 2010-09-09 2886 group_cnt[group]++;
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2887 cpumask_clear_cpu(cpu, &mask);
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2888
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2889 for_each_cpu(tcpu, &mask) {
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2890 if (!cpu_distance_fn ||
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2891 (cpu_distance_fn(cpu, tcpu) == LOCAL_DISTANCE &&
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2892 cpu_distance_fn(tcpu, cpu) == LOCAL_DISTANCE)) {
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2893 group_map[tcpu] = group;
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2894 group_cnt[group]++;
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2895 cpumask_clear_cpu(tcpu, &mask);
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2896 }
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2897 }
3c9a024fde58b0 Tejun Heo 2010-09-09 2898 }
d7d29ac76f7efb Wonhyuk Yang 2020-10-30 2899 nr_groups = group;
3c9a024fde58b0 Tejun Heo 2010-09-09 2900
3c9a024fde58b0 Tejun Heo 2010-09-09 2901 /*
9c01516278ef87 Dennis Zhou (Facebook 2017-07-15 2902) * Wasted space is caused by a ratio imbalance of upa to group_cnt.
9c01516278ef87 Dennis Zhou (Facebook 2017-07-15 2903) * Expand the unit_size until we use >= 75% of the units allocated.
9c01516278ef87 Dennis Zhou (Facebook 2017-07-15 2904) * Related to atom_size, which could be much larger than the unit_size.
3c9a024fde58b0 Tejun Heo 2010-09-09 2905 */
3c9a024fde58b0 Tejun Heo 2010-09-09 2906 last_allocs = INT_MAX;
4829c791b22f98 Dennis Zhou 2021-06-14 2907 best_upa = 0;
3c9a024fde58b0 Tejun Heo 2010-09-09 2908 for (upa = max_upa; upa; upa--) {
3c9a024fde58b0 Tejun Heo 2010-09-09 2909 int allocs = 0, wasted = 0;
3c9a024fde58b0 Tejun Heo 2010-09-09 2910
f09f1243ca2d5d Alexander Kuleshov 2015-11-05 2911 if (alloc_size % upa || (offset_in_page(alloc_size / upa)))
3c9a024fde58b0 Tejun Heo 2010-09-09 2912 continue;
3c9a024fde58b0 Tejun Heo 2010-09-09 2913
3c9a024fde58b0 Tejun Heo 2010-09-09 2914 for (group = 0; group < nr_groups; group++) {
3c9a024fde58b0 Tejun Heo 2010-09-09 2915 int this_allocs = DIV_ROUND_UP(group_cnt[group], upa);
3c9a024fde58b0 Tejun Heo 2010-09-09 2916 allocs += this_allocs;
3c9a024fde58b0 Tejun Heo 2010-09-09 2917 wasted += this_allocs * upa - group_cnt[group];
3c9a024fde58b0 Tejun Heo 2010-09-09 2918 }
3c9a024fde58b0 Tejun Heo 2010-09-09 2919
3c9a024fde58b0 Tejun Heo 2010-09-09 2920 /*
3c9a024fde58b0 Tejun Heo 2010-09-09 2921 * Don't accept if wastage is over 1/3. The
3c9a024fde58b0 Tejun Heo 2010-09-09 2922 * greater-than comparison ensures upa==1 always
3c9a024fde58b0 Tejun Heo 2010-09-09 2923 * passes the following check.
3c9a024fde58b0 Tejun Heo 2010-09-09 2924 */
3c9a024fde58b0 Tejun Heo 2010-09-09 2925 if (wasted > num_possible_cpus() / 3)
3c9a024fde58b0 Tejun Heo 2010-09-09 2926 continue;
3c9a024fde58b0 Tejun Heo 2010-09-09 2927
3c9a024fde58b0 Tejun Heo 2010-09-09 2928 /* and then don't consume more memory */
3c9a024fde58b0 Tejun Heo 2010-09-09 2929 if (allocs > last_allocs)
3c9a024fde58b0 Tejun Heo 2010-09-09 2930 break;
3c9a024fde58b0 Tejun Heo 2010-09-09 2931 last_allocs = allocs;
3c9a024fde58b0 Tejun Heo 2010-09-09 2932 best_upa = upa;
3c9a024fde58b0 Tejun Heo 2010-09-09 2933 }
4829c791b22f98 Dennis Zhou 2021-06-14 2934 BUG_ON(!best_upa);
3c9a024fde58b0 Tejun Heo 2010-09-09 2935 upa = best_upa;
3c9a024fde58b0 Tejun Heo 2010-09-09 2936
3c9a024fde58b0 Tejun Heo 2010-09-09 2937 /* allocate and fill alloc_info */
3c9a024fde58b0 Tejun Heo 2010-09-09 2938 for (group = 0; group < nr_groups; group++)
3c9a024fde58b0 Tejun Heo 2010-09-09 2939 nr_units += roundup(group_cnt[group], upa);
3c9a024fde58b0 Tejun Heo 2010-09-09 2940
3c9a024fde58b0 Tejun Heo 2010-09-09 2941 ai = pcpu_alloc_alloc_info(nr_groups, nr_units);
3c9a024fde58b0 Tejun Heo 2010-09-09 2942 if (!ai)
3c9a024fde58b0 Tejun Heo 2010-09-09 2943 return ERR_PTR(-ENOMEM);
3c9a024fde58b0 Tejun Heo 2010-09-09 2944 cpu_map = ai->groups[0].cpu_map;
3c9a024fde58b0 Tejun Heo 2010-09-09 2945
3c9a024fde58b0 Tejun Heo 2010-09-09 2946 for (group = 0; group < nr_groups; group++) {
3c9a024fde58b0 Tejun Heo 2010-09-09 2947 ai->groups[group].cpu_map = cpu_map;
3c9a024fde58b0 Tejun Heo 2010-09-09 2948 cpu_map += roundup(group_cnt[group], upa);
3c9a024fde58b0 Tejun Heo 2010-09-09 2949 }
3c9a024fde58b0 Tejun Heo 2010-09-09 2950
3c9a024fde58b0 Tejun Heo 2010-09-09 2951 ai->static_size = static_size;
3c9a024fde58b0 Tejun Heo 2010-09-09 2952 ai->reserved_size = reserved_size;
3c9a024fde58b0 Tejun Heo 2010-09-09 2953 ai->dyn_size = dyn_size;
3c9a024fde58b0 Tejun Heo 2010-09-09 2954 ai->unit_size = alloc_size / upa;
3c9a024fde58b0 Tejun Heo 2010-09-09 2955 ai->atom_size = atom_size;
3c9a024fde58b0 Tejun Heo 2010-09-09 2956 ai->alloc_size = alloc_size;
3c9a024fde58b0 Tejun Heo 2010-09-09 2957
2de7852fe9096e Peng Fan 2019-02-20 2958 for (group = 0, unit = 0; group < nr_groups; group++) {
3c9a024fde58b0 Tejun Heo 2010-09-09 2959 struct pcpu_group_info *gi = &ai->groups[group];
3c9a024fde58b0 Tejun Heo 2010-09-09 2960
3c9a024fde58b0 Tejun Heo 2010-09-09 2961 /*
3c9a024fde58b0 Tejun Heo 2010-09-09 2962 * Initialize base_offset as if all groups are located
3c9a024fde58b0 Tejun Heo 2010-09-09 2963 * back-to-back. The caller should update this to
3c9a024fde58b0 Tejun Heo 2010-09-09 2964 * reflect actual allocation.
3c9a024fde58b0 Tejun Heo 2010-09-09 2965 */
3c9a024fde58b0 Tejun Heo 2010-09-09 2966 gi->base_offset = unit * ai->unit_size;
3c9a024fde58b0 Tejun Heo 2010-09-09 2967
3c9a024fde58b0 Tejun Heo 2010-09-09 2968 for_each_possible_cpu(cpu)
3c9a024fde58b0 Tejun Heo 2010-09-09 2969 if (group_map[cpu] == group)
3c9a024fde58b0 Tejun Heo 2010-09-09 2970 gi->cpu_map[gi->nr_units++] = cpu;
3c9a024fde58b0 Tejun Heo 2010-09-09 2971 gi->nr_units = roundup(gi->nr_units, upa);
3c9a024fde58b0 Tejun Heo 2010-09-09 2972 unit += gi->nr_units;
3c9a024fde58b0 Tejun Heo 2010-09-09 2973 }
3c9a024fde58b0 Tejun Heo 2010-09-09 2974 BUG_ON(unit != nr_units);
3c9a024fde58b0 Tejun Heo 2010-09-09 2975
3c9a024fde58b0 Tejun Heo 2010-09-09 2976 return ai;
3c9a024fde58b0 Tejun Heo 2010-09-09 2977 }
23f917169ef157 Kefeng Wang 2022-01-19 2978
Hello, kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on: commit: df671b17195cd6526e029c70d04dfb72561082d7 ("[PATCH 1/2] lib/find: Make functions safe on changing bitmaps") url: https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/lib-find-Make-functions-safe-on-changing-bitmaps/20231011-230553 base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 1c8b86a3799f7e5be903c3f49fcdaee29fd385b5 patch link: https://lore.kernel.org/all/20231011150252.32737-1-jack@suse.cz/ patch subject: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps testcase: will-it-scale test machine: 104 threads 2 sockets (Skylake) with 192G memory parameters: nr_task: 50% mode: thread test: tlb_flush3 cpufreq_governor: performance Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20231025/202310251458.48b4452d-oliver.sang@intel.com ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/tlb_flush3/will-it-scale commit: 1c8b86a379 ("Merge tag 'xsa441-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip") df671b1719 ("lib/find: Make functions safe on changing bitmaps") 1c8b86a3799f7e5b df671b17195cd6526e029c70d04 ---------------- --------------------------- %stddev %change %stddev \ | \ 0.14 ± 19% +36.9% 0.19 ± 17% perf-sched.wait_time.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write_killable.__vm_munmap 2.26e+08 +3.6% 2.343e+08 proc-vmstat.pgfault 0.04 +25.0% 0.05 turbostat.IPC 32666 -15.5% 27605 ± 2% turbostat.POLL 7856 +2.2% 8025 vmstat.system.cs 6331931 +2.3% 6478704 vmstat.system.in 700119 +3.7% 725931 will-it-scale.52.threads 13463 +3.7% 13959 will-it-scale.per_thread_ops 700119 +3.7% 725931 will-it-scale.workload 8.36 -7.3% 7.74 perf-stat.i.MPKI 4.591e+09 +3.4% 4.747e+09 perf-stat.i.branch-instructions 1.832e+08 +2.8% 1.883e+08 perf-stat.i.branch-misses 26.70 -0.3 26.40 perf-stat.i.cache-miss-rate% 7852 +2.2% 8021 perf-stat.i.context-switches 6.43 -7.2% 5.97 perf-stat.i.cpi 769.61 +1.8% 783.29 perf-stat.i.cpu-migrations 6.39e+09 +3.4% 6.606e+09 perf-stat.i.dTLB-loads 2.94e+09 +3.2% 3.035e+09 perf-stat.i.dTLB-stores 78.29 -0.9 77.44 perf-stat.i.iTLB-load-miss-rate% 18959450 +3.5% 19621273 perf-stat.i.iTLB-load-misses 5254435 +8.7% 5713444 perf-stat.i.iTLB-loads 2.236e+10 +7.7% 2.408e+10 perf-stat.i.instructions 1181 +4.0% 1228 perf-stat.i.instructions-per-iTLB-miss 0.16 +7.7% 0.17 perf-stat.i.ipc 0.02 ± 36% -49.6% 0.01 ± 53% perf-stat.i.major-faults 485.08 +3.0% 499.67 perf-stat.i.metric.K/sec 141.71 +3.2% 146.25 perf-stat.i.metric.M/sec 747997 +3.7% 775416 perf-stat.i.minor-faults 3127957 -13.9% 2693728 perf-stat.i.node-loads 26089697 +3.4% 26965335 perf-stat.i.node-store-misses 767569 +3.7% 796095 perf-stat.i.node-stores 747997 +3.7% 775416 perf-stat.i.page-faults 8.35 -7.3% 7.74 perf-stat.overall.MPKI 26.70 -0.3 26.40 perf-stat.overall.cache-miss-rate% 6.43 -7.1% 5.97 perf-stat.overall.cpi 78.30 -0.9 77.45 perf-stat.overall.iTLB-load-miss-rate% 1179 +4.0% 1226 perf-stat.overall.instructions-per-iTLB-miss 0.16 +7.7% 0.17 perf-stat.overall.ipc 9644584 +3.8% 10011125 perf-stat.overall.path-length 4.575e+09 +3.4% 4.731e+09 perf-stat.ps.branch-instructions 1.825e+08 +2.8% 1.876e+08 perf-stat.ps.branch-misses 7825 +2.2% 7995 perf-stat.ps.context-switches 767.16 +1.8% 780.76 perf-stat.ps.cpu-migrations 6.368e+09 +3.4% 6.583e+09 perf-stat.ps.dTLB-loads 2.93e+09 +3.2% 3.025e+09 perf-stat.ps.dTLB-stores 18896725 +3.5% 19555325 perf-stat.ps.iTLB-load-misses 5236456 +8.7% 5693636 perf-stat.ps.iTLB-loads 2.229e+10 +7.6% 2.399e+10 perf-stat.ps.instructions 745423 +3.7% 772705 perf-stat.ps.minor-faults 3117663 -13.9% 2684861 perf-stat.ps.node-loads 26002765 +3.4% 26875267 perf-stat.ps.node-store-misses 764789 +3.7% 793098 perf-stat.ps.node-stores 745423 +3.7% 772705 perf-stat.ps.page-faults 6.752e+12 +7.6% 7.267e+12 perf-stat.total.instructions 19.21 -1.0 18.18 perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu 17.00 -0.9 16.09 perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range 65.30 -0.6 64.69 perf-profile.calltrace.cycles-pp.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise.do_syscall_64 65.34 -0.6 64.75 perf-profile.calltrace.cycles-pp.madvise_vma_behavior.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe 65.98 -0.5 65.45 perf-profile.calltrace.cycles-pp.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise 65.96 -0.5 65.42 perf-profile.calltrace.cycles-pp.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise 9.72 ± 2% -0.5 9.20 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range 66.33 -0.5 65.81 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise 66.46 -0.5 65.95 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__madvise 31.88 -0.4 31.43 perf-profile.calltrace.cycles-pp.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single 67.72 -0.4 67.28 perf-profile.calltrace.cycles-pp.__madvise 32.15 -0.4 31.73 perf-profile.calltrace.cycles-pp.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior 32.60 -0.4 32.21 perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior.do_madvise 32.93 -0.3 32.58 perf-profile.calltrace.cycles-pp.tlb_finish_mmu.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise 31.07 -0.3 30.74 perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.zap_pte_range.zap_pmd_range.unmap_page_range.zap_page_range_single 31.58 -0.3 31.28 perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.zap_page_range_single.madvise_vma_behavior 31.61 -0.3 31.30 perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.zap_page_range_single.madvise_vma_behavior.do_madvise 31.80 -0.3 31.51 perf-profile.calltrace.cycles-pp.unmap_page_range.zap_page_range_single.madvise_vma_behavior.do_madvise.__x64_sys_madvise 8.34 -0.1 8.22 perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask 8.06 -0.1 7.95 perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond 7.98 -0.1 7.87 perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.llist_add_batch 0.59 ± 3% +0.1 0.65 ± 2% perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.testcase 1.46 +0.1 1.53 perf-profile.calltrace.cycles-pp.filemap_map_pages.do_read_fault.do_fault.__handle_mm_fault.handle_mm_fault 1.48 +0.1 1.55 perf-profile.calltrace.cycles-pp.do_read_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault 1.53 +0.1 1.62 perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault 2.92 +0.1 3.02 perf-profile.calltrace.cycles-pp.flush_tlb_func.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function 1.26 ± 2% +0.1 1.36 perf-profile.calltrace.cycles-pp.default_send_IPI_mask_sequence_phys.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range 1.84 +0.1 1.96 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault 7.87 +0.1 8.00 perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function 2.03 ± 2% +0.1 2.17 perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase 2.90 +0.2 3.06 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range 2.62 ± 3% +0.2 2.80 perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.testcase 2.58 ± 3% +0.2 2.76 perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase 2.95 ± 3% +0.2 3.14 perf-profile.calltrace.cycles-pp.asm_exc_page_fault.testcase 2.75 +0.2 2.94 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu 4.96 +0.3 5.29 perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask 4.92 +0.3 5.25 perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond 5.13 +0.3 5.46 perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range 5.08 +0.4 5.44 perf-profile.calltrace.cycles-pp.testcase 37.25 -2.0 35.24 perf-profile.children.cycles-pp.llist_add_batch 62.82 -0.8 62.04 perf-profile.children.cycles-pp.on_each_cpu_cond_mask 62.82 -0.8 62.04 perf-profile.children.cycles-pp.smp_call_function_many_cond 63.70 -0.7 62.98 perf-profile.children.cycles-pp.flush_tlb_mm_range 65.30 -0.6 64.70 perf-profile.children.cycles-pp.zap_page_range_single 65.34 -0.6 64.75 perf-profile.children.cycles-pp.madvise_vma_behavior 65.98 -0.5 65.45 perf-profile.children.cycles-pp.__x64_sys_madvise 65.96 -0.5 65.43 perf-profile.children.cycles-pp.do_madvise 66.52 -0.5 66.01 perf-profile.children.cycles-pp.do_syscall_64 66.65 -0.5 66.16 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 67.79 -0.4 67.36 perf-profile.children.cycles-pp.__madvise 32.94 -0.3 32.60 perf-profile.children.cycles-pp.tlb_finish_mmu 31.74 -0.3 31.43 perf-profile.children.cycles-pp.zap_pte_range 31.76 -0.3 31.46 perf-profile.children.cycles-pp.zap_pmd_range 31.95 -0.3 31.66 perf-profile.children.cycles-pp.unmap_page_range 0.42 ± 2% +0.0 0.46 perf-profile.children.cycles-pp.error_entry 0.20 ± 3% +0.0 0.24 ± 5% perf-profile.children.cycles-pp.up_read 0.69 +0.0 0.74 perf-profile.children.cycles-pp.native_flush_tlb_local 1.47 +0.1 1.55 perf-profile.children.cycles-pp.filemap_map_pages 1.48 +0.1 1.56 perf-profile.children.cycles-pp.do_read_fault 1.54 +0.1 1.62 perf-profile.children.cycles-pp.do_fault 2.75 +0.1 2.86 perf-profile.children.cycles-pp.default_send_IPI_mask_sequence_phys 1.85 +0.1 1.98 perf-profile.children.cycles-pp.__handle_mm_fault 2.04 ± 2% +0.1 2.18 perf-profile.children.cycles-pp.handle_mm_fault 2.63 ± 3% +0.2 2.81 perf-profile.children.cycles-pp.exc_page_fault 2.62 ± 3% +0.2 2.80 perf-profile.children.cycles-pp.do_user_addr_fault 3.24 ± 3% +0.2 3.44 perf-profile.children.cycles-pp.asm_exc_page_fault 3.83 +0.2 4.04 perf-profile.children.cycles-pp.flush_tlb_func 0.69 ± 2% +0.2 0.92 perf-profile.children.cycles-pp._find_next_bit 9.92 +0.3 10.23 perf-profile.children.cycles-pp.llist_reverse_order 5.45 +0.4 5.81 perf-profile.children.cycles-pp.testcase 18.42 +0.5 18.96 perf-profile.children.cycles-pp.asm_sysvec_call_function 16.24 +0.5 16.78 perf-profile.children.cycles-pp.__flush_smp_call_function_queue 15.78 +0.5 16.32 perf-profile.children.cycles-pp.__sysvec_call_function 16.36 +0.5 16.90 perf-profile.children.cycles-pp.sysvec_call_function 27.92 -1.9 26.04 perf-profile.self.cycles-pp.llist_add_batch 0.16 ± 2% +0.0 0.18 ± 4% perf-profile.self.cycles-pp.up_read 0.42 ± 2% +0.0 0.45 perf-profile.self.cycles-pp.error_entry 0.21 ± 4% +0.0 0.24 ± 5% perf-profile.self.cycles-pp.down_read 0.26 ± 2% +0.0 0.29 ± 3% perf-profile.self.cycles-pp.tlb_finish_mmu 2.01 +0.0 2.05 perf-profile.self.cycles-pp.default_send_IPI_mask_sequence_phys 0.68 +0.0 0.73 perf-profile.self.cycles-pp.native_flush_tlb_local 3.10 +0.2 3.26 perf-profile.self.cycles-pp.flush_tlb_func 0.50 ± 2% +0.2 0.68 perf-profile.self.cycles-pp._find_next_bit 9.92 +0.3 10.22 perf-profile.self.cycles-pp.llist_reverse_order 16.10 +0.5 16.64 perf-profile.self.cycles-pp.smp_call_function_many_cond Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
On 25/10/2023 09.18, kernel test robot wrote: > > > Hello, > > kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on: So with that, can we please just finally say "yeah, let's make the generic bitmap library functions correct and usable in more cases" instead of worrying about random micro-benchmarks that just show you-win-some-you-lose-some. Yes, users will have to treat results from the find routines carefully if their bitmap may be concurrently modified. They do. Nobody wins if those users are forced to implement their own bitmap routines for their lockless algorithms. Rasmus
On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote: > On 25/10/2023 09.18, kernel test robot wrote: > > > > > > Hello, > > > > kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on: > > So with that, can we please just finally say "yeah, let's make the > generic bitmap library functions correct They are all correct already. > and usable in more cases" See below. > instead of worrying about random micro-benchmarks that just show > you-win-some-you-lose-some. That's I agree. I don't worry about either +2% or -3% benchmark, and don't think that they alone can or can't justificate such a radical change like making all find_bit functions volatile, and shutting down a newborn KCSAN. Keeping that in mind, my best guess is that Jan's and Misrad's test that shows +2% was against stable bitmaps; and what robot measured is most likely against heavily concurrent access to some bitmap in the kernel. I didn't look at both tests sources, but that at least makes some sense, because if GCC optimizes code against properly described memory correctly, this is exactly what we can expect. > Yes, users will have to treat results from the find routines carefully > if their bitmap may be concurrently modified. They do. Nobody wins if > those users are forced to implement their own bitmap routines for their > lockless algorithms. Again, I agree with this point, and I'm trying to address exactly this. I'm working on a series that introduces lockless find_bit functions based on existing FIND_BIT() engine. It's not ready yet, but I hope I'll submit it in the next merge window. https://github.com/norov/linux/commits/find_and_bit Now that we've got a test that presumably works faster if find_bit() functions are all switched to be volatile, it would be great if we get into details and understand: - what find_bit function or functions gives that gain in performance; - on what bitmap(s); - is the reason in concurrent memory access (guess yes), and if so, - can we refactor the code to use lockless find_and_bit() functions mentioned above; - if not, how else can we address this. If you or someone else have an extra time slot to get deeper into that, I'll be really thankful. Thanks, Yury
On Thu 26-10-23 20:51:22, Yury Norov wrote: > On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote: > > Yes, users will have to treat results from the find routines carefully > > if their bitmap may be concurrently modified. They do. Nobody wins if > > those users are forced to implement their own bitmap routines for their > > lockless algorithms. > > Again, I agree with this point, and I'm trying to address exactly this. > > I'm working on a series that introduces lockless find_bit functions > based on existing FIND_BIT() engine. It's not ready yet, but I hope > I'll submit it in the next merge window. > > https://github.com/norov/linux/commits/find_and_bit I agree that the find_and_{set|clear}() bit functions are useful and we'll be able to remove some boilerplate code with them. But also note that you will need to duplicate practically all of the bitmap API to provide similar "atomic" functionality - e.g. the sbitmap conversion you have in your branch has a bug that it drops the 'lock' memory ordering from the bitmap manipulation. So you need something like find_and_set_bit_lock() (which you already have) and find_and_set_bit_wrap_lock() (which you don't have yet). If you are to convert bitmap code in filesystems (some of which is lockless as well), you will need to add little and big endian variants of volatile bitmap functions. Finally there are users like lib/xarray.c which don't want to set/clear found bit (we just want to quickly find a good guess for a set bit in the bitmap and we then verify in another structure whether the guess was right or not). So we'll need the volatile variant of plain find_first_bit(), find_next_bit() as well. Also when we have variants of bitmap functions that are safe wrt parallel changes and those that are not, the function names should be probably indicating which are which. So as much as I agree your solution is theorically the cleanest, I personally don't think the cost in terms of code duplication and code churn is really worth it. > Now that we've got a test that presumably works faster if find_bit() > functions are all switched to be volatile, it would be great if we get > into details and understand: > - what find_bit function or functions gives that gain in performance; > - on what bitmap(s); > - is the reason in concurrent memory access (guess yes), and if so, > - can we refactor the code to use lockless find_and_bit() functions > mentioned above; > - if not, how else can we address this. Frankly, I don't think there's any substantial reason why the volatile or non-volatile code should be faster. The guys from compiler team looked at the x86 disassembly and said both variants should be the same speed based on instruction costs. What could be causing these small performance differences is that the resulting code is layed out slightly differently (and the volatile bitmap functions end up being somewhat larger) and that somehow interferes with instruction caching or CPU-internal out-of-order execution. Honza
On 10/27/2023 5:51 AM, Yury Norov wrote: > On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote: >> On 25/10/2023 09.18, kernel test robot wrote: >>> >>> >>> Hello, >>> >>> kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on: >> >> So with that, can we please just finally say "yeah, let's make the >> generic bitmap library functions correct > > They are all correct already. > >> and usable in more cases" > > See below. > >> instead of worrying about random micro-benchmarks that just show >> you-win-some-you-lose-some. > > That's I agree. I don't worry about either +2% or -3% benchmark, and > don't think that they alone can or can't justificate such a radical > change like making all find_bit functions volatile, and shutting down > a newborn KCSAN. > > Keeping that in mind, my best guess is that Jan's and Misrad's test > that shows +2% was against stable bitmaps; and what robot measured > is most likely against heavily concurrent access to some bitmap in > the kernel. > > I didn't look at both tests sources, but that at least makes some > sense, because if GCC optimizes code against properly described > memory correctly, this is exactly what we can expect. > >> Yes, users will have to treat results from the find routines carefully >> if their bitmap may be concurrently modified. They do. Nobody wins if >> those users are forced to implement their own bitmap routines for their >> lockless algorithms. > > Again, I agree with this point, and I'm trying to address exactly this. > > I'm working on a series that introduces lockless find_bit functions > based on existing FIND_BIT() engine. It's not ready yet, but I hope > I'll submit it in the next merge window. > > https://github.com/norov/linux/commits/find_and_bit > > Now that we've got a test that presumably works faster if find_bit() > functions are all switched to be volatile, it would be great if we get > into details and understand: > - what find_bit function or functions gives that gain in performance; > - on what bitmap(s); I am positive that your test_and_clear_bit() loop per bit wouldn't improve performance. No insult, but it has to: - LOCK the bus - READ the (byte/word/longword/quadword) from memory - TEST the bit and remember the result in FLAGS - CLEAR the bit - WRITE back the entire byte/word/longword/quadword So, instead of speeding up, you'd end up reading 64 times in the worst case where only the last bit in the quadword is set. What we need is this ffs() that expands to assembly instruction BSF (bit scan forward) [1] https://www.felixcloutier.com/x86/bsf followed by a BTR (bit test and reset) [2] https://www.felixcloutier.com/x86/btr Ideally, we'd have an instruction that does both, and atomic, or a way to LOCK the bus for two instructions. But bad guys could use that to stall all cores indefinitely: DON'T DO THIS! ----------------------------- LOCK loop: BSF r16, m16/m32/m64 BTR m16/m32/m64, r16 JMP loop UNLOCK ----------------------------- This would better work with the hardware-assisted CAM locking device, than stopping all cores from reading and writing on each memory barrier. But this is a long story. > - is the reason in concurrent memory access (guess yes), and if so, > - can we refactor the code to use lockless find_and_bit() functions > mentioned above; > - if not, how else can we address this. > > If you or someone else have an extra time slot to get deeper into > that, I'll be really thankful. I don't know if the Linux kernel uses any advantage of a trasnactional memory device if it finds one? The idea of each mutex() or even user-space futex() stalling all cores to "asm LOCK CMPXCHG m8/m16/m32/m64, r8/r16/r32/r64" simply doesn't scale well with i.e. 128 cores that have to wait for one long read-modify-write cycle that bypasses cache and talks to very slow memory. I see some progress with per-CPU variables. [3] https://stackoverflow.com/questions/58664496/locking-on-per-cpu-variable/67997961#67997961 For multiprocessor system and to protect percpu variables, we can just disable preemption and do local_irq_save. This way we avoid taking the spinlock. Spinlock requires atomicity across multiple CPU's. With per cpu variables it shall not be required. local_irq_save(flags); preempt_disable(); -- Modify the percpu variable preempt_enable(); local_irq_restore(flags); That compiles roughly to: unsigned long flags; asm volatile("cli": : :"memory"); asm volatile( "mrs %0, " IRQMASK_REG_NAME_R " @local_save_flags" : "=r" (flags) : : "memory", "cc"); preempt_count_inc(); __asm__ __volatile__("": : :"memory") // barrier() --- your stuff here --- if (!(!(flags & X86_EFLAGS_IF)) asm volatile("sti": : :"memory"); __asm__ __volatile__("": : :"memory") // barrier() preempt_count_dec(); With this code other CPUs can work in parallel. None of the CPU spins. If we take spinlock then we modify an atomic variable also other CPU comes then it has to wait/spin if spinlock is acquired. Best regards, Mirsad > Thanks, > Yury
diff --git a/include/linux/find.h b/include/linux/find.h index 5e4f39ef2e72..5ef6466dc7cc 100644 --- a/include/linux/find.h +++ b/include/linux/find.h @@ -60,7 +60,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, if (unlikely(offset >= size)) return size; - val = *addr & GENMASK(size - 1, offset); + val = READ_ONCE(*addr) & GENMASK(size - 1, offset); return val ? __ffs(val) : size; } @@ -90,7 +90,8 @@ unsigned long find_next_and_bit(const unsigned long *addr1, if (unlikely(offset >= size)) return size; - val = *addr1 & *addr2 & GENMASK(size - 1, offset); + val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & + GENMASK(size - 1, offset); return val ? __ffs(val) : size; } @@ -121,7 +122,8 @@ unsigned long find_next_andnot_bit(const unsigned long *addr1, if (unlikely(offset >= size)) return size; - val = *addr1 & ~*addr2 & GENMASK(size - 1, offset); + val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) & + GENMASK(size - 1, offset); return val ? __ffs(val) : size; } @@ -151,7 +153,8 @@ unsigned long find_next_or_bit(const unsigned long *addr1, if (unlikely(offset >= size)) return size; - val = (*addr1 | *addr2) & GENMASK(size - 1, offset); + val = (READ_ONCE(*addr1) | READ_ONCE(*addr2)) & + GENMASK(size - 1, offset); return val ? __ffs(val) : size; } @@ -179,7 +182,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, if (unlikely(offset >= size)) return size; - val = *addr | ~GENMASK(size - 1, offset); + val = READ_ONCE(*addr) | ~GENMASK(size - 1, offset); return val == ~0UL ? size : ffz(val); } @@ -200,7 +203,7 @@ static inline unsigned long find_first_bit(const unsigned long *addr, unsigned long size) { if (small_const_nbits(size)) { - unsigned long val = *addr & GENMASK(size - 1, 0); + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); return val ? __ffs(val) : size; } @@ -229,7 +232,7 @@ unsigned long find_nth_bit(const unsigned long *addr, unsigned long size, unsign return size; if (small_const_nbits(size)) { - unsigned long val = *addr & GENMASK(size - 1, 0); + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); return val ? fns(val, n) : size; } @@ -255,7 +258,8 @@ unsigned long find_nth_and_bit(const unsigned long *addr1, const unsigned long * return size; if (small_const_nbits(size)) { - unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0); + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) + & GENMASK(size - 1, 0); return val ? fns(val, n) : size; } @@ -282,7 +286,8 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon return size; if (small_const_nbits(size)) { - unsigned long val = *addr1 & (~*addr2) & GENMASK(size - 1, 0); + unsigned long val = READ_ONCE(*addr1) & ~READ_ONCE(*addr2) & + GENMASK(size - 1, 0); return val ? fns(val, n) : size; } @@ -312,7 +317,8 @@ unsigned long find_nth_and_andnot_bit(const unsigned long *addr1, return size; if (small_const_nbits(size)) { - unsigned long val = *addr1 & *addr2 & (~*addr3) & GENMASK(size - 1, 0); + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & + ~READ_ONCE(*addr3) & GENMASK(size - 1, 0); return val ? fns(val, n) : size; } @@ -336,7 +342,8 @@ unsigned long find_first_and_bit(const unsigned long *addr1, unsigned long size) { if (small_const_nbits(size)) { - unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0); + unsigned long val = READ_ONCE(*addr1) & READ_ONCE(*addr2) & + GENMASK(size - 1, 0); return val ? __ffs(val) : size; } @@ -358,7 +365,7 @@ static inline unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size) { if (small_const_nbits(size)) { - unsigned long val = *addr | ~GENMASK(size - 1, 0); + unsigned long val = READ_ONCE(*addr) | ~GENMASK(size - 1, 0); return val == ~0UL ? size : ffz(val); } @@ -379,7 +386,7 @@ static inline unsigned long find_last_bit(const unsigned long *addr, unsigned long size) { if (small_const_nbits(size)) { - unsigned long val = *addr & GENMASK(size - 1, 0); + unsigned long val = READ_ONCE(*addr) & GENMASK(size - 1, 0); return val ? __fls(val) : size; } @@ -505,7 +512,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned long size, unsigned long offset) { if (small_const_nbits(size)) { - unsigned long val = *(const unsigned long *)addr; + unsigned long val = READ_ONCE(*(const unsigned long *)addr); if (unlikely(offset >= size)) return size; @@ -523,7 +530,8 @@ static inline unsigned long find_first_zero_bit_le(const void *addr, unsigned long size) { if (small_const_nbits(size)) { - unsigned long val = swab(*(const unsigned long *)addr) | ~GENMASK(size - 1, 0); + unsigned long val = swab(READ_ONCE(*(const unsigned long *)addr)) + | ~GENMASK(size - 1, 0); return val == ~0UL ? size : ffz(val); } @@ -538,7 +546,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned long size, unsigned long offset) { if (small_const_nbits(size)) { - unsigned long val = *(const unsigned long *)addr; + unsigned long val = READ_ONCE(*(const unsigned long *)addr); if (unlikely(offset >= size)) return size; diff --git a/lib/find_bit.c b/lib/find_bit.c index 32f99e9a670e..6867ef8a85e0 100644 --- a/lib/find_bit.c +++ b/lib/find_bit.c @@ -98,7 +98,7 @@ out: \ */ unsigned long _find_first_bit(const unsigned long *addr, unsigned long size) { - return FIND_FIRST_BIT(addr[idx], /* nop */, size); + return FIND_FIRST_BIT(READ_ONCE(addr[idx]), /* nop */, size); } EXPORT_SYMBOL(_find_first_bit); #endif @@ -111,7 +111,8 @@ unsigned long _find_first_and_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long size) { - return FIND_FIRST_BIT(addr1[idx] & addr2[idx], /* nop */, size); + return FIND_FIRST_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), + /* nop */, size); } EXPORT_SYMBOL(_find_first_and_bit); #endif @@ -122,7 +123,7 @@ EXPORT_SYMBOL(_find_first_and_bit); */ unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size) { - return FIND_FIRST_BIT(~addr[idx], /* nop */, size); + return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), /* nop */, size); } EXPORT_SYMBOL(_find_first_zero_bit); #endif @@ -130,28 +131,30 @@ EXPORT_SYMBOL(_find_first_zero_bit); #ifndef find_next_bit unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start) { - return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start); + return FIND_NEXT_BIT(READ_ONCE(addr[idx]), /* nop */, nbits, start); } EXPORT_SYMBOL(_find_next_bit); #endif unsigned long __find_nth_bit(const unsigned long *addr, unsigned long size, unsigned long n) { - return FIND_NTH_BIT(addr[idx], size, n); + return FIND_NTH_BIT(READ_ONCE(addr[idx]), size, n); } EXPORT_SYMBOL(__find_nth_bit); unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long size, unsigned long n) { - return FIND_NTH_BIT(addr1[idx] & addr2[idx], size, n); + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), + size, n); } EXPORT_SYMBOL(__find_nth_and_bit); unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long size, unsigned long n) { - return FIND_NTH_BIT(addr1[idx] & ~addr2[idx], size, n); + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]), + size, n); } EXPORT_SYMBOL(__find_nth_andnot_bit); @@ -160,7 +163,8 @@ unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, const unsigned long *addr3, unsigned long size, unsigned long n) { - return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n); + return FIND_NTH_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]) & + ~READ_ONCE(addr3[idx]), size, n); } EXPORT_SYMBOL(__find_nth_and_andnot_bit); @@ -168,7 +172,8 @@ EXPORT_SYMBOL(__find_nth_and_andnot_bit); unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long nbits, unsigned long start) { - return FIND_NEXT_BIT(addr1[idx] & addr2[idx], /* nop */, nbits, start); + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & READ_ONCE(addr2[idx]), + /* nop */, nbits, start); } EXPORT_SYMBOL(_find_next_and_bit); #endif @@ -177,7 +182,8 @@ EXPORT_SYMBOL(_find_next_and_bit); unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long nbits, unsigned long start) { - return FIND_NEXT_BIT(addr1[idx] & ~addr2[idx], /* nop */, nbits, start); + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) & ~READ_ONCE(addr2[idx]), + /* nop */, nbits, start); } EXPORT_SYMBOL(_find_next_andnot_bit); #endif @@ -186,7 +192,8 @@ EXPORT_SYMBOL(_find_next_andnot_bit); unsigned long _find_next_or_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long nbits, unsigned long start) { - return FIND_NEXT_BIT(addr1[idx] | addr2[idx], /* nop */, nbits, start); + return FIND_NEXT_BIT(READ_ONCE(addr1[idx]) | READ_ONCE(addr2[idx]), + /* nop */, nbits, start); } EXPORT_SYMBOL(_find_next_or_bit); #endif @@ -195,7 +202,7 @@ EXPORT_SYMBOL(_find_next_or_bit); unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits, unsigned long start) { - return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start); + return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), /* nop */, nbits, start); } EXPORT_SYMBOL(_find_next_zero_bit); #endif @@ -208,7 +215,7 @@ unsigned long _find_last_bit(const unsigned long *addr, unsigned long size) unsigned long idx = (size-1) / BITS_PER_LONG; do { - val &= addr[idx]; + val &= READ_ONCE(addr[idx]); if (val) return idx * BITS_PER_LONG + __fls(val); @@ -242,7 +249,7 @@ EXPORT_SYMBOL(find_next_clump8); */ unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size) { - return FIND_FIRST_BIT(~addr[idx], swab, size); + return FIND_FIRST_BIT(~READ_ONCE(addr[idx]), swab, size); } EXPORT_SYMBOL(_find_first_zero_bit_le); @@ -252,7 +259,7 @@ EXPORT_SYMBOL(_find_first_zero_bit_le); unsigned long _find_next_zero_bit_le(const unsigned long *addr, unsigned long size, unsigned long offset) { - return FIND_NEXT_BIT(~addr[idx], swab, size, offset); + return FIND_NEXT_BIT(~READ_ONCE(addr[idx]), swab, size, offset); } EXPORT_SYMBOL(_find_next_zero_bit_le); #endif @@ -261,7 +268,7 @@ EXPORT_SYMBOL(_find_next_zero_bit_le); unsigned long _find_next_bit_le(const unsigned long *addr, unsigned long size, unsigned long offset) { - return FIND_NEXT_BIT(addr[idx], swab, size, offset); + return FIND_NEXT_BIT(READ_ONCE(addr[idx]), swab, size, offset); } EXPORT_SYMBOL(_find_next_bit_le);