Message ID | tencent_B51A9DA220288A95A435E3435A0443BEB007@qq.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug | expand |
On 3/4/2024 5:54 AM, linke li wrote: > Some changes are done to fix a data race in commit 202489101f2e ("rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race") > > { > int i; > > - i = rp->rtort_pipe_count; > + i = READ_ONCE(rp->rtort_pipe_count); > if (i > RCU_TORTURE_PIPE_LEN) > i = RCU_TORTURE_PIPE_LEN; > atomic_inc(&rcu_torture_wcount[i]); > - if (++rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > + WRITE_ONCE(rp->rtort_pipe_count, i + 1); > + if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > rp->rtort_mbtest = 0; > return true; > } > > But ++rp->rtort_pipe_count is meant to add itself by 1, not give i+1 to > rp->rtort_pipe_count, because rp->rtort_pipe_count may write by > rcu_torture_writer() concurrently. > > Also, rp->rtort_pipe_count in the next line should be read using > READ_ONCE() because of data race. > > Signed-off-by: linke li <lilinke99@qq.com> > --- > kernel/rcu/rcutorture.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c > index 7567ca8e743c..00059ace4fd5 100644 > --- a/kernel/rcu/rcutorture.c > +++ b/kernel/rcu/rcutorture.c > @@ -465,8 +465,8 @@ rcu_torture_pipe_update_one(struct rcu_torture *rp) > if (i > RCU_TORTURE_PIPE_LEN) > i = RCU_TORTURE_PIPE_LEN; > atomic_inc(&rcu_torture_wcount[i]); > - WRITE_ONCE(rp->rtort_pipe_count, i + 1); > - if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > + WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); > + if (READ_ONCE(rp->rtort_pipe_count) >= RCU_TORTURE_PIPE_LEN) { I want to say, I am not convinced with the patch because what's wrong with writing to an old index? You win/lose the race anyway, say the CPU executed the WRITE_ONCE() a bit too early/late and another WRITE_ONCE() lost/won, regardless of whether you wrote the "incremented i" or "the increment from the latest value of pipe_count". Anyway, a slightly related/different question: Should that: WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); Be: WRITE_ONCE(rp->rtort_pipe_count, READ_ONCE(rp->rtort_pipe_count) + 1); ? thanks, - Joel
On Mon, Mar 04, 2024 at 11:19:21AM -0500, Joel Fernandes wrote: > > > On 3/4/2024 5:54 AM, linke li wrote: > > Some changes are done to fix a data race in commit 202489101f2e ("rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race") > > > > { > > int i; > > > > - i = rp->rtort_pipe_count; > > + i = READ_ONCE(rp->rtort_pipe_count); > > if (i > RCU_TORTURE_PIPE_LEN) > > i = RCU_TORTURE_PIPE_LEN; > > atomic_inc(&rcu_torture_wcount[i]); > > - if (++rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > > + WRITE_ONCE(rp->rtort_pipe_count, i + 1); > > + if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > > rp->rtort_mbtest = 0; > > return true; > > } > > > > But ++rp->rtort_pipe_count is meant to add itself by 1, not give i+1 to > > rp->rtort_pipe_count, because rp->rtort_pipe_count may write by > > rcu_torture_writer() concurrently. > > > > Also, rp->rtort_pipe_count in the next line should be read using > > READ_ONCE() because of data race. > > > > Signed-off-by: linke li <lilinke99@qq.com> > > --- > > kernel/rcu/rcutorture.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c > > index 7567ca8e743c..00059ace4fd5 100644 > > --- a/kernel/rcu/rcutorture.c > > +++ b/kernel/rcu/rcutorture.c > > @@ -465,8 +465,8 @@ rcu_torture_pipe_update_one(struct rcu_torture *rp) > > if (i > RCU_TORTURE_PIPE_LEN) > > i = RCU_TORTURE_PIPE_LEN; > > atomic_inc(&rcu_torture_wcount[i]); > > - WRITE_ONCE(rp->rtort_pipe_count, i + 1); > > - if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > > + WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); > > + if (READ_ONCE(rp->rtort_pipe_count) >= RCU_TORTURE_PIPE_LEN) { > > I want to say, I am not convinced with the patch because what's wrong with > writing to an old index? > > You win/lose the race anyway, say the CPU executed the WRITE_ONCE() a bit too > early/late and another WRITE_ONCE() lost/won, regardless of whether you wrote > the "incremented i" or "the increment from the latest value of pipe_count". > > Anyway, a slightly related/different question: > > Should that: > WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); > > Be: > WRITE_ONCE(rp->rtort_pipe_count, READ_ONCE(rp->rtort_pipe_count) + 1); > > ? Thank you both! At first glance, I would argue for something like this: ------------------------------------------------------------------------ static bool rcu_torture_pipe_update_one(struct rcu_torture *rp) { int i; struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); if (rtrcp) { WRITE_ONCE(rp->rtort_chkp, NULL); smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). } i = READ_ONCE(rp->rtort_pipe_count) + 1; if (i > RCU_TORTURE_PIPE_LEN) i = RCU_TORTURE_PIPE_LEN; atomic_inc(&rcu_torture_wcount[i]); WRITE_ONCE(rp->rtort_pipe_count, i); if (i >= RCU_TORTURE_PIPE_LEN) { rp->rtort_mbtest = 0; return true; } return false; } ------------------------------------------------------------------------ That is, move the increment to the read and replace the re-read with the value "i" that was just written. Thoughts? Thanx, Paul
On 3/4/2024 12:14 PM, Paul E. McKenney wrote: > On Mon, Mar 04, 2024 at 11:19:21AM -0500, Joel Fernandes wrote: >> >> >> On 3/4/2024 5:54 AM, linke li wrote: >>> Some changes are done to fix a data race in commit 202489101f2e ("rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race") >>> >>> { >>> int i; >>> >>> - i = rp->rtort_pipe_count; >>> + i = READ_ONCE(rp->rtort_pipe_count); >>> if (i > RCU_TORTURE_PIPE_LEN) >>> i = RCU_TORTURE_PIPE_LEN; >>> atomic_inc(&rcu_torture_wcount[i]); >>> - if (++rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { >>> + WRITE_ONCE(rp->rtort_pipe_count, i + 1); >>> + if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { >>> rp->rtort_mbtest = 0; >>> return true; >>> } >>> >>> But ++rp->rtort_pipe_count is meant to add itself by 1, not give i+1 to >>> rp->rtort_pipe_count, because rp->rtort_pipe_count may write by >>> rcu_torture_writer() concurrently. >>> >>> Also, rp->rtort_pipe_count in the next line should be read using >>> READ_ONCE() because of data race. >>> >>> Signed-off-by: linke li <lilinke99@qq.com> >>> --- >>> kernel/rcu/rcutorture.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c >>> index 7567ca8e743c..00059ace4fd5 100644 >>> --- a/kernel/rcu/rcutorture.c >>> +++ b/kernel/rcu/rcutorture.c >>> @@ -465,8 +465,8 @@ rcu_torture_pipe_update_one(struct rcu_torture *rp) >>> if (i > RCU_TORTURE_PIPE_LEN) >>> i = RCU_TORTURE_PIPE_LEN; >>> atomic_inc(&rcu_torture_wcount[i]); >>> - WRITE_ONCE(rp->rtort_pipe_count, i + 1); >>> - if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { >>> + WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); >>> + if (READ_ONCE(rp->rtort_pipe_count) >= RCU_TORTURE_PIPE_LEN) { >> >> I want to say, I am not convinced with the patch because what's wrong with >> writing to an old index? >> >> You win/lose the race anyway, say the CPU executed the WRITE_ONCE() a bit too >> early/late and another WRITE_ONCE() lost/won, regardless of whether you wrote >> the "incremented i" or "the increment from the latest value of pipe_count". >> >> Anyway, a slightly related/different question: >> >> Should that: >> WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); >> >> Be: >> WRITE_ONCE(rp->rtort_pipe_count, READ_ONCE(rp->rtort_pipe_count) + 1); >> >> ? > > Thank you both! > > At first glance, I would argue for something like this: > > ------------------------------------------------------------------------ > > static bool > rcu_torture_pipe_update_one(struct rcu_torture *rp) > { > int i; > struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); > > if (rtrcp) { > WRITE_ONCE(rp->rtort_chkp, NULL); > smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). > } > i = READ_ONCE(rp->rtort_pipe_count) + 1; > if (i > RCU_TORTURE_PIPE_LEN) > i = RCU_TORTURE_PIPE_LEN; > atomic_inc(&rcu_torture_wcount[i]); > WRITE_ONCE(rp->rtort_pipe_count, i); > if (i >= RCU_TORTURE_PIPE_LEN) { > rp->rtort_mbtest = 0; > return true; > } > return false; > } > > ------------------------------------------------------------------------ > > That is, move the increment to the read and replace the re-read with > the value "i" that was just written. But that changes the original logic as well? It looks like with the above change, you're now writing to rcu_torture_wcount[READ_ONCE(rp->rtort_pipe_count) + 1] instead of rcu_torture_wcount[READ_ONCE(rp->rtort_pipe_count)]. I think that might break rcutorture, because there is an increment outside of the first 2 entries in rcu_torture_wcount but not sure (need to look more). Thanks.
On Mon, Mar 04, 2024 at 02:10:09PM -0500, Joel Fernandes wrote: > > > On 3/4/2024 12:14 PM, Paul E. McKenney wrote: > > On Mon, Mar 04, 2024 at 11:19:21AM -0500, Joel Fernandes wrote: > >> > >> > >> On 3/4/2024 5:54 AM, linke li wrote: > >>> Some changes are done to fix a data race in commit 202489101f2e ("rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race") > >>> > >>> { > >>> int i; > >>> > >>> - i = rp->rtort_pipe_count; > >>> + i = READ_ONCE(rp->rtort_pipe_count); > >>> if (i > RCU_TORTURE_PIPE_LEN) > >>> i = RCU_TORTURE_PIPE_LEN; > >>> atomic_inc(&rcu_torture_wcount[i]); > >>> - if (++rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > >>> + WRITE_ONCE(rp->rtort_pipe_count, i + 1); > >>> + if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > >>> rp->rtort_mbtest = 0; > >>> return true; > >>> } > >>> > >>> But ++rp->rtort_pipe_count is meant to add itself by 1, not give i+1 to > >>> rp->rtort_pipe_count, because rp->rtort_pipe_count may write by > >>> rcu_torture_writer() concurrently. > >>> > >>> Also, rp->rtort_pipe_count in the next line should be read using > >>> READ_ONCE() because of data race. > >>> > >>> Signed-off-by: linke li <lilinke99@qq.com> > >>> --- > >>> kernel/rcu/rcutorture.c | 4 ++-- > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c > >>> index 7567ca8e743c..00059ace4fd5 100644 > >>> --- a/kernel/rcu/rcutorture.c > >>> +++ b/kernel/rcu/rcutorture.c > >>> @@ -465,8 +465,8 @@ rcu_torture_pipe_update_one(struct rcu_torture *rp) > >>> if (i > RCU_TORTURE_PIPE_LEN) > >>> i = RCU_TORTURE_PIPE_LEN; > >>> atomic_inc(&rcu_torture_wcount[i]); > >>> - WRITE_ONCE(rp->rtort_pipe_count, i + 1); > >>> - if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > >>> + WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); > >>> + if (READ_ONCE(rp->rtort_pipe_count) >= RCU_TORTURE_PIPE_LEN) { > >> > >> I want to say, I am not convinced with the patch because what's wrong with > >> writing to an old index? > >> > >> You win/lose the race anyway, say the CPU executed the WRITE_ONCE() a bit too > >> early/late and another WRITE_ONCE() lost/won, regardless of whether you wrote > >> the "incremented i" or "the increment from the latest value of pipe_count". > >> > >> Anyway, a slightly related/different question: > >> > >> Should that: > >> WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); > >> > >> Be: > >> WRITE_ONCE(rp->rtort_pipe_count, READ_ONCE(rp->rtort_pipe_count) + 1); > >> > >> ? > > > > Thank you both! > > > > At first glance, I would argue for something like this: > > > > ------------------------------------------------------------------------ > > > > static bool > > rcu_torture_pipe_update_one(struct rcu_torture *rp) > > { > > int i; > > struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); > > > > if (rtrcp) { > > WRITE_ONCE(rp->rtort_chkp, NULL); > > smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). > > } > > i = READ_ONCE(rp->rtort_pipe_count) + 1; > > if (i > RCU_TORTURE_PIPE_LEN) > > i = RCU_TORTURE_PIPE_LEN; > > atomic_inc(&rcu_torture_wcount[i]); > > WRITE_ONCE(rp->rtort_pipe_count, i); > > if (i >= RCU_TORTURE_PIPE_LEN) { > > rp->rtort_mbtest = 0; > > return true; > > } > > return false; > > } > > > > ------------------------------------------------------------------------ > > > > That is, move the increment to the read and replace the re-read with > > the value "i" that was just written. > > But that changes the original logic as well? It looks like with the above > change, you're now writing to rcu_torture_wcount[READ_ONCE(rp->rtort_pipe_count) > + 1] instead of rcu_torture_wcount[READ_ONCE(rp->rtort_pipe_count)]. > > I think that might break rcutorture, because there is an increment outside of > the first 2 entries in rcu_torture_wcount but not sure (need to look more). Good point on never incrementing the zeroth entry! Clearly I should have waited before replying. How about the following? ------------------------------------------------------------------------ static bool rcu_torture_pipe_update_one(struct rcu_torture *rp) { int i; struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); if (rtrcp) { WRITE_ONCE(rp->rtort_chkp, NULL); smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). } i = READ_ONCE(rp->rtort_pipe_count); if (i > RCU_TORTURE_PIPE_LEN) i = RCU_TORTURE_PIPE_LEN; atomic_inc(&rcu_torture_wcount[i]); WRITE_ONCE(rp->rtort_pipe_count, i + 1); if (i + 1 >= RCU_TORTURE_PIPE_LEN) { rp->rtort_mbtest = 0; return true; } return false; } ------------------------------------------------------------------------ Thanx, Paul
On 3/4/2024 2:44 PM, Paul E. McKenney wrote: > On Mon, Mar 04, 2024 at 02:10:09PM -0500, Joel Fernandes wrote: >> >> >> On 3/4/2024 12:14 PM, Paul E. McKenney wrote: >>> On Mon, Mar 04, 2024 at 11:19:21AM -0500, Joel Fernandes wrote: >>>> >>>> >>>> On 3/4/2024 5:54 AM, linke li wrote: >>>>> Some changes are done to fix a data race in commit 202489101f2e ("rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race") >>>>> >>>>> { >>>>> int i; >>>>> >>>>> - i = rp->rtort_pipe_count; >>>>> + i = READ_ONCE(rp->rtort_pipe_count); >>>>> if (i > RCU_TORTURE_PIPE_LEN) >>>>> i = RCU_TORTURE_PIPE_LEN; >>>>> atomic_inc(&rcu_torture_wcount[i]); >>>>> - if (++rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { >>>>> + WRITE_ONCE(rp->rtort_pipe_count, i + 1); >>>>> + if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { >>>>> rp->rtort_mbtest = 0; >>>>> return true; >>>>> } >>>>> >>>>> But ++rp->rtort_pipe_count is meant to add itself by 1, not give i+1 to >>>>> rp->rtort_pipe_count, because rp->rtort_pipe_count may write by >>>>> rcu_torture_writer() concurrently. >>>>> >>>>> Also, rp->rtort_pipe_count in the next line should be read using >>>>> READ_ONCE() because of data race. >>>>> >>>>> Signed-off-by: linke li <lilinke99@qq.com> >>>>> --- >>>>> kernel/rcu/rcutorture.c | 4 ++-- >>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c >>>>> index 7567ca8e743c..00059ace4fd5 100644 >>>>> --- a/kernel/rcu/rcutorture.c >>>>> +++ b/kernel/rcu/rcutorture.c >>>>> @@ -465,8 +465,8 @@ rcu_torture_pipe_update_one(struct rcu_torture *rp) >>>>> if (i > RCU_TORTURE_PIPE_LEN) >>>>> i = RCU_TORTURE_PIPE_LEN; >>>>> atomic_inc(&rcu_torture_wcount[i]); >>>>> - WRITE_ONCE(rp->rtort_pipe_count, i + 1); >>>>> - if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { >>>>> + WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); >>>>> + if (READ_ONCE(rp->rtort_pipe_count) >= RCU_TORTURE_PIPE_LEN) { >>>> >>>> I want to say, I am not convinced with the patch because what's wrong with >>>> writing to an old index? >>>> >>>> You win/lose the race anyway, say the CPU executed the WRITE_ONCE() a bit too >>>> early/late and another WRITE_ONCE() lost/won, regardless of whether you wrote >>>> the "incremented i" or "the increment from the latest value of pipe_count". >>>> >>>> Anyway, a slightly related/different question: >>>> >>>> Should that: >>>> WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); >>>> >>>> Be: >>>> WRITE_ONCE(rp->rtort_pipe_count, READ_ONCE(rp->rtort_pipe_count) + 1); >>>> >>>> ? >>> >>> Thank you both! >>> >>> At first glance, I would argue for something like this: >>> >>> ------------------------------------------------------------------------ >>> >>> static bool >>> rcu_torture_pipe_update_one(struct rcu_torture *rp) >>> { >>> int i; >>> struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); >>> >>> if (rtrcp) { >>> WRITE_ONCE(rp->rtort_chkp, NULL); >>> smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). >>> } >>> i = READ_ONCE(rp->rtort_pipe_count) + 1; >>> if (i > RCU_TORTURE_PIPE_LEN) >>> i = RCU_TORTURE_PIPE_LEN; >>> atomic_inc(&rcu_torture_wcount[i]); >>> WRITE_ONCE(rp->rtort_pipe_count, i); >>> if (i >= RCU_TORTURE_PIPE_LEN) { >>> rp->rtort_mbtest = 0; >>> return true; >>> } >>> return false; >>> } >>> >>> ------------------------------------------------------------------------ >>> >>> That is, move the increment to the read and replace the re-read with >>> the value "i" that was just written. >> >> But that changes the original logic as well? It looks like with the above >> change, you're now writing to rcu_torture_wcount[READ_ONCE(rp->rtort_pipe_count) >> + 1] instead of rcu_torture_wcount[READ_ONCE(rp->rtort_pipe_count)]. >> >> I think that might break rcutorture, because there is an increment outside of >> the first 2 entries in rcu_torture_wcount but not sure (need to look more). > > Good point on never incrementing the zeroth entry! Clearly I should > have waited before replying. > > How about the following? > > ------------------------------------------------------------------------ > > static bool > rcu_torture_pipe_update_one(struct rcu_torture *rp) > { > int i; > struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); > > if (rtrcp) { > WRITE_ONCE(rp->rtort_chkp, NULL); > smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). > } > i = READ_ONCE(rp->rtort_pipe_count); > if (i > RCU_TORTURE_PIPE_LEN) > i = RCU_TORTURE_PIPE_LEN; > atomic_inc(&rcu_torture_wcount[i]); > WRITE_ONCE(rp->rtort_pipe_count, i + 1); > if (i + 1 >= RCU_TORTURE_PIPE_LEN) { > rp->rtort_mbtest = 0; > return true; > } > return false; > } Yes, this looks good to me. Thanks, Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
On Mon, Mar 04, 2024 at 03:13:10PM -0500, Joel Fernandes wrote: > > > On 3/4/2024 2:44 PM, Paul E. McKenney wrote: > > On Mon, Mar 04, 2024 at 02:10:09PM -0500, Joel Fernandes wrote: > >> > >> > >> On 3/4/2024 12:14 PM, Paul E. McKenney wrote: > >>> On Mon, Mar 04, 2024 at 11:19:21AM -0500, Joel Fernandes wrote: > >>>> > >>>> > >>>> On 3/4/2024 5:54 AM, linke li wrote: > >>>>> Some changes are done to fix a data race in commit 202489101f2e ("rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race") > >>>>> > >>>>> { > >>>>> int i; > >>>>> > >>>>> - i = rp->rtort_pipe_count; > >>>>> + i = READ_ONCE(rp->rtort_pipe_count); > >>>>> if (i > RCU_TORTURE_PIPE_LEN) > >>>>> i = RCU_TORTURE_PIPE_LEN; > >>>>> atomic_inc(&rcu_torture_wcount[i]); > >>>>> - if (++rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > >>>>> + WRITE_ONCE(rp->rtort_pipe_count, i + 1); > >>>>> + if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > >>>>> rp->rtort_mbtest = 0; > >>>>> return true; > >>>>> } > >>>>> > >>>>> But ++rp->rtort_pipe_count is meant to add itself by 1, not give i+1 to > >>>>> rp->rtort_pipe_count, because rp->rtort_pipe_count may write by > >>>>> rcu_torture_writer() concurrently. > >>>>> > >>>>> Also, rp->rtort_pipe_count in the next line should be read using > >>>>> READ_ONCE() because of data race. > >>>>> > >>>>> Signed-off-by: linke li <lilinke99@qq.com> > >>>>> --- > >>>>> kernel/rcu/rcutorture.c | 4 ++-- > >>>>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c > >>>>> index 7567ca8e743c..00059ace4fd5 100644 > >>>>> --- a/kernel/rcu/rcutorture.c > >>>>> +++ b/kernel/rcu/rcutorture.c > >>>>> @@ -465,8 +465,8 @@ rcu_torture_pipe_update_one(struct rcu_torture *rp) > >>>>> if (i > RCU_TORTURE_PIPE_LEN) > >>>>> i = RCU_TORTURE_PIPE_LEN; > >>>>> atomic_inc(&rcu_torture_wcount[i]); > >>>>> - WRITE_ONCE(rp->rtort_pipe_count, i + 1); > >>>>> - if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > >>>>> + WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); > >>>>> + if (READ_ONCE(rp->rtort_pipe_count) >= RCU_TORTURE_PIPE_LEN) { > >>>> > >>>> I want to say, I am not convinced with the patch because what's wrong with > >>>> writing to an old index? > >>>> > >>>> You win/lose the race anyway, say the CPU executed the WRITE_ONCE() a bit too > >>>> early/late and another WRITE_ONCE() lost/won, regardless of whether you wrote > >>>> the "incremented i" or "the increment from the latest value of pipe_count". > >>>> > >>>> Anyway, a slightly related/different question: > >>>> > >>>> Should that: > >>>> WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); > >>>> > >>>> Be: > >>>> WRITE_ONCE(rp->rtort_pipe_count, READ_ONCE(rp->rtort_pipe_count) + 1); > >>>> > >>>> ? > >>> > >>> Thank you both! > >>> > >>> At first glance, I would argue for something like this: > >>> > >>> ------------------------------------------------------------------------ > >>> > >>> static bool > >>> rcu_torture_pipe_update_one(struct rcu_torture *rp) > >>> { > >>> int i; > >>> struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); > >>> > >>> if (rtrcp) { > >>> WRITE_ONCE(rp->rtort_chkp, NULL); > >>> smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). > >>> } > >>> i = READ_ONCE(rp->rtort_pipe_count) + 1; > >>> if (i > RCU_TORTURE_PIPE_LEN) > >>> i = RCU_TORTURE_PIPE_LEN; > >>> atomic_inc(&rcu_torture_wcount[i]); > >>> WRITE_ONCE(rp->rtort_pipe_count, i); > >>> if (i >= RCU_TORTURE_PIPE_LEN) { > >>> rp->rtort_mbtest = 0; > >>> return true; > >>> } > >>> return false; > >>> } > >>> > >>> ------------------------------------------------------------------------ > >>> > >>> That is, move the increment to the read and replace the re-read with > >>> the value "i" that was just written. > >> > >> But that changes the original logic as well? It looks like with the above > >> change, you're now writing to rcu_torture_wcount[READ_ONCE(rp->rtort_pipe_count) > >> + 1] instead of rcu_torture_wcount[READ_ONCE(rp->rtort_pipe_count)]. > >> > >> I think that might break rcutorture, because there is an increment outside of > >> the first 2 entries in rcu_torture_wcount but not sure (need to look more). > > > > Good point on never incrementing the zeroth entry! Clearly I should > > have waited before replying. > > > > How about the following? > > > > ------------------------------------------------------------------------ > > > > static bool > > rcu_torture_pipe_update_one(struct rcu_torture *rp) > > { > > int i; > > struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); > > > > if (rtrcp) { > > WRITE_ONCE(rp->rtort_chkp, NULL); > > smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). > > } > > i = READ_ONCE(rp->rtort_pipe_count); > > if (i > RCU_TORTURE_PIPE_LEN) > > i = RCU_TORTURE_PIPE_LEN; > > atomic_inc(&rcu_torture_wcount[i]); > > WRITE_ONCE(rp->rtort_pipe_count, i + 1); > > if (i + 1 >= RCU_TORTURE_PIPE_LEN) { > > rp->rtort_mbtest = 0; > > return true; > > } > > return false; > > } > > Yes, this looks good to me. Thanks, > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Again, thank you. linke li, does this approach work for you? If so, would you be willing to send a new patch along these lines? If it does not work, what additional problems do you see? Thanx, Paul
Thank you both. This looks good to me. I will send a new patch. > 2024年3月5日 04:47,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Mon, Mar 04, 2024 at 03:13:10PM -0500, Joel Fernandes wrote: >> >> >> On 3/4/2024 2:44 PM, Paul E. McKenney wrote: >>> On Mon, Mar 04, 2024 at 02:10:09PM -0500, Joel Fernandes wrote: >>>> >>>> >>>> On 3/4/2024 12:14 PM, Paul E. McKenney wrote: >>>>> On Mon, Mar 04, 2024 at 11:19:21AM -0500, Joel Fernandes wrote: >>>>>> >>>>>> >>>>>> On 3/4/2024 5:54 AM, linke li wrote: >>>>>>> Some changes are done to fix a data race in commit 202489101f2e ("rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race") >>>>>>> >>>>>>> { >>>>>>> int i; >>>>>>> >>>>>>> - i = rp->rtort_pipe_count; >>>>>>> + i = READ_ONCE(rp->rtort_pipe_count); >>>>>>> if (i > RCU_TORTURE_PIPE_LEN) >>>>>>> i = RCU_TORTURE_PIPE_LEN; >>>>>>> atomic_inc(&rcu_torture_wcount[i]); >>>>>>> - if (++rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { >>>>>>> + WRITE_ONCE(rp->rtort_pipe_count, i + 1); >>>>>>> + if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { >>>>>>> rp->rtort_mbtest = 0; >>>>>>> return true; >>>>>>> } >>>>>>> >>>>>>> But ++rp->rtort_pipe_count is meant to add itself by 1, not give i+1 to >>>>>>> rp->rtort_pipe_count, because rp->rtort_pipe_count may write by >>>>>>> rcu_torture_writer() concurrently. >>>>>>> >>>>>>> Also, rp->rtort_pipe_count in the next line should be read using >>>>>>> READ_ONCE() because of data race. >>>>>>> >>>>>>> Signed-off-by: linke li <lilinke99@qq.com> >>>>>>> --- >>>>>>> kernel/rcu/rcutorture.c | 4 ++-- >>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c >>>>>>> index 7567ca8e743c..00059ace4fd5 100644 >>>>>>> --- a/kernel/rcu/rcutorture.c >>>>>>> +++ b/kernel/rcu/rcutorture.c >>>>>>> @@ -465,8 +465,8 @@ rcu_torture_pipe_update_one(struct rcu_torture *rp) >>>>>>> if (i > RCU_TORTURE_PIPE_LEN) >>>>>>> i = RCU_TORTURE_PIPE_LEN; >>>>>>> atomic_inc(&rcu_torture_wcount[i]); >>>>>>> - WRITE_ONCE(rp->rtort_pipe_count, i + 1); >>>>>>> - if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { >>>>>>> + WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); >>>>>>> + if (READ_ONCE(rp->rtort_pipe_count) >= RCU_TORTURE_PIPE_LEN) { >>>>>> >>>>>> I want to say, I am not convinced with the patch because what's wrong with >>>>>> writing to an old index? >>>>>> >>>>>> You win/lose the race anyway, say the CPU executed the WRITE_ONCE() a bit too >>>>>> early/late and another WRITE_ONCE() lost/won, regardless of whether you wrote >>>>>> the "incremented i" or "the increment from the latest value of pipe_count". >>>>>> >>>>>> Anyway, a slightly related/different question: >>>>>> >>>>>> Should that: >>>>>> WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); >>>>>> >>>>>> Be: >>>>>> WRITE_ONCE(rp->rtort_pipe_count, READ_ONCE(rp->rtort_pipe_count) + 1); >>>>>> >>>>>> ? >>>>> >>>>> Thank you both! >>>>> >>>>> At first glance, I would argue for something like this: >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> static bool >>>>> rcu_torture_pipe_update_one(struct rcu_torture *rp) >>>>> { >>>>> int i; >>>>> struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); >>>>> >>>>> if (rtrcp) { >>>>> WRITE_ONCE(rp->rtort_chkp, NULL); >>>>> smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). >>>>> } >>>>> i = READ_ONCE(rp->rtort_pipe_count) + 1; >>>>> if (i > RCU_TORTURE_PIPE_LEN) >>>>> i = RCU_TORTURE_PIPE_LEN; >>>>> atomic_inc(&rcu_torture_wcount[i]); >>>>> WRITE_ONCE(rp->rtort_pipe_count, i); >>>>> if (i >= RCU_TORTURE_PIPE_LEN) { >>>>> rp->rtort_mbtest = 0; >>>>> return true; >>>>> } >>>>> return false; >>>>> } >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> That is, move the increment to the read and replace the re-read with >>>>> the value "i" that was just written. >>>> >>>> But that changes the original logic as well? It looks like with the above >>>> change, you're now writing to rcu_torture_wcount[READ_ONCE(rp->rtort_pipe_count) >>>> + 1] instead of rcu_torture_wcount[READ_ONCE(rp->rtort_pipe_count)]. >>>> >>>> I think that might break rcutorture, because there is an increment outside of >>>> the first 2 entries in rcu_torture_wcount but not sure (need to look more). >>> >>> Good point on never incrementing the zeroth entry! Clearly I should >>> have waited before replying. >>> >>> How about the following? >>> >>> ------------------------------------------------------------------------ >>> >>> static bool >>> rcu_torture_pipe_update_one(struct rcu_torture *rp) >>> { >>> int i; >>> struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); >>> >>> if (rtrcp) { >>> WRITE_ONCE(rp->rtort_chkp, NULL); >>> smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). >>> } >>> i = READ_ONCE(rp->rtort_pipe_count); >>> if (i > RCU_TORTURE_PIPE_LEN) >>> i = RCU_TORTURE_PIPE_LEN; >>> atomic_inc(&rcu_torture_wcount[i]); >>> WRITE_ONCE(rp->rtort_pipe_count, i + 1); >>> if (i + 1 >= RCU_TORTURE_PIPE_LEN) { >>> rp->rtort_mbtest = 0; >>> return true; >>> } >>> return false; >>> } >> >> Yes, this looks good to me. Thanks, >> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Again, thank you. > > linke li, does this approach work for you? If so, would you be willing to > send a new patch along these lines? If it does not work, what additional > problems do you see? > > Thanx, Paul
> Anyway, a slightly related/different question: > > Should that: > WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); > > Be: > WRITE_ONCE(rp->rtort_pipe_count, READ_ONCE(rp->rtort_pipe_count) + 1); > > ? Hi, Joel. Sorry, I am not very sure about this. I do a little research on it. I looked through all code in kernel that looks like this, both kinds are used. I also try to compile this file with and without the READ_ONCE in WRITE_ONCE using gcc-9. Their binary is just the same. Thus I think in the current compiler version, they do not have a big difference, but if the compiler wants to optimize more, maybe the latter one is better.
On Tue, 5 Mar 2024 14:24:20 +0800 linke li <lilinke99@qq.com> wrote: > > Anyway, a slightly related/different question: > > > > Should that: > > WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); > > > > Be: > > WRITE_ONCE(rp->rtort_pipe_count, READ_ONCE(rp->rtort_pipe_count) + 1); > > > > ? > > Hi, Joel. Sorry, I am not very sure about this. I do a little research on > it. > > I looked through all code in kernel that looks like this, both kinds are > used. I also try to compile this file with and without the READ_ONCE in > WRITE_ONCE using gcc-9. Their binary is just the same. > > Thus I think in the current compiler version, they do not have a big > difference, but if the compiler wants to optimize more, maybe the latter > one is better. I'm sorry but all these READ_ONCE/WRITE_ONCE() additions that are being added because there's a fear of the compiler tearing long words, is going to make the code really ugly and hard to read. If we take the policy of handling a compiler that can tear reads and writes of any size word, then we should have proper macros to handle it. Perhaps READ_SHARED(), WRITE_SHARED(), ADD_SHARED(), SUB_SHARED(). The ONCE has nothing to do with the reasons for these changes. But at least "SHARED" can be considered "this variable is shared between different contexts". Note, this is different than "atomic". It's just to document that this variable must be loaded or stored in one transaction. I don't know if Linus even cares about fixing "read/write tearing" which is why I Cc'd him. But I'm not going to take any patches that add these macros to fix compilers that tear words on load and store until we have a set policy on what to do with them. -- Steve
On Wed, Mar 06, 2024 at 10:37:19AM -0500, Steven Rostedt wrote: > On Tue, 5 Mar 2024 14:24:20 +0800 > linke li <lilinke99@qq.com> wrote: > > > > Anyway, a slightly related/different question: > > > > > > Should that: > > > WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); > > > > > > Be: > > > WRITE_ONCE(rp->rtort_pipe_count, READ_ONCE(rp->rtort_pipe_count) + 1); > > > > > > ? > > > > Hi, Joel. Sorry, I am not very sure about this. I do a little research on > > it. > > > > I looked through all code in kernel that looks like this, both kinds are > > used. I also try to compile this file with and without the READ_ONCE in > > WRITE_ONCE using gcc-9. Their binary is just the same. > > > > Thus I think in the current compiler version, they do not have a big > > difference, but if the compiler wants to optimize more, maybe the latter > > one is better. > > I'm sorry but all these READ_ONCE/WRITE_ONCE() additions that are being > added because there's a fear of the compiler tearing long words, is going to > make the code really ugly and hard to read. There are quite a few other things to fear besides tearing. The compiler can and does invent and fuse normal loads, and it can and does fuse normal stores. And there really are compilers that tear stores of certain constants on systems with short immediate fields. I would argue that use of normal C-language loads and stores should be accompanied by comments saying why the compiler cannot mess things up. But maintainer's choice. Those maintainers who keep a close personal relationship with the ever-growing list of optimizations should have no problem working out which use cases are safe for normal C-language loads and stores. Me, I would really rather play it safe, especially in the innards of something like RCU. ;-) > If we take the policy of handling a compiler that can tear reads and writes > of any size word, then we should have proper macros to handle it. Those are in fact READ_ONCE() and WRITE_ONCE() when given machine-word sized/aligned variables. > Perhaps READ_SHARED(), WRITE_SHARED(), ADD_SHARED(), SUB_SHARED(). The ONCE > has nothing to do with the reasons for these changes. But at least "SHARED" > can be considered "this variable is shared between different contexts". > Note, this is different than "atomic". It's just to document that this > variable must be loaded or stored in one transaction. We already have READ_ONCE() and WRITE_ONCE(). An ADD_SHARED() might be useful, though compilers are starting to learn how to emit good code for things like WRITE_ONCE(a, READ_ONCE(a) + 1). But such things should also be documented and added to LKMM. > I don't know if Linus even cares about fixing "read/write tearing" which is > why I Cc'd him. I am sure that whatever his views, he will not suffer in silence. ;-) > But I'm not going to take any patches that add these macros to fix > compilers that tear words on load and store until we have a set policy on > what to do with them. Maintainer's choice! For RCU, I want the code to just work with future compiler optimizations as well as with current ones. This stuff is fun enough without giving the compiler opportunities for more mischief! Thanx, Paul
On Wed, 6 Mar 2024 09:36:16 -0800 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > If we take the policy of handling a compiler that can tear reads and writes > > of any size word, then we should have proper macros to handle it. > > Those are in fact READ_ONCE() and WRITE_ONCE() when given machine-word > sized/aligned variables. IIRC, the original purpose of READ_ONCE() and WRITE_ONCE() was to make sure that the compiler only reads or writes the variable "once". Hence the name. That way after a load, you don't need to worry that the content of the variable you read isn't going to be read again from the original location because the compiler decided to save stack space and registers. But that macro has now been extended for other purposes. > > > Perhaps READ_SHARED(), WRITE_SHARED(), ADD_SHARED(), SUB_SHARED(). The ONCE > > has nothing to do with the reasons for these changes. But at least "SHARED" > > can be considered "this variable is shared between different contexts". > > Note, this is different than "atomic". It's just to document that this > > variable must be loaded or stored in one transaction. > > We already have READ_ONCE() and WRITE_ONCE(). An ADD_SHARED() might > be useful, though compilers are starting to learn how to emit good code > for things like WRITE_ONCE(a, READ_ONCE(a) + 1). Well, if we keep the _ONCE() naming, it should be ADD_ONCE(). Because WRITE_ONCE(a, READ_ONCE(a) + 1) is an abomination and should only be present in obfuscation contests. > > But such things should also be documented and added to LKMM. > > > I don't know if Linus even cares about fixing "read/write tearing" which is > > why I Cc'd him. > > I am sure that whatever his views, he will not suffer in silence. ;-) > > > But I'm not going to take any patches that add these macros to fix > > compilers that tear words on load and store until we have a set policy on > > what to do with them. > > Maintainer's choice! > > For RCU, I want the code to just work with future compiler optimizations > as well as with current ones. This stuff is fun enough without giving > the compiler opportunities for more mischief! I'm not against the changes. I'm against the ugliness of the changes. Should we just create a ADD_ONCE() macro? If the approach is now to find all places that access a variable between different contexts, and create READ_ONCE()/WRITE_ONCE() around them, I'm fine with it. Perhaps we need a way to annotate them, like we have with __rcu. "__shared"? Then all accesses to that variable must be wrapped with a READ_ONCE() or WRITE_ONCE()? I mean, if this can cause legitimate bugs, we should probably address it like we do with locking and RCU. -- Steve
On Wed, Mar 06, 2024 at 01:01:03PM -0500, Steven Rostedt wrote: > On Wed, 6 Mar 2024 09:36:16 -0800 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > If we take the policy of handling a compiler that can tear reads and writes > > > of any size word, then we should have proper macros to handle it. > > > > Those are in fact READ_ONCE() and WRITE_ONCE() when given machine-word > > sized/aligned variables. > > IIRC, the original purpose of READ_ONCE() and WRITE_ONCE() was to make sure > that the compiler only reads or writes the variable "once". Hence the name. > That way after a load, you don't need to worry that the content of the > variable you read isn't going to be read again from the original location > because the compiler decided to save stack space and registers. > > But that macro has now been extended for other purposes. If I remember correctly, some 32-bit system had 64-bit PTEs that it wanted to use WRITE_ONCE() on. Does Linux still support that system? If not, maybe it is time to remove that extension. > > > Perhaps READ_SHARED(), WRITE_SHARED(), ADD_SHARED(), SUB_SHARED(). The ONCE > > > has nothing to do with the reasons for these changes. But at least "SHARED" > > > can be considered "this variable is shared between different contexts". > > > Note, this is different than "atomic". It's just to document that this > > > variable must be loaded or stored in one transaction. > > > > We already have READ_ONCE() and WRITE_ONCE(). An ADD_SHARED() might > > be useful, though compilers are starting to learn how to emit good code > > for things like WRITE_ONCE(a, READ_ONCE(a) + 1). > > Well, if we keep the _ONCE() naming, it should be ADD_ONCE(). Because > > WRITE_ONCE(a, READ_ONCE(a) + 1) > > is an abomination and should only be present in obfuscation contests. I have no problem with replacing that sort of thing with ADD_ONCE(). > > But such things should also be documented and added to LKMM. > > > > > I don't know if Linus even cares about fixing "read/write tearing" which is > > > why I Cc'd him. > > > > I am sure that whatever his views, he will not suffer in silence. ;-) > > > > > But I'm not going to take any patches that add these macros to fix > > > compilers that tear words on load and store until we have a set policy on > > > what to do with them. > > > > Maintainer's choice! > > > > For RCU, I want the code to just work with future compiler optimizations > > as well as with current ones. This stuff is fun enough without giving > > the compiler opportunities for more mischief! > > I'm not against the changes. I'm against the ugliness of the changes. > Should we just create a ADD_ONCE() macro? Works for me! We should also update tools/memory-model/linux-kernel.defs to allow it to be used in litmus tests. (I can help with that.) Plus of course documentation. > If the approach is now to find all places that access a variable between > different contexts, and create READ_ONCE()/WRITE_ONCE() around them, I'm > fine with it. I don't know that the entire kernel is going that far, but RCU has had that philosophy for a very long time. Yes, KCSAN sometimes finds places where we slipped up, but those get fixed. > Perhaps we need a way to annotate them, like we have with __rcu. "__shared"? > > Then all accesses to that variable must be wrapped with a READ_ONCE() or > WRITE_ONCE()? I mean, if this can cause legitimate bugs, we should probably > address it like we do with locking and RCU. If we want that, just mark the field "volatile", as in "jiffies". And one of the strengths of READ_ONCE() and WRITE_ONCE() is that they allow non-volatile access where it is safe. For example, if you hold the lock protecting all stores to that variable, you still need WRITE_ONCE() but not READ_ONCE(). In initialization and cleanup code, you don't need either. Thanx, Paul
On Wed, 6 Mar 2024 10:09:48 -0800 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > Perhaps we need a way to annotate them, like we have with __rcu. "__shared"? > > > > Then all accesses to that variable must be wrapped with a READ_ONCE() or > > WRITE_ONCE()? I mean, if this can cause legitimate bugs, we should probably > > address it like we do with locking and RCU. > > If we want that, just mark the field "volatile", as in "jiffies". I already know Linus's view on "volatile" variables ;-) > > And one of the strengths of READ_ONCE() and WRITE_ONCE() is that they > allow non-volatile access where it is safe. For example, if you hold the > lock protecting all stores to that variable, you still need WRITE_ONCE() > but not READ_ONCE(). In initialization and cleanup code, you don't > need either. I guess the current static analyzers just look to see where READ_ONCE() or WRITE_ONCE() is used and checks to see if other places have them properly used. I'm guessing that's where the OP patch came from. Sounds like we just need a ADD_ONCE() or INC_ONCE() then. Because I am not taking a WRITE_ONCE(a, READ_ONCE(a) + 1); patch that replaces a simple "a++". -- Steve
On Wed, 6 Mar 2024 at 09:59, Steven Rostedt <rostedt@goodmis.org> wrote: > > IIRC, the original purpose of READ_ONCE() and WRITE_ONCE() was to make sure > that the compiler only reads or writes the variable "once". Hence the name. > That way after a load, you don't need to worry that the content of the > variable you read isn't going to be read again from the original location > because the compiler decided to save stack space and registers. > > But that macro has now been extended for other purposes. Not really. Tearing of simple types (as opposed to structures or bitfields or "more than one word" or whatever) has never really been a real concern. It keeps being brought up as a "compilers could do this", but it's basically just BS fear-mongering. Compilers _don't_ do it, and the reason why compilers don't do it isn't some "compilers are trying to be nice" issue, but simply a "it is insane and generates worse code" issue. So what happens is that READ_ONCE() and WRITE_ONCE() have always been about reading and writing *consistent* values. There is no locking, but the idea is - and has always been - that you get one *single* answer from READ_ONCE(), and that single answer will always be consistent with something that has been written by WRITE_ONCE. That's often useful - lots of code doesn't really care if you get the old or the new value, but the code *does* care that it gets *one* value, and not some random mix of "I tested one value for validity, then it got reloaded due to register pressure, and I actually used another value". And not some "I read one value, and it was a mix of two other values". But in order to get those semantics, the READ_ONCE() and WRITE_ONCE() macros don't do just the 'volatile' (to get the "no reloads" guarantee), but they also do that "simple types" check. So READ_ONCE/WRITE_ONCE has never really been "extended for other purposes". The purpose has always been the same: one single consistent value. What did happen that our *original* name for this was not "read vs write", but just "access". So instead of "READ_ONCE(x)" you'd do "ACCESS_ONCE(x)", and instead of "WRITE_ONCE(x,y)" you'd do "ACCESS_ONCE(x) = y". And, to make matters more interesting, we had code that did that on things that were *not* simple values. IOW, we'd have things like ACCESS_ONCE() on things that literally *couldn't* be accessed as one single value. The most notable was accessing page table entries, which on multiple architectures (including plain old 32-bit x86) ended up being two words. So the extension that *did* happen is that READ_ONCE and WRITE_ONCE actually verify that the type is simple, and that you can't do a 64-bit READ_ONCE on a 32-bit architecture. Because then while you migth guarantee that the value isn't reloaded multiple times, you cannot guarantee that you actually get a value that is consistent with a WRITE_ONCE (because the reads and writes are both two operations). Now, we've gotten rid of the whole ACCESS_ONCE() thing, and so some of that history is no longer visible (although you can still see that pattern in the rseq self-tests). So yes, READ_ONCE/WRITE_ONCE do control "tearing", but realistically, it was always only about the "complex values" kind of tearing that the old ACCESS_ONCE() model silently and incorrectly allowed. Linus
On Wed, 6 Mar 2024 10:43:25 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: Thanks for the history lesson ;-) > So yes, READ_ONCE/WRITE_ONCE do control "tearing", but realistically, > it was always only about the "complex values" kind of tearing that the > old ACCESS_ONCE() model silently and incorrectly allowed. Now, are you OK with an addition of ADD_ONCE() and/or INC_ONCE()? So that we don't have to look at: WRITE_ONCE(a, READ_ONCE(a) + 1); ? -- Steve
On Wed, 6 Mar 2024 at 10:53, Steven Rostedt <rostedt@goodmis.org> wrote: > > Now, are you OK with an addition of ADD_ONCE() and/or INC_ONCE()? So that we > don't have to look at: > > WRITE_ONCE(a, READ_ONCE(a) + 1); > > ? In a generic header file under include/linux/? Absolutely not. The above is a completely broken operation. There is no way in hell we should expose it as a general helper. So there is no way we'd add that kind of sh*t-for-brains operation in (for example) our <asm/rwonce.h> header file next to the normal READ/WRITE_ONCE defines. In some individual tracing C file where it has a comment above it how it's braindamaged and unsafe and talking about why it's ok in that particular context? Go wild. But honestly, I do not see when a ADD_ONCE() would ever be a valid thing to do, and *if* it's a valid thing to do, why you'd do it with READ_ONCE and WRITE_ONCE. If you don't care about races, just do a simple "++" and be done with it. The end result is random. Adding a "ADD_ONCE()" macro doesn't make it one whit less random. It just makes a broken concept even uglier. So honestly, I think the ADD_ONCE macro not only needs to be in some tracing-specific C file, the comment needs to be pretty damn big too. Because as a random number generator, it's not even a very good one. So you need to explain *why* you want a particularly bad random number generator in the first place. Linus
On Wed, 6 Mar 2024 at 11:01, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > In some individual tracing C file where it has a comment above it how > it's braindamaged and unsafe and talking about why it's ok in that > particular context? Go wild. Actually, I take that back. Even in a random C file, the naming makes no sense. There's no "once" about it. So if you want to do something like #define UNSAFE_INCREMENTISH(x) (WRITE_ONCE(a, READ_ONCE(a) + 1)) then that's fine, I guess. Because that's what the operation is. It's not safe, and it's not an increment, but it _approximates_ an increment most of the time. So UNSAFE_INCREMENTISH() pretty much perfectly describes what it is doing. Note that you'll also almost certainly end up with worse code generation, ie don't expect to see a single "inc" instruction (or "add $1") for the above. Because at least for gcc, the volatiles involved with those "ONCE" operations end up often generating much worse code, so rather than an "inc" instruction, you'll almost certainly get "load+add+store" and the inevitable code expansion and extra register use. I really don't know what you want to do, but it smells bad. A big comment about why you'd want that "incrementish" operation will be needed. To me, this smells like "Steven did something fundamentally wrong again, some tool is now complaining about it, and Steven doesn't want to fix the problem but instead paper over it again". Not a good look. But I don't have a link to the original report, and I'm not thrilled enough about this to go looking for it. Linus
On Wed, 6 Mar 2024 11:01:55 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, 6 Mar 2024 at 10:53, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Now, are you OK with an addition of ADD_ONCE() and/or INC_ONCE()? So that we > > don't have to look at: > > > > WRITE_ONCE(a, READ_ONCE(a) + 1); > > > > ? > > In a generic header file under include/linux/? > > Absolutely not. The above is a completely broken operation. There is > no way in hell we should expose it as a general helper. > > So there is no way we'd add that kind of sh*t-for-brains operation in > (for example) our <asm/rwonce.h> header file next to the normal > READ/WRITE_ONCE defines. > > In some individual tracing C file where it has a comment above it how > it's braindamaged and unsafe and talking about why it's ok in that > particular context? Go wild. Note this has nothing to do with tracing. This thread is in RCU. I just happen to receive the same patch "fix" for my code. > > But honestly, I do not see when a ADD_ONCE() would ever be a valid > thing to do, and *if* it's a valid thing to do, why you'd do it with > READ_ONCE and WRITE_ONCE. > > If you don't care about races, just do a simple "++" and be done with > it. The end result is random. That was my feeling. But when I saw this going into RCU, I was thinking there was a more fundamental problem here. > > Adding a "ADD_ONCE()" macro doesn't make it one whit less random. It > just makes a broken concept even uglier. > > So honestly, I think the ADD_ONCE macro not only needs to be in some > tracing-specific C file, the comment needs to be pretty damn big too. > Because as a random number generator, it's not even a very good one. > So you need to explain *why* you want a particularly bad random number > generator in the first place. Again, this has nothing to do with tracing. The code here is solely in RCU. I did receive a patch in the tracing code, but that had to deal with wakeups of readers with respect to writers which is a common thing across the kernel and is not anything tracing specific. I wasn't about to take the patch to my code, but when I saw the same changes in RCU, then I thought this might be something I need to worry about. -- Steve
On Wed, 6 Mar 2024 at 11:27, Steven Rostedt <rostedt@goodmis.org> wrote: > > Note this has nothing to do with tracing. This thread is in RCU. I just > happen to receive the same patch "fix" for my code. Ok, googling for rtort_pipe_count, I can only state that that code is complete garbage. And no amount of READ_ONCE/WRITE_ONCE will fix it. For one thing, we have this code: WRITE_ONCE(rp->rtort_pipe_count, i + 1); if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { which is broken by design. The compiler is allowed to (and probably does) turn that into just WRITE_ONCE(rp->rtort_pipe_count, i + 1); if (i + 1 >= RCU_TORTURE_PIPE_LEN) { which only results in the question "Why didn't the source code do that obvious simplification itself?" So that code is actively *STUPID*. It's randomly mixing READ_ONCE and regular reads in ways that just makes me go: "there's no saving this shit". This needs fixing. Having tests that have random code in them only makes me doubt that the *TEST* itself is correct, rather than the code it is trying to actually test. And dammit, none of that makes sense anyway. This is not some performance-crticial code. Why is it not using proper atomics if there is an actual data race? The reason to use READ_ONCE() and WRITE_ONCE() is that they can be a lot faster than atomics, or - more commonly - because you have some fundamental algorithm that doesn't do arithmetic, but cares about some "state at time X" (the RCU _pointer_ being one such obvious case, but doing an *increment* sure as hell isn't). So using those READ_ONCE/WRITE_ONCE macros for that thing is fundamntally wrong to begin with. The question should not be "should we add another READ_ONCE()". The question should be "what drugs were people on when writing this code"? People - please just stop writing garbage. That 'rtort_pipe_count' should be an atomic_t, and the "add one and return the old value" should be an "atomic_inc_return()-1" (the "-1" is because "inc_return" returns the *new* value). And feel free to add "_relaxed()" to that atomic op because this code doesn't care about ordering of that counter. It will help on some architectures, but as mentioned, this is not performance-crticial code to begin with. Linus
On Wed, 6 Mar 2024 11:27:04 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, 6 Mar 2024 at 11:01, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > In some individual tracing C file where it has a comment above it how > > it's braindamaged and unsafe and talking about why it's ok in that > > particular context? Go wild. > > Actually, I take that back. > > Even in a random C file, the naming makes no sense. There's no "once" about it. > > So if you want to do something like > > #define UNSAFE_INCREMENTISH(x) (WRITE_ONCE(a, READ_ONCE(a) + 1)) > > then that's fine, I guess. Because that's what the operation is. > > It's not safe, and it's not an increment, but it _approximates_ an > increment most of the time. So UNSAFE_INCREMENTISH() pretty much > perfectly describes what it is doing. > > Note that you'll also almost certainly end up with worse code > generation, ie don't expect to see a single "inc" instruction (or "add > $1") for the above. > > Because at least for gcc, the volatiles involved with those "ONCE" > operations end up often generating much worse code, so rather than an > "inc" instruction, you'll almost certainly get "load+add+store" and > the inevitable code expansion and extra register use. > > I really don't know what you want to do, but it smells bad. A big > comment about why you'd want that "incrementish" operation will be > needed. > > To me, this smells like "Steven did something fundamentally wrong > again, some tool is now complaining about it, and Steven doesn't want > to fix the problem but instead paper over it again". > > Not a good look. > > But I don't have a link to the original report, and I'm not thrilled > enough about this to go looking for it. Well, it's not the above, and I was afraid you would even think that. Here's the back story. I received the following patch: https://lore.kernel.org/all/tencent_BA1473492BC618B473864561EA3AB1418908@qq.com/ I didn't like it. My reply was: > - rbwork->wait_index++; > + WRITE_ONCE(rbwork->wait_index, READ_ONCE(rbwork->wait_index) + 1); I mean the above is really ugly. If this is the new thing to do, we need better macros. If anything, just convert it to an atomic_t. Then because I'm a reviewer of RCU patches, I saw the same fix for RCU (this thread): https://lore.kernel.org/all/tencent_B51A9DA220288A95A435E3435A0443BEB007@qq.com/ Where it was recommended to do the same WRITE_ONCE(a, READ_ONCE(a) + 1), and this is when I Cc'd you into the conversation to get your view of the situation. Sounds like my original gut feeling that this is a non-issue is proving to be correct. I even argued against using the _ONCE() macros. If WRITE_ONCE(a, READ_ONCE(a) + 1) is a valid operation, I sure as hell don't want it in my code, but I would be OK if it had a macro wrapper. Which you seem to be against, so I'm against it too. -- Steve
On Wed, 6 Mar 2024 at 11:45, Steven Rostedt <rostedt@goodmis.org> wrote: > > Here's the back story. I received the following patch: > > https://lore.kernel.org/all/tencent_BA1473492BC618B473864561EA3AB1418908@qq.com/ > > I didn't like it. My reply was: > > > - rbwork->wait_index++; > > + WRITE_ONCE(rbwork->wait_index, READ_ONCE(rbwork->wait_index) + 1); > > I mean the above is really ugly. If this is the new thing to do, we need > better macros. > > If anything, just convert it to an atomic_t. The right thing is definitely to convert it to an atomic_t. The memory barriers can probably also be turned into atomic ordering, although we don't always have all the variates. But for example, that /* Make sure to see the new wait index */ smp_rmb(); if (wait_index != work->wait_index) break; looks odd, and should probably do an "atomic_read_acquire()" instead of a rmb and a (non-atomic and non-READ_ONCE thing). The first READ_ONCE() should probably also be that atomic_read_acquire() op. On the writing side, my gut feel is that the rbwork->wait_index++; /* make sure the waiters see the new index */ smp_wmb(); should be an "atomic_inc_release(&rbwork->wait_index);" but we don't actually have that operation. We only have the "release" versions for things that return a value. So it would probably need to be either atomic_inc(&rbwork->wait_index); /* make sure the waiters see the new index */ smp_wmb(); or atomic_inc_return_release(&rbwork->wait_index); or we'd need to add the "basic atomics with ordering semantics" (which we aren't going to do unless we end up with a lot more people who want them). I dunno. I didn't look all *that* closely at the code. The above might be garbage too. Somebody who actually knows the code should think about what ordering they actually were looking for. (And I note that 'wait_index' is of type 'long' in 'struct rb_irq_work', so I guess it should be "atomic_long_t" instead - just shows how little attention I paid on the first read-through, which should make everybody go "I need to double-check Linus here") Linus
On Wed, 6 Mar 2024 at 11:46, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > That 'rtort_pipe_count' should be an atomic_t, and the "add one and > return the old value" should be an "atomic_inc_return()-1" (the "-1" > is because "inc_return" returns the *new* value). Bah. I am lost in a twisty maze of operations, all the same. One final correction to myself: if you want the old value, the nicer thing to use is probably just "atomic_fetch_inc()". It generates the same result as "atomic_inc_return()-1", but since we do have that native "return old value" variant of this, let's just use it. So the rules are "atomic_op_return()" returns the new value after the op, and "atomic_fetch_op()" returns the old value. For some ops, this matters more than for others. For 'add' like operations, it's you can deduce the old from the new (and vice versa). But for bitwise ops, only the 'fetch" version makes much sense, because you can see the end result from that, but you can't figure out the original value from the final one. And to *really* confuse things, as with the memory ordering variants, we don't always have the full complement of operations. So we have atomic_fetch_and() (returns old version) and atomic_and() (doesn't return any version), but we don't have "atomic_and_return()" because it's less useful. But for 'inc' we have all three. Linus
On Wed, Mar 06, 2024 at 11:46:10AM -0800, Linus Torvalds wrote: > On Wed, 6 Mar 2024 at 11:27, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Note this has nothing to do with tracing. This thread is in RCU. I just > > happen to receive the same patch "fix" for my code. > > Ok, googling for rtort_pipe_count, I can only state that that code is > complete garbage. Well, you all do have to admit that I was right about something, namely that Linus did not suffer in silence. ;-) TL;DR: Those ->rtort_pipe_count increments cannot run concurrently with each other or any other update of that field, so that update-side READ_ONCE() call is unnecessary and the update-side plain C-language read is OK. The WRITE_ONCE() calls are there for the benefit of the lockless read-side accesses to rtort_pipe_count. Of course, I will fix. > And no amount of READ_ONCE/WRITE_ONCE will fix it. > > For one thing, we have this code: > > WRITE_ONCE(rp->rtort_pipe_count, i + 1); > if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > > which is broken by design. The compiler is allowed to (and probably > does) turn that into just > > WRITE_ONCE(rp->rtort_pipe_count, i + 1); > if (i + 1 >= RCU_TORTURE_PIPE_LEN) { > > which only results in the question "Why didn't the source code do that > obvious simplification itself?" Oddly enough, we got a patch to this effect just yesterday evening, Pacific Time: https://lore.kernel.org/all/tencent_5D8919B7D1F21906868DE81406015360270A@qq.com/ But that of course only fixes one aspect of this mess. > So that code is actively *STUPID*. It's randomly mixing READ_ONCE and > regular reads in ways that just makes me go: "there's no saving this > shit". > > This needs fixing. Having tests that have random code in them only > makes me doubt that the *TEST* itself is correct, rather than the code > it is trying to actually test. Huh. There should only be one CPU updating ->rtort_pipe_count at at time. Which to your point would make that READ_ONCE() call not only unnecessary, but confusing as well. Here is the intended uses of ->rtort_pipe_count: 1. Zeroed when the item is first allocated. 2. Incremented at the end of each subsequent grace period, so there is no concurrency with previous instances of this increment and also none with the zeroing. This happens in two different ways. When testing things like call_rcu(), the RCU callback does the update, and if we have not yet reached the limit, passes that same callback to call_rcu(). When testing other grace-period primitives (synchronize_rcu(), for example), we put the item on a list, and update each element of that list after each subsequent synchronous grace period. Once an element has pased through RCU_TORTURE_PIPE_LEN grace periods, it is removed from that list and freed. Either way, there is only ever one CPU incrementing a given ->rtort_pipe_count at any given time. 3. Freed, and not passed to another grace period, thus avoiding concurrency with a future #2. Allocation and free is protected by a spinlock, so there is no concurrency with #1 above. The same CPU that did the last increment does the free, so there is also no concurrency with the last #2. I fired up a KCSAN run with added ASSERT_EXCLUSIVE_WRITER() calls, which agrees with this analysis. (And I will run longer runs to double-check.) > And dammit, none of that makes sense anyway. This is not some > performance-crticial code. Why is it not using proper atomics if there > is an actual data race? > > The reason to use READ_ONCE() and WRITE_ONCE() is that they can be a > lot faster than atomics, or - more commonly - because you have some > fundamental algorithm that doesn't do arithmetic, but cares about some > "state at time X" (the RCU _pointer_ being one such obvious case, but > doing an *increment* sure as hell isn't). The only data race is between the one-at-a-time increment and the lockless reads in the RCU readers. So the RCU readers need READ_ONCE() for ->rtort_pipe_count, but again the updater does not. Which means that the compiler optimization that Linus pointed out above really is an optimization for once because the compiler is for once correct that nothing else is updating ->rtort_pipe_count. > So using those READ_ONCE/WRITE_ONCE macros for that thing is > fundamntally wrong to begin with. > > The question should not be "should we add another READ_ONCE()". The > question should be "what drugs were people on when writing this code"? So what (if anything) was I thinking when I committed those update-side READ_ONCE() calls? 202489101f2e6c ("rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race") The commit log says the right thing, but I nevertheless added that unnecessary READ_ONCE() call. And here I was hoping to be able to blame someone else! ;-) > People - please just stop writing garbage. > > That 'rtort_pipe_count' should be an atomic_t, and the "add one and > return the old value" should be an "atomic_inc_return()-1" (the "-1" > is because "inc_return" returns the *new* value). > > And feel free to add "_relaxed()" to that atomic op because this code > doesn't care about ordering of that counter. It will help on some > architectures, but as mentioned, this is not performance-crticial code > to begin with. There is no update-side concurrency, so there is no need for atomics. I just need to get rid of that extraneous update-side READ_ONCE() call. Athough I am not sure that I can credibly promise to never ever write garbage, I most certainly can fix this particular instance. :-/ Thanx, Paul
On Wed, 6 Mar 2024 at 18:29, Paul E. McKenney <paulmck@kernel.org> wrote: > > TL;DR: Those ->rtort_pipe_count increments cannot run concurrently > with each other or any other update of that field, so that update-side > READ_ONCE() call is unnecessary and the update-side plain C-language > read is OK. The WRITE_ONCE() calls are there for the benefit of the > lockless read-side accesses to rtort_pipe_count. Ahh. Ok. That makes a bit more sense. So if that's the case, then the "updating side" should never use READ_ONCE, because there's nothing else to protect against. Honestly, this all makes me think that we'd be *much* better off showing the real "handoff" with smp_store_release() and smp_load_acquire(). IOW, something like this (TOTALLY UNTESTED!) patch, perhaps? And please note that this patch is not only untested, it really is a very handwavy patch. I'm sending it as a patch just because it's a more precise way of saying "I think the writers and readers could use the store-release -> load-acquire not just to avoid any worries about accessing things once, but also as a way to show the directional 'flow' of the data". I dunno. Linus
On Wed, 6 Mar 2024 at 18:43, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I dunno. Oh, and just looking at that patch, I still think the code is confused. On the reading side, we have: pipe_count = smp_load_acquire(&p->rtort_pipe_count); if (pipe_count > RCU_TORTURE_PIPE_LEN) { /* Should not happen, but... */ where that comment clearly says that the pipe_count we read (whether with READ_ONCE() or with my smp_load_acquire() suggestion) should never be larger than RCU_TORTURE_PIPE_LEN. But the writing side very clearly did: i = rp->rtort_pipe_count; if (i > RCU_TORTURE_PIPE_LEN) i = RCU_TORTURE_PIPE_LEN; ... smp_store_release(&rp->rtort_pipe_count, ++i); (again, syntactically it could have been "i + 1" instead of my "++i" - same value), so clearly the writing side *can* write a value that is > RCU_TORTURE_PIPE_LEN. So while the whole READ/WRITE_ONCE vs smp_load_acquire/store_release is one thing that might be worth looking at, I think there are other very confusing aspects here. Linus
On Wed, Mar 06, 2024 at 06:43:42PM -0800, Linus Torvalds wrote: > On Wed, 6 Mar 2024 at 18:29, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > TL;DR: Those ->rtort_pipe_count increments cannot run concurrently > > with each other or any other update of that field, so that update-side > > READ_ONCE() call is unnecessary and the update-side plain C-language > > read is OK. The WRITE_ONCE() calls are there for the benefit of the > > lockless read-side accesses to rtort_pipe_count. > > Ahh. Ok. That makes a bit more sense. > > So if that's the case, then the "updating side" should never use > READ_ONCE, because there's nothing else to protect against. > > Honestly, this all makes me think that we'd be *much* better off > showing the real "handoff" with smp_store_release() and > smp_load_acquire(). > > IOW, something like this (TOTALLY UNTESTED!) patch, perhaps? > > And please note that this patch is not only untested, it really is a > very handwavy patch. > > I'm sending it as a patch just because it's a more precise way of > saying "I think the writers and readers could use the store-release -> > load-acquire not just to avoid any worries about accessing things > once, but also as a way to show the directional 'flow' of the data". > > I dunno. Thank you for looking at this! I will look carefully at this, but the reason I didn't do it this way to begin with is that I did not want false negatives that let weak-memory RCU bugs escape unnoticed due to that synchronization and its overhead. Of course on x86, that synchronization is (nearly) free. Thanx, Paul > Linus > kernel/rcu/rcutorture.c | 11 +++++------ > 1 file changed, 5 insertions(+), 6 deletions(-) > > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c > index 7567ca8e743c..60b74df3eae2 100644 > --- a/kernel/rcu/rcutorture.c > +++ b/kernel/rcu/rcutorture.c > @@ -461,12 +461,12 @@ rcu_torture_pipe_update_one(struct rcu_torture *rp) > WRITE_ONCE(rp->rtort_chkp, NULL); > smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). > } > - i = READ_ONCE(rp->rtort_pipe_count); > + i = rp->rtort_pipe_count; > if (i > RCU_TORTURE_PIPE_LEN) > i = RCU_TORTURE_PIPE_LEN; > atomic_inc(&rcu_torture_wcount[i]); > - WRITE_ONCE(rp->rtort_pipe_count, i + 1); > - if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { > + smp_store_release(&rp->rtort_pipe_count, ++i); > + if (i >= RCU_TORTURE_PIPE_LEN) { > rp->rtort_mbtest = 0; > return true; > } > @@ -1408,8 +1408,7 @@ rcu_torture_writer(void *arg) > if (i > RCU_TORTURE_PIPE_LEN) > i = RCU_TORTURE_PIPE_LEN; > atomic_inc(&rcu_torture_wcount[i]); > - WRITE_ONCE(old_rp->rtort_pipe_count, > - old_rp->rtort_pipe_count + 1); > + smp_store_release(&old_rp->rtort_pipe_count, ++i); > > // Make sure readers block polled grace periods. > if (cur_ops->get_gp_state && cur_ops->poll_gp_state) { > @@ -1991,7 +1990,7 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp, long myid) > rcu_torture_reader_do_mbchk(myid, p, trsp); > rtrsp = rcutorture_loop_extend(&readstate, trsp, rtrsp); > preempt_disable(); > - pipe_count = READ_ONCE(p->rtort_pipe_count); > + pipe_count = smp_load_acquire(&p->rtort_pipe_count); > if (pipe_count > RCU_TORTURE_PIPE_LEN) { > /* Should not happen, but... */ > pipe_count = RCU_TORTURE_PIPE_LEN;
On 2024-03-06 21:43, Linus Torvalds wrote: [...] > > Honestly, this all makes me think that we'd be *much* better off > showing the real "handoff" with smp_store_release() and > smp_load_acquire(). We've done something similar in liburcu (userspace code) to allow Thread Sanitizer to understand the happens-before relationships within the RCU implementations and lock-free data structures. Moving to load-acquire/store-release (C11 model in our case) allowed us to provide enough happens-before relationship for Thread Sanitizer to understand what is happening under the hood in liburcu and perform relevant race detection of user code. As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern is concerned, the only valid use-case I can think of is split counters or RCU implementations where there is a single updater doing the increment, and one or more concurrent reader threads that need to snapshot a consistent value with READ_ONCE(). Thanks, Mathieu
On Wed, Mar 06, 2024 at 06:49:38PM -0800, Linus Torvalds wrote: > On Wed, 6 Mar 2024 at 18:43, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I dunno. > > Oh, and just looking at that patch, I still think the code is confused. > > On the reading side, we have: > > pipe_count = smp_load_acquire(&p->rtort_pipe_count); > if (pipe_count > RCU_TORTURE_PIPE_LEN) { > /* Should not happen, but... */ > > where that comment clearly says that the pipe_count we read (whether > with READ_ONCE() or with my smp_load_acquire() suggestion) should > never be larger than RCU_TORTURE_PIPE_LEN. I will fix that comment. It should not happen *if* RCU is working correctly. It can happen if you have an RCU that is so broken that a single RCU reader can span more than ten grace periods. An example of an RCU that really is this broken can be selected using rcutorture's torture_type=busted module parameter. No surprise, given that its implementation of call_rcu() invokes the callback function directly and its implementation of synchronize_rcu() is a complete no-op. ;-) Of course, the purpose of that value of the torture_type module parameter (along with all other possible values containing the string "busted") is to test rcutorture itself. > But the writing side very clearly did: > > i = rp->rtort_pipe_count; > if (i > RCU_TORTURE_PIPE_LEN) > i = RCU_TORTURE_PIPE_LEN; > ... > smp_store_release(&rp->rtort_pipe_count, ++i); > > (again, syntactically it could have been "i + 1" instead of my "++i" - > same value), so clearly the writing side *can* write a value that is > > RCU_TORTURE_PIPE_LEN. > > So while the whole READ/WRITE_ONCE vs smp_load_acquire/store_release > is one thing that might be worth looking at, I think there are other > very confusing aspects here. With this change in that comment, are things better? Thanx, Paul ------------------------------------------------------------------------ diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 6b821a7037b03..0cb5452ecd945 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -2000,7 +2000,8 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp, long myid) preempt_disable(); pipe_count = READ_ONCE(p->rtort_pipe_count); if (pipe_count > RCU_TORTURE_PIPE_LEN) { - /* Should not happen, but... */ + // Should not happen in a correct RCU implementation, + // happens quite often for torture_type=busted. pipe_count = RCU_TORTURE_PIPE_LEN; } completed = cur_ops->get_gp_seq();
On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote: > On 2024-03-06 21:43, Linus Torvalds wrote: > [...] > > > > Honestly, this all makes me think that we'd be *much* better off > > showing the real "handoff" with smp_store_release() and > > smp_load_acquire(). > > We've done something similar in liburcu (userspace code) to allow > Thread Sanitizer to understand the happens-before relationships > within the RCU implementations and lock-free data structures. > > Moving to load-acquire/store-release (C11 model in our case) > allowed us to provide enough happens-before relationship for > Thread Sanitizer to understand what is happening under the > hood in liburcu and perform relevant race detection of user > code. Good point! In the kernel, though, KCSAN understands the Linux-kernel memory model, and so we don't get that sort of false positive. > As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern > is concerned, the only valid use-case I can think of is > split counters or RCU implementations where there is a > single updater doing the increment, and one or more > concurrent reader threads that need to snapshot a > consistent value with READ_ONCE(). It is wrong here. OK, not wrong from a functional viewpoint, but it is at best very confusing. I am applying patches to fix this. I will push out a new "dev" branch on -rcu that will make this function appear as shown below. So what would you use that pattern for? One possibility is a per-CPU statistical counter in userspace on a fastpath, in cases where losing the occasional count is OK. Then learning your CPU (and possibly being immediately migrated to some other CPU), READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not) make sense. I suppose the same in the kernel if there was a fastpath so extreme you could not afford to disable preemption. At best, very niche. Or am I suffering a failure of imagination yet again? ;-) Thanx, Paul ------------------------------------------------------------------------ static bool rcu_torture_pipe_update_one(struct rcu_torture *rp) { int i; struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); if (rtrcp) { WRITE_ONCE(rp->rtort_chkp, NULL); smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). } i = rp->rtort_pipe_count; if (i > RCU_TORTURE_PIPE_LEN) i = RCU_TORTURE_PIPE_LEN; atomic_inc(&rcu_torture_wcount[i]); WRITE_ONCE(rp->rtort_pipe_count, i + 1); ASSERT_EXCLUSIVE_WRITER(rp->rtort_pipe_count); if (i + 1 >= RCU_TORTURE_PIPE_LEN) { rp->rtort_mbtest = 0; return true; } return false; }
On 3/6/2024 10:37 PM, Paul E. McKenney wrote: > On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote: >> On 2024-03-06 21:43, Linus Torvalds wrote: >> [...] >>> >>> Honestly, this all makes me think that we'd be *much* better off >>> showing the real "handoff" with smp_store_release() and >>> smp_load_acquire(). >> >> We've done something similar in liburcu (userspace code) to allow >> Thread Sanitizer to understand the happens-before relationships >> within the RCU implementations and lock-free data structures. >> >> Moving to load-acquire/store-release (C11 model in our case) >> allowed us to provide enough happens-before relationship for >> Thread Sanitizer to understand what is happening under the >> hood in liburcu and perform relevant race detection of user >> code. > > Good point! > > In the kernel, though, KCSAN understands the Linux-kernel memory model, > and so we don't get that sort of false positive. > >> As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern >> is concerned, the only valid use-case I can think of is >> split counters or RCU implementations where there is a >> single updater doing the increment, and one or more >> concurrent reader threads that need to snapshot a >> consistent value with READ_ONCE(). > > It is wrong here. OK, not wrong from a functional viewpoint, but it > is at best very confusing. I am applying patches to fix this. I will > push out a new "dev" branch on -rcu that will make this function appear > as shown below. > > So what would you use that pattern for? > > One possibility is a per-CPU statistical counter in userspace on a > fastpath, in cases where losing the occasional count is OK. Then learning > your CPU (and possibly being immediately migrated to some other CPU), > READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not) > make sense. > > I suppose the same in the kernel if there was a fastpath so extreme you > could not afford to disable preemption. > > At best, very niche. > > Or am I suffering a failure of imagination yet again? ;-) > > Thanx, Paul > > ------------------------------------------------------------------------ > > static bool > rcu_torture_pipe_update_one(struct rcu_torture *rp) > { > int i; > struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); > > if (rtrcp) { > WRITE_ONCE(rp->rtort_chkp, NULL); > smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). > } > i = rp->rtort_pipe_count; > if (i > RCU_TORTURE_PIPE_LEN) > i = RCU_TORTURE_PIPE_LEN; > atomic_inc(&rcu_torture_wcount[i]); > WRITE_ONCE(rp->rtort_pipe_count, i + 1); > ASSERT_EXCLUSIVE_WRITER(rp->rtort_pipe_count); I was going to say to add a comment here for the future reader, that update-side ->rtort_pipe_count READ/WRITE are already mutually exclusive, but this ASSERT already documents it ;-) Also FWIW I confirmed after starting at code that indeed only one update-side access is possible at a time! Thanks, Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> - Joel
On Wed, 6 Mar 2024 12:06:00 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, 6 Mar 2024 at 11:45, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Here's the back story. I received the following patch: > > > > https://lore.kernel.org/all/tencent_BA1473492BC618B473864561EA3AB1418908@qq.com/ > > > > I didn't like it. My reply was: > > > > > - rbwork->wait_index++; > > > + WRITE_ONCE(rbwork->wait_index, READ_ONCE(rbwork->wait_index) + 1); > > > > I mean the above is really ugly. If this is the new thing to do, we need > > better macros. > > > > If anything, just convert it to an atomic_t. > > The right thing is definitely to convert it to an atomic_t. Agreed. > > The memory barriers can probably also be turned into atomic ordering, > although we don't always have all the variates. > > But for example, that > > /* Make sure to see the new wait index */ > smp_rmb(); > if (wait_index != work->wait_index) > break; > > looks odd, and should probably do an "atomic_read_acquire()" instead > of a rmb and a (non-atomic and non-READ_ONCE thing). > > The first READ_ONCE() should probably also be that atomic_read_acquire() op. > > On the writing side, my gut feel is that the > > rbwork->wait_index++; > /* make sure the waiters see the new index */ > smp_wmb(); > > should be an "atomic_inc_release(&rbwork->wait_index);" but we don't > actually have that operation. We only have the "release" versions for > things that return a value. > > So it would probably need to be either > > atomic_inc(&rbwork->wait_index); > /* make sure the waiters see the new index */ > smp_wmb(); > > or > > atomic_inc_return_release(&rbwork->wait_index); > > or we'd need to add the "basic atomics with ordering semantics" (which > we aren't going to do unless we end up with a lot more people who want > them). > > I dunno. I didn't look all *that* closely at the code. The above might > be garbage too. Somebody who actually knows the code should think > about what ordering they actually were looking for. > > (And I note that 'wait_index' is of type 'long' in 'struct > rb_irq_work', so I guess it should be "atomic_long_t" instead - just > shows how little attention I paid on the first read-through, which > should make everybody go "I need to double-check Linus here") It doesn't need to be long. I'm not even sure why I made it long. Probably for natural alignment. The code isn't that complex. You can have a list of waiters waiting for the ring buffer to fill to various watermarks (in most cases, it's just one waiter). When a write happens, it looks to see if the smallest watermark is hit, if so, calls irqwork to wakeup all the waiters. The waiters will wake up, check to see if a signal is pending or if the ring buffer has hit the watermark the waiter was waiting for and exit the wait loop. What the wait_index does, is just a way to force all waiters out of the wait loop regardless of the watermark the waiter is waiting for. Before a waiter goes into the wait loop, it saves the current wait_index. The waker will increment the wait_index and then call the same irq_work to wake up all the waiters. After the wakeup happens, the waiter will test if the new wait_index matches what it was before it entered the loop, and if it is different, it falls out of the loop. Then the caller of the ring_buffer_wait() can re-evaluate if it needs to enter the wait again. The wait_index loop exit was needed for when the file descriptor of a file that uses a ring buffer closes and it needs to wake up all the readers of that file descriptor to notify their tasks that the file closed. So we can switch the: rbwork->wait_index++; smp_wmb(); into just a: (void)atomic_inc_return_release(&rbwork->wait_index); and the: smp_rmb() if (wait_index != work->wait_index) into: if (wait_index != atomic_read_acquire(&rb->wait_index)) I'll write up a patch. Hmm, I have the same wait_index logic at the higher level doing basically the same thing (at the close of the file). I'll switch that over too. -- Steve
On 2024-03-06 22:37, Paul E. McKenney wrote: > On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote: [...] > >> As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern >> is concerned, the only valid use-case I can think of is >> split counters or RCU implementations where there is a >> single updater doing the increment, and one or more >> concurrent reader threads that need to snapshot a >> consistent value with READ_ONCE(). > [...] > > So what would you use that pattern for? > > One possibility is a per-CPU statistical counter in userspace on a > fastpath, in cases where losing the occasional count is OK. Then learning > your CPU (and possibly being immediately migrated to some other CPU), > READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not) > make sense. > > I suppose the same in the kernel if there was a fastpath so extreme you > could not afford to disable preemption. > > At best, very niche. > > Or am I suffering a failure of imagination yet again? ;-) The (niche) use-cases I have in mind are split-counters and RCU grace period tracking, where precise counters totals are needed (no lost count). In the kernel, this could be: - A per-cpu counter, each counter incremented from thread context with preemption disabled (single updater per counter), read concurrently by other threads. WRITE_ONCE/READ_ONCE is useful to make sure there is no store/load tearing there. Atomics on the update would be stronger than necessary on the increment fast-path. - A per-thread counter (e.g. within task_struct), only incremented by the single thread, read by various threads concurrently. - A counter which increment happens to be already protected by a lock, read by various threads without taking the lock. (technically doable, but I'm not sure I see a relevant use-case for it) In user-space: - The "per-cpu" counter would have to use rseq for increments to prevent inopportune migrations, which needs to be implemented in assembler anyway. The counter reads would have to use READ_ONCE(). - The per-thread counter (Thread-Local Storage) incremented by a single thread, read by various threads concurrently, is a good target for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in various liburcu implementations which track read-side critical sections per-thread. - Same as for the kernel, a counter increment protected by a lock which needs to be read from various threads concurrently without taking the lock could be a valid use-case, though I fail to see how it is useful due to lack of imagination on my part. ;-) I'm possibly missing other use-cases, but those come to mind as not involving racy counter increments. I agree that use-cases are so niche that we probably do _not_ want to introduce ADD_SHARED() for that purpose in a common header file, because I suspect that it would then become misused in plenty of scenarios where the updates are actually racy and would be better served by atomics or local-atomics. Thanks, Mathieu
On Thu, 7 Mar 2024 08:20:59 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > When a write happens, it looks to see if the smallest watermark is hit, > if so, calls irqwork to wakeup all the waiters. > > The waiters will wake up, check to see if a signal is pending or if the > ring buffer has hit the watermark the waiter was waiting for and exit the > wait loop. > > What the wait_index does, is just a way to force all waiters out of the wait > loop regardless of the watermark the waiter is waiting for. Before a waiter > goes into the wait loop, it saves the current wait_index. The waker will > increment the wait_index and then call the same irq_work to wake up all the > waiters. > > After the wakeup happens, the waiter will test if the new wait_index > matches what it was before it entered the loop, and if it is different, it > falls out of the loop. Then the caller of the ring_buffer_wait() can > re-evaluate if it needs to enter the wait again. > > The wait_index loop exit was needed for when the file descriptor of a file > that uses a ring buffer closes and it needs to wake up all the readers of > that file descriptor to notify their tasks that the file closed. > > So we can switch the: > > rbwork->wait_index++; > smp_wmb(); > > into just a: > > (void)atomic_inc_return_release(&rbwork->wait_index); > > and the: > > smp_rmb() > if (wait_index != work->wait_index) > > into: > > if (wait_index != atomic_read_acquire(&rb->wait_index)) > > I'll write up a patch. > > Hmm, I have the same wait_index logic at the higher level doing basically > the same thing (at the close of the file). I'll switch that over too. Discussing this with Maitheu on IRC, we found two bugs with the current implementation. One was a stupid bug with an easy fix, and the other is actually a design flaw. The first bug was the (wait_index != work->wait_index) check was done *after* the schedule() call and not before it. The second more fundamental bug is that there's still a race between the first read of wait_index and the call to prepare_to_wait(). The ring_buffer code doesn't have enough context to know enough to loop or not. If a file is being closed when another thread is just entering this code, it could miss the wakeup. As the callers of ring_buffer_wait() also do a loop, it's redundant to have ring_buffer_wait() do a loop. It should just do a single wait, and then exit and let the callers decide if it should loop again. This will get rid of the need for the rbwork->wait_index and simplifies the code. Working on that patch now. -- Steve
On Thu, Mar 07, 2024 at 12:44:39AM -0500, Joel Fernandes wrote: > > > On 3/6/2024 10:37 PM, Paul E. McKenney wrote: > > On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote: > >> On 2024-03-06 21:43, Linus Torvalds wrote: > >> [...] > >>> > >>> Honestly, this all makes me think that we'd be *much* better off > >>> showing the real "handoff" with smp_store_release() and > >>> smp_load_acquire(). > >> > >> We've done something similar in liburcu (userspace code) to allow > >> Thread Sanitizer to understand the happens-before relationships > >> within the RCU implementations and lock-free data structures. > >> > >> Moving to load-acquire/store-release (C11 model in our case) > >> allowed us to provide enough happens-before relationship for > >> Thread Sanitizer to understand what is happening under the > >> hood in liburcu and perform relevant race detection of user > >> code. > > > > Good point! > > > > In the kernel, though, KCSAN understands the Linux-kernel memory model, > > and so we don't get that sort of false positive. > > > >> As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern > >> is concerned, the only valid use-case I can think of is > >> split counters or RCU implementations where there is a > >> single updater doing the increment, and one or more > >> concurrent reader threads that need to snapshot a > >> consistent value with READ_ONCE(). > > > > It is wrong here. OK, not wrong from a functional viewpoint, but it > > is at best very confusing. I am applying patches to fix this. I will > > push out a new "dev" branch on -rcu that will make this function appear > > as shown below. > > > > So what would you use that pattern for? > > > > One possibility is a per-CPU statistical counter in userspace on a > > fastpath, in cases where losing the occasional count is OK. Then learning > > your CPU (and possibly being immediately migrated to some other CPU), > > READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not) > > make sense. > > > > I suppose the same in the kernel if there was a fastpath so extreme you > > could not afford to disable preemption. > > > > At best, very niche. > > > > Or am I suffering a failure of imagination yet again? ;-) > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > static bool > > rcu_torture_pipe_update_one(struct rcu_torture *rp) > > { > > int i; > > struct rcu_torture_reader_check *rtrcp = READ_ONCE(rp->rtort_chkp); > > > > if (rtrcp) { > > WRITE_ONCE(rp->rtort_chkp, NULL); > > smp_store_release(&rtrcp->rtc_ready, 1); // Pair with smp_load_acquire(). > > } > > i = rp->rtort_pipe_count; > > if (i > RCU_TORTURE_PIPE_LEN) > > i = RCU_TORTURE_PIPE_LEN; > > atomic_inc(&rcu_torture_wcount[i]); > > WRITE_ONCE(rp->rtort_pipe_count, i + 1); > > ASSERT_EXCLUSIVE_WRITER(rp->rtort_pipe_count); > > I was going to say to add a comment here for the future reader, that update-side > ->rtort_pipe_count READ/WRITE are already mutually exclusive, but this ASSERT > already documents it ;-) Plus KCSAN is way better at repeatedly inspecting code for this sort of issue than I am. ;-) > Also FWIW I confirmed after starting at code that indeed only one update-side > access is possible at a time! Thanks, > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Thank you very much! I will apply your Reviewed-by to these commits on my next rebase: 28455c73b620 ("rcutorture: ASSERT_EXCLUSIVE_WRITER() for ->rtort_pipe_count updates") b0b99e7db12e ("rcutorture: Remove extraneous rcu_torture_pipe_update_one() READ_ONCE()") Thanx, Paul
On Thu, Mar 07, 2024 at 08:53:05AM -0500, Mathieu Desnoyers wrote: > On 2024-03-06 22:37, Paul E. McKenney wrote: > > On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote: > [...] > > > > > As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern > > > is concerned, the only valid use-case I can think of is > > > split counters or RCU implementations where there is a > > > single updater doing the increment, and one or more > > > concurrent reader threads that need to snapshot a > > > consistent value with READ_ONCE(). > > > [...] > > > > So what would you use that pattern for? > > > > One possibility is a per-CPU statistical counter in userspace on a > > fastpath, in cases where losing the occasional count is OK. Then learning > > your CPU (and possibly being immediately migrated to some other CPU), > > READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not) > > make sense. > > > > I suppose the same in the kernel if there was a fastpath so extreme you > > could not afford to disable preemption. > > > > At best, very niche. > > > > Or am I suffering a failure of imagination yet again? ;-) > > The (niche) use-cases I have in mind are split-counters and RCU > grace period tracking, where precise counters totals are needed > (no lost count). > > In the kernel, this could be: Thank you for looking into this! > - A per-cpu counter, each counter incremented from thread context with > preemption disabled (single updater per counter), read concurrently by > other threads. WRITE_ONCE/READ_ONCE is useful to make sure there > is no store/load tearing there. Atomics on the update would be stronger > than necessary on the increment fast-path. But if preemption is disabled, the updater can read the value without READ_ONCE() without risk of concurrent update. Or are you concerned about interrupt handlers? This would have to be a read from the interrupt handler, given that an updated from the interrupt handler could result in a lost count. > - A per-thread counter (e.g. within task_struct), only incremented by the > single thread, read by various threads concurrently. Ditto. > - A counter which increment happens to be already protected by a lock, read > by various threads without taking the lock. (technically doable, but > I'm not sure I see a relevant use-case for it) In that case, the lock would exclude concurrent updates, so the lock holder would not need READ_ONCE(), correct? > In user-space: > > - The "per-cpu" counter would have to use rseq for increments to prevent > inopportune migrations, which needs to be implemented in assembler anyway. > The counter reads would have to use READ_ONCE(). Fair enough! > - The per-thread counter (Thread-Local Storage) incremented by a single > thread, read by various threads concurrently, is a good target > for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in > various liburcu implementations which track read-side critical sections > per-thread. Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or similar? > - Same as for the kernel, a counter increment protected by a lock which > needs to be read from various threads concurrently without taking > the lock could be a valid use-case, though I fail to see how it is > useful due to lack of imagination on my part. ;-) In RCU, we have "WRITE_ONCE(*sp, *sp + 1)" for this use case, though here we have the WRITE_ONCE() but not the READ_ONCE() because we hold the lock excluding any other updates. > I'm possibly missing other use-cases, but those come to mind as not > involving racy counter increments. > > I agree that use-cases are so niche that we probably do _not_ want to > introduce ADD_SHARED() for that purpose in a common header file, > because I suspect that it would then become misused in plenty of > scenarios where the updates are actually racy and would be better > served by atomics or local-atomics. I agree that unless or until we have a reasonable number of use cases, we should open-code it. With a big fat comment explaining how it works. ;-) Thanx, Paul
On 2024-03-07 14:47, Paul E. McKenney wrote: > On Thu, Mar 07, 2024 at 08:53:05AM -0500, Mathieu Desnoyers wrote: >> On 2024-03-06 22:37, Paul E. McKenney wrote: >>> On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote: >> [...] >>> >>>> As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern >>>> is concerned, the only valid use-case I can think of is >>>> split counters or RCU implementations where there is a >>>> single updater doing the increment, and one or more >>>> concurrent reader threads that need to snapshot a >>>> consistent value with READ_ONCE(). >>> >> [...] >>> >>> So what would you use that pattern for? >>> >>> One possibility is a per-CPU statistical counter in userspace on a >>> fastpath, in cases where losing the occasional count is OK. Then learning >>> your CPU (and possibly being immediately migrated to some other CPU), >>> READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not) >>> make sense. >>> >>> I suppose the same in the kernel if there was a fastpath so extreme you >>> could not afford to disable preemption. >>> >>> At best, very niche. >>> >>> Or am I suffering a failure of imagination yet again? ;-) >> >> The (niche) use-cases I have in mind are split-counters and RCU >> grace period tracking, where precise counters totals are needed >> (no lost count). >> >> In the kernel, this could be: > > Thank you for looking into this! > >> - A per-cpu counter, each counter incremented from thread context with >> preemption disabled (single updater per counter), read concurrently by >> other threads. WRITE_ONCE/READ_ONCE is useful to make sure there >> is no store/load tearing there. Atomics on the update would be stronger >> than necessary on the increment fast-path. > > But if preemption is disabled, the updater can read the value without > READ_ONCE() without risk of concurrent update. Or are you concerned about > interrupt handlers? This would have to be a read from the interrupt > handler, given that an updated from the interrupt handler could result > in a lost count. You are correct that the updater don't need READ_ONCE there. It would however require a WRITE_ONCE() to match READ_ONCE() from concurrent reader threads. > >> - A per-thread counter (e.g. within task_struct), only incremented by the >> single thread, read by various threads concurrently. > > Ditto. Right, only WRITE_ONCE() on the single updater, READ_ONCE() on readers. > >> - A counter which increment happens to be already protected by a lock, read >> by various threads without taking the lock. (technically doable, but >> I'm not sure I see a relevant use-case for it) > > In that case, the lock would exclude concurrent updates, so the lock > holder would not need READ_ONCE(), correct? Correct. > >> In user-space: >> >> - The "per-cpu" counter would have to use rseq for increments to prevent >> inopportune migrations, which needs to be implemented in assembler anyway. >> The counter reads would have to use READ_ONCE(). > > Fair enough! > >> - The per-thread counter (Thread-Local Storage) incremented by a single >> thread, read by various threads concurrently, is a good target >> for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in >> various liburcu implementations which track read-side critical sections >> per-thread. > > Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or > similar? Not quite, I recall it's more like WRITE_ONCE(x, READ_ONCE(y)) or such, so we can grab the value of the current gp counter and store it into a TLS variable. > >> - Same as for the kernel, a counter increment protected by a lock which >> needs to be read from various threads concurrently without taking >> the lock could be a valid use-case, though I fail to see how it is >> useful due to lack of imagination on my part. ;-) > > In RCU, we have "WRITE_ONCE(*sp, *sp + 1)" for this use case, though > here we have the WRITE_ONCE() but not the READ_ONCE() because we hold > the lock excluding any other updates. You are right, the READ_ONCE() is not needed in this case for the updater, only for the concurrent readers. Thanks, Mathieu
On Thu, 7 Mar 2024 at 11:47, Paul E. McKenney <paulmck@kernel.org> wrote: > > > - The per-thread counter (Thread-Local Storage) incremented by a single > > thread, read by various threads concurrently, is a good target > > for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in > > various liburcu implementations which track read-side critical sections > > per-thread. > > Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or > similar? Absolutely not. The READ_ONCE->WRITE_ONCE pattern is almost certainly a bug. The valid reason to have a WRITE_ONCE() is that there's a _later_ READ_ONCE() on another CPU. So WRITE_ONCE->READ_ONCE (across threads) is very valid. But READ_ONCE->WRITE_ONCE (inside a thread) simply is not a valid operation. We do have things like "local_t", which allows for non-smp-safe local thread atomic accesses, but they explicitly are *NOT* about some kind of READ_ONCE -> WRITE_ONCE sequence that by definition cannot be atomic unless you disable interrupts and disable preemption (at which point they become pointless and only generate worse code). But the point of "local_t" is that you can do things that aresafe if there is no SMP issues. They are kind of an extension of the percpu_add() kind of operations. In fact, I think it might be interesting to catch those READ_ONCE->WRITE_ONCE chains (perhaps with coccinelle?) because they are a sign of bugs. Now, there's certainly some possibility of "I really don't care about some stats, I'm willing to do non-smp-safe and non-thread safe operations if they are faster". So I'm not saying a READ_ONCE->WRITE_ONCE data dependency is _always_ a bug, but I do think it's a pattern that is very very close to being one. Linus
On Thu, Mar 07, 2024 at 12:00:44PM -0800, Linus Torvalds wrote: > On Thu, 7 Mar 2024 at 11:47, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > - The per-thread counter (Thread-Local Storage) incremented by a single > > > thread, read by various threads concurrently, is a good target > > > for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in > > > various liburcu implementations which track read-side critical sections > > > per-thread. > > > > Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or > > similar? > > Absolutely not. > > The READ_ONCE->WRITE_ONCE pattern is almost certainly a bug. > > The valid reason to have a WRITE_ONCE() is that there's a _later_ > READ_ONCE() on another CPU. > > So WRITE_ONCE->READ_ONCE (across threads) is very valid. But > READ_ONCE->WRITE_ONCE (inside a thread) simply is not a valid > operation. > > We do have things like "local_t", which allows for non-smp-safe local > thread atomic accesses, but they explicitly are *NOT* about some kind > of READ_ONCE -> WRITE_ONCE sequence that by definition cannot be > atomic unless you disable interrupts and disable preemption (at which > point they become pointless and only generate worse code). > > But the point of "local_t" is that you can do things that aresafe if > there is no SMP issues. They are kind of an extension of the > percpu_add() kind of operations. > > In fact, I think it might be interesting to catch those > READ_ONCE->WRITE_ONCE chains (perhaps with coccinelle?) because they > are a sign of bugs. Good point, adding Julia Lawall on CC. A really stupid version of this pattern (WRITE_ONCE.*READ_ONCE) found four more in RCU, so I will take a look at those and either fix or add comments as appropriate. > Now, there's certainly some possibility of "I really don't care about > some stats, I'm willing to do non-smp-safe and non-thread safe > operations if they are faster". So I'm not saying a > READ_ONCE->WRITE_ONCE data dependency is _always_ a bug, but I do > think it's a pattern that is very very close to being one. I agree that valid use cases should be quite rare. Thanx, Paul
On Thu, 7 Mar 2024, Paul E. McKenney wrote: > On Thu, Mar 07, 2024 at 12:00:44PM -0800, Linus Torvalds wrote: > > On Thu, 7 Mar 2024 at 11:47, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > - The per-thread counter (Thread-Local Storage) incremented by a single > > > > thread, read by various threads concurrently, is a good target > > > > for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in > > > > various liburcu implementations which track read-side critical sections > > > > per-thread. > > > > > > Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or > > > similar? > > > > Absolutely not. > > > > The READ_ONCE->WRITE_ONCE pattern is almost certainly a bug. > > > > The valid reason to have a WRITE_ONCE() is that there's a _later_ > > READ_ONCE() on another CPU. > > > > So WRITE_ONCE->READ_ONCE (across threads) is very valid. But > > READ_ONCE->WRITE_ONCE (inside a thread) simply is not a valid > > operation. > > > > We do have things like "local_t", which allows for non-smp-safe local > > thread atomic accesses, but they explicitly are *NOT* about some kind > > of READ_ONCE -> WRITE_ONCE sequence that by definition cannot be > > atomic unless you disable interrupts and disable preemption (at which > > point they become pointless and only generate worse code). > > > > But the point of "local_t" is that you can do things that aresafe if > > there is no SMP issues. They are kind of an extension of the > > percpu_add() kind of operations. > > > > In fact, I think it might be interesting to catch those > > READ_ONCE->WRITE_ONCE chains (perhaps with coccinelle?) because they > > are a sign of bugs. > > Good point, adding Julia Lawall on CC. I tried the following: @@ expression x; @@ *WRITE_ONCE(x,<+...READ_ONCE(x)...+>) This gave a number of results, shown below. Let me know if some of them are undesirable. I also tried: @@ expression x,v; *v = READ_ONCE(x) ... when != v *WRITE_ONCE(x,<+...v...+>) I found the results somewhat less compelling. Maybe the following are interesting (note that - means this line is interesting, not a suggestion to remove it): 479 files match diff -u -p /home/julia/linux/net/netfilter/nf_tables_api.c /tmp/nothing/net/netfilter/nf_tables_api.c --- /home/julia/linux/net/netfilter/nf_tables_api.c +++ /tmp/nothing/net/netfilter/nf_tables_api.c @@ -10026,8 +10026,6 @@ static unsigned int nft_gc_seq_begin(str unsigned int gc_seq; /* Bump gc counter, it becomes odd, this is the busy mark. */ - gc_seq = READ_ONCE(nft_net->gc_seq); - WRITE_ONCE(nft_net->gc_seq, ++gc_seq); return gc_seq; } diff -u -p /home/julia/linux/fs/xfs/xfs_icache.c /tmp/nothing/fs/xfs/xfs_icache.c --- /home/julia/linux/fs/xfs/xfs_icache.c +++ /tmp/nothing/fs/xfs/xfs_icache.c @@ -2076,8 +2076,6 @@ xfs_inodegc_queue( cpu_nr = get_cpu(); gc = this_cpu_ptr(mp->m_inodegc); llist_add(&ip->i_gclist, &gc->list); - items = READ_ONCE(gc->items); - WRITE_ONCE(gc->items, items + 1); shrinker_hits = READ_ONCE(gc->shrinker_hits); /* @@ -2199,9 +2197,7 @@ xfs_inodegc_shrinker_scan( for_each_cpu(cpu, &mp->m_inodegc_cpumask) { gc = per_cpu_ptr(mp->m_inodegc, cpu); if (!llist_empty(&gc->list)) { - unsigned int h = READ_ONCE(gc->shrinker_hits); - WRITE_ONCE(gc->shrinker_hits, h + 1); mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); no_items = false; } In other cases, more work is done between the read and the write. I can send themif of interest. julia diff -u -p /home/julia/linux/kernel/sched/fair.c /tmp/nothing/kernel/sched/fair.c --- /home/julia/linux/kernel/sched/fair.c +++ /tmp/nothing/kernel/sched/fair.c @@ -3153,7 +3153,6 @@ static void reset_ptenuma_scan(struct ta * statistical sampling. Use READ_ONCE/WRITE_ONCE, which are not * expensive, to avoid any form of compiler optimizations: */ - WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1); p->mm->numa_scan_offset = 0; } diff -u -p /home/julia/linux/net/ipv4/inet_hashtables.c /tmp/nothing/net/ipv4/inet_hashtables.c --- /home/julia/linux/net/ipv4/inet_hashtables.c +++ /tmp/nothing/net/ipv4/inet_hashtables.c @@ -1109,7 +1109,6 @@ ok: * it may be inexistent. */ i = max_t(int, i, get_random_u32_below(8) * step); - WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + step); /* Head lock still held and bh's disabled */ inet_bind_hash(sk, tb, tb2, port); diff -u -p /home/julia/linux/kernel/rcu/tree.c /tmp/nothing/kernel/rcu/tree.c --- /home/julia/linux/kernel/rcu/tree.c +++ /tmp/nothing/kernel/rcu/tree.c @@ -1620,8 +1620,6 @@ static void rcu_gp_fqs(bool first_time) /* Clear flag to prevent immediate re-entry. */ if (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) { raw_spin_lock_irq_rcu_node(rnp); - WRITE_ONCE(rcu_state.gp_flags, - READ_ONCE(rcu_state.gp_flags) & ~RCU_GP_FLAG_FQS); raw_spin_unlock_irq_rcu_node(rnp); } } @@ -1882,8 +1880,6 @@ static void rcu_report_qs_rsp(unsigned l { raw_lockdep_assert_held_rcu_node(rcu_get_root()); WARN_ON_ONCE(!rcu_gp_in_progress()); - WRITE_ONCE(rcu_state.gp_flags, - READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS); raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(), flags); rcu_gp_kthread_wake(); } @@ -2392,8 +2388,6 @@ void rcu_force_quiescent_state(void) raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags); return; /* Someone beat us to it. */ } - WRITE_ONCE(rcu_state.gp_flags, - READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS); raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags); rcu_gp_kthread_wake(); } diff -u -p /home/julia/linux/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c /tmp/nothing/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c --- /home/julia/linux/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c +++ /tmp/nothing/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c @@ -80,8 +80,6 @@ static int cn23xx_vf_reset_io_queues(str q_no); return -1; } - WRITE_ONCE(reg_val, READ_ONCE(reg_val) & - ~CN23XX_PKT_INPUT_CTL_RST); octeon_write_csr64(oct, CN23XX_VF_SLI_IQ_PKT_CONTROL64(q_no), READ_ONCE(reg_val)); diff -u -p /home/julia/linux/kernel/rcu/srcutree.c /tmp/nothing/kernel/rcu/srcutree.c --- /home/julia/linux/kernel/rcu/srcutree.c +++ /tmp/nothing/kernel/rcu/srcutree.c @@ -628,7 +628,6 @@ static unsigned long srcu_get_delay(stru if (time_after(j, gpstart)) jbase += j - gpstart; if (!jbase) { - WRITE_ONCE(sup->srcu_n_exp_nodelay, READ_ONCE(sup->srcu_n_exp_nodelay) + 1); if (READ_ONCE(sup->srcu_n_exp_nodelay) > srcu_max_nodelay_phase) jbase = 1; } @@ -1799,7 +1798,6 @@ static void process_srcu(struct work_str } else { j = jiffies; if (READ_ONCE(sup->reschedule_jiffies) == j) { - WRITE_ONCE(sup->reschedule_count, READ_ONCE(sup->reschedule_count) + 1); if (READ_ONCE(sup->reschedule_count) > srcu_max_nodelay) curdelay = 1; } else { diff -u -p /home/julia/linux/kernel/kcsan/kcsan_test.c /tmp/nothing/kernel/kcsan/kcsan_test.c --- /home/julia/linux/kernel/kcsan/kcsan_test.c +++ /tmp/nothing/kernel/kcsan/kcsan_test.c @@ -381,7 +381,6 @@ static noinline void test_kernel_change_ test_var ^= TEST_CHANGE_BITS; kcsan_nestable_atomic_end(); } else - WRITE_ONCE(test_var, READ_ONCE(test_var) ^ TEST_CHANGE_BITS); } static noinline void test_kernel_assert_bits_change(void) diff -u -p /home/julia/linux/arch/s390/kernel/idle.c /tmp/nothing/arch/s390/kernel/idle.c --- /home/julia/linux/arch/s390/kernel/idle.c +++ /tmp/nothing/arch/s390/kernel/idle.c @@ -43,8 +43,6 @@ void account_idle_time_irq(void) S390_lowcore.last_update_timer = S390_lowcore.sys_enter_timer; /* Account time spent with enabled wait psw loaded as idle time. */ - WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time); - WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1); account_idle_time(cputime_to_nsecs(idle_time)); } diff -u -p /home/julia/linux/mm/mmap.c /tmp/nothing/mm/mmap.c --- /home/julia/linux/mm/mmap.c +++ /tmp/nothing/mm/mmap.c @@ -3476,7 +3476,6 @@ bool may_expand_vm(struct mm_struct *mm, void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages) { - WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm)+npages); if (is_exec_mapping(flags)) mm->exec_vm += npages; diff -u -p /home/julia/linux/fs/xfs/libxfs/xfs_iext_tree.c /tmp/nothing/fs/xfs/libxfs/xfs_iext_tree.c --- /home/julia/linux/fs/xfs/libxfs/xfs_iext_tree.c +++ /tmp/nothing/fs/xfs/libxfs/xfs_iext_tree.c @@ -618,7 +618,6 @@ xfs_iext_realloc_root( */ static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp) { - WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1); } void diff -u -p /home/julia/linux/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c /tmp/nothing/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c --- /home/julia/linux/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c +++ /tmp/nothing/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c @@ -379,8 +379,6 @@ static int cn23xx_reset_io_queues(struct q_no); return -1; } - WRITE_ONCE(reg_val, READ_ONCE(reg_val) & - ~CN23XX_PKT_INPUT_CTL_RST); octeon_write_csr64(oct, CN23XX_SLI_IQ_PKT_CONTROL64(q_no), READ_ONCE(reg_val)); @@ -879,9 +877,6 @@ static void cn23xx_disable_io_queues(str /* start the Reset for a particular ring */ WRITE_ONCE(d64, octeon_read_csr64( oct, CN23XX_SLI_IQ_PKT_CONTROL64(q_no))); - WRITE_ONCE(d64, READ_ONCE(d64) & - (~(CN23XX_PKT_INPUT_CTL_RING_ENB))); - WRITE_ONCE(d64, READ_ONCE(d64) | CN23XX_PKT_INPUT_CTL_RST); octeon_write_csr64(oct, CN23XX_SLI_IQ_PKT_CONTROL64(q_no), READ_ONCE(d64)); diff -u -p /home/julia/linux/mm/kfence/kfence_test.c /tmp/nothing/mm/kfence/kfence_test.c --- /home/julia/linux/mm/kfence/kfence_test.c +++ /tmp/nothing/mm/kfence/kfence_test.c @@ -501,7 +501,6 @@ static void test_kmalloc_aligned_oob_wri * fault immediately after it. */ expect.addr = buf + size; - WRITE_ONCE(*expect.addr, READ_ONCE(*expect.addr) + 1); KUNIT_EXPECT_FALSE(test, report_available()); test_free(buf); KUNIT_EXPECT_TRUE(test, report_matches(&expect)); diff -u -p /home/julia/linux/io_uring/io_uring.c /tmp/nothing/io_uring/io_uring.c --- /home/julia/linux/io_uring/io_uring.c +++ /tmp/nothing/io_uring/io_uring.c @@ -363,7 +363,6 @@ static void io_account_cq_overflow(struc { struct io_rings *r = ctx->rings; - WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1); ctx->cq_extra--; } @@ -2403,8 +2402,6 @@ static bool io_get_sqe(struct io_ring_ct spin_lock(&ctx->completion_lock); ctx->cq_extra--; spin_unlock(&ctx->completion_lock); - WRITE_ONCE(ctx->rings->sq_dropped, - READ_ONCE(ctx->rings->sq_dropped) + 1); return false; } } diff -u -p /home/julia/linux/security/apparmor/apparmorfs.c /tmp/nothing/security/apparmor/apparmorfs.c --- /home/julia/linux/security/apparmor/apparmorfs.c +++ /tmp/nothing/security/apparmor/apparmorfs.c @@ -596,7 +596,6 @@ static __poll_t ns_revision_poll(struct void __aa_bump_ns_revision(struct aa_ns *ns) { - WRITE_ONCE(ns->revision, READ_ONCE(ns->revision) + 1); wake_up_interruptible(&ns->wait); } diff -u -p /home/julia/linux/arch/riscv/kvm/vmid.c /tmp/nothing/arch/riscv/kvm/vmid.c --- /home/julia/linux/arch/riscv/kvm/vmid.c +++ /tmp/nothing/arch/riscv/kvm/vmid.c @@ -90,7 +90,6 @@ void kvm_riscv_gstage_vmid_update(struct /* First user of a new VMID version? */ if (unlikely(vmid_next == 0)) { - WRITE_ONCE(vmid_version, READ_ONCE(vmid_version) + 1); vmid_next = 1; /*
On Thu, 7 Mar 2024 at 13:40, Julia Lawall <julia.lawall@inria.fr> wrote: > > I tried the following: > > @@ > expression x; > @@ > > *WRITE_ONCE(x,<+...READ_ONCE(x)...+>) > > This gave a number of results, shown below. Let me know if some of them > are undesirable. Well, all the ones you list do look like garbage. That said, quite often the garbage does seem to be "we don't actually care about the result". Several of them look like statistics. Some of them look outright nasty, though: > --- /home/julia/linux/net/netfilter/nf_tables_api.c > +++ /tmp/nothing/net/netfilter/nf_tables_api.c > @@ -10026,8 +10026,6 @@ static unsigned int nft_gc_seq_begin(str > unsigned int gc_seq; > > /* Bump gc counter, it becomes odd, this is the busy mark. */ > - gc_seq = READ_ONCE(nft_net->gc_seq); > - WRITE_ONCE(nft_net->gc_seq, ++gc_seq); The above is garbage code, and the comment implies that it is garbage code that _should_ be reliable. > diff -u -p /home/julia/linux/fs/xfs/xfs_icache.c /tmp/nothing/fs/xfs/xfs_icache.c > --- /home/julia/linux/fs/xfs/xfs_icache.c > +++ /tmp/nothing/fs/xfs/xfs_icache.c > @@ -2076,8 +2076,6 @@ xfs_inodegc_queue( > cpu_nr = get_cpu(); > gc = this_cpu_ptr(mp->m_inodegc); > llist_add(&ip->i_gclist, &gc->list); > - items = READ_ONCE(gc->items); > - WRITE_ONCE(gc->items, items + 1); In contrast, this is also garbage code, but the only user of it seems to be a heuristic, so if 'items' is off by one (or by a hundred), it probably doesn't matter. The xfs code is basically using that 'items' count to decide if it really wants to do GC or not. This is actually a case where having a "UNSAFE_INCREMENTISH()" macro might make sense. That said, this is also a case where using a "local_t" and using "local_add_return()" might be a better option. It falls back on true atomics, but at least on x86 you probably get *better* code generation for the "incrementish" operation than you get with READ_ONCE -> WRITE_ONCE. > diff -u -p /home/julia/linux/kernel/rcu/tree.c /tmp/nothing/kernel/rcu/tree.c > --- /home/julia/linux/kernel/rcu/tree.c > +++ /tmp/nothing/kernel/rcu/tree.c > @@ -1620,8 +1620,6 @@ static void rcu_gp_fqs(bool first_time) > /* Clear flag to prevent immediate re-entry. */ > if (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) { > raw_spin_lock_irq_rcu_node(rnp); > - WRITE_ONCE(rcu_state.gp_flags, > - READ_ONCE(rcu_state.gp_flags) & ~RCU_GP_FLAG_FQS); > raw_spin_unlock_irq_rcu_node(rnp); This smells bad to me. The code is holding a lock, but apparently not one that protects gp_flags. And that READ_ONCE->WRITE_ONCE sequence can corrupt all the other flags. Maybe it's fine for some reason (that reason being either that the ONCE operations aren't actually needed at all, or because nobody *really* cares about the flags), but it smells. > @@ -1882,8 +1880,6 @@ static void rcu_report_qs_rsp(unsigned l > { > raw_lockdep_assert_held_rcu_node(rcu_get_root()); > WARN_ON_ONCE(!rcu_gp_in_progress()); > - WRITE_ONCE(rcu_state.gp_flags, > - READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS); > raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(), flags); Same field, same lock held, same odd smelly pattern. > - WRITE_ONCE(rcu_state.gp_flags, > - READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS); > raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags); .. and again. > --- /home/julia/linux/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c > +++ /tmp/nothing/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c > @@ -80,8 +80,6 @@ static int cn23xx_vf_reset_io_queues(str > q_no); > return -1; > } > - WRITE_ONCE(reg_val, READ_ONCE(reg_val) & > - ~CN23XX_PKT_INPUT_CTL_RST); > octeon_write_csr64(oct, CN23XX_VF_SLI_IQ_PKT_CONTROL64(q_no), > READ_ONCE(reg_val)); I suspect this is garbage that has been triggered by the usual mindless "fix the symptoms, not the bug" as a result of a "undefined behavior report". >> --- /home/julia/linux/kernel/kcsan/kcsan_test.c > +++ /tmp/nothing/kernel/kcsan/kcsan_test.c > @@ -381,7 +381,6 @@ static noinline void test_kernel_change_ > test_var ^= TEST_CHANGE_BITS; > kcsan_nestable_atomic_end(); > } else > - WRITE_ONCE(test_var, READ_ONCE(test_var) ^ TEST_CHANGE_BITS); Presumably this is intentionally testing whether KCSAN notices these things at all. > diff -u -p /home/julia/linux/arch/s390/kernel/idle.c /tmp/nothing/arch/s390/kernel/idle.c > /* Account time spent with enabled wait psw loaded as idle time. */ > - WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time); > - WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1); > account_idle_time(cputime_to_nsecs(idle_time)); This looks like another "UNSAFE_INCREMENTISH()" case. > --- /home/julia/linux/mm/mmap.c > +++ /tmp/nothing/mm/mmap.c > @@ -3476,7 +3476,6 @@ bool may_expand_vm(struct mm_struct *mm, > > void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages) > { > - WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm)+npages); As does this. > diff -u -p /home/julia/linux/fs/xfs/libxfs/xfs_iext_tree.c /tmp/nothing/fs/xfs/libxfs/xfs_iext_tree.c > static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp) > { > - WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1); > } Ugh. A sequence count that is "incrementish"? That smells wrong to me. But I didn't go look at the users. Maybe it's another case of "we don't *actually* care about the sequence count". > > +++ /tmp/nothing/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c > @@ -379,8 +379,6 @@ static int cn23xx_reset_io_queues(struct > q_no); > return -1; > } > - WRITE_ONCE(reg_val, READ_ONCE(reg_val) & > - ~CN23XX_PKT_INPUT_CTL_RST); > .... > - WRITE_ONCE(d64, READ_ONCE(d64) & > - (~(CN23XX_PKT_INPUT_CTL_RING_ENB))); > - WRITE_ONCE(d64, READ_ONCE(d64) | CN23XX_PKT_INPUT_CTL_RST); More "likely wrong" cases. > +++ /tmp/nothing/mm/kfence/kfence_test.c > @@ -501,7 +501,6 @@ static void test_kmalloc_aligned_oob_wri > * fault immediately after it. > */ > expect.addr = buf + size; > - WRITE_ONCE(*expect.addr, READ_ONCE(*expect.addr) + 1); Looks like questionable test-code again. > +++ /tmp/nothing/io_uring/io_uring.c > @@ -363,7 +363,6 @@ static void io_account_cq_overflow(struc > { > struct io_rings *r = ctx->rings; > > - WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1); > ctx->cq_extra--; Bah. Looks like garbage, but the kernel doesn't actually use that value. Looks like a random number generator exposed to user space. Presumably this is another "statistics, but I don't care enouhg". > @@ -2403,8 +2402,6 @@ static bool io_get_sqe(struct io_ring_ct > - WRITE_ONCE(ctx->rings->sq_dropped, > - READ_ONCE(ctx->rings->sq_dropped) + 1); As is the above. > +++ /tmp/nothing/security/apparmor/apparmorfs.c > @@ -596,7 +596,6 @@ static __poll_t ns_revision_poll(struct > > void __aa_bump_ns_revision(struct aa_ns *ns) > { > - WRITE_ONCE(ns->revision, READ_ONCE(ns->revision) + 1); > wake_up_interruptible(&ns->wait); This looks like somebody copied the RCU / tracing pattern? > +++ /tmp/nothing/arch/riscv/kvm/vmid.c > @@ -90,7 +90,6 @@ void kvm_riscv_gstage_vmid_update(struct > > /* First user of a new VMID version? */ > if (unlikely(vmid_next == 0)) { > - WRITE_ONCE(vmid_version, READ_ONCE(vmid_version) + 1); > vmid_next = 1; Looks bogus and wrong. An unreliable address space version does _not_ sound sane, but who knows. Anyway, from a quick look, there's a mix of "this is just wrong" and a couple of "this seems to just want approximate statistics". Maybe the RCU 'flags' field is using WRITE_ONCE() because while the spinlock protects the bit changes, there are readers that look at other bits with READ_ONCE. That would imply that the READ_ONCE->WRITE_ONCE is just broken garbage - the WRITE_ONCE() part may be right, but the READ_ONCE is wrong because the value is stable. Linus
On Thu, Mar 07, 2024 at 02:09:00PM -0800, Linus Torvalds wrote: > On Thu, 7 Mar 2024 at 13:40, Julia Lawall <julia.lawall@inria.fr> wrote: > > > > I tried the following: > > > > @@ > > expression x; > > @@ > > > > *WRITE_ONCE(x,<+...READ_ONCE(x)...+>) > > > > This gave a number of results, shown below. Let me know if some of them > > are undesirable. > > Well, all the ones you list do look like garbage. > > That said, quite often the garbage does seem to be "we don't actually > care about the result". Several of them look like statistics. [ . . . ] > > diff -u -p /home/julia/linux/kernel/rcu/tree.c /tmp/nothing/kernel/rcu/tree.c > > --- /home/julia/linux/kernel/rcu/tree.c > > +++ /tmp/nothing/kernel/rcu/tree.c > > @@ -1620,8 +1620,6 @@ static void rcu_gp_fqs(bool first_time) > > /* Clear flag to prevent immediate re-entry. */ > > if (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) { > > raw_spin_lock_irq_rcu_node(rnp); > > - WRITE_ONCE(rcu_state.gp_flags, > > - READ_ONCE(rcu_state.gp_flags) & ~RCU_GP_FLAG_FQS); > > raw_spin_unlock_irq_rcu_node(rnp); > > This smells bad to me. The code is holding a lock, but apparently not > one that protects gp_flags. > > And that READ_ONCE->WRITE_ONCE sequence can corrupt all the other flags. > > Maybe it's fine for some reason (that reason being either that the > ONCE operations aren't actually needed at all, or because nobody > *really* cares about the flags), but it smells. > > > @@ -1882,8 +1880,6 @@ static void rcu_report_qs_rsp(unsigned l > > { > > raw_lockdep_assert_held_rcu_node(rcu_get_root()); > > WARN_ON_ONCE(!rcu_gp_in_progress()); > > - WRITE_ONCE(rcu_state.gp_flags, > > - READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS); > > raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(), flags); > > Same field, same lock held, same odd smelly pattern. > > > - WRITE_ONCE(rcu_state.gp_flags, > > - READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS); > > raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags); > > .. and again. In all three cases, the updates are protected by the lock, so the READ_ONCE() is unnecessary. I have queued a commit to remove the READ_ONCE()s. Thanks to both of you (Julia and Linus) for spotting these! Thanx, Paul
On Thu, Mar 07, 2024 at 02:53:43PM -0500, Mathieu Desnoyers wrote: > On 2024-03-07 14:47, Paul E. McKenney wrote: > > On Thu, Mar 07, 2024 at 08:53:05AM -0500, Mathieu Desnoyers wrote: > > > On 2024-03-06 22:37, Paul E. McKenney wrote: > > > > On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote: > > > [...] > > > > > > > > > As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern > > > > > is concerned, the only valid use-case I can think of is > > > > > split counters or RCU implementations where there is a > > > > > single updater doing the increment, and one or more > > > > > concurrent reader threads that need to snapshot a > > > > > consistent value with READ_ONCE(). > > > > > > > [...] > > > > > > > > So what would you use that pattern for? > > > > > > > > One possibility is a per-CPU statistical counter in userspace on a > > > > fastpath, in cases where losing the occasional count is OK. Then learning > > > > your CPU (and possibly being immediately migrated to some other CPU), > > > > READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not) > > > > make sense. > > > > > > > > I suppose the same in the kernel if there was a fastpath so extreme you > > > > could not afford to disable preemption. > > > > > > > > At best, very niche. > > > > > > > > Or am I suffering a failure of imagination yet again? ;-) > > > > > > The (niche) use-cases I have in mind are split-counters and RCU > > > grace period tracking, where precise counters totals are needed > > > (no lost count). > > > > > > In the kernel, this could be: > > > > Thank you for looking into this! > > > > > - A per-cpu counter, each counter incremented from thread context with > > > preemption disabled (single updater per counter), read concurrently by > > > other threads. WRITE_ONCE/READ_ONCE is useful to make sure there > > > is no store/load tearing there. Atomics on the update would be stronger > > > than necessary on the increment fast-path. > > > > But if preemption is disabled, the updater can read the value without > > READ_ONCE() without risk of concurrent update. Or are you concerned about > > interrupt handlers? This would have to be a read from the interrupt > > handler, given that an updated from the interrupt handler could result > > in a lost count. > > You are correct that the updater don't need READ_ONCE there. It would > however require a WRITE_ONCE() to match READ_ONCE() from concurrent > reader threads. > > > > > > - A per-thread counter (e.g. within task_struct), only incremented by the > > > single thread, read by various threads concurrently. > > > > Ditto. > > Right, only WRITE_ONCE() on the single updater, READ_ONCE() on readers. > > > > > > - A counter which increment happens to be already protected by a lock, read > > > by various threads without taking the lock. (technically doable, but > > > I'm not sure I see a relevant use-case for it) > > > > In that case, the lock would exclude concurrent updates, so the lock > > holder would not need READ_ONCE(), correct? > > Correct. > > > > > > In user-space: > > > > > > - The "per-cpu" counter would have to use rseq for increments to prevent > > > inopportune migrations, which needs to be implemented in assembler anyway. > > > The counter reads would have to use READ_ONCE(). > > > > Fair enough! > > > > > - The per-thread counter (Thread-Local Storage) incremented by a single > > > thread, read by various threads concurrently, is a good target > > > for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in > > > various liburcu implementations which track read-side critical sections > > > per-thread. > > > > Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or > > similar? > > Not quite, I recall it's more like WRITE_ONCE(x, READ_ONCE(y)) or such, > so we can grab the value of the current gp counter and store it into a > TLS variable. Good point, you could use that pattern to grab a shared snapshot. Still sounds niche, but you never know! ;-) Thanx, Paul > > > - Same as for the kernel, a counter increment protected by a lock which > > > needs to be read from various threads concurrently without taking > > > the lock could be a valid use-case, though I fail to see how it is > > > useful due to lack of imagination on my part. ;-) > > > > In RCU, we have "WRITE_ONCE(*sp, *sp + 1)" for this use case, though > > here we have the WRITE_ONCE() but not the READ_ONCE() because we hold > > the lock excluding any other updates. > > You are right, the READ_ONCE() is not needed in this case for the > updater, only for the concurrent readers. > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 7567ca8e743c..00059ace4fd5 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -465,8 +465,8 @@ rcu_torture_pipe_update_one(struct rcu_torture *rp) if (i > RCU_TORTURE_PIPE_LEN) i = RCU_TORTURE_PIPE_LEN; atomic_inc(&rcu_torture_wcount[i]); - WRITE_ONCE(rp->rtort_pipe_count, i + 1); - if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { + WRITE_ONCE(rp->rtort_pipe_count, rp->rtort_pipe_count + 1); + if (READ_ONCE(rp->rtort_pipe_count) >= RCU_TORTURE_PIPE_LEN) { rp->rtort_mbtest = 0; return true; }
Some changes are done to fix a data race in commit 202489101f2e ("rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race") { int i; - i = rp->rtort_pipe_count; + i = READ_ONCE(rp->rtort_pipe_count); if (i > RCU_TORTURE_PIPE_LEN) i = RCU_TORTURE_PIPE_LEN; atomic_inc(&rcu_torture_wcount[i]); - if (++rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { + WRITE_ONCE(rp->rtort_pipe_count, i + 1); + if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { rp->rtort_mbtest = 0; return true; } But ++rp->rtort_pipe_count is meant to add itself by 1, not give i+1 to rp->rtort_pipe_count, because rp->rtort_pipe_count may write by rcu_torture_writer() concurrently. Also, rp->rtort_pipe_count in the next line should be read using READ_ONCE() because of data race. Signed-off-by: linke li <lilinke99@qq.com> --- kernel/rcu/rcutorture.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)