diff mbox series

[RFC,v2] Fix get_state_synchronize_rcu_full() GP-start detection

Message ID da5065c4-79ba-431f-9d7e-1ca314394443@paulmck-laptop (mailing list archive)
State Accepted
Commit c5d3a3eaefafd5554a775763e1bbaffcd49e851f
Headers show
Series [RFC,v2] Fix get_state_synchronize_rcu_full() GP-start detection | expand

Commit Message

Paul E. McKenney Dec. 13, 2024, 7:49 p.m. UTC
The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
functions use the root rcu_node structure's ->gp_seq field to detect
the beginnings and ends of grace periods, respectively.  This choice is
necessary for the poll_state_synchronize_rcu_full() function because
(give or take counter wrap), the following sequence is guaranteed not
to trigger:

        get_state_synchronize_rcu_full(&rgos);
        synchronize_rcu();
        WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));

The RCU callbacks that awaken synchronize_rcu() instances are
guaranteed not to be invoked before the root rcu_node structure's
->gp_seq field is updated to indicate the end of the grace period.
However, these callbacks might start being invoked immediately
thereafter, in particular, before rcu_state.gp_seq has been updated.
Therefore, poll_state_synchronize_rcu_full() must refer to the
root rcu_node structure's ->gp_seq field.  Because this field is
updated under this structure's ->lock, any code following a call to
poll_state_synchronize_rcu_full() will be fully ordered after the
full grace-period computation, as is required by RCU's memory-ordering
semantics.

By symmetry, the get_state_synchronize_rcu_full() function should also
use this same root rcu_node structure's ->gp_seq field.  But it turns out
that symmetry is profoundly (though extremely infrequently) destructive
in this case.  To see this, consider the following sequence of events:

1.      CPU 0 starts a new grace period, and updates rcu_state.gp_seq
        accordingly.

2.      As its first step of grace-period initialization, CPU 0 examines
        the current CPU hotplug state and decides that it need not wait
        for CPU 1, which is currently offline.

3.      CPU 1 comes online, and updates its state.  But this does not
        affect the current grace period, but rather the one after that.
        After all, CPU 1 was offline when the current grace period
        started, so all pre-existing RCU readers on CPU 1 must have
        completed or been preempted before it last went offline.
        The current grace period therefore has nothing it needs to wait
        for on CPU 1.

4.      CPU 1 switches to an rcutorture kthread which is running
        rcutorture's rcu_torture_reader() function, which starts a new
        RCU reader.

5.      CPU 2 is running rcutorture's rcu_torture_writer() function
        and collects a new polled grace-period "cookie" using
        get_state_synchronize_rcu_full().  Because the newly started
        grace period has not completed initialization, the root rcu_node
        structure's ->gp_seq field has not yet been updated to indicate
        that this new grace period has already started.

        This cookie is therefore set up for the end of the current grace
        period (rather than the end of the following grace period).

6.      CPU 0 finishes grace-period initialization.

7.      If CPU 1’s rcutorture reader is preempted, it will be added to
        the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
        set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
        will not be updated.  Thus, this grace period will not wait on
        it.  Which is only fair, given that the CPU did not come online
        until after the grace period officially started.

8.      CPUs 0 and 2 then detect the new grace period and then report
        a quiescent state to the RCU core.

9.      Because CPU 1 was offline at the start of the current grace
        period, CPUs 0 and 2 are the only CPUs that this grace period
        needs to wait on.  So the grace period ends and post-grace-period
        cleanup starts.  In particular, the root rcu_node structure's
        ->gp_seq field is updated to indicate that this grace period
        has now ended.

10.     CPU 2 continues running rcu_torture_writer() and sees that,
        from the viewpoint of the root rcu_node structure consulted by
        the poll_state_synchronize_rcu_full() function, the grace period
        has ended.  It therefore updates state accordingly.

11.     CPU 1 is still running the same RCU reader, which notices this
        update and thus complains about the too-short grace period.

The fix is for the get_state_synchronize_rcu_full() function to use
rcu_state.gp_seq instead of the the root rcu_node structure's ->gp_seq
field.  With this change in place, if step 5's cookie indicates that the
grace period has not yet started, then any prior code executed by CPU 2
must have happened before CPU 1 came online.  This will in turn prevent
CPU 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait,
thus preventing CPU 1 from being subjected to a too-short grace period.

This commit therefore makes this change.  Note that there is no change to
the poll_state_synchronize_rcu_full() function, which as noted above,
must continue to use the root rcu_node structure's ->gp_seq field.
This is of course an asymmetry between these two functions, but is an
asymmetry that is absolutely required for correct operation.  It is a
common human tendency to greatly value symmetry, and sometimes symmetry
is a wonderful thing.  Other times, symmetry results in poor performance.
But in this case, symmetry is just plain wrong.

Nevertheless, the asymmetry does require an additional adjustment.
It is possible for get_state_synchronize_rcu_full() to see a given
grace period as having started, but for an immediately following
poll_state_synchronize_rcu_full() to see it as having not yet started.
Given the current rcu_seq_done_exact() implementation, this will
result in a false-positive indication that the grace period is done
from poll_state_synchronize_rcu_full().  This is dealt with by making
rcu_seq_done_exact() reach back three grace periods rather than just
two of them.

Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for
get_completed*() and poll_state*()"), it is not clear that it is worth
backporting this commit.  First, it took me many weeks to convince
rcutorture to reproduce this more frequently than once per year.  Second,
this cannot be reproduced at all without frequent CPU-hotplug operations,
as in waiting all of 50 milliseconds from the end of the previous
operation until starting the next one.  Third, the TREE03.boot settings
cause multi-millisecond delays during RCU grace-period initialization,
which greatly increase the probability of the above sequence of events.
(Don't do this in production workloads!)  Fourth, extremely heavy use of
get_state_synchronize_rcu_full() and/or poll_state_synchronize_rcu_full()
is required to reproduce this, and as of v6.12, only kfree_rcu() uses it,
and even then not particularly heavily.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

---

Changes since RFC v1:

o	Update rcu_seq_done_exact() to handle backwards-gp_seq effects
	of get_state_synchronize_rcu_full() using rcu_state.gp_seq and
	poll_state_synchronize_rcu_full() continuing to use the root
	rcu_node structure's ->gp_seq field.  With this commit, rcutorture
	passes 10 hours of 400 instances of TREE03.  Without this commit,
	there are about 50 too-short grace periods per hour, again with
	400 instances of TREE03.

o	Add TREE03.boot's multi-millisecond delays for grace-period
	initialization.

Comments

Frederic Weisbecker Jan. 24, 2025, 2:49 p.m. UTC | #1
Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
> functions use the root rcu_node structure's ->gp_seq field to detect
> the beginnings and ends of grace periods, respectively.  This choice is
> necessary for the poll_state_synchronize_rcu_full() function because
> (give or take counter wrap), the following sequence is guaranteed not
> to trigger:
> 
>         get_state_synchronize_rcu_full(&rgos);
>         synchronize_rcu();
>         WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));
> 
> The RCU callbacks that awaken synchronize_rcu() instances are
> guaranteed not to be invoked before the root rcu_node structure's
> ->gp_seq field is updated to indicate the end of the grace period.
> However, these callbacks might start being invoked immediately
> thereafter, in particular, before rcu_state.gp_seq has been updated.
> Therefore, poll_state_synchronize_rcu_full() must refer to the
> root rcu_node structure's ->gp_seq field.  Because this field is
> updated under this structure's ->lock, any code following a call to
> poll_state_synchronize_rcu_full() will be fully ordered after the
> full grace-period computation, as is required by RCU's memory-ordering
> semantics.
> 
> By symmetry, the get_state_synchronize_rcu_full() function should also
> use this same root rcu_node structure's ->gp_seq field.  But it turns out
> that symmetry is profoundly (though extremely infrequently) destructive
> in this case.  To see this, consider the following sequence of events:
> 
> 1.      CPU 0 starts a new grace period, and updates rcu_state.gp_seq
>         accordingly.
> 
> 2.      As its first step of grace-period initialization, CPU 0 examines
>         the current CPU hotplug state and decides that it need not wait
>         for CPU 1, which is currently offline.
> 
> 3.      CPU 1 comes online, and updates its state.  But this does not
>         affect the current grace period, but rather the one after that.
>         After all, CPU 1 was offline when the current grace period
>         started, so all pre-existing RCU readers on CPU 1 must have
>         completed or been preempted before it last went offline.
>         The current grace period therefore has nothing it needs to wait
>         for on CPU 1.
> 
> 4.      CPU 1 switches to an rcutorture kthread which is running
>         rcutorture's rcu_torture_reader() function, which starts a new
>         RCU reader.
> 
> 5.      CPU 2 is running rcutorture's rcu_torture_writer() function
>         and collects a new polled grace-period "cookie" using
>         get_state_synchronize_rcu_full().  Because the newly started
>         grace period has not completed initialization, the root rcu_node
>         structure's ->gp_seq field has not yet been updated to indicate
>         that this new grace period has already started.
> 
>         This cookie is therefore set up for the end of the current grace
>         period (rather than the end of the following grace period).
> 
> 6.      CPU 0 finishes grace-period initialization.
> 
> 7.      If CPU 1’s rcutorture reader is preempted, it will be added to
>         the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
>         set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
>         will not be updated.  Thus, this grace period will not wait on
>         it.  Which is only fair, given that the CPU did not come online
>         until after the grace period officially started.
> 
> 8.      CPUs 0 and 2 then detect the new grace period and then report
>         a quiescent state to the RCU core.
> 
> 9.      Because CPU 1 was offline at the start of the current grace
>         period, CPUs 0 and 2 are the only CPUs that this grace period
>         needs to wait on.  So the grace period ends and post-grace-period
>         cleanup starts.  In particular, the root rcu_node structure's
>         ->gp_seq field is updated to indicate that this grace period
>         has now ended.
> 
> 10.     CPU 2 continues running rcu_torture_writer() and sees that,
>         from the viewpoint of the root rcu_node structure consulted by
>         the poll_state_synchronize_rcu_full() function, the grace period
>         has ended.  It therefore updates state accordingly.
> 
> 11.     CPU 1 is still running the same RCU reader, which notices this
>         update and thus complains about the too-short grace period.

I think I get the race but I must confess I'm not very familiar with how this
all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader().

Basically this could trigger on CPU 1 with just doing the following, right?

rcu_read_lock()
get_state_synchronize_rcu_full(&rgos);
WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos))
rcu_read_unlock()

> 
> The fix is for the get_state_synchronize_rcu_full() function to use
> rcu_state.gp_seq instead of the the root rcu_node structure's ->gp_seq
> field.  With this change in place, if step 5's cookie indicates that the
> grace period has not yet started, then any prior code executed by CPU 2
> must have happened before CPU 1 came online.  This will in turn prevent
> CPU 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait,
> thus preventing CPU 1 from being subjected to a too-short grace period.
> 
> This commit therefore makes this change.  Note that there is no change to
> the poll_state_synchronize_rcu_full() function, which as noted above,
> must continue to use the root rcu_node structure's ->gp_seq field.
> This is of course an asymmetry between these two functions, but is an
> asymmetry that is absolutely required for correct operation.  It is a
> common human tendency to greatly value symmetry, and sometimes symmetry
> is a wonderful thing.  Other times, symmetry results in poor performance.
> But in this case, symmetry is just plain wrong.
> 
> Nevertheless, the asymmetry does require an additional adjustment.
> It is possible for get_state_synchronize_rcu_full() to see a given
> grace period as having started, but for an immediately following
> poll_state_synchronize_rcu_full() to see it as having not yet started.
> Given the current rcu_seq_done_exact() implementation, this will
> result in a false-positive indication that the grace period is done
> from poll_state_synchronize_rcu_full().  This is dealt with by making
> rcu_seq_done_exact() reach back three grace periods rather than just
> two of them.
> 
> Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for
> get_completed*() and poll_state*()"), it is not clear that it is worth
> backporting this commit.  First, it took me many weeks to convince
> rcutorture to reproduce this more frequently than once per year.  Second,
> this cannot be reproduced at all without frequent CPU-hotplug operations,
> as in waiting all of 50 milliseconds from the end of the previous
> operation until starting the next one.  Third, the TREE03.boot settings
> cause multi-millisecond delays during RCU grace-period initialization,
> which greatly increase the probability of the above sequence of events.
> (Don't do this in production workloads!)  Fourth, extremely heavy use of
> get_state_synchronize_rcu_full() and/or poll_state_synchronize_rcu_full()
> is required to reproduce this, and as of v6.12, only kfree_rcu() uses it,
> and even then not particularly heavily.

I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on
the root node for the global state ?

Thanks.
Paul E. McKenney Jan. 24, 2025, 3:58 p.m. UTC | #2
On Fri, Jan 24, 2025 at 03:49:24PM +0100, Frederic Weisbecker wrote:
> Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
> > functions use the root rcu_node structure's ->gp_seq field to detect
> > the beginnings and ends of grace periods, respectively.  This choice is
> > necessary for the poll_state_synchronize_rcu_full() function because
> > (give or take counter wrap), the following sequence is guaranteed not
> > to trigger:
> > 
> >         get_state_synchronize_rcu_full(&rgos);
> >         synchronize_rcu();
> >         WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));
> > 
> > The RCU callbacks that awaken synchronize_rcu() instances are
> > guaranteed not to be invoked before the root rcu_node structure's
> > ->gp_seq field is updated to indicate the end of the grace period.
> > However, these callbacks might start being invoked immediately
> > thereafter, in particular, before rcu_state.gp_seq has been updated.
> > Therefore, poll_state_synchronize_rcu_full() must refer to the
> > root rcu_node structure's ->gp_seq field.  Because this field is
> > updated under this structure's ->lock, any code following a call to
> > poll_state_synchronize_rcu_full() will be fully ordered after the
> > full grace-period computation, as is required by RCU's memory-ordering
> > semantics.
> > 
> > By symmetry, the get_state_synchronize_rcu_full() function should also
> > use this same root rcu_node structure's ->gp_seq field.  But it turns out
> > that symmetry is profoundly (though extremely infrequently) destructive
> > in this case.  To see this, consider the following sequence of events:
> > 
> > 1.      CPU 0 starts a new grace period, and updates rcu_state.gp_seq
> >         accordingly.
> > 
> > 2.      As its first step of grace-period initialization, CPU 0 examines
> >         the current CPU hotplug state and decides that it need not wait
> >         for CPU 1, which is currently offline.
> > 
> > 3.      CPU 1 comes online, and updates its state.  But this does not
> >         affect the current grace period, but rather the one after that.
> >         After all, CPU 1 was offline when the current grace period
> >         started, so all pre-existing RCU readers on CPU 1 must have
> >         completed or been preempted before it last went offline.
> >         The current grace period therefore has nothing it needs to wait
> >         for on CPU 1.
> > 
> > 4.      CPU 1 switches to an rcutorture kthread which is running
> >         rcutorture's rcu_torture_reader() function, which starts a new
> >         RCU reader.
> > 
> > 5.      CPU 2 is running rcutorture's rcu_torture_writer() function
> >         and collects a new polled grace-period "cookie" using
> >         get_state_synchronize_rcu_full().  Because the newly started
> >         grace period has not completed initialization, the root rcu_node
> >         structure's ->gp_seq field has not yet been updated to indicate
> >         that this new grace period has already started.
> > 
> >         This cookie is therefore set up for the end of the current grace
> >         period (rather than the end of the following grace period).
> > 
> > 6.      CPU 0 finishes grace-period initialization.
> > 
> > 7.      If CPU 1’s rcutorture reader is preempted, it will be added to
> >         the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
> >         set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
> >         will not be updated.  Thus, this grace period will not wait on
> >         it.  Which is only fair, given that the CPU did not come online
> >         until after the grace period officially started.
> > 
> > 8.      CPUs 0 and 2 then detect the new grace period and then report
> >         a quiescent state to the RCU core.
> > 
> > 9.      Because CPU 1 was offline at the start of the current grace
> >         period, CPUs 0 and 2 are the only CPUs that this grace period
> >         needs to wait on.  So the grace period ends and post-grace-period
> >         cleanup starts.  In particular, the root rcu_node structure's
> >         ->gp_seq field is updated to indicate that this grace period
> >         has now ended.
> > 
> > 10.     CPU 2 continues running rcu_torture_writer() and sees that,
> >         from the viewpoint of the root rcu_node structure consulted by
> >         the poll_state_synchronize_rcu_full() function, the grace period
> >         has ended.  It therefore updates state accordingly.
> > 
> > 11.     CPU 1 is still running the same RCU reader, which notices this
> >         update and thus complains about the too-short grace period.
> 
> I think I get the race but I must confess I'm not very familiar with how this
> all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader().
> 
> Basically this could trigger on CPU 1 with just doing the following, right?
> 
> rcu_read_lock()
> get_state_synchronize_rcu_full(&rgos);
> WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos))
> rcu_read_unlock()

CPU 1 would do rcu_read_lock()/checks/rcu_read_unlock() as the
reader, and CPU 2 would do get_state_synchronize_rcu_full(), later
poll_state_synchronize_rcu_full(), which would (erroneously) indicate
a completed grace period, so it would update the state, triggering CPU
1's checks.

> > The fix is for the get_state_synchronize_rcu_full() function to use
> > rcu_state.gp_seq instead of the the root rcu_node structure's ->gp_seq
> > field.  With this change in place, if step 5's cookie indicates that the
> > grace period has not yet started, then any prior code executed by CPU 2
> > must have happened before CPU 1 came online.  This will in turn prevent
> > CPU 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait,
> > thus preventing CPU 1 from being subjected to a too-short grace period.
> > 
> > This commit therefore makes this change.  Note that there is no change to
> > the poll_state_synchronize_rcu_full() function, which as noted above,
> > must continue to use the root rcu_node structure's ->gp_seq field.
> > This is of course an asymmetry between these two functions, but is an
> > asymmetry that is absolutely required for correct operation.  It is a
> > common human tendency to greatly value symmetry, and sometimes symmetry
> > is a wonderful thing.  Other times, symmetry results in poor performance.
> > But in this case, symmetry is just plain wrong.
> > 
> > Nevertheless, the asymmetry does require an additional adjustment.
> > It is possible for get_state_synchronize_rcu_full() to see a given
> > grace period as having started, but for an immediately following
> > poll_state_synchronize_rcu_full() to see it as having not yet started.
> > Given the current rcu_seq_done_exact() implementation, this will
> > result in a false-positive indication that the grace period is done
> > from poll_state_synchronize_rcu_full().  This is dealt with by making
> > rcu_seq_done_exact() reach back three grace periods rather than just
> > two of them.
> > 
> > Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for
> > get_completed*() and poll_state*()"), it is not clear that it is worth
> > backporting this commit.  First, it took me many weeks to convince
> > rcutorture to reproduce this more frequently than once per year.  Second,
> > this cannot be reproduced at all without frequent CPU-hotplug operations,
> > as in waiting all of 50 milliseconds from the end of the previous
> > operation until starting the next one.  Third, the TREE03.boot settings
> > cause multi-millisecond delays during RCU grace-period initialization,
> > which greatly increase the probability of the above sequence of events.
> > (Don't do this in production workloads!)  Fourth, extremely heavy use of
> > get_state_synchronize_rcu_full() and/or poll_state_synchronize_rcu_full()
> > is required to reproduce this, and as of v6.12, only kfree_rcu() uses it,
> > and even then not particularly heavily.
> 
> I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on
> the root node for the global state ?

One scenario comes to mind immediately.  There may be others.

Suppose we were running with default configuration on a system with
"only" eight CPUs.  Then there is only the one rcu_node structure,
which is both root and leaf.  Without rcu_state.gp_seq, there
would be no way to communicate the beginning of the grace period to
get_state_synchronize_rcu_full() without also allowing quiescent states
to be reported.  There would thus be no time in which to check for newly
onlined/offlined CPUs.

							Thanx, Paul
Frederic Weisbecker Jan. 24, 2025, 4:42 p.m. UTC | #3
Le Fri, Jan 24, 2025 at 07:58:20AM -0800, Paul E. McKenney a écrit :
> On Fri, Jan 24, 2025 at 03:49:24PM +0100, Frederic Weisbecker wrote:
> > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
> > > functions use the root rcu_node structure's ->gp_seq field to detect
> > > the beginnings and ends of grace periods, respectively.  This choice is
> > > necessary for the poll_state_synchronize_rcu_full() function because
> > > (give or take counter wrap), the following sequence is guaranteed not
> > > to trigger:
> > > 
> > >         get_state_synchronize_rcu_full(&rgos);
> > >         synchronize_rcu();
> > >         WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));
> > > 
> > > The RCU callbacks that awaken synchronize_rcu() instances are
> > > guaranteed not to be invoked before the root rcu_node structure's
> > > ->gp_seq field is updated to indicate the end of the grace period.
> > > However, these callbacks might start being invoked immediately
> > > thereafter, in particular, before rcu_state.gp_seq has been updated.
> > > Therefore, poll_state_synchronize_rcu_full() must refer to the
> > > root rcu_node structure's ->gp_seq field.  Because this field is
> > > updated under this structure's ->lock, any code following a call to
> > > poll_state_synchronize_rcu_full() will be fully ordered after the
> > > full grace-period computation, as is required by RCU's memory-ordering
> > > semantics.
> > > 
> > > By symmetry, the get_state_synchronize_rcu_full() function should also
> > > use this same root rcu_node structure's ->gp_seq field.  But it turns out
> > > that symmetry is profoundly (though extremely infrequently) destructive
> > > in this case.  To see this, consider the following sequence of events:
> > > 
> > > 1.      CPU 0 starts a new grace period, and updates rcu_state.gp_seq
> > >         accordingly.
> > > 
> > > 2.      As its first step of grace-period initialization, CPU 0 examines
> > >         the current CPU hotplug state and decides that it need not wait
> > >         for CPU 1, which is currently offline.
> > > 
> > > 3.      CPU 1 comes online, and updates its state.  But this does not
> > >         affect the current grace period, but rather the one after that.
> > >         After all, CPU 1 was offline when the current grace period
> > >         started, so all pre-existing RCU readers on CPU 1 must have
> > >         completed or been preempted before it last went offline.
> > >         The current grace period therefore has nothing it needs to wait
> > >         for on CPU 1.
> > > 
> > > 4.      CPU 1 switches to an rcutorture kthread which is running
> > >         rcutorture's rcu_torture_reader() function, which starts a new
> > >         RCU reader.
> > > 
> > > 5.      CPU 2 is running rcutorture's rcu_torture_writer() function
> > >         and collects a new polled grace-period "cookie" using
> > >         get_state_synchronize_rcu_full().  Because the newly started
> > >         grace period has not completed initialization, the root rcu_node
> > >         structure's ->gp_seq field has not yet been updated to indicate
> > >         that this new grace period has already started.
> > > 
> > >         This cookie is therefore set up for the end of the current grace
> > >         period (rather than the end of the following grace period).
> > > 
> > > 6.      CPU 0 finishes grace-period initialization.
> > > 
> > > 7.      If CPU 1’s rcutorture reader is preempted, it will be added to
> > >         the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
> > >         set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
> > >         will not be updated.  Thus, this grace period will not wait on
> > >         it.  Which is only fair, given that the CPU did not come online
> > >         until after the grace period officially started.
> > > 
> > > 8.      CPUs 0 and 2 then detect the new grace period and then report
> > >         a quiescent state to the RCU core.
> > > 
> > > 9.      Because CPU 1 was offline at the start of the current grace
> > >         period, CPUs 0 and 2 are the only CPUs that this grace period
> > >         needs to wait on.  So the grace period ends and post-grace-period
> > >         cleanup starts.  In particular, the root rcu_node structure's
> > >         ->gp_seq field is updated to indicate that this grace period
> > >         has now ended.
> > > 
> > > 10.     CPU 2 continues running rcu_torture_writer() and sees that,
> > >         from the viewpoint of the root rcu_node structure consulted by
> > >         the poll_state_synchronize_rcu_full() function, the grace period
> > >         has ended.  It therefore updates state accordingly.
> > > 
> > > 11.     CPU 1 is still running the same RCU reader, which notices this
> > >         update and thus complains about the too-short grace period.
> > 
> > I think I get the race but I must confess I'm not very familiar with how this
> > all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader().
> > 
> > Basically this could trigger on CPU 1 with just doing the following, right?
> > 
> > rcu_read_lock()
> > get_state_synchronize_rcu_full(&rgos);
> > WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos))
> > rcu_read_unlock()
> 
> CPU 1 would do rcu_read_lock()/checks/rcu_read_unlock() as the
> reader, and CPU 2 would do get_state_synchronize_rcu_full(), later
> poll_state_synchronize_rcu_full(), which would (erroneously) indicate
> a completed grace period, so it would update the state, triggering CPU
> 1's checks.

I see, so if I generalize the problem beyond rcutorture, this looks like this,
right?

CPU 0                                    CPU 1                       CPU 2
-----                                    ------                      -----

rcu_seq_start(rcu_state.gp_seq)
                                         // CPU boots
                                         rcu_read_lock()
                                         r0 = rcu_dereference(*X)
                                                                     r1 = *X
                                                                     *X = NULL
                                                                     // relies on rnp->gp_seq and not rcu_state.gp_seq
                                                                     get_state_synchronize_rcu_full(&gros)
WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq);
rcu_report_qs_rdp()
                                                                      rcu_report_qs_rdp()
                                                                      rcu_report_qs_rnp()
                                                                      rcu_report_qs_rsp()
                                                                      if (poll_state_synchronize_rcu_full(&rgos))
                                                                          kfree(r1)
                                          // play with r0 and crash


> > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on
> > the root node for the global state ?
> 
> One scenario comes to mind immediately.  There may be others.
> 
> Suppose we were running with default configuration on a system with
> "only" eight CPUs.  Then there is only the one rcu_node structure,
> which is both root and leaf.  Without rcu_state.gp_seq, there
> would be no way to communicate the beginning of the grace period to
> get_state_synchronize_rcu_full() without also allowing quiescent states
> to be reported.  There would thus be no time in which to check for newly
> onlined/offlined CPUs.

Heh, that makes sense! Though perhaps that qsmaskinit[next] handling part
could be done before rcu_seq_start()?

Thanks.

> 
> 							Thanx, Paul
Paul E. McKenney Jan. 24, 2025, 7:40 p.m. UTC | #4
On Fri, Jan 24, 2025 at 05:42:29PM +0100, Frederic Weisbecker wrote:
> Le Fri, Jan 24, 2025 at 07:58:20AM -0800, Paul E. McKenney a écrit :
> > On Fri, Jan 24, 2025 at 03:49:24PM +0100, Frederic Weisbecker wrote:
> > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
> > > > functions use the root rcu_node structure's ->gp_seq field to detect
> > > > the beginnings and ends of grace periods, respectively.  This choice is
> > > > necessary for the poll_state_synchronize_rcu_full() function because
> > > > (give or take counter wrap), the following sequence is guaranteed not
> > > > to trigger:
> > > > 
> > > >         get_state_synchronize_rcu_full(&rgos);
> > > >         synchronize_rcu();
> > > >         WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));
> > > > 
> > > > The RCU callbacks that awaken synchronize_rcu() instances are
> > > > guaranteed not to be invoked before the root rcu_node structure's
> > > > ->gp_seq field is updated to indicate the end of the grace period.
> > > > However, these callbacks might start being invoked immediately
> > > > thereafter, in particular, before rcu_state.gp_seq has been updated.
> > > > Therefore, poll_state_synchronize_rcu_full() must refer to the
> > > > root rcu_node structure's ->gp_seq field.  Because this field is
> > > > updated under this structure's ->lock, any code following a call to
> > > > poll_state_synchronize_rcu_full() will be fully ordered after the
> > > > full grace-period computation, as is required by RCU's memory-ordering
> > > > semantics.
> > > > 
> > > > By symmetry, the get_state_synchronize_rcu_full() function should also
> > > > use this same root rcu_node structure's ->gp_seq field.  But it turns out
> > > > that symmetry is profoundly (though extremely infrequently) destructive
> > > > in this case.  To see this, consider the following sequence of events:
> > > > 
> > > > 1.      CPU 0 starts a new grace period, and updates rcu_state.gp_seq
> > > >         accordingly.
> > > > 
> > > > 2.      As its first step of grace-period initialization, CPU 0 examines
> > > >         the current CPU hotplug state and decides that it need not wait
> > > >         for CPU 1, which is currently offline.
> > > > 
> > > > 3.      CPU 1 comes online, and updates its state.  But this does not
> > > >         affect the current grace period, but rather the one after that.
> > > >         After all, CPU 1 was offline when the current grace period
> > > >         started, so all pre-existing RCU readers on CPU 1 must have
> > > >         completed or been preempted before it last went offline.
> > > >         The current grace period therefore has nothing it needs to wait
> > > >         for on CPU 1.
> > > > 
> > > > 4.      CPU 1 switches to an rcutorture kthread which is running
> > > >         rcutorture's rcu_torture_reader() function, which starts a new
> > > >         RCU reader.
> > > > 
> > > > 5.      CPU 2 is running rcutorture's rcu_torture_writer() function
> > > >         and collects a new polled grace-period "cookie" using
> > > >         get_state_synchronize_rcu_full().  Because the newly started
> > > >         grace period has not completed initialization, the root rcu_node
> > > >         structure's ->gp_seq field has not yet been updated to indicate
> > > >         that this new grace period has already started.
> > > > 
> > > >         This cookie is therefore set up for the end of the current grace
> > > >         period (rather than the end of the following grace period).
> > > > 
> > > > 6.      CPU 0 finishes grace-period initialization.
> > > > 
> > > > 7.      If CPU 1’s rcutorture reader is preempted, it will be added to
> > > >         the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
> > > >         set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
> > > >         will not be updated.  Thus, this grace period will not wait on
> > > >         it.  Which is only fair, given that the CPU did not come online
> > > >         until after the grace period officially started.
> > > > 
> > > > 8.      CPUs 0 and 2 then detect the new grace period and then report
> > > >         a quiescent state to the RCU core.
> > > > 
> > > > 9.      Because CPU 1 was offline at the start of the current grace
> > > >         period, CPUs 0 and 2 are the only CPUs that this grace period
> > > >         needs to wait on.  So the grace period ends and post-grace-period
> > > >         cleanup starts.  In particular, the root rcu_node structure's
> > > >         ->gp_seq field is updated to indicate that this grace period
> > > >         has now ended.
> > > > 
> > > > 10.     CPU 2 continues running rcu_torture_writer() and sees that,
> > > >         from the viewpoint of the root rcu_node structure consulted by
> > > >         the poll_state_synchronize_rcu_full() function, the grace period
> > > >         has ended.  It therefore updates state accordingly.
> > > > 
> > > > 11.     CPU 1 is still running the same RCU reader, which notices this
> > > >         update and thus complains about the too-short grace period.
> > > 
> > > I think I get the race but I must confess I'm not very familiar with how this
> > > all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader().
> > > 
> > > Basically this could trigger on CPU 1 with just doing the following, right?
> > > 
> > > rcu_read_lock()
> > > get_state_synchronize_rcu_full(&rgos);
> > > WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos))
> > > rcu_read_unlock()
> > 
> > CPU 1 would do rcu_read_lock()/checks/rcu_read_unlock() as the
> > reader, and CPU 2 would do get_state_synchronize_rcu_full(), later
> > poll_state_synchronize_rcu_full(), which would (erroneously) indicate
> > a completed grace period, so it would update the state, triggering CPU
> > 1's checks.
> 
> I see, so if I generalize the problem beyond rcutorture, this looks like this,
> right?
> 
> CPU 0                                    CPU 1                       CPU 2
> -----                                    ------                      -----
> 
> rcu_seq_start(rcu_state.gp_seq)
>                                          // CPU boots
>                                          rcu_read_lock()
>                                          r0 = rcu_dereference(*X)
>                                                                      r1 = *X
>                                                                      *X = NULL
>                                                                      // relies on rnp->gp_seq and not rcu_state.gp_seq
>                                                                      get_state_synchronize_rcu_full(&gros)
> WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq);
> rcu_report_qs_rdp()
>                                                                       rcu_report_qs_rdp()
>                                                                       rcu_report_qs_rnp()
>                                                                       rcu_report_qs_rsp()
>                                                                       if (poll_state_synchronize_rcu_full(&rgos))
>                                                                           kfree(r1)
>                                           // play with r0 and crash

That would do it!

> > > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on
> > > the root node for the global state ?
> > 
> > One scenario comes to mind immediately.  There may be others.
> > 
> > Suppose we were running with default configuration on a system with
> > "only" eight CPUs.  Then there is only the one rcu_node structure,
> > which is both root and leaf.  Without rcu_state.gp_seq, there
> > would be no way to communicate the beginning of the grace period to
> > get_state_synchronize_rcu_full() without also allowing quiescent states
> > to be reported.  There would thus be no time in which to check for newly
> > onlined/offlined CPUs.
> 
> Heh, that makes sense! Though perhaps that qsmaskinit[next] handling part
> could be done before rcu_seq_start()?

If we do that, aren't we vulnerable to a CPU coming online just after
we handled qsmaskinit{,next} and just before we do rcu_seq_start()?

							Thanx, Paul
Frederic Weisbecker Jan. 24, 2025, 10:25 p.m. UTC | #5
Le Fri, Jan 24, 2025 at 11:40:54AM -0800, Paul E. McKenney a écrit :
> > > > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on
> > > > the root node for the global state ?
> > > 
> > > One scenario comes to mind immediately.  There may be others.
> > > 
> > > Suppose we were running with default configuration on a system with
> > > "only" eight CPUs.  Then there is only the one rcu_node structure,
> > > which is both root and leaf.  Without rcu_state.gp_seq, there
> > > would be no way to communicate the beginning of the grace period to
> > > get_state_synchronize_rcu_full() without also allowing quiescent states
> > > to be reported.  There would thus be no time in which to check for newly
> > > onlined/offlined CPUs.
> > 
> > Heh, that makes sense! Though perhaps that qsmaskinit[next] handling part
> > could be done before rcu_seq_start()?
> 
> If we do that, aren't we vulnerable to a CPU coming online just after
> we handled qsmaskinit{,next} and just before we do rcu_seq_start()?

But then that CPU is guaranteed to see the pre-GP accesses because
rcutree_report_cpu_starting() holds the rnp lock.

Hmm but it won't see the starting gp_seq and it might therefore be one GP
too early while accelerating its own callbacks (mis-attributing them the
current GP in progress instead of a subsequent one) even though other CPUs
may have reported their own QS already.

Because I only now realize this necessary order while starting a grace period:

1) rcu_seq_start(rcu_state.gp_seq)
2) go through ->qsmaskinit[next] _while holding leaf rnp locks_
3) go through the whole tree breadth first to reflect rcu_state.gp_seq to
   rnp->gp_seq. CPUs may start reporting QS concurrently once the first leaf
   node is reached

And the step 2) with its fully ordered locking on leaves which release the write
to rcu_state.gp_seq is what makes sure that if any CPU from step 3) has already
reported a QS, it is guaranteed that any call to rcu_seq_snap() under any other
leaf rnp locking will return the sequence of the subsequent GP number and not the
current one in progress.

This is what makes RCU an entertaining companion. You (think you) understand
it in the evening and then you forget it all over again in the next morning.
(And in lunch time you stare at things...).

Thanks.
Paul E. McKenney Jan. 24, 2025, 10:50 p.m. UTC | #6
On Fri, Jan 24, 2025 at 11:25:01PM +0100, Frederic Weisbecker wrote:
> Le Fri, Jan 24, 2025 at 11:40:54AM -0800, Paul E. McKenney a écrit :
> > > > > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on
> > > > > the root node for the global state ?
> > > > 
> > > > One scenario comes to mind immediately.  There may be others.
> > > > 
> > > > Suppose we were running with default configuration on a system with
> > > > "only" eight CPUs.  Then there is only the one rcu_node structure,
> > > > which is both root and leaf.  Without rcu_state.gp_seq, there
> > > > would be no way to communicate the beginning of the grace period to
> > > > get_state_synchronize_rcu_full() without also allowing quiescent states
> > > > to be reported.  There would thus be no time in which to check for newly
> > > > onlined/offlined CPUs.
> > > 
> > > Heh, that makes sense! Though perhaps that qsmaskinit[next] handling part
> > > could be done before rcu_seq_start()?
> > 
> > If we do that, aren't we vulnerable to a CPU coming online just after
> > we handled qsmaskinit{,next} and just before we do rcu_seq_start()?
> 
> But then that CPU is guaranteed to see the pre-GP accesses because
> rcutree_report_cpu_starting() holds the rnp lock.
> 
> Hmm but it won't see the starting gp_seq and it might therefore be one GP
> too early while accelerating its own callbacks (mis-attributing them the
> current GP in progress instead of a subsequent one) even though other CPUs
> may have reported their own QS already.
> 
> Because I only now realize this necessary order while starting a grace period:
> 
> 1) rcu_seq_start(rcu_state.gp_seq)
> 2) go through ->qsmaskinit[next] _while holding leaf rnp locks_
> 3) go through the whole tree breadth first to reflect rcu_state.gp_seq to
>    rnp->gp_seq. CPUs may start reporting QS concurrently once the first leaf
>    node is reached
> 
> And the step 2) with its fully ordered locking on leaves which release the write
> to rcu_state.gp_seq is what makes sure that if any CPU from step 3) has already
> reported a QS, it is guaranteed that any call to rcu_seq_snap() under any other
> leaf rnp locking will return the sequence of the subsequent GP number and not the
> current one in progress.

Very good!!!

> This is what makes RCU an entertaining companion. You (think you) understand
> it in the evening and then you forget it all over again in the next morning.
> (And in lunch time you stare at things...).

That sums up many of my entertaining experiences with RCU over the past
three decades.  ;-)

							Thanx, Paul
Frederic Weisbecker Jan. 24, 2025, 11:03 p.m. UTC | #7
Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 2f9c9272cd486..d2a91f705a4ab 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
>  {
>  	unsigned long cur_s = READ_ONCE(*sp);
>  
> -	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> +	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));

This might need a comment.

The way I understand it is that rcu_state.gp_seq might be seen started while
root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
2 GPs to safely deduce we wrapped around.

Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?

Or am I missing something?

Thanks.
Paul E. McKenney Jan. 25, 2025, 12:01 a.m. UTC | #8
On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote:
> Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 2f9c9272cd486..d2a91f705a4ab 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> >  {
> >  	unsigned long cur_s = READ_ONCE(*sp);
> >  
> > -	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> > +	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> 
> This might need a comment.

Good point!  Would you like to propose one?

> The way I understand it is that rcu_state.gp_seq might be seen started while
> root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
> may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
> 2 GPs to safely deduce we wrapped around.

Exactly!

> Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?

Quite possibly.  I freely admit that I allowed a bit of slop because
time was of the essence (holidays and all that) and also it does not
hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing
of the common-case 2^64 cycle.  It would not hurt to be exact, but it
would be necessary to convince ourselves that we were not off by one in
the wrong direction.

I would be happy to see a patch, as long as it was sufficiently
convincing.

> Or am I missing something?

Not that I can see.  So the answer is probably "yes".  ;-)

							Thanx, Paul
Frederic Weisbecker Jan. 25, 2025, 2:56 p.m. UTC | #9
Le Fri, Jan 24, 2025 at 04:01:55PM -0800, Paul E. McKenney a écrit :
> On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote:
> > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > index 2f9c9272cd486..d2a91f705a4ab 100644
> > > --- a/kernel/rcu/rcu.h
> > > +++ b/kernel/rcu/rcu.h
> > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> > >  {
> > >  	unsigned long cur_s = READ_ONCE(*sp);
> > >  
> > > -	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> > > +	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> > 
> > This might need a comment.
> 
> Good point!  Would you like to propose one?

Ok.

> 
> > The way I understand it is that rcu_state.gp_seq might be seen started while
> > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
> > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
> > 2 GPs to safely deduce we wrapped around.
> 
> Exactly!
> 
> > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?
> 
> Quite possibly.  I freely admit that I allowed a bit of slop because
> time was of the essence (holidays and all that) and also it does not
> hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing
> of the common-case 2^64 cycle.  It would not hurt to be exact, but it
> would be necessary to convince ourselves that we were not off by one in
> the wrong direction.
> 
> I would be happy to see a patch, as long as it was sufficiently
> convincing.

I'm not so much concerned about being exact but rather about making
sure we still understand what we did within one year. We can leave one
more grace period than what we expect out of paranoia but, the most
important is that we comment about what we expect and why. Let me
prepare a patch for that.

In the meantime for your patch:

   Reviewed-by: Frederic Weisbecker <frederic@kernel.org>


> 
> > Or am I missing something?
> 
> Not that I can see.  So the answer is probably "yes".  ;-)

Thanks! :-)

> 
> 							Thanx, Paul
Paul E. McKenney Jan. 25, 2025, 6:39 p.m. UTC | #10
On Sat, Jan 25, 2025 at 03:56:16PM +0100, Frederic Weisbecker wrote:
> Le Fri, Jan 24, 2025 at 04:01:55PM -0800, Paul E. McKenney a écrit :
> > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote:
> > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > index 2f9c9272cd486..d2a91f705a4ab 100644
> > > > --- a/kernel/rcu/rcu.h
> > > > +++ b/kernel/rcu/rcu.h
> > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> > > >  {
> > > >  	unsigned long cur_s = READ_ONCE(*sp);
> > > >  
> > > > -	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> > > > +	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> > > 
> > > This might need a comment.
> > 
> > Good point!  Would you like to propose one?
> 
> Ok.

Sounds good!

> > > The way I understand it is that rcu_state.gp_seq might be seen started while
> > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
> > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
> > > 2 GPs to safely deduce we wrapped around.
> > 
> > Exactly!
> > 
> > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?
> > 
> > Quite possibly.  I freely admit that I allowed a bit of slop because
> > time was of the essence (holidays and all that) and also it does not
> > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing
> > of the common-case 2^64 cycle.  It would not hurt to be exact, but it
> > would be necessary to convince ourselves that we were not off by one in
> > the wrong direction.
> > 
> > I would be happy to see a patch, as long as it was sufficiently
> > convincing.
> 
> I'm not so much concerned about being exact but rather about making
> sure we still understand what we did within one year. We can leave one
> more grace period than what we expect out of paranoia but, the most
> important is that we comment about what we expect and why. Let me
> prepare a patch for that.

Even better!

> In the meantime for your patch:
> 
>    Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Thank you, and I will apply this on my next rebase.

							Thanx, Paul

> > > Or am I missing something?
> > 
> > Not that I can see.  So the answer is probably "yes".  ;-)
> 
> Thanks! :-)
> 
> > 
> > 							Thanx, Paul
Joel Fernandes Jan. 27, 2025, 1:13 a.m. UTC | #11
Hi, Paul and Frederic,

[...]
> > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote:
> > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > > index 2f9c9272cd486..d2a91f705a4ab 100644
> > > > > --- a/kernel/rcu/rcu.h
> > > > > +++ b/kernel/rcu/rcu.h
> > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> > > > >  {
> > > > >         unsigned long cur_s = READ_ONCE(*sp);
> > > > >
> > > > > -       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> > > > > +       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> > > >
[...]
> > > > The way I understand it is that rcu_state.gp_seq might be seen started while
> > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
> > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
> > > > 2 GPs to safely deduce we wrapped around.
> > >
> > > Exactly!
> > >
> > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?
> > >
> > > Quite possibly.  I freely admit that I allowed a bit of slop because
> > > time was of the essence (holidays and all that) and also it does not
> > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing
> > > of the common-case 2^64 cycle.  It would not hurt to be exact, but it
> > > would be necessary to convince ourselves that we were not off by one in
> > > the wrong direction.
> > >
> > > I would be happy to see a patch, as long as it was sufficiently
> > > convincing.
> >
> > I'm not so much concerned about being exact but rather about making
> > sure we still understand what we did within one year. We can leave one
> > more grace period than what we expect out of paranoia but, the most
> > important is that we comment about what we expect and why. Let me
> > prepare a patch for that.
>
> Even better!

Do we really have users who could pass such a huge delta wrapped
around value to poll() i.e > ULONG_MAX/2 ?  For 32-bit, that would be
2 billion count since the get() (500 million GPs on 32-bit?). I am
curious if such a scenario should be a WARN() because also: If more
than ULONG_MAX/2 values are possible after get(), then a full or
multiple ULONG_MAX wraparound is also possible. This means then both
rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for
such stale get() values.

I do agree with both your points on the side effect of the patch to
rcu_seq_done_exact(), but I am not fully convinced myself about
utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could
be missing something.

Otherwise I analyzed the patch and it makes sense to me:

Reviewed-by:Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel
Joel Fernandes Jan. 27, 2025, 1:22 a.m. UTC | #12
On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi, Paul and Frederic,
>
> [...]
> > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote:
> > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644
> > > > > > --- a/kernel/rcu/rcu.h
> > > > > > +++ b/kernel/rcu/rcu.h
> > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> > > > > >  {
> > > > > >         unsigned long cur_s = READ_ONCE(*sp);
> > > > > >
> > > > > > -       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> > > > > > +       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> > > > >
> [...]
> > > > > The way I understand it is that rcu_state.gp_seq might be seen started while
> > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
> > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
> > > > > 2 GPs to safely deduce we wrapped around.
> > > >
> > > > Exactly!
> > > >
> > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?
> > > >
> > > > Quite possibly.  I freely admit that I allowed a bit of slop because
> > > > time was of the essence (holidays and all that) and also it does not
> > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing
> > > > of the common-case 2^64 cycle.  It would not hurt to be exact, but it
> > > > would be necessary to convince ourselves that we were not off by one in
> > > > the wrong direction.
> > > >
> > > > I would be happy to see a patch, as long as it was sufficiently
> > > > convincing.
> > >
> > > I'm not so much concerned about being exact but rather about making
> > > sure we still understand what we did within one year. We can leave one
> > > more grace period than what we expect out of paranoia but, the most
> > > important is that we comment about what we expect and why. Let me
> > > prepare a patch for that.
> >
> > Even better!
>
> Do we really have users who could pass such a huge delta wrapped
> around value to poll() i.e > ULONG_MAX/2 ?  For 32-bit, that would be
> 2 billion count since the get() (500 million GPs on 32-bit?). I am
> curious if such a scenario should be a WARN() because also: If more
> than ULONG_MAX/2 values are possible after get(), then a full or
> multiple ULONG_MAX wraparound is also possible. This means then both
> rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for
> such stale get() values.
>
> I do agree with both your points on the side effect of the patch to
> rcu_seq_done_exact(), but I am not fully convinced myself about
> utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could
> be missing something.

I want to modify my comment on reliability. rcu_seq_done_exact() is
certainly more "reliable" than rcu_seq_done() for wraparound delta  >
ULONG_MAX/2. Still with such a huge wrap around it is not fail proof
if it lands within the "3 Grace period" window, so if it is not super
reliable and if the user should not rely on it, then I wonder if it is
better to not do it and instead warn the user they are playing with
fire. But a counter-argument might be, landing within the 3 GP window
is quite low probability...

thanks,

 - Joel



>
> Otherwise I analyzed the patch and it makes sense to me:
>
> Reviewed-by:Joel Fernandes (Google) <joel@joelfernandes.org>
>
> thanks,
>
>  - Joel
Paul E. McKenney Jan. 27, 2025, 2:03 a.m. UTC | #13
On Sun, Jan 26, 2025 at 08:22:23PM -0500, Joel Fernandes wrote:
> On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hi, Paul and Frederic,
> >
> > [...]
> > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote:
> > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644
> > > > > > > --- a/kernel/rcu/rcu.h
> > > > > > > +++ b/kernel/rcu/rcu.h
> > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> > > > > > >  {
> > > > > > >         unsigned long cur_s = READ_ONCE(*sp);
> > > > > > >
> > > > > > > -       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> > > > > > > +       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> > > > > >
> > [...]
> > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while
> > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
> > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
> > > > > > 2 GPs to safely deduce we wrapped around.
> > > > >
> > > > > Exactly!
> > > > >
> > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?
> > > > >
> > > > > Quite possibly.  I freely admit that I allowed a bit of slop because
> > > > > time was of the essence (holidays and all that) and also it does not
> > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing
> > > > > of the common-case 2^64 cycle.  It would not hurt to be exact, but it
> > > > > would be necessary to convince ourselves that we were not off by one in
> > > > > the wrong direction.
> > > > >
> > > > > I would be happy to see a patch, as long as it was sufficiently
> > > > > convincing.
> > > >
> > > > I'm not so much concerned about being exact but rather about making
> > > > sure we still understand what we did within one year. We can leave one
> > > > more grace period than what we expect out of paranoia but, the most
> > > > important is that we comment about what we expect and why. Let me
> > > > prepare a patch for that.
> > >
> > > Even better!
> >
> > Do we really have users who could pass such a huge delta wrapped
> > around value to poll() i.e > ULONG_MAX/2 ?  For 32-bit, that would be
> > 2 billion count since the get() (500 million GPs on 32-bit?). I am
> > curious if such a scenario should be a WARN() because also: If more
> > than ULONG_MAX/2 values are possible after get(), then a full or
> > multiple ULONG_MAX wraparound is also possible. This means then both
> > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for
> > such stale get() values.
> >
> > I do agree with both your points on the side effect of the patch to
> > rcu_seq_done_exact(), but I am not fully convinced myself about
> > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could
> > be missing something.
> 
> I want to modify my comment on reliability. rcu_seq_done_exact() is
> certainly more "reliable" than rcu_seq_done() for wraparound delta  >
> ULONG_MAX/2. Still with such a huge wrap around it is not fail proof
> if it lands within the "3 Grace period" window, so if it is not super
> reliable and if the user should not rely on it, then I wonder if it is
> better to not do it and instead warn the user they are playing with
> fire. But a counter-argument might be, landing within the 3 GP window
> is quite low probability...

It is also a harmless false negative.  And another poll within a few
hundred milliseconds will set things straight.  In contrast, if we
used rcu_seq_done(), it would be a good long time before we got out of
false-negative territory.

On a 32-bit system, if there was an expedited grace period every 10
microseconds (just barely possible on small systems), then a 32-bit
counter would wrap in about 12 hours.  So there would be a six-hour
false-negative zone.

So an expedited grace period every 10 microseconds combined with
a six-hour polling delay is unlikely, but not out of the realm of
possibility.

Please feel free to nominate a comment.

> > Otherwise I analyzed the patch and it makes sense to me:
> >
> > Reviewed-by:Joel Fernandes (Google) <joel@joelfernandes.org>

Thank you!  I will apply this on my next review.

						Thanx, Paul
Joel Fernandes Jan. 27, 2025, 2:55 a.m. UTC | #14
On Sun, Jan 26, 2025 at 9:03 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Sun, Jan 26, 2025 at 08:22:23PM -0500, Joel Fernandes wrote:
> > On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > Hi, Paul and Frederic,
> > >
> > > [...]
> > > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote:
> > > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644
> > > > > > > > --- a/kernel/rcu/rcu.h
> > > > > > > > +++ b/kernel/rcu/rcu.h
> > > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> > > > > > > >  {
> > > > > > > >         unsigned long cur_s = READ_ONCE(*sp);
> > > > > > > >
> > > > > > > > -       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> > > > > > > > +       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> > > > > > >
> > > [...]
> > > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while
> > > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
> > > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
> > > > > > > 2 GPs to safely deduce we wrapped around.
> > > > > >
> > > > > > Exactly!
> > > > > >
> > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?
> > > > > >
> > > > > > Quite possibly.  I freely admit that I allowed a bit of slop because
> > > > > > time was of the essence (holidays and all that) and also it does not
> > > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing
> > > > > > of the common-case 2^64 cycle.  It would not hurt to be exact, but it
> > > > > > would be necessary to convince ourselves that we were not off by one in
> > > > > > the wrong direction.
> > > > > >
> > > > > > I would be happy to see a patch, as long as it was sufficiently
> > > > > > convincing.
> > > > >
> > > > > I'm not so much concerned about being exact but rather about making
> > > > > sure we still understand what we did within one year. We can leave one
> > > > > more grace period than what we expect out of paranoia but, the most
> > > > > important is that we comment about what we expect and why. Let me
> > > > > prepare a patch for that.
> > > >
> > > > Even better!
> > >
> > > Do we really have users who could pass such a huge delta wrapped
> > > around value to poll() i.e > ULONG_MAX/2 ?  For 32-bit, that would be
> > > 2 billion count since the get() (500 million GPs on 32-bit?). I am
> > > curious if such a scenario should be a WARN() because also: If more
> > > than ULONG_MAX/2 values are possible after get(), then a full or
> > > multiple ULONG_MAX wraparound is also possible. This means then both
> > > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for
> > > such stale get() values.
> > >
> > > I do agree with both your points on the side effect of the patch to
> > > rcu_seq_done_exact(), but I am not fully convinced myself about
> > > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could
> > > be missing something.
> >
> > I want to modify my comment on reliability. rcu_seq_done_exact() is
> > certainly more "reliable" than rcu_seq_done() for wraparound delta  >
> > ULONG_MAX/2. Still with such a huge wrap around it is not fail proof
> > if it lands within the "3 Grace period" window, so if it is not super
> > reliable and if the user should not rely on it, then I wonder if it is
> > better to not do it and instead warn the user they are playing with
> > fire. But a counter-argument might be, landing within the 3 GP window
> > is quite low probability...
>
> It is also a harmless false negative.  And another poll within a few
> hundred milliseconds will set things straight.  In contrast, if we
> used rcu_seq_done(), it would be a good long time before we got out of
> false-negative territory.

True!

> On a 32-bit system, if there was an expedited grace period every 10
> microseconds (just barely possible on small systems), then a 32-bit
> counter would wrap in about 12 hours.  So there would be a six-hour
> false-negative zone.
>
> So an expedited grace period every 10 microseconds combined with
> a six-hour polling delay is unlikely, but not out of the realm of
> possibility.

Assuming that every 10 microsecond of the entire 6 hours is used for
an expedited GP, but I agree with your point.

> Please feel free to nominate a comment.

Ok thanks, I'll see what Frederic comes up with and can take it from there.

Thanks!

- Joel
Joel Fernandes Jan. 27, 2025, 2:58 a.m. UTC | #15
On Sun, Jan 26, 2025 at 9:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Sun, Jan 26, 2025 at 9:03 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Sun, Jan 26, 2025 at 08:22:23PM -0500, Joel Fernandes wrote:
> > > On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > Hi, Paul and Frederic,
> > > >
> > > > [...]
> > > > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote:
> > > > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644
> > > > > > > > > --- a/kernel/rcu/rcu.h
> > > > > > > > > +++ b/kernel/rcu/rcu.h
> > > > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> > > > > > > > >  {
> > > > > > > > >         unsigned long cur_s = READ_ONCE(*sp);
> > > > > > > > >
> > > > > > > > > -       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> > > > > > > > > +       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> > > > > > > >
> > > > [...]
> > > > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while
> > > > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
> > > > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
> > > > > > > > 2 GPs to safely deduce we wrapped around.
> > > > > > >
> > > > > > > Exactly!
> > > > > > >
> > > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?
> > > > > > >
> > > > > > > Quite possibly.  I freely admit that I allowed a bit of slop because
> > > > > > > time was of the essence (holidays and all that) and also it does not
> > > > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing
> > > > > > > of the common-case 2^64 cycle.  It would not hurt to be exact, but it
> > > > > > > would be necessary to convince ourselves that we were not off by one in
> > > > > > > the wrong direction.
> > > > > > >
> > > > > > > I would be happy to see a patch, as long as it was sufficiently
> > > > > > > convincing.
> > > > > >
> > > > > > I'm not so much concerned about being exact but rather about making
> > > > > > sure we still understand what we did within one year. We can leave one
> > > > > > more grace period than what we expect out of paranoia but, the most
> > > > > > important is that we comment about what we expect and why. Let me
> > > > > > prepare a patch for that.
> > > > >
> > > > > Even better!
> > > >
> > > > Do we really have users who could pass such a huge delta wrapped
> > > > around value to poll() i.e > ULONG_MAX/2 ?  For 32-bit, that would be
> > > > 2 billion count since the get() (500 million GPs on 32-bit?). I am
> > > > curious if such a scenario should be a WARN() because also: If more
> > > > than ULONG_MAX/2 values are possible after get(), then a full or
> > > > multiple ULONG_MAX wraparound is also possible. This means then both
> > > > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for
> > > > such stale get() values.
> > > >
> > > > I do agree with both your points on the side effect of the patch to
> > > > rcu_seq_done_exact(), but I am not fully convinced myself about
> > > > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could
> > > > be missing something.
> > >
> > > I want to modify my comment on reliability. rcu_seq_done_exact() is
> > > certainly more "reliable" than rcu_seq_done() for wraparound delta  >
> > > ULONG_MAX/2. Still with such a huge wrap around it is not fail proof
> > > if it lands within the "3 Grace period" window, so if it is not super
> > > reliable and if the user should not rely on it, then I wonder if it is
> > > better to not do it and instead warn the user they are playing with
> > > fire. But a counter-argument might be, landing within the 3 GP window
> > > is quite low probability...
> >
> > It is also a harmless false negative.  And another poll within a few
> > hundred milliseconds will set things straight.  In contrast, if we
> > used rcu_seq_done(), it would be a good long time before we got out of
> > false-negative territory.
>
> True!
>
> > On a 32-bit system, if there was an expedited grace period every 10
> > microseconds (just barely possible on small systems), then a 32-bit
> > counter would wrap in about 12 hours.  So there would be a six-hour
> > false-negative zone.
> >
> > So an expedited grace period every 10 microseconds combined with
> > a six-hour polling delay is unlikely, but not out of the realm of
> > possibility.
>
> Assuming that every 10 microsecond of the entire 6 hours is used for
> an expedited GP, but I agree with your point.
>
> > Please feel free to nominate a comment.
>
> Ok thanks, I'll see what Frederic comes up with and can take it from there.
>

I forgot to ask is the reason for keeping rcu_seq_done_exact() to have
more complexity only when needed, or does it make sense to rename
rcu_seq_done_exact() as rcu_seq_done_done() and get rid of the old
rcu_seq_done()?

thanks,

 - Joel
Paul E. McKenney Jan. 27, 2025, 4:49 p.m. UTC | #16
On Sun, Jan 26, 2025 at 09:58:11PM -0500, Joel Fernandes wrote:
> On Sun, Jan 26, 2025 at 9:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Sun, Jan 26, 2025 at 9:03 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Sun, Jan 26, 2025 at 08:22:23PM -0500, Joel Fernandes wrote:
> > > > On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > >
> > > > > Hi, Paul and Frederic,
> > > > >
> > > > > [...]
> > > > > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote:
> > > > > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > > > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644
> > > > > > > > > > --- a/kernel/rcu/rcu.h
> > > > > > > > > > +++ b/kernel/rcu/rcu.h
> > > > > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> > > > > > > > > >  {
> > > > > > > > > >         unsigned long cur_s = READ_ONCE(*sp);
> > > > > > > > > >
> > > > > > > > > > -       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> > > > > > > > > > +       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> > > > > > > > >
> > > > > [...]
> > > > > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while
> > > > > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
> > > > > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
> > > > > > > > > 2 GPs to safely deduce we wrapped around.
> > > > > > > >
> > > > > > > > Exactly!
> > > > > > > >
> > > > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?
> > > > > > > >
> > > > > > > > Quite possibly.  I freely admit that I allowed a bit of slop because
> > > > > > > > time was of the essence (holidays and all that) and also it does not
> > > > > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing
> > > > > > > > of the common-case 2^64 cycle.  It would not hurt to be exact, but it
> > > > > > > > would be necessary to convince ourselves that we were not off by one in
> > > > > > > > the wrong direction.
> > > > > > > >
> > > > > > > > I would be happy to see a patch, as long as it was sufficiently
> > > > > > > > convincing.
> > > > > > >
> > > > > > > I'm not so much concerned about being exact but rather about making
> > > > > > > sure we still understand what we did within one year. We can leave one
> > > > > > > more grace period than what we expect out of paranoia but, the most
> > > > > > > important is that we comment about what we expect and why. Let me
> > > > > > > prepare a patch for that.
> > > > > >
> > > > > > Even better!
> > > > >
> > > > > Do we really have users who could pass such a huge delta wrapped
> > > > > around value to poll() i.e > ULONG_MAX/2 ?  For 32-bit, that would be
> > > > > 2 billion count since the get() (500 million GPs on 32-bit?). I am
> > > > > curious if such a scenario should be a WARN() because also: If more
> > > > > than ULONG_MAX/2 values are possible after get(), then a full or
> > > > > multiple ULONG_MAX wraparound is also possible. This means then both
> > > > > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for
> > > > > such stale get() values.
> > > > >
> > > > > I do agree with both your points on the side effect of the patch to
> > > > > rcu_seq_done_exact(), but I am not fully convinced myself about
> > > > > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could
> > > > > be missing something.
> > > >
> > > > I want to modify my comment on reliability. rcu_seq_done_exact() is
> > > > certainly more "reliable" than rcu_seq_done() for wraparound delta  >
> > > > ULONG_MAX/2. Still with such a huge wrap around it is not fail proof
> > > > if it lands within the "3 Grace period" window, so if it is not super
> > > > reliable and if the user should not rely on it, then I wonder if it is
> > > > better to not do it and instead warn the user they are playing with
> > > > fire. But a counter-argument might be, landing within the 3 GP window
> > > > is quite low probability...
> > >
> > > It is also a harmless false negative.  And another poll within a few
> > > hundred milliseconds will set things straight.  In contrast, if we
> > > used rcu_seq_done(), it would be a good long time before we got out of
> > > false-negative territory.
> >
> > True!
> >
> > > On a 32-bit system, if there was an expedited grace period every 10
> > > microseconds (just barely possible on small systems), then a 32-bit
> > > counter would wrap in about 12 hours.  So there would be a six-hour
> > > false-negative zone.
> > >
> > > So an expedited grace period every 10 microseconds combined with
> > > a six-hour polling delay is unlikely, but not out of the realm of
> > > possibility.
> >
> > Assuming that every 10 microsecond of the entire 6 hours is used for
> > an expedited GP, but I agree with your point.
> >
> > > Please feel free to nominate a comment.
> >
> > Ok thanks, I'll see what Frederic comes up with and can take it from there.
> 
> I forgot to ask is the reason for keeping rcu_seq_done_exact() to have
> more complexity only when needed, or does it make sense to rename
> rcu_seq_done_exact() as rcu_seq_done_done() and get rid of the old
> rcu_seq_done()?

We might well be able to use the rcu_seq_done_exact() algorithm for
all the uses, but that will require very careful review and testing.
Please keep in mind that there are quite a few uses, that the benefit
is small, and I was lazy.  ;-)

The difference with the polling APIs is that a user might well
legitimately hang onto a cookie for six hours, for example, by having
used it to tag a data structure in a block cache or similar.

							Thanx, Paul
Joel Fernandes Jan. 27, 2025, 6:45 p.m. UTC | #17
On Mon, Jan 27, 2025 at 11:49 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Sun, Jan 26, 2025 at 09:58:11PM -0500, Joel Fernandes wrote:
> > On Sun, Jan 26, 2025 at 9:55 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Sun, Jan 26, 2025 at 9:03 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Sun, Jan 26, 2025 at 08:22:23PM -0500, Joel Fernandes wrote:
> > > > > On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > >
> > > > > > Hi, Paul and Frederic,
> > > > > >
> > > > > > [...]
> > > > > > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote:
> > > > > > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > > > > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > > > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644
> > > > > > > > > > > --- a/kernel/rcu/rcu.h
> > > > > > > > > > > +++ b/kernel/rcu/rcu.h
> > > > > > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
> > > > > > > > > > >  {
> > > > > > > > > > >         unsigned long cur_s = READ_ONCE(*sp);
> > > > > > > > > > >
> > > > > > > > > > > -       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
> > > > > > > > > > > +       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
> > > > > > > > > >
> > > > > > [...]
> > > > > > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while
> > > > > > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq
> > > > > > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below
> > > > > > > > > > 2 GPs to safely deduce we wrapped around.
> > > > > > > > >
> > > > > > > > > Exactly!
> > > > > > > > >
> > > > > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ?
> > > > > > > > >
> > > > > > > > > Quite possibly.  I freely admit that I allowed a bit of slop because
> > > > > > > > > time was of the essence (holidays and all that) and also it does not
> > > > > > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing
> > > > > > > > > of the common-case 2^64 cycle.  It would not hurt to be exact, but it
> > > > > > > > > would be necessary to convince ourselves that we were not off by one in
> > > > > > > > > the wrong direction.
> > > > > > > > >
> > > > > > > > > I would be happy to see a patch, as long as it was sufficiently
> > > > > > > > > convincing.
> > > > > > > >
> > > > > > > > I'm not so much concerned about being exact but rather about making
> > > > > > > > sure we still understand what we did within one year. We can leave one
> > > > > > > > more grace period than what we expect out of paranoia but, the most
> > > > > > > > important is that we comment about what we expect and why. Let me
> > > > > > > > prepare a patch for that.
> > > > > > >
> > > > > > > Even better!
> > > > > >
> > > > > > Do we really have users who could pass such a huge delta wrapped
> > > > > > around value to poll() i.e > ULONG_MAX/2 ?  For 32-bit, that would be
> > > > > > 2 billion count since the get() (500 million GPs on 32-bit?). I am
> > > > > > curious if such a scenario should be a WARN() because also: If more
> > > > > > than ULONG_MAX/2 values are possible after get(), then a full or
> > > > > > multiple ULONG_MAX wraparound is also possible. This means then both
> > > > > > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for
> > > > > > such stale get() values.
> > > > > >
> > > > > > I do agree with both your points on the side effect of the patch to
> > > > > > rcu_seq_done_exact(), but I am not fully convinced myself about
> > > > > > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could
> > > > > > be missing something.
> > > > >
> > > > > I want to modify my comment on reliability. rcu_seq_done_exact() is
> > > > > certainly more "reliable" than rcu_seq_done() for wraparound delta  >
> > > > > ULONG_MAX/2. Still with such a huge wrap around it is not fail proof
> > > > > if it lands within the "3 Grace period" window, so if it is not super
> > > > > reliable and if the user should not rely on it, then I wonder if it is
> > > > > better to not do it and instead warn the user they are playing with
> > > > > fire. But a counter-argument might be, landing within the 3 GP window
> > > > > is quite low probability...
> > > >
> > > > It is also a harmless false negative.  And another poll within a few
> > > > hundred milliseconds will set things straight.  In contrast, if we
> > > > used rcu_seq_done(), it would be a good long time before we got out of
> > > > false-negative territory.
> > >
> > > True!
> > >
> > > > On a 32-bit system, if there was an expedited grace period every 10
> > > > microseconds (just barely possible on small systems), then a 32-bit
> > > > counter would wrap in about 12 hours.  So there would be a six-hour
> > > > false-negative zone.
> > > >
> > > > So an expedited grace period every 10 microseconds combined with
> > > > a six-hour polling delay is unlikely, but not out of the realm of
> > > > possibility.
> > >
> > > Assuming that every 10 microsecond of the entire 6 hours is used for
> > > an expedited GP, but I agree with your point.
> > >
> > > > Please feel free to nominate a comment.
> > >
> > > Ok thanks, I'll see what Frederic comes up with and can take it from there.
> >
> > I forgot to ask is the reason for keeping rcu_seq_done_exact() to have
> > more complexity only when needed, or does it make sense to rename
> > rcu_seq_done_exact() as rcu_seq_done_done() and get rid of the old
> > rcu_seq_done()?
>
> We might well be able to use the rcu_seq_done_exact() algorithm for
> all the uses, but that will require very careful review and testing.
> Please keep in mind that there are quite a few uses, that the benefit
> is small, and I was lazy.  ;-)
>
> The difference with the polling APIs is that a user might well
> legitimately hang onto a cookie for six hours, for example, by having
> used it to tag a data structure in a block cache or similar.

Thanks for clarification. I will work on a patch along these lines and
see if it is worth doing.

 - Joel
diff mbox series

Patch

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 2f9c9272cd486..d2a91f705a4ab 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -162,7 +162,7 @@  static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s)
 {
 	unsigned long cur_s = READ_ONCE(*sp);
 
-	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1));
+	return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1));
 }
 
 /*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 980f1fa719665..71620a8a2eb3d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4204,14 +4204,17 @@  EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
  */
 void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 {
-	struct rcu_node *rnp = rcu_get_root();
-
 	/*
 	 * Any prior manipulation of RCU-protected data must happen
 	 * before the loads from ->gp_seq and ->expedited_sequence.
 	 */
 	smp_mb();  /* ^^^ */
-	rgosp->rgos_norm = rcu_seq_snap(&rnp->gp_seq);
+
+	// Yes, rcu_state.gp_seq, not rnp_root->gp_seq, the latter's use
+	// in poll_state_synchronize_rcu_full() notwithstanding.  Use of
+	// the latter here would result in too-short grace periods due to
+	// interactions with newly onlined CPUs.
+	rgosp->rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
 	rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence);
 }
 EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_full);