Message ID | da5065c4-79ba-431f-9d7e-1ca314394443@paulmck-laptop (mailing list archive) |
---|---|
State | Accepted |
Commit | c5d3a3eaefafd5554a775763e1bbaffcd49e851f |
Headers | show |
Series | [RFC,v2] Fix get_state_synchronize_rcu_full() GP-start detection | expand |
Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() > functions use the root rcu_node structure's ->gp_seq field to detect > the beginnings and ends of grace periods, respectively. This choice is > necessary for the poll_state_synchronize_rcu_full() function because > (give or take counter wrap), the following sequence is guaranteed not > to trigger: > > get_state_synchronize_rcu_full(&rgos); > synchronize_rcu(); > WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos)); > > The RCU callbacks that awaken synchronize_rcu() instances are > guaranteed not to be invoked before the root rcu_node structure's > ->gp_seq field is updated to indicate the end of the grace period. > However, these callbacks might start being invoked immediately > thereafter, in particular, before rcu_state.gp_seq has been updated. > Therefore, poll_state_synchronize_rcu_full() must refer to the > root rcu_node structure's ->gp_seq field. Because this field is > updated under this structure's ->lock, any code following a call to > poll_state_synchronize_rcu_full() will be fully ordered after the > full grace-period computation, as is required by RCU's memory-ordering > semantics. > > By symmetry, the get_state_synchronize_rcu_full() function should also > use this same root rcu_node structure's ->gp_seq field. But it turns out > that symmetry is profoundly (though extremely infrequently) destructive > in this case. To see this, consider the following sequence of events: > > 1. CPU 0 starts a new grace period, and updates rcu_state.gp_seq > accordingly. > > 2. As its first step of grace-period initialization, CPU 0 examines > the current CPU hotplug state and decides that it need not wait > for CPU 1, which is currently offline. > > 3. CPU 1 comes online, and updates its state. But this does not > affect the current grace period, but rather the one after that. > After all, CPU 1 was offline when the current grace period > started, so all pre-existing RCU readers on CPU 1 must have > completed or been preempted before it last went offline. > The current grace period therefore has nothing it needs to wait > for on CPU 1. > > 4. CPU 1 switches to an rcutorture kthread which is running > rcutorture's rcu_torture_reader() function, which starts a new > RCU reader. > > 5. CPU 2 is running rcutorture's rcu_torture_writer() function > and collects a new polled grace-period "cookie" using > get_state_synchronize_rcu_full(). Because the newly started > grace period has not completed initialization, the root rcu_node > structure's ->gp_seq field has not yet been updated to indicate > that this new grace period has already started. > > This cookie is therefore set up for the end of the current grace > period (rather than the end of the following grace period). > > 6. CPU 0 finishes grace-period initialization. > > 7. If CPU 1’s rcutorture reader is preempted, it will be added to > the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not > set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer > will not be updated. Thus, this grace period will not wait on > it. Which is only fair, given that the CPU did not come online > until after the grace period officially started. > > 8. CPUs 0 and 2 then detect the new grace period and then report > a quiescent state to the RCU core. > > 9. Because CPU 1 was offline at the start of the current grace > period, CPUs 0 and 2 are the only CPUs that this grace period > needs to wait on. So the grace period ends and post-grace-period > cleanup starts. In particular, the root rcu_node structure's > ->gp_seq field is updated to indicate that this grace period > has now ended. > > 10. CPU 2 continues running rcu_torture_writer() and sees that, > from the viewpoint of the root rcu_node structure consulted by > the poll_state_synchronize_rcu_full() function, the grace period > has ended. It therefore updates state accordingly. > > 11. CPU 1 is still running the same RCU reader, which notices this > update and thus complains about the too-short grace period. I think I get the race but I must confess I'm not very familiar with how this all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader(). Basically this could trigger on CPU 1 with just doing the following, right? rcu_read_lock() get_state_synchronize_rcu_full(&rgos); WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos)) rcu_read_unlock() > > The fix is for the get_state_synchronize_rcu_full() function to use > rcu_state.gp_seq instead of the the root rcu_node structure's ->gp_seq > field. With this change in place, if step 5's cookie indicates that the > grace period has not yet started, then any prior code executed by CPU 2 > must have happened before CPU 1 came online. This will in turn prevent > CPU 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait, > thus preventing CPU 1 from being subjected to a too-short grace period. > > This commit therefore makes this change. Note that there is no change to > the poll_state_synchronize_rcu_full() function, which as noted above, > must continue to use the root rcu_node structure's ->gp_seq field. > This is of course an asymmetry between these two functions, but is an > asymmetry that is absolutely required for correct operation. It is a > common human tendency to greatly value symmetry, and sometimes symmetry > is a wonderful thing. Other times, symmetry results in poor performance. > But in this case, symmetry is just plain wrong. > > Nevertheless, the asymmetry does require an additional adjustment. > It is possible for get_state_synchronize_rcu_full() to see a given > grace period as having started, but for an immediately following > poll_state_synchronize_rcu_full() to see it as having not yet started. > Given the current rcu_seq_done_exact() implementation, this will > result in a false-positive indication that the grace period is done > from poll_state_synchronize_rcu_full(). This is dealt with by making > rcu_seq_done_exact() reach back three grace periods rather than just > two of them. > > Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for > get_completed*() and poll_state*()"), it is not clear that it is worth > backporting this commit. First, it took me many weeks to convince > rcutorture to reproduce this more frequently than once per year. Second, > this cannot be reproduced at all without frequent CPU-hotplug operations, > as in waiting all of 50 milliseconds from the end of the previous > operation until starting the next one. Third, the TREE03.boot settings > cause multi-millisecond delays during RCU grace-period initialization, > which greatly increase the probability of the above sequence of events. > (Don't do this in production workloads!) Fourth, extremely heavy use of > get_state_synchronize_rcu_full() and/or poll_state_synchronize_rcu_full() > is required to reproduce this, and as of v6.12, only kfree_rcu() uses it, > and even then not particularly heavily. I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on the root node for the global state ? Thanks.
On Fri, Jan 24, 2025 at 03:49:24PM +0100, Frederic Weisbecker wrote: > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() > > functions use the root rcu_node structure's ->gp_seq field to detect > > the beginnings and ends of grace periods, respectively. This choice is > > necessary for the poll_state_synchronize_rcu_full() function because > > (give or take counter wrap), the following sequence is guaranteed not > > to trigger: > > > > get_state_synchronize_rcu_full(&rgos); > > synchronize_rcu(); > > WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos)); > > > > The RCU callbacks that awaken synchronize_rcu() instances are > > guaranteed not to be invoked before the root rcu_node structure's > > ->gp_seq field is updated to indicate the end of the grace period. > > However, these callbacks might start being invoked immediately > > thereafter, in particular, before rcu_state.gp_seq has been updated. > > Therefore, poll_state_synchronize_rcu_full() must refer to the > > root rcu_node structure's ->gp_seq field. Because this field is > > updated under this structure's ->lock, any code following a call to > > poll_state_synchronize_rcu_full() will be fully ordered after the > > full grace-period computation, as is required by RCU's memory-ordering > > semantics. > > > > By symmetry, the get_state_synchronize_rcu_full() function should also > > use this same root rcu_node structure's ->gp_seq field. But it turns out > > that symmetry is profoundly (though extremely infrequently) destructive > > in this case. To see this, consider the following sequence of events: > > > > 1. CPU 0 starts a new grace period, and updates rcu_state.gp_seq > > accordingly. > > > > 2. As its first step of grace-period initialization, CPU 0 examines > > the current CPU hotplug state and decides that it need not wait > > for CPU 1, which is currently offline. > > > > 3. CPU 1 comes online, and updates its state. But this does not > > affect the current grace period, but rather the one after that. > > After all, CPU 1 was offline when the current grace period > > started, so all pre-existing RCU readers on CPU 1 must have > > completed or been preempted before it last went offline. > > The current grace period therefore has nothing it needs to wait > > for on CPU 1. > > > > 4. CPU 1 switches to an rcutorture kthread which is running > > rcutorture's rcu_torture_reader() function, which starts a new > > RCU reader. > > > > 5. CPU 2 is running rcutorture's rcu_torture_writer() function > > and collects a new polled grace-period "cookie" using > > get_state_synchronize_rcu_full(). Because the newly started > > grace period has not completed initialization, the root rcu_node > > structure's ->gp_seq field has not yet been updated to indicate > > that this new grace period has already started. > > > > This cookie is therefore set up for the end of the current grace > > period (rather than the end of the following grace period). > > > > 6. CPU 0 finishes grace-period initialization. > > > > 7. If CPU 1’s rcutorture reader is preempted, it will be added to > > the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not > > set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer > > will not be updated. Thus, this grace period will not wait on > > it. Which is only fair, given that the CPU did not come online > > until after the grace period officially started. > > > > 8. CPUs 0 and 2 then detect the new grace period and then report > > a quiescent state to the RCU core. > > > > 9. Because CPU 1 was offline at the start of the current grace > > period, CPUs 0 and 2 are the only CPUs that this grace period > > needs to wait on. So the grace period ends and post-grace-period > > cleanup starts. In particular, the root rcu_node structure's > > ->gp_seq field is updated to indicate that this grace period > > has now ended. > > > > 10. CPU 2 continues running rcu_torture_writer() and sees that, > > from the viewpoint of the root rcu_node structure consulted by > > the poll_state_synchronize_rcu_full() function, the grace period > > has ended. It therefore updates state accordingly. > > > > 11. CPU 1 is still running the same RCU reader, which notices this > > update and thus complains about the too-short grace period. > > I think I get the race but I must confess I'm not very familiar with how this > all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader(). > > Basically this could trigger on CPU 1 with just doing the following, right? > > rcu_read_lock() > get_state_synchronize_rcu_full(&rgos); > WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos)) > rcu_read_unlock() CPU 1 would do rcu_read_lock()/checks/rcu_read_unlock() as the reader, and CPU 2 would do get_state_synchronize_rcu_full(), later poll_state_synchronize_rcu_full(), which would (erroneously) indicate a completed grace period, so it would update the state, triggering CPU 1's checks. > > The fix is for the get_state_synchronize_rcu_full() function to use > > rcu_state.gp_seq instead of the the root rcu_node structure's ->gp_seq > > field. With this change in place, if step 5's cookie indicates that the > > grace period has not yet started, then any prior code executed by CPU 2 > > must have happened before CPU 1 came online. This will in turn prevent > > CPU 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait, > > thus preventing CPU 1 from being subjected to a too-short grace period. > > > > This commit therefore makes this change. Note that there is no change to > > the poll_state_synchronize_rcu_full() function, which as noted above, > > must continue to use the root rcu_node structure's ->gp_seq field. > > This is of course an asymmetry between these two functions, but is an > > asymmetry that is absolutely required for correct operation. It is a > > common human tendency to greatly value symmetry, and sometimes symmetry > > is a wonderful thing. Other times, symmetry results in poor performance. > > But in this case, symmetry is just plain wrong. > > > > Nevertheless, the asymmetry does require an additional adjustment. > > It is possible for get_state_synchronize_rcu_full() to see a given > > grace period as having started, but for an immediately following > > poll_state_synchronize_rcu_full() to see it as having not yet started. > > Given the current rcu_seq_done_exact() implementation, this will > > result in a false-positive indication that the grace period is done > > from poll_state_synchronize_rcu_full(). This is dealt with by making > > rcu_seq_done_exact() reach back three grace periods rather than just > > two of them. > > > > Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for > > get_completed*() and poll_state*()"), it is not clear that it is worth > > backporting this commit. First, it took me many weeks to convince > > rcutorture to reproduce this more frequently than once per year. Second, > > this cannot be reproduced at all without frequent CPU-hotplug operations, > > as in waiting all of 50 milliseconds from the end of the previous > > operation until starting the next one. Third, the TREE03.boot settings > > cause multi-millisecond delays during RCU grace-period initialization, > > which greatly increase the probability of the above sequence of events. > > (Don't do this in production workloads!) Fourth, extremely heavy use of > > get_state_synchronize_rcu_full() and/or poll_state_synchronize_rcu_full() > > is required to reproduce this, and as of v6.12, only kfree_rcu() uses it, > > and even then not particularly heavily. > > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on > the root node for the global state ? One scenario comes to mind immediately. There may be others. Suppose we were running with default configuration on a system with "only" eight CPUs. Then there is only the one rcu_node structure, which is both root and leaf. Without rcu_state.gp_seq, there would be no way to communicate the beginning of the grace period to get_state_synchronize_rcu_full() without also allowing quiescent states to be reported. There would thus be no time in which to check for newly onlined/offlined CPUs. Thanx, Paul
Le Fri, Jan 24, 2025 at 07:58:20AM -0800, Paul E. McKenney a écrit : > On Fri, Jan 24, 2025 at 03:49:24PM +0100, Frederic Weisbecker wrote: > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() > > > functions use the root rcu_node structure's ->gp_seq field to detect > > > the beginnings and ends of grace periods, respectively. This choice is > > > necessary for the poll_state_synchronize_rcu_full() function because > > > (give or take counter wrap), the following sequence is guaranteed not > > > to trigger: > > > > > > get_state_synchronize_rcu_full(&rgos); > > > synchronize_rcu(); > > > WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos)); > > > > > > The RCU callbacks that awaken synchronize_rcu() instances are > > > guaranteed not to be invoked before the root rcu_node structure's > > > ->gp_seq field is updated to indicate the end of the grace period. > > > However, these callbacks might start being invoked immediately > > > thereafter, in particular, before rcu_state.gp_seq has been updated. > > > Therefore, poll_state_synchronize_rcu_full() must refer to the > > > root rcu_node structure's ->gp_seq field. Because this field is > > > updated under this structure's ->lock, any code following a call to > > > poll_state_synchronize_rcu_full() will be fully ordered after the > > > full grace-period computation, as is required by RCU's memory-ordering > > > semantics. > > > > > > By symmetry, the get_state_synchronize_rcu_full() function should also > > > use this same root rcu_node structure's ->gp_seq field. But it turns out > > > that symmetry is profoundly (though extremely infrequently) destructive > > > in this case. To see this, consider the following sequence of events: > > > > > > 1. CPU 0 starts a new grace period, and updates rcu_state.gp_seq > > > accordingly. > > > > > > 2. As its first step of grace-period initialization, CPU 0 examines > > > the current CPU hotplug state and decides that it need not wait > > > for CPU 1, which is currently offline. > > > > > > 3. CPU 1 comes online, and updates its state. But this does not > > > affect the current grace period, but rather the one after that. > > > After all, CPU 1 was offline when the current grace period > > > started, so all pre-existing RCU readers on CPU 1 must have > > > completed or been preempted before it last went offline. > > > The current grace period therefore has nothing it needs to wait > > > for on CPU 1. > > > > > > 4. CPU 1 switches to an rcutorture kthread which is running > > > rcutorture's rcu_torture_reader() function, which starts a new > > > RCU reader. > > > > > > 5. CPU 2 is running rcutorture's rcu_torture_writer() function > > > and collects a new polled grace-period "cookie" using > > > get_state_synchronize_rcu_full(). Because the newly started > > > grace period has not completed initialization, the root rcu_node > > > structure's ->gp_seq field has not yet been updated to indicate > > > that this new grace period has already started. > > > > > > This cookie is therefore set up for the end of the current grace > > > period (rather than the end of the following grace period). > > > > > > 6. CPU 0 finishes grace-period initialization. > > > > > > 7. If CPU 1’s rcutorture reader is preempted, it will be added to > > > the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not > > > set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer > > > will not be updated. Thus, this grace period will not wait on > > > it. Which is only fair, given that the CPU did not come online > > > until after the grace period officially started. > > > > > > 8. CPUs 0 and 2 then detect the new grace period and then report > > > a quiescent state to the RCU core. > > > > > > 9. Because CPU 1 was offline at the start of the current grace > > > period, CPUs 0 and 2 are the only CPUs that this grace period > > > needs to wait on. So the grace period ends and post-grace-period > > > cleanup starts. In particular, the root rcu_node structure's > > > ->gp_seq field is updated to indicate that this grace period > > > has now ended. > > > > > > 10. CPU 2 continues running rcu_torture_writer() and sees that, > > > from the viewpoint of the root rcu_node structure consulted by > > > the poll_state_synchronize_rcu_full() function, the grace period > > > has ended. It therefore updates state accordingly. > > > > > > 11. CPU 1 is still running the same RCU reader, which notices this > > > update and thus complains about the too-short grace period. > > > > I think I get the race but I must confess I'm not very familiar with how this > > all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader(). > > > > Basically this could trigger on CPU 1 with just doing the following, right? > > > > rcu_read_lock() > > get_state_synchronize_rcu_full(&rgos); > > WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos)) > > rcu_read_unlock() > > CPU 1 would do rcu_read_lock()/checks/rcu_read_unlock() as the > reader, and CPU 2 would do get_state_synchronize_rcu_full(), later > poll_state_synchronize_rcu_full(), which would (erroneously) indicate > a completed grace period, so it would update the state, triggering CPU > 1's checks. I see, so if I generalize the problem beyond rcutorture, this looks like this, right? CPU 0 CPU 1 CPU 2 ----- ------ ----- rcu_seq_start(rcu_state.gp_seq) // CPU boots rcu_read_lock() r0 = rcu_dereference(*X) r1 = *X *X = NULL // relies on rnp->gp_seq and not rcu_state.gp_seq get_state_synchronize_rcu_full(&gros) WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); rcu_report_qs_rdp() rcu_report_qs_rdp() rcu_report_qs_rnp() rcu_report_qs_rsp() if (poll_state_synchronize_rcu_full(&rgos)) kfree(r1) // play with r0 and crash > > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on > > the root node for the global state ? > > One scenario comes to mind immediately. There may be others. > > Suppose we were running with default configuration on a system with > "only" eight CPUs. Then there is only the one rcu_node structure, > which is both root and leaf. Without rcu_state.gp_seq, there > would be no way to communicate the beginning of the grace period to > get_state_synchronize_rcu_full() without also allowing quiescent states > to be reported. There would thus be no time in which to check for newly > onlined/offlined CPUs. Heh, that makes sense! Though perhaps that qsmaskinit[next] handling part could be done before rcu_seq_start()? Thanks. > > Thanx, Paul
On Fri, Jan 24, 2025 at 05:42:29PM +0100, Frederic Weisbecker wrote: > Le Fri, Jan 24, 2025 at 07:58:20AM -0800, Paul E. McKenney a écrit : > > On Fri, Jan 24, 2025 at 03:49:24PM +0100, Frederic Weisbecker wrote: > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() > > > > functions use the root rcu_node structure's ->gp_seq field to detect > > > > the beginnings and ends of grace periods, respectively. This choice is > > > > necessary for the poll_state_synchronize_rcu_full() function because > > > > (give or take counter wrap), the following sequence is guaranteed not > > > > to trigger: > > > > > > > > get_state_synchronize_rcu_full(&rgos); > > > > synchronize_rcu(); > > > > WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos)); > > > > > > > > The RCU callbacks that awaken synchronize_rcu() instances are > > > > guaranteed not to be invoked before the root rcu_node structure's > > > > ->gp_seq field is updated to indicate the end of the grace period. > > > > However, these callbacks might start being invoked immediately > > > > thereafter, in particular, before rcu_state.gp_seq has been updated. > > > > Therefore, poll_state_synchronize_rcu_full() must refer to the > > > > root rcu_node structure's ->gp_seq field. Because this field is > > > > updated under this structure's ->lock, any code following a call to > > > > poll_state_synchronize_rcu_full() will be fully ordered after the > > > > full grace-period computation, as is required by RCU's memory-ordering > > > > semantics. > > > > > > > > By symmetry, the get_state_synchronize_rcu_full() function should also > > > > use this same root rcu_node structure's ->gp_seq field. But it turns out > > > > that symmetry is profoundly (though extremely infrequently) destructive > > > > in this case. To see this, consider the following sequence of events: > > > > > > > > 1. CPU 0 starts a new grace period, and updates rcu_state.gp_seq > > > > accordingly. > > > > > > > > 2. As its first step of grace-period initialization, CPU 0 examines > > > > the current CPU hotplug state and decides that it need not wait > > > > for CPU 1, which is currently offline. > > > > > > > > 3. CPU 1 comes online, and updates its state. But this does not > > > > affect the current grace period, but rather the one after that. > > > > After all, CPU 1 was offline when the current grace period > > > > started, so all pre-existing RCU readers on CPU 1 must have > > > > completed or been preempted before it last went offline. > > > > The current grace period therefore has nothing it needs to wait > > > > for on CPU 1. > > > > > > > > 4. CPU 1 switches to an rcutorture kthread which is running > > > > rcutorture's rcu_torture_reader() function, which starts a new > > > > RCU reader. > > > > > > > > 5. CPU 2 is running rcutorture's rcu_torture_writer() function > > > > and collects a new polled grace-period "cookie" using > > > > get_state_synchronize_rcu_full(). Because the newly started > > > > grace period has not completed initialization, the root rcu_node > > > > structure's ->gp_seq field has not yet been updated to indicate > > > > that this new grace period has already started. > > > > > > > > This cookie is therefore set up for the end of the current grace > > > > period (rather than the end of the following grace period). > > > > > > > > 6. CPU 0 finishes grace-period initialization. > > > > > > > > 7. If CPU 1’s rcutorture reader is preempted, it will be added to > > > > the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not > > > > set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer > > > > will not be updated. Thus, this grace period will not wait on > > > > it. Which is only fair, given that the CPU did not come online > > > > until after the grace period officially started. > > > > > > > > 8. CPUs 0 and 2 then detect the new grace period and then report > > > > a quiescent state to the RCU core. > > > > > > > > 9. Because CPU 1 was offline at the start of the current grace > > > > period, CPUs 0 and 2 are the only CPUs that this grace period > > > > needs to wait on. So the grace period ends and post-grace-period > > > > cleanup starts. In particular, the root rcu_node structure's > > > > ->gp_seq field is updated to indicate that this grace period > > > > has now ended. > > > > > > > > 10. CPU 2 continues running rcu_torture_writer() and sees that, > > > > from the viewpoint of the root rcu_node structure consulted by > > > > the poll_state_synchronize_rcu_full() function, the grace period > > > > has ended. It therefore updates state accordingly. > > > > > > > > 11. CPU 1 is still running the same RCU reader, which notices this > > > > update and thus complains about the too-short grace period. > > > > > > I think I get the race but I must confess I'm not very familiar with how this > > > all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader(). > > > > > > Basically this could trigger on CPU 1 with just doing the following, right? > > > > > > rcu_read_lock() > > > get_state_synchronize_rcu_full(&rgos); > > > WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos)) > > > rcu_read_unlock() > > > > CPU 1 would do rcu_read_lock()/checks/rcu_read_unlock() as the > > reader, and CPU 2 would do get_state_synchronize_rcu_full(), later > > poll_state_synchronize_rcu_full(), which would (erroneously) indicate > > a completed grace period, so it would update the state, triggering CPU > > 1's checks. > > I see, so if I generalize the problem beyond rcutorture, this looks like this, > right? > > CPU 0 CPU 1 CPU 2 > ----- ------ ----- > > rcu_seq_start(rcu_state.gp_seq) > // CPU boots > rcu_read_lock() > r0 = rcu_dereference(*X) > r1 = *X > *X = NULL > // relies on rnp->gp_seq and not rcu_state.gp_seq > get_state_synchronize_rcu_full(&gros) > WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); > rcu_report_qs_rdp() > rcu_report_qs_rdp() > rcu_report_qs_rnp() > rcu_report_qs_rsp() > if (poll_state_synchronize_rcu_full(&rgos)) > kfree(r1) > // play with r0 and crash That would do it! > > > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on > > > the root node for the global state ? > > > > One scenario comes to mind immediately. There may be others. > > > > Suppose we were running with default configuration on a system with > > "only" eight CPUs. Then there is only the one rcu_node structure, > > which is both root and leaf. Without rcu_state.gp_seq, there > > would be no way to communicate the beginning of the grace period to > > get_state_synchronize_rcu_full() without also allowing quiescent states > > to be reported. There would thus be no time in which to check for newly > > onlined/offlined CPUs. > > Heh, that makes sense! Though perhaps that qsmaskinit[next] handling part > could be done before rcu_seq_start()? If we do that, aren't we vulnerable to a CPU coming online just after we handled qsmaskinit{,next} and just before we do rcu_seq_start()? Thanx, Paul
Le Fri, Jan 24, 2025 at 11:40:54AM -0800, Paul E. McKenney a écrit : > > > > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on > > > > the root node for the global state ? > > > > > > One scenario comes to mind immediately. There may be others. > > > > > > Suppose we were running with default configuration on a system with > > > "only" eight CPUs. Then there is only the one rcu_node structure, > > > which is both root and leaf. Without rcu_state.gp_seq, there > > > would be no way to communicate the beginning of the grace period to > > > get_state_synchronize_rcu_full() without also allowing quiescent states > > > to be reported. There would thus be no time in which to check for newly > > > onlined/offlined CPUs. > > > > Heh, that makes sense! Though perhaps that qsmaskinit[next] handling part > > could be done before rcu_seq_start()? > > If we do that, aren't we vulnerable to a CPU coming online just after > we handled qsmaskinit{,next} and just before we do rcu_seq_start()? But then that CPU is guaranteed to see the pre-GP accesses because rcutree_report_cpu_starting() holds the rnp lock. Hmm but it won't see the starting gp_seq and it might therefore be one GP too early while accelerating its own callbacks (mis-attributing them the current GP in progress instead of a subsequent one) even though other CPUs may have reported their own QS already. Because I only now realize this necessary order while starting a grace period: 1) rcu_seq_start(rcu_state.gp_seq) 2) go through ->qsmaskinit[next] _while holding leaf rnp locks_ 3) go through the whole tree breadth first to reflect rcu_state.gp_seq to rnp->gp_seq. CPUs may start reporting QS concurrently once the first leaf node is reached And the step 2) with its fully ordered locking on leaves which release the write to rcu_state.gp_seq is what makes sure that if any CPU from step 3) has already reported a QS, it is guaranteed that any call to rcu_seq_snap() under any other leaf rnp locking will return the sequence of the subsequent GP number and not the current one in progress. This is what makes RCU an entertaining companion. You (think you) understand it in the evening and then you forget it all over again in the next morning. (And in lunch time you stare at things...). Thanks.
On Fri, Jan 24, 2025 at 11:25:01PM +0100, Frederic Weisbecker wrote: > Le Fri, Jan 24, 2025 at 11:40:54AM -0800, Paul E. McKenney a écrit : > > > > > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on > > > > > the root node for the global state ? > > > > > > > > One scenario comes to mind immediately. There may be others. > > > > > > > > Suppose we were running with default configuration on a system with > > > > "only" eight CPUs. Then there is only the one rcu_node structure, > > > > which is both root and leaf. Without rcu_state.gp_seq, there > > > > would be no way to communicate the beginning of the grace period to > > > > get_state_synchronize_rcu_full() without also allowing quiescent states > > > > to be reported. There would thus be no time in which to check for newly > > > > onlined/offlined CPUs. > > > > > > Heh, that makes sense! Though perhaps that qsmaskinit[next] handling part > > > could be done before rcu_seq_start()? > > > > If we do that, aren't we vulnerable to a CPU coming online just after > > we handled qsmaskinit{,next} and just before we do rcu_seq_start()? > > But then that CPU is guaranteed to see the pre-GP accesses because > rcutree_report_cpu_starting() holds the rnp lock. > > Hmm but it won't see the starting gp_seq and it might therefore be one GP > too early while accelerating its own callbacks (mis-attributing them the > current GP in progress instead of a subsequent one) even though other CPUs > may have reported their own QS already. > > Because I only now realize this necessary order while starting a grace period: > > 1) rcu_seq_start(rcu_state.gp_seq) > 2) go through ->qsmaskinit[next] _while holding leaf rnp locks_ > 3) go through the whole tree breadth first to reflect rcu_state.gp_seq to > rnp->gp_seq. CPUs may start reporting QS concurrently once the first leaf > node is reached > > And the step 2) with its fully ordered locking on leaves which release the write > to rcu_state.gp_seq is what makes sure that if any CPU from step 3) has already > reported a QS, it is guaranteed that any call to rcu_seq_snap() under any other > leaf rnp locking will return the sequence of the subsequent GP number and not the > current one in progress. Very good!!! > This is what makes RCU an entertaining companion. You (think you) understand > it in the evening and then you forget it all over again in the next morning. > (And in lunch time you stare at things...). That sums up many of my entertaining experiences with RCU over the past three decades. ;-) Thanx, Paul
Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index 2f9c9272cd486..d2a91f705a4ab 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > { > unsigned long cur_s = READ_ONCE(*sp); > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); This might need a comment. The way I understand it is that rcu_state.gp_seq might be seen started while root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below 2 GPs to safely deduce we wrapped around. Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? Or am I missing something? Thanks.
On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote: > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > index 2f9c9272cd486..d2a91f705a4ab 100644 > > --- a/kernel/rcu/rcu.h > > +++ b/kernel/rcu/rcu.h > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > > { > > unsigned long cur_s = READ_ONCE(*sp); > > > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > > This might need a comment. Good point! Would you like to propose one? > The way I understand it is that rcu_state.gp_seq might be seen started while > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below > 2 GPs to safely deduce we wrapped around. Exactly! > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? Quite possibly. I freely admit that I allowed a bit of slop because time was of the essence (holidays and all that) and also it does not hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing of the common-case 2^64 cycle. It would not hurt to be exact, but it would be necessary to convince ourselves that we were not off by one in the wrong direction. I would be happy to see a patch, as long as it was sufficiently convincing. > Or am I missing something? Not that I can see. So the answer is probably "yes". ;-) Thanx, Paul
Le Fri, Jan 24, 2025 at 04:01:55PM -0800, Paul E. McKenney a écrit : > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote: > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > index 2f9c9272cd486..d2a91f705a4ab 100644 > > > --- a/kernel/rcu/rcu.h > > > +++ b/kernel/rcu/rcu.h > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > > > { > > > unsigned long cur_s = READ_ONCE(*sp); > > > > > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > > > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > > > > This might need a comment. > > Good point! Would you like to propose one? Ok. > > > The way I understand it is that rcu_state.gp_seq might be seen started while > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below > > 2 GPs to safely deduce we wrapped around. > > Exactly! > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? > > Quite possibly. I freely admit that I allowed a bit of slop because > time was of the essence (holidays and all that) and also it does not > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing > of the common-case 2^64 cycle. It would not hurt to be exact, but it > would be necessary to convince ourselves that we were not off by one in > the wrong direction. > > I would be happy to see a patch, as long as it was sufficiently > convincing. I'm not so much concerned about being exact but rather about making sure we still understand what we did within one year. We can leave one more grace period than what we expect out of paranoia but, the most important is that we comment about what we expect and why. Let me prepare a patch for that. In the meantime for your patch: Reviewed-by: Frederic Weisbecker <frederic@kernel.org> > > > Or am I missing something? > > Not that I can see. So the answer is probably "yes". ;-) Thanks! :-) > > Thanx, Paul
On Sat, Jan 25, 2025 at 03:56:16PM +0100, Frederic Weisbecker wrote: > Le Fri, Jan 24, 2025 at 04:01:55PM -0800, Paul E. McKenney a écrit : > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote: > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > > index 2f9c9272cd486..d2a91f705a4ab 100644 > > > > --- a/kernel/rcu/rcu.h > > > > +++ b/kernel/rcu/rcu.h > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > > > > { > > > > unsigned long cur_s = READ_ONCE(*sp); > > > > > > > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > > > > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > > > > > > This might need a comment. > > > > Good point! Would you like to propose one? > > Ok. Sounds good! > > > The way I understand it is that rcu_state.gp_seq might be seen started while > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below > > > 2 GPs to safely deduce we wrapped around. > > > > Exactly! > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? > > > > Quite possibly. I freely admit that I allowed a bit of slop because > > time was of the essence (holidays and all that) and also it does not > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing > > of the common-case 2^64 cycle. It would not hurt to be exact, but it > > would be necessary to convince ourselves that we were not off by one in > > the wrong direction. > > > > I would be happy to see a patch, as long as it was sufficiently > > convincing. > > I'm not so much concerned about being exact but rather about making > sure we still understand what we did within one year. We can leave one > more grace period than what we expect out of paranoia but, the most > important is that we comment about what we expect and why. Let me > prepare a patch for that. Even better! > In the meantime for your patch: > > Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Thank you, and I will apply this on my next rebase. Thanx, Paul > > > Or am I missing something? > > > > Not that I can see. So the answer is probably "yes". ;-) > > Thanks! :-) > > > > > Thanx, Paul
Hi, Paul and Frederic, [...] > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote: > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644 > > > > > --- a/kernel/rcu/rcu.h > > > > > +++ b/kernel/rcu/rcu.h > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > > > > > { > > > > > unsigned long cur_s = READ_ONCE(*sp); > > > > > > > > > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > > > > > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > > > > [...] > > > > The way I understand it is that rcu_state.gp_seq might be seen started while > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below > > > > 2 GPs to safely deduce we wrapped around. > > > > > > Exactly! > > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? > > > > > > Quite possibly. I freely admit that I allowed a bit of slop because > > > time was of the essence (holidays and all that) and also it does not > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing > > > of the common-case 2^64 cycle. It would not hurt to be exact, but it > > > would be necessary to convince ourselves that we were not off by one in > > > the wrong direction. > > > > > > I would be happy to see a patch, as long as it was sufficiently > > > convincing. > > > > I'm not so much concerned about being exact but rather about making > > sure we still understand what we did within one year. We can leave one > > more grace period than what we expect out of paranoia but, the most > > important is that we comment about what we expect and why. Let me > > prepare a patch for that. > > Even better! Do we really have users who could pass such a huge delta wrapped around value to poll() i.e > ULONG_MAX/2 ? For 32-bit, that would be 2 billion count since the get() (500 million GPs on 32-bit?). I am curious if such a scenario should be a WARN() because also: If more than ULONG_MAX/2 values are possible after get(), then a full or multiple ULONG_MAX wraparound is also possible. This means then both rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for such stale get() values. I do agree with both your points on the side effect of the patch to rcu_seq_done_exact(), but I am not fully convinced myself about utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could be missing something. Otherwise I analyzed the patch and it makes sense to me: Reviewed-by:Joel Fernandes (Google) <joel@joelfernandes.org> thanks, - Joel
On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > Hi, Paul and Frederic, > > [...] > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote: > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644 > > > > > > --- a/kernel/rcu/rcu.h > > > > > > +++ b/kernel/rcu/rcu.h > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > > > > > > { > > > > > > unsigned long cur_s = READ_ONCE(*sp); > > > > > > > > > > > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > > > > > > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > > > > > > [...] > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below > > > > > 2 GPs to safely deduce we wrapped around. > > > > > > > > Exactly! > > > > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? > > > > > > > > Quite possibly. I freely admit that I allowed a bit of slop because > > > > time was of the essence (holidays and all that) and also it does not > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing > > > > of the common-case 2^64 cycle. It would not hurt to be exact, but it > > > > would be necessary to convince ourselves that we were not off by one in > > > > the wrong direction. > > > > > > > > I would be happy to see a patch, as long as it was sufficiently > > > > convincing. > > > > > > I'm not so much concerned about being exact but rather about making > > > sure we still understand what we did within one year. We can leave one > > > more grace period than what we expect out of paranoia but, the most > > > important is that we comment about what we expect and why. Let me > > > prepare a patch for that. > > > > Even better! > > Do we really have users who could pass such a huge delta wrapped > around value to poll() i.e > ULONG_MAX/2 ? For 32-bit, that would be > 2 billion count since the get() (500 million GPs on 32-bit?). I am > curious if such a scenario should be a WARN() because also: If more > than ULONG_MAX/2 values are possible after get(), then a full or > multiple ULONG_MAX wraparound is also possible. This means then both > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for > such stale get() values. > > I do agree with both your points on the side effect of the patch to > rcu_seq_done_exact(), but I am not fully convinced myself about > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could > be missing something. I want to modify my comment on reliability. rcu_seq_done_exact() is certainly more "reliable" than rcu_seq_done() for wraparound delta > ULONG_MAX/2. Still with such a huge wrap around it is not fail proof if it lands within the "3 Grace period" window, so if it is not super reliable and if the user should not rely on it, then I wonder if it is better to not do it and instead warn the user they are playing with fire. But a counter-argument might be, landing within the 3 GP window is quite low probability... thanks, - Joel > > Otherwise I analyzed the patch and it makes sense to me: > > Reviewed-by:Joel Fernandes (Google) <joel@joelfernandes.org> > > thanks, > > - Joel
On Sun, Jan 26, 2025 at 08:22:23PM -0500, Joel Fernandes wrote: > On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > Hi, Paul and Frederic, > > > > [...] > > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote: > > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644 > > > > > > > --- a/kernel/rcu/rcu.h > > > > > > > +++ b/kernel/rcu/rcu.h > > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > > > > > > > { > > > > > > > unsigned long cur_s = READ_ONCE(*sp); > > > > > > > > > > > > > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > > > > > > > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > > > > > > > > [...] > > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while > > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq > > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below > > > > > > 2 GPs to safely deduce we wrapped around. > > > > > > > > > > Exactly! > > > > > > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? > > > > > > > > > > Quite possibly. I freely admit that I allowed a bit of slop because > > > > > time was of the essence (holidays and all that) and also it does not > > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing > > > > > of the common-case 2^64 cycle. It would not hurt to be exact, but it > > > > > would be necessary to convince ourselves that we were not off by one in > > > > > the wrong direction. > > > > > > > > > > I would be happy to see a patch, as long as it was sufficiently > > > > > convincing. > > > > > > > > I'm not so much concerned about being exact but rather about making > > > > sure we still understand what we did within one year. We can leave one > > > > more grace period than what we expect out of paranoia but, the most > > > > important is that we comment about what we expect and why. Let me > > > > prepare a patch for that. > > > > > > Even better! > > > > Do we really have users who could pass such a huge delta wrapped > > around value to poll() i.e > ULONG_MAX/2 ? For 32-bit, that would be > > 2 billion count since the get() (500 million GPs on 32-bit?). I am > > curious if such a scenario should be a WARN() because also: If more > > than ULONG_MAX/2 values are possible after get(), then a full or > > multiple ULONG_MAX wraparound is also possible. This means then both > > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for > > such stale get() values. > > > > I do agree with both your points on the side effect of the patch to > > rcu_seq_done_exact(), but I am not fully convinced myself about > > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could > > be missing something. > > I want to modify my comment on reliability. rcu_seq_done_exact() is > certainly more "reliable" than rcu_seq_done() for wraparound delta > > ULONG_MAX/2. Still with such a huge wrap around it is not fail proof > if it lands within the "3 Grace period" window, so if it is not super > reliable and if the user should not rely on it, then I wonder if it is > better to not do it and instead warn the user they are playing with > fire. But a counter-argument might be, landing within the 3 GP window > is quite low probability... It is also a harmless false negative. And another poll within a few hundred milliseconds will set things straight. In contrast, if we used rcu_seq_done(), it would be a good long time before we got out of false-negative territory. On a 32-bit system, if there was an expedited grace period every 10 microseconds (just barely possible on small systems), then a 32-bit counter would wrap in about 12 hours. So there would be a six-hour false-negative zone. So an expedited grace period every 10 microseconds combined with a six-hour polling delay is unlikely, but not out of the realm of possibility. Please feel free to nominate a comment. > > Otherwise I analyzed the patch and it makes sense to me: > > > > Reviewed-by:Joel Fernandes (Google) <joel@joelfernandes.org> Thank you! I will apply this on my next review. Thanx, Paul
On Sun, Jan 26, 2025 at 9:03 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Sun, Jan 26, 2025 at 08:22:23PM -0500, Joel Fernandes wrote: > > On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > Hi, Paul and Frederic, > > > > > > [...] > > > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote: > > > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644 > > > > > > > > --- a/kernel/rcu/rcu.h > > > > > > > > +++ b/kernel/rcu/rcu.h > > > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > > > > > > > > { > > > > > > > > unsigned long cur_s = READ_ONCE(*sp); > > > > > > > > > > > > > > > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > > > > > > > > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > > > > > > > > > > [...] > > > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while > > > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq > > > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below > > > > > > > 2 GPs to safely deduce we wrapped around. > > > > > > > > > > > > Exactly! > > > > > > > > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? > > > > > > > > > > > > Quite possibly. I freely admit that I allowed a bit of slop because > > > > > > time was of the essence (holidays and all that) and also it does not > > > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing > > > > > > of the common-case 2^64 cycle. It would not hurt to be exact, but it > > > > > > would be necessary to convince ourselves that we were not off by one in > > > > > > the wrong direction. > > > > > > > > > > > > I would be happy to see a patch, as long as it was sufficiently > > > > > > convincing. > > > > > > > > > > I'm not so much concerned about being exact but rather about making > > > > > sure we still understand what we did within one year. We can leave one > > > > > more grace period than what we expect out of paranoia but, the most > > > > > important is that we comment about what we expect and why. Let me > > > > > prepare a patch for that. > > > > > > > > Even better! > > > > > > Do we really have users who could pass such a huge delta wrapped > > > around value to poll() i.e > ULONG_MAX/2 ? For 32-bit, that would be > > > 2 billion count since the get() (500 million GPs on 32-bit?). I am > > > curious if such a scenario should be a WARN() because also: If more > > > than ULONG_MAX/2 values are possible after get(), then a full or > > > multiple ULONG_MAX wraparound is also possible. This means then both > > > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for > > > such stale get() values. > > > > > > I do agree with both your points on the side effect of the patch to > > > rcu_seq_done_exact(), but I am not fully convinced myself about > > > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could > > > be missing something. > > > > I want to modify my comment on reliability. rcu_seq_done_exact() is > > certainly more "reliable" than rcu_seq_done() for wraparound delta > > > ULONG_MAX/2. Still with such a huge wrap around it is not fail proof > > if it lands within the "3 Grace period" window, so if it is not super > > reliable and if the user should not rely on it, then I wonder if it is > > better to not do it and instead warn the user they are playing with > > fire. But a counter-argument might be, landing within the 3 GP window > > is quite low probability... > > It is also a harmless false negative. And another poll within a few > hundred milliseconds will set things straight. In contrast, if we > used rcu_seq_done(), it would be a good long time before we got out of > false-negative territory. True! > On a 32-bit system, if there was an expedited grace period every 10 > microseconds (just barely possible on small systems), then a 32-bit > counter would wrap in about 12 hours. So there would be a six-hour > false-negative zone. > > So an expedited grace period every 10 microseconds combined with > a six-hour polling delay is unlikely, but not out of the realm of > possibility. Assuming that every 10 microsecond of the entire 6 hours is used for an expedited GP, but I agree with your point. > Please feel free to nominate a comment. Ok thanks, I'll see what Frederic comes up with and can take it from there. Thanks! - Joel
On Sun, Jan 26, 2025 at 9:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Sun, Jan 26, 2025 at 9:03 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Sun, Jan 26, 2025 at 08:22:23PM -0500, Joel Fernandes wrote: > > > On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > Hi, Paul and Frederic, > > > > > > > > [...] > > > > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote: > > > > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644 > > > > > > > > > --- a/kernel/rcu/rcu.h > > > > > > > > > +++ b/kernel/rcu/rcu.h > > > > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > > > > > > > > > { > > > > > > > > > unsigned long cur_s = READ_ONCE(*sp); > > > > > > > > > > > > > > > > > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > > > > > > > > > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > > > > > > > > > > > > [...] > > > > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while > > > > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq > > > > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below > > > > > > > > 2 GPs to safely deduce we wrapped around. > > > > > > > > > > > > > > Exactly! > > > > > > > > > > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? > > > > > > > > > > > > > > Quite possibly. I freely admit that I allowed a bit of slop because > > > > > > > time was of the essence (holidays and all that) and also it does not > > > > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing > > > > > > > of the common-case 2^64 cycle. It would not hurt to be exact, but it > > > > > > > would be necessary to convince ourselves that we were not off by one in > > > > > > > the wrong direction. > > > > > > > > > > > > > > I would be happy to see a patch, as long as it was sufficiently > > > > > > > convincing. > > > > > > > > > > > > I'm not so much concerned about being exact but rather about making > > > > > > sure we still understand what we did within one year. We can leave one > > > > > > more grace period than what we expect out of paranoia but, the most > > > > > > important is that we comment about what we expect and why. Let me > > > > > > prepare a patch for that. > > > > > > > > > > Even better! > > > > > > > > Do we really have users who could pass such a huge delta wrapped > > > > around value to poll() i.e > ULONG_MAX/2 ? For 32-bit, that would be > > > > 2 billion count since the get() (500 million GPs on 32-bit?). I am > > > > curious if such a scenario should be a WARN() because also: If more > > > > than ULONG_MAX/2 values are possible after get(), then a full or > > > > multiple ULONG_MAX wraparound is also possible. This means then both > > > > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for > > > > such stale get() values. > > > > > > > > I do agree with both your points on the side effect of the patch to > > > > rcu_seq_done_exact(), but I am not fully convinced myself about > > > > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could > > > > be missing something. > > > > > > I want to modify my comment on reliability. rcu_seq_done_exact() is > > > certainly more "reliable" than rcu_seq_done() for wraparound delta > > > > ULONG_MAX/2. Still with such a huge wrap around it is not fail proof > > > if it lands within the "3 Grace period" window, so if it is not super > > > reliable and if the user should not rely on it, then I wonder if it is > > > better to not do it and instead warn the user they are playing with > > > fire. But a counter-argument might be, landing within the 3 GP window > > > is quite low probability... > > > > It is also a harmless false negative. And another poll within a few > > hundred milliseconds will set things straight. In contrast, if we > > used rcu_seq_done(), it would be a good long time before we got out of > > false-negative territory. > > True! > > > On a 32-bit system, if there was an expedited grace period every 10 > > microseconds (just barely possible on small systems), then a 32-bit > > counter would wrap in about 12 hours. So there would be a six-hour > > false-negative zone. > > > > So an expedited grace period every 10 microseconds combined with > > a six-hour polling delay is unlikely, but not out of the realm of > > possibility. > > Assuming that every 10 microsecond of the entire 6 hours is used for > an expedited GP, but I agree with your point. > > > Please feel free to nominate a comment. > > Ok thanks, I'll see what Frederic comes up with and can take it from there. > I forgot to ask is the reason for keeping rcu_seq_done_exact() to have more complexity only when needed, or does it make sense to rename rcu_seq_done_exact() as rcu_seq_done_done() and get rid of the old rcu_seq_done()? thanks, - Joel
On Sun, Jan 26, 2025 at 09:58:11PM -0500, Joel Fernandes wrote: > On Sun, Jan 26, 2025 at 9:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Sun, Jan 26, 2025 at 9:03 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Sun, Jan 26, 2025 at 08:22:23PM -0500, Joel Fernandes wrote: > > > > On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > Hi, Paul and Frederic, > > > > > > > > > > [...] > > > > > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote: > > > > > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > > > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > > > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644 > > > > > > > > > > --- a/kernel/rcu/rcu.h > > > > > > > > > > +++ b/kernel/rcu/rcu.h > > > > > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > > > > > > > > > > { > > > > > > > > > > unsigned long cur_s = READ_ONCE(*sp); > > > > > > > > > > > > > > > > > > > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > > > > > > > > > > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > > > > > > > > > > > > > > [...] > > > > > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while > > > > > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq > > > > > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below > > > > > > > > > 2 GPs to safely deduce we wrapped around. > > > > > > > > > > > > > > > > Exactly! > > > > > > > > > > > > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? > > > > > > > > > > > > > > > > Quite possibly. I freely admit that I allowed a bit of slop because > > > > > > > > time was of the essence (holidays and all that) and also it does not > > > > > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing > > > > > > > > of the common-case 2^64 cycle. It would not hurt to be exact, but it > > > > > > > > would be necessary to convince ourselves that we were not off by one in > > > > > > > > the wrong direction. > > > > > > > > > > > > > > > > I would be happy to see a patch, as long as it was sufficiently > > > > > > > > convincing. > > > > > > > > > > > > > > I'm not so much concerned about being exact but rather about making > > > > > > > sure we still understand what we did within one year. We can leave one > > > > > > > more grace period than what we expect out of paranoia but, the most > > > > > > > important is that we comment about what we expect and why. Let me > > > > > > > prepare a patch for that. > > > > > > > > > > > > Even better! > > > > > > > > > > Do we really have users who could pass such a huge delta wrapped > > > > > around value to poll() i.e > ULONG_MAX/2 ? For 32-bit, that would be > > > > > 2 billion count since the get() (500 million GPs on 32-bit?). I am > > > > > curious if such a scenario should be a WARN() because also: If more > > > > > than ULONG_MAX/2 values are possible after get(), then a full or > > > > > multiple ULONG_MAX wraparound is also possible. This means then both > > > > > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for > > > > > such stale get() values. > > > > > > > > > > I do agree with both your points on the side effect of the patch to > > > > > rcu_seq_done_exact(), but I am not fully convinced myself about > > > > > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could > > > > > be missing something. > > > > > > > > I want to modify my comment on reliability. rcu_seq_done_exact() is > > > > certainly more "reliable" than rcu_seq_done() for wraparound delta > > > > > ULONG_MAX/2. Still with such a huge wrap around it is not fail proof > > > > if it lands within the "3 Grace period" window, so if it is not super > > > > reliable and if the user should not rely on it, then I wonder if it is > > > > better to not do it and instead warn the user they are playing with > > > > fire. But a counter-argument might be, landing within the 3 GP window > > > > is quite low probability... > > > > > > It is also a harmless false negative. And another poll within a few > > > hundred milliseconds will set things straight. In contrast, if we > > > used rcu_seq_done(), it would be a good long time before we got out of > > > false-negative territory. > > > > True! > > > > > On a 32-bit system, if there was an expedited grace period every 10 > > > microseconds (just barely possible on small systems), then a 32-bit > > > counter would wrap in about 12 hours. So there would be a six-hour > > > false-negative zone. > > > > > > So an expedited grace period every 10 microseconds combined with > > > a six-hour polling delay is unlikely, but not out of the realm of > > > possibility. > > > > Assuming that every 10 microsecond of the entire 6 hours is used for > > an expedited GP, but I agree with your point. > > > > > Please feel free to nominate a comment. > > > > Ok thanks, I'll see what Frederic comes up with and can take it from there. > > I forgot to ask is the reason for keeping rcu_seq_done_exact() to have > more complexity only when needed, or does it make sense to rename > rcu_seq_done_exact() as rcu_seq_done_done() and get rid of the old > rcu_seq_done()? We might well be able to use the rcu_seq_done_exact() algorithm for all the uses, but that will require very careful review and testing. Please keep in mind that there are quite a few uses, that the benefit is small, and I was lazy. ;-) The difference with the polling APIs is that a user might well legitimately hang onto a cookie for six hours, for example, by having used it to tag a data structure in a block cache or similar. Thanx, Paul
On Mon, Jan 27, 2025 at 11:49 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Sun, Jan 26, 2025 at 09:58:11PM -0500, Joel Fernandes wrote: > > On Sun, Jan 26, 2025 at 9:55 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > On Sun, Jan 26, 2025 at 9:03 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > On Sun, Jan 26, 2025 at 08:22:23PM -0500, Joel Fernandes wrote: > > > > > On Sun, Jan 26, 2025 at 8:13 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > Hi, Paul and Frederic, > > > > > > > > > > > > [...] > > > > > > > > > On Sat, Jan 25, 2025 at 12:03:58AM +0100, Frederic Weisbecker wrote: > > > > > > > > > > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > > > > > > > > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > > > > > > > > > index 2f9c9272cd486..d2a91f705a4ab 100644 > > > > > > > > > > > --- a/kernel/rcu/rcu.h > > > > > > > > > > > +++ b/kernel/rcu/rcu.h > > > > > > > > > > > @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > > > > > > > > > > > { > > > > > > > > > > > unsigned long cur_s = READ_ONCE(*sp); > > > > > > > > > > > > > > > > > > > > > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); > > > > > > > > > > > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > > > > > > > > > > > > > > > > [...] > > > > > > > > > > The way I understand it is that rcu_state.gp_seq might be seen started while > > > > > > > > > > root_rnp->gp_seq is not. So rcu_seq_snap() on the started rcu_state.gp_seq > > > > > > > > > > may return maximum 2 full GPs ahead of root_rnp->gp_seq. And therefore it takes below > > > > > > > > > > 2 GPs to safely deduce we wrapped around. > > > > > > > > > > > > > > > > > > Exactly! > > > > > > > > > > > > > > > > > > > Should it be ULONG_CMP_LT(cur_s, s - (2 * (RCU_SEQ_STATE_MASK + 1))) ? > > > > > > > > > > > > > > > > > > Quite possibly. I freely admit that I allowed a bit of slop because > > > > > > > > > time was of the essence (holidays and all that) and also it does not > > > > > > > > > hurt much to lose a couple of counts out of a 2^32 cycle, to say nothing > > > > > > > > > of the common-case 2^64 cycle. It would not hurt to be exact, but it > > > > > > > > > would be necessary to convince ourselves that we were not off by one in > > > > > > > > > the wrong direction. > > > > > > > > > > > > > > > > > > I would be happy to see a patch, as long as it was sufficiently > > > > > > > > > convincing. > > > > > > > > > > > > > > > > I'm not so much concerned about being exact but rather about making > > > > > > > > sure we still understand what we did within one year. We can leave one > > > > > > > > more grace period than what we expect out of paranoia but, the most > > > > > > > > important is that we comment about what we expect and why. Let me > > > > > > > > prepare a patch for that. > > > > > > > > > > > > > > Even better! > > > > > > > > > > > > Do we really have users who could pass such a huge delta wrapped > > > > > > around value to poll() i.e > ULONG_MAX/2 ? For 32-bit, that would be > > > > > > 2 billion count since the get() (500 million GPs on 32-bit?). I am > > > > > > curious if such a scenario should be a WARN() because also: If more > > > > > > than ULONG_MAX/2 values are possible after get(), then a full or > > > > > > multiple ULONG_MAX wraparound is also possible. This means then both > > > > > > rcu_seq_done() and rcu_seq_done_exact() become unreliable anyway for > > > > > > such stale get() values. > > > > > > > > > > > > I do agree with both your points on the side effect of the patch to > > > > > > rcu_seq_done_exact(), but I am not fully convinced myself about > > > > > > utility of rcu_seq_done_exact() versus the rcu_seq_done() but I could > > > > > > be missing something. > > > > > > > > > > I want to modify my comment on reliability. rcu_seq_done_exact() is > > > > > certainly more "reliable" than rcu_seq_done() for wraparound delta > > > > > > ULONG_MAX/2. Still with such a huge wrap around it is not fail proof > > > > > if it lands within the "3 Grace period" window, so if it is not super > > > > > reliable and if the user should not rely on it, then I wonder if it is > > > > > better to not do it and instead warn the user they are playing with > > > > > fire. But a counter-argument might be, landing within the 3 GP window > > > > > is quite low probability... > > > > > > > > It is also a harmless false negative. And another poll within a few > > > > hundred milliseconds will set things straight. In contrast, if we > > > > used rcu_seq_done(), it would be a good long time before we got out of > > > > false-negative territory. > > > > > > True! > > > > > > > On a 32-bit system, if there was an expedited grace period every 10 > > > > microseconds (just barely possible on small systems), then a 32-bit > > > > counter would wrap in about 12 hours. So there would be a six-hour > > > > false-negative zone. > > > > > > > > So an expedited grace period every 10 microseconds combined with > > > > a six-hour polling delay is unlikely, but not out of the realm of > > > > possibility. > > > > > > Assuming that every 10 microsecond of the entire 6 hours is used for > > > an expedited GP, but I agree with your point. > > > > > > > Please feel free to nominate a comment. > > > > > > Ok thanks, I'll see what Frederic comes up with and can take it from there. > > > > I forgot to ask is the reason for keeping rcu_seq_done_exact() to have > > more complexity only when needed, or does it make sense to rename > > rcu_seq_done_exact() as rcu_seq_done_done() and get rid of the old > > rcu_seq_done()? > > We might well be able to use the rcu_seq_done_exact() algorithm for > all the uses, but that will require very careful review and testing. > Please keep in mind that there are quite a few uses, that the benefit > is small, and I was lazy. ;-) > > The difference with the polling APIs is that a user might well > legitimately hang onto a cookie for six hours, for example, by having > used it to tag a data structure in a block cache or similar. Thanks for clarification. I will work on a patch along these lines and see if it is worth doing. - Joel
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 2f9c9272cd486..d2a91f705a4ab 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) { unsigned long cur_s = READ_ONCE(*sp); - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); } /* diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 980f1fa719665..71620a8a2eb3d 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4204,14 +4204,17 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu); */ void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) { - struct rcu_node *rnp = rcu_get_root(); - /* * Any prior manipulation of RCU-protected data must happen * before the loads from ->gp_seq and ->expedited_sequence. */ smp_mb(); /* ^^^ */ - rgosp->rgos_norm = rcu_seq_snap(&rnp->gp_seq); + + // Yes, rcu_state.gp_seq, not rnp_root->gp_seq, the latter's use + // in poll_state_synchronize_rcu_full() notwithstanding. Use of + // the latter here would result in too-short grace periods due to + // interactions with newly onlined CPUs. + rgosp->rgos_norm = rcu_seq_snap(&rcu_state.gp_seq); rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence); } EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_full);
The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions use the root rcu_node structure's ->gp_seq field to detect the beginnings and ends of grace periods, respectively. This choice is necessary for the poll_state_synchronize_rcu_full() function because (give or take counter wrap), the following sequence is guaranteed not to trigger: get_state_synchronize_rcu_full(&rgos); synchronize_rcu(); WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos)); The RCU callbacks that awaken synchronize_rcu() instances are guaranteed not to be invoked before the root rcu_node structure's ->gp_seq field is updated to indicate the end of the grace period. However, these callbacks might start being invoked immediately thereafter, in particular, before rcu_state.gp_seq has been updated. Therefore, poll_state_synchronize_rcu_full() must refer to the root rcu_node structure's ->gp_seq field. Because this field is updated under this structure's ->lock, any code following a call to poll_state_synchronize_rcu_full() will be fully ordered after the full grace-period computation, as is required by RCU's memory-ordering semantics. By symmetry, the get_state_synchronize_rcu_full() function should also use this same root rcu_node structure's ->gp_seq field. But it turns out that symmetry is profoundly (though extremely infrequently) destructive in this case. To see this, consider the following sequence of events: 1. CPU 0 starts a new grace period, and updates rcu_state.gp_seq accordingly. 2. As its first step of grace-period initialization, CPU 0 examines the current CPU hotplug state and decides that it need not wait for CPU 1, which is currently offline. 3. CPU 1 comes online, and updates its state. But this does not affect the current grace period, but rather the one after that. After all, CPU 1 was offline when the current grace period started, so all pre-existing RCU readers on CPU 1 must have completed or been preempted before it last went offline. The current grace period therefore has nothing it needs to wait for on CPU 1. 4. CPU 1 switches to an rcutorture kthread which is running rcutorture's rcu_torture_reader() function, which starts a new RCU reader. 5. CPU 2 is running rcutorture's rcu_torture_writer() function and collects a new polled grace-period "cookie" using get_state_synchronize_rcu_full(). Because the newly started grace period has not completed initialization, the root rcu_node structure's ->gp_seq field has not yet been updated to indicate that this new grace period has already started. This cookie is therefore set up for the end of the current grace period (rather than the end of the following grace period). 6. CPU 0 finishes grace-period initialization. 7. If CPU 1’s rcutorture reader is preempted, it will be added to the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer will not be updated. Thus, this grace period will not wait on it. Which is only fair, given that the CPU did not come online until after the grace period officially started. 8. CPUs 0 and 2 then detect the new grace period and then report a quiescent state to the RCU core. 9. Because CPU 1 was offline at the start of the current grace period, CPUs 0 and 2 are the only CPUs that this grace period needs to wait on. So the grace period ends and post-grace-period cleanup starts. In particular, the root rcu_node structure's ->gp_seq field is updated to indicate that this grace period has now ended. 10. CPU 2 continues running rcu_torture_writer() and sees that, from the viewpoint of the root rcu_node structure consulted by the poll_state_synchronize_rcu_full() function, the grace period has ended. It therefore updates state accordingly. 11. CPU 1 is still running the same RCU reader, which notices this update and thus complains about the too-short grace period. The fix is for the get_state_synchronize_rcu_full() function to use rcu_state.gp_seq instead of the the root rcu_node structure's ->gp_seq field. With this change in place, if step 5's cookie indicates that the grace period has not yet started, then any prior code executed by CPU 2 must have happened before CPU 1 came online. This will in turn prevent CPU 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait, thus preventing CPU 1 from being subjected to a too-short grace period. This commit therefore makes this change. Note that there is no change to the poll_state_synchronize_rcu_full() function, which as noted above, must continue to use the root rcu_node structure's ->gp_seq field. This is of course an asymmetry between these two functions, but is an asymmetry that is absolutely required for correct operation. It is a common human tendency to greatly value symmetry, and sometimes symmetry is a wonderful thing. Other times, symmetry results in poor performance. But in this case, symmetry is just plain wrong. Nevertheless, the asymmetry does require an additional adjustment. It is possible for get_state_synchronize_rcu_full() to see a given grace period as having started, but for an immediately following poll_state_synchronize_rcu_full() to see it as having not yet started. Given the current rcu_seq_done_exact() implementation, this will result in a false-positive indication that the grace period is done from poll_state_synchronize_rcu_full(). This is dealt with by making rcu_seq_done_exact() reach back three grace periods rather than just two of them. Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for get_completed*() and poll_state*()"), it is not clear that it is worth backporting this commit. First, it took me many weeks to convince rcutorture to reproduce this more frequently than once per year. Second, this cannot be reproduced at all without frequent CPU-hotplug operations, as in waiting all of 50 milliseconds from the end of the previous operation until starting the next one. Third, the TREE03.boot settings cause multi-millisecond delays during RCU grace-period initialization, which greatly increase the probability of the above sequence of events. (Don't do this in production workloads!) Fourth, extremely heavy use of get_state_synchronize_rcu_full() and/or poll_state_synchronize_rcu_full() is required to reproduce this, and as of v6.12, only kfree_rcu() uses it, and even then not particularly heavily. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- Changes since RFC v1: o Update rcu_seq_done_exact() to handle backwards-gp_seq effects of get_state_synchronize_rcu_full() using rcu_state.gp_seq and poll_state_synchronize_rcu_full() continuing to use the root rcu_node structure's ->gp_seq field. With this commit, rcutorture passes 10 hours of 400 instances of TREE03. Without this commit, there are about 50 too-short grace periods per hour, again with 400 instances of TREE03. o Add TREE03.boot's multi-millisecond delays for grace-period initialization.