Message ID | 20220809034517.3867176-5-joel@joelfernandes.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Implement call_rcu_lazy() and miscellaneous fixes | expand |
[Sorry, adding back the CC list] On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > > This is required to prevent callbacks triggering RCU machinery too > quickly and too often, which adds more power to the system. > > When testing, we found that these paths were invoked often when the > system is not doing anything (screen is ON but otherwise idle). Unfortunately, I am seeing a slow down in ChromeOS boot performance after applying this particular patch. It is the first time I could test ChromeOS boot times with the series since it was hard to find a ChromeOS device that runs the upstream kernel. Anyway, Vlad, Neeraj, do you guys also see slower boot times with this patch? I wonder if the issue is with wake up interaction with the nocb GP threads. We ought to disable lazy RCU during boot since it would have little benefit anyway. But I am also concerned about some deeper problem I did not catch before. I'll look into tracing the fs paths to see if I can narrow down what's causing it. Will also try a newer kernel, I am currently testing on 5.19-rc4. Thanks, - Joel
On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > [Sorry, adding back the CC list] > > On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) > <joel@joelfernandes.org> wrote: > > > > This is required to prevent callbacks triggering RCU machinery too > > quickly and too often, which adds more power to the system. > > > > When testing, we found that these paths were invoked often when the > > system is not doing anything (screen is ON but otherwise idle). > > Unfortunately, I am seeing a slow down in ChromeOS boot performance > after applying this particular patch. It is the first time I could > test ChromeOS boot times with the series since it was hard to find a > ChromeOS device that runs the upstream kernel. > > Anyway, Vlad, Neeraj, do you guys also see slower boot times with this > patch? I wonder if the issue is with wake up interaction with the nocb > GP threads. > > We ought to disable lazy RCU during boot since it would have little > benefit anyway. But I am also concerned about some deeper problem I > did not catch before. > > I'll look into tracing the fs paths to see if I can narrow down what's > causing it. Will also try a newer kernel, I am currently testing on > 5.19-rc4. I got somewhere with this. It looks like queuing CBs as lazy CBs instead of normal CBs, are triggering expedited stalls during the boot process: 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. No idea how/why lazy RCU CBs would be related to expedited GP issues, but maybe something hangs and causes that side-effect. initcall_debug did not help, as it seems initcalls all work fine, and then 8 seconds after the boot, it starts slowing down a lot, followed by the RCU stall messages. As a next step I'll enable ftrace during the boot to see if I can get more insight. But I believe, its not the FS layer, the FS layer just triggers lazy CBs, but there is something wrong with the core lazy-RCU work itself. This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more recent kernels and debug. Thanks, - Joel
On Thu, Aug 18, 2022 at 7:05 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > [Sorry, adding back the CC list] > > > > On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) > > <joel@joelfernandes.org> wrote: > > > > > > This is required to prevent callbacks triggering RCU machinery too > > > quickly and too often, which adds more power to the system. > > > > > > When testing, we found that these paths were invoked often when the > > > system is not doing anything (screen is ON but otherwise idle). > > > > Unfortunately, I am seeing a slow down in ChromeOS boot performance > > after applying this particular patch. It is the first time I could > > test ChromeOS boot times with the series since it was hard to find a > > ChromeOS device that runs the upstream kernel. > > > > Anyway, Vlad, Neeraj, do you guys also see slower boot times with this > > patch? I wonder if the issue is with wake up interaction with the nocb > > GP threads. > > > > We ought to disable lazy RCU during boot since it would have little > > benefit anyway. But I am also concerned about some deeper problem I > > did not catch before. > > > > I'll look into tracing the fs paths to see if I can narrow down what's > > causing it. Will also try a newer kernel, I am currently testing on > > 5.19-rc4. > > I got somewhere with this. It looks like queuing CBs as lazy CBs > instead of normal CBs, are triggering expedited stalls during the boot > process: > > 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on > CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. > > No idea how/why lazy RCU CBs would be related to expedited GP issues, > but maybe something hangs and causes that side-effect. > > initcall_debug did not help, as it seems initcalls all work fine, and > then 8 seconds after the boot, it starts slowing down a lot, followed > by the RCU stall messages. As a next step I'll enable ftrace during > the boot to see if I can get more insight. But I believe, its not the > FS layer, the FS layer just triggers lazy CBs, but there is something > wrong with the core lazy-RCU work itself. > > This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more > recent kernels and debug. More digging, thanks to trace_event= boot option , I find that the boot process does have some synchronous waits, and though these are "non-lazy", for some reason the lazy CBs that were previously queued are making them wait for the *full* lazy duration. Which points to a likely bug in the lazy RCU logic. These synchronous CBs should never be waiting like the lazy ones: [ 17.715904] => trace_dump_stack [ 17.715904] => __wait_rcu_gp [ 17.715904] => synchronize_rcu [ 17.715904] => selinux_netcache_avc_callback [ 17.715904] => avc_ss_reset [ 17.715904] => sel_write_enforce [ 17.715904] => vfs_write [ 17.715904] => ksys_write [ 17.715904] => do_syscall_64 [ 17.715904] => entry_SYSCALL_64_after_hwframe I'm tired so I'll resume the debug later.
On Thu, Aug 18, 2022 at 9:21 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Thu, Aug 18, 2022 at 7:05 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > [Sorry, adding back the CC list] > > > > > > On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) > > > <joel@joelfernandes.org> wrote: > > > > > > > > This is required to prevent callbacks triggering RCU machinery too > > > > quickly and too often, which adds more power to the system. > > > > > > > > When testing, we found that these paths were invoked often when the > > > > system is not doing anything (screen is ON but otherwise idle). > > > > > > Unfortunately, I am seeing a slow down in ChromeOS boot performance > > > after applying this particular patch. It is the first time I could > > > test ChromeOS boot times with the series since it was hard to find a > > > ChromeOS device that runs the upstream kernel. > > > > > > Anyway, Vlad, Neeraj, do you guys also see slower boot times with this > > > patch? I wonder if the issue is with wake up interaction with the nocb > > > GP threads. > > > > > > We ought to disable lazy RCU during boot since it would have little > > > benefit anyway. But I am also concerned about some deeper problem I > > > did not catch before. > > > > > > I'll look into tracing the fs paths to see if I can narrow down what's > > > causing it. Will also try a newer kernel, I am currently testing on > > > 5.19-rc4. > > > > I got somewhere with this. It looks like queuing CBs as lazy CBs > > instead of normal CBs, are triggering expedited stalls during the boot > > process: > > > > 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on > > CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. > > > > No idea how/why lazy RCU CBs would be related to expedited GP issues, > > but maybe something hangs and causes that side-effect. > > > > initcall_debug did not help, as it seems initcalls all work fine, and > > then 8 seconds after the boot, it starts slowing down a lot, followed > > by the RCU stall messages. As a next step I'll enable ftrace during > > the boot to see if I can get more insight. But I believe, its not the > > FS layer, the FS layer just triggers lazy CBs, but there is something > > wrong with the core lazy-RCU work itself. > > > > This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more > > recent kernels and debug. > > More digging, thanks to trace_event= boot option , I find that the > boot process does have some synchronous waits, and though these are > "non-lazy", for some reason the lazy CBs that were previously queued > are making them wait for the *full* lazy duration. Which points to a > likely bug in the lazy RCU logic. These synchronous CBs should never > be waiting like the lazy ones: > [ 17.715904] init-1 2..... 8917039us : <stack trace> > [ 17.715904] => trace_dump_stack > [ 17.715904] => __wait_rcu_gp > [ 17.715904] => synchronize_rcu > [ 17.715904] => selinux_netcache_avc_callback > [ 17.715904] => avc_ss_reset > [ 17.715904] => sel_write_enforce > [ 17.715904] => vfs_write > [ 17.715904] => ksys_write > [ 17.715904] => do_syscall_64 > [ 17.715904] => entry_SYSCALL_64_after_hwframe > > I'm tired so I'll resume the debug later. Rushikesh, btw do note that using jiffies_till_first_fqs would also likely slow these synchronous waits down, and thus slow down the boot process for ChromeOS. This could be what you're seeing with the jiffies option. No idea why you were not seeing this in earlier experiments, maybe there was a change in ChromeOS with how selinux enforce option was being written out, or something. This write syscall is being done by the init process. I'll add this to the slides as well as another call_rcu_lazy() selling point (vs using jiffies) :) - Joel
On Thu, Aug 18, 2022 at 09:21:56PM -0400, Joel Fernandes wrote: > On Thu, Aug 18, 2022 at 7:05 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > [Sorry, adding back the CC list] > > > > > > On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) > > > <joel@joelfernandes.org> wrote: > > > > > > > > This is required to prevent callbacks triggering RCU machinery too > > > > quickly and too often, which adds more power to the system. > > > > > > > > When testing, we found that these paths were invoked often when the > > > > system is not doing anything (screen is ON but otherwise idle). > > > > > > Unfortunately, I am seeing a slow down in ChromeOS boot performance > > > after applying this particular patch. It is the first time I could > > > test ChromeOS boot times with the series since it was hard to find a > > > ChromeOS device that runs the upstream kernel. > > > > > > Anyway, Vlad, Neeraj, do you guys also see slower boot times with this > > > patch? I wonder if the issue is with wake up interaction with the nocb > > > GP threads. > > > > > > We ought to disable lazy RCU during boot since it would have little > > > benefit anyway. But I am also concerned about some deeper problem I > > > did not catch before. > > > > > > I'll look into tracing the fs paths to see if I can narrow down what's > > > causing it. Will also try a newer kernel, I am currently testing on > > > 5.19-rc4. > > > > I got somewhere with this. It looks like queuing CBs as lazy CBs > > instead of normal CBs, are triggering expedited stalls during the boot > > process: > > > > 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on > > CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. > > > > No idea how/why lazy RCU CBs would be related to expedited GP issues, > > but maybe something hangs and causes that side-effect. > > > > initcall_debug did not help, as it seems initcalls all work fine, and > > then 8 seconds after the boot, it starts slowing down a lot, followed > > by the RCU stall messages. As a next step I'll enable ftrace during > > the boot to see if I can get more insight. But I believe, its not the > > FS layer, the FS layer just triggers lazy CBs, but there is something > > wrong with the core lazy-RCU work itself. > > > > This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more > > recent kernels and debug. > > More digging, thanks to trace_event= boot option , I find that the > boot process does have some synchronous waits, and though these are > "non-lazy", for some reason the lazy CBs that were previously queued > are making them wait for the *full* lazy duration. Which points to a > likely bug in the lazy RCU logic. These synchronous CBs should never > be waiting like the lazy ones: > > [ 17.715904] => trace_dump_stack > [ 17.715904] => __wait_rcu_gp > [ 17.715904] => synchronize_rcu > [ 17.715904] => selinux_netcache_avc_callback > [ 17.715904] => avc_ss_reset > [ 17.715904] => sel_write_enforce > [ 17.715904] => vfs_write > [ 17.715904] => ksys_write > [ 17.715904] => do_syscall_64 > [ 17.715904] => entry_SYSCALL_64_after_hwframe > > I'm tired so I'll resume the debug later. At times like this, I often pull the suspect code into userspace and run it through its paces. In this case, a bunch of call_rcu_lazy() invocations into an empty bypass list, followed by a call_rcu() invocation, then a check to make sure that the bypass list is no longer lazy. Thanx, Paul
On 8/18/2022 10:35 PM, Paul E. McKenney wrote: > On Thu, Aug 18, 2022 at 09:21:56PM -0400, Joel Fernandes wrote: >> On Thu, Aug 18, 2022 at 7:05 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>> >>> On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>> >>>> [Sorry, adding back the CC list] >>>> >>>> On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) >>>> <joel@joelfernandes.org> wrote: >>>>> >>>>> This is required to prevent callbacks triggering RCU machinery too >>>>> quickly and too often, which adds more power to the system. >>>>> >>>>> When testing, we found that these paths were invoked often when the >>>>> system is not doing anything (screen is ON but otherwise idle). >>>> >>>> Unfortunately, I am seeing a slow down in ChromeOS boot performance >>>> after applying this particular patch. It is the first time I could >>>> test ChromeOS boot times with the series since it was hard to find a >>>> ChromeOS device that runs the upstream kernel. >>>> >>>> Anyway, Vlad, Neeraj, do you guys also see slower boot times with this >>>> patch? I wonder if the issue is with wake up interaction with the nocb >>>> GP threads. >>>> >>>> We ought to disable lazy RCU during boot since it would have little >>>> benefit anyway. But I am also concerned about some deeper problem I >>>> did not catch before. >>>> >>>> I'll look into tracing the fs paths to see if I can narrow down what's >>>> causing it. Will also try a newer kernel, I am currently testing on >>>> 5.19-rc4. >>> >>> I got somewhere with this. It looks like queuing CBs as lazy CBs >>> instead of normal CBs, are triggering expedited stalls during the boot >>> process: >>> >>> 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on >>> CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. >>> >>> No idea how/why lazy RCU CBs would be related to expedited GP issues, >>> but maybe something hangs and causes that side-effect. >>> >>> initcall_debug did not help, as it seems initcalls all work fine, and >>> then 8 seconds after the boot, it starts slowing down a lot, followed >>> by the RCU stall messages. As a next step I'll enable ftrace during >>> the boot to see if I can get more insight. But I believe, its not the >>> FS layer, the FS layer just triggers lazy CBs, but there is something >>> wrong with the core lazy-RCU work itself. >>> >>> This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more >>> recent kernels and debug. >> >> More digging, thanks to trace_event= boot option , I find that the >> boot process does have some synchronous waits, and though these are >> "non-lazy", for some reason the lazy CBs that were previously queued >> are making them wait for the *full* lazy duration. Which points to a >> likely bug in the lazy RCU logic. These synchronous CBs should never >> be waiting like the lazy ones: >> >> [ 17.715904] => trace_dump_stack >> [ 17.715904] => __wait_rcu_gp >> [ 17.715904] => synchronize_rcu >> [ 17.715904] => selinux_netcache_avc_callback >> [ 17.715904] => avc_ss_reset >> [ 17.715904] => sel_write_enforce >> [ 17.715904] => vfs_write >> [ 17.715904] => ksys_write >> [ 17.715904] => do_syscall_64 >> [ 17.715904] => entry_SYSCALL_64_after_hwframe >> >> I'm tired so I'll resume the debug later. > > At times like this, I often pull the suspect code into userspace and > run it through its paces. In this case, a bunch of call_rcu_lazy() > invocations into an empty bypass list, followed by a call_rcu() > invocation, then a check to make sure that the bypass list is no longer > lazy. Thanks a lot for this great debug idea, I will look into it. Thanks, - Joel
On 8/18/2022 10:45 PM, Joel Fernandes wrote: > > > On 8/18/2022 10:35 PM, Paul E. McKenney wrote: >> On Thu, Aug 18, 2022 at 09:21:56PM -0400, Joel Fernandes wrote: >>> On Thu, Aug 18, 2022 at 7:05 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>> >>>> On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>>> >>>>> [Sorry, adding back the CC list] >>>>> >>>>> On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) >>>>> <joel@joelfernandes.org> wrote: >>>>>> >>>>>> This is required to prevent callbacks triggering RCU machinery too >>>>>> quickly and too often, which adds more power to the system. >>>>>> >>>>>> When testing, we found that these paths were invoked often when the >>>>>> system is not doing anything (screen is ON but otherwise idle). >>>>> >>>>> Unfortunately, I am seeing a slow down in ChromeOS boot performance >>>>> after applying this particular patch. It is the first time I could >>>>> test ChromeOS boot times with the series since it was hard to find a >>>>> ChromeOS device that runs the upstream kernel. >>>>> >>>>> Anyway, Vlad, Neeraj, do you guys also see slower boot times with this >>>>> patch? I wonder if the issue is with wake up interaction with the nocb >>>>> GP threads. >>>>> >>>>> We ought to disable lazy RCU during boot since it would have little >>>>> benefit anyway. But I am also concerned about some deeper problem I >>>>> did not catch before. >>>>> >>>>> I'll look into tracing the fs paths to see if I can narrow down what's >>>>> causing it. Will also try a newer kernel, I am currently testing on >>>>> 5.19-rc4. >>>> >>>> I got somewhere with this. It looks like queuing CBs as lazy CBs >>>> instead of normal CBs, are triggering expedited stalls during the boot >>>> process: >>>> >>>> 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on >>>> CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. >>>> >>>> No idea how/why lazy RCU CBs would be related to expedited GP issues, >>>> but maybe something hangs and causes that side-effect. >>>> >>>> initcall_debug did not help, as it seems initcalls all work fine, and >>>> then 8 seconds after the boot, it starts slowing down a lot, followed >>>> by the RCU stall messages. As a next step I'll enable ftrace during >>>> the boot to see if I can get more insight. But I believe, its not the >>>> FS layer, the FS layer just triggers lazy CBs, but there is something >>>> wrong with the core lazy-RCU work itself. >>>> >>>> This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more >>>> recent kernels and debug. >>> >>> More digging, thanks to trace_event= boot option , I find that the >>> boot process does have some synchronous waits, and though these are >>> "non-lazy", for some reason the lazy CBs that were previously queued >>> are making them wait for the *full* lazy duration. Which points to a >>> likely bug in the lazy RCU logic. These synchronous CBs should never >>> be waiting like the lazy ones: >>> >>> [ 17.715904] => trace_dump_stack >>> [ 17.715904] => __wait_rcu_gp >>> [ 17.715904] => synchronize_rcu >>> [ 17.715904] => selinux_netcache_avc_callback >>> [ 17.715904] => avc_ss_reset >>> [ 17.715904] => sel_write_enforce >>> [ 17.715904] => vfs_write >>> [ 17.715904] => ksys_write >>> [ 17.715904] => do_syscall_64 >>> [ 17.715904] => entry_SYSCALL_64_after_hwframe >>> >>> I'm tired so I'll resume the debug later. >> >> At times like this, I often pull the suspect code into userspace and >> run it through its paces. In this case, a bunch of call_rcu_lazy() >> invocations into an empty bypass list, followed by a call_rcu() >> invocation, then a check to make sure that the bypass list is no longer >> lazy. > > Thanks a lot for this great debug idea, I will look into it. It seems to be a subtle issue when a large number of callbacks are queued trigging the lock-contention code, which happens at boot. It appears the non-lazy ones and lazy ones collide, so you have the lazy timer which wins, and then the regular bypass lock-contention timer is not allowed to do its thing. Due to this, the rcuog thread wakes up much later than a jiffie. Things are much better with the following change. However, this brings me to a question about lock-contention based or any deferring and boot time. If you have a path like selinux doing a synchronize_rcu(), shouldn't we skip the jiffie waiting for the bypass timer? Otherwise things synchronously waiting will slow down more than usual. Maybe bypassing should not be done for any case until boot up is done. I'm curious to see if that improves boot time. @@ -580,7 +585,11 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, len = rcu_segcblist_n_cbs(&rdp->cblist); bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass); lazy_len = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass); - if (was_alldone) { + + // If we are in lazy-mode, we still need to do a wake up even if + // all CBs were previously done. Otherwise the GP thread will + // wait for the full lazy duration. + if (was_alldone || (READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_LAZY)) { rdp->qlen_last_fqs_check = len; // Only lazy CBs in bypass list if (lazy_len && bypass_len == lazy_len) { --- Thanks, - Joel
On Fri, Aug 19, 2022 at 12:30:49PM -0400, Joel Fernandes wrote: > On 8/18/2022 10:45 PM, Joel Fernandes wrote: > > On 8/18/2022 10:35 PM, Paul E. McKenney wrote: > >> On Thu, Aug 18, 2022 at 09:21:56PM -0400, Joel Fernandes wrote: > >>> On Thu, Aug 18, 2022 at 7:05 PM Joel Fernandes <joel@joelfernandes.org> wrote: > >>>> > >>>> On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: > >>>>> > >>>>> [Sorry, adding back the CC list] > >>>>> > >>>>> On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) > >>>>> <joel@joelfernandes.org> wrote: > >>>>>> > >>>>>> This is required to prevent callbacks triggering RCU machinery too > >>>>>> quickly and too often, which adds more power to the system. > >>>>>> > >>>>>> When testing, we found that these paths were invoked often when the > >>>>>> system is not doing anything (screen is ON but otherwise idle). > >>>>> > >>>>> Unfortunately, I am seeing a slow down in ChromeOS boot performance > >>>>> after applying this particular patch. It is the first time I could > >>>>> test ChromeOS boot times with the series since it was hard to find a > >>>>> ChromeOS device that runs the upstream kernel. > >>>>> > >>>>> Anyway, Vlad, Neeraj, do you guys also see slower boot times with this > >>>>> patch? I wonder if the issue is with wake up interaction with the nocb > >>>>> GP threads. > >>>>> > >>>>> We ought to disable lazy RCU during boot since it would have little > >>>>> benefit anyway. But I am also concerned about some deeper problem I > >>>>> did not catch before. > >>>>> > >>>>> I'll look into tracing the fs paths to see if I can narrow down what's > >>>>> causing it. Will also try a newer kernel, I am currently testing on > >>>>> 5.19-rc4. > >>>> > >>>> I got somewhere with this. It looks like queuing CBs as lazy CBs > >>>> instead of normal CBs, are triggering expedited stalls during the boot > >>>> process: > >>>> > >>>> 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on > >>>> CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. > >>>> > >>>> No idea how/why lazy RCU CBs would be related to expedited GP issues, > >>>> but maybe something hangs and causes that side-effect. > >>>> > >>>> initcall_debug did not help, as it seems initcalls all work fine, and > >>>> then 8 seconds after the boot, it starts slowing down a lot, followed > >>>> by the RCU stall messages. As a next step I'll enable ftrace during > >>>> the boot to see if I can get more insight. But I believe, its not the > >>>> FS layer, the FS layer just triggers lazy CBs, but there is something > >>>> wrong with the core lazy-RCU work itself. > >>>> > >>>> This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more > >>>> recent kernels and debug. > >>> > >>> More digging, thanks to trace_event= boot option , I find that the > >>> boot process does have some synchronous waits, and though these are > >>> "non-lazy", for some reason the lazy CBs that were previously queued > >>> are making them wait for the *full* lazy duration. Which points to a > >>> likely bug in the lazy RCU logic. These synchronous CBs should never > >>> be waiting like the lazy ones: > >>> > >>> [ 17.715904] => trace_dump_stack > >>> [ 17.715904] => __wait_rcu_gp > >>> [ 17.715904] => synchronize_rcu > >>> [ 17.715904] => selinux_netcache_avc_callback > >>> [ 17.715904] => avc_ss_reset > >>> [ 17.715904] => sel_write_enforce > >>> [ 17.715904] => vfs_write > >>> [ 17.715904] => ksys_write > >>> [ 17.715904] => do_syscall_64 > >>> [ 17.715904] => entry_SYSCALL_64_after_hwframe > >>> > >>> I'm tired so I'll resume the debug later. > >> > >> At times like this, I often pull the suspect code into userspace and > >> run it through its paces. In this case, a bunch of call_rcu_lazy() > >> invocations into an empty bypass list, followed by a call_rcu() > >> invocation, then a check to make sure that the bypass list is no longer > >> lazy. > > > > Thanks a lot for this great debug idea, I will look into it. > > It seems to be a subtle issue when a large number of callbacks are > queued trigging the lock-contention code, which happens at boot. It > appears the non-lazy ones and lazy ones collide, so you have the lazy > timer which wins, and then the regular bypass lock-contention timer is > not allowed to do its thing. Due to this, the rcuog thread wakes up much > later than a jiffie. Good show, and glad you found it! > Things are much better with the following change. However, this brings > me to a question about lock-contention based or any deferring and boot time. > > If you have a path like selinux doing a synchronize_rcu(), shouldn't we > skip the jiffie waiting for the bypass timer? Otherwise things > synchronously waiting will slow down more than usual. Maybe bypassing > should not be done for any case until boot up is done. I'm curious to > see if that improves boot time. Why not simply disable laziness at boot time and enable it only after booting is complete? The exiting rcupdate.rcu_normal_after_boot kernel boot parameter uses a similar scheme. > @@ -580,7 +585,11 @@ static void __call_rcu_nocb_wake(struct rcu_data > *rdp, bool was_alldone, > len = rcu_segcblist_n_cbs(&rdp->cblist); > bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass); > lazy_len = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass); > - if (was_alldone) { > + > + // If we are in lazy-mode, we still need to do a wake up even if > + // all CBs were previously done. Otherwise the GP thread will > + // wait for the full lazy duration. > + if (was_alldone || (READ_ONCE(rdp->nocb_defer_wakeup) == > RCU_NOCB_WAKE_LAZY)) { > rdp->qlen_last_fqs_check = len; > // Only lazy CBs in bypass list > if (lazy_len && bypass_len == lazy_len) { And this change looks plausible, though as always, the system's opinion carries much more weight than does mine. Thanx, Paul
On 8/19/2022 1:12 PM, Paul E. McKenney wrote: > On Fri, Aug 19, 2022 at 12:30:49PM -0400, Joel Fernandes wrote: >> On 8/18/2022 10:45 PM, Joel Fernandes wrote: >>> On 8/18/2022 10:35 PM, Paul E. McKenney wrote: >>>> On Thu, Aug 18, 2022 at 09:21:56PM -0400, Joel Fernandes wrote: >>>>> On Thu, Aug 18, 2022 at 7:05 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>>>> >>>>>> On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>>>>> >>>>>>> [Sorry, adding back the CC list] >>>>>>> >>>>>>> On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) >>>>>>> <joel@joelfernandes.org> wrote: >>>>>>>> >>>>>>>> This is required to prevent callbacks triggering RCU machinery too >>>>>>>> quickly and too often, which adds more power to the system. >>>>>>>> >>>>>>>> When testing, we found that these paths were invoked often when the >>>>>>>> system is not doing anything (screen is ON but otherwise idle). >>>>>>> >>>>>>> Unfortunately, I am seeing a slow down in ChromeOS boot performance >>>>>>> after applying this particular patch. It is the first time I could >>>>>>> test ChromeOS boot times with the series since it was hard to find a >>>>>>> ChromeOS device that runs the upstream kernel. >>>>>>> >>>>>>> Anyway, Vlad, Neeraj, do you guys also see slower boot times with this >>>>>>> patch? I wonder if the issue is with wake up interaction with the nocb >>>>>>> GP threads. >>>>>>> >>>>>>> We ought to disable lazy RCU during boot since it would have little >>>>>>> benefit anyway. But I am also concerned about some deeper problem I >>>>>>> did not catch before. >>>>>>> >>>>>>> I'll look into tracing the fs paths to see if I can narrow down what's >>>>>>> causing it. Will also try a newer kernel, I am currently testing on >>>>>>> 5.19-rc4. >>>>>> >>>>>> I got somewhere with this. It looks like queuing CBs as lazy CBs >>>>>> instead of normal CBs, are triggering expedited stalls during the boot >>>>>> process: >>>>>> >>>>>> 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on >>>>>> CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. >>>>>> >>>>>> No idea how/why lazy RCU CBs would be related to expedited GP issues, >>>>>> but maybe something hangs and causes that side-effect. >>>>>> >>>>>> initcall_debug did not help, as it seems initcalls all work fine, and >>>>>> then 8 seconds after the boot, it starts slowing down a lot, followed >>>>>> by the RCU stall messages. As a next step I'll enable ftrace during >>>>>> the boot to see if I can get more insight. But I believe, its not the >>>>>> FS layer, the FS layer just triggers lazy CBs, but there is something >>>>>> wrong with the core lazy-RCU work itself. >>>>>> >>>>>> This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more >>>>>> recent kernels and debug. >>>>> >>>>> More digging, thanks to trace_event= boot option , I find that the >>>>> boot process does have some synchronous waits, and though these are >>>>> "non-lazy", for some reason the lazy CBs that were previously queued >>>>> are making them wait for the *full* lazy duration. Which points to a >>>>> likely bug in the lazy RCU logic. These synchronous CBs should never >>>>> be waiting like the lazy ones: >>>>> >>>>> [ 17.715904] => trace_dump_stack >>>>> [ 17.715904] => __wait_rcu_gp >>>>> [ 17.715904] => synchronize_rcu >>>>> [ 17.715904] => selinux_netcache_avc_callback >>>>> [ 17.715904] => avc_ss_reset >>>>> [ 17.715904] => sel_write_enforce >>>>> [ 17.715904] => vfs_write >>>>> [ 17.715904] => ksys_write >>>>> [ 17.715904] => do_syscall_64 >>>>> [ 17.715904] => entry_SYSCALL_64_after_hwframe >>>>> >>>>> I'm tired so I'll resume the debug later. >>>> >>>> At times like this, I often pull the suspect code into userspace and >>>> run it through its paces. In this case, a bunch of call_rcu_lazy() >>>> invocations into an empty bypass list, followed by a call_rcu() >>>> invocation, then a check to make sure that the bypass list is no longer >>>> lazy. >>> >>> Thanks a lot for this great debug idea, I will look into it. >> >> It seems to be a subtle issue when a large number of callbacks are >> queued trigging the lock-contention code, which happens at boot. It >> appears the non-lazy ones and lazy ones collide, so you have the lazy >> timer which wins, and then the regular bypass lock-contention timer is >> not allowed to do its thing. Due to this, the rcuog thread wakes up much >> later than a jiffie. > > Good show, and glad you found it! Thanks! >> Things are much better with the following change. However, this brings >> me to a question about lock-contention based or any deferring and boot time. >> >> If you have a path like selinux doing a synchronize_rcu(), shouldn't we >> skip the jiffie waiting for the bypass timer? Otherwise things >> synchronously waiting will slow down more than usual. Maybe bypassing >> should not be done for any case until boot up is done. I'm curious to >> see if that improves boot time. > > Why not simply disable laziness at boot time and enable it only after > booting is complete? The exiting rcupdate.rcu_normal_after_boot kernel > boot parameter uses a similar scheme. That sounds like the right thing to good, but unfortunately it wont help this problem. The boot time issue happens after init has started. So the OS is still "booting" even though the kernel has. Also the problem can happen after boot as well, like if RCU lazy/non-lazy callbacks come back to back quickly, or so. But yes nonetheless, I can see the value of disabling it till the in-kernel boot completets. >> @@ -580,7 +585,11 @@ static void __call_rcu_nocb_wake(struct rcu_data >> *rdp, bool was_alldone, >> len = rcu_segcblist_n_cbs(&rdp->cblist); >> bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass); >> lazy_len = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass); >> - if (was_alldone) { >> + >> + // If we are in lazy-mode, we still need to do a wake up even if >> + // all CBs were previously done. Otherwise the GP thread will >> + // wait for the full lazy duration. >> + if (was_alldone || (READ_ONCE(rdp->nocb_defer_wakeup) == >> RCU_NOCB_WAKE_LAZY)) { >> rdp->qlen_last_fqs_check = len; >> // Only lazy CBs in bypass list >> if (lazy_len && bypass_len == lazy_len) { > > And this change looks plausible, though as always, the system's opinion > carries much more weight than does mine. Sounds good, thanks, I am testing it more. Will update it for v4. Thanks, - Joel
On Fri, Aug 19, 2022 at 2:14 PM Joel Fernandes <joel@joelfernandes.org> wrote: [..] > >> Things are much better with the following change. However, this brings > >> me to a question about lock-contention based or any deferring and boot time. > >> > >> If you have a path like selinux doing a synchronize_rcu(), shouldn't we > >> skip the jiffie waiting for the bypass timer? Otherwise things > >> synchronously waiting will slow down more than usual. Maybe bypassing > >> should not be done for any case until boot up is done. I'm curious to > >> see if that improves boot time. > > > > Why not simply disable laziness at boot time and enable it only after > > booting is complete? The exiting rcupdate.rcu_normal_after_boot kernel > > boot parameter uses a similar scheme. > > That sounds like the right thing to good, but unfortunately it wont help > this problem. The boot time issue happens after init has started. So the > OS is still "booting" even though the kernel has. > > Also the problem can happen after boot as well, like if RCU > lazy/non-lazy callbacks come back to back quickly, or so. > > But yes nonetheless, I can see the value of disabling it till the > in-kernel boot completets. My mail client is acting weird. I meant to add to this, I wonder if there is a way other subsystems detect when userspace boots using some heuristic? Thanks, - Joel
On Fri, Aug 19, 2022 at 02:17:32PM -0400, Joel Fernandes wrote: > On Fri, Aug 19, 2022 at 2:14 PM Joel Fernandes <joel@joelfernandes.org> wrote: > [..] > > >> Things are much better with the following change. However, this brings > > >> me to a question about lock-contention based or any deferring and boot time. > > >> > > >> If you have a path like selinux doing a synchronize_rcu(), shouldn't we > > >> skip the jiffie waiting for the bypass timer? Otherwise things > > >> synchronously waiting will slow down more than usual. Maybe bypassing > > >> should not be done for any case until boot up is done. I'm curious to > > >> see if that improves boot time. > > > > > > Why not simply disable laziness at boot time and enable it only after > > > booting is complete? The exiting rcupdate.rcu_normal_after_boot kernel > > > boot parameter uses a similar scheme. > > > > That sounds like the right thing to good, but unfortunately it wont help > > this problem. The boot time issue happens after init has started. So the > > OS is still "booting" even though the kernel has. > > > > Also the problem can happen after boot as well, like if RCU > > lazy/non-lazy callbacks come back to back quickly, or so. > > > > But yes nonetheless, I can see the value of disabling it till the > > in-kernel boot completets. > > My mail client is acting weird. I meant to add to this, I wonder if > there is a way other subsystems detect when userspace boots using some > heuristic? I don't know of one, but I bet that ChromeOS has ways. If nothing else, might you add a sysfs write to one of the boot-up phases? Thanx, Paul
On Fri, Aug 19, 2022 at 2:26 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Aug 19, 2022 at 02:17:32PM -0400, Joel Fernandes wrote: > > On Fri, Aug 19, 2022 at 2:14 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > [..] > > > >> Things are much better with the following change. However, this brings > > > >> me to a question about lock-contention based or any deferring and boot time. > > > >> > > > >> If you have a path like selinux doing a synchronize_rcu(), shouldn't we > > > >> skip the jiffie waiting for the bypass timer? Otherwise things > > > >> synchronously waiting will slow down more than usual. Maybe bypassing > > > >> should not be done for any case until boot up is done. I'm curious to > > > >> see if that improves boot time. > > > > > > > > Why not simply disable laziness at boot time and enable it only after > > > > booting is complete? The exiting rcupdate.rcu_normal_after_boot kernel > > > > boot parameter uses a similar scheme. > > > > > > That sounds like the right thing to good, but unfortunately it wont help > > > this problem. The boot time issue happens after init has started. So the > > > OS is still "booting" even though the kernel has. > > > > > > Also the problem can happen after boot as well, like if RCU > > > lazy/non-lazy callbacks come back to back quickly, or so. > > > > > > But yes nonetheless, I can see the value of disabling it till the > > > in-kernel boot completets. > > > > My mail client is acting weird. I meant to add to this, I wonder if > > there is a way other subsystems detect when userspace boots using some > > heuristic? > > I don't know of one, but I bet that ChromeOS has ways. If nothing else, > might you add a sysfs write to one of the boot-up phases? Yes, that's possible :) Thanks, I will consider this idea. Thanks, - Joel
On 8/19/2022 2:14 PM, Joel Fernandes wrote: > > > On 8/19/2022 1:12 PM, Paul E. McKenney wrote: >> On Fri, Aug 19, 2022 at 12:30:49PM -0400, Joel Fernandes wrote: >>> On 8/18/2022 10:45 PM, Joel Fernandes wrote: >>>> On 8/18/2022 10:35 PM, Paul E. McKenney wrote: >>>>> On Thu, Aug 18, 2022 at 09:21:56PM -0400, Joel Fernandes wrote: >>>>>> On Thu, Aug 18, 2022 at 7:05 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>>>>> >>>>>>> On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>>>>>> >>>>>>>> [Sorry, adding back the CC list] >>>>>>>> >>>>>>>> On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) >>>>>>>> <joel@joelfernandes.org> wrote: >>>>>>>>> >>>>>>>>> This is required to prevent callbacks triggering RCU machinery too >>>>>>>>> quickly and too often, which adds more power to the system. >>>>>>>>> >>>>>>>>> When testing, we found that these paths were invoked often when the >>>>>>>>> system is not doing anything (screen is ON but otherwise idle). >>>>>>>> >>>>>>>> Unfortunately, I am seeing a slow down in ChromeOS boot performance >>>>>>>> after applying this particular patch. It is the first time I could >>>>>>>> test ChromeOS boot times with the series since it was hard to find a >>>>>>>> ChromeOS device that runs the upstream kernel. >>>>>>>> >>>>>>>> Anyway, Vlad, Neeraj, do you guys also see slower boot times with this >>>>>>>> patch? I wonder if the issue is with wake up interaction with the nocb >>>>>>>> GP threads. >>>>>>>> >>>>>>>> We ought to disable lazy RCU during boot since it would have little >>>>>>>> benefit anyway. But I am also concerned about some deeper problem I >>>>>>>> did not catch before. >>>>>>>> >>>>>>>> I'll look into tracing the fs paths to see if I can narrow down what's >>>>>>>> causing it. Will also try a newer kernel, I am currently testing on >>>>>>>> 5.19-rc4. >>>>>>> >>>>>>> I got somewhere with this. It looks like queuing CBs as lazy CBs >>>>>>> instead of normal CBs, are triggering expedited stalls during the boot >>>>>>> process: >>>>>>> >>>>>>> 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on >>>>>>> CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. >>>>>>> >>>>>>> No idea how/why lazy RCU CBs would be related to expedited GP issues, >>>>>>> but maybe something hangs and causes that side-effect. >>>>>>> >>>>>>> initcall_debug did not help, as it seems initcalls all work fine, and >>>>>>> then 8 seconds after the boot, it starts slowing down a lot, followed >>>>>>> by the RCU stall messages. As a next step I'll enable ftrace during >>>>>>> the boot to see if I can get more insight. But I believe, its not the >>>>>>> FS layer, the FS layer just triggers lazy CBs, but there is something >>>>>>> wrong with the core lazy-RCU work itself. >>>>>>> >>>>>>> This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more >>>>>>> recent kernels and debug. >>>>>> >>>>>> More digging, thanks to trace_event= boot option , I find that the >>>>>> boot process does have some synchronous waits, and though these are >>>>>> "non-lazy", for some reason the lazy CBs that were previously queued >>>>>> are making them wait for the *full* lazy duration. Which points to a >>>>>> likely bug in the lazy RCU logic. These synchronous CBs should never >>>>>> be waiting like the lazy ones: >>>>>> >>>>>> [ 17.715904] => trace_dump_stack >>>>>> [ 17.715904] => __wait_rcu_gp >>>>>> [ 17.715904] => synchronize_rcu >>>>>> [ 17.715904] => selinux_netcache_avc_callback >>>>>> [ 17.715904] => avc_ss_reset >>>>>> [ 17.715904] => sel_write_enforce >>>>>> [ 17.715904] => vfs_write >>>>>> [ 17.715904] => ksys_write >>>>>> [ 17.715904] => do_syscall_64 >>>>>> [ 17.715904] => entry_SYSCALL_64_after_hwframe >>>>>> >>>>>> I'm tired so I'll resume the debug later. >>>>> >>>>> At times like this, I often pull the suspect code into userspace and >>>>> run it through its paces. In this case, a bunch of call_rcu_lazy() >>>>> invocations into an empty bypass list, followed by a call_rcu() >>>>> invocation, then a check to make sure that the bypass list is no longer >>>>> lazy. >>>> >>>> Thanks a lot for this great debug idea, I will look into it. >>> >>> It seems to be a subtle issue when a large number of callbacks are >>> queued trigging the lock-contention code, which happens at boot. It >>> appears the non-lazy ones and lazy ones collide, so you have the lazy >>> timer which wins, and then the regular bypass lock-contention timer is >>> not allowed to do its thing. Due to this, the rcuog thread wakes up much >>> later than a jiffie. >> >> Good show, and glad you found it! > > Thanks! > >>> Things are much better with the following change. However, this brings >>> me to a question about lock-contention based or any deferring and boot time. >>> >>> If you have a path like selinux doing a synchronize_rcu(), shouldn't we >>> skip the jiffie waiting for the bypass timer? Otherwise things >>> synchronously waiting will slow down more than usual. Maybe bypassing >>> should not be done for any case until boot up is done. I'm curious to >>> see if that improves boot time. >> >> Why not simply disable laziness at boot time and enable it only after >> booting is complete? The exiting rcupdate.rcu_normal_after_boot kernel >> boot parameter uses a similar scheme. > > That sounds like the right thing to good, but unfortunately it wont help > this problem. The boot time issue happens after init has started. So the > OS is still "booting" even though the kernel has. > > Also the problem can happen after boot as well, like if RCU > lazy/non-lazy callbacks come back to back quickly, or so. > > But yes nonetheless, I can see the value of disabling it till the > in-kernel boot completets. > >>> @@ -580,7 +585,11 @@ static void __call_rcu_nocb_wake(struct rcu_data >>> *rdp, bool was_alldone, >>> len = rcu_segcblist_n_cbs(&rdp->cblist); >>> bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass); >>> lazy_len = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass); >>> - if (was_alldone) { >>> + >>> + // If we are in lazy-mode, we still need to do a wake up even if >>> + // all CBs were previously done. Otherwise the GP thread will >>> + // wait for the full lazy duration. >>> + if (was_alldone || (READ_ONCE(rdp->nocb_defer_wakeup) == >>> RCU_NOCB_WAKE_LAZY)) { >>> rdp->qlen_last_fqs_check = len; >>> // Only lazy CBs in bypass list >>> if (lazy_len && bypass_len == lazy_len) { >> >> And this change looks plausible, though as always, the system's opinion >> carries much more weight than does mine. > > Sounds good, thanks, I am testing it more. Will update it for v4. We could also do the following, I tested it and it fixes it. It seems more maintainable and less fragile, but it comes at a slightly higher (but likely negligible) cost. If there are lazy CBs queued, and any non-lazy one comes, then the first non-lazy one is not considered to be added to the bypass list but hopefully that's Ok with you. Later non-lazy ones will be added to the bypass. @@ -484,9 +490,17 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp, // since we are kick-starting RCU GP processing anyway for the non-lazy // one, we can just reuse that GP for the already queued-up lazy ones. if ((rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) || - (lazy && n_lazy_cbs >= qhimark)) { + (!lazy && n_lazy_cbs) || + (lazy && n_lazy_cbs >= qhimark)) { rcu_nocb_lock(rdp); - *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); + + // This variable helps decide if a wakeup of the rcuog thread + // is needed. It is passed to __call_rcu_nocb_wake() by the + // caller. If only lazy CBs were previously queued and this one + // is non-lazy, make sure the caller does a wake up. + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist) || + (!lazy && n_lazy_cbs); + if (*was_alldone) trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, lazy ? TPS("FirstLazyQ") : TPS("FirstQ")); @@ -500,7 +514,8 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp, if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) || ncbs >= qhimark) { rcu_nocb_lock(rdp); if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) { - *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist) || + (!lazy && n_lazy_cbs); if (*was_alldone) trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, lazy ? TPS("FirstLazyQ") : TPS("FirstQ"));
On Fri, Aug 19, 2022 at 03:40:00PM -0400, Joel Fernandes wrote: > On 8/19/2022 2:14 PM, Joel Fernandes wrote: > > On 8/19/2022 1:12 PM, Paul E. McKenney wrote: > >> On Fri, Aug 19, 2022 at 12:30:49PM -0400, Joel Fernandes wrote: > >>> On 8/18/2022 10:45 PM, Joel Fernandes wrote: > >>>> On 8/18/2022 10:35 PM, Paul E. McKenney wrote: > >>>>> On Thu, Aug 18, 2022 at 09:21:56PM -0400, Joel Fernandes wrote: > >>>>>> On Thu, Aug 18, 2022 at 7:05 PM Joel Fernandes <joel@joelfernandes.org> wrote: > >>>>>>> > >>>>>>> On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: > >>>>>>>> > >>>>>>>> [Sorry, adding back the CC list] > >>>>>>>> > >>>>>>>> On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) > >>>>>>>> <joel@joelfernandes.org> wrote: > >>>>>>>>> > >>>>>>>>> This is required to prevent callbacks triggering RCU machinery too > >>>>>>>>> quickly and too often, which adds more power to the system. > >>>>>>>>> > >>>>>>>>> When testing, we found that these paths were invoked often when the > >>>>>>>>> system is not doing anything (screen is ON but otherwise idle). > >>>>>>>> > >>>>>>>> Unfortunately, I am seeing a slow down in ChromeOS boot performance > >>>>>>>> after applying this particular patch. It is the first time I could > >>>>>>>> test ChromeOS boot times with the series since it was hard to find a > >>>>>>>> ChromeOS device that runs the upstream kernel. > >>>>>>>> > >>>>>>>> Anyway, Vlad, Neeraj, do you guys also see slower boot times with this > >>>>>>>> patch? I wonder if the issue is with wake up interaction with the nocb > >>>>>>>> GP threads. > >>>>>>>> > >>>>>>>> We ought to disable lazy RCU during boot since it would have little > >>>>>>>> benefit anyway. But I am also concerned about some deeper problem I > >>>>>>>> did not catch before. > >>>>>>>> > >>>>>>>> I'll look into tracing the fs paths to see if I can narrow down what's > >>>>>>>> causing it. Will also try a newer kernel, I am currently testing on > >>>>>>>> 5.19-rc4. > >>>>>>> > >>>>>>> I got somewhere with this. It looks like queuing CBs as lazy CBs > >>>>>>> instead of normal CBs, are triggering expedited stalls during the boot > >>>>>>> process: > >>>>>>> > >>>>>>> 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on > >>>>>>> CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. > >>>>>>> > >>>>>>> No idea how/why lazy RCU CBs would be related to expedited GP issues, > >>>>>>> but maybe something hangs and causes that side-effect. > >>>>>>> > >>>>>>> initcall_debug did not help, as it seems initcalls all work fine, and > >>>>>>> then 8 seconds after the boot, it starts slowing down a lot, followed > >>>>>>> by the RCU stall messages. As a next step I'll enable ftrace during > >>>>>>> the boot to see if I can get more insight. But I believe, its not the > >>>>>>> FS layer, the FS layer just triggers lazy CBs, but there is something > >>>>>>> wrong with the core lazy-RCU work itself. > >>>>>>> > >>>>>>> This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more > >>>>>>> recent kernels and debug. > >>>>>> > >>>>>> More digging, thanks to trace_event= boot option , I find that the > >>>>>> boot process does have some synchronous waits, and though these are > >>>>>> "non-lazy", for some reason the lazy CBs that were previously queued > >>>>>> are making them wait for the *full* lazy duration. Which points to a > >>>>>> likely bug in the lazy RCU logic. These synchronous CBs should never > >>>>>> be waiting like the lazy ones: > >>>>>> > >>>>>> [ 17.715904] => trace_dump_stack > >>>>>> [ 17.715904] => __wait_rcu_gp > >>>>>> [ 17.715904] => synchronize_rcu > >>>>>> [ 17.715904] => selinux_netcache_avc_callback > >>>>>> [ 17.715904] => avc_ss_reset > >>>>>> [ 17.715904] => sel_write_enforce > >>>>>> [ 17.715904] => vfs_write > >>>>>> [ 17.715904] => ksys_write > >>>>>> [ 17.715904] => do_syscall_64 > >>>>>> [ 17.715904] => entry_SYSCALL_64_after_hwframe > >>>>>> > >>>>>> I'm tired so I'll resume the debug later. > >>>>> > >>>>> At times like this, I often pull the suspect code into userspace and > >>>>> run it through its paces. In this case, a bunch of call_rcu_lazy() > >>>>> invocations into an empty bypass list, followed by a call_rcu() > >>>>> invocation, then a check to make sure that the bypass list is no longer > >>>>> lazy. > >>>> > >>>> Thanks a lot for this great debug idea, I will look into it. > >>> > >>> It seems to be a subtle issue when a large number of callbacks are > >>> queued trigging the lock-contention code, which happens at boot. It > >>> appears the non-lazy ones and lazy ones collide, so you have the lazy > >>> timer which wins, and then the regular bypass lock-contention timer is > >>> not allowed to do its thing. Due to this, the rcuog thread wakes up much > >>> later than a jiffie. > >> > >> Good show, and glad you found it! > > > > Thanks! > > > >>> Things are much better with the following change. However, this brings > >>> me to a question about lock-contention based or any deferring and boot time. > >>> > >>> If you have a path like selinux doing a synchronize_rcu(), shouldn't we > >>> skip the jiffie waiting for the bypass timer? Otherwise things > >>> synchronously waiting will slow down more than usual. Maybe bypassing > >>> should not be done for any case until boot up is done. I'm curious to > >>> see if that improves boot time. > >> > >> Why not simply disable laziness at boot time and enable it only after > >> booting is complete? The exiting rcupdate.rcu_normal_after_boot kernel > >> boot parameter uses a similar scheme. > > > > That sounds like the right thing to good, but unfortunately it wont help > > this problem. The boot time issue happens after init has started. So the > > OS is still "booting" even though the kernel has. > > > > Also the problem can happen after boot as well, like if RCU > > lazy/non-lazy callbacks come back to back quickly, or so. > > > > But yes nonetheless, I can see the value of disabling it till the > > in-kernel boot completets. > > > >>> @@ -580,7 +585,11 @@ static void __call_rcu_nocb_wake(struct rcu_data > >>> *rdp, bool was_alldone, > >>> len = rcu_segcblist_n_cbs(&rdp->cblist); > >>> bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass); > >>> lazy_len = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass); > >>> - if (was_alldone) { > >>> + > >>> + // If we are in lazy-mode, we still need to do a wake up even if > >>> + // all CBs were previously done. Otherwise the GP thread will > >>> + // wait for the full lazy duration. > >>> + if (was_alldone || (READ_ONCE(rdp->nocb_defer_wakeup) == > >>> RCU_NOCB_WAKE_LAZY)) { > >>> rdp->qlen_last_fqs_check = len; > >>> // Only lazy CBs in bypass list > >>> if (lazy_len && bypass_len == lazy_len) { > >> > >> And this change looks plausible, though as always, the system's opinion > >> carries much more weight than does mine. > > > > Sounds good, thanks, I am testing it more. Will update it for v4. > > We could also do the following, I tested it and it fixes it. It seems more maintainable > and less fragile, but it comes at a slightly higher (but likely negligible) cost. If there > are lazy CBs queued, and any non-lazy one comes, then the first non-lazy one is not > considered to be added to the bypass list but hopefully that's Ok with you. Later non-lazy > ones will be added to the bypass. At first I was concerned that you intended to reorder the callbacks, but fortunately that is not what the patch below does. ;-) But don't you also need to clear the "lazy" flag at some point in this execution path? After all, once a non-lazy callback arrives, all the callbacks are treated as if they are non-lazy, correct? Thanx, Paul > @@ -484,9 +490,17 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head > *rhp, > // since we are kick-starting RCU GP processing anyway for the non-lazy > // one, we can just reuse that GP for the already queued-up lazy ones. > if ((rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) || > - (lazy && n_lazy_cbs >= qhimark)) { > + (!lazy && n_lazy_cbs) || > + (lazy && n_lazy_cbs >= qhimark)) { > rcu_nocb_lock(rdp); > - *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); > + > + // This variable helps decide if a wakeup of the rcuog thread > + // is needed. It is passed to __call_rcu_nocb_wake() by the > + // caller. If only lazy CBs were previously queued and this one > + // is non-lazy, make sure the caller does a wake up. > + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist) || > + (!lazy && n_lazy_cbs); > + > if (*was_alldone) > trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, > lazy ? TPS("FirstLazyQ") : TPS("FirstQ")); > @@ -500,7 +514,8 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head > *rhp, > if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) || ncbs >= qhimark) { > rcu_nocb_lock(rdp); > if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) { > - *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); > + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist) || > + (!lazy && n_lazy_cbs); > if (*was_alldone) > trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, > lazy ? TPS("FirstLazyQ") : TPS("FirstQ"));
On 8/19/2022 3:58 PM, Paul E. McKenney wrote: > On Fri, Aug 19, 2022 at 03:40:00PM -0400, Joel Fernandes wrote: >> On 8/19/2022 2:14 PM, Joel Fernandes wrote: >>> On 8/19/2022 1:12 PM, Paul E. McKenney wrote: >>>> On Fri, Aug 19, 2022 at 12:30:49PM -0400, Joel Fernandes wrote: >>>>> On 8/18/2022 10:45 PM, Joel Fernandes wrote: >>>>>> On 8/18/2022 10:35 PM, Paul E. McKenney wrote: >>>>>>> On Thu, Aug 18, 2022 at 09:21:56PM -0400, Joel Fernandes wrote: >>>>>>>> On Thu, Aug 18, 2022 at 7:05 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>>>>>>> >>>>>>>>> On Thu, Aug 18, 2022 at 1:23 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>>>>>>>>> >>>>>>>>>> [Sorry, adding back the CC list] >>>>>>>>>> >>>>>>>>>> On Mon, Aug 8, 2022 at 11:45 PM Joel Fernandes (Google) >>>>>>>>>> <joel@joelfernandes.org> wrote: >>>>>>>>>>> >>>>>>>>>>> This is required to prevent callbacks triggering RCU machinery too >>>>>>>>>>> quickly and too often, which adds more power to the system. >>>>>>>>>>> >>>>>>>>>>> When testing, we found that these paths were invoked often when the >>>>>>>>>>> system is not doing anything (screen is ON but otherwise idle). >>>>>>>>>> >>>>>>>>>> Unfortunately, I am seeing a slow down in ChromeOS boot performance >>>>>>>>>> after applying this particular patch. It is the first time I could >>>>>>>>>> test ChromeOS boot times with the series since it was hard to find a >>>>>>>>>> ChromeOS device that runs the upstream kernel. >>>>>>>>>> >>>>>>>>>> Anyway, Vlad, Neeraj, do you guys also see slower boot times with this >>>>>>>>>> patch? I wonder if the issue is with wake up interaction with the nocb >>>>>>>>>> GP threads. >>>>>>>>>> >>>>>>>>>> We ought to disable lazy RCU during boot since it would have little >>>>>>>>>> benefit anyway. But I am also concerned about some deeper problem I >>>>>>>>>> did not catch before. >>>>>>>>>> >>>>>>>>>> I'll look into tracing the fs paths to see if I can narrow down what's >>>>>>>>>> causing it. Will also try a newer kernel, I am currently testing on >>>>>>>>>> 5.19-rc4. >>>>>>>>> >>>>>>>>> I got somewhere with this. It looks like queuing CBs as lazy CBs >>>>>>>>> instead of normal CBs, are triggering expedited stalls during the boot >>>>>>>>> process: >>>>>>>>> >>>>>>>>> 39.949198] rcu: INFO: rcu_preempt detected expedited stalls on >>>>>>>>> CPUs/tasks: { } 28 jiffies s: 69 root: 0x0/. >>>>>>>>> >>>>>>>>> No idea how/why lazy RCU CBs would be related to expedited GP issues, >>>>>>>>> but maybe something hangs and causes that side-effect. >>>>>>>>> >>>>>>>>> initcall_debug did not help, as it seems initcalls all work fine, and >>>>>>>>> then 8 seconds after the boot, it starts slowing down a lot, followed >>>>>>>>> by the RCU stall messages. As a next step I'll enable ftrace during >>>>>>>>> the boot to see if I can get more insight. But I believe, its not the >>>>>>>>> FS layer, the FS layer just triggers lazy CBs, but there is something >>>>>>>>> wrong with the core lazy-RCU work itself. >>>>>>>>> >>>>>>>>> This kernel is 5.19-rc4. I'll also try to rebase ChromeOS on more >>>>>>>>> recent kernels and debug. >>>>>>>> >>>>>>>> More digging, thanks to trace_event= boot option , I find that the >>>>>>>> boot process does have some synchronous waits, and though these are >>>>>>>> "non-lazy", for some reason the lazy CBs that were previously queued >>>>>>>> are making them wait for the *full* lazy duration. Which points to a >>>>>>>> likely bug in the lazy RCU logic. These synchronous CBs should never >>>>>>>> be waiting like the lazy ones: >>>>>>>> >>>>>>>> [ 17.715904] => trace_dump_stack >>>>>>>> [ 17.715904] => __wait_rcu_gp >>>>>>>> [ 17.715904] => synchronize_rcu >>>>>>>> [ 17.715904] => selinux_netcache_avc_callback >>>>>>>> [ 17.715904] => avc_ss_reset >>>>>>>> [ 17.715904] => sel_write_enforce >>>>>>>> [ 17.715904] => vfs_write >>>>>>>> [ 17.715904] => ksys_write >>>>>>>> [ 17.715904] => do_syscall_64 >>>>>>>> [ 17.715904] => entry_SYSCALL_64_after_hwframe >>>>>>>> >>>>>>>> I'm tired so I'll resume the debug later. >>>>>>> >>>>>>> At times like this, I often pull the suspect code into userspace and >>>>>>> run it through its paces. In this case, a bunch of call_rcu_lazy() >>>>>>> invocations into an empty bypass list, followed by a call_rcu() >>>>>>> invocation, then a check to make sure that the bypass list is no longer >>>>>>> lazy. >>>>>> >>>>>> Thanks a lot for this great debug idea, I will look into it. >>>>> >>>>> It seems to be a subtle issue when a large number of callbacks are >>>>> queued trigging the lock-contention code, which happens at boot. It >>>>> appears the non-lazy ones and lazy ones collide, so you have the lazy >>>>> timer which wins, and then the regular bypass lock-contention timer is >>>>> not allowed to do its thing. Due to this, the rcuog thread wakes up much >>>>> later than a jiffie. >>>> >>>> Good show, and glad you found it! >>> >>> Thanks! >>> >>>>> Things are much better with the following change. However, this brings >>>>> me to a question about lock-contention based or any deferring and boot time. >>>>> >>>>> If you have a path like selinux doing a synchronize_rcu(), shouldn't we >>>>> skip the jiffie waiting for the bypass timer? Otherwise things >>>>> synchronously waiting will slow down more than usual. Maybe bypassing >>>>> should not be done for any case until boot up is done. I'm curious to >>>>> see if that improves boot time. >>>> >>>> Why not simply disable laziness at boot time and enable it only after >>>> booting is complete? The exiting rcupdate.rcu_normal_after_boot kernel >>>> boot parameter uses a similar scheme. >>> >>> That sounds like the right thing to good, but unfortunately it wont help >>> this problem. The boot time issue happens after init has started. So the >>> OS is still "booting" even though the kernel has. >>> >>> Also the problem can happen after boot as well, like if RCU >>> lazy/non-lazy callbacks come back to back quickly, or so. >>> >>> But yes nonetheless, I can see the value of disabling it till the >>> in-kernel boot completets. >>> >>>>> @@ -580,7 +585,11 @@ static void __call_rcu_nocb_wake(struct rcu_data >>>>> *rdp, bool was_alldone, >>>>> len = rcu_segcblist_n_cbs(&rdp->cblist); >>>>> bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass); >>>>> lazy_len = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass); >>>>> - if (was_alldone) { >>>>> + >>>>> + // If we are in lazy-mode, we still need to do a wake up even if >>>>> + // all CBs were previously done. Otherwise the GP thread will >>>>> + // wait for the full lazy duration. >>>>> + if (was_alldone || (READ_ONCE(rdp->nocb_defer_wakeup) == >>>>> RCU_NOCB_WAKE_LAZY)) { >>>>> rdp->qlen_last_fqs_check = len; >>>>> // Only lazy CBs in bypass list >>>>> if (lazy_len && bypass_len == lazy_len) { >>>> >>>> And this change looks plausible, though as always, the system's opinion >>>> carries much more weight than does mine. >>> >>> Sounds good, thanks, I am testing it more. Will update it for v4. >> >> We could also do the following, I tested it and it fixes it. It seems more maintainable >> and less fragile, but it comes at a slightly higher (but likely negligible) cost. If there >> are lazy CBs queued, and any non-lazy one comes, then the first non-lazy one is not >> considered to be added to the bypass list but hopefully that's Ok with you. Later non-lazy >> ones will be added to the bypass. > > At first I was concerned that you intended to reorder the callbacks, > but fortunately that is not what the patch below does. ;-) > > But don't you also need to clear the "lazy" flag at some point in this > execution path? After all, once a non-lazy callback arrives, all the > callbacks are treated as if they are non-lazy, correct? Oh, I did not see a need to clear this flag, I did not want to modify the flag because it is set by the caller and is more to signal what the caller intended than what its treated as. IMO it should not effect the treatment of the lazy CB as a non-lazy one, because the effect is the same as if something was queued into the bypass list. The bypass list is flushed first before the "non-bypass" (which could be non-lazy) CB is requested to be queued by the caller, so we're doing the same thing as the existing bypass logic. Basically these comments: // If there hasn't yet been all that many ->cblist enqueues // this jiffy, tell the caller to enqueue onto ->cblist. But flush // ->nocb_bypass first. WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, lazy, false)); Also the issue that the below diff fixed is more about the non-lazy one being treated as a lazy one than the other way. Let me know if I missed something. Thanks, - Joel > Thanx, Paul > >> @@ -484,9 +490,17 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head >> *rhp, >> // since we are kick-starting RCU GP processing anyway for the non-lazy >> // one, we can just reuse that GP for the already queued-up lazy ones. >> if ((rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) || >> - (lazy && n_lazy_cbs >= qhimark)) { >> + (!lazy && n_lazy_cbs) || >> + (lazy && n_lazy_cbs >= qhimark)) { >> rcu_nocb_lock(rdp); >> - *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); >> + >> + // This variable helps decide if a wakeup of the rcuog thread >> + // is needed. It is passed to __call_rcu_nocb_wake() by the >> + // caller. If only lazy CBs were previously queued and this one >> + // is non-lazy, make sure the caller does a wake up. >> + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist) || >> + (!lazy && n_lazy_cbs); >> + >> if (*was_alldone) >> trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, >> lazy ? TPS("FirstLazyQ") : TPS("FirstQ")); >> @@ -500,7 +514,8 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head >> *rhp, >> if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) || ncbs >= qhimark) { >> rcu_nocb_lock(rdp); >> if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) { >> - *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); >> + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist) || >> + (!lazy && n_lazy_cbs); >> if (*was_alldone) >> trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, >> lazy ? TPS("FirstLazyQ") : TPS("FirstQ"));
diff --git a/fs/dcache.c b/fs/dcache.c index 93f4f5ee07bf..7f51bac390c8 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -366,7 +366,7 @@ static void dentry_free(struct dentry *dentry) if (unlikely(dname_external(dentry))) { struct external_name *p = external_name(dentry); if (likely(atomic_dec_and_test(&p->u.count))) { - call_rcu(&dentry->d_u.d_rcu, __d_free_external); + call_rcu_lazy(&dentry->d_u.d_rcu, __d_free_external); return; } } @@ -374,7 +374,7 @@ static void dentry_free(struct dentry *dentry) if (dentry->d_flags & DCACHE_NORCU) __d_free(&dentry->d_u.d_rcu); else - call_rcu(&dentry->d_u.d_rcu, __d_free); + call_rcu_lazy(&dentry->d_u.d_rcu, __d_free); } /* diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 971f98af48ff..57b3f781760c 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -729,7 +729,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make * use of the rbn field. */ - call_rcu(&epi->rcu, epi_rcu_free); + call_rcu_lazy(&epi->rcu, epi_rcu_free); percpu_counter_dec(&ep->user->epoll_watches); diff --git a/fs/file_table.c b/fs/file_table.c index 5424e3a8df5f..417f57e9cb30 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -56,7 +56,7 @@ static inline void file_free(struct file *f) security_file_free(f); if (!(f->f_mode & FMODE_NOACCOUNT)) percpu_counter_dec(&nr_files); - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); + call_rcu_lazy(&f->f_u.fu_rcuhead, file_free_rcu); } /* diff --git a/fs/inode.c b/fs/inode.c index bd4da9c5207e..38fe040ddbd6 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -312,7 +312,7 @@ static void destroy_inode(struct inode *inode) return; } inode->free_inode = ops->free_inode; - call_rcu(&inode->i_rcu, i_callback); + call_rcu_lazy(&inode->i_rcu, i_callback); } /**
This is required to prevent callbacks triggering RCU machinery too quickly and too often, which adds more power to the system. When testing, we found that these paths were invoked often when the system is not doing anything (screen is ON but otherwise idle). Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- fs/dcache.c | 4 ++-- fs/eventpoll.c | 2 +- fs/file_table.c | 2 +- fs/inode.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-)