Message ID | 20221019225144.2500095-13-paulmck@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Commit | bc940d469f48b6b01ca4ac7353b85ac8f1101d72 |
Headers | show |
Series | Lazy call_rcu() updates for v6.2 | expand |
Hello, On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > From: Uladzislau Rezki <urezki@gmail.com> > > call_rcu() changes to save power will slow down RCU workqueue items > queued via queue_rcu_work(). This may not be an issue, however we cannot > assume that workqueue users are OK with long delays. Use > call_rcu_flush() API instead which reverts to the old behavio On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the batching is much better if we can just keep it as call_rcu() instead of call_rcu_flush(). Is there really any reason to keep it as call_rcu_flush() ? If I recall, the real reason Vlad's system was slowing down was because of scsi and the queue_rcu_work() conversion was really a red herring. Vlad, any thoughts? thanks, - Joel . > > Signed-off-by: Uladzislau Rezki <urezki@gmail.com> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > --- > kernel/workqueue.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 7cd5f5e7e0a1b..b4b0e828b529e 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork) > > if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { > rwork->wq = wq; > - call_rcu(&rwork->rcu, rcu_work_rcufn); > + call_rcu_flush(&rwork->rcu, rcu_work_rcufn); > return true; > } > > -- > 2.31.1.189.g2e36527f23 >
On Sun, Oct 23, 2022 at 08:36:00PM -0400, Joel Fernandes wrote: > Hello, > > On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > From: Uladzislau Rezki <urezki@gmail.com> > > > > call_rcu() changes to save power will slow down RCU workqueue items > > queued via queue_rcu_work(). This may not be an issue, however we cannot > > assume that workqueue users are OK with long delays. Use > > call_rcu_flush() API instead which reverts to the old behavio > > On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the > batching is much better if we can just keep it as call_rcu() instead > of call_rcu_flush(). > > Is there really any reason to keep it as call_rcu_flush() ? If I > recall, the real reason Vlad's system was slowing down was because of > scsi and the queue_rcu_work() conversion was really a red herring. There are less than 20 invocations of queue_rcu_work(), so it should be possible look through each. The low-risk approach is of course to have queue_rcu_work() use call_rcu_flush(). The next approach might be to have a Kconfig option and/or kernel boot parameter that allowed a per-system choice. But it would not hurt to double-check on Android. Thanx, Paul > Vlad, any thoughts? > > thanks, > > - Joel > > . > > > > Signed-off-by: Uladzislau Rezki <urezki@gmail.com> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > --- > > kernel/workqueue.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > > index 7cd5f5e7e0a1b..b4b0e828b529e 100644 > > --- a/kernel/workqueue.c > > +++ b/kernel/workqueue.c > > @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork) > > > > if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { > > rwork->wq = wq; > > - call_rcu(&rwork->rcu, rcu_work_rcufn); > > + call_rcu_flush(&rwork->rcu, rcu_work_rcufn); > > return true; > > } > > > > -- > > 2.31.1.189.g2e36527f23 > >
> On Sun, Oct 23, 2022 at 08:36:00PM -0400, Joel Fernandes wrote: > > Hello, > > > > On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > From: Uladzislau Rezki <urezki@gmail.com> > > > > > > call_rcu() changes to save power will slow down RCU workqueue items > > > queued via queue_rcu_work(). This may not be an issue, however we cannot > > > assume that workqueue users are OK with long delays. Use > > > call_rcu_flush() API instead which reverts to the old behavio > > > > On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the > > batching is much better if we can just keep it as call_rcu() instead > > of call_rcu_flush(). > > > > Is there really any reason to keep it as call_rcu_flush() ? If I > > recall, the real reason Vlad's system was slowing down was because of > > scsi and the queue_rcu_work() conversion was really a red herring. > <snip> *** drivers/acpi/osl.c: acpi_os_drop_map_ref[401] queue_rcu_work(system_wq, &map->track.rwork); *** drivers/gpu/drm/i915/gt/intel_execlists_submission.c: virtual_context_destroy[3653] queue_rcu_work(system_wq, &ve->rcu); *** fs/aio.c: free_ioctx_reqs[632] queue_rcu_work(system_wq, &ctx->free_rwork); *** fs/fs-writeback.c: inode_switch_wbs[604] queue_rcu_work(isw_wq, &isw->work); cleanup_offline_cgwb[676] queue_rcu_work(isw_wq, &isw->work); *** include/linux/workqueue.h: __printf[446] extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork); *** kernel/cgroup/cgroup.c: css_release_work_fn[5253] queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork); css_create[5384] queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork); *** kernel/rcu/tree.c: kfree_rcu_monitor[3192] queue_rcu_work(system_wq, &krwp->rcu_work); *** net/core/skmsg.c: sk_psock_drop[852] queue_rcu_work(system_wq, &psock->rwork); *** net/sched/act_ct.c: tcf_ct_flow_table_put[355] queue_rcu_work(act_ct_wq, &ct_ft->rwork); *** net/sched/cls_api.c: tcf_queue_work[225] return queue_rcu_work(tc_filter_wq, rwork); <snip> There are 9 users of the queue_rcu_work() functions. I think there can be a side effect if we keep it as lazy variant. Please note that i have not checked all those users. > There are less than 20 invocations of queue_rcu_work(), so it should > be possible look through each. The low-risk approach is of course to > have queue_rcu_work() use call_rcu_flush(). > > The next approach might be to have a Kconfig option and/or kernel > boot parameter that allowed a per-system choice. > > But it would not hurt to double-check on Android. > I did not see such noise but i will come back some data on 5.10 kernel today. > > > Vlad, any thoughts? > > At least for the kvfree_rcu() i would like to keep the sync variant, because we have the below patch that improves bathing: <snip> commit 51824b780b719c53113dc39e027fbf670dc66028 Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Thu Jun 30 18:33:35 2022 +0200 rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval Currently the monitor work is scheduled with a fixed interval of HZ/20, which is roughly 50 milliseconds. The drawback of this approach is low utilization of the 512 page slots in scenarios with infrequence kvfree_rcu() calls. For example on an Android system: <snip> Apparently i see it in the "dev" branch only. -- Uladzislau Rezki
> > On Sun, Oct 23, 2022 at 08:36:00PM -0400, Joel Fernandes wrote: > > > Hello, > > > > > > On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > From: Uladzislau Rezki <urezki@gmail.com> > > > > > > > > call_rcu() changes to save power will slow down RCU workqueue items > > > > queued via queue_rcu_work(). This may not be an issue, however we cannot > > > > assume that workqueue users are OK with long delays. Use > > > > call_rcu_flush() API instead which reverts to the old behavio > > > > > > On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the > > > batching is much better if we can just keep it as call_rcu() instead > > > of call_rcu_flush(). > > > > > > Is there really any reason to keep it as call_rcu_flush() ? If I > > > recall, the real reason Vlad's system was slowing down was because of > > > scsi and the queue_rcu_work() conversion was really a red herring. > > > <snip> > *** drivers/acpi/osl.c: > acpi_os_drop_map_ref[401] queue_rcu_work(system_wq, &map->track.rwork); > > *** drivers/gpu/drm/i915/gt/intel_execlists_submission.c: > virtual_context_destroy[3653] queue_rcu_work(system_wq, &ve->rcu); > > *** fs/aio.c: > free_ioctx_reqs[632] queue_rcu_work(system_wq, &ctx->free_rwork); > > *** fs/fs-writeback.c: > inode_switch_wbs[604] queue_rcu_work(isw_wq, &isw->work); > cleanup_offline_cgwb[676] queue_rcu_work(isw_wq, &isw->work); > > *** include/linux/workqueue.h: > __printf[446] extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork); > > *** kernel/cgroup/cgroup.c: > css_release_work_fn[5253] queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork); > css_create[5384] queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork); > > *** kernel/rcu/tree.c: > kfree_rcu_monitor[3192] queue_rcu_work(system_wq, &krwp->rcu_work); > > *** net/core/skmsg.c: > sk_psock_drop[852] queue_rcu_work(system_wq, &psock->rwork); > > *** net/sched/act_ct.c: > tcf_ct_flow_table_put[355] queue_rcu_work(act_ct_wq, &ct_ft->rwork); > > *** net/sched/cls_api.c: > tcf_queue_work[225] return queue_rcu_work(tc_filter_wq, rwork); > <snip> > There are 9 users of the queue_rcu_work() functions. I think there can be > a side effect if we keep it as lazy variant. Please note that i have not > checked all those users. > > > There are less than 20 invocations of queue_rcu_work(), so it should > > be possible look through each. The low-risk approach is of course to > > have queue_rcu_work() use call_rcu_flush(). > > > > The next approach might be to have a Kconfig option and/or kernel > > boot parameter that allowed a per-system choice. > > > > But it would not hurt to double-check on Android. > > > I did not see such noise but i will come back some data on 5.10 kernel > today. > Home screen swipe: <snip> <...>-15 [003] d..1 202.142205: rcu_batch_start: rcu_preempt CBs=105 bl=10 <...>-55 [001] d..1 202.166174: rcu_batch_start: rcu_preempt CBs=135 bl=10 <...>-26 [001] d..1 202.402182: rcu_batch_start: rcu_preempt CBs=221 bl=10 rcuop/3-40 [003] d..1 202.650323: rcu_batch_start: rcu_preempt CBs=213 bl=10 rcuop/3-40 [000] d..1 203.210537: rcu_batch_start: rcu_preempt CBs=90 bl=10 rcuop/5-55 [001] d..1 204.675671: rcu_batch_start: rcu_preempt CBs=14 bl=10 rcuop/2-33 [002] d..1 205.162229: rcu_batch_start: rcu_preempt CBs=649 bl=10 rcuop/3-40 [000] d..1 205.418214: rcu_batch_start: rcu_preempt CBs=291 bl=10 rcuop/3-40 [003] d..1 206.134204: rcu_batch_start: rcu_preempt CBs=174 bl=10 rcuop/0-15 [003] d..1 206.726311: rcu_batch_start: rcu_preempt CBs=738 bl=10 rcuop/1-26 [001] d..1 206.814168: rcu_batch_start: rcu_preempt CBs=865 bl=10 rcuop/3-40 [003] d..1 207.278178: rcu_batch_start: rcu_preempt CBs=287 bl=10 rcuop/1-26 [001] d..1 208.826279: rcu_batch_start: rcu_preempt CBs=506 bl=10 <snip> An app launch: <snip> rcuop/3-40 [002] d..1 322.118620: rcu_batch_start: rcu_preempt CBs=99 bl=10 rcuop/4-48 [005] dn.1 322.454052: rcu_batch_start: rcu_preempt CBs=270 bl=10 rcuop/5-55 [005] d..1 322.454109: rcu_batch_start: rcu_preempt CBs=91 bl=10 rcuop/5-55 [007] d..1 322.470054: rcu_batch_start: rcu_preempt CBs=106 bl=10 rcuop/6-62 [005] d..1 322.482120: rcu_batch_start: rcu_preempt CBs=231 bl=10 rcuop/4-48 [001] d..1 322.494150: rcu_batch_start: rcu_preempt CBs=227 bl=10 <...>-69 [002] d..1 322.502442: rcu_batch_start: rcu_preempt CBs=3350 bl=26 rcuop/1-26 [001] d..1 322.646099: rcu_batch_start: rcu_preempt CBs=1685 bl=13 rcuop/2-33 [001] d..1 322.670071: rcu_batch_start: rcu_preempt CBs=438 bl=10 rcuop/1-26 [001] d..1 322.674120: rcu_batch_start: rcu_preempt CBs=18 bl=10 rcuop/2-33 [003] d..1 322.690152: rcu_batch_start: rcu_preempt CBs=10 bl=10 rcuop/1-26 [002] d..1 322.698104: rcu_batch_start: rcu_preempt CBs=10 bl=10 rcuop/3-40 [002] d..1 322.706167: rcu_batch_start: rcu_preempt CBs=313 bl=10 rcuop/2-33 [003] d..1 322.710075: rcu_batch_start: rcu_preempt CBs=15 bl=10 rcuop/3-40 [002] d..1 322.742137: rcu_batch_start: rcu_preempt CBs=13 bl=10 rcuop/5-55 [000] d..1 322.754270: rcu_batch_start: rcu_preempt CBs=157 bl=10 rcuop/3-40 [000] d..1 322.762182: rcu_batch_start: rcu_preempt CBs=17 bl=10 rcuop/2-33 [003] d..1 322.774088: rcu_batch_start: rcu_preempt CBs=38 bl=10 rcuop/3-40 [000] d..1 322.778131: rcu_batch_start: rcu_preempt CBs=23 bl=10 rcuop/1-26 [002] d..1 322.790105: rcu_batch_start: rcu_preempt CBs=33 bl=10 rcuop/4-48 [001] d..1 322.798074: rcu_batch_start: rcu_preempt CBs=340 bl=10 rcuop/2-33 [002] d..1 322.806158: rcu_batch_start: rcu_preempt CBs=18 bl=10 rcuop/1-26 [002] d..1 322.814057: rcu_batch_start: rcu_preempt CBs=18 bl=10 rcuop/0-15 [001] d..1 322.822476: rcu_batch_start: rcu_preempt CBs=333 bl=10 rcuop/4-48 [003] d..1 322.830102: rcu_batch_start: rcu_preempt CBs=11 bl=10 rcuop/2-33 [001] d..1 322.846109: rcu_batch_start: rcu_preempt CBs=80 bl=10 rcuop/3-40 [001] d..1 322.854162: rcu_batch_start: rcu_preempt CBs=145 bl=10 rcuop/4-48 [003] d..1 322.874129: rcu_batch_start: rcu_preempt CBs=21 bl=10 rcuop/3-40 [001] d..1 322.878149: rcu_batch_start: rcu_preempt CBs=43 bl=10 rcuop/3-40 [001] d..1 322.906273: rcu_batch_start: rcu_preempt CBs=10 bl=10 rcuop/4-48 [001] d..1 322.918201: rcu_batch_start: rcu_preempt CBs=23 bl=10 rcuop/2-33 [001] d..1 322.926212: rcu_batch_start: rcu_preempt CBs=86 bl=10 rcuop/2-33 [001] d..1 322.946251: rcu_batch_start: rcu_preempt CBs=12 bl=10 rcuop/5-55 [003] d..1 322.954482: rcu_batch_start: rcu_preempt CBs=70 bl=10 rcuop/2-33 [003] d..1 322.978146: rcu_batch_start: rcu_preempt CBs=20 bl=10 rcuop/1-26 [002] d..1 323.014290: rcu_batch_start: rcu_preempt CBs=230 bl=10 rcuop/4-48 [001] d..1 323.026119: rcu_batch_start: rcu_preempt CBs=73 bl=10 rcuop/5-55 [003] d..1 323.026175: rcu_batch_start: rcu_preempt CBs=94 bl=10 rcuop/3-40 [001] d..1 323.035310: rcu_batch_start: rcu_preempt CBs=70 bl=10 rcuop/0-15 [001] d..1 323.046231: rcu_batch_start: rcu_preempt CBs=165 bl=10 rcuop/6-62 [005] d..1 323.066132: rcu_batch_start: rcu_preempt CBs=179 bl=10 rcuop/1-26 [002] d..1 323.174202: rcu_batch_start: rcu_preempt CBs=61 bl=10 rcuop/2-33 [003] d..1 323.190203: rcu_batch_start: rcu_preempt CBs=80 bl=10 rcuop/3-40 [003] d..1 323.206210: rcu_batch_start: rcu_preempt CBs=84 bl=10 rcuop/2-33 [003] d..1 323.226880: rcu_batch_start: rcu_preempt CBs=5 bl=10 <snip> It is on Android with 5.10 kernel running. I do not see that queue_rcu_work() makes some noise. Joel Could you please post your batch_start trace point output where you see the noise? -- Uladzislau Rezki
On Mon, Oct 24, 2022 at 02:23:39PM +0200, Uladzislau Rezki wrote: > > > On Sun, Oct 23, 2022 at 08:36:00PM -0400, Joel Fernandes wrote: > > > > On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > From: Uladzislau Rezki <urezki@gmail.com> > > > > > > > > > > call_rcu() changes to save power will slow down RCU workqueue items > > > > > queued via queue_rcu_work(). This may not be an issue, however we cannot > > > > > assume that workqueue users are OK with long delays. Use > > > > > call_rcu_flush() API instead which reverts to the old behavio > > > > > > > > On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the > > > > batching is much better if we can just keep it as call_rcu() instead > > > > of call_rcu_flush(). > > > > > > > > Is there really any reason to keep it as call_rcu_flush() ? If I > > > > recall, the real reason Vlad's system was slowing down was because of > > > > scsi and the queue_rcu_work() conversion was really a red herring. > > > > > <snip> > > *** drivers/acpi/osl.c: > > acpi_os_drop_map_ref[401] queue_rcu_work(system_wq, &map->track.rwork); > > > > *** drivers/gpu/drm/i915/gt/intel_execlists_submission.c: > > virtual_context_destroy[3653] queue_rcu_work(system_wq, &ve->rcu); > > > > *** fs/aio.c: > > free_ioctx_reqs[632] queue_rcu_work(system_wq, &ctx->free_rwork); > > > > *** fs/fs-writeback.c: > > inode_switch_wbs[604] queue_rcu_work(isw_wq, &isw->work); > > cleanup_offline_cgwb[676] queue_rcu_work(isw_wq, &isw->work); > > > > *** include/linux/workqueue.h: > > __printf[446] extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork); > > > > *** kernel/cgroup/cgroup.c: > > css_release_work_fn[5253] queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork); > > css_create[5384] queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork); > > > > *** kernel/rcu/tree.c: > > kfree_rcu_monitor[3192] queue_rcu_work(system_wq, &krwp->rcu_work); > > > > *** net/core/skmsg.c: > > sk_psock_drop[852] queue_rcu_work(system_wq, &psock->rwork); > > > > *** net/sched/act_ct.c: > > tcf_ct_flow_table_put[355] queue_rcu_work(act_ct_wq, &ct_ft->rwork); > > > > *** net/sched/cls_api.c: > > tcf_queue_work[225] return queue_rcu_work(tc_filter_wq, rwork); > > <snip> > > There are 9 users of the queue_rcu_work() functions. I think there can be > > a side effect if we keep it as lazy variant. Please note that i have not > > checked all those users. > > > > > There are less than 20 invocations of queue_rcu_work(), so it should > > > be possible look through each. The low-risk approach is of course to > > > have queue_rcu_work() use call_rcu_flush(). Yes, once I get to a device (tomorrow), I'll look more. Last I checked it was kvfree_rcu() -- this was a few weeks/months ago though. > > > The next approach might be to have a Kconfig option and/or kernel > > > boot parameter that allowed a per-system choice. > > > > > > But it would not hurt to double-check on Android. > > > > > I did not see such noise but i will come back some data on 5.10 kernel > > today. > > > Home screen swipe: > <snip> > <...>-15 [003] d..1 202.142205: rcu_batch_start: rcu_preempt CBs=105 bl=10 > <...>-55 [001] d..1 202.166174: rcu_batch_start: rcu_preempt CBs=135 bl=10 > <...>-26 [001] d..1 202.402182: rcu_batch_start: rcu_preempt CBs=221 bl=10 > rcuop/3-40 [003] d..1 202.650323: rcu_batch_start: rcu_preempt CBs=213 bl=10 > rcuop/3-40 [000] d..1 203.210537: rcu_batch_start: rcu_preempt CBs=90 bl=10 > rcuop/5-55 [001] d..1 204.675671: rcu_batch_start: rcu_preempt CBs=14 bl=10 > rcuop/2-33 [002] d..1 205.162229: rcu_batch_start: rcu_preempt CBs=649 bl=10 > rcuop/3-40 [000] d..1 205.418214: rcu_batch_start: rcu_preempt CBs=291 bl=10 > rcuop/3-40 [003] d..1 206.134204: rcu_batch_start: rcu_preempt CBs=174 bl=10 > rcuop/0-15 [003] d..1 206.726311: rcu_batch_start: rcu_preempt CBs=738 bl=10 > rcuop/1-26 [001] d..1 206.814168: rcu_batch_start: rcu_preempt CBs=865 bl=10 > rcuop/3-40 [003] d..1 207.278178: rcu_batch_start: rcu_preempt CBs=287 bl=10 > rcuop/1-26 [001] d..1 208.826279: rcu_batch_start: rcu_preempt CBs=506 bl=10 > <snip> This looks fine to me, but.. > An app launch: > <snip> > rcuop/3-40 [002] d..1 322.118620: rcu_batch_start: rcu_preempt CBs=99 bl=10 > rcuop/4-48 [005] dn.1 322.454052: rcu_batch_start: rcu_preempt CBs=270 bl=10 > rcuop/5-55 [005] d..1 322.454109: rcu_batch_start: rcu_preempt CBs=91 bl=10 > rcuop/5-55 [007] d..1 322.470054: rcu_batch_start: rcu_preempt CBs=106 bl=10 > rcuop/6-62 [005] d..1 322.482120: rcu_batch_start: rcu_preempt CBs=231 bl=10 > rcuop/4-48 [001] d..1 322.494150: rcu_batch_start: rcu_preempt CBs=227 bl=10 > <...>-69 [002] d..1 322.502442: rcu_batch_start: rcu_preempt CBs=3350 bl=26 > rcuop/1-26 [001] d..1 322.646099: rcu_batch_start: rcu_preempt CBs=1685 bl=13 > rcuop/2-33 [001] d..1 322.670071: rcu_batch_start: rcu_preempt CBs=438 bl=10 > rcuop/1-26 [001] d..1 322.674120: rcu_batch_start: rcu_preempt CBs=18 bl=10 > rcuop/2-33 [003] d..1 322.690152: rcu_batch_start: rcu_preempt CBs=10 bl=10 > rcuop/1-26 [002] d..1 322.698104: rcu_batch_start: rcu_preempt CBs=10 bl=10 > rcuop/3-40 [002] d..1 322.706167: rcu_batch_start: rcu_preempt CBs=313 bl=10 > rcuop/2-33 [003] d..1 322.710075: rcu_batch_start: rcu_preempt CBs=15 bl=10 The above does not look fine to me (^^^) from a Lazu-RCU PoV. Here, RCU callbacks are being invoked every 10-20ms. The batching I seek is of the order of seconds, when the system is relatively idle. Why is Lazy-RCU not in effect for app launch? IOW, which callback is causing Lazy-RCU to not be lazy here? Could it be queue_rcu_work()? Whenever a non-lazy callback is queued, all the lazy ones are 'promoted' to non-lazy. That's why I am asking. Even if you queue one non-lazy callback at a high enough frequency, the lazy ones will no longer giving you batching or the benefits of laziness. > rcuop/3-40 [002] d..1 322.742137: rcu_batch_start: rcu_preempt CBs=13 bl=10 > rcuop/5-55 [000] d..1 322.754270: rcu_batch_start: rcu_preempt CBs=157 bl=10 > rcuop/3-40 [000] d..1 322.762182: rcu_batch_start: rcu_preempt CBs=17 bl=10 > rcuop/2-33 [003] d..1 322.774088: rcu_batch_start: rcu_preempt CBs=38 bl=10 > rcuop/3-40 [000] d..1 322.778131: rcu_batch_start: rcu_preempt CBs=23 bl=10 > rcuop/1-26 [002] d..1 322.790105: rcu_batch_start: rcu_preempt CBs=33 bl=10 > rcuop/4-48 [001] d..1 322.798074: rcu_batch_start: rcu_preempt CBs=340 bl=10 > rcuop/2-33 [002] d..1 322.806158: rcu_batch_start: rcu_preempt CBs=18 bl=10 > rcuop/1-26 [002] d..1 322.814057: rcu_batch_start: rcu_preempt CBs=18 bl=10 > rcuop/0-15 [001] d..1 322.822476: rcu_batch_start: rcu_preempt CBs=333 bl=10 > rcuop/4-48 [003] d..1 322.830102: rcu_batch_start: rcu_preempt CBs=11 bl=10 > rcuop/2-33 [001] d..1 322.846109: rcu_batch_start: rcu_preempt CBs=80 bl=10 > rcuop/3-40 [001] d..1 322.854162: rcu_batch_start: rcu_preempt CBs=145 bl=10 > rcuop/4-48 [003] d..1 322.874129: rcu_batch_start: rcu_preempt CBs=21 bl=10 > rcuop/3-40 [001] d..1 322.878149: rcu_batch_start: rcu_preempt CBs=43 bl=10 > rcuop/3-40 [001] d..1 322.906273: rcu_batch_start: rcu_preempt CBs=10 bl=10 > rcuop/4-48 [001] d..1 322.918201: rcu_batch_start: rcu_preempt CBs=23 bl=10 > rcuop/2-33 [001] d..1 322.926212: rcu_batch_start: rcu_preempt CBs=86 bl=10 > rcuop/2-33 [001] d..1 322.946251: rcu_batch_start: rcu_preempt CBs=12 bl=10 > rcuop/5-55 [003] d..1 322.954482: rcu_batch_start: rcu_preempt CBs=70 bl=10 > rcuop/2-33 [003] d..1 322.978146: rcu_batch_start: rcu_preempt CBs=20 bl=10 > rcuop/1-26 [002] d..1 323.014290: rcu_batch_start: rcu_preempt CBs=230 bl=10 > rcuop/4-48 [001] d..1 323.026119: rcu_batch_start: rcu_preempt CBs=73 bl=10 > rcuop/5-55 [003] d..1 323.026175: rcu_batch_start: rcu_preempt CBs=94 bl=10 > rcuop/3-40 [001] d..1 323.035310: rcu_batch_start: rcu_preempt CBs=70 bl=10 > rcuop/0-15 [001] d..1 323.046231: rcu_batch_start: rcu_preempt CBs=165 bl=10 > rcuop/6-62 [005] d..1 323.066132: rcu_batch_start: rcu_preempt CBs=179 bl=10 > rcuop/1-26 [002] d..1 323.174202: rcu_batch_start: rcu_preempt CBs=61 bl=10 > rcuop/2-33 [003] d..1 323.190203: rcu_batch_start: rcu_preempt CBs=80 bl=10 > rcuop/3-40 [003] d..1 323.206210: rcu_batch_start: rcu_preempt CBs=84 bl=10 > rcuop/2-33 [003] d..1 323.226880: rcu_batch_start: rcu_preempt CBs=5 bl=10 And for several seconds you have the same thing going ^^^. > <snip> > > It is on Android with 5.10 kernel running. I do not see that queue_rcu_work() makes > some noise. Your rcu_batch_start tracepoint output above does not really reveal much information about which callbacks are lazy and which are not. rcu_invoke_callback is better in the sense at least you have the name of the callback and can take a guess. > Joel Could you please post your batch_start trace point output where you see the noise? Sure, I'll do that once I get to a device. thanks, - Joel
On Mon, Oct 24, 2022 at 02:31:15PM +0000, Joel Fernandes wrote: > On Mon, Oct 24, 2022 at 02:23:39PM +0200, Uladzislau Rezki wrote: > > > > On Sun, Oct 23, 2022 at 08:36:00PM -0400, Joel Fernandes wrote: > > > > > On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > > > From: Uladzislau Rezki <urezki@gmail.com> > > > > > > > > > > > > call_rcu() changes to save power will slow down RCU workqueue items > > > > > > queued via queue_rcu_work(). This may not be an issue, however we cannot > > > > > > assume that workqueue users are OK with long delays. Use > > > > > > call_rcu_flush() API instead which reverts to the old behavio > > > > > > > > > > On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the > > > > > batching is much better if we can just keep it as call_rcu() instead > > > > > of call_rcu_flush(). > > > > > > > > > > Is there really any reason to keep it as call_rcu_flush() ? If I > > > > > recall, the real reason Vlad's system was slowing down was because of > > > > > scsi and the queue_rcu_work() conversion was really a red herring. > > > > > > > <snip> > > > *** drivers/acpi/osl.c: > > > acpi_os_drop_map_ref[401] queue_rcu_work(system_wq, &map->track.rwork); > > > > > > *** drivers/gpu/drm/i915/gt/intel_execlists_submission.c: > > > virtual_context_destroy[3653] queue_rcu_work(system_wq, &ve->rcu); > > > > > > *** fs/aio.c: > > > free_ioctx_reqs[632] queue_rcu_work(system_wq, &ctx->free_rwork); > > > > > > *** fs/fs-writeback.c: > > > inode_switch_wbs[604] queue_rcu_work(isw_wq, &isw->work); > > > cleanup_offline_cgwb[676] queue_rcu_work(isw_wq, &isw->work); > > > > > > *** include/linux/workqueue.h: > > > __printf[446] extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork); > > > > > > *** kernel/cgroup/cgroup.c: > > > css_release_work_fn[5253] queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork); > > > css_create[5384] queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork); > > > > > > *** kernel/rcu/tree.c: > > > kfree_rcu_monitor[3192] queue_rcu_work(system_wq, &krwp->rcu_work); > > > > > > *** net/core/skmsg.c: > > > sk_psock_drop[852] queue_rcu_work(system_wq, &psock->rwork); > > > > > > *** net/sched/act_ct.c: > > > tcf_ct_flow_table_put[355] queue_rcu_work(act_ct_wq, &ct_ft->rwork); > > > > > > *** net/sched/cls_api.c: > > > tcf_queue_work[225] return queue_rcu_work(tc_filter_wq, rwork); > > > <snip> > > > There are 9 users of the queue_rcu_work() functions. I think there can be > > > a side effect if we keep it as lazy variant. Please note that i have not > > > checked all those users. > > > > > > > There are less than 20 invocations of queue_rcu_work(), so it should > > > > be possible look through each. The low-risk approach is of course to > > > > have queue_rcu_work() use call_rcu_flush(). > > Yes, once I get to a device (tomorrow), I'll look more. Last I checked it was > kvfree_rcu() -- this was a few weeks/months ago though. > > > > > The next approach might be to have a Kconfig option and/or kernel > > > > boot parameter that allowed a per-system choice. > > > > > > > > But it would not hurt to double-check on Android. > > > > > > > I did not see such noise but i will come back some data on 5.10 kernel > > > today. > > > > > Home screen swipe: > > <snip> > > <...>-15 [003] d..1 202.142205: rcu_batch_start: rcu_preempt CBs=105 bl=10 > > <...>-55 [001] d..1 202.166174: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > <...>-26 [001] d..1 202.402182: rcu_batch_start: rcu_preempt CBs=221 bl=10 > > rcuop/3-40 [003] d..1 202.650323: rcu_batch_start: rcu_preempt CBs=213 bl=10 > > rcuop/3-40 [000] d..1 203.210537: rcu_batch_start: rcu_preempt CBs=90 bl=10 > > rcuop/5-55 [001] d..1 204.675671: rcu_batch_start: rcu_preempt CBs=14 bl=10 > > rcuop/2-33 [002] d..1 205.162229: rcu_batch_start: rcu_preempt CBs=649 bl=10 > > rcuop/3-40 [000] d..1 205.418214: rcu_batch_start: rcu_preempt CBs=291 bl=10 > > rcuop/3-40 [003] d..1 206.134204: rcu_batch_start: rcu_preempt CBs=174 bl=10 > > rcuop/0-15 [003] d..1 206.726311: rcu_batch_start: rcu_preempt CBs=738 bl=10 > > rcuop/1-26 [001] d..1 206.814168: rcu_batch_start: rcu_preempt CBs=865 bl=10 > > rcuop/3-40 [003] d..1 207.278178: rcu_batch_start: rcu_preempt CBs=287 bl=10 > > rcuop/1-26 [001] d..1 208.826279: rcu_batch_start: rcu_preempt CBs=506 bl=10 > > <snip> > > This looks fine to me, but.. > > > An app launch: > > <snip> > > rcuop/3-40 [002] d..1 322.118620: rcu_batch_start: rcu_preempt CBs=99 bl=10 > > rcuop/4-48 [005] dn.1 322.454052: rcu_batch_start: rcu_preempt CBs=270 bl=10 > > rcuop/5-55 [005] d..1 322.454109: rcu_batch_start: rcu_preempt CBs=91 bl=10 > > rcuop/5-55 [007] d..1 322.470054: rcu_batch_start: rcu_preempt CBs=106 bl=10 > > rcuop/6-62 [005] d..1 322.482120: rcu_batch_start: rcu_preempt CBs=231 bl=10 > > rcuop/4-48 [001] d..1 322.494150: rcu_batch_start: rcu_preempt CBs=227 bl=10 > > <...>-69 [002] d..1 322.502442: rcu_batch_start: rcu_preempt CBs=3350 bl=26 > > rcuop/1-26 [001] d..1 322.646099: rcu_batch_start: rcu_preempt CBs=1685 bl=13 > > rcuop/2-33 [001] d..1 322.670071: rcu_batch_start: rcu_preempt CBs=438 bl=10 > > rcuop/1-26 [001] d..1 322.674120: rcu_batch_start: rcu_preempt CBs=18 bl=10 > > rcuop/2-33 [003] d..1 322.690152: rcu_batch_start: rcu_preempt CBs=10 bl=10 > > rcuop/1-26 [002] d..1 322.698104: rcu_batch_start: rcu_preempt CBs=10 bl=10 > > rcuop/3-40 [002] d..1 322.706167: rcu_batch_start: rcu_preempt CBs=313 bl=10 > > rcuop/2-33 [003] d..1 322.710075: rcu_batch_start: rcu_preempt CBs=15 bl=10 > > The above does not look fine to me (^^^) from a Lazu-RCU PoV. Here, RCU > callbacks are being invoked every 10-20ms. The batching I seek is of the > order of seconds, when the system is relatively idle. > > Why is Lazy-RCU not in effect for app launch? IOW, which callback is causing > Lazy-RCU to not be lazy here? Could it be queue_rcu_work()? Whenever a > non-lazy callback is queued, all the lazy ones are 'promoted' to non-lazy. > That's why I am asking. Even if you queue one non-lazy callback at a high > enough frequency, the lazy ones will no longer giving you batching or the > benefits of laziness. > > > rcuop/3-40 [002] d..1 322.742137: rcu_batch_start: rcu_preempt CBs=13 bl=10 > > rcuop/5-55 [000] d..1 322.754270: rcu_batch_start: rcu_preempt CBs=157 bl=10 > > rcuop/3-40 [000] d..1 322.762182: rcu_batch_start: rcu_preempt CBs=17 bl=10 > > rcuop/2-33 [003] d..1 322.774088: rcu_batch_start: rcu_preempt CBs=38 bl=10 > > rcuop/3-40 [000] d..1 322.778131: rcu_batch_start: rcu_preempt CBs=23 bl=10 > > rcuop/1-26 [002] d..1 322.790105: rcu_batch_start: rcu_preempt CBs=33 bl=10 > > rcuop/4-48 [001] d..1 322.798074: rcu_batch_start: rcu_preempt CBs=340 bl=10 > > rcuop/2-33 [002] d..1 322.806158: rcu_batch_start: rcu_preempt CBs=18 bl=10 > > rcuop/1-26 [002] d..1 322.814057: rcu_batch_start: rcu_preempt CBs=18 bl=10 > > rcuop/0-15 [001] d..1 322.822476: rcu_batch_start: rcu_preempt CBs=333 bl=10 > > rcuop/4-48 [003] d..1 322.830102: rcu_batch_start: rcu_preempt CBs=11 bl=10 > > rcuop/2-33 [001] d..1 322.846109: rcu_batch_start: rcu_preempt CBs=80 bl=10 > > rcuop/3-40 [001] d..1 322.854162: rcu_batch_start: rcu_preempt CBs=145 bl=10 > > rcuop/4-48 [003] d..1 322.874129: rcu_batch_start: rcu_preempt CBs=21 bl=10 > > rcuop/3-40 [001] d..1 322.878149: rcu_batch_start: rcu_preempt CBs=43 bl=10 > > rcuop/3-40 [001] d..1 322.906273: rcu_batch_start: rcu_preempt CBs=10 bl=10 > > rcuop/4-48 [001] d..1 322.918201: rcu_batch_start: rcu_preempt CBs=23 bl=10 > > rcuop/2-33 [001] d..1 322.926212: rcu_batch_start: rcu_preempt CBs=86 bl=10 > > rcuop/2-33 [001] d..1 322.946251: rcu_batch_start: rcu_preempt CBs=12 bl=10 > > rcuop/5-55 [003] d..1 322.954482: rcu_batch_start: rcu_preempt CBs=70 bl=10 > > rcuop/2-33 [003] d..1 322.978146: rcu_batch_start: rcu_preempt CBs=20 bl=10 > > rcuop/1-26 [002] d..1 323.014290: rcu_batch_start: rcu_preempt CBs=230 bl=10 > > rcuop/4-48 [001] d..1 323.026119: rcu_batch_start: rcu_preempt CBs=73 bl=10 > > rcuop/5-55 [003] d..1 323.026175: rcu_batch_start: rcu_preempt CBs=94 bl=10 > > rcuop/3-40 [001] d..1 323.035310: rcu_batch_start: rcu_preempt CBs=70 bl=10 > > rcuop/0-15 [001] d..1 323.046231: rcu_batch_start: rcu_preempt CBs=165 bl=10 > > rcuop/6-62 [005] d..1 323.066132: rcu_batch_start: rcu_preempt CBs=179 bl=10 > > rcuop/1-26 [002] d..1 323.174202: rcu_batch_start: rcu_preempt CBs=61 bl=10 > > rcuop/2-33 [003] d..1 323.190203: rcu_batch_start: rcu_preempt CBs=80 bl=10 > > rcuop/3-40 [003] d..1 323.206210: rcu_batch_start: rcu_preempt CBs=84 bl=10 > > rcuop/2-33 [003] d..1 323.226880: rcu_batch_start: rcu_preempt CBs=5 bl=10 > > And for several seconds you have the same thing going ^^^. > > > <snip> > > > > It is on Android with 5.10 kernel running. I do not see that queue_rcu_work() makes > > some noise. > > Your rcu_batch_start tracepoint output above does not really reveal much > information about which callbacks are lazy and which are not. > rcu_invoke_callback is better in the sense at least you have the name of the > callback and can take a guess. > > > Joel Could you please post your batch_start trace point output where you see the noise? > > Sure, I'll do that once I get to a device. You guys might need to agree on the definition of "good" here. Or maybe understand the differences in your respective platforms' definitions of "good". ;-) Thanx, Paul
> > You guys might need to agree on the definition of "good" here. Or maybe > understand the differences in your respective platforms' definitions of > "good". ;-) > Indeed. Bad is when once per-millisecond infinitely :) At least in such use workload a can detect a power delta and power gain. Anyway, below is a new trace where i do not use "flush" variant for the kvfree_rcu(): <snip> 1. Home screen swipe: rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 2. App launches: rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 <snip> it is much more better. But. As i wrote earlier there is a patch that i have submitted some time ago improving kvfree_rcu() batching: <snip> commit 51824b780b719c53113dc39e027fbf670dc66028 Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Thu Jun 30 18:33:35 2022 +0200 rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval Currently the monitor work is scheduled with a fixed interval of HZ/20, which is roughly 50 milliseconds. The drawback of this approach is low utilization of the 512 page slots in scenarios with infrequence kvfree_rcu() calls. For example on an Android system: <snip> The trace that i posted was taken without it. -- Uladzislau Rezki
On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > You guys might need to agree on the definition of "good" here. Or maybe > > understand the differences in your respective platforms' definitions of > > "good". ;-) > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > workload a can detect a power delta and power gain. Anyway, below is a new > trace where i do not use "flush" variant for the kvfree_rcu(): > > <snip> > 1. Home screen swipe: > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > 2. App launches: > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > <snip> > > it is much more better. But. As i wrote earlier there is a patch that i have submitted > some time ago improving kvfree_rcu() batching: > > <snip> > commit 51824b780b719c53113dc39e027fbf670dc66028 > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Thu Jun 30 18:33:35 2022 +0200 > > rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval > > Currently the monitor work is scheduled with a fixed interval of HZ/20, > which is roughly 50 milliseconds. The drawback of this approach is > low utilization of the 512 page slots in scenarios with infrequence > kvfree_rcu() calls. For example on an Android system: > <snip> > > The trace that i posted was taken without it. And if I am not getting too confused, that patch is now in mainline. So it does make sense to rely on it, then. ;-) Thanx, Paul
On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > You guys might need to agree on the definition of "good" here. Or maybe > > understand the differences in your respective platforms' definitions of > > "good". ;-) > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use To me once per-ms is really bad, and once per 20ms indefinitely is also not ideal ;-). Just to give you a sense of why I feel this, I see the RCU thread wake ups that periodically happen can disturb CPUIdle. The act of queuing Callback + gp delay + rcu threads running is enough to disrupt overlaps between CPUidle time and the gp delay. Further the idle governor will refrain from entering deeper CPUidle states because it will see timers queued in the near future to wake up the RCU grace-period kthreads. > workload a can detect a power delta and power gain. Anyway, below is a new > trace where i do not use "flush" variant for the kvfree_rcu(): > > <snip> > 1. Home screen swipe: > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > 2. App launches: > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > <snip> > > it is much more better. But. As i wrote earlier there is a patch that i have submitted > some time ago improving kvfree_rcu() batching: Yes it seems much better than your last traces! I'd propose to drop this patch because as you show, it effects not only yours but ChromeOS. It appears kvfree_rcu() use of queue_rcu_work() is a perfect candidate for call_rcu() batching because it is purely driven by memory pressure. And we have a shrinker for lazy-RCU as well. For non-kvfree uses, we can introduce a queue_rcu_work_flush() if need-be. What do you think? thanks, - Joel
On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > understand the differences in your respective platforms' definitions of > > > "good". ;-) > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > workload a can detect a power delta and power gain. Anyway, below is a new > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > <snip> > > 1. Home screen swipe: > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > 2. App launches: > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > <snip> > > > > it is much more better. But. As i wrote earlier there is a patch that i have submitted > > some time ago improving kvfree_rcu() batching: > > > > <snip> > > commit 51824b780b719c53113dc39e027fbf670dc66028 > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > > Date: Thu Jun 30 18:33:35 2022 +0200 > > > > rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval > > > > Currently the monitor work is scheduled with a fixed interval of HZ/20, > > which is roughly 50 milliseconds. The drawback of this approach is > > low utilization of the 512 page slots in scenarios with infrequence > > kvfree_rcu() calls. For example on an Android system: > > <snip> > > > > The trace that i posted was taken without it. > > And if I am not getting too confused, that patch is now in mainline. > So it does make sense to rely on it, then. ;-) > Right. urezki@pc638:~/data/raid0/coding/linux.git$ git tag --contains 51824b780b719c53113dc39e027fbf670dc66028 v6.1-rc1 v6.1-rc2 urezki@pc638:~/data/raid0/coding/linux.git$ -- Uladzislau Rezki
On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > understand the differences in your respective platforms' definitions of > > > > "good". ;-) > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > <snip> > > > 1. Home screen swipe: > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > 2. App launches: > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > <snip> > > > > > > it is much more better. But. As i wrote earlier there is a patch that i have submitted > > > some time ago improving kvfree_rcu() batching: > > > > > > <snip> > > > commit 51824b780b719c53113dc39e027fbf670dc66028 > > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > Date: Thu Jun 30 18:33:35 2022 +0200 > > > > > > rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval > > > > > > Currently the monitor work is scheduled with a fixed interval of HZ/20, > > > which is roughly 50 milliseconds. The drawback of this approach is > > > low utilization of the 512 page slots in scenarios with infrequence > > > kvfree_rcu() calls. For example on an Android system: > > > <snip> > > > > > > The trace that i posted was taken without it. > > > > And if I am not getting too confused, that patch is now in mainline. > > So it does make sense to rely on it, then. ;-) > > > Right. > > urezki@pc638:~/data/raid0/coding/linux.git$ git tag --contains > 51824b780b719c53113dc39e027fbf670dc66028 > v6.1-rc1 > v6.1-rc2 > urezki@pc638:~/data/raid0/coding/linux.git$ > Just in case: 5.10 + "rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval": <snip> 1. Home screen swipe: rcuop/3-40 [003] d..1 94.202849: rcu_batch_start: rcu_preempt CBs=664 bl=10 rcuop/4-48 [001] d..1 95.999352: rcu_batch_start: rcu_preempt CBs=252 bl=10 rcuop/6-62 [002] d..1 97.534875: rcu_batch_start: rcu_preempt CBs=152 bl=10 rcuop/5-55 [003] d..1 98.042912: rcu_batch_start: rcu_preempt CBs=189 bl=10 rcuop/0-15 [002] d..1 98.306769: rcu_batch_start: rcu_preempt CBs=1457 bl=11 rcuop/1-26 [000] d..1 99.582931: rcu_batch_start: rcu_preempt CBs=2115 bl=16 rcuop/2-33 [003] d..1 99.582935: rcu_batch_start: rcu_preempt CBs=2019 bl=15 rcuop/3-40 [001] d..1 99.838885: rcu_batch_start: rcu_preempt CBs=1168 bl=10 rcuop/1-26 [000] d..1 100.603496: rcu_batch_start: rcu_preempt CBs=168 bl=10 2. Apps launches: rcuop/4-48 [007] d..1 102.910580: rcu_batch_start: rcu_preempt CBs=1150 bl=10 rcuop/6-62 [007] d..1 102.910682: rcu_batch_start: rcu_preempt CBs=1001 bl=10 rcuop/5-55 [007] d..1 103.166607: rcu_batch_start: rcu_preempt CBs=939 bl=10 rcuop/0-15 [007] d..1 104.450598: rcu_batch_start: rcu_preempt CBs=1694 bl=13 rcuop/5-55 [006] d..1 104.478640: rcu_batch_start: rcu_preempt CBs=3125 bl=24 rcuop/3-40 [007] d..1 104.958565: rcu_batch_start: rcu_preempt CBs=1108 bl=10 rcuop/7-69 [007] d..1 106.238634: rcu_batch_start: rcu_preempt CBs=10275 bl=80 rcuop/4-48 [007] d..1 107.258586: rcu_batch_start: rcu_preempt CBs=8142 bl=63 rcuop/7-69 [007] d..1 107.260769: rcu_batch_start: rcu_preempt CBs=1880 bl=14 rcuop/2-33 [007] d..1 107.526638: rcu_batch_start: rcu_preempt CBs=1968 bl=15 rcuop/1-26 [007] d..1 107.542612: rcu_batch_start: rcu_preempt CBs=1796 bl=14 rcuop/5-55 [007] d..1 108.286588: rcu_batch_start: rcu_preempt CBs=3547 bl=27 rcuop/6-62 [007] d..1 108.287639: rcu_batch_start: rcu_preempt CBs=5820 bl=45 rcuop/7-69 [007] d..1 108.290548: rcu_batch_start: rcu_preempt CBs=2430 bl=18 rcuop/0-15 [000] d..1 109.826843: rcu_batch_start: rcu_preempt CBs=2282 bl=17 rcuop/3-40 [002] d..1 110.595455: rcu_batch_start: rcu_preempt CBs=1960 bl=15 <...>-48 [005] d..1 112.390702: rcu_batch_start: rcu_preempt CBs=5143 bl=40 rcuop/7-69 [004] d..1 112.402607: rcu_batch_start: rcu_preempt CBs=3379 bl=26 rcuop/2-33 [005] d..1 112.638614: rcu_batch_start: rcu_preempt CBs=3223 bl=25 rcuop/1-26 [004] d..1 112.638617: rcu_batch_start: rcu_preempt CBs=3026 bl=23 rcuop/5-55 [007] d..1 113.402581: rcu_batch_start: rcu_preempt CBs=7251 bl=56 rcuop/6-62 [007] d..1 113.658582: rcu_batch_start: rcu_preempt CBs=7035 bl=54 <snip> -- Uladzislau Rezki
On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > understand the differences in your respective platforms' definitions of > > > > > "good". ;-) > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > <snip> > > > > 1. Home screen swipe: > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > 2. App launches: > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 Definitely better, but I'd still ask why not just rely on the lazy batching that we now have, since it is a memory pressure related usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't disturb the lazy-RCU batching by queuing these "free memory" CBs; and instead keep your improved kvfree_rcu() batching only for !CONFIG_RCU_LAZY.
On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > understand the differences in your respective platforms' definitions of > > > > > > "good". ;-) > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > <snip> > > > > > 1. Home screen swipe: > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > 2. App launches: > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > Definitely better, but I'd still ask why not just rely on the lazy > batching that we now have, since it is a memory pressure related > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > instead keep your improved kvfree_rcu() batching only for > !CONFIG_RCU_LAZY. Given that making the kvfree_rcu()-level batching conditional on CONFIG_RCU_LAZY would complicate the code, what bad thing happens when keeping the kvfree_rcu-level batching unconditionally? Thanx, Paul
On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > understand the differences in your respective platforms' definitions of > > > > > > "good". ;-) > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > <snip> > > > > > 1. Home screen swipe: > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > 2. App launches: > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > Definitely better, but I'd still ask why not just rely on the lazy > batching that we now have, since it is a memory pressure related > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > instead keep your improved kvfree_rcu() batching only for > !CONFIG_RCU_LAZY. > 1. Double-batching? The kvfree_rcu() interface itself keeps track of when to reclaim: a) when a page is full; b) when i high storm of freeing over rcu; c) when a low memory condition. such control stays inside the kvfree_rcu(). Converting it to lazy variant: a) lose the control, what will become as a problem; b) nothing is improved. 2. Converting the queue_rcu_work() to lazy variant breaks a humanity interpretation when a queued work is supposed to be run. People do not expect seconds when they queue the work. Same as in the kvfree_rcu() we do not expect it we even used a high_prio queue in the beginning. There are ~10 users who queue the work and they did not expect it to be run in 10 seconds when they wrote the code. 3. With the "rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval" there is no sense in doing it. Same data in active and idle use cases. -- Uladzislau Rezki
On Mon, Oct 24, 2022 at 1:40 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > > understand the differences in your respective platforms' definitions of > > > > > > > "good". ;-) > > > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > > > <snip> > > > > > > 1. Home screen swipe: > > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > > > 2. App launches: > > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > > Definitely better, but I'd still ask why not just rely on the lazy > > batching that we now have, since it is a memory pressure related > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > > instead keep your improved kvfree_rcu() batching only for > > !CONFIG_RCU_LAZY. > > > > 1. Double-batching? > > The kvfree_rcu() interface itself keeps track of when to reclaim: > a) when a page is full; > b) when i high storm of freeing over rcu; > c) when a low memory condition. > > such control stays inside the kvfree_rcu(). Converting it to lazy > variant: > a) lose the control, what will become as a problem; > b) nothing is improved. AFAICS, the only thing being changed is when you are giving memory back to the system. So you will be holding on to memory a bit longer. And there's shrinkers that are already flushing those. I don't think the users of kvfree_rcu() want to free memory instantly. If there is such usecase, please share it. > 2. Converting the queue_rcu_work() to lazy variant breaks a humanity > interpretation when a queued work is supposed to be run. People do not > expect seconds when they queue the work. Which people? ;) > Same as in the kvfree_rcu() > we do not expect it we even used a high_prio queue in the beginning. > There are ~10 users who queue the work and they did not expect it to > be run in 10 seconds when they wrote the code. That's a bit of a misinterpretation of what I'm saying. A variant queue_rcu_work_flush() can be added for those users (such as ones that are not freeing memory). Thanks.
On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > > understand the differences in your respective platforms' definitions of > > > > > > > "good". ;-) > > > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > > > <snip> > > > > > > 1. Home screen swipe: > > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > > > 2. App launches: > > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > > Definitely better, but I'd still ask why not just rely on the lazy > > batching that we now have, since it is a memory pressure related > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > > instead keep your improved kvfree_rcu() batching only for > > !CONFIG_RCU_LAZY. > > Given that making the kvfree_rcu()-level batching conditional on > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when > keeping the kvfree_rcu-level batching unconditionally? The bad thing happening is power impact. There is a noticeable impact in our testing, and when we dropped this particular patch, it got much better results. I also run rcutop and I see without the patch that I have several seconds of laziness at a time, unlike with the patch. Even in the beginning when I came up with an implementation for call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since it was quite frequent (on ChromeOS). But when we introduced the flush() API, I forgot to not use flush() on it. But unfortunately this patch slipped into my last series when Vlad and I were debugging the SCSI issue, and did not really help for the SCSI issue itself. Thanks, - Joel
On Mon, Oct 24, 2022 at 4:12 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > > > understand the differences in your respective platforms' definitions of > > > > > > > > "good". ;-) > > > > > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > > > > > <snip> > > > > > > > 1. Home screen swipe: > > > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > > > > > 2. App launches: > > > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > > > > Definitely better, but I'd still ask why not just rely on the lazy > > > batching that we now have, since it is a memory pressure related > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > > > instead keep your improved kvfree_rcu() batching only for > > > !CONFIG_RCU_LAZY. > > > > Given that making the kvfree_rcu()-level batching conditional on > > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when > > keeping the kvfree_rcu-level batching unconditionally? > > The bad thing happening is power impact. There is a noticeable impact > in our testing, and when we dropped this particular patch, it got much > better results. > > I also run rcutop and I see without the patch that I have several > seconds of laziness at a time, unlike with the patch. > > Even in the beginning when I came up with an implementation for > call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since > it was quite frequent (on ChromeOS). But when we introduced the > flush() API, I forgot to not use flush() on it. But unfortunately > this patch slipped into my last series when Vlad and I were debugging > the SCSI issue, and did not really help for the SCSI issue itself. I could try to run Vlad's other mainline patch itself and measure power, I'll get back on that. Thanks!
On Mon, Oct 24, 2022 at 04:12:59PM -0400, Joel Fernandes wrote: > On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > > > understand the differences in your respective platforms' definitions of > > > > > > > > "good". ;-) > > > > > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > > > > > <snip> > > > > > > > 1. Home screen swipe: > > > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > > > > > 2. App launches: > > > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > > > > Definitely better, but I'd still ask why not just rely on the lazy > > > batching that we now have, since it is a memory pressure related > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > > > instead keep your improved kvfree_rcu() batching only for > > > !CONFIG_RCU_LAZY. > > > > Given that making the kvfree_rcu()-level batching conditional on > > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when > > keeping the kvfree_rcu-level batching unconditionally? > > The bad thing happening is power impact. There is a noticeable impact > in our testing, and when we dropped this particular patch, it got much > better results. > > I also run rcutop and I see without the patch that I have several > seconds of laziness at a time, unlike with the patch. Fair point, but is this visible at the power meter? Thanx, Paul > Even in the beginning when I came up with an implementation for > call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since > it was quite frequent (on ChromeOS). But when we introduced the > flush() API, I forgot to not use flush() on it. But unfortunately > this patch slipped into my last series when Vlad and I were debugging > the SCSI issue, and did not really help for the SCSI issue itself. > > Thanks, > > - Joel
On Mon, Oct 24, 2022 at 4:19 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Mon, Oct 24, 2022 at 04:12:59PM -0400, Joel Fernandes wrote: > > On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > > > > understand the differences in your respective platforms' definitions of > > > > > > > > > "good". ;-) > > > > > > > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > > > > > > > <snip> > > > > > > > > 1. Home screen swipe: > > > > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > > > > > > > 2. App launches: > > > > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > > > > > > Definitely better, but I'd still ask why not just rely on the lazy > > > > batching that we now have, since it is a memory pressure related > > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > > > > instead keep your improved kvfree_rcu() batching only for > > > > !CONFIG_RCU_LAZY. > > > > > > Given that making the kvfree_rcu()-level batching conditional on > > > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when > > > keeping the kvfree_rcu-level batching unconditionally? > > > > The bad thing happening is power impact. There is a noticeable impact > > in our testing, and when we dropped this particular patch, it got much > > better results. > > > > I also run rcutop and I see without the patch that I have several > > seconds of laziness at a time, unlike with the patch. > > Fair point, but is this visible at the power meter? Yes it is and it came up as a part of debugging I did, I am not making it up ;-) The delta in power is 10%. As you saw in Vlad's traces as well, the kvfree_rcu() can be called quite frequently. - Joel
On Mon, Oct 24, 2022 at 04:08:17PM -0400, Joel Fernandes wrote: > On Mon, Oct 24, 2022 at 1:40 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > > > understand the differences in your respective platforms' definitions of > > > > > > > > "good". ;-) > > > > > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > > > > > <snip> > > > > > > > 1. Home screen swipe: > > > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > > > > > 2. App launches: > > > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > > > > Definitely better, but I'd still ask why not just rely on the lazy > > > batching that we now have, since it is a memory pressure related > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > > > instead keep your improved kvfree_rcu() batching only for > > > !CONFIG_RCU_LAZY. > > > > > > > 1. Double-batching? > > > > The kvfree_rcu() interface itself keeps track of when to reclaim: > > a) when a page is full; > > b) when i high storm of freeing over rcu; > > c) when a low memory condition. > > > > such control stays inside the kvfree_rcu(). Converting it to lazy > > variant: > > a) lose the control, what will become as a problem; > > b) nothing is improved. > > AFAICS, the only thing being changed is when you are giving memory > back to the system. So you will be holding on to memory a bit longer. > And there's shrinkers that are already flushing those. I don't think > the users of kvfree_rcu() want to free memory instantly. If there is > such usecase, please share it. > Actually, ideally, we want to free a memory asap. The problem with extra 10 seconds is a big amount of un-reclaimed memory that usually can lead to more frequent memory pressure and doing shrinking. We do not want ideally trigger any shrinker. Because for us it is a big slow down in a device behaviour. > > 2. Converting the queue_rcu_work() to lazy variant breaks a humanity > > interpretation when a queued work is supposed to be run. People do not > > expect seconds when they queue the work. > > Which people? ;) > Who wrote the code :) > > Same as in the kvfree_rcu() > > we do not expect it we even used a high_prio queue in the beginning. > > There are ~10 users who queue the work and they did not expect it to > > be run in 10 seconds when they wrote the code. > > That's a bit of a misinterpretation of what I'm saying. A variant > queue_rcu_work_flush() can be added for those users (such as ones that > are not freeing memory). > If it is added for the kvfree_rcu() it is totally fine. Because there is a batching in place. -- Uladzislau Rezki
On Mon, Oct 24, 2022 at 04:16:20PM -0400, Joel Fernandes wrote: > On Mon, Oct 24, 2022 at 4:12 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > > > > understand the differences in your respective platforms' definitions of > > > > > > > > > "good". ;-) > > > > > > > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > > > > > > > <snip> > > > > > > > > 1. Home screen swipe: > > > > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > > > > > > > 2. App launches: > > > > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > > > > > > Definitely better, but I'd still ask why not just rely on the lazy > > > > batching that we now have, since it is a memory pressure related > > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > > > > instead keep your improved kvfree_rcu() batching only for > > > > !CONFIG_RCU_LAZY. > > > > > > Given that making the kvfree_rcu()-level batching conditional on > > > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when > > > keeping the kvfree_rcu-level batching unconditionally? > > > > The bad thing happening is power impact. There is a noticeable impact > > in our testing, and when we dropped this particular patch, it got much > > better results. > > > > I also run rcutop and I see without the patch that I have several > > seconds of laziness at a time, unlike with the patch. > > > > Even in the beginning when I came up with an implementation for > > call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since > > it was quite frequent (on ChromeOS). But when we introduced the > > flush() API, I forgot to not use flush() on it. But unfortunately > > this patch slipped into my last series when Vlad and I were debugging > > the SCSI issue, and did not really help for the SCSI issue itself. > > I could try to run Vlad's other mainline patch itself and measure > power, I'll get back on that. Thanks! > That makes sense. It would be good to have a look at your power figures and traces. -- Uladzislau Rezki
On Tue, Oct 25, 2022 at 6:48 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > On Mon, Oct 24, 2022 at 04:16:20PM -0400, Joel Fernandes wrote: > > On Mon, Oct 24, 2022 at 4:12 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > > > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > > > > > understand the differences in your respective platforms' definitions of > > > > > > > > > > "good". ;-) > > > > > > > > > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > > > > > > > > > <snip> > > > > > > > > > 1. Home screen swipe: > > > > > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > > > > > > > > > 2. App launches: > > > > > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > > > > > > > > Definitely better, but I'd still ask why not just rely on the lazy > > > > > batching that we now have, since it is a memory pressure related > > > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > > > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > > > > > instead keep your improved kvfree_rcu() batching only for > > > > > !CONFIG_RCU_LAZY. > > > > > > > > Given that making the kvfree_rcu()-level batching conditional on > > > > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when > > > > keeping the kvfree_rcu-level batching unconditionally? > > > > > > The bad thing happening is power impact. There is a noticeable impact > > > in our testing, and when we dropped this particular patch, it got much > > > better results. > > > > > > I also run rcutop and I see without the patch that I have several > > > seconds of laziness at a time, unlike with the patch. > > > > > > Even in the beginning when I came up with an implementation for > > > call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since > > > it was quite frequent (on ChromeOS). But when we introduced the > > > flush() API, I forgot to not use flush() on it. But unfortunately > > > this patch slipped into my last series when Vlad and I were debugging > > > the SCSI issue, and did not really help for the SCSI issue itself. > > > > I could try to run Vlad's other mainline patch itself and measure > > power, I'll get back on that. Thanks! > > > That makes sense. It would be good to have a look at your power figures > and traces. If you don't mind, could you backport that patch to 5.10? Here is my 5.10 tree for reference (without the patch) https://github.com/joelagnel/linux-kernel.git (branch 5.10-v9-minus-queuework-plus-kfreebatch) and I am getting conflicts if I cherry-pick: 51824b780b71 ("rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval") I am assuming you have already done the backport, that's why you got the traces above. If so, I would appreciate a link to your branch so I don't mess the backport up! - Joel
On Tue, Oct 25, 2022 at 11:05:57AM -0400, Joel Fernandes wrote: > On Tue, Oct 25, 2022 at 6:48 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Mon, Oct 24, 2022 at 04:16:20PM -0400, Joel Fernandes wrote: > > > On Mon, Oct 24, 2022 at 4:12 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote: > > > > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote: > > > > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > > > > > > understand the differences in your respective platforms' definitions of > > > > > > > > > > > "good". ;-) > > > > > > > > > > > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > > > > > > > > > > > <snip> > > > > > > > > > > 1. Home screen swipe: > > > > > > > > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > > > > > > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > > > > > > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > > > > > > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > > > > > > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > > > > > > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > > > > > > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > > > > > > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > > > > > > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > > > > > > > > > > > > > > 2. App launches: > > > > > > > > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > > > > > > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > > > > > > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > > > > > > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > > > > > > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > > > > > > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > > > > > > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > > > > > > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > > > > > > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > > > > > > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > > > > > > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > > > > > > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > > > > > > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > > > > > > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > > > > > > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > > > > > > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > > > > > > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > > > > > > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > > > > > > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > > > > > > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > > > > > > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > > > > > > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > > > > > > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > > > > > > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > > > > > > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > > > > > > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > > > > > > > > > > Definitely better, but I'd still ask why not just rely on the lazy > > > > > > batching that we now have, since it is a memory pressure related > > > > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't > > > > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and > > > > > > instead keep your improved kvfree_rcu() batching only for > > > > > > !CONFIG_RCU_LAZY. > > > > > > > > > > Given that making the kvfree_rcu()-level batching conditional on > > > > > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when > > > > > keeping the kvfree_rcu-level batching unconditionally? > > > > > > > > The bad thing happening is power impact. There is a noticeable impact > > > > in our testing, and when we dropped this particular patch, it got much > > > > better results. > > > > > > > > I also run rcutop and I see without the patch that I have several > > > > seconds of laziness at a time, unlike with the patch. > > > > > > > > Even in the beginning when I came up with an implementation for > > > > call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since > > > > it was quite frequent (on ChromeOS). But when we introduced the > > > > flush() API, I forgot to not use flush() on it. But unfortunately > > > > this patch slipped into my last series when Vlad and I were debugging > > > > the SCSI issue, and did not really help for the SCSI issue itself. > > > > > > I could try to run Vlad's other mainline patch itself and measure > > > power, I'll get back on that. Thanks! > > > > > That makes sense. It would be good to have a look at your power figures > > and traces. > > If you don't mind, could you backport that patch to 5.10? > Here is my 5.10 tree for reference (without the patch) > https://github.com/joelagnel/linux-kernel.git (branch > 5.10-v9-minus-queuework-plus-kfreebatch) > > and I am getting conflicts if I cherry-pick: > 51824b780b71 ("rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval") > > I am assuming you have already done the backport, that's why you got > the traces above. If so, I would appreciate a link to your branch so I > don't mess the backport up! > Sure. I sent you the patches privately so i do not want to paste here a lot of code to make extra line-nose. -- Uladzislau Rezki
On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > understand the differences in your respective platforms' definitions of > > > "good". ;-) > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > workload a can detect a power delta and power gain. Anyway, below is a new > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > <snip> > > 1. Home screen swipe: > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > 2. App launches: > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > <snip> > > > > it is much more better. But. As i wrote earlier there is a patch that i have submitted > > some time ago improving kvfree_rcu() batching: > > > > <snip> > > commit 51824b780b719c53113dc39e027fbf670dc66028 > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > > Date: Thu Jun 30 18:33:35 2022 +0200 > > > > rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval > > > > Currently the monitor work is scheduled with a fixed interval of HZ/20, > > which is roughly 50 milliseconds. The drawback of this approach is > > low utilization of the 512 page slots in scenarios with infrequence > > kvfree_rcu() calls. For example on an Android system: > > <snip> > > > > The trace that i posted was taken without it. > > And if I am not getting too confused, that patch is now in mainline. > So it does make sense to rely on it, then. ;-) Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ? One at 5 and other at 10 seems odd, especially because the former seems to negate the effects of the latter and anyone tweaking that in the future (say via new command line options) should probably tweak them together to increase batching. Testing shows significantly better batching with Vlad's updates, however I am wondering why the rcu_callback fires in pairs like that from separate kworkers: kworker/6:1-157 [006] d..1 288.861610: rcu_callback: rcu_preempt rhp=0000000079b895f9 func=rcu_work_rcufn 1214 kworker/4:2-158 [004] d..1 288.861612: rcu_callback: rcu_preempt rhp=00000000d83fcc90 func=rcu_work_rcufn 798 I wonder if the queued kwork is executing twice accidentally, or something. This kernel does have the additional trace patch below, fyi. Another thought I have is, if we can just keep the kvfree_rcu() mapped to call_rcu() via a config option say CONFIG_RCU_LAZY_KFREE, or something. Because I am personally not much a fan of the kfree_rcu() induced additional timer wake ups and kworker queue+wakeup which we don't need per-se, if we are already batching with Lazyfied-call_rcu. Too many moving parts which might hurt power. ---8<----------------------- From: Joel Fernandes <joelaf@google.com> Subject: [PATCH] debug: reorder trace_rcu_callback Signed-off-by: Joel Fernandes <joelaf@google.com> --- kernel/rcu/tree.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 73feb09fd51b..a7c175e9533a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2978,10 +2978,6 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy) } check_cb_ovld(rdp); - if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy)) - return; // Enqueued onto ->nocb_bypass, so just leave. - // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock. - rcu_segcblist_enqueue(&rdp->cblist, head); if (__is_kvfree_rcu_offset((unsigned long)func)) trace_rcu_kvfree_callback(rcu_state.name, head, (unsigned long)func, @@ -2990,6 +2986,11 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy) trace_rcu_callback(rcu_state.name, head, rcu_segcblist_n_cbs(&rdp->cblist)); + if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy)) + return; // Enqueued onto ->nocb_bypass, so just leave. + // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock. + rcu_segcblist_enqueue(&rdp->cblist, head); + /* Go handle any RCU core processing required. */ if (IS_ENABLED(CONFIG_RCU_NOCB_CPU) && unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) {
On Fri, Oct 28, 2022 at 09:23:47PM +0000, Joel Fernandes wrote: > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > understand the differences in your respective platforms' definitions of > > > > "good". ;-) > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > <snip> > > > 1. Home screen swipe: > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > 2. App launches: > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > <snip> > > > > > > it is much more better. But. As i wrote earlier there is a patch that i have submitted > > > some time ago improving kvfree_rcu() batching: > > > > > > <snip> > > > commit 51824b780b719c53113dc39e027fbf670dc66028 > > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > Date: Thu Jun 30 18:33:35 2022 +0200 > > > > > > rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval > > > > > > Currently the monitor work is scheduled with a fixed interval of HZ/20, > > > which is roughly 50 milliseconds. The drawback of this approach is > > > low utilization of the 512 page slots in scenarios with infrequence > > > kvfree_rcu() calls. For example on an Android system: > > > <snip> > > > > > > The trace that i posted was taken without it. > > > > And if I am not getting too confused, that patch is now in mainline. > > So it does make sense to rely on it, then. ;-) > > Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable > to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ? > > One at 5 and other at 10 seems odd, especially because the former seems to > negate the effects of the latter and anyone tweaking that in the future (say > via new command line options) should probably tweak them together to increase > batching. > > Testing shows significantly better batching with Vlad's updates, however I am > wondering why the rcu_callback fires in pairs like that from separate > kworkers: > > kworker/6:1-157 [006] d..1 288.861610: rcu_callback: rcu_preempt rhp=0000000079b895f9 func=rcu_work_rcufn 1214 > kworker/4:2-158 [004] d..1 288.861612: rcu_callback: rcu_preempt rhp=00000000d83fcc90 func=rcu_work_rcufn 798 I think this is just 2 kvfree_call_rcu() happening on 2 different CPUs, and then the draining happened on 2 different kworkers, so appears normal. Here is also some more evidence from the user side, of kvfree_call_rcu() noisiness on ChromeOS. So we definitely need the batching to happen on ChromeOS: kworker/u24:6-1448 [005] 77.290344: function: kvfree_call_rcu <-- cfg80211_update_known_bss ThreadPoolForeg-5130 [011] 77.301101: function: kvfree_call_rcu <-- cgroup_migrate_finish irq/144-iwlwifi-1010 [004] 77.314367: function: kvfree_call_rcu <-- cfg80211_update_known_bss ThreadPoolSingl-2050 [004] 77.330359: function: kvfree_call_rcu <-- cgroup_migrate_finish ThreadPoolSingl-2050 [004] 77.330362: function: kvfree_call_rcu <-- cgroup_migrate_finish ThreadPoolForeg-5130 [011] 77.331513: function: kvfree_call_rcu <-- cgroup_migrate_finish patchpaneld-2195 [009] 77.337726: function: kvfree_call_rcu <-- neigh_flush_dev patchpaneld-2195 [009] 77.337737: function: kvfree_call_rcu <-- __hw_addr_del_entry patchpaneld-2195 [009] 77.337744: function: kvfree_call_rcu <-- addrconf_ifdown patchpaneld-2195 [009] 77.337744: function: kvfree_call_rcu <-- __hw_addr_del_entry irq/144-iwlwifi-1010 [004] 77.633595: function: kvfree_call_rcu <-- cfg80211_update_known_bss irq/144-iwlwifi-1010 [004] 77.633609: function: kvfree_call_rcu <-- cfg80211_update_known_bss irq/144-iwlwifi-1010 [004] 77.769844: function: kvfree_call_rcu <-- cfg80211_update_known_bss kworker/u24:1-9 [008] 77.769858: function: kvfree_call_rcu <-- cfg80211_update_known_bss irq/144-iwlwifi-1010 [004] 77.880114: function: kvfree_call_rcu <-- cfg80211_update_known_bss kworker/u24:6-1448 [005] 77.880129: function: kvfree_call_rcu <-- cfg80211_update_known_bss irq/144-iwlwifi-1010 [004] 77.880131: function: kvfree_call_rcu <-- cfg80211_update_known_bss kworker/u24:6-1448 [005] 77.880133: function: kvfree_call_rcu <-- cfg80211_update_known_bss virtio_gpu-5882 [010] 78.337302: function: kvfree_call_rcu <-- context_close virtio_gpu-5882 [010] 78.337303: function: kvfree_call_rcu <-- i915_driver_postclose virtio_gpu-5882 [010] 78.346058: function: kvfree_call_rcu <-- fence_notify virtio_gpu-5882 [010] 78.346070: function: kvfree_call_rcu <-- fence_notify virtio_gpu-5882 [010] 78.346079: function: kvfree_call_rcu <-- fence_notify virtio_gpu-5882 [010] 78.346086: function: kvfree_call_rcu <-- fence_notify virtio_gpu-5882 [010] 78.346184: function: kvfree_call_rcu <-- fence_notify virtio_gpu-5882 [010] 78.346196: function: kvfree_call_rcu <-- fence_notify thanks, - Joel > > I wonder if the queued kwork is executing twice accidentally, or something. > This kernel does have the additional trace patch below, fyi. > > Another thought I have is, if we can just keep the kvfree_rcu() mapped to > call_rcu() via a config option say CONFIG_RCU_LAZY_KFREE, or something. > Because I am personally not much a fan of the kfree_rcu() induced additional > timer wake ups and kworker queue+wakeup which we don't need per-se, if we are > already batching with Lazyfied-call_rcu. Too many moving parts which might > hurt power. > > ---8<----------------------- > > From: Joel Fernandes <joelaf@google.com> > Subject: [PATCH] debug: reorder trace_rcu_callback > > Signed-off-by: Joel Fernandes <joelaf@google.com> > --- > kernel/rcu/tree.c | 9 +++++---- > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 73feb09fd51b..a7c175e9533a 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -2978,10 +2978,6 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy) > } > > check_cb_ovld(rdp); > - if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy)) > - return; // Enqueued onto ->nocb_bypass, so just leave. > - // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock. > - rcu_segcblist_enqueue(&rdp->cblist, head); > if (__is_kvfree_rcu_offset((unsigned long)func)) > trace_rcu_kvfree_callback(rcu_state.name, head, > (unsigned long)func, > @@ -2990,6 +2986,11 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy) > trace_rcu_callback(rcu_state.name, head, > rcu_segcblist_n_cbs(&rdp->cblist)); > > + if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy)) > + return; // Enqueued onto ->nocb_bypass, so just leave. > + // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock. > + rcu_segcblist_enqueue(&rdp->cblist, head); > + > /* Go handle any RCU core processing required. */ > if (IS_ENABLED(CONFIG_RCU_NOCB_CPU) && > unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) { > -- > 2.38.1.273.g43a17bfeac-goog >
On Fri, Oct 28, 2022 at 09:23:47PM +0000, Joel Fernandes wrote: > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > understand the differences in your respective platforms' definitions of > > > > "good". ;-) > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > <snip> > > > 1. Home screen swipe: > > > rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 > > > rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 > > > rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 > > > rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 > > > rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 > > > rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 > > > rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 > > > rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 > > > rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 > > > 2. App launches: > > > rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 > > > rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 > > > rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 > > > rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 > > > rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 > > > rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 > > > rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 > > > rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 > > > rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 > > > <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 > > > rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 > > > <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 > > > <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 > > > rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 > > > <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 > > > rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 > > > rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 > > > rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 > > > rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 > > > rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 > > > rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 > > > rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 > > > rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 > > > rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 > > > rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 > > > rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 > > > <snip> > > > > > > it is much more better. But. As i wrote earlier there is a patch that i have submitted > > > some time ago improving kvfree_rcu() batching: > > > > > > <snip> > > > commit 51824b780b719c53113dc39e027fbf670dc66028 > > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > Date: Thu Jun 30 18:33:35 2022 +0200 > > > > > > rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval > > > > > > Currently the monitor work is scheduled with a fixed interval of HZ/20, > > > which is roughly 50 milliseconds. The drawback of this approach is > > > low utilization of the 512 page slots in scenarios with infrequence > > > kvfree_rcu() calls. For example on an Android system: > > > <snip> > > > > > > The trace that i posted was taken without it. > > > > And if I am not getting too confused, that patch is now in mainline. > > So it does make sense to rely on it, then. ;-) > > Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable > to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ? > This is very good. Below is a plot that i have taken during one use-case. It is about three apps usage in parallel. It was done by running "monkey" test: wget ftp://vps418301.ovh.net/incoming/monkey_3_apps_slab_usage_5_minutes.png i set up three apps as usage scenario: Google Chrome, YoTube and Camera. I logged the Slab metric from the /proc/meminfo. Sampling rate is 0.1 second. Please have a look at results. It reflects what i am saying. non-flush kvfree RCU variant makes a memory usage higher. What is not acceptable for our mobile devices and workloads. > > One at 5 and other at 10 seems odd, especially because the former seems to > negate the effects of the latter and anyone tweaking that in the future (say > via new command line options) should probably tweak them together to increase > batching. > Well. Convert 5 seconds to 10? What will it solve for you? We can do it and from a kvfree_rcu() perspective nothing really is changed. > Testing shows significantly better batching with Vlad's updates, however I am > wondering why the rcu_callback fires in pairs like that from separate > kworkers: > > kworker/6:1-157 [006] d..1 288.861610: rcu_callback: rcu_preempt rhp=0000000079b895f9 func=rcu_work_rcufn 1214 > kworker/4:2-158 [004] d..1 288.861612: rcu_callback: rcu_preempt rhp=00000000d83fcc90 func=rcu_work_rcufn 798 > > I wonder if the queued kwork is executing twice accidentally, or something. > Because a kfree_rcu_cpu() is a per-cpu thing. > This kernel does have the additional trace patch below, fyi. > > Another thought I have is, if we can just keep the kvfree_rcu() mapped to > call_rcu() via a config option say CONFIG_RCU_LAZY_KFREE, or something. > I am not sure you need it, really. If you wake-up "rcuop" or whatever with 0.5 second interval or with 5 seconds interval you will not notice anything in terms of power between both. -- Uladzislau Rezki
> On Oct 31, 2022, at 9:21 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > On Fri, Oct 28, 2022 at 09:23:47PM +0000, Joel Fernandes wrote: >>> On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: >>> On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: >>>>> >>>>> You guys might need to agree on the definition of "good" here. Or maybe >>>>> understand the differences in your respective platforms' definitions of >>>>> "good". ;-) >>>>> >>>> Indeed. Bad is when once per-millisecond infinitely :) At least in such use >>>> workload a can detect a power delta and power gain. Anyway, below is a new >>>> trace where i do not use "flush" variant for the kvfree_rcu(): >>>> >>>> <snip> >>>> 1. Home screen swipe: >>>> rcuop/0-15 [003] d..1 1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10 >>>> rcuop/2-33 [002] d..1 1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10 >>>> rcuop/3-40 [001] d..1 1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11 >>>> rcuop/1-26 [003] d..1 1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16 >>>> rcuop/4-48 [001] d..1 1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10 >>>> rcuop/5-55 [002] d..1 1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10 >>>> rcuop/6-62 [005] d..1 1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10 >>>> rcuop/2-33 [002] d..1 1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29 >>>> rcuop/0-15 [003] d..1 1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23 >>>> 2. App launches: >>>> rcuop/4-48 [005] d..1 1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47 >>>> rcuop/7-69 [007] d..1 1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42 >>>> rcuop/5-55 [004] d..1 1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66 >>>> rcuop/0-15 [004] d..1 1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19 >>>> rcuop/1-26 [006] d..1 1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20 >>>> rcuop/2-33 [006] d..1 1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18 >>>> rcuop/3-40 [006] d..1 1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28 >>>> rcuop/4-48 [002] d..1 1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81 >>>> rcuop/7-69 [001] d..1 1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98 >>>> <...>-62 [002] d..1 1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110 >>>> rcuop/6-62 [000] d..1 1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84 Please let us try to trim emails. That goes for me too. >>>> <...>-62 [003] d..1 1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14 >>>> <...>-26 [001] d..1 1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45 >>>> rcuop/2-33 [001] d..1 1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29 >>>> <...>-40 [001] d..1 1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50 >>>> rcuop/2-33 [005] d..1 1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10 >>>> rcuop/2-33 [005] d..1 1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10 >>>> rcuop/2-33 [005] d..1 1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10 >>>> rcuop/0-15 [002] d..1 1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49 >>>> rcuop/0-15 [003] d..1 1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10 >>>> rcuop/5-55 [004] d..1 1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125 >>>> rcuop/5-55 [004] d..1 1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61 >>>> rcuop/6-62 [001] dn.1 1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65 >>>> rcuop/6-62 [006] d..1 1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10 >>>> rcuop/0-15 [003] d..1 1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14 >>>> rcuop/0-15 [003] d..1 1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10 >>>> <snip> >>>> >>>> it is much more better. But. As i wrote earlier there is a patch that i have submitted >>>> some time ago improving kvfree_rcu() batching: >>>> >>>> <snip> >>>> commit 51824b780b719c53113dc39e027fbf670dc66028 >>>> Author: Uladzislau Rezki (Sony) <urezki@gmail.com> >>>> Date: Thu Jun 30 18:33:35 2022 +0200 >>>> >>>> rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval >>>> >>>> Currently the monitor work is scheduled with a fixed interval of HZ/20, >>>> which is roughly 50 milliseconds. The drawback of this approach is >>>> low utilization of the 512 page slots in scenarios with infrequence >>>> kvfree_rcu() calls. For example on an Android system: >>>> <snip> >>>> >>>> The trace that i posted was taken without it. >>> >>> And if I am not getting too confused, that patch is now in mainline. >>> So it does make sense to rely on it, then. ;-) >> >> Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable >> to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ? >> > This is very good. > > Below is a plot that i have taken during one use-case. It is about three > apps usage in parallel. It was done by running "monkey" test: > > wget ftp://vps418301.ovh.net/incoming/monkey_3_apps_slab_usage_5_minutes.png > > i set up three apps as usage scenario: Google Chrome, YoTube and Camera. > I logged the Slab metric from the /proc/meminfo. Sampling rate is 0.1 second. > > Please have a look at results. It reflects what i am saying. non-flush > kvfree RCU variant makes a memory usage higher. What is not acceptable > for our mobile devices and workloads. Thanks I’ll take a closer look at the data (currently commuting) but here’s a quick reply: I am curious that with the 5 second timer, you are delaying RCU anyway. Are you saying that, adding another 10 on top (due to lazyfying) seems to be causing issues? I find it hard to believe that you cannot give the shrinker enough work within 5 seconds, such that it also triggers the issues you’re seeing. However the workload and data speaks. >> One at 5 and other at 10 seems odd, especially because the former seems to >> negate the effects of the latter and anyone tweaking that in the future (say >> via new command line options) should probably tweak them together to increase >> batching. >> > Well. Convert 5 seconds to 10? What will it solve for you? We can do it > and from a kvfree_rcu() perspective nothing really is changed. True. In fact with my last patch, I see almost never even the need to go to RCU. However my point with unification is just to keep it simple for user for 2 knobs that do the same thing. Granted this is a compiler knob but that might change in the future. We already have enough knobs in RCU and as you guys know, I’m a fan of not letting the user mess things up too much. >> Testing shows significantly better batching with Vlad's updates, however I am >> wondering why the rcu_callback fires in pairs like that from separate >> kworkers: >> >> kworker/6:1-157 [006] d..1 288.861610: rcu_callback: rcu_preempt rhp=0000000079b895f9 func=rcu_work_rcufn 1214 >> kworker/4:2-158 [004] d..1 288.861612: rcu_callback: rcu_preempt rhp=00000000d83fcc90 func=rcu_work_rcufn 798 >> >> I wonder if the queued kwork is executing twice accidentally, or something. >> > Because a kfree_rcu_cpu() is a per-cpu thing. Right, got it. >> This kernel does have the additional trace patch below, fyi. >> >> Another thought I have is, if we can just keep the kvfree_rcu() mapped to >> call_rcu() via a config option say CONFIG_RCU_LAZY_KFREE, or something. >> > I am not sure you need it, really. If you wake-up "rcuop" or whatever > with 0.5 second interval or with 5 seconds interval you will not notice > anything in terms of power between both. Yes, you are right. This is not needed considering the improvements you recently made. Cheers, - Joel > > -- > Uladzislau Rezki
On Mon, Oct 31, 2022 at 9:21 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > On Fri, Oct 28, 2022 at 09:23:47PM +0000, Joel Fernandes wrote: > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > understand the differences in your respective platforms' definitions of > > > > > "good". ;-) > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > <snip> > > > > 1. Home screen swipe: [...] > > > > 2. App launches: [...] > > > > <snip> > > > > > > > > it is much more better. But. As i wrote earlier there is a patch that i have submitted > > > > some time ago improving kvfree_rcu() batching: > > > > > > > > <snip> > > > > commit 51824b780b719c53113dc39e027fbf670dc66028 > > > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > > Date: Thu Jun 30 18:33:35 2022 +0200 > > > > > > > > rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval > > > > > > > > Currently the monitor work is scheduled with a fixed interval of HZ/20, > > > > which is roughly 50 milliseconds. The drawback of this approach is > > > > low utilization of the 512 page slots in scenarios with infrequence > > > > kvfree_rcu() calls. For example on an Android system: > > > > <snip> > > > > > > > > The trace that i posted was taken without it. > > > > > > And if I am not getting too confused, that patch is now in mainline. > > > So it does make sense to rely on it, then. ;-) > > > > Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable > > to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ? > > > This is very good. > > Below is a plot that i have taken during one use-case. It is about three > apps usage in parallel. It was done by running "monkey" test: > > wget ftp://vps418301.ovh.net/incoming/monkey_3_apps_slab_usage_5_minutes.png > > i set up three apps as usage scenario: Google Chrome, YoTube and Camera. > I logged the Slab metric from the /proc/meminfo. Sampling rate is 0.1 second. > > Please have a look at results. It reflects what i am saying. non-flush > kvfree RCU variant makes a memory usage higher. What is not acceptable > for our mobile devices and workloads. That does look higher, though honestly about ~5%. But that's just the effect of more "laziness". The graph itself does not show a higher number of shrinker invocations, in fact I think shrinker invocations are not happening much that's why the slab holds more memory. The system may not be under memory pressure? Anyway, I agree with your point of view and I think my concern does not even occur with the latest patch on avoiding RCU that I posted [1], so I come in peace. [1] https://lore.kernel.org/rcu/20221029132856.3752018-1-joel@joelfernandes.org/ I am going to start merging all the lazy patches to ChromeOS 5.10 now including your kfree updates, except for [1] while we discuss it.
> On Mon, Oct 31, 2022 at 9:21 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Fri, Oct 28, 2022 at 09:23:47PM +0000, Joel Fernandes wrote: > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote: > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote: > > > > > > > > > > > > You guys might need to agree on the definition of "good" here. Or maybe > > > > > > understand the differences in your respective platforms' definitions of > > > > > > "good". ;-) > > > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use > > > > > workload a can detect a power delta and power gain. Anyway, below is a new > > > > > trace where i do not use "flush" variant for the kvfree_rcu(): > > > > > > > > > > <snip> > > > > > 1. Home screen swipe: > [...] > > > > > 2. App launches: > [...] > > > > > <snip> > > > > > > > > > > it is much more better. But. As i wrote earlier there is a patch that i have submitted > > > > > some time ago improving kvfree_rcu() batching: > > > > > > > > > > <snip> > > > > > commit 51824b780b719c53113dc39e027fbf670dc66028 > > > > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > > > Date: Thu Jun 30 18:33:35 2022 +0200 > > > > > > > > > > rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval > > > > > > > > > > Currently the monitor work is scheduled with a fixed interval of HZ/20, > > > > > which is roughly 50 milliseconds. The drawback of this approach is > > > > > low utilization of the 512 page slots in scenarios with infrequence > > > > > kvfree_rcu() calls. For example on an Android system: > > > > > <snip> > > > > > > > > > > The trace that i posted was taken without it. > > > > > > > > And if I am not getting too confused, that patch is now in mainline. > > > > So it does make sense to rely on it, then. ;-) > > > > > > Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable > > > to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ? > > > > > This is very good. > > > > Below is a plot that i have taken during one use-case. It is about three > > apps usage in parallel. It was done by running "monkey" test: > > > > wget ftp://vps418301.ovh.net/incoming/monkey_3_apps_slab_usage_5_minutes.png > > > > i set up three apps as usage scenario: Google Chrome, YoTube and Camera. > > I logged the Slab metric from the /proc/meminfo. Sampling rate is 0.1 second. > > > > Please have a look at results. It reflects what i am saying. non-flush > > kvfree RCU variant makes a memory usage higher. What is not acceptable > > for our mobile devices and workloads. > > That does look higher, though honestly about ~5%. But that's just the > effect of more "laziness". The graph itself does not show a higher > number of shrinker invocations, in fact I think shrinker invocations > are not happening much that's why the slab holds more memory. The > system may not be under memory pressure? > The idea is to minimize a possibility of entering into a low memory condition mode. This is bad from a sluggishness point of view for users. I am saying it in a context of android devices. > Anyway, I agree with your point of view and I think my concern does > not even occur with the latest patch on avoiding RCU that I posted > [1], so I come in peace. > > [1] https://lore.kernel.org/rcu/20221029132856.3752018-1-joel@joelfernandes.org/ > I will have a look at it. > > I am going to start merging all the lazy patches to ChromeOS 5.10 now > including your kfree updates, except for [1] while we discuss it. > Good for ChromeOS users :) -- Uladzislau Rezki
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 7cd5f5e7e0a1b..b4b0e828b529e 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork) if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { rwork->wq = wq; - call_rcu(&rwork->rcu, rcu_work_rcufn); + call_rcu_flush(&rwork->rcu, rcu_work_rcufn); return true; }